introductory statisticsbkrein/introductory statistics...statistics is the collection of methods used...

336
Introductory Statistics Brad Krein Math Department Cabrillo College January, 2020

Upload: others

Post on 20-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Introductory Statistics

Brad Krein

Math DepartmentCabrillo College

January, 2020

Page 2: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

ii

Dedicated to my wife

Page 3: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Contents

1 Introducing Statistics 11.1 Sampling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Variable Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Observational Studies and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Describing Data Graphically 112.1 Describing Qualitative Data Graphically . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Pie Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.1.2 Bar Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Describing Quantitative Data Graphically . . . . . . . . . . . . . . . . . . . . . . . . 192.2.1 Histograms and Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.2 Frequency Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2.3 Describing Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 Dotplots and Stem and Leaf Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.3.1 Dotplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.3.2 Stem and Leaf Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Numerical Descriptors of Data 353.1 Measures of Central Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.2 Measures of Position, Box and Whisker Plots. . . . . . . . . . . . . . . . . . . . . . . 40

3.2.1 Identifying Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.2.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3 Measures of Spread of a Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.4 Weighted Mean and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . 523.4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.5 Chebyshev’s Theorem and the Empirical Rule . . . . . . . . . . . . . . . . . . . . . . 573.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

iii

Page 4: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

iv CONTENTS

4 Probability 614.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.2 Conditional Probability and Independence of Events . . . . . . . . . . . . . . . . . . 67

4.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.3 Intersection of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.4 Union of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.5 Counting Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.5.1 Factorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884.5.2 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.5.3 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5 Discrete Probability Distributions 955.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.1.1 The Mean and Standard Deviation of a Discrete Random Variable . . . . . . 985.1.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.3 The Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.4 The Poisson Probability Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145.4.1 Approximating the Binomial Distribution with the Poisson Distribution . . . 1165.4.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6 Continuous Probability Distributions 1196.1 Continuous Probabiity Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.1.1 The Normal Probability Distribution . . . . . . . . . . . . . . . . . . . . . . . 1216.1.2 The Standard Normal Probability Distribution . . . . . . . . . . . . . . . . . 1226.1.3 Probabilities of the Non-Standard Normal Distribution . . . . . . . . . . . . . 1256.1.4 Adding and Subtracting Independent Normal Random Variables . . . . . . . 1286.1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.2 Finding X When the Probability is Given . . . . . . . . . . . . . . . . . . . . . . . . 1346.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.3 Normal Approximation to the Binomial Distribution . . . . . . . . . . . . . . . . . . 1396.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

7 Sampling Distributions of the Population Mean and Proportion 1437.1 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

7.1.1 The Mean and Standard Deviation of the Sampling Distribution . . . . . . . 1467.1.2 The Shape of the Sampling Distribution of X . . . . . . . . . . . . . . . . . . 1487.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

7.2 Probabilities of X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1547.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

7.3 The Sampling Distribution of p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

Page 5: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

CONTENTS v

7.3.1 Excercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

8 Confidence Intervals for The Mean and Proportion 163

8.1 Confidence Interval for µ with Known σ . . . . . . . . . . . . . . . . . . . . . . . . . 164

8.1.1 Determining the Sample Size for Estimation of µ . . . . . . . . . . . . . . . . 169

8.1.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

8.2 Confidence Interval for µ with Unknown σ . . . . . . . . . . . . . . . . . . . . . . . . 173

8.2.1 The t–distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

8.2.2 Confidence Intervals of µ using the t-Distribution . . . . . . . . . . . . . . . . 174

8.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

8.3 Confidence Intervals for p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

8.3.1 Determining the Sample Size for Estimates of p . . . . . . . . . . . . . . . . . 182

8.3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

9 Hypothesis Tests for The Mean and Proportion 187

9.1 Hypothesis Tests of µ, σ Known . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

9.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

9.2 Hypothesis Tests of µ using the p-Value . . . . . . . . . . . . . . . . . . . . . . . . . 197

9.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

9.3 Hypothesis Tests of µ, σ Unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

9.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

9.4 Hypothesis Tests of the Population Proportion . . . . . . . . . . . . . . . . . . . . . 210

9.4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

10 Confidence Intervals and Hypothesis Tests for Two Population Data 215

10.1 Confidence Intervals and Hypothesis Tests of µ1 − µ2 with known σ’s . . . . . . . . . 216

10.1.1 Confidence Intervals of µ1 − µ2 with known σ’s . . . . . . . . . . . . . . . . . 216

10.1.2 Hypothesis Tests of µ1 − µ2 with known σ’s . . . . . . . . . . . . . . . . . . . 219

10.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

10.2 Confidence Intervals and Hypothesis Tests of µ1 − µ2 with unknown σ’s . . . . . . . 225

10.2.1 Confidence intervals for µ1 − µ2 with σ1 = σ2 . . . . . . . . . . . . . . . . . 225

10.2.2 Confidence intervals for µ1 − µ2 with σ1 6= σ2 . . . . . . . . . . . . . . . . . 227

10.2.3 Hypothesis Tests of µ1 − µ2 with unknown σ’s . . . . . . . . . . . . . . . . . 228

10.2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

10.3 Conference Intervals and Hypothesis Tests of µd . . . . . . . . . . . . . . . . . . . . 235

10.3.1 Confidence Intervals of µd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

10.3.2 Hypothesis Tests of µd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

10.3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

10.4 Confidence Intervals and Hypothesis Tests of p1 − p2 . . . . . . . . . . . . . . . . . 242

10.4.1 Confidence intervals for p1 − p2 . . . . . . . . . . . . . . . . . . . . . . . . . 242

10.4.2 Hypothesis Tests of p1 − p2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

10.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

10.5 The F-Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

10.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

Page 6: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

vi CONTENTS

11 Inferential Statistics with Chi-Square Distribution 25711.1 Chi-Square Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

11.1.1 Sampling Distribution of s2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25911.1.2 Confidence intervals of σ2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26011.1.3 Hypothesis tests of σ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26211.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

11.2 Goodness of Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26711.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

11.3 Tests of Independence and Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . 27311.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276

12 Analysis of Variance 28112.1 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

12.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

13 Linear Regression 28913.1 Descriptive Statistics using Linear Regression . . . . . . . . . . . . . . . . . . . . . . 290

13.1.1 Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29113.1.2 Finding the Line of Best Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29213.1.3 Interpreting a and b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29413.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298

13.2 Hyptothesis Tests and Confidence Intervals for B . . . . . . . . . . . . . . . . . . . . 30313.2.1 Hypothesis Tests of H0 : B = 0 and H0 : ρ = 0 . . . . . . . . . . . . . . . . . 30313.2.2 Hypothesis Tests of B when B0 6= 0 . . . . . . . . . . . . . . . . . . . . . . . 30513.2.3 Confidence Intervals of B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30713.2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

13.3 Prediction Intervals and Confidence Intervals for µy|x . . . . . . . . . . . . . . . . . . 31113.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

A Tables 319

Page 7: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Chapter 1

Introducing Statistics

1

Page 8: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

2 CHAPTER 1. INTRODUCING STATISTICS

1.1 Sampling Techniques

We are constantly being bombarded by statistics and our lives are affected by them. It could beas mundane as ‘what is the most popular television show’ to ‘what treatment is best for my battlewith cancer’. We hear the word statistics but what is it. Like many words, it has more than onemeaning.

Statistics is the collection of methods used to collect, analyze, and interpret data and use tothe data make decisions.

Statistics is a numerical description of a sample.

This last definition is given as a preview of things to come. When a baseball fan speaks of aplayers stats, the second definition is what they are referring to.

The second definition also brings us to the following definitions

A population is collection of all objects about which a researcher is studying.

A sample is a subset of the population.

The human resources department at a large company is interested in what percent of its em-ployees have plans to leave the company in the next five years. Fifty employees are asked. Thepopulation is the set of all employees at the company and the sample is the 50 they chose.

When a pollster wants to know who people are going to vote for in an upcoming election, thepollster is interested in the collection of all people who are going to vote. That is the population.The pollster can’t get the entire population so they take, say, 1000 people who are going to voteand ask them who they are planning on voting for. The 1000 voters are what makes up the sample.

In the last two examples we see some subtle problems we have in statistics. We aren’t goingto make it part of this course but it is advisable that the consumer always be wary. If you arereading the results in a publication then you are the consumer of the statistics. So, what are theproblems? In both examples, we are asking people what they are planning on doing. People don’talways tell the truth. In the first example, an employee might not want to disclose that they areactively looking for a new job due to possible retribution by their boss. In the second example, theymight feel uncomfortable revealing to someone that they are planning on voting for an unpopularcandidate. In addition, the second example has a more serious problem. The population is the setof all people who are going to vote in the next election. One cannot determine this until after theelection. There are ways to lessen these issues but they do remain so some extent.

Once we have identified the population in question, we need to determine how we are going tocollect a sample. There are several techniques that are employed.

A Simple Random Sample is a sample chosen where each element of the population has anequal chance of being selected and the selections are independent of one another.

We will refer to a simple random sample as simply a random sample. The best way to think ofthe simple random sample is that of picking names out of a hat.

Page 9: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

1.1. SAMPLING TECHNIQUES 3

One problem with random sampling is it may appear that the sample is not random. Forexample, if a large company is randomly selecting people from its employees to assess what it is liketo work company and the sample could, just by chance, have a larger proportion of managers thatthe company has in which case the results could be unreliable. The larger the sample, the moreunlikely this is.

In this example, it would be better to take a random sample of management, and a randomsample of non-management workers that would reflect the population. This is what is called arepresentative sample.

A Representative Sample is a sample randomly chosen to match the characteristics of thepopulation.

One of the challenges of selecting a representative sample is identifying the important charac-teristics. Does gender matter? What about race? Political affiliation? Being a pet owner? The listgoes on. The more categories you include, the larger your sample needs to be.

A pollster wants to know if people support a new tax bill. How it affects a person should bea good indicator as to whether or not they are in favor of the new bill. The pollster can split thepopulation into categories based on income: high income, low income, etc. and take a sample fromeach of the different levels of income that reflect the population. This is what is called stratifiedsampling.

Stratified Sampling is a sampling technique where the data is separated into distinct categories,or strata, and a sample from each strata is selected where the proportions of the sample ineach strata match the proportions of the population in each strata.

This is very similar to representative sampling. In stratified sampling, we are focusing on onecategory that is split into strata and in representative sampling, the categories may be different:you can be in more than one category.

Like representative sampling it is sometimes hard to identify the important strata.

Suppose you are in charge of taking a sample of elementary school students in a large schooldistrict and interviewing them. You could use stratified sampling and take a sample of studentsfrom each class in each grade at each school. That would take a bit of traveling on your part.An easier, and cheaper, way to sample would to be to randomly select a few of the schools andinterview everyone at those schools. This is what is called cluster sampling.

Cluster Sampling is a sampling techique in which the population is separated into groups,or clusters, a random sample of clusters is selected and all elements from those clustered aresampled.

Cluster sampling will have greater variability than stratified sampling. To counter this a largersample may need to be taken.

If you were in charge of randomly selecting people that are passing through a DUI checkpoint todetermine if they were under the influence of drugs and/or alcohol, how would you proceed? Youwould probably use what is called systematic sampling. At the DUI checkpoint, you pull over, say,every 5th car. This would eliminate any bias you have.

Page 10: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

4 CHAPTER 1. INTRODUCING STATISTICS

A Systematic Sample is a sample in which data values are ordered and every nth item isselected.

A potential problem with systematic sampling is if there is a pattern in the data that matchesthe frequency used. For example, if a worker times how long it takes to get to work and selectsevery fifth day and works five days a week they would always time the commute on the same dayof the week. If on that day a school had a different start time than the other days, that could affectthe commute and the results would not be a good indication of the commute time.

A Convienience Sampling the sample is chosen in a convenient fashion.

These sampling are not generally reliable and are rife with the potential of bias.

In Volunary Sampling, participants initiate whether to participate or not.

Generally, voluntary sampling produces a bad sample. If a call in radio show is soliciting opinionson a topic that people have highly polarized opinions on, who is going to call? Those people thathave very strong opinions.

1.1.1 Exercises

Identify the sampling technique used.

1. A high school principal wants to randomly select a group of students. The school consistsof 22% seniors, 26% juniors, 27% sophomores and 25% freshmen. The principal randomlyselects 22 seniors, 26 juniors, 27 sophomores and 25 freshmen.

2. A agriculture inspector is inspecting potatoes from a shipment. The inspector grabs the first5 potatoes that they see.

3. While working at an assembly plant that fills milk cartons, the inspector grabs every tenthcarton coming off the assembly line.

4. A medical researcher advertises that they are doing a study to reduce anxiety to get theirsample and will pay subjects to participate.

5. The California State Lottery mixes numbered balls in a big hopper and 5 of the balls areejected one at a time.

6. A beer manufacturer wants to check if the cans are properly filled and randomly selects acase.

7. The overseer of a fast food chain randomly selects 20 stores and determines whether or notthe employees at the 20 stores are all properly trained.

8. A statistics student has an assignment to determine whether or not students use tobaccoproduct and so they ask 20 people that pass by them whether or not they use tobaccoproducts.

Page 11: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

1.1. SAMPLING TECHNIQUES 5

9. A company that makes carabiners tests them for breaking strength. The test engineer reachesinto a box containing the carabiners and grabs a handful.

10. A health care provider solicits comments by having forms and a box that patients can puttheir comments in the box.

11. A barista is interested in how much people tip, on average, and so the barista watches everytenth customer to see what tip is left.

12. While listening to music on there mp3 player in shuffle mode, a music lover looks at the timesof the next 10 songs that come up.

13. An insurance company insures a diverse population: 30% are married, 40% have no points,50% are women. A sample of 100 is selected and 30 are married, 40 have no points, and 50are women.

14. The owners of a sports arena want to poll residents in a city where an arena is proposed tobe located. The owners put up signs all over town with a website address where residents canexpress their opinions.

15. In order to see how high school seniors do on a standardized test, an administrator selectsseveral seniors so that the number of students from single parent households match all seniorsat the high school as well as ethnicity and gender.

16. To estimate the average amount customers spend at a store, the manager selects the next 10customers.

Page 12: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

6 CHAPTER 1. INTRODUCING STATISTICS

1.2 Variable Types

Once we have collected our sample it is time to collect the data. Let’s say we go out and collectdata from several students at a local college. The data might look like this:

Age Height Number Number Relationship TransportSubject (years) (inches) Children Classes Status mode

#1 32 63 2 3 Married Bus#2 22 70 0 4 Single Bike...

......

......

......

The variables here are: Age, Height, Number of Children, Number of Classes, RelationshipStatus, and Transportation mode. There is a natural break in the variables: Age, Height, Numberof Children, and Number of Classes versus Relationship Status and Transportation mode. Don’tsee why? Look at the data. For the first 4 variables, the data are all numeric and the last two arenon-numeric. These are called quantitative and qualitative, respectively.

A Qualitative Variable is a variable whose data is qualitative, that is, numeric.

A Quantitative (or categorical) Variable is a variable whose data is quantitative (or categori-cal), that is, non-numeric.

There is also another way to split the qualitative variables/data further. That split is not asobvious: Age and Height versus Number of Children and Number of Classes. The distinction iswhat is called continuous versus discrete. For the Number of Children it can only be 0, 1, 2 etc.Same with the Number of Classes. With height, however, a person can be 62.3 inches or 61.2 inchesetc. Subject #1 didn’t go to bed 62 inches tall and wake up 63 inches tall. The height increasedcontinually. Same with age. You can be 4 1/2 years old. Well, not again.

Variables

Quantitative Qualitative

Continuous Discrete

Classify the following as Qualitative, Quantitative and Discrete, or Quantitative and Continuous.

Page 13: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

1.2. VARIABLE TYPES 7

1.2.1 Exercises

1. The number of grammatical errors in a randomly selected student’s essay.

2. The speed of a randomly selected car on a highway.

3. The number of left-handed people in a randomly selected class.

4. The type of car a randomly selected person drives.

5. The number of spam emails a person receives on a randomly selected day.

6. A randomly slected person’s favorite type of food.

7. The time a randomly selected person spends watching videos per day.

8. How many calls a tow truck driver gets in a randomly selected day.

9. How long it takes for a randomly selected person to run a mile.

10. A person’s favorite streaming service.

Page 14: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

8 CHAPTER 1. INTRODUCING STATISTICS

1.3 Observational Studies and Experiments

Do you own a pet? According to a report, people who own pets live longer than those who don’town pets.1 The article notes that the conclusions were based on a long term sample of 1500 people.Hmmm. Sounds convincing enough. But is it? We don’t doubt that the study found that petowners live longer that their non pet-owner counterparts. What readers of the article need toconsider is the implication of the study: ‘If I want to live longer, I should get a pet!’ Althoughthis may be true, we cannot get a causation here. It could be that people who are disposed to livelonger like pets better.

The above example is what is called an observational study. In it we are simply ‘observing’different variables to see if they are related. If we wanted to conclude that owning a pet causedpeople to live longer, we would need to do an experiment.

In an experiment one variable is controlled, called the independent variable or explanatoryvariable, while observing another variable, called the dependent variable or the response variable.For the pet/longevity example, the experimenter would need to randomly split the population intotwo groups: one that gets a pet and another that doesn’t get a pet. This would be impossible todo. Just try to tell a pet owner that they can’t have their pets. Or try and tell a pet hater thatthey are getting a pet! This is one reason we need to rely on observational data at times.

In an experiment, the researcher controls one variable, called the treatments, while observinganother variable(s).

In an observational study, the researcher simply observes the values of the variables withoutaffecting a change in any variable.

An experiment can establish a causality, whereas in an observational study causality is extremelydifficult to establish. Consider smoking and lung cancer. The causation took a long time to beaccepted. It was established with observational data. What would need to happen if you wantedto perform an experiment to establish causality?

Example 1.3.1.

For the following, determine which would be more appropriate/best: an observational study oran experiment

1. How much a person washes their face and the amount of acne they have.2. Whether or not a person has attached earlobes2 and hearing loss.

Solution.

1. How much a person washes their face and the amount of acne they have.

1https://www.verywellhealth.com/pets-and-longevity-2223874 Accessed 10/14/192Earlobes, not ears. This is a recessive genealogical trait.

Page 15: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

1.3. OBSERVATIONAL STUDIES AND EXPERIMENTS 9

This would be more appropriate for an experiment. We can get volunteers to participate, ran-domly divide them into different groups: in the different groups subjects would be instructedhow often to wash their face. The amount of acne would then be measured. Although anexperiment would be better, this is an example where observational studies are probably goingto be used even though an experiment would be best. An observational study is often lessexpensive than an experiment.

2. Whether or not a person has attached earlobes and hearing loss.

In an experiment we are changing one variable and observing the other. We cannot assignpeople to different earlobe groups: we are born with earlobes attached or unattached. Thus,an observational study is more appropriate.

An experiment is the preferred method is drug studies. The researcher splits the population intodifferent groups and the treatments differ amongst the groups. In those studies, the researched doesnot want the subjects to know if they are given the drug or not. All patients are treated exactlythe same with the exception of the treatment. Patients that aren’t receiving the drug receive aplacebo, a non-therapeutic pill/shot or whatever the treatment group gets. A placebo is sometimesreferred to as a sugar pill, although this is a misnomer. What we have is what is called a randomizeddouble-blind experiment.

In a Randomized Experiment, subjects are randomly assigned to their treatment group(s).

In a Quasi-Experiment, subjects are not randomly assigned to their treatment group(s).

In a Double-Blind Experiment, neither subject nor the person administering the treatmentknows what type of treatment, if any, the subject is receiving.

The intent of the randomized double-blind experiment is to eliminate as much bias as possibleas well. If researchers were to assign subjects to treatments, their biases could come into play anaffect the experiment. Likewise, if the patient knows if they are receiving the drug, their mindmight influence their reporting of side-effects.

Not all experiments can be double-blind. If an experimenter is comparing two therapies for abroken ankle: one where the patient wears a cast and the other where the patient uses a walkingboot, a double-blind experiment will not be possible. Why?

Page 16: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10 CHAPTER 1. INTRODUCING STATISTICS

Is the Independent Variable

Manipulated?

Yes No

Are Groups

Randomly Assigned?

Observational

Study

Yes No

Experiment Quasi-

Experiment

1.3.1 Exercises

Classify the following as better suited for an observational study or an experiment for the variableslisted. If an observational study, explain why. If an experiment, briefly explain the process.

1. Average daily sun exposure and skin cancer rates.

2. Daily high summer temperature in Bakersfield and household electric consumption.

3. A person’s race and their likelihood of developing heart disease.

4. A person’s fat intake and their cholesterol level.

5. Amount of water used in concrete and its breaking strength.

6. Age of popcorn kernels and the percent of kernels that don’t pop.

7. The snow pack in spring and the water level in a lake.

8. The average wave height and the number of surfers at the Point.

9. A person’s birth order and their mental health.

10. The blood alcohol content of a person’s blood that was in an accident and the cost of thedamages.

11. The amount of rain a region receives during winter/spring and the fire insurance payouts inthat region the following summer.

12. Comparing breast cancer rates of women whose mother had breast cancer and those whosemothers did not have breast cancer.

13. Whether or not a person takes a new drug being developed and their level of anxiety.

14. A new type of bed and how well a person sleeps at night.

Page 17: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Chapter 2

Describing Data Graphically

11

Page 18: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

12 CHAPTER 2. DESCRIBING DATA GRAPHICALLY

2.1 Describing Qualitative Data Graphically

One of the things we are going to want to do is to summarize a data set. We begin with describingqualitative data.

Consider the following situation: you have asked several registered voters whether or not theysupport the reelection of the incumbent congressional representative. You have coded the data as:F for in Favor, O for opposed, U for Undecided, and N for no reply.

F O U F N F F O U U N N O O F F F O U O F O O F UO O F U N O O F F O U U N O F F F F O F O O U U NF O O U U N N O O N U U N N O F F O O O N U U O F

At a glance it is impossible to know how the incumbent stands: does the data support a reelectionor not? If we summarize the data the answer should be obvious. We will start with a FrequencyDistribution

A Frequency Distribution is a listing of all the different outcomes along with the frequen-cies(counts).

Before we proceed with construction of our frequency distribution, we need to remember that ourgoal here is to summarize the data and make the data usable for the reader. Since this is the case,we want to make sure that any display we produce can stand on its own. In other words, the readershouldn’t have to sort through the reading to figure out what the display is all about. We proceed.

For the data above, we have only four different possibilities: F, O, U, and N. We need to countthe number of each reply. There are 21 F’s, 26 O’s, 16 U’s, and 12 N’s.

How are you Planning on Voting in the Congressional Race?

Response Number of VotersFor the Incumbent 21Against the Incumbent 26Undecided 16Non-responsive 12

We can now see that the representative doesn’t have the majority required to insure reelection ifthis sample is representative of the population.1

We have two variations on the frequency distribution: the percentage distribution and therelative frequency distribution. (The relative frequency is the decimal form of the percent.)

These two are simply the frequency distribution with the frequencies replaced with the percent-ages or the relative frequencies, whichever is appropriate.

Example 2.1.1.

1It probably isn’t. The sample size is very small.

Page 19: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

2.1. DESCRIBING QUALITATIVE DATA GRAPHICALLY 13

Using the information from above construct a percentage distribution and a relative frequencydistribution.

Solution.

If we add the frequencies from the table above we will see that we have a total of 75 voters. Toget the percentages we divide the values by the total and then multiply by 100%

For the first category we have21

75× 100% = 28.0%

For the other three we have 34.7%, 21.3%, and 16.0%. Our percentage distribution is

How are you Planning on Voting in the Congressional Race?

Response Percent of VotersFor the Incumbent 28.0%Against the Incumbent 34.7%Undecided 21.3%Non-responsive 16.0%

n = 75

How are you Planning on Voting in the Congressional Race?

Response Relative Frequency of VotersFor the Incumbent .280Against the Incumbent .347Undecided .213Non-responsive .160

n = 75

We have also added the total. In looking at a distribution such as this, we should include thesample size, n. This gives us some sense of how significant the percentages are. If we noted that75% of people asked were in favor we might think ‘landslide’ but if this is based on only four peopleasked, we are not as impressed as if a thousand2 were asked. Instead of including n, some pollsinclude the margin of error. That will be covered in a later chapter. Further note that the twodistributions here are essentially the same. You can go back and forth from one to the other bymoving the decimal.

We can see that the incumbent should either get to work on getting some votes or start lookingfor a new job.

Once we have our data summarized in a distribution, we would like to get a ‘picture’ of thedata. We will discuss two here: the pie chart and the bar chart.

2Pretty close to what a lot of polls are.

Page 20: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

14 CHAPTER 2. DESCRIBING DATA GRAPHICALLY

2.1.1 Pie Charts

In a pie chart, we take a ‘pie’ and slice it into pieces that correspond to the different categories withthe sizes of the wedges proportional to the size of the sample (or population). In the case above,the F category had 28% of the sample so it gets 28% of the pie, etc.

Example 2.1.2.

Draw a pie chart using the information above.

Solution.

We need to determine how many degrees each piece gets. For F, we need 28% of 360

28% of 360 = 100.8

34.7% of 360 = 124.9

21.3% of 360 = 76.7

16% of 360 = 57.6

Where we begin with our slices doesn’t matter. By measuring from the right side of a circle weget the following.

How are you Planning on Voting in the Congressional Race?

For the Incumbent

28%Against the Incumbent

34.7%

Undecided

21.3% Non-responsive

16%

Notice in the graph we have included labels, a title, and the percentages. We want the graphto be able to be understood without looking at the original information given.

2.1.2 Bar Charts

An alternative graphical display would be a bar chart. In a bar chart we draw bars whose height isdetermined by the frequency (or percentage, or relative frequency).

Page 21: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

2.1. DESCRIBING QUALITATIVE DATA GRAPHICALLY 15

Example 2.1.3.

Use the voter data to draw a bar chart using the frequencies.

Solution.

The graph is included here. Note that the labels are clearly labeled, the graph has a title, andthe scale on the vertical axis is linear. The bars are labeled and the vertical axis starts at 0. Alsonote that the widths of the bars are all the same. If the widths are not the same, your brain isconfused about whether it is looking at the height of the bars or the areas. If the widths are equal,your brain doesn’t need to worry about that.

Further, if we had used the relative frequencies or the percentages, our graph would still havethe same shape. The only difference would be the scale on the vertical axis.

0

5

10

15

20

25

30

Voter Preference

Voters in Favor of Reelecting the Incumbent

Number

ofVoters

For

Against

Undecided

Norespon

se

A variation of the bar chart is the Pareto chart. It is a bar chart with the bars going from tallestto smallest. One advantage of the Pareto chart is it removes the arbitrarity nature of picking whichcategories go first, second, etc.

Page 22: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

16 CHAPTER 2. DESCRIBING DATA GRAPHICALLY

0

5

10

15

20

25

30

Voter Preference

Voters in Favor of Reelecting the Incumbent

Number

ofVoters

Against

For

Undecided

Norespon

se2.1.3 Exercises

1. According to the Beer Marketer’s Insights, 2019, in 2018 the Unites States beer markethad Anheuser-Busch owning 40.8% of the market, MillerCoors had 23.5%, Constellation had9.9%, Heineken USA had 3.5%, Pabst Brewing had 2.1%, and the remaining 20% went toother domestics and imports.

(a) Summarize the information in a percentage distribution.

(b) Draw a pie chart.

(c) Draw a Pareto chart.

2. Madison Bumgarner, an MLB pitcher, has four types of pitches (with percent thrown): achangeup(7.7%), curveball(22.8%), two-seamer(34.3%), and cutter(35.2%).

(a) Summarize the information in a percentage distribution.

(b) Draw a pie chart.

(c) Draw a bar chart.

3. On the game Family Feud, 99 adults were asked to ‘Name something people try to kill byusing poison’. 68 said Rats/Mice, 13 said Bugs/Ants, 11 said Weeds, and 7 said a CheatingMate.

(a) Summarize the information in a percentage distribution.

(b) Draw a pie chart.

(c) Draw a bar chart using frequencies.

Page 23: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

2.1. DESCRIBING QUALITATIVE DATA GRAPHICALLY 17

4. According to cbpp.org, 15% of your tax dollars go to Defense, 24% for Social Security, 16%for Medicare/Medicaid, 9% for Safety net programs, 7% for Interest on debt. The rest forOthers.

(a) Summarize the information in a percentage distribution.

(b) Draw a pie chart.

(c) Draw a Pareto chart.

5. According to the International Bartenders Association, a Mai Tai is a drink which contains40 ml light rum, 20 ml dark rum, 15 ml orange curacao, 15 ml orgreat syrup, and 10 ml limejuice.

(a) Summarize the information in a percentage distribution.

(b) Draw a pie chart.

(c) Draw a Pareto chart.

6. According to the UCS Satellite Database, ucsusa.org, there are 2062 operating satellites. Ofthose, 901 are from the United States, 153 are from Russia, 299 are from China, and 709 fromother countries.

(a) Summarize the information in a percentage distribution.

(b) Draw a pie chart.

(c) Draw a bar chart using frequencies.

7. Of the 901 satellites orbiting the Earth, 38 are civil, 523 are commercial, 164 are government,and 176 are military.

(a) Summarize the information in a percentage distribution.

(b) Draw a pie chart.

(c) Draw a bar chart using frequencies.

8. A May 2019 Pew poll asked teens what their favorite online platform was. 32% said YouTube,15% said Instagram, 35% said Snapchat, 10% said Facebook. Group the remaining as ‘Other’.

(a) Summarize the information in a percentage distribution.

(b) Draw a pie chart.

(c) Draw a Pareto chart.

9. In the 2017-18 school year, Cabrillo College transfered 252 students to the University ofCalifornia. 124 of those transfers went to the Santa Cruz campus, 28 to Santa Barbara, 17to San Diego, 16 to Los Angeles, 29 to Davis, 30 to Berlkeley, and the remainder to the othercampuses.

(a) Summarize the information in a percentage distribution.

(b) Draw a pie chart.

Page 24: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

18 CHAPTER 2. DESCRIBING DATA GRAPHICALLY

(c) Draw a Pareto chart using frequencies.

10. The US military has 5 branches: The Army, Marine Corp, Navy, Air Force, and Coast Guard.In February 2018 the breakdown of the military was 471513 in the Army, 184,427 in the MarineCorps, 325,802 inthe Navy, 323,222 in the Air Force, and 42,042 in the Coast Guard.

(a) Summarize the information in a percentage distribution.

(b) Draw a pie chart.

(c) Draw a bar chart using percentages.

11. According to the 2017 Assual Report of the California DUI Management Information System,in 2015 there were 130468 drivers licenses suspended or revoked in California due to alcoholuse. There were 9074 suspensions due to .01 Zero tolerance suspensions, 86933 First-offendersuspentions, 31093 Repeat-offender suspensions, and 3368 Repeat-Offender revocations.

(a) Summarize the information in a frequency distribution.

(b) Draw a pie chart.

(c) Draw a bar chart using the percentages.

12. According to wikipedia, in the Gulf War (August 1990 to February 1991) there were 146 UStroops killed, 92 Senegal troops killed, 47 United Kingdom troops killed, 24 Saudi Arabiatroops killed and 32 troops from other contries killed.

(a) Summarize the information in a frequency distribution.

(b) Draw a pie chart.

(c) Draw a bar chart using frequencies.

13. nps.gov reports that in the American Civil War, there were 110100 Union soldiers killed inbattle, 224580 died due to diseases, 275174 were wounded in action, and 30192 died whileprisoners of war.

(a) Summarize the information in a percentage distribution.

(b) Draw a pie chart.

(c) Draw a Pareto chart using frequencies.

14. nps.gov reports that in the American Civil War, there were 94000 Confederate soldiers killedin battle, 164000 died due to diseases, 194026 were wounded in action, and 31000 died whileprisoners of war.

(a) Summarize the information in a percentage distribution.

(b) Draw a pie chart.

(c) Draw a Pareto chart using frequencies.

Page 25: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

2.2. DESCRIBING QUANTITATIVE DATA GRAPHICALLY 19

2.2 Describing Quantitative Data Graphically

Just like with qualitative data, we need to be able to present the data in a way that is useful tothe reader. It needs to be organized. We will organize the data as we did before with a few moredetails that need to be ironed out.

On June 8th and June 13th, 2002, volunteers of the Pelagic Shark Research Foundation3 col-lected, among other fish, Miliobatis californica, or bat ray, and made measurements. One of themeasurement was the total length. Several lengths, in cm, are given:

27 36 22 26 26 26 33 36 22 22 34 29 39 28 31 25 35 22 20 44

27 34 30 26 34 27 28 28 28 24 29 22 24 36 33 33 34 37 34 35

We would like to organize this. We would like to create a frequency distribution like we didwith the election data when we discussed qualitative data. We need to make classes. With thequalitative data, the classes were obvious (count the ‘For’s’, etc.). For quantitative data, we need tocreate the classes. We will proceed by first deciding how many different classes to have. Generallyspeaking, we want 6 to 12 classes. The more data we have, the more classes we can have.

By scanning the data we notice that the minimum value is 20 and the maximum value is 39. Ifwe think about putting the data on a number line, the data is 19 cm across (= 39− 20) also calledthe range, more on the range later. Since we only have 40 data values, not a lot, we will shoot forabout 6 classes.

Class Width =range

number of classes=

19

6= 3.17

If we look at the data we will notice that the data values are all measured to the nearest wholenumber. We would like our classes to also be whole numbers. We will choose to round this to 3.Since 20 is the minimum value, we will start with 20. The first class will be 20 to 22. Since thedata values are rounded off, this is really getting bat rays with a length of 19.5 to 22.5 cm. This isthe width of 3 we want.

Let us construct our table and count the data values.

Total Length of Bat Rays Collected at Elkhorn Slough, June 8,13, 2002

Total Length Number of Bat Rays20-22 623-25 326-28 1129-31 432-34 835-37 638-40 2

3pelagic.org/slough/index.html

Page 26: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

20 CHAPTER 2. DESCRIBING DATA GRAPHICALLY

2.2.1 Histograms and Polygons

Our next goal it to get a graph of the data. To do this we will draw a histogram. A histogram issimilar to a bar chart. One big difference with histograms is that they are representing quantitativedata. As such, the horizontal axis will be numeric, whereas in a bar chart, it is non-numeric. Sincethe lengths of the bat rays are continuous, we would like to see that in our graphs. To realize this,we will not have any gaps between the bars. Gaps would suggest discrete data.

Since the lengths for the bat rays were measured to the nearest cm, we have some apparent gapsin the table. For example we have 22 to 23 missing, 25-26, etc. They aren’t really missing. Thelengths need to be measured to some digit. In this case the data is measured to the nearest cm soif a bat ray is 25.8 cm long, it is classified as being 26 cm. With this in mind, we will use thesevalues when we create a graph.

20 22 24 26 28 30 32 34 36 38 40 4201234567891011121314

Total Length, cm

Bat Ray Lengths, Elkhorn SloughJune 8, 13, 2002

Number

ofBatRays

Notice that, like the qualitative graphs, we can see what information the graph is trying toconvey without having to read the problem. It has a title, the axes are labeled, the scales arelinear. Further, the bars are touching and they start and end on the class boundaries, not the classlimits. Along the horizontal axis, you will notice 2 small diagonal lines. These lines indicate a breakin the scale. If we required the axis start at 0, we would have a lot of ‘dead’ space.

So what can we get from the graph. It appears that there is a lot of up-down. If you look atthe scale, the big jumps are only a few rays. To make a statement about the distribution of thelengths of the rays would not be very reliable. The graph is based on a sample of only 40 rays. Ifyou are the one catching, tagging, mesuring, and releasing the rays, 40 may seem like a lot. Froma statistical standpoint, it isn’t very large. Finally, we used the frequencies along the vertical axis.We could just a easily used percentages or relative frequencies.

Page 27: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

2.2. DESCRIBING QUANTITATIVE DATA GRAPHICALLY 21

Before we proceed with our next graphical display, some terminology is in order. For quantitativedata we have class limits, class boundaries, class midpoints, and class widths. Additionally, we havefrequencies, relative frequencies, and percents. The frequencies, relative frequencies, and percentsare done exactly as is done with qualitative data.

Class limits are the minimum and maximum possible measured data values. Class boundariesare the minimum and maximum possible values the data can actually be. Midpoints are the middlepossible value of the class. The class width is the distance between consecutive class midpoints.

In the frequency distribution above, the last class was 38-40. These are the class limits. Theactual possible lengths of the fish for this class is between 37.5 and 40.5. These are the classboundaries. The class midpoint is the average of the class limits or boundaries. So the class midpointis 39. The class width is the difference of the class boundaries. In this case it is 40.5− 37.5 = 3 aswe started with.

We can add all of this to our frequency distribution, if we like.

Class Class Class RelativeLimits Boudaries Midpoint Frequency Frequency Percent20-22 19.5-22.5 21 6 .150 1523-25 22.5-25.5 24 3 .075 7.526-28 25.5-28.5 27 11 .275 27.529-31 28.5-31.5 30 4 .100 1032-34 31.5-34.5 33 8 .200 2035-37 34.5-37.5 36 6 .150 1538-40 37.5-40.5 39 2 .050 5

This is awfully busy. We include here for the sake of completion. We use the boundaries onour histograms. In the histograms, and other graphical displays, we can use either the frequency,relative frequency, or percent. Furthermore we will use the midpoints in the next graphical display,the polygon.

2.2.2 Frequency Polygons

Another graphical display we will discuss is the Polygon. The polygon is made up of linear segmentsconnecting the top center of each bar in a histogram, if it were there. We can either construct ahistogram first and then ‘connect the dots’ or we can find the points, plot them, and then draw theline segments. The points we are looking for are of the form

(Midpoint of class, Number in the class)

In the following table, the midpoints have been added. Also, the classes before and after theclasses with data have been included. We will need them for the graph.

Page 28: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

22 CHAPTER 2. DESCRIBING DATA GRAPHICALLY

Total Length Midpoint Number of Bat Rays17-19 18 020-22 21 623-25 24 326-28 27 1129-31 30 432-34 33 835-37 36 638-40 39 241-43 42 0

18 20 22 24 26 28 30 32 34 36 38 40 4201234567891011121314

Total Length, cm

Bat Ray Lengths, Elkhorn SloughJune 8, 13, 2002

Number

ofBatRays

In the table above, we included the classes 17-19 and 41-43 even though they have no data valuesin the class. If we don’t include those classes, our polygon will be ‘floating’ above the horizontalaxis. We didn’t have this issue with the histograms. If there were no data values in a class, therewas no bar there.

2.2.3 Describing Distributions

In the graphs we looked at before (histograms and polygons), the histograms had a ‘stair-step’ lookto them and the polygons had the ‘connect-the-dots’ look. Imagine if we had a lot more data valuesthan we had we could get more classes. This would reduce the effects we note in the graphs. If wecontinue, we will get what looks like a curve.

Not only do we want to summarize distributions using a graph, we would also like to describethem verbally. The adage ‘a picture is worth a thousand words’ is true when we look at graphs.

Page 29: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

2.2. DESCRIBING QUANTITATIVE DATA GRAPHICALLY 23

We are not going to try to describe it in all detail. We have a few standard descriptors that we willemploy.

In the bat ray example, we are noticing the up-down effect of having not many data values. Thebat ray data has been extended to include all data from 2001 to 2007. The graph is given.

0 10 20 30 40 50 60 70 80 90 100 110 120 130 1400102030405060708090100110120

Total Length, cm

Bat Ray Lengths, Elkhorn Slough2001-2007

Number

ofBat

Rays

The gap on the horizontal axis is gone-it isn’t as much of a deal so we chose to remove thebreak. Also, the scales have changed. We have many more bat rays (n=564) so the vertical axisneeds to accommodate the larger frequencies and the horizontal axis goes much further for somelarger specimens that were captured. They are easy to overlook on the graph, the graph has fewsmall rectangles for the now non-empty classes.

Even though it would have been more convenient to not include the larger data values, this ispoor statistical practice. Hence, they are included.

When we look at the graph, the graph seems to rise to a peak and then trail off more graduallythan it rose. This is what we called ‘skewed right’. We also have a few bat rays that are long for abat ray in the slough. These are what are called outliers. An outlier is a data value which is verylarge or very small for the data set. More on those later. The graphs below show the basic typesof distributions. Not all distrubitons follow one of these: some defy description.

Uniform Distribution Bell Shaped Distribution

Page 30: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

24 CHAPTER 2. DESCRIBING DATA GRAPHICALLY

Skewed Right Distribution Skewed Left Distribution

When we are describing our distributions, we cannot expect them to look exactly like the graphsabove. For the larger bat ray data, it would be skewed right. The smaller data set we can’t reallyclassify using one of the terms above.

2.2.4 Exercises

1. The speeds of vehicles on Highway 1 are measured and the speeds in mph are summarizedbelow.

Speed(mph) Number of vehicles55-57 358-60 1561-63 2664-66 4367-69 3270-72 2173-75 1576-78 979-81 582-84 2

(a) For each class find the class boundaries, class midpoints, percentages, and relative fre-quencies.

(b) Using the percentages, draw a histogram.

(c) Using the frequencies, draw a polygon.

(d) Describe the histogram as skewed left, skewed right, bell shaped, or none of these.

2. At Mountain Middle School, the PE teacher has students run as part of their exercise regimen.The times to run a sprint are in the table.

Time(seconds) Number of Students8-11 8

12-15 4616-19 3420-23 1224-27 2

Page 31: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

2.2. DESCRIBING QUANTITATIVE DATA GRAPHICALLY 25

(a) For each class find the class boundaries, class midpoints, percentages, and relative fre-quencies.

(b) Using the percentages, draw a histogram.

(c) Using the frequencies, draw a polygon.

(d) Describe the histogram as skewed left, skewed right, bell shaped, or none of these.

3. The 2018 Big Sur International Marathon Race times for women is summarized4

Times(hr:min) Number of Runners3:00 - 3:30 453:30 - 4:00 1844:00 - 4:30 3304:30 - 5:00 4575:00 - 5:30 3165:30 - 6:00 2826:00 - 6:30 67

(a) For each class find the class midpoints, percentages, and relative frequencies.

(b) Using the relative frequencies, draw a histogram.

(c) Using the frequencies, draw a polygon.

(d) Describe the histogram as skewed left, skewed right, bell shaped, or none of these.

4. The 2018 Big Sur International Marathon Race times for men is summarized5

Times(hr:min) Number of Runners2:00 - 2:30 12:30 - 3:00 263:00 - 3:30 1653:30 - 4:00 3294:00 - 4:30 3574:30 - 5:00 3385:00 - 5:30 2255:30 - 6:00 1866:00 - 6:30 34

(a) For each class find the class midpoints, percentages, amd relative frequencies.

(b) Using the relative frequencies, draw a histogram.

(c) Using the frequencies, draw a polygon.

(d) Describe the histogram as skewed left, skewed right, bell shaped, or none of these.

5. The population of every 50 states of the US is summarized in the table below. The data isfrom the 2000 Census. Draw a histogram and comment on the shape of the distribution.

4Several finished after 6:00 without official times and were included in the last class.5Several finished after 6:00 without official times and were included in the last class.

Page 32: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

26 CHAPTER 2. DESCRIBING DATA GRAPHICALLY

Population Number of States0 - 3,000,000 22

3,000,000 - 6,000,000 156,000,000 - 9,000,000 6

9,000,000 - 12,000,000 212,000,000 - 15,000,000 215,000,000 - 18,000,000 118,000,000 - 21,000,000 221,000,000 - 24,000,000 024,000,000 - 27,000,000 027,000,000 - 30,000,000 030,000,000 - 33,000,000 033,000,000 - 36,000,000 1

6. The area of each state in the Continental United States, in square miles, is summarized inthe table.

Area (mi2) Number of States0 - 50,000 19

50,000 - 100,000 23100,000 - 150,000 4150,000 - 200,000 1200,000 - 250,000 0250,000 - 300,000 1

(a) Draw a frequency histogram. Using a ruler, make each bar in the histogram betweenhalf an inch and an inch.

(b) Describe the histogram as skewed left, skewed right, bell shaped, or none of these.

(c) Alaska has an area of 663,000 mi2. How far off your paper would a bar need to be toaccommodate Alaska if it were part of the data set.

7. According to the Roller Coaster Database, rcdb.com, in November 2019, the heights of oper-ating roller coasters with reported heights in North America are given in the following table.FYI: The tallest is Kingda Ka in New Jersey at 456 feet. By comparison, the Giant Dipperin Santa Cruz is 70 feet tall.

Height(feet) Number of Roller Coasters1-50 191

51-100 174101-150 127151-200 34201-250 25251-300 0301-350 4351-400 0401-450 2451-500 1

(a) Draw a frequency histogram.

Page 33: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

2.2. DESCRIBING QUANTITATIVE DATA GRAPHICALLY 27

(b) Draw the frequency polygon for the data.

(c) Describe the histogram as skewed left, skewed right, bell shaped, or none of these.

8. According to an online Beatles blog, the length of Beatles songs are listed in the table below.

Length (seconds) Number of Songs1-60 4

61-120 24121-180 134181-240 34241-300 9301-360 2361-420 1421-480 2481-540 1

(a) Draw a frequency histogram.

(b) Draw the frequency polygon for the data.

(c) Describe the histogram as skewed left, skewed right, bell shaped, or none of these.

9. The total annual rainfall for the years 1916 through 2015 are given (a few years have missingreports) in inches per year.

20.29 9.77 35.74 20.01 25.14 28.25 22.94 8.71 23.76 18.3724.28 14.92 14.46 16.45 11.02 24.50 15.12 12.46 19.87 21.0426.38 25.97 14.71 24.14 36.59 24.80 20.82 18.55 21.63 18.6613.28 17.84 15.76 19.30 23.62 27.78 19.38 14.26 17.96 26.6314.40 29.82 15.51 17.07 10.53 18.52 25.41 15.31 21.31 14.3831.93 14.09 28.22 20.78 18.48 13.00 31.79 32.14 17.10 8.4911.07 29.40 19.01 26.65 16.92 39.02 36.78 19.82 19.16 28.2514.39 13.36 16.58 14.81 18.34 19.05 29.24 5.14 22.66 21.3219.68 21.52 22.27 15.44 29.30 24.91 8.90 10.14 11.73 18.3925.48 16.35 13.09 9.28 17.11 24.80

(a) Construct a frequency distribution for the annual rainfall in Watsonville.

(b) Draw the frequency histogram for the data.

(c) Draw the frequency polygon for the data.

(d) Describe the histogram as skewed left, skewed right, bell shaped, or none of these.

10. The following are the heights, in cm, of the first 45 Presidents. (There are only 44 data valuesbecause one President was elected in two nonconsecutive terms)

188 170 189 163 183 171 185 168 173183 173 173 175 178 183 193 178 173174 183 188 180 168 170 178 182 180183 178 182 188 175 179 185 192 182183 177 185 188 188 182 185 191

Page 34: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

28 CHAPTER 2. DESCRIBING DATA GRAPHICALLY

(a) Construct a frequency distribution for the heights of US presidents.

(b) Draw the frequency histogram for the data.

(c) Draw the frequency polygon for the data.

(d) Describe the histogram as skewed left, skewed right, bell shaped, or none of these.

Page 35: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

2.3. DOTPLOTS AND STEM AND LEAF DISPLAYS 29

2.3 Dotplots and Stem and Leaf Displays

There are several different ways to display data graphically. We are only looking at a few of them.The next one we will look at is the dotplot. As the name suggests, we will look at the dataindividually as dots. Unlike the histograms and polygons where we looked at graphs of summarydata, in dotplots we will be able to see the individual data values. Since it is a graphical displaywe would the graph to stand alone. Titles etc.

2.3.1 Dotplots

Example 2.3.1.

The 2019 Oscar nominations for Best Picture are:“Black Panther”, “BlacKkKlansman”, “ Bo-hemial Rhapsody”, “The Favourite”, “Green Book”, “Roma”, “A Star Is Born”, and “Vice”. Therunning times for these movies are, respectively, 2h 14 min, 2 h 15 min, 2 h 14 min, 1h 59 min,2h 10 min, 2h 15 min, 2h 16min, and 2h 12 min. Construct a dotplot of running times of the bestpicture nominations.

Solution.

We start with a linear scale. Notice in the plot the scale is in hours and minutes. This is a morenatural way to display the time instead of converting to minutes only. Also note that each moviegets its own dot. When there are two or more data values that are the same, we simply stack themas shown in the plot. Notice that one of the data values is much smaller than the others. This iswhat is called an outlier. Right now, we have no way to objectively determine if a data value is anoutlier or not.

1:58 2:0 2:2 2:4 2:6 2:8 2:10 2:12 2:14 2:16 2:18

Running Times of 2019 Best Picture Oscar Nominees

Running time, hours:minutes

An outlier is a data value which is very large or very small relative to the rest of the data set.

Let us look at two dot plots together

Example 2.3.2.

The scores for the 2009-2018 season Super Bowl are: 31-17, 31-25, 21-17, 34-31, 43-8, 28-24,24-10, 34-28, 41-33, and 13-3

Construct a dot plot with the winning and losing scores. What do you observe?

Page 36: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

30 CHAPTER 2. DESCRIBING DATA GRAPHICALLY

Construct a dotplot for the winning point differential (winning score-losing score)

Solution.

The winning scores are: 31, 31, 21, 34, 43, 28, 24, 34, 41, and 13The losing scores are: 17, 25, 17, 31, 8, 24, 10, 28, 33, and 3

0 5 10 15 20 25 30 35 40 45

Super Bowl XLIV thru LIII Scores

Score

Winning

Losing

Notice that the winning scores tend to be greater than the losing scores. We can see that becausethe ‘dots’ are further to the right for the winning scores. There are no outliers for either group andthe distribution doesn’t show any great deal of skewness.

The differential scores are 14, 6, 4, 3, 35, 4, 14, 6, 8, 10

0 5 10 15 20 25 30 35 40

Super Bowl XLIV thru LIII

Final Scored Differential

Winning Score - Losing Score

For this, we have an outlier of 35 (A sad day for Bronco fans). Furthermore, there are no datavalues which are negative. Why?

When we are looking at a dotplot, we are looking for several things. We can see if the datavalues are clustered in any particular place, we can assess skewness, etc. If you look at the dotplot,you will notice the data values are all clustered to the right with the exception of one on the left.It is ‘far away’ from the rest of the data. This is what is called an outlier.

2.3.2 Stem and Leaf Displays

One of the disadvantages of looking at a frequency distribution of a data set is we lose the actualdata values. When we have a class of, say 10-12 and we say there are 4 data values in that class,what are the values? Although we lose the individual data values, we gain an overall idea of thedistribution as a whole. One way to get the best of both worlds, when possible, is to construct astem and leaf plot.

Page 37: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

2.3. DOTPLOTS AND STEM AND LEAF DISPLAYS 31

Consider the data set25 30 32 67 40 55 53 76 48 3132 58 42 66 60 51 42 39 61 4838 27 43 34 52 60 44 31 20 38

For each data value we will split it into two parts: a stem and a leaf. In this case we will splitthe data values between the tens place and the ones place so we get

2 5

Stem Leaf

If we look at the data values we will have the stems 2 through 7. We list them vertically, inorder and then to the right of the lines we list all the leaves and we get:

2 5 7 0

3 0 2 1 2 9 8 4 1 8

4 0 8 2 2 8 3 4

5 5 3 8 1 2

6 7 6 0 1 0

7 6

We can sort the leaves and we get the following display:

2 0 5 7

3 0 1 1 2 2 4 8 8 9

4 0 2 2 3 4 8 8

5 1 2 3 5 8

6 0 0 1 6 7

7 6

To see the shape of the distribution, think of this as a sideways histogram. We can see that thedistribution could be described as skewed right. This is a fairly efficient way to sort data as well.

There are times when the options for classes are not sufficient. We need to modify the displayslightly as we will see in the next example.

Example 2.3.3.

Recall the data for the bat rays from before: On June 8th and June 13th, 2002, volunteersof the Pelagic Shark Research Foundation collected Miliobatis californica, or bat ray, and mademeasurements. One of the measurement was the total length. Several lengths, in cm, are given:

27 36 22 26 26 26 33 36 22 22 34 29 39 28 31 25 35 22 20 4427 34 30 26 34 27 28 28 28 24 29 22 24 36 33 33 34 37 34 35

Solution.

Page 38: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

32 CHAPTER 2. DESCRIBING DATA GRAPHICALLY

If we look at the data, almost all the data have stems of 2 or 3. One way to deal with this is tosplit the classes up. We will have two stems with 2’s. The first will take leaves that are 0-4, andthe other 5-9. We proceed.

The leaves here are in the order given in the data.

Bat Ray Lengths, cm, Collected June 8, 13, 2002

2 2 2 2 2 0 4 2 4

2 7 6 6 6 9 8 5 7 6 7 8 8 8 9

3 3 4 1 4 0 4 3 3 4 4

3 6 6 9 5 6 7 5

4 4

Here is display with the leaves ordered.

Bat Ray Lengths, cm, Collected June 8, 13, 2002

2 0 2 2 2 2 2 4 4

2 5 6 6 6 6 7 7 7 8 8 8 8 9 9

3 0 1 3 3 3 4 4 4 4 4

3 5 5 6 6 6 7 9

4 4

2.3.3 Exercises

1. While on vacation, a vacationer spent, on food, the following amounts per day, in dollars:26.31, 18.56, 43.65, 19.22, 25.63, 31.2, 14.59. Draw a dotplot using the data and comment onthe data.

2. A basketball player records the number of points scored in the last several games. The scoreswere: 16, 13, 2, 18, 20, 15, 19, 15. Draw a dotplot using the data and comment on the data.

3. At an assembly line, workers assemble computers. At one stage of the process, the time tocomplete the task is too long so employees are getting retrained. The times, in seconds, beforeand after the training to complete the stage are recorded:

Before: 125, 136, 129, 133, 141, 155, 135

After: 118, 123, 122, 126, 126, 138, 150, 124

(a) Draw two scatterplots using the same axis and scale.

(b) Does it appear as if the training has helped? Explain

4. A restaurant owner owns two pizza shops: UberThick and UltraThin. The number of pizzassold for several days at each restaurant are below.

UberThick: 56, 84, 99, 67, 77, 66, 81, 103

Page 39: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

2.3. DOTPLOTS AND STEM AND LEAF DISPLAYS 33

UltraThin: 46, 55, 90, 74, 66, 50, 71

(a) Draw two scatterplots using the same axis and scale.

(b) Based on the dot plots, which store seems busier?

5. Following are the number of non-fatal shark attacks in the United States for the years 1987to 2016. Use the data to draw a Stem and leaf display. Put leaves in order and describe thedistribution, if possible.

53 56 55 52 57 38 33 31 52 53

45 39 30 45 43 51 45 29 21 27

23 39 17 18 17 16 15 23 12 13

6. The number of US fatalities from hurricanes for the years 1990 through 2019 are given.

(a) One of the data values is clearly an outlier. Which data value?

(b) Remove the data value and construct an ordered stem and leaf display for the remainingdata6.

42, 10, 35, 18, 36, 74, 55, 69, 553, 45,

22, 126, 81, 67, 38, 35, 54, 55, 40, 41,

94, 130, 68, 26, 30, 69, 33, 39, 39, 53

7. The number of tornados in the US for the years 1990 through 2019 are given. Construct astem and leaf plot.

1,482 1,123 1,418 976 1,178 928 903 939 1,703 1,2821,159 1,692 1,092 1,103 1,265 1,817 1,374 934 1,215 1,0751,339 1,424 1,148 1,173 1,235 1,082 1,173 1,297 1,132 1,133

8. At The Spinning Wheel, a bicycle shop, the owner checks the tire pressure and records thepressures, in kiloPascals(kPa). The pressures follow. Construct a stem and leaf plot. Deter-mine if there are any outliers and describe the distribution.

331 263 214 85 609 407 252 289 70 122580 147 489 593 807 390 135 379 278 241

6Usually we don’t remove data values because they are outliers.

Page 40: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

34 CHAPTER 2. DESCRIBING DATA GRAPHICALLY

Page 41: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Chapter 3

Numerical Descriptors of Data

35

Page 42: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

36 CHAPTER 3. NUMERICAL DESCRIPTORS OF DATA

3.1 Measures of Central Tendency

One of our goals in statistics is to describe a data set. The most common description of a data setis a measure of central tendency or what the ‘middle’ of the data set is. The three most commonare the mean, median, and mode. The mean is what most people think of when they hear the word‘average’. The median is the ‘middle’ data value, and the mode is the most common data value.

Definition: The mode is the most common data value for a data set.

Definition: The median for a data set with an odd number of data values is the data valuethat is in the middle of the data sorted from least to greatest.

Definition: The median for a data set with an even number of data values is the average ofthe two data values closest to the middle of the data sorted from

Example 3.1.1.

Consider the data set: 34, 27, 63, 27, 66, 53, 70. Find the mode and the median of the data set.

Find the median of the data set: 3, 6, 8, 11, 12, 15.

Solution.

For the first data set the mode in this case would be 27 since it occurs twice and all other datavalues only occur once.

The median is the middle data value in a sorted list. For the example above we need to firstput the data values in order: 27, 27, 34, 53, 63, 66, 70. The data value in the middle is 53.

For the second data set, there isn’t a ‘middle’ data value so we average the two middlemost datavalues. In this case, it would be 8 and 11 so our median would be 9.5.

The third measure of the ‘middle’ is the mean. The mean (more precisely the arithmetic mean)is what most people refer to as the average. One of the goals of this text is to ultimately makeinferences about a population based on a sample. (More on this later.) This being the case we needto distinguish between the mean of a population and the mean of a sample. Each has their ownnotation.

Definition: The population mean, denoted µ read ‘mu’ a or if we want to emphasize thevariable X, µX , read ‘mu sub X’, is given by the formula

µ =

∑X

N

where N is the population size.

Definition: The sample mean, denoted X, read ‘X bar’ is given by the formula

X =

∑X

n

Page 43: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

3.1. MEASURES OF CENTRAL TENDENCY 37

.where n is the sample size.

arhymes with ‘new’

In the last two definitions note the distinction between the population and sample sizes, N andn, respectfully. Generally, populations will be described with upper case and Greek letters.

Example 3.1.2.

All of the students in a seminar report the following weights, in pounds.126, 135, 186, 164, 152Find the mean weight for this data set.

Solution.

Since this is all of the students, the population formula is appropriate. In this case N = 5 and∑X = 126 + 135 + 186 + 164 + 152 = 763 , so the mean would be X = 763

5 = 152.6 pounds.

With all measures of central tendency, we need to include the same units as the units on thedata. If we want the average weight, we wouldn’t say 152.6 we would say 152.6 pounds. In everydayspeech, the units are often left off when is it clear what units are appropriate. We will make it apoint to include units.

Example 3.1.3.

A sample of five students in a seminar report the following weights, in pounds.126, 135, 186, 164, 152Find the mean weight for this data set.

Solution.

Since this is a sample of the students, the sample formula is appropriate. In this case n = 5 and∑X = 126 + 135 + 186 + 164 + 152 = 763 , so the mean would be X = 763

5 = 152.6 pounds.

Note that the numbers in the last two examples were exactly the same and the means we got inboth cases were exactly the same. This will always be the case in a similar situation. It is importantto distinguish the difference between the two means. What we want to know is µ. This is oftenvery difficult or impossible to calculate. Since we want to know µ but can’t, the next best thingwould be an estimate of it. This is how we want to think about X. X is our best estimate of µ .

Example 3.1.4.

An instructor requires a minimum number of words in written work. The instructor estimatesthe average number of words per line in an essay to be 16 words. The essay consisted of 56 lines.Estimate the total number of words.

Page 44: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

38 CHAPTER 3. NUMERICAL DESCRIPTORS OF DATA

Solution.

First note that we will only be able to estimate the total number of words. We have

µ =

∑X

N

In this case µ is the average number of words per line,∑X would be what we get when we add

the number of words in each line together. This will give us the total number of words, and N is

the total number of lines. Substituting into the formula µ =∑XN gives 16 =

∑X

56 solving for∑X

we get∑X = 16× 56 or 896 words. Since the mean is a crude estimate, we will say that the essay

is about 900 words.

We have three measures of central tendency. Each has their own advantages and disadvantages.The mode and the median are both easy to calculate. A disadvantage of the mode is that it cangive false sense of the ‘middle’, In the first example, we had a mode which was the smallest datavalue. An advantage of the mean is the fact that it uses all of the data values in its calculations. Adisadvantage of the median is that it only uses one or two data values in its calculations. For datasets which tend to have outliers, the median is preferred over the mean.

Example 3.1.5.

Consider the data set: 543, 468, 681, 795, 862 which represents the selling price of 5 homes inan area, in thousands of dollars. The mean is $669,800 and the median is $681,000.

Consider what happens when we add an outlier to the data set and get: 543, 468, 681, 795, 862,2656. The mean has changed significantly to $1,000,083 with a median of $738,000.

Although both the median and the mean changed, the median didn’t change as much. It iscustomary to report the median rather than the mean for data sets such as incomes and sellingprices of homes. This is a much more useful measure of central tendency. If the mean price ofa home were reported each month, for example, the mean would bounce up and down giving theimpression that the mean was changing when it wasn’t.

3.1.1 Exercises

1. For the population with data values 5, 6, 8, 4, 6, 10, 16 find the mean, median, and mode

2. For the population with data values 56, 45, 48, 66 ,52 ,45, 81, 22 find the mean, median, andmode

3. For the population consisting of 15.6, 23.4, 46.8, 22.9, 18.8 find the mean, median, and mode

4. For the population with data values 11.5, 12.6, 16.5, 12.4, 11.9, 34.2 find the mean, median,and mode

5. Joey Chestnut’s claim to fame is as a competitive eater. He won Nathan’s Hot Dog EatingContest several times. To win, contestants eat as many hot dogs and buns as they can in 10minutes. The numbers of hot dogs, with buns, eaten by Joey to win the contest were 71, 74,72, 70, 61, 69, 68, 62, 54, 68, and 59.1 Find the mean, median, and mode number of hot dogseaten to win the contest. Find the average time it took to eat one hot dog.

1He also ate 66 to win when the contest was 12 minutes long, if you care.

Page 45: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

3.1. MEASURES OF CENTRAL TENDENCY 39

6. At a Rubik’s Cube contest, contestants were asked to solve the Rubik’s cube five times thecontestant with the lowest average is declared the winner. Feliks Zemdegs of Austrailia solvedfive different problems in 7.16, 5.04, 4.67, 6.55, and 4.99 seconds and currently holds therecord for the shortest average. Find the mean and median times it took Feliks to solve thecube.

7. In California, the number of DUI convictions for the years from 2005 to 2014 were: 140,879,156,595, 160,591, 169,053, 161,074, 148,042, 142,121, 133,525, 121,304, and 116,190. Find themean and median number of DUI convictions per year for the years 2005 to 2014.

8. In several randomly chosen years, the annual rainfall in Watsonville, CA is reported to be25.14, 11.02, 36.59, 23.62, 10.53, 18.48, 16.92, 18.34, 19.68, and 25.48. Find the mean, median,and mode of the annual rainfall for the given sample.

9. There are 8 National parks in Alaska. The areas, in acres, are 669,650, 1,750,716, 2,619,816,3,223,383, 3,674,529, 4,740,911, 7,512,897, and 8,323,146. Find the mean and median sizes ofthe National parks in Alaska.

10. In the 1990’s the total dungeness crab harvests, in pounds, were: 10,369,518, 4,246,044,8,327,150, 11,958,039, 13,491,363, 9,236,191, 12,331,365, 9,908,520, 10,692,760, and 8,713,823.Find the mean and median amount harvested for the years.

11. The weights of 7 apples are selected from a large orchard. The weights, in ounces, of theapples are 7.5, 6.3, 8.1, 5.9, 7.7, 6.4, and 6.0. Find the mean weight of the apples.

12. An instructor records the number of students that are observed by the instructor using theirphones in class. The number of students from 9 days was 5, 6, 8, 4, 3, 6, 6, 4, 8. What is themean number of students that are observed using their phones in class?

13. A class of 25 students were asked how much money they had in cash on them. The averagewas $16.53. How much did the class have all together?

14. At a burger shack, an average of 3.56 hamburger patties were needed per order for the day.There were 200 orders. How many hamburger patties were used that day?

15. The average amount of money spent by 5 visitors to a fair was $52.43. If 4 of the visitorsspent $50.31, $47.80, $62.36, and $44.98. How much did the fifth visitor spend?

16. The average weight of 6 newborn babies is 7.6 pounds. Five of the babies weighed 9.5, 6.2,5.8, 7.8, and 8.1 pounds. Find the weight of the missing baby.

17. A student needs an average of 80 on their exams. The grades are based on 4 scores. If thescores of 3 of the exams are 76, 81, and 73, what score is required by the student to get anaverage of 80. If all tests are out of 100, is it possible for the student to get an average of90? What is the minimum average the student can get? What is the maximum average thestudent can get?

18. A gardener is moving sand. The gardener needs to move an average of 50 pounds of sand in5 trips to the garden. On 4 trips the amounts of sand in the trips were 64.2, 49.8, 53.2, and51.5 pounds. How much sand must the gardener move on the last trip? If the gardener canmove at most 90 pounds on the fifth trip, what are the maximum and minimum amounts theaverage can be?

Page 46: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

40 CHAPTER 3. NUMERICAL DESCRIPTORS OF DATA

3.2 Measures of Position, Box and Whisker Plots.

Imagine you are considering the purchase of a used car. The asking price is $5,000. Is it a gooddeal? The only way to decide this is to compare the prices of similar cars. You are ultimatelytrying to assess if the price is high or low. We could use the mean and the standard deviation todetermine if the price is right. Another is to look at percentiles and quartiles. Let’s start withquartiles. Sort the prices of cars from high to low. Determine the median. This cuts the data setin half. Let us assume that the median for the price of cars is $5,300. So far, it looks like a gooddeal. Half of similar cars cost more than $5,300.

Now let’s find the median of the upper and lower half of the data sets. The median of the lowerhalf, which we will call Q1 is $4,900 and the median of the upper half, which we call Q3 is $5,700.The median is also called Q2. We can visualize the prices by locating them on a number line. Theprice of the car you are looking at is indicated with the asterisks(*).

As you can see from the graph, more than 25% of the prices of similar cars are less than theasking price of the car you are looking at.

4,900 5,300 5,700

25% 25% 25% 25%*****

Let us look at what is called a box and whisker plot, or simply a box plot. Suppose further thatthe minimum price of all such cars is $4,400 and the maximum price is $6,300. We construct a boxthat consists of vertical bars indicating the location of the quartiles. The ‘whiskers’ go from thebox to the most extreme non outliers. The price of the car you are interested in is still noted withthe asterisk. We might classify the asking price as a good deal but not a great deal. (Of course”good” and ”great” are subjective and aren’t defined here). The asking price of $5,000 is indicatedwith an asterisk(*) below the axis.

42 44 46 48 50 52 54 56 58 60 62 64

Price of Used Car

Cost ($100)*

We have the obligatory linear scale in the graph. We also have written the numbers in $hundredsas indicated below the axis. The distribution appears symmetric and reasonably uniform. Whatwould the box and whisker plot of a bell shaped distribution look lilke? What about skewed?

3.2.1 Identifying Outliers

Until now, we have not objectively assessed whether or not a data value was an outlier. We will doso now. Recall that an outlier is a very large or very small data value, relative to the data set.

We construct the box as before but the whiskers only extend to the largest non outlier and thesmallest non outlier. Any outliers are shown on the graph as dots or any convenient symbol.

Page 47: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

3.2. MEASURES OF POSITION, BOX AND WHISKER PLOTS. 41

Example 3.2.1.

Following are the areas of all of 50 United States, in square miles. Construct a box and whiskerplot.

1,545 2,489 5,543 8,721 9,350 9,614 10,93110,555 12,407 24,230 32,020 35,385 36,418 40,40942,143 42,774 44,825 46,055 48,430 51,840 52,41953,179 53,819 54,556 56,272 57,914 59,425 65,49865,755 69,704 69,898 70,700 71,300 77,116 77,35482,278 83,569 84,897 86,936 96,714 97,818 98,381104,094 110,561 121,589 147,042 113,998 163,696 268,581663,267

Solution.

Since the data values are in order, the sorting work has been eliminated. Now we need to findour five number summary. Recall that the position of the median is given by n+1

2 = 50+12 = 25.5

which tells us to average the 25th and 26th data values. So the median = Q2 = 57,093 (average of56,272 and 57,914). Now to find Q1 we need the median of the lower half of the data values. Thereare 25 data values so the position of the median is given by n+1

2 = 25+12 = 13 . The median of this

half is in the thirteenth position, so Q1 = 36,418. Similarly, Q3 is the 13th data value for the lastfive columns of data so we get 84,897. Putting this all together we get:

minimum = 1,545Q1 = 36,418

Median = 57,093Q3 = 84,897

maximum= 663,267

A data value is considered an outlier if it is more than 1.5 IQR’s from the box. The IQR is48,479. (= 84,897-36,418)

0 100 200 300 400 500 600 700

Areas of US States

Area (thousands of square miles)

Note in the graph we have three dots. These correspond to the areas of the three largest statesby area: California, Texas, and Alaska (the largest). These data values are all considered outliers.However, the area of Alaska isn’t just big, it is very big. As a result we will refer to the data valueas an extreme outlier. An extreme outlier is a data value that is more than 3 times the IQR awayfrom the box.

To be considered an outlier a data value must be greater than Q3+1.5×IQR (or Q1−1.5×IQR)For this problem we have

Page 48: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

42 CHAPTER 3. NUMERICAL DESCRIPTORS OF DATA

Q3 + 1.5× IQR = 84, 897 + 1.5× 48, 479 = 157, 615.5

So any data value that is above 157,615.5 is considered an outlier. From the data set we seethat there are 3 outliers: 163,696, 268,581, and 663,267.

To be considered an extreme outlier a data value must be greater than Q3 + 3 × IQR (orQ1− 3× IQR)

Here we haveQ3 + 3× IQR = 84, 897 + 3× 48, 479 = 230, 334

An extreme outlier would be any data value that is greater than 230,334. There are two.We did not do the similar calculations on the left hand side of the box. A data value must be

more than 1.5 times the width of the box (=IQR) away from the box to be considered an outlier. Ifwe are using a linear scale we can see that the minimum value: 1,545, is not more that 1.5 times thewidth of the box. As mentioned previously, whenever we have a graphical display of quantitativedata we should think about the shape of the distribution. In this case we would describe thedistribution as skewed right.

In general we have the following:

OutliersOutliers ExtremeOutliers

SmallestNon-outlier

Q1

Med

Q3Largest

Non-outlier

ExtremeOutliers

1.5× IQR1.5× IQRIQR1.5× IQR1.5× IQR

Percentiles

For quartiles we divided the data set into 4 equal parts. There is no reason we can’t divide thedata into more parts2. We could divide the data set into ten parts(deciles) if we chose. We dividedthe data set into two parts. This is what the median does. For data sets with a lot of data, themore common way is to divide the data set into 100 parts. Since we are not using ‘large’ data sets,we will discuss percentiles here but put off their use until later when we are studying the normaldistribution.

Definition: The kth percentile, Pk is the value below which k% of the data falls.

2Wikipedia lists several. Search for quantile.

Page 49: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

3.2. MEASURES OF POSITION, BOX AND WHISKER PLOTS. 43

The median, which divides the data set into two equal pieces is P50. Q1 is the same as P25.For people’s heights if someone was in the 97th percentile, you would call them ‘tall’. Only 3% ofpeople are taller than that. Someone in the 5th percentile is ‘short’ because only 5% of people areshorter than that person.

3.2.2 Exercises

1. Consider the five-number summary of the visitor’s ages at the Happy Time Senior Center.

Minimum = 34Q1 = 45Median = 49Q3 = 55Maximum = 68

(a) Draw a box and whisker plot.

(b) Describe the distribution.

(c) Find a value that, if added to the data set would be an outlier but not extreme.

2. Consider the five-number summary for the weights, in ounces of fish caught by a shark.

Minimum = 48.5Q1 = 59.8Median = 66.4Q3 = 70.1Maximum = 75.2

(a) Draw a box and whisker plot.

(b) Describe the distribution.

(c) Find a value that, if added to the data set would be an outlier but not extreme.

3. The pairs of shoes sold at the Cobbler’s Sole each day for 2018 had a minimum of 57 pairssold. Q1 was 64, the median was 67 pairs, Q3 was 69 pairs and the maximum was 75 pairs.Draw a box and whisker graph and describe the distribution.

4. Sleepy is a tiger in a zoo that sleeps most of the day. The times, in hours for the tiger themonth of June, 2018 had a minimum of 15.8 hours, Q1 was 17.1 hours, the median was 18.6,Q3 was 19.9 hours and the maximum was 21.6 hours. Construct a box and whisker plot anddescribe the distribution.

5. The management of the Super Pop Carmel store takes samples of packages of carmel cornand weighs them. The five number summary of the weights, in ounces, is as follows.

Minimum = 15.84Q1 = 16.02Median = 16.33Q3 = 16.73Maximum = 17.09

Page 50: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

44 CHAPTER 3. NUMERICAL DESCRIPTORS OF DATA

(a) Draw a box and whisker plot.

(b) Describe the distribution.

(c) Find a value that, if added to the data set would be an outlier but not extreme.

6. The Happy Joggers Club meets every week for a run along the coast. The five-numbersummary for the times to run the course, in minutes, follow

Mminimum = 38.5Q1 = 42.8Median = 48.5Q3 = 54.9Maximum = 68.7

(a) Draw a box and whisker plot.

(b) Describe the distribution.

(c) Find a value that, if added to the data set would be an outlier but not extreme.

7. If Q1 = 110 the median = 120, and Q3 = 126, determine which of the following are outliersand classify if extreme: 64, 71, 89, 121, 138, 148, 152, 175. Draw the box and whisker plot.

8. If Q1 = 84, the median = 90, and Q3 = 96, determine which of the following are outliers andclassify if extreme: 44, 49, 64, 86, 113, 130, 135. Draw the box and whisker plot.

9. Disneyland annual attendence has been tracked since it opened in 1955. The data given belowlists the annual attendence, in millions of guests for the years 1956-2018.3

4 4.5 4.6 5.0 5.1 5.3 5.5 5.7 6.0 6.56.7 7.8 9.1 9.2 9.3 9.4 9.5 9.8 9.8 9.89.8 9.9 10.0 10.3 10.4 10.9 11.0 11.0 11.3 11.4

11.5 11.6 11.6 12.0 12.0 12.3 12.7 12.7 12.9 13.013.3 13.5 13.5 13.7 13.9 14.1 14.2 14.3 14.4 14.714.7 14.9 15.0 15.9 16.0 16.0 16.1 16.2 16.8 17.918.3 18.3 18.7

(a) Without using technology, find the five number summary.

(b) Draw a box and whisker plot.

(c) Describe the distribution.

(d) If done correctly, there will be no outliers. In the first year the park opened, the at-tendence was light and would be considered an outlier. Find a value that could be theattendence for the first year. (Do not go look it up. That is not the point of the problem.)

(e) In the year 2000, the attendence was 13.9 million. Which of the following would be yourbest choice for 13.9? P20, P70, or P95 Explain.

3Since the park was only opened for part of the year, 1955 attendence is not included.

Page 51: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

3.2. MEASURES OF POSITION, BOX AND WHISKER PLOTS. 45

10. The area of the 58 counties of California are given below. The areas are in square miles andthe data is in order.

47 446 449 520 603 606 630 720 738 739754 828 948 953 958 966 1,008 1,012 1,020 1,151

1,258 1,291 1,315 1,389 1,390 1,399 1,407 1,451 1,495 1,5761,640 1,712 1,846 1,929 2,138 2,236 2,554 2,738 2,951 3,0443,179 3,304 3,322 3,509 3,573 3,786 3,944 4,060 4,175 4,2044,558 4,824 5,963 6,287 7,208 8,142 10,192 20,062

(a) Without using technology, find the five number summary.

(b) Draw a box and whisker plot.

(c) Describe the distribution.

(d) Classify any outliers.

(e) If you had to assign a percentile to the value of 6,287 square miles, which of the followingwould be your best choice? P10, P50, or P90 Explain.

11. The data below represent the enrollments for the Fall 2018 term for all colleges in the Cal-ifornia Community College System according to the CCC Chancellor’s Office ManagementInformation Systems Data Mart. (N=118)

257 1,953 2,109 2,329 2,512 2,839 2,922 2,936 3,7653,885 4,005 4,267 4,470 4,628 4,961 4,989 5,454 5,6075,861 5,971 6,080 6,235 6,408 6,493 6,552 6,688 7,0467,158 7,220 7,672 7,689 8,256 8,341 8,347 8,466 8,5668,663 8,770 8,782 8,897 8,948 9,064 9,172 9,307 9,4659,505 9,588 9,595 9,640 10,039 10,441 10,727 10,789 11,069

11,243 11,477 11,820 11,869 11,909 11,917 12,503 12,538 12,79413,023 13,147 13,154 13,220 13,347 13,355 13,410 13,710 14,08914,107 14,429 14,945 14,996 15,001 15,082 15,438 15,606 15,70615,950 16,515 16,573 17,037 17,313 17,420 17,611 18,111 18,36418,888 19,476 19,669 19,753 19,983 20,207 20,730 20,902 20,96421,225 21,247 22,344 23,045 23,259 24,323 24,396 24,496 24,81924,951 24,985 25,498 27,064 29,267 30,476 30,854 36,411 36,57437,359

(a) Without using technology, find the five number summary.

(b) Draw a box and whisker plot.

(c) Describe the distribution.

(d) Classify any outliers.

(e) If you had to assign a percentile to the value of 4,005 students, which of the followingwould be your best choice? P10, P50, or P90 Explain.

12. The commercial landings of dungeness crab for the years 1916-2001 are given in the tablebelow. The numbers represent the pounds of crab commercially landed. The data have beensorted for your convenience.

Page 52: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

46 CHAPTER 3. NUMERICAL DESCRIPTORS OF DATA

685,000 800,952 860,328 1,022,873 1,075,800 1,220,5681,296,912 1,304,904 1,506,816 1,563,006 1,619,280 1,627,7531,792,776 1,815,363 1,951,461 1,992,384 2,231,384 2,311,8022,315,338 2,414,110 2,433,987 2,580,840 2,934,776 2,960,7123,208,494 3,222,580 3,234,312 3,296,280 3,536,099 3,574,4643,680,188 3,768,081 3,873,600 3,934,663 4,246,044 4,260,3404,334,383 4,803,906 5,151,014 5,301,828 5,340,031 5,718,0175,953,361 6,119,320 6,210,359 6,476,494 6,857,070 6,973,6797,758,251 7,829,651 7,938,996 8,278,519 8,327,150 8,713,8239,236,191 9,362,197 9,624,368 9,662,265 9,908,520 10,369,518

10,435,441 10,692,760 10,733,398 11,115,476 11,297,696 11,568,35311,704,648 11,711,327 11,716,488 11,892,891 11,958,039 12,331,36512,376,390 12,978,505 12,997,451 13,491,363 14,320,549 14,876,14815,413,589 15,726,774 15,934,778 16,015,581 17,262,261 17,282,76619,118,484 33,647,863

(a) Without using technology, find the five number summary.

(b) Draw a box and whisker plot.

(c) Describe the distribution.

(d) Classify any outliers.

(e) Give a reasonable explaination as to why the data should not be symmetric.

(f) Which of the folloiwng would be the best estimate for P85: 2,000,000, 10,000,000, or13,000,000 pounds. Explain.

Page 53: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

3.3. MEASURES OF SPREAD OF A DATA SET 47

3.3 Measures of Spread of a Data Set

We have already seen the mean, median, and mode as measures of central tendency. What wewould like to do next is measure how spread out the data is. One way to do this is to see how farapart the minimum and maximum data values are. This is what is called the range of a data set.

range = maximum−minimum

Every week a commuter fills the tank on the gas tank and calculates the miles per gallon forthat week. The mileage for 5 weeks are 26.1, 24.5, 24.6, 26.7, and 24.1 mpg.

Another commuter calculates their mileages to be 20.1, 23.9, 29.6, 30.1, and 22.3 mpg.

20 21 22 23 24 25 26 27 28 29 30 31

First Commuter

Second Commuter

Mileage, mpgX

A simple calculation shows that the means of the mileage for the two commuters are the same,25.2 mpg (indicated with a line on the graph). But if we look at how spread out the data valuesare there is a marked difference. The range of the two are 2.6 mph (=26.7-24.1) and 10.0 mpg(=30.1-20.1). The mpg’s for the first commuter are much less spread out. We see that in the graphand also in the ranges.

Although the range is an attrachtive option for measuring the spread of a data set it has acouple of undesireable qualitites: it only uses two data values in its calculation and it is verysucceptible to outliers. Another way to measure the spread is with what is called the Mean AbsoluteDeviation(MAD). It is the average distance the data values fall from the mean.

For the commuters above the calculations are in the following table

X X − X |X − X|26.1 0.9 0.924.5 -0.7 0.724.6 -0.6 0.626.7 1.5 1.524.1 -1.1 1.1

4.8

X X − X |X − X|20.1 -5.1 5.123.9 -1.3 1.329.6 4.4 4.430.1 4.9 4.922.3 -2.9 2.9

18.6

The first data value 26.1 represents the mpg for the first week for the first commuter and 0.9tells us that this is 0.9 mpg more than the average (more because the number is positive). Thesecond column contains the deviations which are the signed distances a data value falls from themean. A positive value tells you the data value is to the right of the mean and a negative valuetells you the data value is to the left of the mean. The third column is simply the absolute valuesof the second column. If we add up the numbers in this column we get 4.8, which gets us a MAD

Page 54: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

48 CHAPTER 3. NUMERICAL DESCRIPTORS OF DATA

= 0.96 mpg (= 4.8/5). This tells us that the average distance the data values are from the mean is0.96 mpg for the first sample. The MAD = 3.72 mpg for the second data set. We can see from thenumber and by looking at the data that the mileages for the first data set are more consistent. Itis also evident from the dot plot.

MAD =

∑ |X − X|n

for a sample

MAD =

∑ |X − µ|N

for a population

Example 3.3.1.

A population consists of the data values: 8, 12, 16, 9, 10. Calculate the MAD for this popula-tion.

Solution.

Although this a population and the previous example was for a sample, the calculations aredone the same way. A simple calculation gives the mean to be 11

X X − µ |X − µ|8 -3 312 1 116 5 59 -2 210 -1 1∑ |X − µ| = 12

The MAD = 12/5=2.4

Although the MAD is fairly straightforward to calculate, it is rarely used in serious statisticalcalculations. A better way is using what is called the standard deviation.

If we look at the same data set but square the deviations instead of taking the absolute valuewe get the following

X X − µ (X − µ)2

8 -3 912 1 116 5 259 -2 410 -1 1∑

(X − µ)2 = 40

If we average the squares of the deviations we get 40/5 = 8. This is called the populationvariance. This is poor approximation for the MAD. It isn’t a surprise that it is a bad estimate since

Page 55: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

3.3. MEASURES OF SPREAD OF A DATA SET 49

we squared all the deviations which made the numbers we average get bigger, except for the ones.To make a more reasonable measure of the spread of the data set we will take the square root ofthis number we just calculated and we get 2.83. This is called the population standard deviation,a much better estimate of the MAD. In fact, we will think about the standard deviation the sameway we think of the MAD. That is, it is the ‘average’ distance the data values fall from the mean.

The population variance is given by

σ2 =

∑(X − X)2

N

The population standard deviation is given by

σ =

√∑(X − X)2

N

Now let’s consider the sample variance and standard deviation. Recall in the discussion of themean that the reason for taking a sample is to estimate the population mean with the sample mean.The idea is the same for the variance and standard deviation. Since our goal is to estimate these,we want as good an estimate as possible. As a result our formulas are slightly different.

The sample variance is given by

s2 =

∑(X − X)2

n− 1

The sample standard deviation is given by

s =

√∑(X − X)2

n− 1

If you compare the variance of the population, σ2, with the sample variance, s2, there is anotable difference between the two formulas: the denominators are different. Just like with themean, we want to estimate the population variance with the sample variance. If the denominator ofthe sample variance were n instead of n−1 then the calculated variance would tend to underestimatethe true population variance. If we divide by n− 1 then the sample variance will, in the long run,be right on target even though the individual estimates of σ2 will be low or high.

We will refer to the formulas given for the variances and standard deviations in the futurealthough we will rarely use the formulas. Instead we will rely on the calculator to do the calculations.

One note about your calculator. Some calculators will have a button which is labeled σn−1 orSxwhich is what we call s2. For the population standard deviation, σ, you might see σn or σx

Example 3.3.2.

A sample consists of the data values: 8, 12, 16, 9, 10. Calculate its standard deviation andvariance.

Page 56: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

50 CHAPTER 3. NUMERICAL DESCRIPTORS OF DATA

Solution.

The calculations above are almost identical with the exception of the denominator. So we gets2 = 40/4 = 10 (the variance) and s = 3.16.

Although the formulas for the standard deviation is presented here, for almost all cases a cal-culator will be used.

Example 3.3.3.

Calculate the standard deviation and variance for the commuters mileages in the first example.

Solution.

First commuter: 26.1, 24.5, 24.6, 26.7, and 24.1 mpg.

Second commuter: 20.1, 23.9, 29.6, 30.1, and 22.3 mpg.

For the first commuter the calculator yields 1.13 mpg and 4.46 mpg for the second (See below).The variances will be 1.28 mpg2 and 19.87 mpg2, respectively. Note that the units for the standarddeviation will always be the units of the individual data values and the units of the variance willalways be the square of the units.

To Find Descriptive Statistics

STAT>EDIT>1:Edit

Use the arrow keys to highlight L1, hit CLEAR then ENTER (If L1 not there, STAT>EDIT>5:SetUpEditor>ENTER)

Input data values in L1

STAT>CALC>1:1-Var Stats

This should copy 1-Var Stats command to your homescreen.

Specify the list: 2nd>L1 (located with the 1)

ENTER

To get the variance, we simply need to square the standard deviation. To find the range, subtractthe maximum and minimum values.

Page 57: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

3.3. MEASURES OF SPREAD OF A DATA SET 51

3.3.1 Exercises

1. Population given by 8, 3, 5, 10,12, 16, 12. Find the range, MAD, standard deviation, andvariance.

2. Population given by 22, 18, 13, 18, 15, 21, 33. Find the range, MAD, standard deviation, andvariance.

3. Sample given by 12.3, 15.6, 14.6, 19.8, 14.4. Find the range, MAD, standard deviation, andvariance.

4. Sample given by 22.6, 45.6, 34.6, 38.7, 27.6. Find the range, MAD, standard deviation, andvariance.

5. The time for 5 randomly selected students to take an exam, in minutes, is 53.1, 46.8, 55.1,50.6, and 44.9. Find the range, MAD, standard deviation, and variance.

6. Nine randomly selected cars are selected and the number of miles on their odometers arecollected. The miles were 26135, 31.687, 106548, 65987, 29568, 197564, 66287, 106548, and46389. Find the range, MAD, standard deviation, and variance.

7. The amounts of water, in ml, in a random sample of bottles are 501.3, 502.6, 500.6, 499.8,498.6, 503.1. Find the range, MAD, standard deviation, and variance.

8. The Apgar Score is a test with scores from 0 to 10 used to assess infants one minute and againat five minutes after birth. The Apgar score for several infants five minutes after birth are 5,9, 7, 8, 6, 3, 4, 9. Find the range, MAD, standard deviation, and variance.

9. You have an interview for a job. You do not want to be late. The interview is 4 hours away.You have two routes: the East Bay route and the West Bay route. Both take an average of4.5 hours. The standard deviation of the East Bay route is .64 hours. The standard deviationof the West Bay route is .31 hours. Which route should you take? Answer the question if themean times are 3.8 hours. Explain.

10. What can you say about a data set if the standard deviation is 0?

11. Several friends step on the scale and they find the average weight to be 165.4 pounds with astandard deviation of 26.5 pounds. It is then pointed out that the scale registered everythingit weighs one pound over the actual weight. What are the actual mean and standard deviationof their weights?

12. The weight of several packages to be mailed are determined to be 15.6 ounces with a standarddeviation of 3.64 ounces. It is then discovered that the packages need to have an envelopewith advertisements and offers. The envelopes each weigh one ounce. What are the meanand standard deviation of the packages now?

Page 58: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

52 CHAPTER 3. NUMERICAL DESCRIPTORS OF DATA

3.4 Weighted Mean and Standard Deviation

A student takes two courses and gets an A in one class and a C in another class but are surprisedwhen the average grade for that student is not a B. The reason why is simply that the A and C donot have equal ‘weights’. For grades, the weights are the units for the course.

If we look further and find that the A was received in a class that was 5 units and the C wasin a class that was a 1 unit class, the A counts 5 times as much as the 1 unit class. For grades, anA is worth 4 points, a B 3 points, a C 2 points etc.. For this example we can think of this as 5 A’sand 1 C so we would get an average of (4 + 4 + 4 + 4 + 4 + 2)/(5 + 1) = 3.67

We can rewrite this as (5× 4 + 1× 2)/(5 + 1). The units 5 and 1, are the weights and the grade

points are 4 and 2. We can write this more formally as

Definition: Weighted Mean

mean =ΣwX

Σw

Where w are the weights and X are the data values

Any decent calculator will handle the calculation. (See below.) Since we will rely on ourcalculator exclusively to find the standard deviation we will not give the formula here.

Example 3.4.1.

A student receives the following grades: A, B, D, C, B and the unit values for the courses are:3, 2, 4, 4, 5, respectively. Find the grade point average.

Solution.

The grade points for the grades A, B, D, C, B are 4, 3, 1, 2, 3, respectively. These are the valuesof X. The units, or weights, are 3, 2, 4, 4, 5, respectively. so we get

(3× 4) + (2× 3) + (4× 1) + (4× 2) + (5× 3)

3 + 2 + 4 + 4 + 5=

45

18= 2.5

The student earned a GPA of 2.5

Often times, we are not presented with the raw data but data which has been summarized ina table. In such cases we would like to determine a way to find the mean and standard deviation.This is done the same way as for data with weights.

Consider the breakdown of heights of a sample of students in a college class presented in thetable below. Find the mean and standard deviation of the heights.

Page 59: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

3.4. WEIGHTED MEAN AND STANDARD DEVIATION 53

Height, inches Number of Students58-61 562-65 866-69 470-73 374-77 1

When we look at the table we know that there are 5 students in the first category. But how tallare they? We simply don’t have that information.

Idea: Construct a data set with the same distribution and find the mean and standard deviationof the constructed data set.

We need to be consistent in the way in which we construct the data set. What we will do isfind the midpoint of the data set as our constructed data set. We then will get:

59.5, 59.5, 59.5, 59.5, 59.5, 63.5, etc. (59.5 is the midpoint of 58-61, etc)

We can then replace our distribution with

Height, inches Number of Students59.5 563.5 867.5 471.5 375.5 1

Our calculator can easily handle the data values with the frequencies. We obtain X = 65.0inches and s = 4.64 inches.

To Find Descriptive Statistics for Grouped Data

STAT>EDIT>1:Edit

Use the arrow keys to highlight L1, hit CLEAR then ENTER (If L1 not there, STAT>EDIT>5:SetUpEditor>ENTER)

Input data values in L1

Input weights in L2

STAT>CALC>1:1-Var Stats

This should copy 1-Var Stats command to your homescreen.

Specify the list and frequency, separated with a comma: 2nd>L1 , (comma above the

Page 60: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

54 CHAPTER 3. NUMERICAL DESCRIPTORS OF DATA

‘7’) 2nd>Ll2

ENTER

3.4.1 Exercises

1. A student has three exams and a final that their grade is based on. The three exams are eachworth 20% of their grade and the final is worth 40% of their grade. If the student received a76, 86, 79 on the exams and 72 on the final, what is the overall average grade?

2. For a particular class, homework is 10% of the grade, two midterm exams are each worth 25%and the final is worth 40% of the grade. The student got 100 on the homework, 81 on thefirst exam, 64 on the second exam, and 78 on the final. What is the overall average score?

3. A student got an A in a 3 unit class, a D in a 5 unit class, a C in a 2 unit class and a C in a3 unit class. Find the student’s GPA.

4. A student got an C in a 4 unit class, a B in a 1 unit class, a B in a 2 unit class and a C in a1 unit class. Find the student’s GPA.

5. In the past 3 semesters, a student got a 3.75 GPA while taking 12 units, a 3.00 GPA whiletaking 14 units, and a 2.5 while taking 18 units.

(a) Find the students overall GPA.

(b) If the student is currently taking 10 units, what are the minimum and maximum valuesthe overall GPA can be after the current semester.

(c) If the student is currently taking 10 units, what GPA this semester is required to get anoverall GPA of 3.10?

6. In the past 3 semesters, a student got a 2.67 GPA while taking 15 units, a 3.50 GPA whiletaking 10 units, and a 1.5 while taking 8 units.

(a) Find the students overall GPA.

(b) If the student is currently taking 16 units, what is the minimum and maximum valuesthe overall GPA can be after the current semester.

(c) If the student is currently taking 16 units, what is the GPA required to get an overallGPA of 3.00?

7. The weights of watermelons, in ounces, randomly selected from a farm are given in the table.Find the mean, standard deviation, and variance.

Weight Number89-92 893-96 2697-100 19101-104 13105-108 7

Page 61: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

3.4. WEIGHTED MEAN AND STANDARD DEVIATION 55

8. Recall from before, the Elkhorn Slough bat rays. The bat ray’s lengths, in cm, are given inthe table along with the frequencies. Find the mean, standard deviation, and variance.

Total Length of Bat Rays Collected at Elkhorn Slough, June 8,13, 2002

Total Length Number of Bat Rays20-22 623-25 326-28 1129-31 432-34 835-37 638-40 2

9. The 2018 Big Sur International Marathon Race times for women is summarized. Find themean, standard deviation, and variance. (Beware of units).

Times(hr:min) Number of Runners3:00 - 3:30 453:30 - 4:00 1844:00 - 4:30 3304:30 - 5:00 4575:00 - 5:30 3165:30 - 6:00 2826:00 - 6:30 67

10. The 2018 Big Sur International Marathon Race times for men is summarized. Find the mean,standard deviation, and variance. (Beware of units).

Times(hr:min) Number of Runners2:00 - 2:30 12:30 - 3:00 263:00 - 3:30 1653:30 - 4:00 3294:00 - 4:30 3574:30 - 5:00 3385:00 - 5:30 2255:30 - 6:00 1866:00 - 6:30 34

11. The number of eggs in a clutch of eggs for a group of red tailed hawks are recorded. Theresults are in the table that follows. Find the mean, standard deviation, and variance.

Page 62: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

56 CHAPTER 3. NUMERICAL DESCRIPTORS OF DATA

Number Numberof Eggs of Birds

1 352 863 1364 1085 236 8

12. The Apgar Score is a way to quickly assess the health of a newborn. The scores for severalnewborns were recorded five minutes after birth.4 Find the mean, standard deviation, andvariance.

Score Number of Newborns1 13,7372 12,1043 18,5324 29,3205 55,9686 129,7887 341,7688 1,680,3299 20,760,364

10 1,968,967

4Li, Fei et al. The apgar score and infant mortality. PloS one vol. 8,7 e69072. 29 Jul. 2013,doi:10.1371/journal.pone.0069072

Page 63: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

3.5. CHEBYSHEV’S THEOREM AND THE EMPIRICAL RULE 57

3.5 Chebyshev’s Theorem and the Empirical Rule

For data which has a ‘small’ standard deviation, there can’t be a lot of data values ‘far’ from themean. Otherwise the standard deviation would need to be larger.

Chebyshev’s Theorem:

At least (1− 1k2 × 100%) of data falls within k standard deviations from the mean.

Equivalently, at most 1k2 × 100% of data falls more than k standard deviations from the

mean.

Example 3.5.1.

The mean time taken for all students to finish an exam is 53.6 minutes with a standard deviationof 4.6 minutes. What percent of students finish the exam in 44.4 to 62.8 minutes?

Solution.

We are given µ = 53.6 and σ = 4.6 . We want to know what percent are between 44.4 and 62.8.If we organize the information on a number line we obtain:

44.4 62.8µ = 53.6

53.6− 44.4 = 9.2 62.8− 53.6 = 9.2

k in the theorem is the ‘number of standard deviations from the mean’. The distance between62.8 and 53.6 (= µ) is 9.2. The distance from 44.4 to 53.6 is also 9.2. To get the number of standarddeviations from the mean we need to divide 9.2 by the standard deviation. k = 9.2/4.6 = 2. If weplug into the formula we get 75%(= 1− 1

22 × 100%). So, at least 75% of students take between 44.4and 62.8 minutes to complete the exam.

Example 3.5.2.

The average amount of rain in a town each year is 56.9 inches with a standard deviation of 6.8inches. What percent of years will there be more than 80.7 inches of rain or less than 33.1.

Solution.

Page 64: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

58 CHAPTER 3. NUMERICAL DESCRIPTORS OF DATA

Following the example above we find that 80.7 and 33.1 are 3.5 standard deviations from themean. So we get 1

3.52 ×100% = 8.2%. We conclude that at most 8.2% of years have more than 80.7or less than 33.1 inches of rain.

Chebyshev’s theorem can be used whenever we know the mean and standard deviation withoutregard to the distribution. If we also know that the distribution follows a bell shape we get thefollowing

The Empirical Rule (68-95-99.7 Rule)

For data which follows a bell-shaped distribution,approximately 68% of data falls within 1 standard deviation of the mean.approximately 95% of data falls within 2 standard deviations of the mean.approximately 99.7% of data falls within 3 standard deviations of the mean.

Applying the empirical rule is done almost the same as with Chebyshev’s theorem. The maindifference is that you don’t use a formula to get the percentages.

Example 3.5.3.

The amount of water a household consumes varies from day to day. It is known that their dailywater consumption follows a bell-shaped distribution with a mean of 236.5 gallons with a standarddeviation of 56.9 gallons. What percent of families use between 122.7 and 350.3 gallons of watereach day?

Solution.

We calculate k = 350.3−236.556.9 = 236.5−122.7

56.9 = 2. Since it is stated that the water consumptionis bell shaped we use the 68-95-99.7 rule and conclude that approximately 95% of households usebetween 122.7 and 350.3 gallons per day.

3.5.1 Exercises

1. When is it appropriate to use the empirical rule and when should Chebyshev’s theorem beused?

2. The empirical rule says approximately 95% of data falls within 2 standard deviations fromthe mean where Chebyshev’s theorem states that at least 75% of data is within 2 standarddeviations from the mean. Are these contradictory statements? Explain.

3. The Empircial rule states that approximately 95% of data is within 2 standard deviationsfrom the mean. Can we conclude that approximately 47.5% of data is between the mean and2 standard deviations to the right of the mean? Explain.

Page 65: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

3.5. CHEBYSHEV’S THEOREM AND THE EMPIRICAL RULE 59

4. Chebyshev’s theorem tells us that at least 75% of data is within 2 standard deviations fromthe mean. Can we conclude that at least 37.5% of data is between the mean and 2 standarddeviations to the right of the mean? Explain.

5. The mean speed of all cars in front of a school is 25.8 mph with a standard deviation of 3.4mph. What percent of all cars drive between 15.6 and 36.0 mph?

6. The Golden Gate Bridge has an average of 112,000 vehicles cross it per day. Assume thestandard deviation of the number of vehicles that cross the bridge is 15,000. What percent ofdays have between 74,500 and 149,500 vehicles cross the bridge?

7. According to heart.org, Americans consume more than 3400 mg of sodium each day, onaverage. Assume that the mean amount of sodium consumed per day is 3400 mg with astandard deviation of 550 mg. what percent of all Americans consume more than 5600 mg orless than 1200 mg of sodium each day.

8. The average amount of water consumed by a small business is 52.6 gallons per day of waterwith a standard deviation of 6.9 gallons per day. What percent of all days does the businessconsume more than 76.75 gallons or less than 28.45 gallons of water?

9. The times spent doing homework by all students are bell-shaped with a mean of 15.6 hoursper week with a standard deviation of 6.4 hours. What percent of students spend between2.8 and 28.4 hours per week?

10. The time employees spend commuting to work is approximately bell-shaped distributed with amean of 35.4 minutes and a standard deviation of 4.65 minutes. What percent of all employeesspend between 26.1 and 44.7 minutes?

11. The amount of tips a tip jar receives varies from night to night. On average, there is $94.50with a standard deviation of $16.20. What percent of nights does the tip jar get between$49.14 and $139.86?

12. Every day, a runner ‘runs the stairs’ at a stadium as part of their workout routine. The timesare approximately bell-shaped with a mean of 12.64 minutes with a standard deviation of 1.26minutes. What percent of workouts have the runner taking between 10.12 and 15.16 minutesto ‘run the stairs’?

13. At a large company, the computer time of employees is monitored. The times employees spendonline is bell-shaped with a mean of 56.4 minutes and a standard deviation of 6.8 minutes.What percent of employees are online more than 70 minutes a day?

14. A large company uses a lot of staples. On average, they use 15,465 staples each week witha standard deviation of 264 staples. The number of staples is known to be approximatelybell-shaped. What percent of weeks do they use more than 15201 staples?

15. At an amusement park, the average amount spent on food and beverages by guests is $19.85with a standard deviation of $5.26. At least 75% of guests spend between what two amounts?

16. The time spent on hold before being able to speak to a representative is approximately bell-shaped with a mean of 24.6 minutes and a standard deviation of 5.2 minutes. 95% of timesspent on hold are between what two times?

Page 66: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

60 CHAPTER 3. NUMERICAL DESCRIPTORS OF DATA

Page 67: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Chapter 4

Probability

61

Page 68: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

62 CHAPTER 4. PROBABILITY

4.1 Probability

Is it going to rain tomorrow? Will my favorite sports team win the championship? Will I win thePower Ball lottery? Although we would like to answer these questions, math isn’t that awesome.We can’t give definitive answers but we can try to address the likelihood of one of these eventshappening. That is what probability is about. Before we get to probability we need a little settheory along with some additional terminology.

A Random Variable is a variable whose value is determined by an experiment whose outcomesare random.

Example: Roll a die. In this case, there are 6 outcomes whose outcomes are random. Thepossible values are 1, 2, 3, 4, 5, and 6. The set that contains all outcomes of an experiment is calledSample Space, traditionally denoted with the letter ‘S’, so S=1,2,3,4,5,6

The Sample Space, S, of a random variable is a set consisting of all the possible outcomes ofthe experiment.

When calculating probabilities we are going to be interested in only some of the possible outcomes.If we are playing a board game, for example, we may only want to roll a ‘2’ or ‘5’. In this case, theoutcomes we are interested in is a subset of the sample space, called an event.

An Event is a subset of the sample space.

An event with only one outcome is called a Simple Event.

An event with more than one outcome is called a Compound Event.

Example: Let S=1,2,3,4,5,6 . List several events. Identify each as simple or compound.It turns out there are dozens of different possible events so here we will give only a few:Some compound events 1,3, 2,4,6, 1,5, 3,6 ,Some simple events: 5,1, 2

Example 4.1.1.

You ask two people whether they have ridden the bus in the last month. Observe how manypeople have ridden the bus in the last month. Find the sample space.

Solution.

In this case we have only 3 possibilities: neither one has taken the bus, exactly one has takenthe bus, or both have taken the bus. So our sample space is S = 0, 1, 2

Page 69: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

4.1. PROBABILITY 63

Example 4.1.2.

You ask two people whether they have ridden the bus in the last month. Observe the order ofresponses obtained. Find the sample space.

Solution.

For this our sample space is not going to consist of numbers. Instead it will consist of a verbaldescription. For example: the first rode the bus and the second person didn’t. Let’s abbreviate thisas ‘BD’, ‘B’ for rode the bus, and ‘D’ for didn’t ride the bus. We could possibly just think aboutit and try and reason it through but this is a good place for a tree diagram.

B

B

D

D

CB

D

BB

BD

DB

DD

In the diagram we start on the left with a branch since the first selection is for the bus or not thebus denoted ‘B’ and ‘D’, respectively. Followed up with the second selection and the final outcome.Our sample space is S = BB,BD,DB,DD

We now are ready to calculate some probabilities. The probability of an event is a numericalmeasure of the likelihood of the event occurring. It is a number between 0 and 1, inclusive. We haveall heard the meteorologist on the local news state that there is a ‘30% chance of rain’ tomorrow.This simply means that it the current conditions were to happen over and over again, 30% of thetime it would rain.

Let’s start with the die example from above. We have S=1,2,3,4,5,6 . We want to roll either a‘2’ or a ‘5’. What is the probability of that happening? Before we can do this we need a definition.

Definition: If a sample space consists of equally likely outcomes, then the probability of anevent, A, is given by

P (A) =number of elements in A

number of elements in S

For our die example, A=2,5 so P (A) = 26 = 1

3 .A warning: the events in the sample space must be equally likely. If not, you cannot use the

definition of the probability as stated. If the die were loaded (unfair) then we would not expect theprobabilities to be the same.

Probability is, theoretically, straightforward. All you need to do is count how many are in twosets and divide them. In the last example the counting was very easy. For many cases counting iswhat makes the problem difficult.

Page 70: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

64 CHAPTER 4. PROBABILITY

For many problems we can’t find, let alone count, the elements in S. We need to estimate theprobabilities in such cases.

To estimate the probability P(A) we repeat our experiment several times and calculate the ratioto get

Relative Frequency Estimate of Probability

P (A) ≈ number of times A occurs

number of times experiment is repeated

Note: the equal sign has been replaced with an approximately equal. The approximation will getbetter the more times the experiment is repeated. We see these estimations whenever we look atpolls. If a poll asks whether or not a person will vote for a particular candidate, the reportedpercentage is really just an estimate of the probability. It would be too expensive to ask all votersso pollsters take a sample to get an estimate.

Often times, we want to calculate the probability of an event that is difficult to do directly butit is much easier to calculate what is called the complement of the event

The Complement of an event A, denoted A, consists of all elements that are in the samplespace, S, that are not in the event A.

Probability of the an event and its complement are related by

P (A) + P (A) = 1

The more useful form is given by

P (A) = 1− P (A)

This last formula we are probably all familiar with, perhaps not written as a formula, but the idea.For example, if 30% of voters voted for a proposition, then the rest, 70% (= 100% − 30%), votedagainst the proposition.

Example 4.1.3.

Let S = 1, 2, 3, 4, 5, 6 from before and A = 2, 5 Find A and P (A)

Solution.

Since A consists of the elements in S that are not in A, we have A = 1, 3, 4, 6. The probabilityis P (A) = 4/6 = 2/3.

Page 71: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

4.1. PROBABILITY 65

4.1.1 Exercises

1. We are to randomly select 2 registered voters and determine if they voted in the last election.Use a tree diagram to determine the sample space.

2. Two students at a large university are to be selected and it is to be determined if they areundergraduates or graduate students. Use a tree diagram to find the sample space of theexperiment.

3. A box contains 12 marbles. 8 of the marbles are green. If you reach in and grab 5 marblesand count how many are green, find the sample space.

4. You flip a coin 4 times. If you are observing the number of heads in the four flips, find thesample space.

5. You are to select 3 people and determine if they have a criminal record. Use a tree diagramto find the sample space where the outcomes note the order of whether or not a person has acrimial record.

6. A pollster wishes to ask 3 workers if they use mass transit to get to work. Use a tree diagramto find the sample space. Where the pollster observes the order of the responses, e.g. Yes-Yes-No.

7. Let S = a, b, c, d, e be the sample space that consists of equally likely outcomes.

(a) let A = a, b, e Find P (A)

(b) let B = c, d Find P (B)

(c) let C = d Find P (C)

(d) let D = b. Find P (D)

(e) Which pair of events listed above are complements.

8. Let S = 1, 2, 3, 4, 5 be the sample space that consists of equally likely outcomes.

(a) let A = 1, 2, 4, 5 Find P (A)

(b) let B = 2, 4 Find P (B)

(c) let C = 3 Find P (C)

(d) let D = 2. Find P (D)

(e) Which pair of events listed above are complements.

9. A company consists of 26 employees. 12 of them are full time employees and 14 are part timeemployees. If you randomly select a worker find the probability that the worker is a full timeemployee.

10. A parking lot has 35 cars that have California license plates and 12 that are out of statelicense plates. If one car is selected at random, find the probability that the car has out ofstate plates.

11. A bag of candies contains 13 red candies, 17 blue candies, and 25 white candies. A candy isto be selected. Find the probability that you will get a blue candy.

Page 72: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

66 CHAPTER 4. PROBABILITY

12. A box of apples contains 3 granny smith apples, 6 golden delicious apples, and 5 pink ladyapples. One of the apples is to be selected. Find the probability that the apple will be a pinklady apple.

13. A bag of candies contains 13 red candies, 17 blue candies, and 25 white candies. A candy isto be selected. Find the probability that you don’t get a blue candy.

14. A box of apples contains 3 granny smith apples, 6 golden delicious apples, and 5 pink ladyapples. One of the apples is to be selected. Find the probability that the apple will not be apink lady apple.

15. At a blood drive 40 people have O+ blood, 31 have A+ blood, 8 have B+, 2 have AB+, 9have O-, 7 have A-, 2 have B-, and 1 has AB-. A person is to be selected.

(a) Find the probability the person has O- blood.

(b) Find the probability the person has A blood (+ or -).

(c) Find the probability that the person is Rh +. (has a ‘+’ blood type).

(d) Find the probability that the person does not have A+ blood.

16. At a car show, there are 23 Ford convertibles, 12 Ford hard tops, 18 Chevy convertibles, 25Chevy hard tops, 9 Chrysler convertibles, and 16 Chrysler hard tops. You are to randomlyselect one car.

(a) Find the probability of selecting a Chevy convertible.

(b) Find the probability of getting a convertible.

(c) Find the probability that the car is a Ford.

(d) Find the probability that the car is not a Chevy hard top.

Page 73: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

4.2. CONDITIONAL PROBABILITY AND INDEPENDENCE OF EVENTS 67

4.2 Conditional Probability and Independence of Events

When we are calculating probabilities, we sometimes need to restrict our attention to a subpopula-tion of our population. We may want to examine the differences in disease rates for different ethnicgroups, for example. This is where conditional probability is applied.

The probability of an event A occurring given that another event, B, has already occurred iscalled the Conditional Probability of A given B and is written P (A | B)

Example 4.2.1.

Define events and write the following as conditional probabilities: ‘twelve percent of womenwill develop breast cancer’, ‘forty-five percent of people in Argentina have type O+ blood’, ‘fifteenpercent of men are left-handed’

Solution.

Let W be the event ‘Woman’ and C be the event ‘Breast Cancer’. Then the conditional proba-bility is P (C |W ) = .12

Note that the order is important. P (C | W ) 6= P (W | C) = .12 . We expect P (W | C) ≈ 1.Why?

Similarly, if we define A to be the event ‘from Argentina’ and O as the event ‘has O+ blood’then we get P (O | A) = .45

Lastly, let M be the event ‘Man’ and L be the event ‘left-handed’ then we get P (L |M) = .15

We will use the idea of independence quite a bit as we proceed. The idea behind independenceis that two events are independent if one event occurring does not affect the probability of anotherevent occurring. Consider the events ‘Breast Cancer’ and ‘Woman’ that we looked at before. Ifyou choose a woman, then the likelihood of that woman developing breast cancer is greater thanthe population as a whole. If we randomly select an adult from the population, the probability isabout .06 that they will develop breast cancer. If the person we chose is a woman, the probabilityis about .12, different from the entire population. If we were to choose a man, the probability theywill develop breast cancer drops to .001.1 Whether or not a person is a woman and the likelihoodof developing cancer are related or dependent. We can generalize

The events A and B are independent if

P (A) = P (A | B) and P (B) = P (B | A)

It turns out that if one of the statements is true then the other one is true. This tells us we onlyneed to check if one is true.

1nationalbreastcancer.org

Page 74: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

68 CHAPTER 4. PROBABILITY

Example 4.2.2.

224 employees at a company were asked about their full/part time status and whether they area college graduate or not. The results are summarized in the table that follows.

Full Time Part TimeCollege Graduate 59 15

Not a College Graduate 97 53

1. One of these employees is to be selected. Find the probability that . . .

(a) A college graduate is chosen

(b) A part time employee is chosen

(c) A college graduate given that they a part-time employee

(d) A full-time employee given that they are not a college graduate

2. Are the events ‘Part Time’ and ‘College Graduate’ independent?

Solution.

In order to use the definition of probability from before we will need the totals. Also notice wehave given letters to the events: F = Full Time, etc.

Full Time(F ) Part Time(PT ) TotalCollege Graduate(G) 59 15 74

Not a College Graduate(N) 97 53 150Total 156 68 224

We can restate the problem

1. P (G)

2. P (PT )

3. P (G | PT )

4. P (F | G)

5. Are PT and G independent?

To determine the probability of a college graduate(G) chosen we see that there are 74 collegegraduates(G) and a total of 224 employees. So we get P (G) = 74/224 = 0.3304

For the probability of PT , we have 68 total PT s and a total of 224 employees. So P (PT ) =68/224 = 0.3036

Both of these require the use of the definition of probability given before. For P (G | PT ), weare restricting our attention to just the part-time employees. So we see the table as . . .

Page 75: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

4.2. CONDITIONAL PROBABILITY AND INDEPENDENCE OF EVENTS 69

Full Time(F) Part Time(PT) TotalCollege Graduate(G) 15

Not a College Graduate(N) 53Total 68

From this we see that there are 15 college graduates out of a total of 68. So we get P (G | PT ) =1568 = 0.2206

Likewise, for P (F | N), we see the table as . . .

Full Time(F ) Part Time(PT ) TotalCollege Graduate(G)

Not a College Graduate(N) 97 53 150Total

There are 97 full time employees out of 150 non college graduates (N). So we get P (F | N) =97150 = .6467

To see if PT and G independent, we need to check if either P (PT ) = P (PT | G) or P (G) =P (G | PT ). We have already determined the probabilities for the second equality. We haveP (G) = .3304 and P (G | PT ) = .2206. So we can state that G and PT are not independent (ordependent)

4.2.1 Exercises

1. According to the CDC 18 of every 100 adult men are smokers. Define appropriate events andwrite the statement as a conditional probability.

2. According to the CDC 14 of every 100 adult women are smokers. Define appropriate eventsand write the statement as a conditional probability.

3. According to the CDC, in 28% of all traffic-related deaths in the US were in alcohol-impaireddriving crashes. Define appropriate events and write the statement as a conditional probabil-ity.

4. According to cpcstrategy.com, 67% of millennials shop online. Define appropriate events andwrite the statement as a conditional probability.

5. Several people who wanted to have children were asked how many siblings they have and howmany children they wanted to have. See the table for the results.

0 siblings 1+ siblingsWant 1 child 49 94

Want 2+ children 68 106

(a) One of these persons is to be selected. Find the probability that . . .

i. They have no siblings

ii. They only want one child

iii. They want 2+ children given that they have 1+ siblings

Page 76: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

70 CHAPTER 4. PROBABILITY

iv. They are an only child given that they only want one child.

(b) are the events ‘1+sibling’ and ‘Want 2+ children’ dependent?

6. Several people with smart phones were asked if they use it to check their email and their age.The results are in the table that follows.

Check email Don’t check emailunder 30 years old 49 94

30+ years old 68 106

(a) One of these persons is to be selected. Find the probability that . . .

i. They don’t check email

ii. They are under 30 years old

iii. They check their email given that they are 30+ years old

iv. They are under 30 years old given that they check their email

(b) are the events ‘Check email’ and ‘30+ years old’ dependent?

7. An airline is interested in how late planes arrive. Several of the companies flights were selectedand it was determined how late, if at all, the planes were and the size of the plane. The resultsare in the table that follows.

On Time or Early Late: < 15 minutes 15+ minutes lateSmall aircraft 302 198 58Large aircraft 83 123 91

(a) One of these flights is to be selected. Find the probability that . . .

i. A large aircraft is chosen

ii. A flight that is 15+ minutes late is chosen

iii. A small aircraft is chosen given that it is on time

iv. A flight is on time or early given that a large aircraft was chosen

(b) are the events ‘Small Aircraft’ and ‘15+ Minutes Late’ independent?

8. Several students at a high school were asked their class and whether or not they planned ongoing to college. The results follow.

Frosh Soph Junior SeniorCollege:Yes 302 201 186 166College:No 136 124 105 86

(a) One of these students is to be selected. Find the probability that . . .

i. A freshman is chosen

ii. A person planning on going to college is chosen

iii. A junior given that they are planning on going to college

iv. A person not planning on going to college given that they are a sophomore

(b) are the events ‘Senior’ and ‘College:Yes’ independent?

Page 77: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

4.2. CONDITIONAL PROBABILITY AND INDEPENDENCE OF EVENTS 71

9. Several movie-goers were asked to rate a movie that had a lot of violence prior to release.They were also asked if they had children. The results are in the table

Thumbs Up Thumbs DownHave Children 26 75

Don’t Have Children 67 96

(a) We are to select one of these movie-goers. Find the probability that . . .

i. They gave the movie a thumbs down

ii. They have children

iii. A person without children given that they gave it a thumbs up?

(b) What percent of people with children gave it a thumbs down?

(c) are the events ‘Have Children’ and ‘Thumbs Up’ independent?

10. A biologist is examining salmon carcasses on a stretch of river and notes the gender andwhether or not the salmon was hatchery raised or not. The results follow in the table.

Female MaleHatchery 147 127

Wild 116 107

(a) One of these salmon carcasses, one is to be selected at random. Find the probabilitythat the salmon . . .

i. Is a female.

ii. Is hatchery raised.

iii. Is a male given that it is wild.

iv. Is hatchery raised given that it is a female.

(b) Are the events ‘Female’ and ‘Wild’ independent?

11. A survey asked several people what type of phone and computer they own. The results aresummarized below.

iPhone Other PhoneApple Computer 89 32Other Computer 167 216

(a) One of these persons are to be selected at random. Find the probability of selecting . . .

i. Someone who uses a non-Apple computer.

ii. Someone who has an iPhone.

iii. Someone who uses a non-apple phone given that they use a non-Apple computer.

iv. Someone who uses an Apple computer given that they use a non-Apple phone.

(b) Are the events ‘iPhone’ and ‘Apple Computer’ independent?

Page 78: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

72 CHAPTER 4. PROBABILITY

12. At a company meeting, several donuts were consumed. The results of the donut by type andtopping are summarized in the table.

Chocolate Glazed MapleCake 11 24 9Yeast 18 33 11

One of these donuts is to be selected. Find the probability that . . .

(a) A glazed donut is selected.

(b) A cake donut is selected.

(c) A cake donut is selected given it was maple.

(d) A chocolate donut is selected given it was a yeast donut.

(e) Are the events ‘Cake’ and ‘Chocolate’ independent?

13. At a business luncheon, three entrees were served: a beef dish, chicken dish, and a vegetariandish. Additionally guests were given the option of water, tea, or lemonade. The results aresummarized in the table below.

Water Iced Tea LemonadeBeef 17 35 16

Chicken 23 33 10Vegetarian 12 11 8

One of these meals is to be selected. Find the probability that . . .

(a) A meal with iced tea is selected.

(b) A meal who has the vegetarian entree is selected.

(c) A meal with water is selected given that they had the beef.

(d) A meal with chicken is chosen given that they had lemonade.

(e) Are the events ‘Lemonade’ and ‘Beef’ independent?

14. Some roses are bred to have no, or very little, scent. Others have a noticeable scent. Severalroses are individually sold. The color and if they have a scent are in the following table.

Red White YellowScented 75 44 23

Unscented 52 26 10

(a) What percent of roses sold were red?

(b) What percent of roses sold were scented?

(c) What percent of red roses had no scent?

(d) What percent of scented roses were yellow?

(e) Are the events ‘Scented’ and ‘Red’ independent?

Page 79: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

4.2. CONDITIONAL PROBABILITY AND INDEPENDENCE OF EVENTS 73

15. In a triathlon, participants swim, cycle, and run. At a local triathlon, several participantswere asked their favorite part of the race and if they are a ‘local’. The summary of the resultsfollow.

Swim Cycle RunLocal 56 43 61

Not Local 153 162 208

(a) What percent of participants prefer the swim?

(b) What percent of participants are local?

(c) What percent of local prefer the run?

(d) What percent of participants who prefer the cycle are not local?

(e) Are the events ‘Local’ and ‘Run’ independent?

16. A pollster is taking a poll about the support of a new tax law and whether or not the personpaid taxes with their federal tax returns or got a refund in the last year.

Refund Paid Some Paid a LotIn Favor 68 268 267Opposed 164 316 218

(a) What percent of those polled are in favor of the new tax law?

(b) What percent of those who got a refund are opposed to the new law?

(c) What percent of people polled who paid a lot of taxes in favor of the new law?

(d) What percent of people who are in favor of the new law are in favor of the new bill?

(e) Are the events ‘Paid Some’ and ‘Opposed’ independent?

17. Go online, find a percentage that is a conditional probability, define the events, and write theprobability.

Page 80: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

74 CHAPTER 4. PROBABILITY

4.3 Intersection of Events

We often want to find the probability of two events occurring at the same time. For example, if weare selecting a person where the information is summarized in the following table:

Full Time(F ) Part Time TotalCollege Graduate(G) 59 15 74

Not a College Graduate 97 53 150Total 156 68 224

We might what the probability of selecting someone who is a Full Time College Graduate thenwe need to select someone who is in both the Full Time group and the College Graduate group.Where the row and column intersect we see that there are 59 people in there out of a total of 224.So the probability of selecting someone who is a full time college graduate is given by

P (F ∩G) = 59224 = .2634

Definition: The Intersection of two events, A and B, consists of all elements that are in bothA and B and is denoted as P (A ∩B)

If A=1,2,4,5,8 and B=2,5,8,9 then the intersection of A and B is A ∩B = 2, 5, 8

With a two way table it is a simple matter to use the definition of probability given before. Ifwe don’t have a two way table we need to find another way.

When we deal with continuous probabilities later we will think about probabilities in terms ofareas. That’s the approach we will take here. Consider the Venn diagram below. It shows the eventA and the sample space S. The probability of A would be the fraction of the area that A takes upthe box representing the sample space, S.

S

A

A very crude estimate for P (A) would be around .2. This is the area of the circle divided bythe area of the entire rectangle. (Yes, I measured it and calculated the areas).

Let’s look at the probability of P (A ∩B) .

A ∩B

S

BA

Page 81: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

4.3. INTERSECTION OF EVENTS 75

P (A | B) =Area of the shaded area

Area of B

If we divide the numerator and denominator by the area of S we get

P (A | B) =Area of the shaded area/Area of S

Area of B/Area of S

The fraction in the numerator is P (A ∩B) and the fraction in the denominator is P (B), so weget

P (A | B) =P (A ∩B)

P (B)

Equivalently,P (A ∩B) = P (B)P (A | B)

orP (A ∩B) = P (A)P (B | A)

Example 4.3.1.

Use the formula above to find the probability of selecting someone who is a Full Time CollegeGraduate from the first example.

Solution.

In this case we have the events F and G as defined before and we want P (F∩G) = P (F )P (G | F ). We get P (F ) = 156

224 and P (G | F ) = 59156 from the table and this yields P (F ∩ G) = P (F )P (G |

F ) = 156224 × 59

156 = 59224 = .2634 which is what we obtained before. Clearly, this way is more time

consuming than the way we solved the problem before.

Example 4.3.2.

In a group of 8 people, 3 are college graduates. 2 are to be chosen. What is the probability ofselecting two college graduates?

Solution.

In this problem we have events occurring one after another. Therefore, let G1 be the event: acollege graduate was chosen on the first pick. Similarly, let G2 be the event: a college graduatewas chosen on the second pick. In order to get two college graduates you have to get a collegegraduate on the first pick and get a college graduate on the second pick. So to say that we selectedtwo college graduates is the same thing as G1 ∩G2 . So we want P (G1 ∩G2) = P (G1)P (G2 | G1).

Page 82: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

76 CHAPTER 4. PROBABILITY

Since there are 3 college graduates and 8 people, P (G1 = 38 . For P (G2 | G1) , the event G1 means

we picked a college graduate so we now have only 2 college graduates left and 7 people left. (Oncesomeone is picked, they can’t be picked again). so P (G2 | G1) = 2

7 . If we put this together we getP (G1 ∩G2) = 3

8 × 27 = .1071. We can organize it by using a tree diagram:

G1P (G1)

= 3/8G2

P (G2 | G2) = 2/7

G1

G1

CG1

G2

P (G1 ∩G2) = 3/8× 2/7 = 3/28

Example 4.3.3.

In a large city, 35% of adults are married. Two adults from this city are to be selected atrandom. Find the probability of selecting two that are married.

Solution.

This is similar to the last example with one major difference. When we select an adult fromthe population, the probability of getting someone that is married on the next pick changes solittle, we can ignore the difference (It is a large city). In practical terms we treat it as if theprobabilities don’t change. Let M1 be the event: a married person was selected on the first pick,and M2 the event: a married person was selected on the second pick. So in this case, we haveP (M1) = P (M2 |M1) = .35 giving P (M1)P (M2 |M1) = .35× .35 = .1225

Using a tree diagram, we have

M1P (M1)

= .35 M2

P (M2 |M2) = .35

M1

M1

MM1

M2

P (M1 ∩M2) = .35× .35 = .1225

Page 83: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

4.3. INTERSECTION OF EVENTS 77

4.3.1 Exercises

1. A fair die is to be tossed twice. What is the probability of throwing a ‘5’ followed by a ‘3’?

2. 20% of residents in a large city smoke. If you are to randomly select 2 residents from the city,find the probability that the first one is a smoker and the second one isn’t.

3. A hat contains the names of 8 students. 3 of the students haven’t completed their homeworkand the remaining 5 have completed their homework. If you are to randomly select 2 namesat random, find the probability that . . .

(a) Both have done their homework

(b) The first has completed their homework and the second hasn’t.

(c) Both haven’t done their homework

4. In a standard deck of cards, there are 12 face cards and 40 non-face cards. If two cards areto be selected at random find the probability of

(a) Both being face cards if the first card is returned and the deck is shuffled before thesecond card is drawn

(b) Both being face cards if the first card is not returned to the deck before the second cardis drawn

5. In a standard deck of cards, there are 13 hearts, 13 diamonds, 13 clubs, and 13 spades. If twocards are to be selected at random find the probability of

(a) Both being diamonds if the first card is returned and the deck is shuffled before thesecond card is drawn

(b) Both being diamonds if the first card is not returned to the deck before the second cardis drawn

6. For a research paper, a student listed 10 sources. 4 of the sources were online and theremaining 6 were hard copies. If you randomly select 2 of the sources find the probabilitythat . . .

(a) Both are online sources

(b) The first was online and the second was a hard copy

(c) Neither were online sources

7. 22% of all college students at a large university are seniors. 64% of all seniors are full-timestudents. What percent of all students are full-time seniors?

8. 37% of cars on a highway are driving over the speed limit. 29% of cars that are speeding havegotten at least one ticket in the past year. What percent of all cars are speeding with at leastone ticket in the last year?

Page 84: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

78 CHAPTER 4. PROBABILITY

9. Several people who wanted to have children in their family were asked how many siblings theyhave and how many children they wanted to have. See the table for the results.

0 siblings 1+ siblingsWant 1 child 49 94

Want 2+ children 68 106

One of these persons is to be selected. Find the probability that . . .

(a) They have no siblings and want to have 2 or more children.

(b) They weren’t an only child and don’t want to have an only child.

10. Several people with smart phones were asked if they use it to check their email and their age. The results are in the table that follows.

Check email Don’t check emailunder 30 years old 49 94

30+ years old 68 106

One of these persons is to be selected. Find the probability that . . .

(a) They don’t check email on their smart phone and are under 30 years old.

(b) They are 30+ years old and they check their email on their smart phone.

11. An airline is interested in how late planes arrive. Several of the companies flights were selectedand it was determined how late, if at all, the planes were and the size of the plane. The resultsare in the table that follows.

On Time or Early Late: < 15 minutes late 15+ minutes lateSmall aircraft 302 198 58Large aircraft 83 123 91

One of these flights is to be selected. Find the probability that . . .

(a) A large aircraft is chosen and it is 15+ minutes late.

(b) A flight that wasn’t late and the aircraft was small.

12. Several students at a high school were asked their class and whether or not they planned ongoing to college. The results follow.

Frosh Soph Junior SeniorCollege:Yes 302 201 186 166College:No 136 124 105 86

One of these students is to be selected. Find the probability that . . .

(a) A freshman planning on going to college is chosen.

(b) A senior not planning on going to college is chosen.

13. Several movie-goers were asked to rate a movie that had a lot of violence prior to release.They were also asked if they had children. The results are in the table

Thumbs Up Thumbs DownHave Children 26 75

Don’t Have Children 67 96

Page 85: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

4.3. INTERSECTION OF EVENTS 79

We are to randomly select one of these movie-goers. Find the probability that the movie-goer. . .

(a) is a childless movie-goer that gave it a thumbs up.

(b) is a movie-goer with children that gave it a thumbs down.

14. A biologist is examining salmon carcasses on a stretch of river and notes the gender andwhether or not the salmon was hatchery raised or not. The results follow in the table.

Female MaleHatchery 147 127

Wild 116 107

One of these salmon carcasses, one is to be selected at random. Find the probability that thesalmon . . .

(a) Is a hatchery raised female.

(b) Is a wild male.

15. A survey asked several people what type of phone and computer they own. The results aresummarized below.

iPhone Other PhoneApple Computer 89 32Other Computer 167 216

One of these persons are to be selected at random. Find the probability of selecting . . .

(a) An iPhone owner that has an Apple computer.

(b) A non-Apple computer owner that has an Apple phone.

16. At a company meeting, several donuts were consumed. The results of the donut by type andtopping are summarized in the table.

Chocolate Glazed MapleCake 11 24 9Yeast 18 33 11

One of these donuts is to be selected. Find the probability that . . .

(a) A chocolate cake donut is selected.

(b) A yeast donut with maple is selected.

17. At a business luncheon, three entrees were served: a beef dish, chicken dish, and a vegetariandish. Additionally guests were given the option of water, tea, or lemonade. The results aresummarized in the table below.

Water Iced Tea LemonadeBeef 17 35 16

Chicken 23 33 10Vegetarian 12 11 8

One of these meals is to be selected. Find the probability that . . .

Page 86: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

80 CHAPTER 4. PROBABILITY

(a) A meal with beef and water is selected.

(b) A meal with lemonade and chicken is selected.

18. Some roses are bred to have no, or very little, scent. Others have a noticeable scent. Severalroses are individually sold. The color and if they have a scent are in the following table.

Red White YellowScented 75 44 23

Unscented 52 26 10

(a) What percent of roses sold were scented yellow roses?

(b) What percent of roses sold were unscented red roses?

19. In a triathlon, participants swim, cycle, and run. At a local triathlon, several participantswere asked their favorite part of the race and if they are a ‘local’. The summary of the resultsfollow.

Swim Cycle RunLocal 56 43 61

Not Local 153 162 208

(a) What percent of participants preferred to run and were local?

(b) What percent of participants aren’t local and don’t prefer to swim?

20. A pollster is taking a poll about the support of a new tax law and whether or not the personpaid taxes with their federal tax returns or got a refund in the last year.

Refund Paid Some Paid a LotIn Favor 68 268 267Opposed 164 316 218

One of these people are to be selected at random. Find the probability of selecting . . .

(a) A person in favor of the law that also got a refund.

(b) A person who paid a lot that is opposed to the bill.

Page 87: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

4.4. UNION OF EVENTS 81

4.4 Union of Events

We have seen the probability of the intersection of events, i.e. A and B. Now we would like toexamine the probability of A or B. Technically, we are looking at the union of events.

Definition: The Union of two events, A and B, consists of all elements that are in either A orB and is denoted as A ∪B

In the Venn diagram below, A ∪B is shaded.

S

A B

Recall the following table from a previous section

Full Time Part Time TotalCollege Graduate 59 15 74

Not a College Graduate 97 53 150Total 156 68 224

Let’s say we want to find the probability that someone selected is either a college graduate or afull time employee. We are looking to get P (F ∪G) . We need to find how many people are in theunion of the two events F and G. We can simply add the numbers in the cells that are in F or G(or both) and then divide by the total number of employees.

P (F ∪G) =59 + 15 + 97

224=

171

224= .7634

Lets look at deriving a formula for the union of events. In the Venn diagram we think of theprobabilities as ratios of areas. In the diagram we need to subtract off the area of the intersectionbecause it is counted in both A and B.

A ∪B

=

A

+

B

A ∩B

Dividing by the area of the sample space, S, we get the following

Page 88: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

82 CHAPTER 4. PROBABILITY

P (A ∪B) =

=

P (A) +

+

P (B) −

P (A ∩B)

Finally, we getP (A ∪B) = P (A) + P (B)− P (A ∩B)

If A and B are mutually exclusive then

P (A ∩B) = 0 and P (A ∪B) = P (A) + P (B)

We can use this to find the probability in the opening example of this section. Substituting infor A and B we get

P (F ∪G) = P (F ) + P (G)− P (F ∩G)

P (F ∪G) =156

224+

74

224− 59

224=

171

224= .7634

As expected, we get the same answer as before.

Example 4.4.1.

Bob lives in a rural area. His only source of transportation into the city is his only car. 23% ofdays his car won’t start. 18% of the days the roads are impassible. 5% of the days both the roadsare impassible and his car won’t start. What is the probability that Bob won’t be able to get totown on a randomly selected day?

Solution.

Solution: The problem is asking for the union of two events. Let’s start by defining a few events.Let C = ”the car won’t start”Let R = ”the roads are impassible”Translating, we are looking for P (C∪R) . We are given P (C) = .23, P (R) = .18, and P (C∩R) =

.05 .Putting into the formula we get

P (C ∪R) = P (C) + P (R)− P (C ∩R) = .23 + .18− .05 = .36

Page 89: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

4.4. UNION OF EVENTS 83

So the probability is .36 that Bob won’t be able to get to town. Put in percent form, 36% ofdays Bob can’t get to town.

4.4.1 Exercises

1. What is the probability of throwing a pair of dice and getting at least one 3 showing?

2. A die has 3 sides that are blue, 2 sides that are red , and 1 side that is black. What is theprobability of throwing the die twice and getting at least one red face up?

3. In a large city, 52% of households own dogs, 38% of households own cats, and 19% own bothcats and dogs. What percent of households own a pet (cat or dog)?

4. A baseball fan is considering going to a game on the weekend. The fan is broke and will onlybe able to buy tickets after getting paid on Friday. The fan’s favorite team has a game onboth Saturday and Sunday. The fan estimates that the probability that Saturday’s game willhave tickets available Friday is .67. The fan also estimates that the Sunday game has a .84probability that the game will have tickets available on Friday. Their is a .53 probability thaton Friday there will be tickets available for both days. What is the probability that the fanwill be able to see the team play this weekend?

5. Several people who wanted to have children in their family were asked how many siblings theyhave and how many children they wanted to have. See the table for the results.

0 siblings 1+ siblingsWant 1 child 49 94

Want 2+ children 68 106

One of these persons is to be selected. Find the probability that . . .

(a) They have no siblings or want to have 2 or more children.

(b) They weren’t an only child or don’t want to have an only child.

6. Several people with smart phones were asked if they use it to check their email and their age. The results are in the table that follows.

Check email Don’t check emailunder 30 years old 49 94

30+ years old 68 106

One of these persons is to be selected. Find the probability that . . .

(a) They don’t check email on their smart phone or are under 30 years old.

(b) They are 30+ years old or they check their email on their smart phone.

7. An airline is interested in how late planes arrive. Several of the companies flights were selectedand it was determined how late, if at all, the planes were and the size of the plane. The resultsare in the table that follows.

On Time or Early Late: < 15 minutes 15+ minutes lateSmall aircraft 302 198 58Large aircraft 83 123 91

Page 90: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

84 CHAPTER 4. PROBABILITY

One of these flights is to be selected. Find the probability that . . .

(a) A large aircraft is chosen or it is 15+ minutes late.

(b) A flight that wasn’t late or the aircraft was small.

8. Several students at a high school were asked their class and whether or not they planned ongoing to college. The results follow.

Frosh Soph Junior SeniorCollege:Yes 302 201 186 166College:No 136 124 105 86

One of these students is to be selected. Find the probability that . . .

(a) A freshman or someone planning on going to college is chosen.

(b) A senior or a student not planning on going to college is chosen.

9. Several movie-goers were asked to rate a movie that had a lot of violence prior to release.They were also asked if they had children. The results are in the table

Thumbs Up Thumbs DownHave Children 26 75

Don’t Have Children 67 96

We are to select one of these movie-goers. Find the probability that . . .

(a) A childless movie-goer or someone that gave it a thumbs up.

(b) A movie-goer that gave it a thumbs down or someone that has children.

10. A biologist is examining salmon carcasses on a stretch of river and notes the gender andwhether or not the salmon was hatchery raised or not. The results follow in the table.

Female MaleHatchery 147 127

Wild 116 107

One of these salmon carcasses, one is to be selected at random. Find the probability that thesalmon . . .

(a) Is a hatchery raised or is a female.

(b) Is a wild salmon or is a male.

11. A survey asked several people what type of phone and computer they own. The results aresummarized below.

iPhone Other PhoneApple Computer 89 32Other Computer 167 216

One of these persons are to be selected at random. Find the probability of selecting . . .

(a) An iPhone owner or someone that has an Apple computer.

(b) A non-Apple computer owner or someone that has an ‘other’ phone.

Page 91: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

4.4. UNION OF EVENTS 85

12. At a company meeting, several donuts were consumed. The results of the donut by type andtopping are summarized in the table.

Chocolate Glazed MapleCake 11 24 9Yeast 18 33 11

One of these donuts is to be selected. Find the probability that . . .

(a) A yeast or glazed donut is selected.

(b) A maple or chocolate donut is selected.

13. At a business luncheon, three entrees were served: a beef dish, chicken dish, and a vegetariandish. Additionally guests were given the option of water, tea, or lemonade. The results aresummarized in the table below.

Water Iced Tea LemonadeBeef 17 35 16

Chicken 23 33 10Vegetarian 12 11 8

One of these meals is to be selected. Find the probability that . . .

(a) A meal with chicken or water is selected.

(b) A meal with water or iced tea is selected.

14. Some roses are bred to have no, or very little, scent. Others have a noticeable scent. Severalroses are individually sold. The color and if they have a scent are in the following table.

Red White YellowScented 75 44 23

Unscented 52 26 10

(a) What percent of roses sold were red or unscented?

(b) What percent of roses sold were not red?

15. In a triathlon, participants swim, cycle, and run. At a local triathlon, several participantswere asked their favorite part of the race and if they are a ‘local’. The summary of the resultsfollow.

Swim Cycle RunLocal 56 43 61

Not Local 153 162 208

(a) What percent of participants were local or preferred the swim?

(b) What percent of participants weren’t local or preferred the run?

16. A pollster is taking a poll about the support of a new tax law and whether or not the personpaid taxes with their federal tax returns or got a refund in the last year.

Refund Paid Some Paid a LotIn Favor 68 268 267Opposed 164 316 218

One of these people are to be selected at random. Find the probability of selecting . . .

Page 92: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

86 CHAPTER 4. PROBABILITY

(a) A person in favor of the law or someone that got a refund.

(b) A person who paid a lot or someone that is opposed to the bill.

Page 93: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

4.5. COUNTING TECHNIQUES 87

4.5 Counting Techniques

If we are interested in playing the Power Ball Lottery and want to know the probability of winning,how do we do this? We have, conceptually, the idea already. We simply need to divide one(thereis only one way to pick the correct numbers) by the total number of ways of picking the numbers.How do we count the number of ways? In this section we will address this.

When we are counting, we need to distinguish when we are sampling without replacement orwith replacement. We also need to determine if the order the objectes are selected matters or not.

Example 4.5.1.

A vacationer is picking shorts, a shirt, and a hat to go explore. The vacationer brought 3 shorts,4 shorts and 2 hats. How many different possible outfits are possible?

Solution.

This problem is really a tree diagram problem. We will see that the tree needs some ‘pruning’.When our vacationer goes to get dressed, they need to pick out shorts: there are 3 different waysto do this. This is seen in the tree diagram where we have three branches. Next, we need a shirt.There are 4 ways to pick out a shirt for each way we pick a pair of shorts. In the tree diagram thereare 4 branches coming off each of the 3 branches from before. This gives us 12 ways to pick out apair of shorts and a shirt. Lastly, we need to pick a hat. There are 2 branches for each short/shirtcombination. This gives us a total of 24 different possible outfits.

In the problem we made the assumption that there were no restrictions of combinations ofoutfits. (as if our vacationer had no sense of fashion) If there were restrictions, e.g. this pair ofshorts does not look good with this shirt, then the counting becomes more difficult.

Page 94: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

88 CHAPTER 4. PROBABILITY

Counting Principle:

If an experiment consists of 2 steps where each step has n1 and n2 possible outcomes thenthe total number of outcomes of the experiment is n = n1 × n2

For three steps we get n = n1 × n2 × n3.

This can be extended to any number of steps in the experiment.

Example 4.5.2.

Our vacationer has stopped to get lunch. There are 8 different sandwiches to choose from, 6different chips possible, 12 drinks possible, and 3 different cookies. The vacationer will choose onesandwich, one type of chip, one drink, and one cookie. How many different lunches are possible.

Solution.

Imagine drawing a tree diagram here! We will use the above counting priciple. We have8× 6× 12× 3 = 1728 different ways to pick a lunch.

4.5.1 Factorials

We now look at what are called factorials. These come from our counting priciple.

Example 4.5.3.

Our vacationer has brought four books to read while vacationing. In how many different ways(orders) can the vacationer do this?

Solution.

From our multiplication principle it should be: the number of ways to pick the first book ×the number of ways to pick the second book × the number of ways to pick the third book × thenumber of ways to pick the fourth book.

There are 4 books so there are 4 ways to pick the first book.There are now 3 books left so there are 3 ways to pick the second book.There are now 2 books left so there are 2 ways to pick the third book.There is only 1 book left so there is 1 way to pick the fourth book.

So there are 4× 3× 2× 1 = 24 ways to pick the order to read the books.

n! is read ‘n factorial’. And represents the number of ways to arrange n items. By definition,0! = 1.

n! = n× (n− 1) · · · 3× 2× 1

Page 95: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

4.5. COUNTING TECHNIQUES 89

To do this on your calculator, input the number first then go to MATH > PRB > 4:! then hitENTER.

Example 4.5.4.

In how many ways can we arrange the cards in a standard deck of cards?

Solution.

There are 52 cards in a standard deck of cards so there are 52! ways to arrange them. We get8.07× 1067 from our calculator.2

4.5.2 Permutations

Permuatations describe the number of ways of selecting items when order matters. Unlike factorials,where order also matters, we are not selecting all items.

nPr Represents the number of ways of selecting r items from n items when order matters.

Example 4.5.5.

From 12 members of a council, 3 members are to be selected: one will serve as the chair, anotheras secretary, and a third as treasurer. How many ways can the assignments be made.

Solution.

We are selecting 3 people out of 12. Order matters here: being picked as the secretary is differentthan being selected as treasurer, etc. This is a permuation problem. Although we haven’t figuredout how to do this yet as a permutation, we can use the counting principle from before. There are12 ways to pick the first, 11 ways to pick the second and 10 ways to pick the third. There are atotal of 12× 11× 10 = 1320 different ways to make the assignments.

nPr can be accessed from you calculator.

Input the first number (n)

MATH>PRB>2:nPr

Input the second number (r)

ENTER

For the above example first input 12 then MATH > PRB > 2:nPr, input the 3 then ENTER.If you aren’t selecting more than a few items, it is probably easier to use the counting principle.

2Most calculators will display this as 8.0658E67

Page 96: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

90 CHAPTER 4. PROBABILITY

4.5.3 Combinations

Combinations are just like permutations with one major difference: with combinations order doesnot matter.

nCr or, equivalently,(nk

)read ‘n choose k’ is the number of ways of selecting k items out of n

items if order does not matter.

To calculate the combinations, we proceed exactly as we do with permuations but instead ofchoosing nPr on our calculator, we choose nCr.

Example 4.5.6.

A yougnster is standing in front of a soda dispensing machine that has 8 different flavors. Theyouth will dispense equal amounts of 5 different sodas into their cup, mix and drink. How manydifferent ways can our adventurous youth do this?

Solution.

A few subtleties we need to see: equal amounts of different drinks are being mixed together.Since they are equal amounts and different, we can think of this as choosing 5 items out of 8. Sincethey are being mixed together, it doen’t matter which beverage is first, second, etc. So order doesnot matter.

We get 8C5 =(

85

)= 56 different possible beverages.

The following example uses more than one of the previous methods.

Example 4.5.7.

Find the probabiltiy of winning the top prize in the Powerball Lottery.

Solution.

We need to do some research. Here we go: There are 69 white balls, numbered 1 to 69 and 26Powerballs numbered 1 to 26. 5 white balls are picked and 1 Powerball. To play, you bubble in thenumbers you want for the white balls and bubble in one Powerball number3. To win, you need tomatch the correct 5 white balls and the Powerball.

The order the balls are selected does not matter. Think of this as a two step experiment: firstpick the white balls then pick the Powerball.

For the white balls, we are picking 5 balls out of 69 balls. There are(

695

)= 11, 238, 513 ways to

choose the white balls. If we got all of these, we need to match the Powerball as well.

There are 26 Powerballs so there are 26 ways to pick one Powerball.

3Just like the bubble-in scantron form you use for multiple choice tests.

Page 97: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

4.5. COUNTING TECHNIQUES 91

Putting it together, there are 11, 238, 513× 26 = 292, 201, 3384 different ways for the Powerballlottery to turn out. There is only one way to match all the numbers chosen. So, using the classicalapproach to probability we get P (winning) = 1

292,201,338

Calculating Combinations and Permutations by hand. (Optional)

For small values of r, we can calculate combinations and permutations just as easily by hand aswith the nCr or nPr functions on our calculator. The formulas for them follow presented for thesake of completion. We will never use these formulas as written. To do so, we would need to findfactorials. To find the factorials, we use our calculator which has nPr and nCr as neighbors. Sowe will use nPr and nCr directly in those cases.

nPr =n!

(n− r)! and nCr =

(n

r

)=

n!

(n− r)!r!To calculate permutations by hand, nPr, we start with the first number, n, multiply by one less

than the first number, mutliply by two less than the first number etc., until the number of factorsequals r.

12P3 = 12× 11× 10 = 1320

3 items

Start with

To calculate combintations,(nr

), start with a fraction with n in the numerator and r in the

denominator. Then multiply by one less than each number, continue until the the last factor in thedenominator is 1.

12C3 =(123)= 12× 11× 10

3× 2× 1 = 220

4.5.4 Exercises

1. After having their computer stolen, a computer user is looking purchasing a new system.There are 6 different computers that are acceptable, 8 different monitors, and 6 differentprinters. How many ways can the user select a new system?

2. A car purchaser is considering purchasing a new car. They have the model picked out but nowneed to decide on options. There are 5 different colors that the purchaser likes, 4 different trimoptions, and 4 different stereo options. How many different ways can the car be specified?

3. A burger joint only sells burgers, fries, and drinks. If there are 4 different burger options, 3different size fries, and 12 different drinks what is the probability of someone guessing whatthe next customer orders, assuming they order one of each?

4Compare to the US population in 2020: 329,174,929

Page 98: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

92 CHAPTER 4. PROBABILITY

4. A school has 10 classrooms in a building. Over summer break there are enough funds toremodel 4 of the classrooms. In how many different ways can the rooms be chosen for remod-eling?

5. A third-grade teacher is casting a play about animals in the woods for the class. The teacherneed to cast an owl, a bear, a wolf, and an eagle. In how many ways can the teacher cast theplay is there are 25 students in the class?

6. A music teacher has only four instruments but 20 students. There is one guitar, one tamborine,one ukelele, and one tom-tom drum.

(a) In how many ways can the instruments be assigned?

(b) If it turns out that 7 of the students only know how to play guitar, 3 can only play theukelele, and the remaining can play both the the tamborine and the tom tom, but notthe guitar or ukelele. How many ways can the instruments be assigned if only peoplethat know how to play the instrument get assigned to a particular instrument.

7. A child is planning on coloring while on vacation. At the last minute they realize that theyforgot the crayons. There are 30 different crayons but the child is in a hurry and can onlytake 6 colors. How many ways can the child make the choices?

8. From biology we learned that there are 4 different bases for the DNA structure abbreviatedT, A, C, and G. According to wikipedia, the first DNA sequencing was done in 1977 of abacteriophage that has 5386 bases. (Sequencing is finding the order fo the bases in the DNA)Find an expression that gives the number of ways to pick the bases. (Your calculator will notbe able to do the calculation)

9. A quiz consists of 10 multiple choice questions. Each question has four possible answers butonly one answer is correct for each question. If a student guesses, what is the probability ofgetting all questions correct?

10. A student is picking classes for the upcoming semester and wants to take 3 classes: a mathclass, an English class, and an art class. There are 5 different math classes that the studentcan take, 9 different English classes, and 4 art classes. None of the times confilict for theclasses. How many ways can the student make their schedule?

11. A baseball team has 16 players but only 11 can play at a time.

(a) How many different ways can the coach pick the players to start the game, includingwhat postion they will play?

(b) The assistant coach will decide the order that the players bat. In how many ways canthis be done once the starters have been chosen?

12. A manufacturer of chips is planning on a multipack. There are 12 different varieties of chipsfrom which to choose. The multipack will have 4 different types of chips.

(a) How many different ways can the chips be selected if the multipack contains 20 bags ofchips with equal numbers of the different types of chips?

Page 99: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

4.5. COUNTING TECHNIQUES 93

(b) How many different ways can the chips be selected if the multipack contains 20 bags ofchips with 8 of one kind, 6 of another kind, 4 of another kind, and 2 of the last kind?

13. In a department of 20 people, 5 are to be selected to evaluate their supervisor. In how manydifferent ways can the selection be made?

14. A combination lock is opened by rotating a wheel clockwise to a number, then counterclockwiseto a different number, then finally clockwise to a number different from the second number.The same number can’t be consecutive: 34-5-34 is ok, but 34-34-5 is not. If the wheel has 35numbers on it, how many different combinations are possible?

15. In California, the standard licence plate for a car is a digit, followed by 3 letters, and followedby 3 digits. How many different licence plates are possible?

16. A state offers personalized plates for automobiles. Residents can pick any 6 characters (eithera letter or digit). How many different licence plates are possible. (In reality, the actual numberis a little less than what you will calculate because you can’t create a licence plate that saysanything offensive.)

17. A computer system at a company has been hacked so a user needs to change their password.The user is lazy so they are going to create a password that consists of only letters from thehome row of their keyboard: ASDFGHJKL. The user needs to pick a password that is exactly6 characters long.

(a) How many passwords are possible if they only use lowercase letters?

(b) How many passwords are possible if they are able to use uppercase or lowercase letters?

(c) If the user needs a password that contains at least one uppercase letter and at leastone lowercase letters how many passwords are possible? Hint: there are three distinctpossibilities: all uppercase, all lowercase, and at least one of each.

18. A movie has 5 famous performers in it. The producer has the task of listing the big stars,with big egos to match, on the credits. How many different ways can the names be arrangedin the credits?

19. A chocolate lover has 8 different types of chocolate and is planning on melting equal amountsof 4 of the chocolates together for a fondue. How many different flavors are possible?

20. For a coworker’s birthday, a worker is planning on bringing cupcakes for the festivities. Thereare 12 different flavors of cake that are possible and 8 different frosting flavors.

(a) How many different options are there if each cupcake needs frosting?

(b) How many different options are there if not all cupcakes need to have frosting?

21. After receiving gifts for their birthday, a grandparent writes 6 thank-yous. The cards arewritten and the envelopes addressed when they ask you to mail them. Unfortunately, thecards and envelopes fall to the ground scattered. You decide to guess which thank-you goesin which envelope. What is the probability that you guess all correctly?

22. A phone has a code needed to get in. The user needs to enter 4 digits to gain access.

Page 100: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

94 CHAPTER 4. PROBABILITY

(a) What is the probability of you guessing the code on the first guess?

(b) What is the probability of you needing no more than two guesses to get in?

(c) What is the probability that 4 guesses aren’t enough?

23. In the game of Clue, players need to figure out who the murderer is, what room it happened in,and the weapon used. Three cards are hidden from view with the solution. If they correctlyfigure it out, they win the game. There are 9 rooms, 6 suspects, and 6 weapons.

(a) What is the probability of you guessing the three cards that are to be selected?

(b) You have been dealt 2 suspect cards and 1 weapon card. What is the probability ofwinning after looking at your cards and making one guess?

(c) What is the probability of that 4 guesses aren’t enough?

24. At a party, the party planner wants two cakes and three types of ice cream available to guestsfor dessert. There are 8 different cakes available and 6 different ice creams. In how manydifferent ways can the planner select the desserts for the party?

Page 101: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Chapter 5

Discrete Probability Distributions

95

Page 102: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

96 CHAPTER 5. DISCRETE PROBABILITY DISTRIBUTIONS

5.1 Random Variables

We have looked at some probability calculations in the previous chapter. In this chapter we willlook at some situations where the type of problem is the same but the only difference is the specificapplication. For example, if we are looking at probabilities of flipping a coin five times and lookingat the probability of getting three heads or tossing a die five times and looking at the probabilityof getting 3 even numbers, the problems are the same. The only difference is the setting of theproblem. We need to distinguish discrete and continuous random variables.

We now turn our attention to Random Variables. These are just a more formal way to talkabout variables. In this chapter, our focus will be on quantitative variables. We will be examiningtwo different types of random variables: continuous and discrete.

A variable whose value is determined by a random process is a Random Variable.

A Discrete Random Variable is a random variable that can only take on a countable numberof possible values.

A Continuous Random Variable is a random variable that can take on any value in an intervalor intervals.

We will discuss discrete random variables in this chapter and continuous random variables inthe next chapter.

Example 5.1.1.

Classify the following as continous or discrete random variables:

1. The number of DUI arrests at a randomly selected sobriety checkpoint2. The height of a randomly selected adult3. The number of strikeouts a pitcher has in a randomly selected game4. The time it takes a randomly selected taxpayer to prepare their taxes5. The speed of a randomly selected car on the freeway at noon

Solution.

The key to determining which it is is by figuring out what the possible values of the randomvariable are. If you can list them all (even if you use an ellipsis . . .) then it is discrete. If the randomvariable can be anything between values, it is a continuous random variable.

Notice that they all include ‘random’ or ‘randomly’ in their definition. It needs to be clearhow the value is obtained and what it represents. In the first one, we randomly select a sobrietycheckpoint and then count the number of DUI arrests. The random variable is not ‘DUI arrests’:it is not clear what we are talking about. Is this in the entire United States? Santa Cruz county?We don’t know. Our random variable would not be well defined.

Page 103: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

5.1. RANDOM VARIABLES 97

1. The number of DUI arrests at a randomly selected sobriety checkpoint: The possible vauesare 0, 1, 2, 3, . . . Therefore this random variable is discrete.

2. The height of a randomly selected adult: A typical height might be 67 inches. What is next?There isn’t a ‘next’ possible value. If you think 68 inches, what about 68.5? or 68.1? Theheight can take on any value in an interval. This random variable is continous.

3. The number of strikeouts a pitcher has in a randomly selected game: How many strikeoutscan a pitcher have? 0, 1, 2, 3, . . . Therefore this is a discrete random variable.

4. The time it takes a randomly selected taxpayer to prepare their taxes: also continous. Wecan’t list the possible values. As soon as you list two possible values the variable can take onany value between the two values. That makes this random variable continuous.

5. The speed of a randomly selected car on the freeway at noon: Again, this random variable iscontinous. If it were discrete, instead of your car accelerating smoothly, your drive would bevery jerky. You’d be driving at 60 mph then, BAM, you’re driving 61 mph etc.

We have looked in the past at frequency distributions, percentage distributions, etc. In thissection we introduce another distribution: the probability distribution.

Just as the frequency distibution has a list of all possible values and the associated frequencies,a probability distribution will have a list of all possible values of the random variable and theassociated probabilities.

The following are examples of probabiltiy distributions

X P(X)0 0.261 0.232 0.183 0.174 0.125 0.04

X P(X)0 0.621 0.252 0.13

X P(X)2 1/163 3/164 4/165 4/166 3/167 1/16

Notice that there are two conditions that every probability distribution must meet: the probabilitiesmust add up to one and the probabitilities need to all be positive. You can check the abovedistributions to check this.

Once we have the distribution, we can also look at a graph of the distribtution similar to a barchart.

Following is the graph of the probabaility distribution.

Page 104: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

98 CHAPTER 5. DISCRETE PROBABILITY DISTRIBUTIONS

0 1 2 3 4 5 60

.05

.10

.15

.20

.25

.30

XP(X

)

The ‘bars’ that we use in histograms and bar charts are replaced with line segments. If therewas any width to these ‘bars’ we would wonder what the width represented.

For this distribution we would describe it as skewed right.

5.1.1 The Mean and Standard Deviation of a Discrete Random Variable

For a random variable the mean, µ, is referred to as the Expected Value, E(X). It is what we‘expect’. If X represents the number of calls a tow truck recieves on a randomly selected day, themean would simply represent the number of tows we can ‘expect’ in a given day.

Let us assume that the probability distribution is based on a frequency distribution with a totalof 100. Our distribution becomes

X Frequency0 261 232 183 124 75 4

The mean, or expected value, would be simply obtained as

µ =0 + 0 + · · ·+ 0 + 1 + 1 + · · ·+ 1 + 2 + 2 + · · ·+ 2 + 3 + · · ·+ 5

100

There are 26 0’s, hence the · · · etc. This then becomes

µ =(0× 26) + (1× 23) + (2× 18) + (3× 17) + (4× 12) + (5× 4)

100

This can be written as

µ = 0× 26

100+ 1× 23

100+ 2× 18

100+ 3× 17

100+ 4× 12

100+ 5× 4

100

Notice that this is the sum of the random variable times the associated probability. We canwrite a formula:

µ = E(X) =∑

XP (X)

Page 105: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

5.1. RANDOM VARIABLES 99

We can also do this on our calculatorInput values of X in to L1 and the probabilities in L2.STAT > CALC > 1:1-Var StatsSpecify lists, L1, L2 (you need to make sure you have the comma between lists)Our calculator gives usX=1.78σx = 1.49n = 1Notice that the Sx is blank. This is expected because n = 1.So we have E(X) = 1.78 and σ = 1.49Recall that the mean is where the graph ‘balances’. Look at the graph. It is reasonable that

the graph would balance at 1.78

Example 5.1.2.

In roulette, the wheel consists of 38 slots: 18 red slots, 18 black slots, and 2 green slots. You bet$1 on black. If black comes up, you win $1. If not, you lose. Construct a probability distributionand find the expected value where X is the amount of money won. Interpret the expected value.

Solution.

Each slot is equally likely so the probabilties are straighforward.There are only two things that can happen: win or lose. There are a total of 18 winning slots

and 20 losers.

X P (X)(win) 1 18/38

(lose) -1 20/38

To calculate the expected value we will use the formula µ = E(X) =∑XP (X)

X P (X) XP (X)(win) 1 18/38 18/38

(lose) -1 20/38 -20/38-2/38

The expected value is -$.0526 (=-2/38) or -5.26 ¢. What this means is that every time we betone dollar, we lose on average, about a nickel.

Example 5.1.3.

A car dealership has sent out an advertisement that promises people who come in will receive agift. Guests are guaranteed one of the following: a new car or $50,000, a TV worth $2,000, ticketsto a local theme park worth $200, or a $5 Target gift card. What are your expected winnings? Howmuch does it cost the dealership? The probabilities are stated as the car/$50,000, 1 in 20,000, theTV is 2 in 20,000, theme park tickets 5 in 20,000, and $5 gift card is 19,993 in 20,000.

Page 106: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

100 CHAPTER 5. DISCRETE PROBABILITY DISTRIBUTIONS

Solution.

Since we want to know our ‘expected winnings’ we are looking for the mean of the winnings.This means our random variable will be the amount we win. Let us construct the probabiltiydistribution. And use it to find the mean.

X P(X) XP(X)50000 1/20000 50000/200002000 2/20000 4000/20000200 5/20000 1000/20000

5 19993/20000 99965/20000154965/20000

The probability distribution consists of just the first two columns above. The third is used tocalculate the mean which is $7.75 (=154965/20000). When we look at the probabilities, we assumethat the flier is being sent to 20,000 customers. Assuming this is the case, we find the total amountis $154,964 plus any additional costs: postage, printing, etc.

This example was inspired by an advertisement received by the author in the mail. If severaldealerships work together to put this on they can split the expenses and focus on only customersthat might be in the market for a car.

Example 5.1.4.

20% of students at a large university are juniors. A random sample of 2 students are to beselected. Let X be the number of juniors selected. Find the probability distribution.

Solution.

Let us construct a tree diagram with the information. In the diagram, J represents getting aJunior and J represents not getting a Junior.

J.20

.20

J

.80

J

.80.20

J

.80

.04 X = 2

.16 X = 1

.16 X = 1

.64 X = 0

J

J

To the right of the tree are the probabilites and the appropriate values of the random variable.Starting with the root of the tree, count the number of J ’s and we get the values of X.

Note that there are two different places in the tree where X = 1. We need to add these together.Summarizing, we get

Page 107: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

5.1. RANDOM VARIABLES 101

X P (X)0 .641 .322 .04

Note that, as needed, the probabilties add up to one.

5.1.2 Exercises

1. Determine if the following are continous or discrete random variables.

(a) The number of fish a fisherman catches during a randomly selected fishing trip

(b) The number of foul balls hit into the stands during a randomly selected baseball game

(c) The weight of a randomly selected blue whale

(d) The time a randomly selected teenager spent online in the previous 24 hours

2. Determine if the following are continous or discrete random variables.

(a) The time it takes for a randomly selected flight to get to its destination

(b) How many dogs are rescued at a particular animal shelter in a randomly selected week

(c) The number of high school students that graduated with a 4.0 or higher last year at arandomly selcted high school

(d) The amount of gas a randomly selected car gets the next time they visit a gas station

3. For the following, determine if the given distribution is a possible probability distribution. Ifit is, find the mean and standard deviation of the random variable.

X P(X)0 .321 .212 .203 .124 .095 .06

X P(X)4 1/105 2/106 2/107 3/108 1/109 1/10

X P(X)0 .201 .362 .183 .194 .115 .10

X P(X)2 -.213 .354 .415 .186 .157 .12

4. For the following, determine if the given distribution is a possible probability distribution. Ifit is, find the mean and standard deviation of the random variable.

X P(X).25 .41.50 .26.75 .131.25 .111.50 .09

X P(X)2 .334 .266 .208 .4910 .11

X P(X)-5 .10-4 .23-3 .26-2 .39-1 .02

X P(X)10 .1820 .32200 .222000 .15-10 .13

5. One estimate gives the percentage of people that are left-handed at 15%. Assuming this istrue find probability distribution of X where X is the number of left-handed people selectedat random in a sample of 2.

Page 108: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

102 CHAPTER 5. DISCRETE PROBABILITY DISTRIBUTIONS

6. In a large city, 30% of residents are over the age of 50. Let X be the random variable thatrepresents the number of residents that are over 50 in a random sample of 2 residents. Findthe probability distribution of X.

7. A pen holder contains 5 black pens and 3 blue pens. A person randomly grabs two pens. LetX be the number of blue pens selected in a random selection of two pens from the container.Find the probability distribution of X.

8. While getting ready for a long trip, a book lover randomly grabs 2 books off the shelf. Theshelf contains 5 romance novels and 6 mysteries. Let X be the number of romance novelsselected in a random sample of 2 books. Find the probability distribution of X.

9. 30% of a particular model of a car are painted white. If 3 cars are randomly selected, findthe probability distribution of X where X is the number of white cars in a random sample of3 cars.

10. It has been reported that 40% of teens text while they drive. Assuming this is true, find theprobability distribution for the random variable that describes the number of teens that textwhile they drive in a random sample of 3 teens

11. A farmer is looking ahead to the amount of rain during the growing season. If there is adrought, the farmer will lose $1,000. If there is marginal amount of rain, the farmer will earn$1,500. If there is sufficient rain, the farmer will earn $5,000. If there is too much rain, thefarmer will experience damaged fruit and will only make $3,000. The probability of a droughtis estimated to be .08, marginal rain is estimated to have a probability of .35, sufficient rainhas a probability of .53 and the probability of too much rain is .04. Find the expected amountthe farmer will earn.

12. An insurance company has a policy for a concert. If the headliner cancels, the insuancecompany will pay the concert promoters $1,000,000 to cover lost revenue from the concert.The concert promoter has paid $15,000 for this policy. The insurance company estimatesthere is a 2.5% chance the headliner will cancel. Find the expected value of the payout to thepromoter. Is this a good deal for the insurance company?

13. A company offers extended warranties to car owners. The policy costs $300 per year. Thecompany estimates the following amount will need to be paid out and the probabilites: $3,000,probability of .001 $1,500, probability of .02, $500, probality of .25. What is the expectedamount of money the company takes in for a policy?

14. In roulette, there are 38 slots, each numbered. If you bet $1 on a number and it comes up, youwin $35. Find the probability distribution and the expected winnings for this game. Comparewith betting on a color. See example in this section. Which bet is better: a number or color?

15. Two friends come up with a game. A person rolls a fair die. If a 1 comes up, player A givesplayer B $8. If a 2 or 3 comes up, player A gives player B $2. If a 4 comes up, player A givesplayer B $5. If a 5 or 6 comes up, player B gives player A $10. If the game is to be playedseveral times, who is expected to come out ahead: A or B? If they play the game 100 timeswho we expect to end up richer and by how much?

Page 109: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

5.1. RANDOM VARIABLES 103

16. A fair game is a game where the expected value is 0. In a die game, a player pays a fee toplay. The die is rolled and the player recieves an amount, in dollars, equal to the numberrolled (For example, if a 3 comes up, the player gets $3). How much does the player need topay to play the game if it is going to be a fair game?

Page 110: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

104 CHAPTER 5. DISCRETE PROBABILITY DISTRIBUTIONS

5.2 Binomial Distribution

Some of the probability problems we have seen before can be grouped together by type once westrip away the specifics of the problem. The binomial distribution is one such distribution. We canthink of it as a coin flipping problem where we want to know the probability of flipping a specifiednumber of heads in a given number of flips. In these ‘coin flipping problems’, the probability offlipping heads is not necessarily .5. We need to get some terminology down.

There are some specific properties of a coin flipping problem that, although perhaps obvious,we need to list.

1. We flip the same coin several times.2. Each time we flip the coin there are only two possibilities: heads or tails.3. The probability of getting heads doesn’t change.4. If we get heads, the probability of getting heads on the next flip doesn’t change.

If we look at rolling a die several times and we are only interested in, say, how many times aone is rolled, this follows the four items listed about a coin if we simply replace: ‘coin’ with ‘die’,‘heads’ with ‘one’, and ‘tails’ with ‘not a one’. Now let us generalize this.

A Binomial Experiment is an experiment in which the following conditions are met:

1. A trial is repeated n times.2. Each trial consists of two possible outcomes: success and failure.3. The probability of success does not change throughout the experiment.4. The trials are independent.

What we are interested in is the number of successes in a binomial experiment. This leads to thefollowing.

A Binomial Random Variable is a random variable that represents the number of successes ina binomial experiment. We write

X ∼ B(n, p)

n and p are called the parameters of the distribution.

This notation, and similar notation for other distributions, will be used throughout. The X isthe name of the random variable, B stands for Binomial, the parameters n and p give the numberof trials and the probability of success. The parameters are what we need to totally describe thedistribution.

If we flip a fair coin 10 times and are looking at how many times heads is flipped this will bea binomial experiment. In this case, n is 10, there are only two possible outcomes (heads andtails), the probability of heads and tails never change, and what happens on a flip does not affectwhat happens on subsequent flips. Therefore, this is a binomial experiment. We would writeX ∼ B(10, .5)

We can use a formula to calculate the formula or use technology to find the probability.

Page 111: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

5.2. BINOMIAL DISTRIBUTION 105

If X ∼ B(n, p)

P (X = k) =

(n

k

)pnq1−n

n is the number of trialsp is the probability of successq is the probability of failure and q = 1− pYour calculator will also give this probability using a binomialPDF command: P (X =

k) =binomPDF(n,p,k).For cumulative probabilties P (X ≤ k), you would need a binomialCDF command: P (X ≤

k) =binomCDF(n,p,k). (C stands for cumulative.)

When we attack our probability problems in this chapter and beyond we will follow the following4 steps.

Four steps in solving probability problems:

1. Indicate what the random variable represents2. Indicate the appropriate distribution and parameter(s)3. Indicate what you are calculating, symbolically4. Answer the question

Example 5.2.1.

According to a report, 20% of all adults in the United States are smokers. Find the probabiltiythat in a random sample of 18 adults in the US, 5 are smokers.

Solution.

Let us start by determining what makes up a trial. In this case it would be randomly selectingone person and determining if they were a smoker or not. This will be done 18 times in total. Alsonote there are two possible outcomes: smoker or non-smoker. We are randomly choosing 18 fromthe population of US adults, a huge population compared to 18, so our probability of success willnot change enough to worry about. Also, we are randomly choosing these people so the trials willbe random. This meets all the criteria of a binomial experiment. Our ‘success’ is whatever we arecounting. In this case a ‘smoker’ is our ‘success’.

1. Our random variable X is the number of successes. In this example we getX=number of smokers in a random sample of 18 US adults.

2. We have decided that it is a binomial experiment and so X is a binomial random variable sowe haveX ∼ B(18, .2)

3. We want the probability that we get 5 smokers so we write P (X = 5)

Page 112: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

106 CHAPTER 5. DISCRETE PROBABILITY DISTRIBUTIONS

4. Lastly, we can get the probability using the formula or our calculator. So we have.(185

)(.2)5(.8)13 = .1507, (or binomPDF(18,.2,5) on our calculator) so about 15% of the time

you select 18 adults, 5 will be smokers.

In the next example we will look at using our calculator to evaluate the probabilities. Wehave two options on our calculator: PDF and CDF corresponding to P (X = k) andP (X ≤ k),respectively. In order to use the calculator we will need to write all of our probabilities as acombination of these.

Example 5.2.2.

According to the National Institute of Mental Health, in 15% of births the mother suffers frompostpartum depression (PPD). Assume that this is true for all births. A random sample of 25women who recently gave birth is to be selected. What is the probability that the number of thesenew moms who suffer PPD is...

(a) Fewer than 5(b) At least 3(c) Exactly 6(d) 2 to 7

Solution.

We will proceed with our 4 steps as before. Note that this is a binomial experiment: there aretwo possible outcomes, we are repeating a trial (select one mom and determine if she had PPD),the probability is not expected to change, and the trials are independent. Also, we are countingthe number of women with PPD so that is what we consider a ‘success’.

1. Let X = The number of women with PPD in a random sample of 25 women who recentlygave birth.

2. X ∼ B(25, .15)3. (a) P (X < 5)

(b) P (X ≥ 3)

(c) P (X = 6)

(d) P (2 ≤ X ≤ 7)4. (a) For P (X < 5), we need to rewrite as P (X ≤ 4). This can easily be evaluated using

binomCDF(25,.15,5)=.8385

(b) For P (X ≥ 3), we must rewrite as 1− P (X ≤ 2) = 1−binomCDF(25,.15,2) = .7463

(c) P (X = 6) is ready to input as, binomPDF(25,.15,6)= .0920

(d) For P (2 ≤ X ≤ 7) we rewrite as P (X ≤ 7)− P (X ≤ 1) =

binomCDF(25,.15,7)−binomCDF(25,.15,1)=.8815.

A bit of common confusion amongst students is the 1 in the second inequality. Below isthe reasoning, with all the X’s removed for a better fit on the page.

Page 113: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

5.2. BINOMIAL DISTRIBUTION 107

P (7 ≤ X ≤ 2)

= P (x ≤ 7)− P (X ≤ 1)

= P (7) +P (6) +P (5) +P (4) +P (3) +P (2)

= P (7) +P (6) +P (5) +P (4) +P (3) +P (2) +P (1) +P (0)

− (P (1) + P (0))

5.2.1 Exercises

Solve the following probability problems by addressing the 4 steps outlined in the section.

1. According to one site, 15% of Americans are left-handed. Assume this is true for the currentpopulation. What is the probability that in a random sample of 16 Americans, 4 are left-handed

2. According to breastcancer.org, 12.4% of women will develop invasive breast cancer in thecourse of their life. In a random sample of 17 women, what is the probability that 3 to 6women will develop breast cancer in the course of their life?

3. According the Red Cross, 53% of Latino-Americans have O+ blood. In a random sample of10 Latino-Americans at most 5 have O+ blood.

4. A professional baseball player has a batting average of .302. This means they get a hit 30.2%of the time. What is the probability that at 12 randomly selected trips to the plate, the playerwill get at most 5 hits?

5. An April 8, 2015 Pew Research Center survey indicated that 73% of teens have or have accessto a smartphone. Assume that is true for the current population of American teens. Findthe probability that of 18 randomly selected American teens more than 14 will have or haveaccess to smartphones.

6. The California Elections Board reported that as of October 24, 2016 24.27% of registeredvoters were registered as ‘No Party Preference’. Assume that is true for the current populationof registered voters in California. A random sample of 13 California registered voters are tobe selected. What is the probability of selecting at most 5 voters that are registered as ‘NoParty Preference’?

7. In the report ‘Sexual Activity and Contraceptive Use Among Teenagers in the United States:2011-2015’, the CDC reported that 42% of female teens aged 15-19 reported having ever hadsex. Assume that the report is an accurate depiction of the current population of female teensaged 15-19. What is the probabiltiy that in a random sample of 14 female teens aged 15-19,more than 8 will have ever had sex.

8. According to www.ptsd.va.gov, estimates that 30% of Vietnam Veterans have had Post Trau-matic Stress Disorder (PTSD) in their lifetime. Assume this is true. What is the probabilitythat in a random sample of 9 Vietnam Veterans, 3 to 6 will have had PTSD at some point intheir lifetime.

Page 114: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

108 CHAPTER 5. DISCRETE PROBABILITY DISTRIBUTIONS

9. www.ptsd.va.gov goes on to report that 23% of women who use VA health care reported sexualassault when in the military. Find the probability that in a random sample of 8 women whouse VA health care, fewer than 3 will have reported sexual assault.

10. The National Alliance of Mental Illness (NAMI) reports that 18.1% of American adults livewith anxiety disorders. If this is true of the current population, find the probability that morethan 4 of 12 randomly selected American adults live with anxiety disorders.

11. The National Association of Anorexia Nervosa and Associated Disorders reports that 13% ofwomen over 50 engage in eating disorder behaviors. If this accurately describes the currentpopulation of women over 50, find the probability of selecting 4 women over 50 who engagein eating disorder behaviors from a random sample of 13

12. The American Veterinarian Medical Association (AVMA) reports that 36.5% of Americanhouseholds own at least one dog. A random sample of 9 households is to be selected. Findthe probability that fewer than 4 own at least one dog.

13. The American Veterinarian Medical Association (AVMA) reports that 30.4% of Americanhouseholds own at least one cat. A random sample of 12 households is to be selected. Findthe probability that fewer than 6 own at least one cat.

14. Your parents are watching you. According to an October 24th, 2018 report by Pew Research,68% of US adults use Facebook. 7 randomly selected US adults are asked about Facebook.What is the probability that at least 5 use Facebook?

15. Drunk drivers accounted for 29% of all fatalities on American roads in 2015 according tofinder.com. If the percentage is still the same, find the probability that in a random sampleof 17 fatalities on American roads, at most 4 are accounted to drunk drivers.

16. In the report ‘Parents in Prison and Their Minor Children’(US Department of Justice, 2008),it is reported that 54% of Americans in federal prisons were parents of minor children. Findthe probability that more than 4 of 10 randomly selected Americans in federal prisons areparents of minor children.

17. 61% of Americans wear visual aids occasionally according to a CBS report from September20, 2013. Assume the current population is the same. Determine the probability that 2 to 6of a random sample of 9 Americans wear visual aids occasionally.

18. An article from the Huffington Post from May 25, 2011 reports that 18% of Americans thinkthe Sun revolves around the Earth. Proceed as if this is true currently. What is the probabilitythat in a random sample of 15 Americans, 3 to 6 that think the Sun revolves around the Earth?

19. Go online and find a site that gives the percent of a large population that has some charac-teristic. Write up a binomial probability problem and solve.

Page 115: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

5.3. THE HYPERGEOMETRIC DISTRIBUTION 109

5.3 The Hypergeometric Distribution

In the preceeding section we looked at the binomial distrubtion. In this section we look at a similardistribution, the hypergeometric distribution. The main distinction between the binomial and thehypergeometric experiment is that in a binomial distribution the probability of success remainsconstant whereas in a hypergeometric experiment the probabilties change. Let us begin with anexample.

Example 5.3.1.

A box contains 12 light bulbs. Four of the bulbs are burned out. You randomly select 3 bulbs.Let X be the number of burned out bulbs in a random sample of 3 bulbs. Find the probabiltiy ofgetting 1 burned out bulb.

Solution.

To find the required probability for this problem we need to count the number of ways to getthe different possible outcomes.

First off, how many ways can we pick 3 bulbs? We are picking 3 items out of 12, this meanseither a combination or permutation. Since order doesn’t matter, we want a combination. So wehave

(123

)= 220. Recall that to get the probabiltiy distribution we need to determine all possible

values the random variable can take on and the associated probabilities.When we say we have one defect (X = 1) what we really mean is: one bulb is defective and 2

are not defective. (remember, we are picking 3 bulbs). Now let’s count the number of ways to dothis. First, count the number of ways of picking the one defective. This is

(41

)= 4. The number

of ways of picking the 2 good bulbs is(

82

)= 28 ((if 4 are burned out then the rest, 8, are good).

For each way to pick 1 burned out bulb there are 28 ways to pick the remaining 2 good bulbs. Todetermine the total number of ways of picking 1 burned out and 2 good bulbs we multiply togetherand get 4× 28 = 112. Finally, the probability is 112

220 = .5091.

Let us now formally define a few things.

A hypergeometric experiment must meet the following conditions:

Items are selected at random from a population, without replacement. The population constists of N items. The population consists of only ‘successes’ and ‘failures’. There are r ‘successes’ in the population.

If we let X be the number of ‘successes’ in a hypergeometric experiment the we get a hyper-geometric random variable and we write

X ∼ H(N, r, n)

To calculate the probability we use the formula

Page 116: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

110 CHAPTER 5. DISCRETE PROBABILITY DISTRIBUTIONS

P (X = k) =

(rk

)(N−rn−k

)(Nn

)

r is the total number of successes

k is the number of successes you want the probability of

n− k is the number of failures you want the probability of

N − r is the total number of failures

N is the population size

n is the sample size

Example 5.3.2.

A elementary school has 13 teachers. Seven of the teachers have CPR certifications. Fiveteachers from the school are to be randomly selected to attend a conference. Find the probabiltiythat three of the teachers have CPR certifications.

Solution.

Let’s begin by noting that the ‘population’ (the 13 teachers) fall into two categories: they haveCPR certification or they don’t. We are picking 5 of them without replacing. This sounds likeeither binomial or hypergeometric. To determine which one, we need to ask ourselves one question:we are picking five teachers out of how many? If the population is large then binomial would beappropriate, (13 isn’t large). Otherwise we use the hypergeometric distribution.

Let X=The number of teachers with CPR certification in a random sample of 5 teachers. X ∼ H(13, 7, 5) P (X = 3)

P (X = 3) =

(73

)(62

)(

135

) =35× 15

1287= .4079

The formula for the hypergeometric probabilities are for the probability that X equals a number.If we have an inequality we will need to apply the formula more than once. This is illustrated inthe following example.

Example 5.3.3.

21 chickens have just hatched. Twelve of the newly hatched chicks are female. Four chicks areto be selected. Find the probabiltiy that at most 2 are female

Page 117: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

5.3. THE HYPERGEOMETRIC DISTRIBUTION 111

Solution.

We are picking 4 items without replacement out of 20 and we want the probability of at most2 ‘successess’. Hypergeometric is appropriate.

Let X=The number of females selected in a random sample of 5 chicks. X ∼ H(20, 12, 4) P (X ≤ 2) P (X ≤ 2) = P (X = 0) + P (X = 1) + P (X = 2)

=

(120

)(94

)(

214

) +

(121

)(93

)(

214

) +

(122

)(92

)(

214

)

=1× 126

5985+

12× 84

5985+

66× 36

5985= .5865

5.3.1 Exercises

1. Let X ∼ H(12, 5, 4), Find P (X = 3)

2. Let X ∼ H(20, 8, 5), Find P (X = 2)

3. Let X ∼ H(9, 4, 4), Find P (X ≤ 3)

4. Let X ∼ H(16, 12, 5), Find P (X ≥ 4)

For the following problems address all four steps used to solve probability problems.

5. A teacher has a box which contains 8 whiteboard markers. Three of them are dried out. Theteacher randomly selects 4 markers. What is the probability that 2 of the markers selectedare dried out?

6. At a party, the hosts have put out a bowl of cookies for their guests. Five of the cookieshave been made with organic ingredients and the remaining 8 have been made with non-organic ingredients. A partygoer randomly selects 5 cookies and to take home. What is theprobability that the guest gets 3 organic cookies.

7. A candy lover is also allergic to peanuts. Before our candy lover is a plate which contains 20pieces of fudge: Four of the pieces of fudge have traces of peanuts in them. Our candy fan isplanning on eating four pieces of fudge. What is the probability of the candy lover having anallergic reaction to peanuts?

8. At an NHL game, there are 25 pucks. Unfortunately, 9 of the pucks are not of the properweight. Seven pucks are to be randomly selected. Find the probability of getting 2 pucks thatare not of the proper weight.

9. A case of beer has a variety of types: 6 IPAs, 4 stouts, and 2 lagers. If a beer lover was torandomly select 5 beers, what is the probability of getting 2 IPAs?

Page 118: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

112 CHAPTER 5. DISCRETE PROBABILITY DISTRIBUTIONS

10. An appliance store has 19 refrigerators. Six of the refrigerators have a stainless steel front.What is the probability that of 10 randomly selected refrigerators at most 2 will have stainlesssteel fronts?

11. A car dealer tells the assistant to park some of the new models in front of the dealership. Theassistant plans to randomly select 6 car keys to determine which cars to park in front. Thedealership has 12 new models from which the assistance can choose. Three of the cars areconvertibles. Find the probability that the assistant doesn’t park any convertibles in front.

12. A classroom has 12 students in it. All have strong preferences about which computer theyprefer: PC or Apple. Eight are Apple users and 4 are PC users. The teacher has no ideaabout this plans to randomly select 12 computers from the storage room. Students then willselect one of the computers. In the storage room there are 20 Apple computers and 9 PCs.What is the probability that the teacher will not have any students that are unhappy withwhat computer they have?

13. A box of light bulbs contains 8 halogen light bulbs and 9 incandescent bulbs. A homeownerwill randomly pick 5 of the bulbs to replace some burned out bulbs. Find the probability ofgetting at most 2 halogen bulbs.

14. While getting ready for an early morning flight a traveler is going to grab 8 loose socks froma drawer that contains 10 black socks and 6 dark blue socks. The blue socks all match andthe black socks all match. Unfortunately for the traveler, the electricity is out and the socksall look alike in the dark. Find the probability that the traveler will be able to match thesocks without any black/blue combinations?

15. At a traffic school for people that get tickets, there are 21 students: 16 got speeding ticketsand the remaining 5 got other traffic violations. For an in-class demonstration, 6 studentsare to be selected to perform a traffic safety skit. What is the probability that 4 of them gotspeeding tickets?

16. At a meeting of engineers, 7 are electrical engineers and 9 are mechanical engineers. If arandom sample of 3 engineers are to be selected, find the probability of getting at least 2mechanical engineers.

17. For an article on airline arrival times, a reporter is going to randomly select 11 flights thatarrived at an airport during the a given hour and report on how late the flights were. Duringthe hour in question, there were 23 flights and 5 were more than 30 minutes late. What is theprobability of the reporter selecting at least 2 flights that were more than 30 minutes late?

18. It is time for standardized testing for a student. The student was told to bring a number 2pencil for the test. The student prefers to have at least 2 pencils when taking a test. Thestudent has a cup on their desk with 12 pencils in it. Although the student thinks they areall number 2 pencils, 5 of them are not. The student is in a hurry and will just grab 4 fromthe cup. What is the probability that the student will have at least two number 2 pencils?

19. At a veteran’s reunion, there are 5 officers and 12 enlisted. The veterans are challenged toa pick-up game of basketball so they will randomly select 5 of the veterans. What is theprobability of selecting no more than 2 officers?

Page 119: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

5.3. THE HYPERGEOMETRIC DISTRIBUTION 113

20. While planning a trip, a vacationer takes 15 DVD’s with them. Of those taken, 9 are comedies.The vacationer is planning to have a movie-fest and so is going to randomly pick 5 DVD’s towatch. Find the probability of selecting at least 1 comedy.

21. In the fall an avid gardener is planting bulbs for spring. The gardener has 5 white bulbs, and7 red bulbs. The gardener is planning on planting 4 bulbs. What is the probability that allof the bulbs will be the same color?

22. A photographer is applying to a program and needs to send in 5 photos. The photographernarrows down the best photos taken: it is down to 8 portraits and 6 panoramas. If the choiceis made at random, find the probability that there is at least one of each type.

23. At a costume party there are 15 people in costume and 4 that aren’t in costume. What is theprobability that in a random sample of 6 partiers at least 4 are in costumes.

24. The dog park in a park is a popular spot. On a specific day there are 12 dogs present, eachwith a different owner. Of the dogs present, 7 have been spayed or neutered. A representativefrom a local animal group is also at the park and has information about spaying/neuteringpets. The representative only has time to talk to 4 randomly selected pet owners. What isthe probability of selecting at most 1 owner whose dog has already been spayed/neutered?

Page 120: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

114 CHAPTER 5. DISCRETE PROBABILITY DISTRIBUTIONS

5.4 The Poisson Probability Distribution

The final discrete distribution we discuss is the Poisson probabiltiy distribution. The distributionis named after Simeon Poisson, a French mathematician. The distribution deals with the numberof occurances in a time or space interval.

Let X be the random variable that represents the number of occurances in a time or spaceinterval where the number of occurances are random and independent. Then X is a Poissonrandom variable and we write X ∼ P(λ).

We will use the symbol P so we don’t confuse it with a P for probability. It would be a goodidea at this point to look at some examples.

Consider the number of pieces space debris that fall to Earth in a given week. The reader maybe unaware, but there are a great number of objects that are orbiting about the Earth. Each week,it is estimated that an average of 5 objects fall to the Earth each week.

It is reasonable that the debris falling are independent of one another. That is, an object fallingdoes not cause another object to fall or not fall. It will fall independently of the first object. That isnot to say that they cannot be dependent. It is just that most of the time they will be independent.Our model here, as all models, are not going to match reality 100%. But it will be close enough.As for randomness, when the objects hit the Earth could be any time. If X is the number of piecesof debris falling to Earth in a randomly selected week then X would be a Poisson random variableand we would write X ∼ P(5).

Emergency room visits are another example of a Poisson random variable. Suppose we have anaverage of 8.2 people come into the emergency room to be seen between 2 and 3 in the afternoon.Let X to be the number of people that come into an emergency room to be seen during that timeinterval. In this case, when people arrive at the emergency room is random: you go to the ERwhen you are in need of stiches, diagnosis for an injury, etc. Also, the events are indpendent of oneanother. (This could be violated if a car accident resulted in two people needing to go the ER, notvery likely so we won’t let it concern us.) We would write. X∼ P(8.2).

Don’t you just love junk mail? The amount of jumk mail you recieve in a randomly selectedday can be modeled as a Poisson random variable. Assume you receive 2.1 pieces of junk mail eachday, on average. It is reasonable that the junk mail you get are independent of one another andwhen they hit the mail stream prior to you getting them is random. In this case if X is the numberof pieces of junk mail in a randomly selected day then we would write X∼ P(2.1)

As with the binomial distribution we can either use a formula or our calculators to find aprobability.

If X ∼ P(λ)

P (X = k) =e−λλk

k!

λ is the average number of occurances

Page 121: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

5.4. THE POISSON PROBABILITY DISTRIBUTION 115

Your calculator will also give this probability using a poissonpdf command:For single probabilities: P (x = k) = poissonpdf(λ, k).For cumulative probabilties: P (X ≤ k) = poissoncdf(λ, k).

Example 5.4.1.

The average number of battery fires for a popular cell phone is 2.3 fires per week. Find theprobability that there will be 4 battery fires for this phone in a randomly selected week.

Solution.

We are looking at the number of occurances (an occurance is a phone catches fire) in a timeinterval (one week). It is reasonable that the incidents are independent of one another and theycould occur anywhere. So the Poisson distribution is appropriate.

1. Our random variable X is the number of occurances in a time interval. In this example wegetX=number of cell phone batteried that catch fire in a randomly selected week.

2. We have decided that the Poisson distribution is appopriate so we haveX ∼ P(2.3)

3. We want the probability of 4 fires so we write P (X = 4)4. Lastly, we can get the probability using the formula or our calculator. So we have.

P (X = k) =e−2.32.34

4!= .1169 So about 12% of all weeks there will be 4 cell phones that

catch fire.

Example 5.4.2.

For each large truckload of earth that is excavated from a mine, an average of 1.4 diamonds ofgem quality are found. Find the probability that in a randomly selected large truckload of earththere we be at most 2 diamonds of gem quality found.

Solution.

Unlike the last example, in this example we are looking at the number of occurances in a spaceinterval. We still will apply the Poisson probability distribution. It is reasonable that the diamondsare randomly distributed and hence being mined will be independent of one another. They couldoccur anywhere in the truckload (randomness).

1. Let X=number of gem quality diamond found in a randomly selected large truckload.2. X ∼ P(1.4)3. P (X ≤ 2)4. Lastly, we can get the probability using our calculator. So we have poissoncdf(1.4,2) = .8335

Page 122: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

116 CHAPTER 5. DISCRETE PROBABILITY DISTRIBUTIONS

5.4.1 Approximating the Binomial Distribution with the Poisson Distri-bution

As we look at the problems above and those in the excercises, we might notice that some of theproblems look a lot like the binomial problems we have seen in the past. This is because one of theuses of the Poisson dstribution is to approximate a binomial distribution. We can do so when theprobability of success, p, is very small and the number of trials, n, is very large. If you examing thecell phone fire from above, it is really a binomial problem: a randomly selected cell phone will catchfire or it won’t, etc. . . . . In this case, the probability of a cell phone catching fire, p, is extremelysmall(and unknown) but the number of cell phones used, n, is very large (also unknown). Themean of this binomial distribution would be np, a very large number times a very small numberyielding 2.3 fires per week. This is a common situation, n and p are unknown but we know whatnp is. As you proceed with the excercises, look for the problems which are really binomial.

5.4.2 Exercises

1. Let X ∼ P(1.4), Find P (X = 2)

2. Let X ∼ P(3.6), Find P (X = 3)

3. Let X ∼ P(3.2), Find P (X ≤ 3)

4. Let X ∼ P(5.1), Find P (X ≥ 4)

5. Let X ∼ P(2.7), Find P (2 ≤ X ≤ 5)

6. Let X ∼ P(4.3), Find P (3 ≤ X ≤ 7)

For the following problems address all four steps used to solve the problems.

7. On average, there are 5.2 broken bones seen at a local hospital’s emergency room in a givenweek. Find the probability that in a randomly selected week there will be 4 broken bonesseen in the emergency room.

8. At a large assembly line, production needs to be shut down unexpectedly an average of 2.6times per day. What is the probability that the line will be shut down 4 times in a randomlyselected day.

9. It is raining at a fairly steady rate when you notice that an average of 5.6 raindrops fall in abird bath each second. Find the probability that you will observe at most 8 raindrops in thebird bath in a randomly chosen second.

10. At a large orange orchard, a farmer grows seedless oranges. The variety isn’t truly seedless,a very rare few have seeds. The farmer notes that during harvest, an average of 5.9 orangesare found each day that have seeds. What is the probability that the farmer will find at most4 oranges with seeds in a randomly selected day?

11. A high energy particle detector detects an average of 9.4 particles per day. What is theprobability that there will be more than 8 particles detected on a randomly selected day?

Page 123: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

5.4. THE POISSON PROBABILITY DISTRIBUTION 117

12. A rare tumor is diagnosed an average of 7.4 times per week. Find the probability that thetumor will be diagnosed more than 8 times in a randomly selected week.

13. An entomologist has set traps to collect bugs. On average, there are 2.1 uncommon flies thatare found in the traps each week. What is the probability that in a week there will be 2 to 6uncommon flies found.

14. A contractor specializes in concrete work. For each 1000 square feet of concrete laid, anaverage of 1.6 cracks form. What is the probability that in a randomly selected 1000 squarefeet poured, there will be 2 to 5 cracks formed?

15. A manufacture of glass extrudes the glass into a continuous sheet of glass. Although themanufacturer tries to avoid detectible air bubbles, an average of 2.6 bubbles are detected foreach 100 feet of glass. Determine the probability of finding more than 3 air bubbles in a 100foot long stretch of glass.

16. A nurse administers flu shots to patients. Although negative reactions to the flu shot is rare,the nurse has an average of 2.3 patients per month exhibit a fever after getting the shot.What is the probability that in a randomly selected month more than 4 patients will get afever after getting the shot.

17. The Empire State Building is hit by lightning and average of 23 times a year. What isthe probability that the building will be hit with lightening at least 25 times in a randomlyselected year.

18. California is hit by an average of 6.4 tornadoes each year. What is the probability that therewill be at least 5 tornadoes in a randomly selected year?

19. A copy machine that gets a lot of use jams an average of 1.2 times for each ream of paperused. What is the probability that there will be no jams if one ream of paper is used?

20. It has been estimated that about 5 people are fatally attacked by sharks each year. Assumingthis is true what is the probability that there will be at least 7 fatal shark attacks in arandomly selected year.

21. At 9 am each weekday, a maintenance worker comes to work and replaces any light bulbs ona large lighted display are burned out. On average, 1.3 light bulbs each day burn out. Themaintenance worker is the only one that replaces the bubs and only works Monday throughFriday. What is the probability that when the worker comes to work on Monday, there willbe at most five bulbs that are burned out? What is the probability that no bulbs will burnout when the worker is there if the worker leaves at 5 pm?

22. A secretary has an old computer. On average, if freezes up 1.8 times per day. The secretaryworks Monday through Friday. What is the probability that the computer will freeze at most5 times during a work week?

Page 124: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

118 CHAPTER 5. DISCRETE PROBABILITY DISTRIBUTIONS

Page 125: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Chapter 6

Continuous ProbabilityDistributions

119

Page 126: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

120 CHAPTER 6. CONTINUOUS PROBABILITY DISTRIBUTIONS

6.1 Continuous Probabiity Distributions

For discrete probability distributions we were able to list all possible values of the random variableand the corresponding probabilities. We cannot due this with continuous probability distributionssince there are an uncountable number of possible outcomes. Instead, we will equate probabilitydistributions with areas under a curve.

Let X be a continuous random variable. Then P (a < X < b) is the area under the curve of theprobability density function of X.

Xa b

P (a < X < b)

This being the case, the total area under the curve of the probability density function is one.(This is analogous to the sum of probabilities for random variable being one).

Since the area represents the probability, the probability of a continuous random variable takingon a given single value is zero. P (X = a) = 0. This being the case, whether or not the inequalitiesincludes ‘=’ doesn’t matter. P (a < X < b) = P (a ≤ X ≤ b). This is not true for discreteprobability distributions.

Example 6.1.1.

Consider the graph below which is the graph of the probability density function of X. FindP (1.2 < X < 4.1)

−1 0 1 2 3 4 5 6 7

X

Page 127: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

6.1. CONTINUOUS PROBABIITY DISTRIBUTIONS 121

Solution.

First note that the area under the curve is going to represent the probability.

−1 0 1 2 3 4 5 6 7

X

P (a < X < b)

Since the region is a rectangle to get the probability we need to simply multiply the length ofthe base and the height. The width of the rectangle is 2.9 (=4.1-1.2). The height is not indicated.However we know from above that the total area under the curve has to be one. Since the widthof the base is five, then the height of the rectangle is 1/5 = .2. So we get the area of the shadedregion is 2.9× .2 = .58. If you look at the graph this is a reasonable. The shaded area looks like itis a little bit more than half of the total area.

6.1.1 The Normal Probability Distribution

The most important distribution we will encounter is what is referred to as the normal probabilitydistribution.

A typical graph is given below.

X

Some properties:

The graph is bell-shaped.

The total area under the curve is 1.

The tails continue on to infinity in either direction.

From our investigation from before, the mean is located at the point about which the graph issymmetric.

If we look at the graphs below they all have the properties listed above. (Take our word aboutthe area being 1.)

Page 128: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

122 CHAPTER 6. CONTINUOUS PROBABILITY DISTRIBUTIONS

X

What is clearly different is how spread out the graphs are and where they are located on theaxis. Recall that the spread of a distribution is given by the standard deviation or variance. In ourwork it was more meaningful to use the standard deviation and we will do so here. Once we havethe mean and the standard deviation we have everything we need to describe a normal distribution.

For a random variable X that has a normal distribution we will write X ∼ N(µ, σ)

It is easy to locate the mean in normal probability distribution: it corresponds to the ‘hump’in the graph. The standard deviation can be readily found as well. To find the standard deviationstart with the mean. Next locate the points on the graph where the graph is ‘steepest’. These twopoints are one standard deviation σ from the mean. See the diagram below.

X

σ σ

µ

6.1.2 The Standard Normal Probability Distribution

To find the probabilities coming up we will need to find the area under a normal curve. If youhave taken calculus, finding the area under this curve is not a trivial problem, you can’t find theantiderivative of it. If you haven’t had calculus, don’t panic. We will do it without us having toknow any calculus.

The Standard Normal Probability Distribution is the normal probability distribution witha mean of 0 and a standard deviation of 1. We use the letter z for the standard normalprobability distribution. We write z ∼ N(0, 1).

To find the areas under the standard normal probability distribution we can use a table ofvalues. We can also use technology. We opt for the table. We will illustrate this with an example.

Page 129: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

6.1. CONTINUOUS PROBABIITY DISTRIBUTIONS 123

Example 6.1.2.

Find P (z < 1.25)

Solution.

To find the probability we will use the Standard Normal Probability Distribution Table.

z1.25

P (z < 1.25)

Lets look at the table. First notice the graph at the top of the table. It shows the area to theleft of the given value, just like what we want.

z

0.0

1.2

.00 .05

.8944

.09. . . . . .

...

so P (z < 1.25) = .8944

Example 6.1.3.

Find P (z > 2.14)

Solution.

Lets start with a picture.

z2.14

P (z > 2.14)

Page 130: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

124 CHAPTER 6. CONTINUOUS PROBABILITY DISTRIBUTIONS

Note that the area we want is above the value. We have two options:

First solution: since the total area under the curve is one we have P (X < 2.14) + P (X >2.14) = 1 or P (X > 2.14) = 1 − P (X < 2.14). This last probability we can get from the tableP (X > 2.14) = 1− .9838 = .0162

Second solution: We will exploit the symmetry in the graph.We have P (X > 2.14) = P (X < −2.14) See the diagram.

z2.14

P (z > 2.14) =

z-2.14

P (z < −2.14)

So we end up with P (X > 2.14) = .0162

Example 6.1.4.

Find the area under the standard normal distribution from z = −1.62 to z = .94.

Solution.

The problem is asking for the area. This is the same as P (−1.62 < z < .94)

The graph below demonstrates how we will calculate the probability.

z-1.62 .94 z-1.62 .94 z-1.62 .94

P (−1.62 < z < 0.94) = P (z < 0.94) − P (z < −1.62)= 0.8264 − 0.0526= 0.7738

Using the z-table to find probabilities

To find P (z < a) simply look up a in the table. If a is beyond the table, use either 1 or 0,whichever is appropriate.

To find P (z > a) change the problem to P (z < −a) then proceed as above.

Page 131: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

6.1. CONTINUOUS PROBABIITY DISTRIBUTIONS 125

To find P (a < z < b) rewrite as P (z < b)− P (z < a) and look these up as above.

6.1.3 Probabilities of the Non-Standard Normal Distribution

A lot of random variables are normally distributed. This makes the normal distribution one of themost important distribution we will use. Not many distributions that we encounter have a meanof 0 and a standard deviation of 1. In order to solve a [Non Standard] normal probability problemwe will need to turn it into a standard normal probability problem. Let us see how to proceed withan example.

Example 6.1.5.

Let X ∼ N(35, 4). Find P (X < 42)

Solution.

Recall that X ∼ N(35, 4) means X is a normal random variable (normal, hence, the ‘N ’) withmean, µ = 35, and standard deviation, σ = 4.

Until we are comfortable how to do this type of problem we will draw a picture.

X27 31 35 39 43

z−2 −1 0 1 2

42

Notice in the diagram that in addition to 42, we have the numbers 27, 31, 35, 39, and 43 labeled.These correspond to µ − 2σ, µ − σ, µ, µ + σ, and µ + 2σ, respectively, for X. Also, the regionshaded corresponds to P (X < 42). Below the X axis we have the z-axis indicated with −2, −1, 0,1, and 2 labeled. Those correspond to to µ− 2σ, µ− σ, µ, µ+ σ, and µ+ 2σ, respectively, for z.

To get the area/probability, we need to determine the value of z and look it up in our table.We can see that z is about 1.75 by just estimating it from the graph. But we want to be able todo this without relying on a graph.

By examining the picture we notice an import fact about the z-score of a data value. For datavalues that have a z-score of 1, the data value is one standard deviation to the right of the mean,for a z-score of 2 the data value is 2 standard deviations to the right, etc.

The z-score of a data value is given by

z =X − µσ

Page 132: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

126 CHAPTER 6. CONTINUOUS PROBABILITY DISTRIBUTIONS

and represents the number of standard deviations that data value is from the mean. (positivez-score means its to the right of the mean, negative to the left)

This is a formula that will be used quite often throughout the text along with variations.Let us now continue with the problem. We have

P (X < 42)

= P

(z <

42− 35

4

)

P (z < 1.75)

= 0.9599

This last number is from the z-table.

Example 6.1.6.

Let X ∼ N(145.3, 26.54). Find P (106.5 < X < 177.6)

Solution.

We will proceed as in the last example by first calculating the corresponding z-scores.

P (106.5 < X < 177.6)

= P (106.5− 145.3

26.54z <

177.6− 145.3

26.54)

= P (−1.46 < z < 1.22)

= 0.8888− 0.0721 from z-table

= 0.8167

Finding normal probabilities of

Convert the given values of X to values of z using z =x− µσ

.

Now that the probabilities are in terms of z use the table as from before.

This is an important skill to master. We will use it in several places throughout the text.

Let us now look at an application of the normal probability distribution.

Page 133: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

6.1. CONTINUOUS PROBABIITY DISTRIBUTIONS 127

Example 6.1.7.

A large shipment of apples has just arrived from an orchard. The weights of the apples varyfrom apple to apple but it is known that the weights of apples follows a normal distribution witha mean of 164 grams and a standard deviation of 23.5 grams. What percent of apples weigh lessthan 150 grams?

Solution.

We are given that µ = 164 and σ = 23.5. We want the percent of apples that weigh less than150 grams. The percent is equivalent to finding the probability that an apple that is selected willweigh less than 150 grams. As with the binomial etc, problems, we will list the four steps.

1. Let X=weight of a randomly selected apple from the shipment.2. X ∼ N(164, 23.5)3. P (X < 150)

= P (z <150− 164

23.5)

= P (z < −0.60)= 0.2743

4. 27.43% of apples weigh less than 150 grams

Example 6.1.8.

A company that produces 4-inch nails produces the nails on machines that produce the averagelength of the nails to be 4 inches, the standard deviations of the nails is 0.12 inches. The companyhas observed that the distribution of the lengths of nails are approximately normally distributed.What percent of 4-inch nails are between 3.8 and 4.2 inches long.

Solution.

1. Let X=length, in inches, of a randomly selected 4-inch nail.2. X ∼ N(4, 0.12)3. P (3.8 < X < 4.2)

= P ( 3.8−40.12 < z <

4.2− 4

0.12)

= P (−1.67 < z < 1.67)= P (z < 1.67)− P (z < −1.67)= 0.9525− 0.0475= 0.9050

4. 90.5% of 4-inch nails are between 3.8” and 4.2”

Example 6.1.9.

Page 134: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

128 CHAPTER 6. CONTINUOUS PROBABILITY DISTRIBUTIONS

From growth charts from the CDC, it can be seen that the lengths of newborn girls follow anormal distribution with the length of a newborn girl estimated to be 49.5 cm with a standarddeviation of 2.45 cm. Assuming that these estimates are correct, what is the probability that arandomly selected newborn girl will be between 46.1 cm and 53.7 cm?

Solution.

1. Let X=length of a randomly selected newborn girl, in cm.2. X ∼ N(49.5, 2.45)

3. P (46.1 < X < 53.7) = P (46.1− 49.5

2.45< z <

53.7− 49.5

2.45)

= P (−1.39 < z < 1.71)= P (z < 1.71)− P (z < −1.39)= 0.9564− 0.0823= 0.8741

4. = 0.8741

6.1.4 Adding and Subtracting Independent Normal Random Variables

We want to look now at what happen when we add or subtract two independent normal randomvariables. By independent, we mean that for two random variables, X1 and X2, the outcome of X1

does not affect the outcome of X2, and vice versa.For example, let X1 be the amount of garbage picked up each week at a local pizza restaurant

in California and X2 the amount of garbage picked up each week at a local pizza restaurant in NewYork. Let us assume that the restaurant in California has an average of 250 gallons of garbagepicked up each week with a standard deviation of 30 gallons. Assume that the restaurant in NewYork has an average of 230 gallons of garbage picked up each week with a standard deviation of 40gallons. On average, the restaurant in Calilfornia has 20 more gallons of garbage than the one inNew York and combined, they throw out a total of 480 gallons. (If the standard deviations wereboth 0 then it would always be 20, or 480, gallons every week). It is reasonable that the averageif we subtract or add two random variables is simply the sum or difference of the means. Somenotation

X1 = the amount of garbage picked up in a randomly selected week at the restaurant in California

X2 = the amount of garbage picked up in a randomly selected week at the restaurant in NewYork

X1 − X2 = how much more garbage the restaurant in California throws out than the one inNew York

We get

µX1−X2= µX1

− µX2and µX1+X2

= µX1+ µX2

Where µX1 is the mean of the random variable X1, etc.

Page 135: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

6.1. CONTINUOUS PROBABIITY DISTRIBUTIONS 129

Now, let us look at the standard deviation. One way to think about the standard deviation isit is a measure of uncertainty. In California, the typical amount of garbage is 250 gallons give ortake around 30 gallons.1

One important observation to make is that the uncertaintly will tend to accumulate, not cancelout. In idividual cases they can cancel but in general they don’t. It turns out that when we add(or subtract) two random variables, the variances add. So we have the following

σ2X1±X2

= σ2X1

+ σ2X2

σX1±X2=√σ2X1

+ σ2X2

Lastly, let us consider the distribution. It also turns out that if we subtract, or add, two normalrandom variables then the result is also normal.

Putting it all together:

If X1 ∼ N(µ1, σ1) and X2 ∼ N(µ2, σ2)

Then

X1 −X2 ∼ N(µ1 − µ2,

√σ2X1

+ σ2X2

)

And

X1 +X2 ∼ N(µ1 + µ2,

√σ2X1

+ σ2X2

)

Example 6.1.10.

At a local pizza restaurant in California, the amount of garbage picked up each week varies fromweek to week. It is known that the amount of garbage picked up each week is approximately normalwith a mean of 250 gallons and a standard deviation of 30 gallons. At a local pizza restaurant inNew York the amount is also normal with a mean of 230 gallons and a standard deviation of 40gallons. Find the probability that in a randomly selected week the store in California will produce80 or more gallons than the restaurant in New York.

Solution.

Define the random variables X1 and X2:X1 = the amount of garbage picked up in a randomly selected week at the restaurant in California

X2 = the amount of garbage picked up in a randomly selected week at the restaurant in NewYork

We have X1 ∼ N(250, 40) and X2 ∼ N(230, 30)

1For an untypical amount it could be a lot more than 30 gallons.

Page 136: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

130 CHAPTER 6. CONTINUOUS PROBABILITY DISTRIBUTIONS

µX1−X2 = µX1 − µX2 = 250− 230 = 20

σX1−X2 =√σ2X1

+ σ2X2

=√

402 + 302 = 50

Putting this together we get X1 −X2 ∼ N(20, 50)

1. Let X1−X2 = how much more garbage the restaurant in California throws out than the onein New York

2. X1 −X2 ∼ N(20, 50)3. P (X1 −X2 > 80)

= P (z >80− 20

50)

= P (z > 1.20)= P (z < −1.20)= 0.1151

4. 11.51% of all weeks the amount of garbage thrown out by the restaurant in California is morethan 80 gallons more than the amount of garbage at the restaurant in New York.

6.1.5 Exercises

Find the following probabilities

1. P (z < 2.01)

2. P (z < 1.03)

3. P (z > −2.36)

4. P (z > −1.28)

5. P (z > −5.68)

6. P (z > −8.12)

7. P (−.32 < z < 1.68)

8. P (−3.24 < z < .59)

9. P (−1.06 < z < 3.35)

10. P (−5.66 < z < 1.57)

11. Let X ∼ N(26.3, 12.38), find P (X < 36.9).

12. Let X ∼ N(459.6, 44.68), find P (X > 419.0).

13. Let X ∼ N(−52.3, 15.73), find P (−45.6 < X < −22.4).

14. Let X ∼ N(0.235, 0.0156), find P (X > 0.200).

15. Let X ∼ N(684.5, 47.65), find P (700 < X < 750).

Page 137: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

6.1. CONTINUOUS PROBABIITY DISTRIBUTIONS 131

16. Let X ∼ N(10226.4, 51.44), find P (10000 < X < 10300).

17. Let X1 ∼ N(35.6, 12), X2 ∼ N(26.8, 5), find P (X1 −X2 < 1.6).

18. Let X1 ∼ N(2.36, 1.6), X2 ∼ N(4.35, 1.2), find P (X1 −X2 < 0).

19. Let X1 ∼ N(8.23, 3.5), X2 ∼ N(13.26, 3.5), find P (X1 +X2 > 24.1).

20. Let X1 ∼ N(134, 24), X2 ∼ N(165, 20), find P (X1 +X2 > 350).

21. Let X1 ∼ N(26.5, 5.0), X2 ∼ N(33.6, 4.1), find P (44.0 < X1 +X2 < 59.2).

22. Let X1 ∼ N(398, 18), X2 ∼ N(265, 24), find P (650 < X1 +X2 < 700).

23. The speeds of cars on a long stretch of highway follow a normal distribution with a mean of68.2 mph with a standard deviation of 3.67 mph. What percent of cars are traveling underthe posted speed limit of 65 mph?

24. Pistons for an engine are expected to be 4.000 inches. The actual mean is 4.008 inches witha standard deviation of 0.0028 and the pistons follow a normal distribution. What is theprobability that a randomly selected piston will be over 4.000 inches?

25. The time customers spend on the phone with a customer service representative for a phonecarrier varies from customer to customer. It is known that the average time is 7.68 minutes.The population of times is approximately normally distributed with a standard deviation of2.39 minutes. The supervisor of the phone call center wants the times to be less than 5minutes. What percent of all calls last less than 5.0 minutes?

26. The times customers spend on hold waiting to speak to a representative for a phone carrier isknown to be approximately normally distributed with a mean of 26.5 minutes and a standarddeviation of 9.56 minutes. What percent of all callers wait on hold between 20 and 40 minutes?

27. A model rocket maker has a rocket that will fly to an average of 987.9 feet each time it issent up into the air. The heights the rocket achieves are normally distributed with a standarddeviation of 9.64 feet. If the rocket maker sends up the rocket today, what is the probabilitythat the rocket will fly to over 1000 feet?

28. A commuter drives the car to and from work and that is it. The mileage is kept each week forthe car. The mileage varies each week but the average is 16.8 mpg with a standard deviationof 1.67 mpg. What percent of weeks does the commuter get between 15 and 20 mpg?

29. A comedian does the same routine each night. The time on stage varies due to differingresponses from the crowd, the tempo the comedian sets each night, etc. The time is approx-imately normally distributed with mean time of 98.7 minutes with a standard deviation of3.64 minutes. What is the probability that on a randomly selected night, the time on stagewill be more than 100 minutes?

30. According to ‘Consumer Expenditures in 2017’, released in 2018 by the US Department ofLabor, The average cost to own and operate a car is $9,576 per year. Assume that this is truefor the current cost of owning a car and the distribution is normally distributed with standarddeviation of $1,268. What percent of cars cost more than $10,000 per year to operate?

Page 138: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

132 CHAPTER 6. CONTINUOUS PROBABILITY DISTRIBUTIONS

31. According to wikipedia, the mean math score on the SAT in 2018 was 531. The test isdesigned so that the distribution will be normal with a standard deviation around 100. (Inthe early days of the SAT, the scores were scaled each time so the mean was 500 and thestandard deviation was 100). Assume the scores are, as hoped for, normally distributed witha mean of 531 and a standard deviation of 96.4. Your friend reports that they got a 684 onthe math score. What is the percentile rank of their score?

32. Mensa is a high IQ society. In order to be a member of Mensa your IQ must be in the 98thpercentile or higher. Your friend just took an IQ test and scored 125. They are bragging thatthey are eligible to join Mensa. You investigate and find the IQ test they took has a mean of100 with a standard deviation of 15. Such tests are well known to have a normal distribution.Is your friend eligible for Mensa or not? What is their IQ percentile rank?

33. A soccer ball needs to weigh between 410 and 450 grams. A producer of soccer balls producessoccer ball that have a mean weight of 435 grams with a standard deviation of 9.84 grams.The balls are weighed and if not in the range they are discarded, otherwise they are sold.What percent of all balls produced are destroyed? If the manufacturer could adjust the weightof the balls produced to any mean desired while keeping the standard deviation the same,what weight should the manufacturer choose?

34. Every day, a quality control inspector pulls a bag of chips off the assembly line and weighsit. The inspector will shut down the production line and recalibrate the machine if the bagweighs either more than 16.3 ounces or less than 16.0 ounces. The machine produces bagsof chips with an average of 16.15 ounces with a standard deviation of 0.094 ounces. If theweights of chips in the bags follows a normal distribution, what is the probability that theinspector will shut down the assembly line today? If the inspector shuts down the machine,does the machine need to be adjusted?

35. Go online and find the average of something that you find interesting and fashion the statisticinto a problem where you want a greater than probability. It is not difficult to search for ‘theaverage. . . ’ and find something. It is much more difficult to find the mean and the standarddeviation so make up the standard deviation. (The operating a car problem in this section isa good model)

36. Go online and find the average of something that you find interesting and fashion the statisticinto a problem where you want an interval probability. (For example, P (32.5 < X < 56.7)but pick your own numbers.) See the previous problem for tips.

37. Two plants process and bag sugar to be shipped out for industrial use. The first plant fillsbags with an average of 100.46 pounds with a standard deviation of 0.16 pounds. The secondplant fills bags with an average of 100.23 pounds with a standard deviation of 0.15 pounds.It is known that the amounts dispensed into all bags at both plants are normally distributed.Find the probability that if a bag from each plant is chosen, the bag from the first plant willhave half a pound more than the bag from the first plant.

38. As part of a physics assignment, classmates are required to construct a scaled down trebuchet(a medieval siege weapon). The first classmate’s trebuchet will project a ball 135 feet witha standard deviation of 3.6 feet. The second classmate’s trebuchet will project the ball 129feet with a standard deviation of 3.4 feet. The distances from both trebuchets are normally

Page 139: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

6.1. CONTINUOUS PROBABIITY DISTRIBUTIONS 133

distributed. The first trebuchet is going to send the ball flying. The second trebuchet will thenmove to where the ball landed and shoot back at the first trebuchet. What is the probabilitythat the ball will either hit the first trebuchet of go past it?

39. At an assembly line, orders are put together for shipping, the process requires only two steps.The steps are done by different people. The first step takes a mean time of 153 seconds witha standard deviation of 8.95 seconds. The second step takes an average of 123 seconds witha standard deviation of 7.68 seconds. What is the probability that a randomly selected orderwill take more than 4 minutes? (60 seconds=1 minute)Assume the times for both processesare normally distributed.

40. To perform an experiment, a chemist needs 1000 mg of aspirin. The chemist has available abottle of 500 mg tablets. The tablets don’t have exactly 500 mg each. The amount of aspirinin the tablets are normally distributed with an average amount of 501.2 mg with a standarddeviation of 1.3 mg. What is the probability that if the chemist randomly selects two tabletsfrom the bottle they will have at least 1000 mg of aspirin?

Page 140: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

134 CHAPTER 6. CONTINUOUS PROBABILITY DISTRIBUTIONS

6.2 Finding X When the Probability is Given

In the last section we found probabilities of a normal random variable given the mean and thestandard deviation. In this section we will focus on going the other direction. That is, givena particular probability we would like to find the value of the random variable that yields theprobability.

We start with an example:

Example 6.2.1.

Find the value of z such that the area to the left of z is .9732

Solution.

The problem is very similar to the first problems in the last section. Let us look at the picture.

z0 ?

P (z <?) = .9732

Note that we have shaded the left side and the picture looks just like the picture in the standardnormal table. The area to the left of 0 is 0.5. Since the area to the left of what we are lookingfor is greater than 0.5, the value of z is to the right of the mean (= 0). The difference here is wealready have the area and we need the value of z. When we look at the table, the z values are onthe outside of the table and the areas (probabilities) are in the middle of the table. Since 0.9732 isan area, look for it in the area section. The area corresponds to the z score of 1.93.

z

0.0

1.9

.00 .03

.9732

.09. . . . . .

...

We can do the same for areas to the right of a value.

Example 6.2.2.

Find the value of z such that the area to the right of z is .8810

Page 141: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

6.2. FINDING X WHEN THE PROBABILITY IS GIVEN 135

Solution.

Since we want the area to the right and the probabiltiy is greater than 0.5, the value we arelooking for is to the left of 0. We will indicate the value of z we want as z0.

z0z0

P (z > z0) = .8810

This graph does not look like the graph at the top of our table.

Like all distributions, the total area under the curve is 1. The area to the left of z0 is .1190(=1-0.8810). Now the problem we started with is a problem just like the last example. Looking atthe table we get z0 = −1.18

Solution.

We present an alternate solution. If we use the symmetry we get the following graphs.

zz0

P (z > z0) =

0 z−z0

P (z < −z0)

0

If we use the table, we get −z0 = 1.18 so we have z = −1.18.

In the last section we went from X → z → probability. In this section we will go in the otherdirection. That is, probability→ z → X. We have just seen how to go from the probability to thevalue of z. We will now look at going from the value of z to the value of X. To do this we will use

the same formula as before: z =X − µσ

.

We will illustrate this with an example.

Example 6.2.3.

Let X ∼ N(24.3, 5.96). Find X0 such that P (X < X0) = 0.04

Solution.

For this first one we will draw a graph. To do so, notice that X0 is to the left of the mean sincethe area to the left X0 is .04 and the area to the left of µ = 24.3 if 0.5. How far to the left we don’tknow, yet.

Page 142: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

136 CHAPTER 6. CONTINUOUS PROBABILITY DISTRIBUTIONS

X

z

24.3

0z0

X0

In the graph we have included both the X and z axes. From the previous problems we willneed to find .0400 in the area section. The value is not there. We have two options: we can eitherinterpolate the values or take the closest value. We will do the later as a matter of routine. It willbe unlikely that the probabilities are in the table. Although .0400 is not there we find the values.0401 and .0392 in the table. .0401 is closer so we get z0 = −1.75.

Recall that the z-score of a value is the number of standard deviations from the mean. Whatthis means is X0 is 1.75 standard deviations to the left of the mean. (The left because the valueof z is negative.). This tells us that X0 is 1.75 × 5.96 = 10.43 to the left of the mean. So, X0 is24.3− 10.43 = 13.87. Since the mean is given to the nearest tenth, we will round to 13.9.

We now present a slightly different way to do the problem.

Solution.

The last solution was a more visual approach to solving the poblem. This approach is morealgebraic. We start with our formula.

z =X − µσ

We can solve for X algebraically and get.

X = µ− z × σSo we get

X = 24.3 + (−1.75)× 5.96 = 13.87

Again, we would round to 13.9.

We end this section with an application. We will use the same four steps we used before with aslight modification because we are not looking for a probabiltiy.

Example 6.2.4.

A large shipment of apples has just arrived from an orchard. The weights of the apples varyfrom apple to apple but it is known that the weights of apples follows a normal distribution with amean of 164 grams and a standard deviation of 23.5 grams. A buyer only will purchase the largest20% of the apples. What is the weight of the smallest apple the buyer will accept?

Page 143: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

6.2. FINDING X WHEN THE PROBABILITY IS GIVEN 137

Solution.

As in the last section, we are given that µ = 164 and σ = 23.5. We are not looking for aprobability. We are looking for a value of X (What is the weight . . .) We now list the four steps.

1. Let X=weight of a randomly selected apple from the shipment.2. X ∼ N(164, 23.5)3. P (X > X0) = .20z0 = 0.84X0 = 164 + .84× 23.5= 183.74

4. The farmer will only accept apples that weigh 184 grams or more.

Notice that the first two steps are exactly like the example in the previous section. It is commonthat students will want to let the random variable be the minimum weight. The minimum weightis not a random variable, it is a fixed value that we are trying to find.

6.2.1 Exercises

Find z0 for each of the following

1. P (z < z0) = .7389

2. P (z < z0) = .9463

3. P (z < z0) = .30

4. P (z < z0) = .15

5. P (z > z0) = .23

6. P (z > z0) = .17

Find the values of X0 in the following.

7. Let X ∼ N(23.68, 4.689) and P (X < X0) = .95

8. Let X ∼ N(126.8, 16.59) andP (X < X0) = .62

9. Let X ∼ N(1.23, 0.264) andP (X > X0) = .43

10. Let X ∼ N(2.6, 1.29) and P (X > X0) = .025

11. Let X ∼ N(6.54, 3.65)and P (X < X0) = .01

12. Let X ∼ N(1246, 298) and P (X < X0) = .28

Solve the following problems using the four steps to solve probability problems.

13. Heights of adult women in the United States are approximately normally distributed with amean of 163.2 cm and a standard deviation of 4.05 cm. 80% of adult women in the UnitedStates are shorter than what height?

Page 144: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

138 CHAPTER 6. CONTINUOUS PROBABILITY DISTRIBUTIONS

14. Heights of adult men in the United States are approximately normally distributed with amean of 176.8 cm and a standard deviation of 4.48 cm. 15% of adult men in the UnitedStates are shorter than what height?

15. At an egg ranch, the size of the eggs vary. The average weight of the eggs is 56.5 grams witha standard deviation of 6.52 grams. Find the weight of an egg in the 95th percentile if theweights of the eggs are approximately normally distributed.

16. The annual rainfall in Watsonville is approximately normally distributed with a mean of20.03 inches and a standard deviation of 6.97 inches. What annual rainfall represents the99th percentile?

17. The lengths of newborn babies are known to be normally distributed with a mean length of50.1 cm and a standard deviation of 2.45 cm. A baby that is in the 99th percentile is howlong?

18. The amount of beer dispensed into kegs is approximately normal. The brew master informsus that the average amount of beer in all kegs is 15.64 gallons with a standard deviation of0.135 gallons. A keg is selected and the brew master states that only 3% of kegs have morebeer than the selected keg. How much beer is in the keg?

19. A city is reassessing the speed limit on a stretch of road. They observe that cars drive onaverage 56.4 mph with a standard deviation of 3.97 mph. The speeds are also observed tobe approximately normally distributed. The speed limit is to be set so that only 20% of thecurrent drivers drive faster than the limit. Find the speed limit.

20. A vintner has a vineyard that produces an average of 15,264 pounds of grapes with a standarddeviation of 649 pounds each year. The annual yields are approximately normally distributed.The vintner is heard stating that this year is a bumper crop, only 2% of the years is there abetter yield. What was the yield this year?

21. At a bottling plant, maple syrup is put into 16-ounce jars. The dispensing machine does notput exactly the same amount of syrup in each bottle, the amounts vary. The machine canbe adjusted so that the average amount is whatever is required. The standard deviation isalways 0.26 ounces. If the mean is set at 16 ounces, half of the bottles will have less than16 ounces which is totally unacceptable. What should the mean be set at so that 99% of allbottles have at least 16 ounces of syrup?

22. A home owner has just received a notice from the electric company. It is telling them thattheir energy usage for the month was more than 92% of residents in the city. It goes on tosay that the average amount of energy used was 895 Kwh. The homeowner’s usage was 987Kwh. If the energy usage for homes are approximately normally distributed, find the standarddeviation of the monthly usage.

Page 145: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

6.3. NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION 139

6.3 Normal Approximation to the Binomial Distribution

In previous sections we have approximated one distribution with another. For example, if we havea hypergeometric distribution where n is small and N is big, we use the binomial distribution asan approximation. In this section we will approximate the binomial distribution with the normaldistribution. This is significantly different than the previous approximations. In the approximationsbefore, the original distribution and the approximating distribution were both discrete. Here weare approximating a discrete random variable with a continuous random variable. We will need toadjust our probabilities accordingly. The details will come shortly.

Before we start approximating the binomial distribution with a normal distribution let us takea look a binomial distribution.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40X

P(X

)

X ∼ B(40, .4)

Not perfect, but a reasonable likeness to the bell curve. In the graph the heights are theprobabilities of the individual values of the random variable, X. We can see that the mean isaround 16 and the standard deviation is around 3 or so. We need to be more precise. Recall thefollowing formulas for the binomial distribution:

µ = np and σ =√npq where q = 1− p

We can find the mean and standard deviation

µ = np = 40× .4 = 16 and σ =√npq =

√40× .4× .6 =

√9.6 ≈ 3.1

If we look at the graph, the heights of the bars are the probabilities. Also note that the barshave widths of one each. This tells us that the area of each bar is the associated probability. Letus look at an example.

Example 6.3.1.

If X ∼ B(40, .4) estimate P (13 ≤ X ≤ 21) with the normal distribution.

Solution.

Let us examine the graph further. We include a graph of the normal distribution with the meanand standard deviation just calculated.

8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24X

P(X

)

X ∼ B(40, .4)Y ∼ N(16,√9.6)

Page 146: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

140 CHAPTER 6. CONTINUOUS PROBABILITY DISTRIBUTIONS

If we notice, there area under the binomial curve is pretty close to the corresponding area underthe normal curve. We want to distinguish the two distributions. We will use X to denote thebinomial random variable and Y the approximating normal distribtuion. We have X ∼ B(40, .4)and Y ∼ N(16,

√9.6) (Calculations above)

What we need to determine are the values of Y . From the graph we can see that the shadingabove Y begins at 12.5 and ends at 21.5. The ±.5 is what is called a continuity correction factor.It is required whenever we approximate a discrete distribution (here binomial) with a continuousdistribution (normal here).

So what we have is P (13 ≤ X ≤ 21) ≈ P (12.5 < Y < 21.5) To determine the later probability,

we simply use the techniques from before: convert to z-score then look up in table.P (12.5 < Y < 21.5)

= P

(12.5− 16√

9.6< z <

21.5− 16√9.6

)

= P (−1.13 < z < 1.78)= .9625− .1292= .8333 (The actual probability is .8323. We have a pretty good approximation.)

At this point using the normal approximation seems to have no advantage over using the binomialdirectly. Using the binomial directly gets me the actual answer, not an approximation, and it isfaster and easier than using the approximation. So why do it this way? If all we wanted was toapproximate the binomial this would be simply an interesting diversion and would most likely beomitted. The advantage is to come: if we can think of a binomial distribution as being approximatelynormal, then we will be able to apply inferential techiques that require a normal distribution whichwill be developed later on in the text. Be patient.

Can we always use the normal distribution to approximate the binomial distribution? In a word,no. But the binomial distribution will be close enough to normal if n and p are not too bad. Ageneral rule of thumb is that we can use the normal approximation if np > 5 and nq > 5. Nothingmagical happens when they both hit 5. As np and nq get bigger the approximation gets better.Empirically, it has been decided that 5 is probably big enough, although some texts may require10. For most of the problems we are doing, np and nq will typically be much larger than 5 so it isa non-issue.

Let us look at an application.

Example 6.3.2.

According to the CDC, 8.1% of adults have asthma. Assume this is true. What is the probabilitythat in a random sample of 200 adults, more than 20 will have asthma? What is the probabilitythat exactly 15 will have asthma?

Solution.

We start as we did before when we did binomial problems.

1. Let X=number of people that have asthma in a random sample of 200 adults.2. X ∼ B(200, .081)

Page 147: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

6.3. NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION 141

3. (a) P (X > 20)

(b) P (X = 15)

Let us now convert to the normal:

µ = 200 × .081 = 16.2, σ =√

200× .081× .919 =√

14.8878 (leave as a radical a to avoidrounding errors)

1. Y = normal approximation to X (note: it’s always this in this section)2. Y ∼ N(16.2,

√14.8878) (np and nq are both greater than 5)

3. (a) P (Y > 20.5)

= P

(z >

20.5− 16.2√14.8878

)

= P (z > 1.11)

= P (z < −1.11)

(b) P (14.5 < Y < 15.5)

= P

(14.5− 16.2√

14.8878< z <

15.5− 16.2√14.8878

)

= P (−.44 < z < −.18) and

= P (z < −.18)− P (z < −.44)4. (a) .1335

(b) .4286− .3300 = .0986

Use the normal approximation to the binomial to estimate the following probabilities

6.3.1 Exercises

1. Let X ∼ B(40, .7), find P (25 ≤ X ≤ 30)

2. Let X ∼ B(85, .25), find P (18 ≤ X ≤ 25)

3. Let X ∼ B(50, .3), find P (X < 12)

4. Let X ∼ B(200, .1), find P (X > 23)

5. Let X ∼ B(90, .27), find P (X = 20)

6. Let X ∼ B(70, .4), find P (X = 30)

For the following use set up the binomial problem then use the normal approximation to thebinomial.

7. At a large sports arena, 43% of fans purchase at a least one food item. Find the probabilitythat in a random sample of 60 fans at the sports arena, at least 20 purchase at least one fooditem.

Page 148: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

142 CHAPTER 6. CONTINUOUS PROBABILITY DISTRIBUTIONS

8. In 2018 is was reported in a California Secretary of State press release that 25.5% of registeredvoters in California were registered as ‘no party preference’. Assume this is true of the currentpopulation in California. Find the probability that if 150 registered voters are selected atrandom there will be at most 40 that are registered as ‘no party preference’?

9. Psychology Today states that about 45% of marriages end in divorce. Find the probabilitythat of 60 randomly selected marriages, at most 25 will end in divorce.

10. Although estimates vary, about 12% of the population are left-handed. Assume this is true ofthe current population. If you randomly select 80 people, what is the probability of selectingmore than 10 lefties?

11. On a desolate stretch of road, 67% of drivers drive at least 10 mph over the posted speedlimit. If the local authorities take a sample of 90 drivers on the stretch of road, what is theprobability that fewer than 50 are driving at least 10 mph over the speed limit.

12. According to hootsuite.com, 59% of Instagram users are between 18 and 29 years old. If thisis so, then find the probability that if a random sample of 150 Instagram users are selected,more than 100 are between 18 and 29 years old.

13. Statistica.com states that 10.1% of respondents aged 18 to 29 years old played a musicalinstrument in the past year. Assume this is true for the current population of people aged 18to 29 years of age. Find the probability that if 75 randomly selected 18 to 29 year olds areselected, then at most 10 have played a musical instrument in the past year.

14. A Gallop Poll reported that 53% of Americans with college degrees play the lottery. 65randomly selected Americans with college degrees are going to be selected. What is theprobability that more than 30 play the lottery?

15. Many people shop online. Pew Research reported that 79% of US consumers shop online.If 80 US consumers are selected at random, what is the probability that fewer than 60 shoponline?

16. The travel site, TripAdvisor, states that 59% of site visitors report taking separate vacationsfrom their significant-others in the past. If the result holds for all couples that take vacations,what is the probability that at least 45 of 90 adults have taken separate vacations from theirsignificant-other?

17. Many students prefer to use ebooks instead of a hard copy. A popular textbook has an ebookversion that is used by 46% of all students. Find the probability that in a random sample of31 students, 14 use an ebook.

18. A large hotel chain reports that 16% of current occupants have stayed with the hotel at leastonce in the past year. In a random sample of 120 occupants, what is the probability that 20of them have stayed with the hotel at least once in the past year.

Page 149: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Chapter 7

Sampling Distributions of thePopulation Mean and Proportion

143

Page 150: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

144CHAPTER 7. SAMPLING DISTRIBUTIONS OF THE POPULATIONMEANAND PROPORTION

7.1 Sampling Distributions

Thus far we have examined both continuous and discrete random variables and their applications.We now want to examine what are called ‘Sampling Distributions’. If we consider the randomvariable that represents the amount of beer dispensed into a bottle, we probably would be able tocalculate probabilities using the normal distribution. Now, consider the following problem: if welook at a random sample of 6 bottles and want to know the probability that the average amount ofbeer dispensed into the 6 beers we need to know how to calculate the probability for a new randomvariable, specifically X. Before we get there, we need to get some new ideas. We begin with thePopulation Distribution.

Definition: The Population Distribution is the probability distribution of the population.

Example 7.1.1.

Consider the population that consists of the values: 2, 3, 6, 8, 12, and 17. One of these is to beselected at random, find the population distribution.

Solution.

Since the outcomes are all equally likely, the probabilities will all be the same, namely 1/6. Thepopulation distribution is given as

X P (X)2 1/63 1/66 1/68 1/612 1/617 1/6

Now let’s look at what is called ‘The Sampling Distribution of X’.

Definition: The Sampling Distribution of X is the probability distribution of the randomvariable X.

We will look at several sampling distributions. When it is clear what statistic we are talkingabout (here it is X) we will simply say ‘the sampling distribution’. (dropping the ‘of X’)

Example 7.1.2.

Using the population given above, construct the sampling distribution of X where we takesamples of size 3 without replacement.

Page 151: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

7.1. SAMPLING DISTRIBUTIONS 145

Solution.

Since order does not matter there will be(

63

)= 20 ways to select samples of size 3. It would be

instructive to list the samples below along with each sample’s mean.

Sample X Sample X Sample X Sample X2, 3, 6 11/3 2, 6, 12 20/3 3, 6, 8 17/3 3, 12, 17 32/32, 3, 8 13/3 2, 6, 17 25/3 3, 6, 12 21/3 6, 8, 12 26/32, 3, 12 17/3 2, 8, 12 22/3 3, 6, 17 26/3 6, 8, 17 31/32, 3, 17 22/3 2, 8, 17 27/3 3, 8, 12 23/3 6, 12, 17 35/32, 6, 8 16/3 2, 12, 17 31/3 3, 8, 17 28/3 8, 12, 17 37/3

We now need to summarize the means above. There are twenty different samples and each oneis equally likely. So we get the following.

X P (X) X P (X) X P (X) X P (X)11/3 1/20 20/3 1/20 25/3 1/20 31/3 2/2013/3 1/20 21/3 1/20 26/3 2/20 32/3 1/2016/3 1/20 22/3 2/20 27/3 1/20 35/3 1/2017/3 2/20 23/3 1/20 28/3 1/20 37/3 1/20

Note that the distribution has been split into several columns to save space.

Recall what we are doing when we take a sample and calculate X: we are trying to estimatethe population mean, µ. Although we hope that X is close to µ we realize that they will almostalways not be equal. That leads us to the idea of ‘sampling error’.

The Sampling Error for Estimate of the Population Mean is given by

X − µ

The sampling error is the signed distance the sample mean is from the actual mean.

Example 7.1.3.

For the samples 2, 8, 12 and 6, 8, 17 selected from the population above, find the samplingerrors and interpret what the results say.

Solution.

First let’s find µ from the population distribution µ = (2 + 3 + 6 + 8 + 12 + 17)/6 = 8.For the first sample 2, 8, 12 , X = 22/3 so the sampling error is X − µ = 22/3 − 8 = −2/3 ≈

−.6667.What this tells us is that our estimate of µ underestimates µ by 2/3.

For the sample 6, 8, 17, X = 31/3 so the sampling error is X − µ = 31/3− 8 = 7/3 ≈ 2.3333This tells us that our estimate of µ overestimates µ by 7/3.

Page 152: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

146CHAPTER 7. SAMPLING DISTRIBUTIONS OF THE POPULATIONMEANAND PROPORTION

The negative tells us that we have an underestimate and a positive value tells us we have anoverestimate.

When most people hear the word ‘error’, they equate it with the word ‘mistake’. In statistics thisis not the case. We are simply acknowledging the fact that due to the nature of sampling we can doeverything without making any mistakes and our estimate will not be equal to our parameter. Wedo have the ability to make plenty of mistakes. That is addressed in what are called ‘non-samplingerrors’.

In the last example we needed to calculate X. If we attempt to find X for 2, 8, 12 and in ourexcitement we enter 21 instead of 12 on our calculator and getX = 31/3, then our X is clearlywrong. In fact it is 3 greater than what it should be. This brings us to a definition for non-samplingerror.

The Non-Sampling Error for an Estimate of µ is given by

Xincorrect − Xcorrect

Note: in the ‘real world’ one does not ever calculate the sampling or non-sampling errors. Inorder to calculate the sampling error you need the mean of the population. If you have the meanof the population then there is no point to take a sample mean. The sample mean is an estimateof the population mean. If you know what µ is, why are you trying to estimate µ?

During an exam if you were to make the same mistake we made in the above discussion wouldyou leave it? The answer is ‘no’ if you know that it is wrong. You would fix it and turn in thecorrect value. If you don’t know that you made a mistake you can’t determine how far off youranswer is from the actual value.

Although we don’t calculate these in the real world, we will be able to get an idea of a reasonablebound on the sampling error. We also need to have a discussion of non-sampling errors so thatwe can identify the source and do our best to eliminate them. If were able to somehow sampleperfectly, we could eliminate the non-sampling errors. There is no way to eliminate sampling errorsall together. They are an integral part of sampling.

7.1.1 The Mean and Standard Deviation of the Sampling Distribution

We would like to find the mean and standard deviation of our new random variable, X. Let uscontinue on with our sampling distribution from before. We need to realize that in these problemswe will have two different distributions: X and X and we need to distinguish them when we talkabout the mean and standard deviation. For the random variable X we will continue to use µ andσ. For the random variable X we will use use µX and σX . (Read ‘mu sub x bar’)

Example 7.1.4.

Find the mean and standard deviation of the population distribution and the sampling distri-bution of X from before. Compare results.

Solution.

Page 153: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

7.1. SAMPLING DISTRIBUTIONS 147

This is a simple matter to do with our calculator. For the population distribution we input thedata values in list 1. Since the probabilities are all the same, we can leave off the probabilities anddo a one variable statistics. We get.

µ = 8 and σ = 5.196

For the sampling distribution input the means into list 1 and the probabilities into list 2 andrun one-variable statistics we get

µX = 8 and σX = 2.324

Notice that the mean values are the same and the standard deviation of the sampling distributionis smaller.

This may seem odd at first but it is exactly what we want to happen. We get the following:

µ = µX

The standard deviation may at first seem strange but recall what the standard deviation tellsus. We think of the standard deviation as the ‘average’ distance the data values fall from the mean.Look at graphs of the probability distributions below.

0 2 4 6 8 10 12 14 16 18 20

0 2 4 6 8 10 12 14 16 18 20

X

The sampling distribution is not as spread out as the population distribution. We get the fol-lowing, where n is the sample size, N is the population size and σ is for the population distribution.

σX =σ√n

√N − nN − 1

(sampling without replacement)

or

σX =σ√n

(sampling with replacement or if n/N is ‘small’)

Page 154: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

148CHAPTER 7. SAMPLING DISTRIBUTIONS OF THE POPULATIONMEANAND PROPORTION

We will use the arbitrary value of small here as 0.05. In other words, if we sample less than 5%of the population we will treat the problem as if we were sampling with replacement.

Example 7.1.5.

Find the mean and standard deviation of the sampling distribution above without using thesampling distribution.

Solution.

Although we are not able to use the sampling distribution we can use the population distribution.The population distribution had µ = 8 and σ = 5.196. Since we sampled without replacing, we usethe formula for the standard deviation without replacement. We took samples of size 3, so n = 3.The population consisted of 6 items, so N = 6.

Since µX = µ, we have µX = 8.

σX =σ√n

√N − nN − 1

=5.196√

3

√6− 3

6− 1= 2.324

Example 7.1.6.

The average height of residents of a large city has a mean height of 65.9 inches and a standarddeviation of 2.78 inches. A random sample of 8 residents is to be taken. Find the mean and standarddeviation of the sampling distribution of X with n = 8.

Solution.

Listing all samples as we initially started our discussion on sampling distribution would beimpossible. However, we have the formulas from above. Here we have µ = 65.9 and σ = 2.78 withn = 8. Since we are given that this is coming from a large city, we clearly have less than 5% of thepopulation so we can use the formula σX = σ√

nSo we have

µ = µX = 65.9 inches

σX =σ√n

= 2.78/√

8 = 0.98 inches

7.1.2 The Shape of the Sampling Distribution of X

We have established the mean and standard deviation of the sampling distribution. We now turnour attention to the shape of the sampling distribution. We have added/subtracted normallydistributed random variables before. When we added to random variables that were normal, the

Page 155: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

7.1. SAMPLING DISTRIBUTIONS 149

shape was normal and the variances added. If we then divide the random variable that representsthe sum by two we get a normal distribution with the variance divided by 2.

If X1 ∼ N(µ, σ) and X2 ∼ N(µ, σ)

then X1 +X2 ∼ N(µ+ µ,√σ2 + σ2)

or X1 +X2 ∼ N(µ+ µ,√

2σ2)

If we divide by 2 we get

X1 +X2

2∼ N

(µ+ µ

2,

√2σ2

2

).

X ∼ N

(µ,

σ√2

)

If we generalize, we will get,

X ∼ N

(µ,

σ√n

)

This is consistent with the mean and standard deviation formulas from before. The shape isthe focus here, however. If we look at a population distribution and a few sampling distributionsof X for several n’s we will get a better idea.

−2 0 2

X

−2 0 2

X, n = 2

−2 0 2

X, n = 3

−2 0 2

X, n = 4

Notice that as the sample size gets larger, the standard deviation of the sampling distributiondecreases. Since the total area is one for each graph, the graphs will get taller. Also note thatthe mean is the same for all the distributions and the shapes of the distributions all appear to benormal. This is because the population distribution is normal. Let us proceed by looking at a nonnormal distribution.

Consider a uniform distribution.

0 1X

Page 156: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

150CHAPTER 7. SAMPLING DISTRIBUTIONS OF THE POPULATIONMEANAND PROPORTION

This is similar to the roll of a single die, although with a die we would have a discrete randomvariable. If we take samples of size 2 we will get a sampling distribution that looks like the following.

0 1Xn=2

This may seem odd at first. Our first reaction might be that the distribution might be uniform.But, what does it take for the mean to be near 0? It takes both values in your sample to be near 0,not very likely. To get a mean near 1/2 is a lot more likely. To get a mean near 1/2 the first valuepicked can be anywhere. You only need to worry about the second value picked, not both. In thedie analogy mentioned above, if we throw the die twice, the probability that the average, X is 1 is1/36 (both values need to be 6’s, only one way out of 36 possible ways to roll a die twice). To getX to be 3.5, you need a sum of 7. There are 6 ways to get this out of 36 possible, so the probabilityof getting an average of 3.5 is 1/6. We see this in the distribution above. It starts unlikely at 0,rises to its maximum value at 0.5 and then drops back down to the axis.

So what happens if we continue to take larger and larger samples. The distribution with n = 3is below.

0 1

Xn=3

This distribution looks bell shaped!

We look at the sampling distributions of X for n =2, 3, and 4 as well as the populationdistribution.

Page 157: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

7.1. SAMPLING DISTRIBUTIONS 151

0 1

X

0 1

X, n = 2

0 1

X, n = 3

0 1

X, n = 4

Notice that the graphs are getting narrower, taller, and more bell shaped as n gets bigger. Themean of the distributions are all the same, as well.

We also do this for another distribution.

0 1

X

0 1

X, n = 2

0 1

X, n = 3

0 1

X, n = 4

Notice that the distribution goes from something that is non-normal(triangular) to somethingthat is almost bell shaped when the sample size is 4.

This brings us to an important theorem.

Central Limit TheoremThe shape of the sampling distribution of X will be approximately normally distributed if

the sample size is sufficiently large.

This raises a question. What is sufficiently large? If you look at the examples above, thesample size didn’t need to be very large. For the population distribution that was uniform, X hada bell-shaped distribution at n ≥ 3. For the population distribution that looked triangular, we hada reasonable bell-shaped distribution for X at n ≥ 4. There is no specific value that n needs tobe. If the population distribution is bell-shaped, the sampling distribution of X will be normal forany n. If the population distribution is very non-normal then we need n to be much larger. Themore non-normal the population distribution is, the larger you need n to be. The general consensusamong most authors is that a sample size bigger than 30 is sufficient.

Page 158: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

152CHAPTER 7. SAMPLING DISTRIBUTIONS OF THE POPULATIONMEANAND PROPORTION

Some texts refer to the population distribution as the parent distribution. The author refers tothe Central Limit Theorem as the ‘Teenager’s Theorem’: teenagers think they are normal despitethe fact that their parents are not normal.

We can put this together with what we learned before and we get.

X ∼ N if X ∼ N or n > 30

Example 7.1.7.

The weights of a particular breed of dog is skewed to the right with a mean of 25.4 pounds and astandard deviation of 6.7 pounds. A random sample of 35 of these types of dogs are to be selected.Let X be the mean weight of a random sample of 35 of this breed. Find the mean and standarddeviation of the sampling distribution of X and describe its shape.

Solution.

The example is asking for µX and σX just like we did before. We have

µX = µ = µX = 25.4 pounds

σX =σ√n

=6.7√

35= 1.13 pounds

For the shape, don’t let the ‘skewed to the right’ confuse you. This statement applies to thepopulation distribution, X, we want to know about the sampling distribution of X. From the CentralLimit Theorem, we know that the sampling distribution of X will be approximately normal. So wehave X ∼ N . If we put all this together we have

X ∼ N(25.4, 1.13)

The Central Limit Theorem’s importance is huge in the world of statistics. To determine if adistribution is normally distributed is not a task which is easy to determine. In the previous partof the text, we have been given a lot of information about a population. This is now starting tochange. We will be able to infer that a distribution is normal without needing to be told (X not X).As we progress in the subsequent chapters we will find that we are given less and less informationabout a population.

7.1.3 Exercises

1. If µ = 34.8, σ = 5.98, and n = 13 find the mean and standard deviation of the samplingdistribution of X.

2. If µ = 125.6, σ = 26.4, and n = 27 find the mean and standard deviation of the samplingdistribution of X.

Page 159: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

7.1. SAMPLING DISTRIBUTIONS 153

3. The speeds of cars on a highway have a mean of 67.4 mph with a standard deviation of 4.26mph. FInd the find the mean and standard deviation of the sampling distribution of X for asample size of 18.

4. The average time a physician spends with a patient is 8.9 minutes with a standard deviationof 2.16 minutes. Find the mean and standard deviation of the sampling distribution of X fora sample size of 31.

5. The cost of repairs for cars in an accident have a distribution which is skewed left with amean of $3,219 and a standard deviation of $958. Find the mean and standard deviation ofthe sampling distribution of X where the mean is based on a sample of 35. Comment on theshape of the sampling distribution of X.

6. The amount of beer sold in a day at a sports complex has a distribution which is skewed rightwith a mean of 864 gallons and a standard deviation of 102.8 gallons. Find the mean andstandard deviation of the sampling distribution of X if the mean is based on a sample of 40days. Comment on the shape of the sampling distribution of X.

7. The mean amount of grapes harvested from a grape vine is 16.8 pounds with a standarddeviation of 6.48 pounds. A random sample of 42 vines are to be selected. Find the meanand standard deviation of the sampling distribution of the sample mean with a sample sizeof 42 and comment on the shape of the sampling distribution.

8. An online report indicated that ‘tweens’ spend an average of 6 hours using media: watchingvideos, listening to music, playing games, etc. Assume this is true and the standard deviationis 5.3 hours. Find the mean and standard deviation of the sampling distribution of the samplemean for a sample size of 60. Comment on the shape of the sampling distribution. What canyou say about the shape of the population distribution?

Page 160: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

154CHAPTER 7. SAMPLING DISTRIBUTIONS OF THE POPULATIONMEANAND PROPORTION

7.2 Probabilities of X

We have established that the mean of the sampling distribution is approximately normally dis-tributed under some fairly mild conditions. Since X is normally distributed, we can proceed andcalculate some probabilities like we did in the normal probability section.

Example 7.2.1.

The amount of sugar dispensed into 50-pound bags varies from bag to bag. The mean amountdispensed into all bags is 50.06 pounds with a standard deviation of 0.46 pounds. The managementof the company is planning on taking a sample of 36 bags and determining the mean weight of thebags. Find the probability that the mean weight is less than 50 pounds.

Solution.

There is no mention of the type of distribution. However, we are looking at ‘the probabilitythat the mean . . .’ The Central Limit Theorem(CLT) tells us that the mean will be approximatelynormal since n > 30. As such, we will proceed as before. Since we want the probability of themean, the mean is our random variable.

1. Let X=mean weight of a random sample of 36 bags.2. X ∼ N(50.06, .46√

36)

3. P (X < 50)

= P

(z <

50− 50.06

.46/√

36

)

= P (z < −0.78)

= 0.21774. 0.2177

Example 7.2.2.

The average speed of cars traveling on a freeway is 71.6 mph with a standard deviation of 5.64mph. Find the probability that the mean speed of a random sample of 40 cars will be between 70and 73 mph.

Solution.

As with the last problem, we are looking for the probability of the mean. Since n > 30 we canuse the CLT.

1. Let X=mean speed of a random sample of 40 cars.2. X ∼ N(71.6, 5.64√

40)

Page 161: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

7.2. PROBABILITIES OF X 155

3. P (70 < X < 73)

= P

(z <

50− 50.06

.46/√

36

)

= P (−1.79 < z < 1.57)

= 0.9418− 0.0.03674. 0.9051

The formula to calculate the z-score is essentially the same as before.

From before, we have z =X − µσ

Now, we have

z =X − µσ/√n

=X − µσX

The best way to think about it is as

z =RV − µRV

σRV

Where RV is a random variable.

7.2.1 Exercises

1. µ = 26.4, σ = 7.96, n = 35. Find P (X < 25.0)

2. µ = 56.23, σ = 4.35, n = 44. Find P (X < 55.0)

3. µ = 1.765, σ = 1.026, n = 15, X ∼ N . Find P (X > 1.530)

4. µ = 25.9, σ = 9.46, n = 8, X ∼ N . Find P (X > 30)

5. µ = 14.8, σ = 5.97, n = 37. Find P (12.68 < X < 16.98)

6. µ = 0.21, σ = 7.64, n = 50, X ∼ N . Find P (1.35 < X < 2.67)

For the following problems, address the following

(a) What the random variable represents

(b) What the appropriate distribution is and the parameters

(c) What probability that is being calculated

(d) What the probability is

7. The germination time for a seed variety in ideal conditions is 5.68 days with a standarddeviation of 2.67 days. What is the probability that the mean germination time for a randomsample of 38 seeds under ideal conditions is more than 6.5 days?

Page 162: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

156CHAPTER 7. SAMPLING DISTRIBUTIONS OF THE POPULATIONMEANAND PROPORTION

8. The time required for a customer to make it through drive thru has a mean of 98.7 secondswith a standard deviation of 12.6 seconds. Find the probability that the mean time for arandom sample of 41 customers to make it through drive thru will be more than 101 seconds.

9. The average height of an American woman is 163.4 cm with a standard deviation of 5.93 cm.It is a well known fact that heights of women are approximately normally distributed. Findthe probability that in a random sample of 7 women, the average height will be between 160cm and 165 cm.

10. The average height of an American man is 176.9 cm with a standard deviation of 6.70 cm.It is a well known fact that heights of men are approximately normally distributed. Find theprobability that in a random sample of 12 men, the average height will be between 175 cmand 180 cm.

11. Although heights of adults tend to be approximately normal, this is not true for weights ofadults. The distribution of weights of 20-year old males are skewed to the right with a meanof 158.9 pounds and a standard deviation of 29.7 pounds. What is the probability that arandom sample of 50 20-year old males will have a mean weight less than 150 pounds?

12. The distribution of weights of 20-year old women are skewed to the right with a mean of 142.5pounds with a standard deviation of 25.9 pounds. What is the probability that a randomsample of 40 20-year old women will have a mean weight more than 150 pounds?

13. A factory fills beer bottles. The management is planning an inspection. As part of theinspection, the management will take a sample of 36 bottles. If the mean amount of beerin the bottles is within 0.1 ounces from the required mean of 12.1 ounces in the bottle, theworkers will get a bonus. If the mean is not within 0.1 ounces, the machines will be shutdown and adjusted. The mean amount of beer dispensed into all bottles is 12.1 ounces andthe standard deviation is 0.26 ounces.

(a) What is the probability that the employees will get a bonus?

(b) What is the probability the machines will need to be adjusted?

(c) If the machines are stopped and adjusted, does that mean they were not dispensing thecorrect amount of beer?

14. The owner of ‘While U Wait’ oil and lube shop strives to have the mean wait time to be lessthan 15 minutes. The owner is planning on taking a sample of 32 cars that are serviced andfinding the mean time. if the mean time is less than 15 minutes, the managers will get a raise.If the mean time is more than 17 minutes, the mangers will get fired. Assume the mean timefor all oil and lube jobs is 15.68 minutes with a standard deviation 2.64 minutes.

(a) What is the probability that the managers get a raise?

(b) What is the probability that the managers get fired?

(c) Does the owners policy seem fair? Explain.

15. A sheriff wants to determine the average speed of cars on a particularly dangerous stretchof highway. It is known that the standard deviation of the speeds is 5.68 mph. If a randomsample of 60 cars is to be taken, what is the probability that the mean speed of the samplewill be within 1 mph of the actual mean?

Page 163: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

7.2. PROBABILITIES OF X 157

16. The amount of flour dispensed into bags varies from bag to bag. The standard deviation isknown to be 4.43 ounces. What is the probability that the mean amount of flour dispensedinto a random sample of 40 bags will be within 1 ounce of the actual mean?

Page 164: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

158CHAPTER 7. SAMPLING DISTRIBUTIONS OF THE POPULATIONMEANAND PROPORTION

7.3 The Sampling Distribution of p

When we looked at the normal approximation to the binomial distribution, it was a hard sell. Thenormal distribution gave an approximation, not the actual value, and it was more work!

The reason we discuss the normal approximation is because if we can think of a distributionas being approximately normal, we can use all the techniques for a normally distributed randomvariable. This will especially be useful when we look at inferential statistics.

We need to define the proportion. We have calculated this before but the term is new. It is justthe relative frequency we calculated before.

DefinitionThe population proportion, p is given by

p =number of items in a population with a certain characteristic

population size

The sample proportion p (read ‘p hat’) is given by

p =number of items in a sample with a certain characteristic

sample size

Example 7.3.1.

According to the California Secretary of State, 17.8 million of the 24.3 million eligible Californiaadults were registered to vote just before the 2014 general election. Is a group of 45 randomlyselected eligible California voters, 26 were registered to vote before the 2014 general election. Findthe proportions of registered voters in the 2014 election.

Solution.

In the first part we are dealing with the population and in the second part, we have ‘randomlyselected. . .’ this means that is a sample. For the first part,

p =17.8 Million

24.3 Million= 0.7325

For the second part,

p =26

45= 0.5778

When we look at the numerator of the formula for p, we see ‘number of items in a sample witha certain characteristic’. This is going to be a random variable that is either a hypergeometricor binomial distribution. Recall these distributions are similar. The main difference is that for abinomial distribution, we are sampling either with replacement or taking a small portion of the

Page 165: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

7.3. THE SAMPLING DISTRIBUTION OF P 159

population. In a hypergeometric distribution, we are sampling without replacement from a smallpopulation.

In the last example, we took a sample of 45 eligible voters and were interested in the numberof those that were registered. That is a binomial problem. Most of what follows begins with abinomial distribution.

Let X ∼ B(n, p). Recall that for np > 5 and nq > 5 a binomial distribution will be approxi-mately normal.

X ∼ B(n, p)

Can be approximated by

X ∼ N(np,√npq)

If we divide by n we get

X

n∼ N

(np

n,

√npq

n

)

This simplifies to

p ∼ N

(p,

√pq

n

)

From this we get

µp = p, σp =

√pq

n, and, if np > 5 and nq > 5, p ∼ N .

Now that we have p is approximately normal, we can apply the steps we used before to calculateprobabilities.

Example 7.3.2.

The proportion of adults who have gotten a flu shot is 0.46. If a sample of 790 adults is randomlyselected, find the probability that the proportion of those that have gotten the flu shot is between0.425 and 0.482.

Solution.

Note ‘the probability . . . the proportion . . .’. This tells us that we want p as our random variable.

1. Let p=the proportion of adults who have gotten a flu shot in a random sample of 790 adults.

2. p ∼ N(

0.46,√

.46∗.54790

)

Page 166: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

160CHAPTER 7. SAMPLING DISTRIBUTIONS OF THE POPULATIONMEANAND PROPORTION

3. P (.425 < p < 0.482)

= P

.425− .46√

.46∗.54790

< z <.482− .46√

.46∗.54790

= P (−1.97 < z < 1.24)= 0.8925− 0.0244= .8681

4. 0.8681

Just as we often interpreted or wrote probabilities as percentages, we will do or see the samething with proportions. In the last example it could have been stated that 46% of adults havegotten a flu shot. In the next example we will look at percentages.

Example 7.3.3.

A current estimate for the percent of American adults who are smokers is 14%. Assume this istrue for the current population of American adults. Find the probability that in a random sampleof 1000 Americans, at most 16% are smokers.

Solution.

The wording is not as direct as we had in the last example. It is much more subtle. We need tosee ‘the probability . . . 16 percent ’. The percent here is leading us to the proportion. Since we arelooking at the probability of the proportion, we will use p as our random variable.

1. Let p=the proportion of American adults who are smokers in a random sample of 1000American adults.

2. p ∼ N

(0.14,

√.15 ∗ .85

1000

)

3. P (p < 0.16)

= P

z < .16− .14√

.14∗.861000

= P (z < 1.82)= 0.9656

4. 0.9656

In this section we are looking at the sampling distribution of p. If we look at the applicationproblems in the examples and the applications in the excercises, we will notice that they have abinomial look to them: they have successes and failures. That is exactly true. We are actuallyusing the normal approximation to the binomial without the continuity correction factor. Sincethe samples are very large (several 100), we can ignore the correction factor. In the problems thatfollow, do not convert them to binomial problems but use p in the solutions.

Page 167: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

7.3. THE SAMPLING DISTRIBUTION OF P 161

7.3.1 Excercises

1. If n = 785 and p = 0.67, find the mean and standard deviation of the sampling distributionof p and comment on the shape of the sampling distribution.

2. If n = 958 and p = 0.704, find the mean and standard deviation of the sampling distributionof p and comment on the shape of the sampling distribution.

3. If n = 1264 and p = 0.642, find the mean and standard deviation of the sampling distributionof p and comment on the shape of the sampling distribution.

4. If n = 1059 and p = 0.310, find the mean and standard deviation of the sampling distributionof p and comment on the shape of the sampling distribution.

For the following problems, address the following

(a) What the random variable represents

(b) What the appropriate distribution is and the parameters

(c) What probability that is being calculated

(d) What the probability is

5. The proportion of eggs from a large egg farm that are classified as large is 0.57. In a randomsample of 1240 eggs, what is the probability that the proportion is between 0.55 and 0.58?

6. The proportion of registered voters in favor of a proposition on an upcoming ballot is 0.64.Find the probability that if a sample of 860 are registered voters is taken the proportion ofvoters in favor of the proposition will be more than 2/3.

7. At a large university, 28.3% of students rent books instead of buying them. A random sampleof 500 students is to be selected. Find the probability that the proportion of students whorent their books is less than .25.

8. According to a Gallup poll, American’s favorite sport is football. 37% of American adultsreported that it was their favorite. A random sample of 950 is to be taken. What is theprobability that the proportion of those surveyed whose favorite sport is football is between.35 and .38?

9. Aside from taking coursework, in order to become an RN, you need to pass the NCLEX exam.In 2018 the pass rate for all first time takers of the exam was 88%. If you randomly select500 people taking the NCLEX for the first time what is the probability that the proportionwho pass is more than 90%?

10. According to the CDC, 34% of US adults had prediabetes in 2015. Assume the percentageis currently the same. What is the probability that a random sample of 300 US adults theproportion with prediabetes will be between .30 and .40?

11. The death rates during the bubonic plague vary quite a bit due to environmental factors,population density, etc. It has been estimated that the death rate in Egypt was 40%, accordingto wikipedia. Assume this is true. A researcher is investigating Egyptians during the BlackDeath. The researcher is going to select 600 Egyptians and determine if they were victims ofthe Death. Find the probability that the proportion of those who died is less than .38.

Page 168: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

162CHAPTER 7. SAMPLING DISTRIBUTIONS OF THE POPULATIONMEANAND PROPORTION

12. An executive at a large company wants the time customers spend on the phone with customerservice to be under 10 minutes. The manager of the call center knows that 39 percent ofcustomers are on the phone for more than 10 minutes. A random sample of 500 calls are to beinspected. If more than 40% of calls have customers on the phone for more than 10 minutes,the manager will be fired. What is the probability that the manager will be fired?

13. In 2019, UCLA admitted 12% of first-year applicants. A random sample of 800 first-yearapplicants are to be inspected. Find the probability that more than 10% of those surveyedwere admitted. (FYI: UCLA had over 100,000 applicants.)

14. At a large company, 64% of orders are made online. A random sample of 850 orders is to betaken. Find the probability that the percentage of online orders is greater than 68%.

15. At a very popular bridge in a large city, 19% of cars are from out of state. If we take a sampleof 1200 cars on the bridge, what is the probability that the percentage of cars that are fromout of state is more than 22%?

16. The manager of an online store wants to determine the percent of orders shipped within 24hours of the placement of the order. If the percent of such orders is 84% and the managertakes a sample of 920 orders, what is the probability that the percent of orders that are placedwithin 24 hours is within 2 percentage points of the actual percentage?

17. At a fast food restaurant with a drive-thru, the owner is considering adding a second windowto speed up the drive-thru. Currently, 36% of orders are made at a drive-thru window. Findthe probability that a randomly sample of 800 orders at the restaurant will have the percentof orders made at a drive-thru window within 3 percentage points fo the actual percentage?

18. An online report indicated that 37% of high school seniors reported vaping in the previous12 months. Assume this is true of the current population of high school seniors. Find theprobability that in a random sample of 800 high school seniors more that 40% vaped in theprevious 12 months.

19. According to the 2017 Used Car Industry Report, 77% of car buyers buy within a week ofbeginning shopping for a car. A random sample of 1200 car buyers is to be taken. Find theprobability that the percent of car buyers who bought within a week of beginning shoppingis more than 80%.

20. The 2017 National Survey on Drug Use and Health (NSDUH) reported that 12% of peopleaged 12-20 years reported binge drinking in the past month. Assume this is true for thecurrent population of 12-20 year-olds. A random sample of 700 12-20 year-olds are to betaken. What is the probability that the proportion of those who reported binge drinking inthe last month is within .01 of the actual value?

Page 169: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Chapter 8

Confidence Intervals for The Meanand Proportion

163

Page 170: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

164 CHAPTER 8. CONFIDENCE INTERVALS FOR THE MEAN AND PROPORTION

8.1 Confidence Interval for µ with Known σ

Suppose we are interested in the average amount of beer dispensed into all bottles of beer at abrewery. This is something that we clearly cannot determine exactly. To do so would require us todetermine the amount of beer in all bottles. Even if we could do this, the cost would be prohibitive.We can take a sample of beer bottles and find the average amount of beer in the bottles sampled.We can easily calculate the mean amount in the bottles. This will give us X, not µ. Recall thatthe point of taking a sample and calculating the sample statistic is to estimate the correspondingpopulation parameter. We call this estimate a point estimate.

The Point Estimate of µ is X

In general, we have

The Point Estimate of a parameter is the corresponding statistic.

Recall that parameters describe populations and statistics describe samples.

We use X as a point estimate of µ because we expect to find X near µ. Sometimes it willbe high, sometimes it will be low, but in the long run it will be right on. If the mean of a pointestimator is equal to the parameter it estimates then we say that it is an unbiased estimator.

Unbiased Estimator An estimator is said to be a unbiased estimator if the mean of the estimatoris equal to the parameter it is estimating.a

aThis is why we have a different formula for the sample variance than for the population variance. If weuse the same formula for both variances we will have a biased estimate

In this book we will only be using unbiased estimators.

We have a problem with the point estimator. We know our estimate of the mean of the amountof beer mentioned above will not exactly equal the actual amount of beer. What we want to do isreplace the point estimate, which we know does not equal the corresponding parameter, with aninterval and a level of confidence for the interval.

Example 8.1.1.

Let’s say we have taken a sample of 36 people and determined the time it takes to get to workduring the morning commute. Assume that σ = 20 minutes. The average turned out to be X = 37.9minutes.

1. What does µ equal?2. Find an interval that you are 95% certain contains µ.

Solution.

Page 171: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

8.1. CONFIDENCE INTERVAL FOR µ WITH KNOWN σ 165

1. What does µ equal? Quite simply, we don’t know. That is point of the discussion thatwill follow shortly. Although we don’t know what the mean time spent commuting by allcommuters, we can make an educated guess. That is the point estimate. So, our guess forthe mean time is 37.9 minutes.

2. In the example we have:X = 37.9σ = 20n = 36

Although this is completely new to us, the concepts we have from before will help us. Firstnote that since the sample size is greater than 30, the sampling distribution of X will beapproximately normal. This gives us X ∼ N(µ, σX) Note that we are given σ but we wantσX(= σ/

√n) X ∼ N(µ, σ/

√n), specifically, X ∼ N(37.9, 20/

√36).

Since we want to be 95% certain µ is in my interval, let’s start by finding the interval thatcontains 95% of all X’s. The distribution of X is given below. Note that the distributionhas a mean of µ, which we don’t know. Furthermore, note that the center of the distributioncontains 95% of the area and is bounded by µ± 1.96σX .

z

95%

µ+ 1.96σXµ− 1.96σX

95% of the time,

µ− 1.96σX < X < µ+ 1.96σX

This says that the distance between µ and X is less than 1.96 σX . So we can rewrite this as

X − 1.96σX < µ < X + 1.96σX

Which also happens 95% of the time. This last line is what is called a 95% confidence intervalfor µ. If we calculate the interval, we will be 95% ‘sure’ that µ is in the interval.Substituting in we get

37.9− 1.96× 20√36

< µ < 37.9 + 1.96× 20√36

37.9− 6.53 < µ < 37.9 + 6.53

Finally,

31.37 < µ < 44.43

So, what does the average commute time equal? We still don’t know! However, we are 95%certain that it is in the interval given. In otherwords, we are 95% confident that the meantime for the commute for all commuters is between 31.37 and 44.43 minutes. If we think theinterval is too wide, we can simply take a larger sample. More on that later.

Page 172: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

166 CHAPTER 8. CONFIDENCE INTERVALS FOR THE MEAN AND PROPORTION

In the above problem, 6.53 ounces is what is called the 95% margin of error. The margin oferror is giving us an indication of how large the sampling error can reasonably be. A word or two isin order here regarding the wording of our statement. First, note that this interval we got relies onthe fact that the sampling distribution of X was approximately normal. For all of our confidenceintervals of µ it will be required that we have the normality. This can be achieved by sampling froma population that is normal or by taking a sample size sufficiently large (n > 30). Second, note thatwe are not using the word ‘probability’ in our statement. The probability occurs before the sampleis taken. If we are going to take a sample, then there is a 95% chance our interval will contain µ.Once the sample is taken, there is no longer a probability involved. Us not knowing whether or notthe mean is actually in the interval doesn’t turn it into a probability. This is similar to flipping acoin: if we are going to flip a coin, there is a .5 probability that the coin will come up heads. Asyou are staring at heads, say, after the flip you aren’t going to state that there is a .5 probabilityit is heads. It is heads!

There is nothing special in the previous example about the 95%. We can use any percent wewould like. Before we do that some notation is in order

Definition zα is the value of z such that the area to the right of zα is α.

z

α

Example 8.1.2.

Example: find z.025

Solution.

Method 1: to find this we first note that if the area to the right of z.025 is .025 then the area tothe left of z.025 is 1-.025 or .975. We can look this up in the z-table and we find z.025=1.96

Method 2 (preferred): We can look at another table, the t-table. We will discuss why we can dothis in a latter section. To use the t-table, find the column heading of .025 and go to the bottomof the table to find 1.960. This is easier to get and gives us an additional digit.

With zα and the discussion above, we can given the formula for a confidence interval for anyconfidence level.

Page 173: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

8.1. CONFIDENCE INTERVAL FOR µ WITH KNOWN σ 167

A (1-α)100% confidence interval of µ is given by X ± zα/2σX .

The (1-α)100% margin of error is given by ±zα/2σX .

(1-α)100% is called the Confidence Level.

Provided that X is normally distributed.

Example 8.1.3.

A sample of 40 bottles of beer are collected and the average amount of beer turns out to be16.152 ounces. Assume further that the standard deviation of all bottles is .24 ounces. Constructconfidence intervals with confidence levels: 90%, 95%, 98%, and 99%. Compare the intervalsobtained.

Solution.

First of all we need to note that the sample size is greater than 30 we know, from the CentralLimit Theorem, that X ∼ N (n > 30) so we can use the formula above. We have X = 16.152 andσ = .24. For a 90% confidence interval, α is .10 (=100%-90%). From our t-table we get z.05 = 1.645so our confidence interval is

16.152± 1.645× .24/√

40

Which gives us

16.152± .062

.Our confidence interval is 16.090 to 16.214 ounces. We can state that we are 90% confident that

the average amount of beer dispensed into all bottles is between 16.090 and 16.214 ounces.

We can do the same thing to get the other confidence intervals. Our statements will be wordedsimilarly.

Below is a summary of the results

Confidence Point Margin ConfidenceLevel zα/2 Estimate of Error Interval

90% 1.645 16.152 .062 16.090 to 16.21495% 1.960 16.152 .074 16.078 to 16.22698% 2.326 16.152 .088 16.064 to 16.24099% 2.576 16.152 .098 16.054 to 16.250

Notice what is happening here. As the confidence level gets larger, the margin of error getsbigger as well. Since our confidence interval is of the form

Page 174: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

168 CHAPTER 8. CONFIDENCE INTERVALS FOR THE MEAN AND PROPORTION

point estimate ± margin of error

as the margin of error gets bigger, so too does the width of the confidence level. 1

Example 8.1.4.

From your previous work with the weights of apples in the family orchard, you know that theweights of the apples are approximately normally distributed with a population standard deviationof 2.64 ounces. You take a sample of 9 apples and find the average weight to be 11.46 ounces.

1. Find a point estimate for the average weight of all apples in the orchard.2. Find the 98% margin of error for the point estimate and interpret it.3. Construct a 98% confidence interval for the average weight of all apples in the orchard.

Solution.

1. The point estimate of µ is simply X so we get that the point estimate is 11.46 ounces.

2. Note that although n is small, the population is normally distributed which implies that thesampling distribution of X is normally distributed so we can proceed with the margin of errorand confidence interval.

To get the margin of error we need to first find z.01 = 2.326 from our t-table. Then the marginof error is

2.326× 2.64√9

= 2.05 ounces

The margin of error tells us that we are 98% certain our estimate of µ to be within 2.05 ounces

of the actual value, even though we don’t know what the value of µ is.

3. Once we have the point estimate and the margin of error, it is a relatively easy calculation toget the interval. We get

point estimate ± margin of error

11.46 ± 2.05

this yields

9.41 to 13.51

1The margin of error is similar to the radius of a circle, whereas the width of the confidence interval is like thediameter of a circle

Page 175: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

8.1. CONFIDENCE INTERVAL FOR µ WITH KNOWN σ 169

So we can state that we are 98% confident that the average weight of all apples in the orchardis between 9.41 and 13.51 ounces.

Note that this is a pretty wide interval. If we want to make it narrower we need to look at theformula for the margin of error.

Margin of error= ±zα/2σx

8.1.1 Determining the Sample Size for Estimation of µ

To make this margin of error small, our options are: make the denominator bigger or make thenumerator smaller. To make the denominator bigger, you need to make n larger. To make thenumerator smaller we need to make zα/2 or σ smaller. To make zα/2 smaller, we need to make theconfidence level smaller. We cannot affect any change to σ. Thus our only options are to make thesample size larger or choose a smaller confidence level.

We may need to have a particular confidence level and margin of error so our focus is on thesample size.

Let E represent the margin of error so we have

E = ±zα/2σxor

n =(zα/2σ

E

)2

Example 8.1.5.

In the last example, how large a sample must we choose if we want the 98% margin of error tobe at most 1 ounce?

Solution.

For 98% confidence level, we need z.01 = 2.326. We have the E = 1 so we get

n =

(2.326× 2.64

1

)2

= 37.7

Since n is a count, it must be an integer. For sample size determination problems we will alwaysround up. So we need 38 apples.

Page 176: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

170 CHAPTER 8. CONFIDENCE INTERVALS FOR THE MEAN AND PROPORTION

8.1.2 Exercises

For confidence intervals, be sure to write a statement.

1. If n = 35, X = 26.4, σ = 7.32.

Find

(a) The point estimate of µ

(b) The 95% margin of error for the point estimate

(c) The 95% confidence interval for µ

2. If n = 56, X = 135.6, σ = 26.87, and confidence level = 95%.

Find

(a) The point estimate of µ

(b) The 95% margin of error for the point estimate

(c) The 95% confidence interval for µ

3. If n = 135, X = 64.5, σ = 26.45, and confidence level = 90% Find

(a) The point estimate of µ

(b) The 90%margin of error for the point estimate

(c) The 90% confidence interval for µ

4. If n = 62, X = 1.236, σ = .0397, and confidence level = 98% Find

(a) The point estimate of µ

(b) The 98% margin of error for the point estimate

(c) The 98% confidence interval for µ

5. The speeds of cars on a highway vary but it is known that the standard deviation of all carsis 6.54 mph. A recent random sample of 40 cars gave a mean speed of 65.4 mph.

(a) Find the point estimate of the average speed of all cars on the highway.

(b) Determine the 98% margin of error for this point estimate.

6. A store manager wants to know the average amount of time customers spend in the store.The standard deviation of the time customers spend in the store is 12.5 minutes. A randomsample of 41 shoppers showed that they spent an average of 43.5 minutes in the store.

(a) Give the point estimate of the average time shoppers spend in the store.

(b) Determine the 90% margin of error for this point estimate.

7. A company that makes resistors for electronic circuits is investigating their resistance. Arandom sample of 40 resistors is taken and the mean resistance is 1056 Ω (Ohms). Thestandard deviation of all resistors is known to be 23.6 Ω.

Page 177: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

8.1. CONFIDENCE INTERVAL FOR µ WITH KNOWN σ 171

(a) Give the point estimate of the average resistance in all resistors.

(b) Determine the 99% margin of error for this point estimate.

8. The light output, in lumens, for a new brand of light bulbs is being investigated. A randomsample of 50 bulbs produced an average light output of 397 lumens. The standard deviationof all such bulbs is assumed to be 35 lumens.

(a) What is the point estimate of the mean output of light ?

(b) Determine the 95% margin of error for this point estimate.

9. The amount of beer dispensed into 12-ounce bottles varies from bottle to bottle. Althoughthe mean amount dispensed into bottles varies from time to times due to the vibrations in themachine, the standard deviation is always 0.235 ounces. A random sample of 35 bottles wasrecently selected and the mean amount of beer in the bottles was 12.16 ounces. Construct a95% confidence interval for the mean amount of beer dispensed into all 12-ounce bottles.

10. The amount of water used by a particular dishwasher varies from cycle to cycle. It has beenestablished that the standard deviation of the amount of water used is .264 gallons of water.The dishwasher is run for 34 cycles and the mean amoun of water used per cycle is found tobe 6.58 gallons. Construct a 95% confidence interval for the mean amount of water used inall cycles.

11. A new golf ball is being developed. As part of the testing, the ball is dropped from a standardheight and allowed to bounce. The height of the bounce is measured. The average height of arandom sample of 38 bounces is 38.9 inches. The standard deviation of all rebound heights is1.35 inches. Construct a 90% confidence interval for the average rebound heights of all balls.

12. At a large university, students at the end of the first semester are polled as to how much timethey spent studying per week. For a sample of 85 students the average time spent studyingwas found to be 28.9 hours per week. The standard deviation of the times all students spentstudying is 8.97 hours per week. Construct a 98% confidence interval for the average time allstudents spent studying per week.

13. A baseball pitcher’s fastball speeds vary from pitch to pitch. The speeds are known to followa normal distribution with a standard deviation of 1.36 mph. A sample of 12 piches gave anaverage speed of 94.6 mph. Construct a 90% confidence interval for the average speed of thepitcher’s fastball.

14. The almonds from a large orchard are advertised as being ‘large’. A consumer wants to knowhow large. The farmer assures the customer that the weights follow a normal distributionwith a standard deviatgion of the weight of all almonds is 0.035 grams. The customer takesa sample of 21 almonds and finds the average weight to be 1.261grams. Construct a 90%confidence interval for the mean weight of all almonds from the orchard.

15. The biologist at a fish hatchery wants to determine the average length of all fish at thehatchery. The biologist takes a sample of 64 fish and finds the average length to be 28.9 cm.The standard deviation of all fish in the hatchery is known to be 3.64 cm. Construct a 98%confidence interval for the average length of all fish in the hatchery. If the interval is too wide,what options are there to make the interval narrower?

Page 178: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

172 CHAPTER 8. CONFIDENCE INTERVALS FOR THE MEAN AND PROPORTION

16. At a concert the promoter is trying to figure out the average age of attendees to the show. Arandom sample of 52 yielded an average of 17.4 years. The standard deviation of all concert-goers is 6.57 years. Find a 99% confidence interval for the average age of all concertgoers. Ifthe interval is too wide, what options are there to make the interval narrower?

17. A pizza restaurant owner wants to know the average time it takes for an order to be filled.From past experience, the owner knows that the standard deviation of the times is 6.7 minutes.How large a sample must the owner take to ensure that the 90% margin of error for the averagetime is 2 minutes?

18. A dental hygenist estimates that the standard deviation of the time it takes for a completecleaning to be 5.67 minutes. What should the sample size be to ensure that the 98% marginof error for the estimate of the true average to be 1 minute?

19. You wish to estimate the average amount of peanut oil in all 64-ounce bottles of your favoritebrand. You know that the standard deviation of the amount of oil in all 64-ounce bottlesis 0.67 ounces. How large a sample must you take to be sure that your estimate has a 95%chance of being within 0.1 ounces of the actual mean?

20. In the past the standard deviation of the amount of money spent on Valentine’s Day giftsby American teens was $12.54. If this is true for the current population, how large a samplemust you take to estimate the mean of the average amount spent by all American teens towithin $1 at a 90% confidence level?

Page 179: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

8.2. CONFIDENCE INTERVAL FOR µ WITH UNKNOWN σ 173

8.2 Confidence Interval for µ with Unknown σ

In the previous section we developed confidence intervals of µ. Although there was nothing wrongwith the process, it doesn’t have much applicability because of one major issue: we rarely knowthe population standard deviation, σ. This chapter is all about estimation so if we don’t know thepopulation standard deviation let’s estimate it. The natural choice to estimate σ is s. This means

instead of having z =X − µσ/√n

we will useX − µs/√n

We have a problem here. Although subtracting and

dividing a normal random variable (X) by constants doesn’t change the normality, we are dividingby s which is a random variable. So the fraction on the right is not normally distributed. If n isvery large then we don’t have an issue with simply replacing σ with s and treating the expressionon the right as a standard normal random variable. If n is not very large, we have a problem. Weneed a new random variable.

8.2.1 The t–distribution

In the early 1900’s William Gosset was working for Guinness Brewery and was looking at cropyields. He was forced to work with small samples so the methods of the day for estimation were notsufficient for his needs. He developed what is known as the t-distribution. He published his workunder the pseudonym Student so this is often referred to as the “Student’s t-distribution”. Thet-distribution has one parameter which is called the degrees of freedom, abbreviated df . Below areseveral graphs with different df ’s.

df = 1df = 3df = 10

t

The distribution looks very similar to the standard normal distribution, z. They are both bell-shaped with means of 0. The standard deviation of the standard normal distribution is 1. Thestandard deviation of the t-distribution is σt =

√df/(df − 2). Below are graphs of the t and z

distributions. Note that the distribution of t is a little bit wider than the distribution of z.

z

t

To find values of tα we will use the t-table located in the appendix.To use the table we locate the df along the left column and the value of α along the top where

the column and row meet we get the desired value.

Page 180: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

174 CHAPTER 8. CONFIDENCE INTERVALS FOR THE MEAN AND PROPORTION

Example 8.2.1.

Find t.05 with 6 degrees of freedom.

Solution.

Looking at the t-distribution table we see the following.

df .10 .05 .025 .01 .005 .0011 - - - - - -2 - - - - - -3 - - - - - -4 - - - - - -5 - - - - - -6 - 1.943 - - - -

So t.05 = 1.943. We can also write t.05,6 = 1.943 where we have included the degrees of freedom inthe subscript.

We can put this together with the previous section to construct a confidence interval. We obtainthe following.

8.2.2 Confidence Intervals of µ using the t-Distribution

A (1-α)100% confidence interval of µ is given by

X ± tα/2sX where df = n− 1

The margin of error is given by±tα/2sX

Provided that X is normally distributed.

The main difference between this confidence interval in the previous section is whether or notwe know σ. If we are looking at confidence intervals of µ it is straightforward to determine if weuse z or t. If we know σ, use z. If we don’t know σ, use t. In most ‘real world situations’ thepopulation standard deviation is unknown. The sample standard deviation is always going to beknown.

Example 8.2.2.

After reading about the t-distribution, a statistics student becomes thirsty (for both beer andknowledge) and decides to determine the average amount of beer in all 12-ounce bottles of beerproduced by the Guinness brewery. The student takes a sample of 32 12-ounce bottles and finds theaverage amount of beer to be 12.131 ounces with a standard deviation of .146 ounces. Construct a95% confidence interval for the average amount of beer in all 12-ounce bottles.

Page 181: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

8.2. CONFIDENCE INTERVAL FOR µ WITH UNKNOWN σ 175

Solution.

We are looking for a confidence interval for the ‘. . . average . . . in all . . . ’ or µ. Note that the‘12’ in 12-ounce plays no role in the calculations in this problem. Furthermore, both the mean andthe standard deviation are both coming from the sample. This makes them X, and s, respectively,not µ and σ. We have:

X = 12.131,s = .146,n = 32,and a confidence level of 95%.

Since n > 30 we know that the X will be normally distributed. This means we can use theabove formula.

We have 31 degrees of freedom (= 32− 1) so t.025,31 = 2.040

So our confidence interval is given by

12.131± 2.040× .146√32

or 12.078 to 12.184

We are 95% confident that the average amount of beer dispensed in all 12-ounce bottles isbetween 12.078 and 12.184 ounces.

Example 8.2.3.

The diameter of a piston for an engine varies from piston to piston. It is known that thediameters are approximately normally distributed. A sample of 8 pistons is taken and the meandiameter of the 8 pistons is 3.956 inches. The standard deviation of these 8 pistons is .049 inches.Construct a 99% confidence interval for the mean diameter of all pistons.

Solution.

We can use the formula for the confidence interval because the distribution of pistons is normallydistributed therefore X will be normally distributed.

We have:

X = 3.956 inches, s = .049 inches, n = 8, and a confidence level of 99%.

So t.005,7 = 3.499

So our confidence interval is given by

3.956± 3.499× .049√8

This gives us an interval of 3.895 to 4.017.

Page 182: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

176 CHAPTER 8. CONFIDENCE INTERVALS FOR THE MEAN AND PROPORTION

We are 99% confident that the mean diameter of all pistons is between 3.895 and 4.017 inches.

Example 8.2.4.

A sample of 8 soccer balls coming off an assembly line was selected and the circumference ismeasured. The diameters, in cm, are 68.7, 69.3, 69.8, 68.4, 69.1, 68.8, 68.7, 68.9. It is known thatthe distribution of circumferences are normally distributed.

1. Give the point estimate for the average circumference of all soccer balls coming off the assemblyline.

2. Find the 98% margin of error for this point estimate.3. Construct a 98% confidence interval for the average circumference of all soccer balls coming

off the assembly line.

Solution.

For this problem we are not given any of the statistics we need. Instead we are given the actualdata values. Since we need to calculate the standard deviation in order to calculate the confidenceinterval, we will need to use a calculator. Since we are going to use the calculator, we will use it tocalculate the interval directly. To do so, on our TI-83 or 84 we use the following steps.

Enter the data in a list

Select STAT>TESTS >TInterval

For Inpt, select Data.

For List, put the list where your data is. (2nd>List>NAMES)

For Freq, we want 1.

For C-Level, we want 98 (or .98, the calculator allows either)

Highlight Calculate and hit ENTER.

Our calculator gives us the interval as well as X and s.

1. Our point estimate of µ is 68.96 cm2. The margin of error is ±t.01sX = ±2.998× .434√

8= .46 cm.

3. The confidence interval, from the calculator is 68.502 to 69.423. So, we are 98% confidentthat the average circumference of all soccer balls is between 68.50 and 69.42 cm.

Page 183: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

8.2. CONFIDENCE INTERVAL FOR µ WITH UNKNOWN σ 177

8.2.3 Exercises

1. Let X = 23.7, s = 6.42, n = 32. Find

(a) The point estimate of µ

(b) The 95% margin of error for the point estimate

(c) The 95% confidence interval for the mean

2. Let X = 13.6, s = 2.65, n = 12, X ∼ N . Find

(a) The point estimate of µ

(b) The 95% margin of error for the point estimate

(c) The 95% confidence interval for the mean

3. Let X = 125.6, s = 26.8, n = 10, X ∼ N . Find

(a) The point estimate of µ

(b) The 90% margin of error for the point estimate

(c) The 90% confidence interval for the mean

4. Let X = 13.6, s = 2.65, n = 12, X ∼ N . Find

(a) The point estimate of µ

(b) The 98% margin of error for the point estimate

(c) The 98% confidence interval for the mean

5. Having to download a lot of data for work, a worker is testing the time it takes to download a 1GB file. The file is downloaded at several random times and it is found that for 39 downloads,it took an average of 7.26 minutes with a standard deviation of 1.36 minutes.

(a) Give the point estimate of the average time it takes to download a 1 GB file.

(b) Determine the 90% margin of error for this point estimate.

(c) Construct a 90% confidence interval for the average time it takes to download a 1 GBfile.

6. A random sample of 32 dog owners showed that they spend an average of $486 on veterinariancare with a standard deviation of $135 in the last year.

(a) Give the point estimate of the mean amount of money spent last year on veterinariancare by all dog owners.

(b) Determine the 90% margin of error for this point estimate.

(c) Construct a 90% confidence interval for the mean amount of money spent last year onveterinarian care by all dog owners.

Page 184: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

178 CHAPTER 8. CONFIDENCE INTERVALS FOR THE MEAN AND PROPORTION

7. A teacher wants to know the average time kindergarten students spend playing electronicgames per week. The teacher takes a random sample of 15 students and finds the averagetime spent playing electronic games was 15.6 hours per week with a standard deviation of6.24 hours. Assuming the times students spend playing electronic games per week is normallydistributed, construct a 95% confidence interval for the mean time spent playing electronicgames by all kindergarteners.

8. A gas station owner needs to determine the average amount of gasoline distributed when 1gallon is ordered. The owner takes 10 buckets, sets the pump to dispense one gallon andthen measures the amount that has come out. The average amount of gas in the ten bucketsturned out to be 1.023 gallons with a standard deviation of 0.106 gallons. Construct a90% confidence interval for the mean amount of gas dispensed when one gallon is requested.Assume the amounts dispensed are approximately normally distributed.

9. A nursery sells seeds for home consumers. A sample of 7 envelopes of seeds were weighedand the weights were, in mg, 135, 165, 146, 133, 155, 149, and 141. It is reasonable frompast experience that the distribution of weights is approximately normal. Construct a 95%confidence interval for the mean weight in all envelopes of seeds packaged.

10. An agriculture student is looking into wheat yiellds in a large farming community. Thestudent looks at 6 one-acre plots and determines the amount harvested for the acre. Theamounts were, in bushels, 50.2, 49.8, 56.8, 56.3, 55.7, and 60.1. From past data, it is clearto the student that the distributions of crop yields are approximately normally distributed.Construct a 90% confidence interval for the mean yield per acre for all farms in the community

11. A drill sergeant has set up a new obstacle course for the incoming recruits. The sergeanttakes a sample of 8 recruits and sends them on the obstacle course. The sergeant has seentimes from several different obstacle courses and expects that the times will be approximatelynormally distributed. The times, in seconds, are 46.8, 53.8, 39.7, 44.1, 49.2, 39.8, 40.8, and46.3. Construct a 95% confidence interval for the mean time it takes all recruits to run thecourse.

12. In a plant that assembles small motors, the workers put together the prefabricated parts.The management is planning on putting the assembly process on an assembly line so theywant to know the average time it takes to assemble a motor. A random sample of 6 workersare observed putting the motors together. From past experience with assembly lines, themanagement expects the times to be approximately normal. The six workers took 135, 145,138, 148, 133, and 137 seconds to assemble the motor. Construct a 90% confidence intervalfor the average time it takes to assemble a motor by all workers.

13. While taking a particular medication, patients are subject to drowsiness. A random sampleof 35 adults using the medication slept an average of 10.35 hours per night with a standarddeviation of 2.64 hours per night. Construct a 98% confidence interval for the mean time allpatients spend sleeping at night while using the medication.

14. After discussing the amount of TV Americans watch, a sociologist takes a random sample of36 American adults and finds the average time spent watching TV per week was 20.4 hourswith a standard deviation of 5.98 hours. Construct a 90% confidence interval for the averageamount of time spent by all American adults watching TV.

Page 185: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

8.2. CONFIDENCE INTERVAL FOR µ WITH UNKNOWN σ 179

15. At a large airport, a random sample of 37 passengers showed that they arrived an average of38.4 minutes before their flights where scheduled to depart with a standard deviation of 12.4minutes. Construct a 98% confidence interval for the average time of arrival before a flightby all passengers.

16. As part of an science experiment, students are trying to grow sugar crystals at home. Thestudents fill a jar with a highly saturated solution of sugar and water, lower a string into thewater, cover and wait. After several days, the string with all the crystals are removed, dried,and taken to school to weigh. The average weight of the crystals turned out to be 64.5 gramswith a standard deviation of 8.42 grams. Construct a 95% confidence interval for the averageweight of the crystals for all students that perform the experiment.

17. Each day at a deli, a tip jar is on the counter for customers to leave a tip. A random sampleof 12 lunch hours is selected and the average amount of money in the tip jar was $135.30with a standard deviation of $26.51. Assuming the amounts of money left in the tip jar eachlunch hour are approximately normally distributed, construct a 90% confidence interval forthe average amount of tips for all lunch hours.

18. A textile manufacturer has a quick-dry fabric that it is working on. A random sample of 12shorts made with the fabric were soaked in water and then put on mannequins. The averagetime it took for the fabric to dry was 12.5 minutes with a standard deviation of 1.36 minutes.Construct a 98% confidence interval for the mean time it takes for shorts to dry. Assume thedrying times of all shorts is normally distributed.

19. For a game show, contestants are put on the ‘Hot Seat’ and need to name 10 out of 15celebrities by looking at their picture as quickly as they can. If the contestant doesn’t knowone, they can pass to the next celeb. If the contestant can name 10 celebrities in 30 seconds,they win a prize. Fifteen contestants were put on the ‘Hot Seat’ and it took an average of35.6 seconds with a standard deviation of 6.51 seconds.

(a) Construct a 95% confidence interval for the mean time it takes to name 10 celebrities.

(b) Notice that 30 seconds is not in the interval. Does that mean that the game is unfair?

(c) What can be done to make the interval narrower?

20. At an amusement park, there are signs that tell park-goers how long until they get on theride from points on the way. A worker can see the sign that reads ‘30 minutes from here’ anddecides to time some customers. In 32 randomly selected customers, the mean wait time was31.6 minutes with a standard deviation of 3.56 minutes.

(a) Construct a 95% confidence interval for the mean time it takes all customers from the‘30 minutes from here’ sign.

(b) Notice that 30 minutes is not in the interval. Does that mean that the sign is wrong?

(c) What can be done to make the interval narrower?

Page 186: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

180 CHAPTER 8. CONFIDENCE INTERVALS FOR THE MEAN AND PROPORTION

8.3 Confidence Intervals for p

In the previous section we have found confidence intervals for µ. In this section we continue ourdiscussion of confidence intervals to include confidence intervals of the populaiton proportion: p.

A pollster takes a sample to determine what percent of registered voters are in favor of aproposition on the current ballot. The pollster reports that 53% are in favor of the proposition.Not all voters were asked. The pollster is using a sample percentage to estimate the populationpercentage. Of course, when we are talking about the percentage here we are really talking aboutthe proportion. We then just multiply by 100% to get a percentage.

Since we want to estimate p, which is really just the mean of a random variable2, we can usethe section on confidence intervals of µ to construct confidence intervals. All we need to do is makesome modifications. Recall p ∼ N

(p,√

pqn

)Whenever p is approximately normal. This happens

when np > 5 and nq > 5.

We would expect that the confidence interval for p to be p± zα/2√

pqn

We have a major problem with this. The confidence interval as stated requires us to know whatp is. What we need to do is modify this. So we obtain the following:

A (1− α) 100% confidence interval for p is given by

p± zα/2√pq

n

We can write this asp± zα/2sp

where

sp =

√pq

n

The point estimate of p is p =X

n

The Margin of Error is

±zα/2√pq

n

Provided that np > 5 and nq > 5. (q = 1− p)

Notice in the requirements for the confidence interval we require np > 5 and nq > 5. We can’tcheck these. To do so would require us to know what p is. Since we are estimating p the best wecan do is check np and nq . For the problems we will solve, np and nq will be a lot more than 5.

Example 8.3.1.

2Recall that µp = p.

Page 187: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

8.3. CONFIDENCE INTERVALS FOR P 181

A pollster takes a sample of 1000 voters and determines if the voters are in favor of a propositionon the ballot. 53% say they are in favor of the proposition. Determine

1. A point estimate of the percentage of all voters in favor of the proposition.2. The 95% margin of error for this point estimate. Interpret the margin of error.3. Construct a 95% confidence interval for the proportion of all voters in favor of the proposition.

Solution.

First note that we are given n = 1000, p = .53 (53%), confidence level 95%. Since q = 1− p weget q = .47 Also, although we are asked about the percentage the formulas use the proportion sowe will do all of the work using the proportion and then rewrite as a percentage.

1. The point estimate of p is equal to p so our point estimate is 53%.

2. Since np = 530 and nq = 470 we can reasonably conclude that p ∼ N which means we can

continue. So we get margin of error = ±zα/2√

pqn = ±1.960×

√.53×.47

1000 = .031. This margin

of error tells us that percentage of all voters in favor of the proposition, which we don’t know,is within 3.1 percentage points of our point estimate, 53%.

3. The confidence interval is given by point estimate ± margin of error so we have .53± .031 or.499 to .561. We are 95% confident that the percentage of all voters who are in favor of theproposition is between 49.9% and 56.1%. A more common way to say this is: ‘We are 95%confident that 49.9% to 56.1% of all voters are in favor of the proposition.’

In this example, the pollster took a sample of 1000 voters. This is about the size of most commonpolls. It gets a reasonable small margin of error but is not too costly. The larger the sample size,the more we expect the poll to cost.

Example 8.3.2.

A large company is looking at the wait times on hold for its callers. A sample of 385 phone cus-tomers indicated that 162 of them waited on hold for more than 10 minutes while being repeatedlytold ‘. . . your call is important to us . . . ’. Find a 98% confidence interval for the proportion of allcallers who are on hold for more than 10 minutes.

Solution.

We are looking for a confidence interval for ‘. . . the proportion . . . ’. Unlike the last problem, inthis problem we are not given the proportion (or percentage). We are given n = 385, X = 162, anda confidence level of 98%. The 10 minutes is not used in the calculation. It is the criteria by whichwe determine if we have a success.

p = 162385 = .421, so q = .579

and we get

Page 188: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

182 CHAPTER 8. CONFIDENCE INTERVALS FOR THE MEAN AND PROPORTION

.421± 2.326

√.421× .579

385or.362 to .479We are 98% confident that the proportion of people who are on hold for more than 10 minutes

is between .362 and .479.

8.3.1 Determining the Sample Size for Estimates of p

Note that in this last problem, the interval is pretty wide. Just like with confidence intervals of µ wecan make the interval narrower by either increasing n or decreasing the confidence level. The bestthing to do is to determine the sample size before taking the sample. Then we will be guaranteedthe width will be small enough at the confidence level we desire.

The margin of error, E, is given by

E = zα/2

√pq

n

Solving for n we get

n =zα/2

2pq

E2

We have a bit of a problem here. To determine the sample size so we can calculate p, we needp. There are two options we have here that we shall look at in the next example

Example 8.3.3.

In the last example, use the information given to determine how large a sample needs to be takesso that our point estimate of the proportion is within .04 of the actual value at a 98% confidencelevel.

Solution.

From before we have p = .421, q = .579, and z.01 = 2.326.Our sample size is given by

n =2.3262 × .421× .579

.042= 824.13

In sample size determination, we round up so we have n = 825.

Example 8.3.4.

The company in the last example has totally changed their phone protocol. Find the samplesize that will give a margin of error of .04 with a 98% confidence level.

Page 189: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

8.3. CONFIDENCE INTERVALS FOR P 183

Solution.

Since the phone protocol has ‘totally changed’ we have no clue what to use for p. Let’s look ata table of values with different values of p and see what the sample size is.

The first entry is given by n =2.3262 × .1× .9

.042= 304.32, which we round up to 305. The others

are similarly calculated.

p q sample size.1 .9 305.2 .8 542.3 .7 711.4 .6 812.5 .5 846

Why didn’t we take p = .6, .7, . . .?Note that the largest of these occurs when p = q = .5. The required sample size is 846 customers.

When we have no clue what p might be or we want the most conservative value for n, usep = q = .5 in the sample size determination formula

8.3.2 Exercises

1. Let X=456, n = 1250

(a) Find the point estimate for p.

(b) Find the 95% margin of error for the point estimate of p.

(c) Construct the 95% confidence interval for p.

(d) What can be done if we need a narrower interval?

(e) How large a sample should we take if we want the 95% margin of error to be 4%? Useyour value of p from above.

(f) How large a sample should we take if we want the 95% margin of error to be 4%? Donot use your value of p from above.

2. Let X=269, n=895

(a) Find the point estimate for p.

(b) Find the 95% margin of error for the point estimate of p.

(c) Construct the 95% confidence interval for p.

(d) What can be done if we need a narrower interval?

Page 190: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

184 CHAPTER 8. CONFIDENCE INTERVALS FOR THE MEAN AND PROPORTION

(e) How large a sample should we take if we want the 95% margin of error to be 3%? Useyour value of p from above.

(f) How large a sample should we take if we want the 95% margin of error to be 3%? Donot use your value of p from above.

3. Let p = .74, n = 850

(a) Find the point estimate for p.

(b) Find the 90% margin of error for the point estimate of p.

(c) Construct the 90% confidence interval for p.

(d) What can be done if we need a narrower interval?

(e) How large a sample should we take if we want the 90% margin of error to be 2.5%? Useyour value of p from above.

(f) How large a sample should we take if we want the 90% margin of error to be 2.5%? Donot use your value of p from above.

4. Let p = .21, n = 1100

(a) Find the point estimate for p.

(b) Find the 98% margin of error for the point estimate of p.

(c) Construct the 98% confidence interval for p.

(d) What can be done if we need a narrower interval?

(e) How large a sample should we take if we want the 98% margin of error to be 5%? Useyour value of p from above.

(f) How large a sample should we take if we want the 98% margin of error to be 5%? Donot use your value of p from above.

5. The percent turnout for elections vary from election to election. A random poll of 1050 eligiblevoters showed that 53.4% of eligible voters voted. Construct a 90% confidence interval for thepercentage of eligible voters that voted in the last election.

6. At a large company, the management wants to know how employees feel about the currentCEO. They take a sample of 350 employees and find that 68.5% are unsatisfied with theCEO’s performance. Construct a 95% confidence interval for the percentage of employeeswho are unsatisfied with the CEO’s performance.

7. One of the many side-effects of a drug that is being researched is nausea. Of 846 people usingthe drug, 168 reported having nausea. Find a 99% confidence interval for the percentage ofall users of the drug that have nausea.

8. A call-in customer service center has a long wait. As a result, several people hang up beforebeing able to speak to a representative. From a random sample of 561 calls, there were 53hang-ups. Construct a 95% confidence interval for the percent of all callers that hang upbefore being able to speak to a representative.

Page 191: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

8.3. CONFIDENCE INTERVALS FOR P 185

9. According to pocketsense.com, 21% of paper tax returns have errors in them. Assume thisis based on a sample of 890 paper tax returns. Construct a 90% confidence interval for thepercentage of all paper tax returns that have errors in them.

10. You have just read that 20% of businesses fail within the first year. Further it is stated thatit is based on a sample of 560 businesses. Find a 98% confidence interval for the percent ofall businesses that fail within the first year.

11. A college placement center is looking at how many of their college graduates get jobs with 6months of graduation in their field of study. A sample of 653 graduates yielded 425 graduatesthat got jobs within 6 months of graduating. Construct a 90% confidence interval for thepercentage of all college graduates at the college that get jobs within 6 months of graduating.

12. The CEO of a website that sells clothes is investigating the percent of returns. From 684items purchased, 53 were returned. Find a 90% confidence interval for the percent of all itemspurchased that are returned.

13. A biologist for the Department of Fish and Game is inspecting caught salmon to determine ifthey are wild or hatchery raised. Of 564 salmon inspected, 124 were wild. Construct a 90%confidence interval for the percent of all caught salmon that are wild.

14. A poll of 1239 smokers showed that 894 always properly dispose of their cigarette butts.Construct a 98% confidence interval for the percent of all smokers that always properly disposeof their cigarette butts.

15. A police officer reported that ‘Of 156 observed vehicles, 82.7% were driving more than 10 mphabove the speed limit’ . Construct a 95% confidence interval for the percent of all vehiclesthat drive more than 10 mph above the speed limit.

16. A wedding planner has been suggesting to all clients that the couple create a wedding website.From a random sample of 438 weddings, 45.9% created their own wedding website. Constructa 90% confidence interval for the percent of all soon-to-be-married couples that create theirown wedding website.

17. A Gallop poll reported that 73% of Americans say artificial intelligence will eliminate morejobs than it creates. Assume this is based on a sample of 1200 Americans.

(a) Find the point estimate for the percent of all Americans who feel that artificial intelli-gence will eliminate more jobs than it creates.

(b) Find the 90% margin of error for the point estimate.

18. A Gallop poll reported that 44% of Americans view abuse of prescription painkillers as a‘crisis’ of ‘very serious problem’ in their local area. Assume this is based on a sample of 950Americans.

(a) Find the point estimate for the percent of all Americans who feel abuse of prescriptionpainkillers as a ‘crisis’ of ‘very serious problem’ in their local area.

(b) Find the 98% margin of error for the point estimate.

Page 192: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

186 CHAPTER 8. CONFIDENCE INTERVALS FOR THE MEAN AND PROPORTION

19. A fitness guru has been touring a state to encourage people to exercise. One year after theguru has passed through the state a sample of 806 adults were sampled and 135 of themexercise regularly.

(a) Construct a 98% confidence interval for the percent of all residents in the state whoexercise regularly.

(b) The governor of the state is also a fitness fanatic. The governor wants to know how largea sample next year to take to see the percent of all people who exercise regularly. Thegovernor wants the 98% margin error to be 3.5%. How large a sample should you tellthe governor to take next year. Use the previous years percent to answer the question.

(c) In the year since the guru was in the state, the governor of another state has been pushingresidents to exercise regularly. How large a sample should that governor take now. Sincethe governor is in a different state, we don’t expect the percent to be what it was before.

20. At a large university, the administration is considering requiring all incoming students to haveall vaccines up to date. A recent sample of 564 current first year students had 204 with atleast one vaccine not up to date.

(a) Construct a 95% confidence interval for the percent of all current first year students thathave at least one vaccine not up to date.

(b) If the administration wants a 95% margin of error for the percentage of next year’sincoming students who have at least one vaccination not up to date to be 2.5%. Howlarge of a sample should be selected. Use this years percentage to determine the samplesize.

(c) Assume there has been story that has people rethinking their vaccination choices. Thismeans you have no idea what the percentage might be. How large a sample size isrequired if we want the 95% margin of error to be 2.5%?

21. A sales representative for a large company flies for job related business often. The companyalways uses the same airline. The representative notes that of 238 flights taken in the lastyear, 68 were at least 10 minutes late in taking off. Find a 90% confidence interval for thepercentage of all flights that are at least 10 minutes late in taking off. Comment on problemswith the information given in the problem and your conclusion.

22. At a bottling plant, the management wants estimates of the percentage of bottles that havemore than 12 ounces of beer every day. An employee is asked to take a sample and determinehow many have more than 12 ounces beer. The employee instructs the forklift operator whois moving bottles as they come off the assembly line to drop the next palletful of beer off attheir work station. The employee then determines that of 480 bottles, 401 have more than12 ounces. Construct a 98% confidence interval for the percentage of all bottles that havemore than 12 ounces of beer. Comment on issues you might have with the conclusions in theproblem.

Page 193: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Chapter 9

Hypothesis Tests for The Meanand Proportion

187

Page 194: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

188 CHAPTER 9. HYPOTHESIS TESTS FOR THE MEAN AND PROPORTION

9.1 Hypothesis Tests of µ, σ Known

You have just purchased a package of cheese labeled as 16-ounces. You pick it up. It feels a littlelight. You decide to test whether or not the labeling is correct. The idea is very straightforward:take a sample, calculate the sample mean, X, and if that value is a lot less than 16-ounces youwill conclude that the labeling is incorrect.1 Although this makes sense, what is ‘a lot’? Before weconsider this problem, let us assume that the label is referring to the average amount of cheese inall packages, µ.2

Before we attack this problem, let’s get a few terms down. First of all, we are trying to concludethat the average is less than 16 ounces. If this is not the case, then the average is greater than orequal to 16 ounces. We are assuming that the labeling is correct so we will assume that the averageis 16 ounces (or more). We need evidence to prove to ourselves, and everybody interested, that theaverage is less than 16 ounces.

The Null Hypothesis is a statement about a population that is assumed to be true until wehave sufficient evidence to the contrary.

It is denoted H0, read ‘H naught’.

The Alternative Hypothesis is a statement about a population which is true if and only ifthe null hypothesis is false.

It is denoted H1, read ‘H one’.

We can think of H0 and H1 as being complements.

Ideally, we would like to determine which hypothesis is true. We are making a decision. Assoon as we make a decision we are either correct or we have made an incorrect decision. We wantto distinguish the different errors that might be made: concluding H1 is true, when it is false andnot concluding H1 is true, when it is false. These are type I and type II errors, respectively.

A Type I error is the error made when we conclude the null hypothesis is false but it is actuallytrue.

A Type II error is the error made when we don’t conclude the null hypothesis is false but it isactually false.

We summarize in the following table.

Actual SituationH0 True H0 False

DecisionDo not reject H0 Correct Decision Type II Error

Reject H0 Type I Error Correct Decision

1We aren’t going to complain if they give us more than 16-ounces2In reality, it is not the average. Each customer expects to get at least 16-ounces but this is a good place to start.

Page 195: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

9.1. HYPOTHESIS TESTS OF µ, σ KNOWN 189

Note that we never concluded that H0 is true. We assume it is true. Our goal is to concludethat H0 is false (same as concluding H1 is true). This is just like a criminal trial. The defendantis assumed to be innocent (H0) until there is enough evidence to convict them of the crime (rejectH0). There is never a trial to prove someone is innocent. (Other than in the movies, TV, etc.)

We will make incorrect decisions. That is the nature of decision making. It does not mean thatwe made a mistake, the data leads us to our decision. In the criminal trial mentioned above, westrongly dislike the idea of convicting someone of a crime that is innocent (a type I error) so wewant this to not happen very often. This gives us what is called the level of significance for ourhypothesis test.

α = P (Type I error)

β = P (Type II error)

α is the called the level of significance, or significance level.

1− β is called the power of the test.

The power of the test is the probability of not making a type II error. Whereas the level ofsignificance, α is chosen, β usually is not chosen.3

Example 9.1.1.

Let’s get back to the original problem. You take a sample of 40 packages of cheese and find theaverage weight to be 15.92 ounces. You know that the population standard deviation is .13 ounces.Using a level of significance of 5% can you conclude that the average amount of cheese is less than16 ounces?

Solution.

First off, notice what we are trying to show: the average amount of cheese is less than 16ounces(µ < 16). This is the alternative hypothesis, H1. The null hypothesis, H0, is the ‘opposite’of this or µ ≥ 16. It turns out that we will need a value of µ when we do our test so we will useµ = 16. So we have . . .

H0 : µ = 16H1 : µ < 16

Since we are doing a test on µ we are looking at X so we need to know what its distributionis. Since the sample size is greater than 30 we know that the distribution of X is approximatelynormal.

Let’s look at the distribution of X. Note it is normal (n > 30) and µ = 16. (We assume thenull hypothesis is true at the start of a hypothesis test.)

3In more advanced statistics, α is chosen and n is determined based on an acceptable power of the test.

Page 196: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

190 CHAPTER 9. HYPOTHESIS TESTS FOR THE MEAN AND PROPORTION

X

α

a lot 16

Since we are trying to show µ is less than 16, we will conclude that it is less than 16 if X is alot less than 16. ‘A lot’ is determined by α. In the graph given, ‘a lot’ is indicated on the graph aswell as a mean of 16. (We assume the null hypothesis is true until proven otherwise and the nullhypothesis states that µ is 16.) We will conclude that the mean is less than 16 if X is on the axesbelow the shaded region.

If the null hypothesis is true then the distribution of X is given in the graph. If X is also onthe axis below the shaded region then we conclude, incorrectly, that µ is less than 16 (again, weare assuming the null hypothesis is true here). This would be a type I error. So the area in the lefttail is α. We need to compare α with X. We can’t compare them directly. Since α is an area, orprobability, we would need to convert X into a probability. To do this we would need to calculatethe z-score of X then determine the area. To start with we will instead determine the values of zfor both the value of X and α. This is what is called the critical value approach.

In the graph, ‘a lot’ has an area to the left of .05 (α) so we need to find −z.05 and we get−z.05 = −1.645. This is what is called the critical value. In the graph, the rejection region is in theleft tail so this is called a left-tailed test. Now let us calculate the z-score of X. This is called thetest statistic.

The test statistic is the value of the z-score corresponding to the observed value of X.

z =X − µσ/√n

=15.92− 16

.13/√

40= −3.892

Since we are assuming that the null hypothesis is true, we use 16 for µ. If we look we will see that-3.892 is in the left tail so we conclude that the average amount of cheese in all 16-ounce packagesis less than 16 ounces.

In this problem, when we are talking about ‘16-ounce packages’ we are referring to what thelabel says, which may or may not be what is actually in the package.

Steps for a Hypothesis Test Using the Critical Value Approach

1. State the null and alternative hypotheses.

2. Select the appropriate distribution and reason(s) why.

3. Determine the acceptance and rejection regions.

4. Calculate the test statistic.

5. Write out your conclusion.

Page 197: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

9.1. HYPOTHESIS TESTS OF µ, σ KNOWN 191

Example 9.1.2.

The speeds of cars passing through a residential area vary from car to car. The posted speedlimit is 25 mph. If there is sufficient evidence that the average speed is more that 5 mph above theposted speed limit an increased police presence will begin in the area. The local patrol officer takesa sample of 8 cars and finds the average speed of the cars to be 31.5 mph. Having been on the beatfor years, the officer knows that the speeds of cars are approximately normally distributed with apopulation standard deviation of 3.64 mph. Using the 2.5% level of significance if the police willbe setting up an increased presence.

Solution.

We will expect to see an increased police presence if the average speed limit is over 30 mph (5mph above the posted 25 mph). This is what our alternative hypothesis is. Putting this togetherwith the information given we get

H1 : µ > 30n = 8X = 31.5σ = 3.64α = 2.5%X ∼ N

Following the steps for the hypothesis test we get

1.H0 : µ = 30H1 : µ > 30

2. Use z because X ∼ N and σ is known.

3. z

α = 2.5%

1.960

4.

z =31.5− 30

3.64/√

8= 1.166

5. Since the test statistic is not in the rejection region, we do not conclude the average speed isgreater than 30 mph. Therefore there will be no increased police presence.

Page 198: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

192 CHAPTER 9. HYPOTHESIS TESTS FOR THE MEAN AND PROPORTION

In this last example we had a right-tailed test. We will only conclude that µ is greater than 30if X is a lot more than 30. ‘More than 30’ is to the right. That is why our rejection region is tothe right. We now will look at a two-tailed test.

Example 9.1.3.

After getting complaints and being fined for under-filling the 16-ounce packages of cheese, thecompany’s CEO directs the production line to hire a new quality control officer. It is agreed thatthe cheeses will be filled to an average of 16.1 ounces. This way, if the package is under-filled,there will most likely be at least 16 ounces. Although the customers don’t care if the packagesare over-filled, the company realizes that this is product that is not being sold so they don’t wantthe packages over-filled or under-filled. A sample of 40 16-ounce packages will be collected eachshift, weighed, and if there is enough evidence that the average amount of cheese is different from16.1 ounces, the machine will be stopped and adjusted. A recent sample of 40 packages yieldedand average of 16.04 ounces. The population standard deviation is still .13 ounces. Use a level ofsignificance of 1% to determine if the machine will be stopped and adjusted. Describe any errorthat might be made (I or II) and describe the error, in terms of the problem.

Solution.

We have the followingH1 : µ 6= 16.1n = 40X = 16.04σ = .13α = .01

Since H1 6= 16.1, we have H0 = 16. We will reject the null hypothesis if X is ‘a lot’ more than16.1 or if it is ‘a lot’ less than 16.1 we need to split α in two. Each tail contains α/2.

1.H0 : µ = 16.1H1 : µ 6= 16.1

2. Use z because n > 30 and σ is known.

3. z

.005 .005

2.576−2.576

4. z =16.04− 16.1

.13/√

40= 2.919

Page 199: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

9.1. HYPOTHESIS TESTS OF µ, σ KNOWN 193

5. Since the test statistic is in the rejection region, we conclude the average amount of cheesein all 16-ounce packages is different from 16.1 ounces. Therefore we will stop and adjust themachine.

So, does the machine need to be adjusted? We don’t know. The machine could be workingas desired but the data led us to adjust the machine. Again, we don’t know the reality of thesituation, just what the data led us to conclude. If we made the wrong decision, it would be a typeI error: concluding a the null hypothesis to be false, when it is true. This would mean adjusting themachine when it doesn’t need adjusting. In terms of the hypotheses, this would mean concludingthe mean is different from 16.1 ounces when, in reality, the mean is not different from 16.1 ounces.

9.1.1 Exercises

Perform the test of hypotheses for the following using the critical value approach.

1. Let n = 39, X = 16.5, σ = 3.64, H0 : µ > 16, α = .05

2. Let n = 54, X = 32.54, σ = 6.54, H0 : µ > 32, α = .025

3. Let n = 33, X = 1.59, σ = 0.23, H0 : µ < 1.6, α = .10

4. Let n = 106, X = 58.4, σ = 6.54, H0 : µ < 60, α = .025

5. Let n = 67, X = 1.35, σ = 0.165, H0 : µ 6= 1.3, α = .01

6. Let n = 38, X = 51.8, σ = 3.64, H0 : µ 6= 50, α = .05

7. After reading a report from nerdwallet.com that states that the average US household creditcard debt is $6,929, a financial planner takes a sample to test the claim. After taking arandom sample of 50 US households, the planner finds the average debt to be $7,503. Thestandard deviation of credit card debt for all households is $2,608. Test if the average UScredit card debt is different from $6,929. Use a 1% level of significance.

8. A large pizza chain that delivers claims that the average time for delivery is at most 30 minutes.After waiting for what seems forever for their pizza, a customer decides to investigate. Thecustomer takes a random sample of 51 deliveries and finds the average time to be 30.54minutes. The population standard deviation is known to be 2.68 minutes. Is there sufficientevidence that the company’s claim is false. Use a 5% level of significance.

9. A bottling plant distributes wine into bottles. The machine tends to need to be readjustedquite often. The supervisor takes as sample of 12 bottles each day and determines the averageamount of wine. A hypothesis test will then be done using the alternative hypothesis that themean is different from 64 ounces. A recent sample yielded an average of 64.21 ounces. Sincethe company is quite familiar with the machine, it is known that the distribution of amountsof wine is normally distributed with a standard deviation that is always .34 ounces.

(a) Use a 5% level of significance to determine if the machine needs adjusting.

Page 200: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

194 CHAPTER 9. HYPOTHESIS TESTS FOR THE MEAN AND PROPORTION

(b) Retest using a 1% level of significance.

(c) Does the machine need to be adjusted?

10. A company packages frozen peas. The label states that there are 10 ounces in the bag. Thecompany has been sued in the past for under filling the bags so they choose to overfill thebags by an average of at least 0.2 ounces. The amount dispensed into each bag varies frombag to bag but the weights are approximately normal and the standard deviation is 0.054ounces. A recent sample of 16 bags produced and average of 10.225 ounces.

(a) Use a 5% level of significance to determine if the companies goals are not met.

(b) Retest using a 10% level of significance.

(c) Does the machine need to be adjusted?

11. After upgrading their scheduling protocols, an airline boasts that flights are no more than anaverage of 15 minutes late. A customer takes a random sample of 40 flights and determinesthe average time late to be 18.6 minutes. The standard deviation of the times that flights arelate is 6.87 minutes. Test, using a 5% significance level, if the claim of the airline is false.

12. The mayor of a large city claims that the average amount recycled by households of the cityis more than 90 pounds per week. An investigative reporter takes a sample of 52 householdsand finds the average amount recycled to be 92.4 pounds. The population standard deviationof the amount recycled each week is 16.8 pounds. Can you conclude that the mayor’s claimis true at the 5% level of significance?

13. On a stretch of highway known for speeding, the local sheriffs have increased their presence onthe highway in hopes of getting drivers to slow down. After the increase in presence, the sherifftakes a sample of 62 cars and finds the average speed to be 63.5 mph. The standard deviationbefore the increased presence was 5.46 mph and it is reasonable that it hasn’t changed.

(a) Test, using the 5% level of significance, if the mean speed of all cars is greater than 60mph.

(b) What type of error might you have made: I or II. Explain what this error would be, interms of the problem.

14. Every day, a commuter drives back and forth to work. The average time it takes to get towork is 35.6 minutes. The commuter hears of an alternative route. A random sample of32 days the commuter takes the alternate route and the average time is 35.1 minutes. Thestandard deviation of all times to get to work by that route is estimated to be 2.64 minutes.

(a) Test, using the 5% level of significance, if the alternative route is, on average, faster thanthe original route.

(b) What type of error might you have made: I or II. Explain what this error would be, interms of the problem.

15. A deli offers a foot-long sandwich. After frequenting the deli every day while working, theworker doesn’t feel the sandwiches measure up. To check, the customer orders a foot-longevery day at lunch for 2 weeks (10 days). The average length of the sandwiches turned out to

Page 201: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

9.1. HYPOTHESIS TESTS OF µ, σ KNOWN 195

be 11.64 inches. Assume the distribution of lengths of the sandwiches are normally distributedwith a standard deviation of 0.67 inches. Is there sufficient evidence that the average lengthof all foot-longs are less than 12 inches? Use a 5% level of significance.

16. A biologist is concerned about the effect a drought has had on the frogs in a local pond.Specifically, the biologist is concerned that the drought has reduced the food for the frogs andas a result the froglets might be undersized. At this point in their life cycle the froglets areexpected to be an average of 6.0 cm long. A sample of 37 froglets were found to be an averageof 5.61 cm long. The standard deviation for all lengths is 1.45 cm. Can you conclude, at a1% significance level, if the average length of all froglets are less than 6.0 cm?

17. The maker of a tire claims that the average stopping distance under specific speed and roadconditions is less than 80 feet, on average. The public relations officer wants to make surethat the average is in fact less than 80 feet. Fifteen cars are driven to specifications andbrakes applied. The average distance was 78.6 feet. The standard deviation of the lengths forall such experiments is 1.34 feet and the lengths are expected to be normally distributed. Isthere sufficient evidence that the claim is true. Use a 1% level of significance.

18. When a patient is in pain in the hospital, it is routine for the nurse to ask the patient torate their pain from 1 (no pain) to 10 (worst pain). A hypnotist is claiming that for patientswith a rating of 8 or above, hypnosis will reduce the pain by more than 2 points, on average.A pain management physician plans to test the claim. A sample of 40 patients with pain isselected and the average reduction in pain is 2.1. The standard deviation is expected to be.50. Using a 5% significance level test if the hypnotist’s claim is true

19. A company that makes frozen french fries uses a specific type of potato. When a truckloadof potatoes comes in, a sample of 50 potatoes is chosen and a hypothesis test is performedat a 5% significance level. If the average weight of the potatoes is significantly less than 16ounces, the shipment is rejected. For the sample taken, the average weight was 14.35 ounces.The standard deviation of all weights is 1.26 ounces.

(a) Determine if the shipment will be rejected.

(b) As an alternative, the company will simply weigh the sample of 50 potatoes. If theweight of the sample is less than a certain amount, the shipment will be rejected. Whatis the weight below which the shipment will be rejected.

20. A company manufactures rebar, a long steel bar that forms the framework of concrete struc-tures. The company wants the average length to be 10 feet. To achieve this, the qualitycontrol supervisor takes a sample of 10 bars each week to determine the average length of thebars and performs a hypothesis test with the alternative hypothesis that the average length isnot equal to 10 feet using a 5% level of significance. If the hypothesis is rejected then the ma-chine is inspected. This week the supervisor determined the average length to be 10.023 feet.The distribution of the lengths of the bars is normally distributed with a standard deviationof 0.056 feet.

(a) Determine if the machine will need to be inspected.

Page 202: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

196 CHAPTER 9. HYPOTHESIS TESTS FOR THE MEAN AND PROPORTION

(b) The supervisor assigns an assistant to check the rebar. To make is easier, the supervisortells the assistant ‘Once you have your sample, lay the bars end to end starting in thecorner of the warehouse. When you have done this you will notice I have drawn twomarks on the floor. If the end of the last rebar falls between the two marks, leave themachine alone. If not, inspect the machine.’ How far from the corner of the warehouseare the two marks?

Page 203: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

9.2. HYPOTHESIS TESTS OF µ USING THE P -VALUE 197

9.2 Hypothesis Tests of µ using the p-Value

In the last section we attacked a hypothesis test using the critical value approach. In that sectionwe compared the z-scores of X and α. In the speeds of cars example we had

H1 : µ > 30n = 8X = 31.5σ = 3.64α = 2.5%X ∼ N

This gave a test statistic of z = 1.946. We compared this to 2.316 and concluded that wecouldn’t reject the null hypothesis.

z

.01

2.326

1.946

Instead of comparing the test statistic z with zα we compare areas to the right (since its aright-tailed test). We get the following

z

z

p− value

.01

1.946

The area to the right of the test statistic is called the p-value (again, to the right because it isa right-tailed test). From our z-table we get P (z > 1.95) = .0256. Or the p-value=.0256. Noticefrom the graph that the p-value is greater than α and we did not reject the null hypothesis.

The p-value of a test of hypothesis is the smallest value of α such that we reject the nullhypothesis.

An alternative way to think about the p-value is it is the probability of obtaining an observedvalue (X here) as far away from the mean as it was or further.

We can do our tests either using the values of z (critical value approach) or the probabilities(p-value approach).

Page 204: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

198 CHAPTER 9. HYPOTHESIS TESTS FOR THE MEAN AND PROPORTION

Steps for a Hypothesis Test Using the p-value Approach

1. State the null and alternative hypotheses.

2. Select the appropriate distribution and reason(s) why.

3. Calculate the test statistic.

4. Determine the p-value.

5. Write out your conclusion.

The steps for both the critical value approach and the p-value approach are almost identical.The difference is in steps 3 and 4. Step 3 of the p-value approach is the same as step 4 of the criticalvalue approach.

One of the advantages of using the p-value approach is it tells us how much evidence we have toreject the null hypothesis. In most scientific journals, when hypothesis tests are performed p-valuesare reported with the results. It lets the reader be the judge of whether or not there is sufficientevidence.

A comparison of methods follows:

observed X X↓

Test Statistic z Critical Value ←− Critical value approach↓ ↑

p-value probability α ←− p-value approach

Example 9.2.1.

The average height of a species of tree is 17 inches after one year. A nursery operator gives aspecial fertilizer and measures the height after one year. The mean height of a sample of 55 trees is17.4 inches. The population standard deviation is 1.36 inches. Test, using a 5% level of significancewith the p-value approach if the mean height of all trees is greater than 17 inches.

Solution.

We start by interpreting the question and we get

H1 : µ > 17n = 55x = 17.4σ = 1.36α = 5%

1.H0 : µ = 17H1 : µ > 17

Page 205: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

9.2. HYPOTHESIS TESTS OF µ USING THE P -VALUE 199

2. Use z because n > 30 and σ is known.

3. z =17.4− 17

1.36/√

55= 2.18

4. p-value = P (z > 2.18) = P (z < −2.18) = .0146

5. Since the p-value is not less than α we do not reject the null hypothesis. There is not evidencethat the average height of all trees after one year is greater than 17 inches.

Example 9.2.2.

The amount of fat in 42 1-ounce servings of potato chips is measured and the average amount offat is 9.21 grams. The standard deviation of the amount of fat in all 1-ounce servings is .83 grams.Use the p-value approach to test if the average amount of fat in all 1-ounce servings is differentfrom 9.0 grams. Use a 1% level of significance.

Solution.

From the problem we get:

H1 : µ 6= 9.0n = 55x = 9.3σ = .83α = 1%

For our 5 steps we have:

1.H0 : µ = 9.0H1 : µ 6= 9.0

2. Use z because n > 30 and σ is known.

3.

z =9.3− 9.0

.83/√

42= 2.34

Since this is a two tailed test, the area in the tail (in this case, the right tail, since z is positive)is one-half α. So the p-value is twice the area in one of the tails.

4. p− value = 2P (z > 2.34) = 2P (z < −2.34) = 2× .0096 = .0192

Page 206: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

200 CHAPTER 9. HYPOTHESIS TESTS FOR THE MEAN AND PROPORTION

5. Since the p-value is not less than α we do not reject the null hypothesis. There is not suf-ficient evidence that the average amount of fat is different from 9.0 grams per 1-ounce serving.

The following is an excerpt from a publication put out by the Cleveland Clinic published in theAnnals of Internal Medicine, Volume 125;No. 2, July 15, 1996. The study looked at the effect ofColdeeze on the common cold. Several patients were given a placebo and several others were giventhe medicine. The results were published in the Journal.

‘Results: The time to complete resolution of symptoms was significantly shorter in the zincgroup than in the placebo group (median, 4.4 days compared with 7.6 days; P < 0.001). The zincgroup had significantly fewer days with coughing (median, 2.0 days compared with 4.5 days; P =0.04), headache (2.0 days and 3.0 days; P = 0.02), hoarseness (2.0 days and 3.0 days; P = 0.02),nasal congestion (4.0 days and 6.0 days; P = 0.002), nasal drainage (4.0 days and 7.0 days; P <0.001), and sore throat (1.0 day and 3.0 days; P < 0.001). The groups did not differ significantlyin the resolution of fever, muscle ache, scratchy throat, or sneezing. ’ 4

You will note that there are p-values reported for most comparisons. The types of tests are nottests of µ but the median. We can understand their conclusions from the p-value reported. Forexample, there is more evidence that the nasal drainage improves with the medicine (p − value <.001) than relief of headache (p− value = .02). We would interpret this last p-value by saying thatthe difference we saw in the times for relief of headache would happen 2% of the time just by chanceif the median relief times are equal. You will notice that some of the inequalities are equalities andothers are inequalities. Although technology will easily calculate the p-value for any test you giveit, when the p-value is extremely small the accuracy of the calculation is dubious, at best.

You will also notice ‘The groups did not differ significantly in . . .’. This is a common way to statethat there was not enough evidence that the null hypothesis was rejected. i.e. they are differentbut not different enough to reject H0.

9.2.1 Exercises

Perform the test of hypothesis for the following problems. Use the p-value approach. If the p-valueis less than 0.001, write p-value as p < .001

1. H1 : µ > 2.6, n = 63, X = 2.91, σ = .64, α = .05

2. H1 : µ > 165, n = 55, X = 173.5, σ = 16.6, α = .05

3. H1 : µ < 1536, n = 43, X = 1506, σ = 294, α = .01

4. H1 : µ < 2.34, n = 77, X = 2.31, σ = .134, α = .01

5. H1 : µ 6= 5, n = 40, X = 4.93, σ = .26, α = .05

6. H1 : µ 6= 5, n = 42, X = 4.93, σ = .26, α = .05

4http://www.coldeeze.com/-/media/coldeezecom/pdf/cleveland-clinic-annals-of-internal-medicine.pdf?la=en ac-cessed 6/13/18

Page 207: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

9.2. HYPOTHESIS TESTS OF µ USING THE P -VALUE 201

7. The owner of an ambulance service claims that the average response time is less than 5minutes. A random sample of 40 calls are selected and the average response time is 4.68minutes. The standard deviation of all response times is 1.2 minutes. Is there sufficientevidence that the owners claim is true. Use a 5% significance level.

8. While talking to students, a statistics teacher makes the comment that the average timestudents at the teacher’s high school on homework is greater than 15 hours per week. One ofthe teacher’s students takes a sample 39 students and finds the mean time spent on homeworkeach wek is 16.8 hours. The standard deviation of the times students spend on homeworkeach week is 7.51 hours. Test, using a 5% significance level if the teacher’s claim is true.

9. The USDA Economic Research Service reported in November, 2016, that American adultsspend an average of 37 minutes a day preparing and serving food and cleaning up. Beingskeptical, you take a sample of 35 American adults and find the mean time to prepare, serveand clean up is 30.64 minutes. A previous investigation found that the standard deviationof all such times is 19.58 minutes. Test if the mean time for all American adults to prepare,serve, and clean up food is different from the reported time. Use a 5% level of significance.

10. The nutrition label on a box of Girl Scout Thin Mints states that one serving5 has 160calories. Suppose you take a random sample of 45 servings of cookies and finds the averagecaloric content to be 165.2 calories. Upon further investigation you conclude the standarddeviation of the caloric content of all servings is 5.6 calories. Test, using a 1% significancelevel if the mean caloric content of all servings of thin mints is greater than the claim of thepackaging.

11. The mean amount of cottage cheese in all 16-ounce containers is supposed to be 16.1 ounces.A random sample of 12 tubs is taken each day to check if the machine needs to be adjusted.A recent sample produced a mean of 16.19 ounces. Since this is done every day, it is knownthat the distribution of the weights is approximately normally distributed with a standarddeviation of .052 ounces.

(a) Test if the mean amount of cottage cheese in all 16-ounce containers is different from16.1 ounces. Use a 5% level of significance.

(b) Why would the management want a two tailed test here?

(c) Give a reason as to why they would want the mean to be 16.1 ounces instead of thelabels 16 ounces.

12. A game show producer is in the process of creating a new game show. In the game contestantswill need to answer ten trivia questions. The amount of money contestants win depends onhow long it takes to answer the questions. The producer needs to make sure the average timeis less than 20 seconds. In a random sample of 44 contestants, the average time to answer thequestions was 18.6 seconds. The population standard deviation is assumed to be 3.5 seconds.Is there sufficient evidence that the mean time to answer 10 questions is less than 20 seconds?Use α = .05

5One serving is 4 cookies. To me, it is one sleeve.

Page 208: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

202 CHAPTER 9. HYPOTHESIS TESTS FOR THE MEAN AND PROPORTION

13. An avid exerciser walks regularly. The walker claims that their average walking stride is 36inches. You follow the walker. After the walker walks through a puddle you measure 31strides before the water is gone. The average length to be 36.2 inches. You estimate that thepopulation standard deviation of the lengths to be .85 inches.

(a) Using a 5% level of significance level, test if the average stride is different from 36 inches.

(b) Comment on statistical issues with this problem, other than knowing the standard de-viation.

14. A craft brewer is experimenting with brewing techniques and is investigating the alcoholcontent of the beer. After the wort has fermented for the prescribed time in vats, the alcoholcontent is measured. In 36 100-ml samples taken from one vat, the mean amount of alcoholis 6.33 ml. From before, the standard deviation is known to be 0.31 ml per 100-ml sample.

(a) Is there sufficient evidence that the mean alcohol content per 100-ml of all beer producedat the brewery is greater than 6.20 ml? Use α = .05

(b) Comment on statistical issues with this problem.

15. An orange juice maker buys oranges from a local grower. It is required that the average sugarcontent in a 100 gram sample of flesh be more than 9.5 grams. A sample of 35 oranges aretaken from which 100-gram samples are prepared. The mean amount of sugar is found to be9.68 grams. The standard deviation is known to be 0.56 grams. Test if the average amountof sugar in all 100-gram sample is greater than 9.5 grams. Use a 5% significance level.

16. The new owner of a auto oil and lube shop has been trying to speed up the time customersspend at the shop while waiting. The owner takes a sample of 38 cars and finds the averagetime to be 13.6 minutes. The standard deviation from before was 2.64 minutes. Assume thishasn’t changed. Is there sufficient evidence that the mean time for an oil and lube is less than15 minutes? Use a 2.5% significance level.

17. A medical director at a large clinic is monitoring the time doctors spend with patients. Thedirector reads that the average time a doctor spends with a patient is 14.5 minutes. Thedirector takes a sample of 40 office visits and finds the mean time is 16.2 minutes. Thepopulation standard deviation is known to be 2.31 minutes. Test, using a 5% significancelevel if the mean of all visits is different from 14.5 minutes.

18. A brand of light bulbs claims that its light bulbs produce, on average, 340 lumens. Themanagement requires tests daily of bulbs coming off the assembly line. A random sample of12 bulbs produced a mean of 334 lumens. Since the bulbs are tested regularly, it is knownthat the distribution is approximately normal with a standard deviation of 16.4 lumens. Test,at a 5% level of significance if the mean light production is different from 340 lumens.

19. At a large company, the mean BMI6 of all workers in the past was 29.1. To reduce medicalinsurance premiums, the company installs exercise equipment onsite. After a year, a randomsample of 41 employees are selected and the mean BMI is found to be 27.6. The standarddeviation of all the BMIs of all employees is 4.6.

6Body Mass Index. CNN reported the mean for men is 29.1 and 29.6 for women.

Page 209: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

9.2. HYPOTHESIS TESTS OF µ USING THE P -VALUE 203

(a) Test if the mean BMI after the equipment is installed has gone down using a 5% level ofsignificance.

(b) The management puts out a newsletter and reports that ‘. . .by installing the exerciseequipment, our employees are healthier.’ Comment on the statement.

20. A pediatrician is testing claims that a drug reduces the birth weight of babies. The averageweight of newborns is known to be 3.58 kg. A random sample of 35 women who took thedrug while pregnant delivered babies weighing an average of 3.24 kg. The standard deviationof babies birth weights are known to be 0.57 kg.

(a) Test if the mean weight of all babies born to users of the drug is less than 3.58 using a1% significance level.

(b) Can the pediatrician state that the use of the drug by women who are pregnant reducesthe average fetal birth weight?

Page 210: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

204 CHAPTER 9. HYPOTHESIS TESTS FOR THE MEAN AND PROPORTION

9.3 Hypothesis Tests of µ, σ Unknown

In this part of the chapter we draw a parallel with the chapter on confidence intervals of µ. In thatchapter we started with problems where the population standard deviation was known and we usedthe z-distsribution. We then found confidence intervals of µ where σ was unknown. We used thet-distribution to deal with that case. We will do the same thing in this section. The problems aregoing to be very similar to problems in the last two sections. The main difference is we will use tinstead of z.

The test statistic for tests of µ is given by

t =X − µs/√n

Withdf = n− 1

Provided the sampling distribution of X is approximately normal.

When doing hypothesis tests of µ, a key distinction we need to make is what standard deviationis given: s or σ? Just like with confidence intervals of µ, when we know σ, we use z. When wedon’t know σ, we use t.

Example 9.3.1.

The manufacturer of steel rods wants the average breaking strength of the rods to be 1000pounds. To test this, a sample of 35 iron rods are selected and the breaking strength of each rod isdetermined. In the sample, the mean breaking strength was 986 pounds with a standard deviationof 23.5 pounds. Is there sufficient evidence that the average breaking strength is less than 1000pounds? Use a level of significance of 5%.

Solution.

From the wording of the problem, we can see that the standard deviation came from the sample.This means we have s not σ. We have the following:

H1 : µ < 1000n = 35X = 986s = 23.5α = 5%

We will first use the critical value approach.

1.H0 : µ = 1000H1 : µ < 1000

Page 211: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

9.3. HYPOTHESIS TESTS OF µ, σ UNKNOWN 205

2. Use t because n > 30 and σ is unknown.

Since we are using the t-distribution we have 34 (=35-1) degrees of freedom.

3. t

.05

−1.691

4. t =986− 1000

23.5/√

35= −3.524

5. [Since the test statistic is in the rejection region,] We conclude the average breaking strengthis less than 1000 pounds.

In the next example we will use the p-value approach.

Example 9.3.2.

A psychology instructor has read that teens spend an average of 14 hours every week lookingat their phones. The instructor decides to take a sample and check it out. The instructor takesa sample of 50 teens and finds the average time teens spend looking at their phones to be 15.65hours with a standard deviation of 4.23 hours. Use the p-value approach and a level of significanceof 2.5% to test if the average time teens spend looking at their phones is more than 14 hours perweek

Solution.

We getH1 : µ > 14n = 50X = 15.65s = 4.23α = 2.5%

1.H0 : µ = 14H1 : µ > 14

2. Use t because n > 30 and σ is unknown.

Page 212: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

206 CHAPTER 9. HYPOTHESIS TESTS FOR THE MEAN AND PROPORTION

3. t =15.65− 14

4.23/√

50= 2.785

Since we are using the t-distribution we have 49 (=50-1) degrees of freedom. Since the t-distribution table works differently than the z-distribution we will need to find the p-valuedifferently. Let’s look at the portion of the t-table corresponding to 49 degrees of freedomwith our value put in the table.

df .10 .05 .025 .01 .005 p-value .001...

......

......

......

...49 1.299 1.677 2.010 2.405 2.680 2.785 3.265

4. Our p-value is between .005 and .001. We will write p < .005

5. [Since the p-value is less than α,] there is sufficient evidence that the average amount of timeteens spend looking at their phones is greater than 14 hours per week.

Example 9.3.3.

The amount of amount of active ingredient in a capsule of type of drug is measured for 8randomly selected pills. The amount of the drug, in mg, is measured to be:

246.3, 253.4, 255.6, 244.9, 261.0, 266.3, 258.4, 251.3

It is a reasonable assumption that the amount of active ingredient in all pills are approximatelynormally distributed. Test, using a 5% level of significance, if the average amount of active ingredientin all pills is different from 250 mg.

Solution.

Since we are going to need the sample standard deviation and all we have is data, we will useour calculator to do the calculations. As such, we will use the p-value approach.

We have:

H1 : µ 6= 250α = 5%X ∼ N

1.H0 : µ = 250H1 : µ 6= 250

2. Use t because X ∼ N and σ is unknown.

3. t = 1.813 (From calculator, see below.)

Page 213: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

9.3. HYPOTHESIS TESTS OF µ, σ UNKNOWN 207

4. p-value= .1128 (From calculator, see below.)

5. There is not sufficient evidence that the average amount of active ingredient in all pills isdifferent from 250 mg.

Calculator Instructions For Hypothesis Tests when given Data:

Enter the data in a list

Select STAT>TESTS >TTest

For Inpt, select Data.

For µ0, enter 250 (From H0).

For List, put the list where your data is. (2nd>List>NAMES)

For Freq, we want 1.

µ : select 6= µ0 (Same as H1)

Highlight Calculate and hit ENTER.

Perform hypothesis tests on the following

9.3.1 Exercises

1. H1 : µ > 10, n = 15, X ∼ N , X = 10.6, s = 2.3, α = .05.

2. H1 : µ < 23, n = 21, X ∼ N , X = 20.6, s = 6.7, α = .05.

3. H1 : µ 6= 65, n = 19, X ∼ N , X = 66.2, s = 7.85, α = .01.

4. H1 : µ 6= 106, n = 20, X ∼ N , X = 101.7, s = 15.2, α = .01.

5. H1 : µ 6= 5, X ∼ N , α = .05, Data: 5.3, 4.2, 6.5, 5.8, 6.1, 4.9

6. H1 : µ 6= 20, X ∼ N , α = .05, Data: 19.6, 22.8, 21.6, 18.5, 24.3, 25.6

7. For the following get an estimate of the p-value

(a) H1 : µ < 5.3, df = 5, t = −2.768

(b) H1 : µ > 12.03, df = 13, t = 2.035

(c) H1 : µ 6= 18.4, df = 17, t = −3.751

8. For the following get an estimate of the p-value

(a) H1 : µ < 7.3, df = 15, t = −3.245

(b) H1 : µ > 135.8, df = 18, t = 2.035

Page 214: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

208 CHAPTER 9. HYPOTHESIS TESTS FOR THE MEAN AND PROPORTION

(c) H1 : µ 6= 29.5, df = 22, t = −2.566

9. An artificial sweetener manufacturer sells one gram packets of its sweetener. A frequent userof the sweetener decides to do a test to see if the average amount of sweetener is less than theone gram promised. A random sample of 23 bags found that the average amount of sweetenerwas 0.985 grams with a standard deviation of .023 grams. Test, using a 5% significance levelif the average amount of sweetener is less than one gram. Assume the amounts distributed inall bags is approximately normal.

10. Your roommate claims that they take, on average, a shower that is less than 7 minutes long.Having gone without hot water after following them in the shower, you decide to check. Youtake a random sample of 10 days and find the mean time to be 7.94 minutes with a standarddeviation of 1.26 minutes. You have observed the shower lengths seem to be approximatelynormal. Test, using a 5% significance level if the mean length of your roommate’s showers ismore than 7 minutes.

11. The students at a high school use digital media an average of 8 hours per day. The principalhas started a program to encourage students to use less digital media. After the program hasbegun, a sample of 32 students used digital media an average of 7.2 hours with a standarddeviation of 2.36 hours. Using a 5 % significance level test if the mean time students of thehigh school spend on digital media is less than 8 hours.

12. The average time it takes for the employees on an assembly line to fill an order is 2.48 minutes.The management has started playing music to try to get the employees to work faster. Afterthe music has started playing a sample of 34 employees is selected and the mean time to fillan order is 2.39 minutes with a standard deviation of 0.29 minutes. Using a 5% significancelevel, test if the mean time has decreased after the management began playing music.

13. The average weight of candies at a chocolate factory is supposed to be 35 grams. The weightsof the candies are expected to be normally distributed. The weights of several chocolates, ingrams, are determined to be the following:

34.2, 35.6, 36.1, 33.9, 35.8, 35.1, 33.6

Test, using a 1% significance level, if the mean weight of all chocolates is different from 35grams.

14. An avocado farmer is working on a new variety of avocado. The farmer wants the mean fatcontent to be less than 15 grams of fat per 100 grams of avocado. The fat content per 100grams is reasonable to be normally distributed. Six avocados are selected and the amount, ingrams, of fat from 100 gram samples are determined. The amounts are:

14.5, 13.1, 15.2, 14.9, 13.9, 15.6

Use a 5% level of significance to test if the mean amount of fat per 100 grams of avocado isless than 15 grams.

15. The mean time it takes to serve diners at a restaurant is being examined. A sample of40 orders were taken and the average time to be served was 13.6 minutes with a standarddeviation of 3.56 minutes. Test if the mean time to serve a diner is less than 15 minutes.

Page 215: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

9.3. HYPOTHESIS TESTS OF µ, σ UNKNOWN 209

16. A paper towel manufacturer takes a sample of 31 paper towels and dips them in water to seehow much water they will soak up. The average amount of water soaked up was 2.65 ounceswith a standard deviation of 0.13 ounces. Test if the mean amount of water soaked up is morethan 2.6 ounces. Use a 1% level of significance.

17. A refrigerator manufacturer brags that the average amount of energy used is less than 300kWh per year. A random sample of 35 refrigerators are taken and the mean amount ofenergy used was 286 kWh with a standard deviation of 59.8 kWh. Test if the claim of themanufacturer is true. Use a 2.5% level of significance.

18. A random sample of 46 restaurant bills were examined at a restaurant. The average tip per$100 spent was $16.43 with a standard deviation of $4.98. Test if the average tip per $100spent is less than $18 using a 2.5% level of significance.

19. A infectious control nurse is investigating the time medical staff spends washing thier hands.The CDC recommends at least 20 seconds. Eight medical staff are observed washing theirhands before examining a patient and the times they spent washing their hands, in seconds,were

18.5, 23.5, 14.5, 13.6, 12.9, 20.5, 18.4, 13.7

Assuming the times that all medical staff spend washing their hands are normally distributed,test at the 5% level of significance, if the mean time is less than the recommended time.

20. A phlebotomist at a blood drive is trying to speed up the blood drawing process. Currently,the average time is 15.8 minutes. The phlebotomist tries a new technique to extract bloodand finds the following times, in minutes:

15.4, 13.4, 18.4, 15.6, 14.6, 16.3, 15.2

Is there sufficient evidence that the mean time with the new technique is faster than theprevious technique? Assume the times are normally distributed and use a 1% significancelevel.

Page 216: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

210 CHAPTER 9. HYPOTHESIS TESTS FOR THE MEAN AND PROPORTION

9.4 Hypothesis Tests of the Population Proportion

Just as we extended the idea of the confidence interval from µ to p, we can extend the idea ofhypothesis testing from µ to p. We need to recall

p ∼ N

(p,

√pq

n

)

Whenever np > 5 and nq > 5The hypothesis tests of p will be done just like the hypothesis tests of µ, with minor changes.

We illulstrate with an example

Example 9.4.1.

A report indicates that 20% of adults in the United States use tobacco products. A statisticianfeels that this is too low. The statistician takes a sample of 850 adults in the US and finds 260 ofthem use tobacco products. Use a 5% significance level to determine if the percent of US adultswho use tobacco products is greater than 20%.

Solution.

Our test will consist of a test on the proportion, p, but we are going to interpret the test interms of the percentage. We will use the critical value approach for this problem.

We have the followingH1 : p > .20n = 850x = 260α = .05

We will also continue to use the 5 steps introduced earlier.

1.H0 : p = .20H1 : p > .20

In order for the hypothesis test to be valid, we need p to be approximately normal. Thisoccurs if np > 5 and nq > 5. This gives us our second step:

2. Use z because np > 5 and nq > 5. Since we are using the z distribution and we have a righttailed test (p > .20 is our alternative hypothesis).

3. z

.05

1.645

Our test statistic is just the z-score of the observed value of the point estimator (p)

Page 217: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

9.4. HYPOTHESIS TESTS OF THE POPULATION PROPORTION 211

4. z =p− µpσp

=p− p√

pqn

p =x

n=

260

850= .30588 . . .

z =.30588− .20√

.2×.8850

= 7.717

The test statistic is clearly in the rejection region. Even without looking at the rejectionregion in step 3 we see that the test statistic is very large (more than 7 standard deviationsaway from the mean).

5. There is sufficient evidence that more than 20% of US adults use tobacco products.

In the next example we will use the p-value approach.

Example 9.4.2.

According to wikipedia, 56.5% of residents in New York City use public transportation tocommute to work. Feeling this is not correct you take a sample of 975 residents and find that 52.3%of them commute to work. Using a 1% significance level, test if the percentage of New York Cityresidents who commute to work using public transit is different from 56.5%.

Solution.

We have the followingH1 : p 6= .565n = 975p = .523α = .01

1.H0 : p = .565H1 : p 6= .565

2. Use z because np > 5 and nq > 5

3. z =.523− .565√

.565×.435975

= −2.65

4. Since this is a two-tailed test, the area in the tail is one-half the p-value.

z

p-value/2 =.0040

−2.65

Page 218: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

212 CHAPTER 9. HYPOTHESIS TESTS FOR THE MEAN AND PROPORTION

The p-value=2× P (z < −2.65) = 2× .0040 = .0080

5. We conclude that the percentage of New York City residents who commute to work is differentfrom 56.5%

9.4.1 Exercises

1. Test H1 : p > .32, if n = 1200, X = 402, α = .01

2. Test H1 : p > .59, if n = 1542, X = 1016, α = .025

3. Test H1 : p 6= .21, if n = 1350, p = .26, α = .02

4. Test H1 : p 6= .39, if n = 1380, p = .45, α = .05

5. A 2015 report indicated that 73% of households in the United States use the internet at home.A recent sample of 864 American households indicated that 692 of them use the internet athome. Test using the 5% level of significance if the percentage of households that use theinternet at home has increased.

6. A US News and World Report article reported in June, 2018, that 60% of blood donationscome from people over 40. You want to test this. You sample 706 blood donations and findthat 388 of them are from people over 40. Test at the 1% level if the percent reported in thearticle is false.

7. A dog lover has read that 83% of dogs in the US are spayed or neutered. The dog lover takesa random sample of 500 dogs and finds that 87.5% of them are spayed or neutered. Test usinga 5% level of significance if the percent of dogs that are spayed or neutered is greater than83%

8. A Gallup poll reported that 67% of cat owners have given their cats Christmas gifts. Youthink this is a bit high. You take a sample 260 cat owners and find that 60% of cat ownerspolled give their cats Christmas gifts. Can you conclude at the 5% level of significance if thepercent of cat owners who give their cats Christmas gifts is less than 67%.

9. The pass rate for candidates taking the NCLEX Exam is 85% for testers with an associatedegree. Three hundred randomly selected candidates are put into a test-prep course prior totaking the exam. The pass rate for the 300 was 88%. Test at the 2.5% significance level ifthe success rate of those that take the test is greater than the given NCLEX rate.

10. The Gallup International reported that 37% of Americans attend religious ceremonies weeklyor near-weekly. A student of religious studies feels that the percent is higher than that. Thestudent takes a sample of 450 Americans and finds that 40% attend religious ceremoniesweekly or near-weekly. Test, using a 5% significance level if the student’s feeling is correct.

Page 219: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

9.4. HYPOTHESIS TESTS OF THE POPULATION PROPORTION 213

11. A phone service company has had bad customer service record in the past. Of all customersthat called in, 39% said they had ‘terrible’ service. A new CEO has started a training programto train its customer service representatives. After the training, a sample of 368 customersare polled and 35.3% of customers said they had ‘terrible’ service. Using a 5% significanelevel, test if the percent of people who have ‘terrible’ service has decreased.

12. In a August, 2018 report by Consumer Affairs, 38% of teen drivers text while driving. A CHPofficer claims that the percent should be higher. You take a sample of 564 teen drivers andfind that 41.7% of teen drivers text while driving. Test using a 2.5% level of significance ifthe officer’s claim is true.

13. surveys.pro reported that 51% of people listen to music every day. From a sample of 1350,729 listen to music every day. Test, using a 1% significance level, if the percent of people wholisten to music every day is different from 51%

14. The software firm is planning a large upgrade on its software. The lead designer for theupgrade tells the team working on it that at least half of the users of the software will likethe new upgrade. The designer takes a sample of 843 users and finds that 395 like the newupgrade. Using a 2.5% level of significance if the designer’s claim is false.

15. According to the CDC, only 12% of Americans eat the minimum amount of fruit recom-mended. A recent sample of 913 Americans showed that 80 of them eat the minimum amountof fruit. Test at the 1% level of significance if the percent of Americans that get the minimumamount of fruit is different that the percent reproted by the CDC

16. A financial advisor stated that 52% of all American adults own stocks. A random sample of793 American adults reported that 452 owned stocks. Is there sufficient evidence that thepercent of Americans who own stock is different from 52%? Use a 1% level of significance.

17. The Cleveland Clinic reported in February, 2018 that only 11% of Americans know the correctpace for perfoming chest compressions while doing CPR. 7 After a series of CPR PSA’s, aCPR teacher takes a sample of 234 Americans and finds that 44 know the correct pace. Doesit appear, at the 5% level of significance if the PSA’s have increased the percent of Americanswho know the correct pace? Comment on issues you have with the problem. (Of a statisticalnature, not that more people should know CPR.)

18. In the past, 45% of people have gotten flu shots, according to cdcfoundation.org. After arough flu season, the following the year a sample of 1604 Americans are polled and 798 got aflu shot. Test at the 2.5% significance level if the rough flu season has increased the percentof people who got the flu shot this year. There is a major problem with this problem. Whatis it?

19. In June, 2017, the Josephson Institute Center for Youth Ethics reported that 59% of highschool students admitted to cheating on a test in the last year. A high school math teacherfeels that this percent is underestimating the percentage. The teacher polls 156 studentstaking the teacher for math and finds 117 of them admit to cheating in the last year. Test atthe 1% level of significance if the math teacher’s feeling is correct. Comment on any problemsyou have with your conclusion.

7It is 100 to 120 bpm. Just sing ‘Stayin’ Alive’ by the Bee Gees.

Page 220: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

214 CHAPTER 9. HYPOTHESIS TESTS FOR THE MEAN AND PROPORTION

20. A September 2018 report on the site drugfree.com stated that 20% of all high school studentsvape. The principal of a high school feels this reporting of the percentage is too low. Theprincipal takes a sample of 350 students from amongst the high school the pricncipal adminis-ters and finds that 91 vape. Test using 1% significance level if the principal’s feeling is correct.Comment on any problems with your conclusion.

Page 221: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Chapter 10

Confidence Intervals andHypothesis Tests for TwoPopulation Data

215

Page 222: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

216CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

10.1 Confidence Intervals and Hypothesis Tests of µ1 − µ2

with known σ’s

Up until now, we have constructed confidence intervals and done hypothesis tests for a single pa-rameter, such as µ. We want to extend that to looking at the parameters for two populations.Specifically, we will look for a confidence interval for the difference of a parameter for two differentpopulations. In this section this will be µ1 − µ2 where we are sampling from two different pop-ulations and the samples are independent. By independence we mean the samples are collectedindependently of one another and how the sample is selected from one population has no impacton how the sample from the second sample is collected.

10.1.1 Confidence Intervals of µ1 − µ2 with known σ’s

Since we are interested in estimating µ1 − µ2, it is natural to use the estimators of µ1 and µ2 toestimate µ1 − µ2. Let’s illustrate with an example.

Example 10.1.1.

A sample of 35 women found that the mean height was 162.5 cm. A sample of 40 men foundthat the mean height was 176.8 cm. Assume the population standard deviations of women and menare, respectively, 6.15 cm and 6.75 cm.

1. What is the point estimate for the difference of the heights of women and men. Interpret thepoint estimate of the point estimate.

2. Find a 95% margin of error for the point estimate.

3. Find and interpret a 95% confidence interval for the difference of the means.

Solution.

Let us first identify what we have:Women Menn1 = 35 n2 = 40X1 = 162.5 X2 = 176.86σ1 = 6.15 σ2 = 6.75

1. What is the average height of a woman? A man? Simply put, we don’t know. That is thereason we are looking at a point estimate. So what is your best guess as to the average heightof all woman? The average height of the women sampled: X1 = 162.5 cm. Similarly, our bestguess for the average height of men is X2 = 176.8 cm. So what does µ1−µ2 equal? Again, wedon’t know but our best guess is, using our guesses for the means, X1− X2 = 162.5−176.8 =−14.6 cm. Since we got a negative number that simply means that or first mean is less thanour second mean. Finally, we conclude that the average height of a woman is 14.6 cm lessthan the average height of a man. Equivalently, we could say that men are, on average, 14.6cm taller than women.

Page 223: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.1. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS OF µ1−µ2 WITHKNOWN σ’S217

2. As soon as we have a point estimate, the question arises: ‘How close is this estimate to theactual value?’. The margin of error gives us an idea of the answer to this question. Beforewe proceed, let us look at point estimates/margins of error/confidence intervals we have donebefore.

Parameter Point Estimate and Distribution Margin of Errorµ X ∼ N(µ, σX) zα/2σXµ X ∼ N(µ, σX) tα/2sXp p ∼ N(p, σp) zα/2sp

µ1 − µ2 X1 − X2 ∼ N(µ1 − µ2, σX1−X2) zα/2σX1−X2

µ1 − µ2 X1 − X2 ∼ N(µ1 − µ2, σX1−X2) tα/2sX1−X2

The last two lines were logical extensions of what we have done before.

Recall from before that when we subtract independent normal random variables, we get anormal random variable with the means subtracted and we add the variances of the tworandom variables to get the new variance. How do we know that X1 and X2 are normallydistributed?

X1 ∼ N(µ1,

σ1√n1

)and X2 ∼ N

(µ2,

σ1√n2

)

The variance of X1 − X2, is σ2X1−X2

= σ2X1

+ σ2X2

The standard deviation is then σX1−X2=√σ2X1

+ σ2X2

=√

σ21

n1+

σ22

n2

Putting all the pieces together we have X1 − X2 ∼ N(µ1 − µ2,

√σ2X1

+ σ2X2

)

Our margin of error is given by z.025

√σ2X1

+ σ2X2

= 1.960√

6.152

35 + 6.752

40 = 2.92

Finally, our margin of error is 2.92 cm. This tells us that we are 95% certain that our pointestimate is within 2.92 cm of the actual difference in heights.

3. Our confidence intervals when we use the t or z distribution are of the form: point estimate± margin of error. So we have −14.6 ± 2.92 or -17.52 to -11.68. We are 95% confident thatthe mean height of a woman is 11.68 cm to 17.52 cm less than the mean height of a man.Compare this statement with the statement for the point estimate.

The point estimate of µ1 − µ2 is X1 − X2

If X1 and X2 are normally distributed then we also have.

The margin of error, E, of this point estimate is E = zα/2√σ2X1

+ σ2X2

The confidence interval for µ1 − µ2 is(X1 − X2

)± zα/2

√σ2X1

+ σ2X2

Page 224: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

218CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

Example 10.1.2.

At a bottling plant that fills beer bottles, there are two bottling machines: one on the east sideof the building and one on the west side of the building. The machines are monitored regularly tocheck and see if the machines are working properly. Since the machines are monitored regularly,the management knows that the distribution of the amounts dispensed into the bottles by bothmachines are normally distributed. Also, they know that the standard deviation of amount of beerdispensed by the east side machine is 2.63 ml and the standard deviation of the machine on thewest side is 3.21 ml. A recent sample of 12 bottles from the east side machine produced a meanamount of 502.1 ml and a sample of 11 bottles from the west side machine produced a mean amountof 501.6 ml.

1. Find a point estimate for the difference of the mean amounts of beer dispensed by the twomachines

2. Find the 90% margin of error for the point estimate.

3. Construct a 90% confidence interval for the difference of the mean amounts of beer dispensedby the two machines.

Solution.

Let us first identify what we have:East Side West Siden1 = 12 n2 = 11X1 = 502.1 X2 = 501.6σ1 = 2.63 σ2 = 3.21X1 ∼ N X2 ∼ N

1. The point estimate is 502.1 − 501.6 = .5 ml this tells us that the mean amount of beerdispensed by the east side machine is .5 ml more than the mean of the west side machine.This is an estimate, the actual amount may be different.

2. Notice that we have small sample sizes unlike the last example. We need the X ′s ∼ N . Wehave this since the X ′s ∼ N . Since our confidence level is 90% we need z.05 = 1.645. So we

have E = ±1.645√

2.632

12 + 3.212

11 = ±2.02 ml

3. Since we have the point estimate and the margin of error we have .5 ± 2.02 which is −1.52to 2.52. The wording here is a bit odd since the endpoints are of different signs. Let us takeapart our confidence interval.

We are 90% confident that −1.52 < µ1 − µ2 < 2.52.

If we add µ2 to all three sides of the inequality we get

µ1 − 1.52 < µ1 < µ2 + 2.52

Page 225: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.1. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS OF µ1−µ2 WITHKNOWN σ’S219

Let’s put this in terms of the problem:

We are 90% confident that [the mean amount of beer dispensed by the east side machine] isbetween [1.52 ml less than the mean amount dispensed by the west side machine] and [2.52ml more than the mean amount dispensed by the west side machine].

Inside the brackets,[ ], are the sides of the inequality. Read it without the brackets.

10.1.2 Hypothesis Tests of µ1 − µ2 with known σ’s

Hypothesis tests of µ1 − µ2 are an extension of the hypothsis tests of µ.

The test statistic for a hypothesis test of µ1 − µ2 is given by

z =

(X1 − X2

)− (µ1 − µ2)√

σ2X1

+ σ2X2

Provided the sampling distributions of X1 and X2 are approximately normally distributed.

Example 10.1.3.

The dean at a large university claims that the average SAT scores of incoming first-generationstudents is greater than the average SAT of non first-generation students. To test the claim, asample of 35 first-generation students is selected and the mean SAT is 1237. A random sampleof 36 non first-generation students is taken and the mean is found to be 1195. The populationstandard deviation of scores are 86 and 79 for the first-generation and non first-generation students,respectively. Using a 5% level of significance test is the dean’s claim is true.

Solution.

Let us orgainize what we have1st Gen non-1st Genn1 = 35 n2 = 36X1 = 1237 X2 = 1195σ1 = 86 σ2 = 79α = 5%H1 : µ1 > µ2

We will proceed with a hypothesis test using the same 5 steps we used before. We will proceedwith the critical value approach.

1.H0 : µ1 = µ2

H1 : µ1 > µ2or

H0 : µ1 − µ2 = 0H1 : µ1 − µ2 > 0

Since this test is just an extension of a test of µ, we use the same criteria to determine whichdistribution. Since n1 > 30 and n2 > 30 we know that X1 − X2 ∼ N We then need to knowif the population standard deviations, σ1 and σ2 are known. Somehow, they are. So. . .

Page 226: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

220CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

2. Use z because n′s > 30 and σ’s are known.

Since we are using z, we determine the values exactly like we did before. Recall that at thebottom of the t-table are values of z.

3. z

.05

1.645

4. Recall that when we are using the z-distribution, the test statistic is just a z-score.

z =(X1 − X2)− (µ1 − µ2)

σX1−X2

where σX1−X2=√σ2X1

+ σ2X2

z =(X1 − X2)− (µ1 − µ2)√

σ2X1

+ σ2X2

=(1237− 1195)− (0)√

862

35 + 792

36

= 2.141

5. Since the test statistic is in the rejection region, we conclude the mean SAT scores of all first-generation students at the university is greater than the SAT scores of all non first-generationstudents.

In the last example, we had two different ways to write the null and alternative hypotheses. Inthe future, we will use the hypotheses with the means on one side of the equality. This will beconvenient if the null hypothesis is not that the means are equal but that they differ by a constant.

Example 10.1.4.

The So Lo-So soup manufacturer claims that the average amount of sodium per serving of itschicken noodle soup is more than 30 mg less than their competitor’s. A random sample of 12servings of soup are measured and the average amount of sodium was 326 mg. A random sample of15 servings of the competitor’s soup yielded an average of 381 mg sodium. The amounts of sodiumper serving in the two brands are normally distributed with standard deviations of 31.6 mg and35.8 mg sodium for the So Lo-So soup and it’s competitor, respectively. Test, using a 1% level ofsignificance if the manufacturers claim is true.

Solution.

Let us organize what we have

So Lo-So Competitorn1 = 12 n2 = 15X1 = 326 X2 = 381σ1 = 31.6 σ2 = 35.8X1 ∼ N X2 ∼ N

Page 227: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.1. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS OF µ1−µ2 WITHKNOWN σ’S221

α = 1%H1 : µ1 − µ2 < −30

In this example, unlike the last one, we have small sample sizes (< 30) but normality is given.Also, look at the alternative hypothesis. The wording is a little odd. To convince yourself that

the inequality given is correct , try some numbers. If µ1 = 300 and µ2 = 350, is µ1 ‘more than 30mg less than” µ2? What about 320 and 340?

On to the test. We will use the p-value approach here.

1.H0 : µ1 − µ2 = −30H1 : µ1 − µ2 < −30

2. Use z because X ′s ∼ N and σ′s are known.

3. z =(X1 − X2)− (µ1 − µ2)√

σ2X1

+ σ2X2

=(326− 381)− (−30)√

31.62

12 + 35.82

15

= −1.93

4. Since we have a left-tailed test we have p-value = P (z < −1.93) = 0.0268

5. Since the p-value is not less than α we can’t conclude that the sodium content of all So Lo-Sochicken noodle soup is more than 30 mg less than their competitor.

10.1.3 Exercises

For the following, find the point estimate and Construct confidence intervals for µ1 − µ2

1. n1 = 37, n2 = 54, X1 = 64.9, X2 = 76.5, σ1 = 6.24, σ2 = 5.67, 95% confidence level.

2. n1 = 62, n2 = 84, X1 = 265.8, X2 = 231.5, σ1 = 23.64, σ2 = 18.94, 98% confidence level.

3. n1 = 13, n2 = 15, X1 = 2.364, X2 = 1.268, σ1 = 0.364, σ2 = 0.468, X1 ∼ N , X2 ∼ N , 90%confidence level.

4. n1 = 9, n2 = 11, X1 = 864.3, X2 = 764.5, σ1 = 63.4, σ2 = 77.6, X1 ∼ N , X2 ∼ N , 99%confidence level.

Perform the following hypothesis tests. Use the critical value approach.

5. H1 : µ1 − µ2 > 0, n1 = 102, n2 = 93, X1 = 2.545, X2 = 2.335, σ1 = 0.431, σ2 = 0.337, 5%significance level.

6. H1 : µ1 − µ2 > 0, n1 = 31, n2 = 32, X1 = 16.54, X2 = 14.66, σ1 = 2.31, σ2 = 2.01, 5%significance level.

7. H1 : µ1 − µ2 6= 0, n1 = 16, n2 = 13, X1 = 567, X2 = 597, σ1 = 63.5, σ2 = 62.3, X1 ∼ N ,X2 ∼ N , 1% significance level.

8. H1 : µ1 − µ2 < 0, n1 = 39, n2 = 33, X1 = 55.6, X2 = 56.4, σ1 = 2.34, σ2 = 1.56, X1 ∼ N ,X2 ∼ N , 5% significance level.

Page 228: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

222CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

Perform the following hypothesis tests. Use the p-value approach.

9. H1 : µ1 − µ2 < 2, n1 = 44, n2 = 37, X1 = 23.6, X2 = 27.9, σ1 = 6.34, σ2 = 5.99, 5%significance level.

10. H1 : µ1 − µ2 > 100, n1 = 65, n2 = 51, X1 = 2136.5, X2 = 1856.7, σ1 = 364.5, σ2 = 468.1,2.5% significance level.

11. H1 : µ1 − µ2 6= 10, n1 = 10, n2 = 8, X1 = 22.65, X2 = 16.54, σ1 = 2.68, σ2 = 4.65, X1 ∼ N ,X2 ∼ N , 5% significance level.

12. H1 : µ1 − µ2 6= 0, n1 = 7, n2 = 6, X1 = 12.3, X2 = 10.4, σ1 = 1.26, σ2 = 1.33, X1 ∼ N ,X2 ∼ N , 5% significance level.

For the following, write the statement for the confidence intervals and use the five steps for hypoth-esis tests.

13. A consumer advocate is investigating the times spent on hold waiting ‘for the next availableservice representative’ for a large call center. A random sample of 14 calls made in theevening yielded a mean wait time of 23.5 minutes. A random sample of 13 calls made duringthe daytime yielded a mean wait time of 36.4 minutes. The population distribution of thewait times in both the evening and daytime are normally distributed. The standard deviationof times for the evening wait times is 6.97 minutes and the standard deviation of the waittimes for the daytime is 9.75.

(a) What is the point estimate for the difference of the mean wait times?

(b) What is the 95% margin of error for the point estimate?

(c) Construct a 95% confidence interval for the difference of the wait times.

(d) The evening supervisor claims that the mean wait time in the evening is less than themean wait time in the daytime. Test the claim using a 5% level of significance.

14. An auto battery manufacturer claims that their Super Start battery lasts longer than theLong Charge battery, their major competitor. A sample of 35 Super Start batteries weretested and the mean life was 56.3 months. A random sample of 40 Long Charge batteries hada mean life of 53.6 months. The standard deviation of the lives of all Super Start batteries isknown to be 9.64 months and the standard deviation of the Long Charge battery is known tobe 8.74 months.

(a) What is the point estimate for the difference of the average lives of the batteries?

(b) What is the 90% margin of error for the point estimate?

(c) Construct a 90% confidence interval for the difference of the means of the battery lives.

(d) Test the claim of the battery manufacturer at the 10% level of significance.

15. The dean of an engineering school feels that the mean starting salary for a biomedical engineeris greater than the mean for a civil engineer. A random sample of 40 biomedical engineersfound a mean starting salary of $62,411. A random sample of 38 civil engineers found thatthe mean starting salary was $59,230. The population standard deviation for the two majorsare $5,978 and $5,465, respectively.

Page 229: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.1. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS OF µ1−µ2 WITHKNOWN σ’S223

(a) Give the point estimate for the difference of the mean starting salaries.

(b) Find is the 90% margin of error for the point estimate.

(c) Construct a 90% confidence interval for the difference of the mean starting salaries forthe two majors.

(d) Use a 5% level of significance to test the dean’s claim.

16. A principal at a Old School High School states that the average time their students spendstudying per week is greater than at their cross town rival, Uppity High School. To test this,the principal takes a sample of 38 students at Old School and finds the average to be 23.54hours per week. A sample of 42 students at Uppity High School yields an average of 19.40hours per week. The population standard deviations are known to be 6.87 hours and 8.46hours, respectively.

(a) Give the point estimate for the difference of the average times spent studying at the twoschools.

(b) Find is the 95% margin of error for the point estimate.

(c) Construct a 95% confidence interval for the difference of the average times spent studyingat the two schools.

(d) Use a 5% level of significance to test the principal’s claim.

17. As part of their goal to get people more fit, a large company is encouraging people to biketo work. A random sample of 15 people who ride their bike to work regularly are sampledand their mean systolic blood pressure is 138.6 mm Hg. A random sample of 18 people thatdon’t ride their bike to work are selected and their mean systolic blood pressure is 142.7mm Hg. The standard deviations of all people that ride to work is 15.6 mm Hg and thestandard deviation of all people that don’t ride is 21.6 mm Hg. It is further known that thedistributions of the blood pressures are approximately normal for both groups.

(a) Give the point estimate for the difference of the mean systolic blood pressures.

(b) Find is the 95% margin of error for the point estimate.

(c) Construct a 95% confidence interval for the difference of the mean systolic blood pres-sures.

(d) Use a 5% level of significance to test if people that don’t ride have a higher mean systolicblood pressure than people that ride to work.

(e) If there were any difference, could we conclude that the cause is riding a bike to work?

18. To reduce the speed of cars on a highway, the CHP has started a ‘slow down’ campaign.Before the campaign started, the average speed of a sample of 45 cars passing past a certainpoint was found to be 76.5 mph. After the campaign, the speed of 40 randomly selected carspassing the same point was found to be 73.5 mph. The standard deviations of all cars beforeand after are known to be 3.64 and 3.26 mph, respectively.

(a) Give the point estimate for the difference of the average speeds of cars before and afterthe campaign began.

Page 230: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

224CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

(b) Find is the 99% margin of error for the point estimate.

(c) Construct a 99% confidence interval for the difference of the average speeds of the cars.

(d) Use a 5% level of significance to test if the average speed of all cars is less after thecampaign began.

(e) If there were any change before and after the campaign began, can we state that thecampaign was the reason the speeds changed?

Page 231: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.2. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS OF µ1−µ2 WITH UNKNOWN σ’S225

10.2 Confidence Intervals and Hypothesis Tests of µ1 − µ2

with unknown σ’s

Although the author is fairly conservative with respect to use of calculators, the calculations involvedin this section are particularly nasty. We will present formulas but rely on our calculator for theactual calculations.

The last section, although useful, was not very realistic from a statistics standpoint. As men-tioned before, it is rare that we will know what the population standard deviation is. We will nowtake away the population standard deviations.

Just like before since we don’t know the population standard deviations, (σ1 and σ2), we willuse the sample standard deviations (s1 and s2). Also, we will use the t distribution not the zdistribution. There are some details that need to be addressed.

We have two separate cases to address: σ1 = σ2 and σ1 6= σ21 We begin with σ1 = σ2.

10.2.1 Confidence intervals for µ1 − µ2 with σ1 = σ2

Recall that σX1−X2=√σ2X1

+ σ2X2

. The issue we have with this is what we put in for the σ′s,

which we don’t know and are expected to be equal.

We have σX1−X2=√

σ21

n1+

σ22

n2

If the standard deviations are the same we can drop the subscripts and we get:

σX1−X2=√

σ2

n1+ σ2

n2

We can factor out the common σ to get σX1−X2= σ

√1n1

+ 1n2

Now we need to estimate the standard deviation. What we do is find a weighted average of thevariances. Then take the square root to get the standard deviation. We get

s2 =(n1−1)s21+(n2−1)s22

n1+n2−2 = n1−1n1+n2−2s

21 + n2−1

n1+n2−2s22

Notice the weights of the variances: in the numerators are the degrees of freedom of the individualestimates. In the denominator is the degrees of freedom for our new estimate.

If we put it all together we get the followingIf X1 ∼ N and X2 ∼ N then a (1− α) confidence interval for µ1 − µ2 is given by(X1 − X2)± tα/2sX1−X2

(This is the form we expect.)

(X1 − X2)± tα/2√

(n1−1)s21+(n2−1)s22n1+n2−2

√1n1

+ 1n2

(The actual formula to use)

Degrees of freedom, df = n1 + n2 − 2

Example 10.2.1.

A restaurant owner has two fast food restaurants that sell burgers. On 12 randomly selecteddays, the north store used an average of 268.4 pounds of hamburger with a standard deviation of

1This looks like H0 and H1 for some test. We will look at this later.

Page 232: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

226CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

35.6 pounds. On 10 randomly selected days, the south store used an average of 234.6 pounds ofhamburger with a standard deviation of 29.6 pounds. It is reasonable that the amounts of hamburgerused per day are normally distributed with equal population standard deviations. Construct a 95%confidence interval for the difference of the means of the amounts of hamburger used at the tworestaurants.

Solution.

Collecting what is given we have

North Southn1 = 12 n2 = 10X1 = 268.4 X2 = 234.6s1 = 35.6 s2 = 29.6

X ′s ∼ N

σ1 = σ2

Confidence level = 95%

Select STAT>TESTS >2-SampTInt

For Inpt, select Stats.

Input the statisics

For C-Level, we want 95 (or .95, the calculator allows either)

Pooled: Select No if σ1 6= σ2, Yes if σ1 = σ2. For this problem, Yes.

Highlight Calculate and hit ENTER.

We select 2-samp T Int.: 2-samp refers to two population means, T refers to the distributionused, and Int is for interval.

We get (4.2945, 63.306)

If we check the output, the calculator gives back the values put in. In addition it gives thedegrees of freedom, 20, and if we scroll down, we get Sxp=33.035. . . . This is the pooled standarddeviation discussed above.

We are 95% confident that 4.3 < µ1 − µ2 < 63.3 Which gives µ2 + 4.3 < µ1 < µ2 + 63.3

Finally, we get the following.

We are 95% confident that the average amount of hamburger used at the north store is between4.3 and 63.3 pounds more than the average amount of hamburger used at the south store.

Page 233: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.2. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS OF µ1−µ2 WITH UNKNOWN σ’S227

10.2.2 Confidence intervals for µ1 − µ2 with σ1 6= σ2

The good news here is that we don’t have the issue with putting the two standard deviationstogether. The bad news: the degrees of freedom. Not to worry, we include them here but we willrely on technology to find any interval or do any test.

df =

(s21n1

+s22n2

)2

(s21n1

)2

n1−1 +

(s22n2

)2

n2−1

We certainly don’t expect an integer here. As such, we will round this number down to thenearest integer, if needed.

Our calculator will deal with the particulars. Let us revisit the example from before.

Example 10.2.2.

A restaurant owner has two fast food restaurants that sell burgers. On 12 randomly selecteddays, the north store used an average of 268.4 pounds of hamburger with a standard deviation of35.6 pounds. On 10 randomly selected days, the south store used an average of 234.6 pounds ofhamburger with a standard deviation of 29.6 pounds. It is reasonable that the amounts of hamburgerused per day are normally distributed with equal population standard deviations. Construct a 95%confidence interval for the difference of the means of the amounts of hamburger used at the tworestaurants.

Solution.

As before we collect what is given. We haveNorth Southn1 = 12 n2 = 10X1 = 268.4 X2 = 234.6s1 = 35.6 s2 = 29.6X ′s ∼ Nσ1 6= σ2

Confidence level = 95%

We proceed exactly as before. We select 2-samp T Int. We input the data, but now we selectpooled: No (Since σ1 6= σ2 so we don’t ‘pool’ our estimates since there isn’t a common σ as before.)

We get (4.8036, 62.796)

The output echoes what we put in plus it gives the degrees of freedom to be 19.999. In this casethe degrees of freedom are very close to the degrees of freedom from before.

We are 95% confident that 4.8 < µ1 − µ2 < 62.7 Which gives µ2 + 4.8 < µ1 < µ2 + 62.7

Finally, we get the following.

We are 95% confident that the average amount of hamburger used at the north store is between4.8 and 62.7 pounds more than the average amount of hamburger used at the south store.

Page 234: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

228CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

If you compare with the example with equal standard deviations, the intervals are close.

10.2.3 Hypothesis Tests of µ1 − µ2 with unknown σ’s

Regardless of whether the standard deviations are equal or not, we will proceed with the fivesteps we used before. We have one slight modification for the critical value approach with unequalstandard deviations.

Example 10.2.3.

A walk-in medical clinic has two physicians working: Dr. Slowpoke and Dr. Speedy. Thenursing staff complains to the management that Dr. Slowpoke is too slow. The management takesa sample of 13 patients of Dr. Slowpoke’s and finds the mean time to be 15.7 minutes with astandard deviation of 5.65 minutes. A sample of 16 patients of Dr. Speedy takes a mean time of12.6 minutes with a standard deviation of 4.69 minutes. Assume the times spent with patients forboth physicians are approximately normal with equal standard deviations. Test if the mean timeof Dr. Slowpoke is greater than the mean time for Dr. Speedy. Use a 5% level of significance. Usethe p-value approach.

Solution.

Let us organize what we have:

Dr. Slowpoke Dr. Speedyn1 = 13 n2 = 16X1 = 15.7 X2 = 12.6s1 = 5.65 s2 = 4.69X1 ∼ N X2 ∼ NH1 : µ1 > µ2 equivalently, H1 : µ1 − µ2 > 0σ1 = σ2

α = 5%

1.H0 : µ1 − µ2 = 0H1 : µ1 − µ2 > 0

2. Use t because X ′s ∼ N and σ′s are unknown.

3. t = 1.616 (From calculator, see below)

4. p-value = 0.0589 (From calculator, see below)

5. (Since the p-value is not less than α) There is not sufficient evidence that the average timeDr. Slowpoke spends with patients is greater than the mean time Dr. Speedy spends withpatients.

Select STAT>TESTS >2-SampTTest

For Inpt, select Stats.

Page 235: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.2. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS OF µ1−µ2 WITH UNKNOWN σ’S229

Input the statistics

Select the appropriate alternative hypothesis (> µ2 here)

Pooled: Select Yes.

Highlight Calculate and hit ENTER.

Let us redo the problem with unequal standard deviations. We will look at it from a criticalvalue approach this time.

Example 10.2.4.

A walk-in medical clinic has two physicians working: Dr. Slowpoke and Dr. Speedy. Thenursing staff complains to the management that Dr. Slowpoke is too slow. The management takesa sample of 13 patients of Dr. Slowpoke’s and finds the mean time to be 15.7 minutes with astandard deviation of 4.65 minutes. A sample of 15 patients of Dr. Speedy takes a mean time of12.6 minutes with a standard deviation of 5.69 minutes. Assume the times spent with patients forboth physicians are approximately normal with unequal standard deviations. Test if the mean timeof Dr. Slowpoke is greater than the mean time for Dr. Speedy. Use a 5% level of significance.

Solution.

As before, let us organize what we haveDr. Slowpoke Dr. Speedyn1 = 13 n2 = 16X1 = 15.7 X2 = 12.6s1 = 5.65 s2 = 4.69X1 ∼ N X2 ∼ NH1 : µ1 > µ2 or, H1 : µ1 − µ2 > 0

σ1 6= σ2

α = 5%

1.H0 : µ1 − µ2 = 0H1 : µ1 − µ2 > 0

2. Use t because X ′s ∼ N and σ′s are unknown.

Input the data in 2-SampTTest (‘Test’ not ‘Int’) Be sure to pick the appropriate alternativehypothesis. Since we are not assuming the standard deviations are equal, we select Pooled:No.And, since the degrees of freedom is not easy to calculate, we will do the steps 3 and 4 at thesame time. (Although you can’t tell once it is on paper)

3. df=23.344. . . (From calculator)which we round down to df = 23 We get

Page 236: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

230CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

t

.05

1.714

4. t = 1.584 (From calculator)

5. (Our test statistic is not in the rejection region) There is not sufficient evidence that theaverage time Dr. Slowpoke spends with patients is greater than the mean time Dr. Speedyspends with patients.

Although the numbers aren’t the same for both cases, they are close.

How do we proceed if we don’t know if the population standard deviations are equal or not?We could perform a hypothesis test on the equaility of the variances. Instead, we will simply beconservative about the conclusions we make. Specifically, we will calculate intervals both ways andtake the wider interval. For hypothesis tests, we will do both ways and take the larger p-value.Since the degrees of freedom are different depending on the way we attack the problem, we willreport the test statistics both way but use just the larger p-value to make any decision.

Example 10.2.5.

A walk-in medical clinic has two physicians working: Dr. Slowpoke and Dr. Speedy. Thenursing staff complains to the management that Dr. Slowpoke is too slow. The management takesa sample of 13 patients of Dr. Slowpoke’s and finds the mean time to be 15.7 minutes with astandard deviation of 4.65 minutes. A sample of 15 patients of Dr. Speedy takes a mean time of12.6 minutes with a standard deviation of 5.69 minutes. Assume the times spent with patients forboth physicians are approximately normal. Test if the mean time of Dr. Slowpoke is greater thanthe mean time for Dr. Speedy. Use a 5% level of significance.

Solution.

Not this problem again!

As before, let us organize what we have. Notice that this time we don’t know about the equalityof the standard deviations.

Dr. Slowpoke Dr. Speedyn1 = 13 n2 = 16X1 = 15.7 X2 = 12.6s1 = 5.65 s2 = 4.69X1 ∼ N X2 ∼ NH1 : µ1 > µ2 or, H1 : µ1 − µ2 > 0α = 5%

Page 237: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.2. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS OF µ1−µ2 WITH UNKNOWN σ’S231

1.H0 : µ1 − µ2 = 0H1 : µ1 − µ2 > 0

2. Use t because X ′s ∼ N and σ′s are unknown.

3. If not pooled then t = 1.584 . If pooled, then t = 1.616.

4. If not pooled, then p-value = .0633. If pooled, then p-value = .0589. We will use p-value= .0633.

5. (Our test statistic is not in the rejection region) There is not sufficient evidence that theaverage time Dr. Slowpoke spends with patients is greater than the mean time Dr. Speedyspends with patients.

10.2.4 Exercises

1. A marine biologist is investigating the sizes of a fish in two different large reefs. The biologisttraps, measures, then releases 13 fish from the northern reef and finds the average length offish is 35.6 cm with a standard deviation of 6.45 cm. At the southern reef a sample of 18fish are measured and the average length is found to be 29.8 cm with a standard deviation of6.12 cm. The biologist is confident that the lengths of fish at the two reefs are approximatelynormal with equal population standard deviations.

(a) Construct a 95% confidence interval for the difference of the mean lengths of fish at thetwo reefs.

(b) Test, at a 5% level of significance if the mean length of the fish at the reefs are not equal.

(c) Redo the problem assuming the population standard deviations are not equal.

2. The manager of a business is planning on replacing its soon to be dead printer. There are twomodels of printer the manager is looking at purchasing: Speedy-Ink printer and the Rapid-Scribe printers. A file is sent to each printer several times and the times are recorded. Thefile is sent 12 times to the Speedy-Ink printer and finds the average time to be 16.54 secondswith a standard deviation of 2.31 seconds. The file is sent to the Rapid-Scribe printer 10times and the average time was 15.26 seconds with a standard deviation of 2.03 seconds. Itis reasonable that the times to print the file on the two printers are approximately normallydistributed with equal population standard deviations.

(a) Construct a 95% confidence interval for the mean difference of the times for the twoprinters.

(b) Test, at a 5% level of significance if the mean time for the Speedy-Ink printer is morethan the mean time of the Rapid-Scribe printer.

(c) Redo the problem assuming the population standard deviations are not equal.

3. A new post surgery wound therapy is being investigated by a surgeon. The surgeon usesthe old therapy for 14 patients and finds the wound is completely healed after an average of18.2 days with a standard deviation of 3.16 days. On 15 patients, the surgeon applies the

Page 238: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

232CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

new therapy and finds the mean time to complete healing to be 16.8 days with a standarddeviation of 2.68 days.

(a) Construct a 98% confidence interval for the difference of the average healing times forthe two therapies.

(b) Test, at a 1% level of significance if the mean time new therapy is less than the meantime for the old therapy.

(c) Redo the problem assuming the population standard deviations are not equal.

4. Two different types of light bulbs each use 15 watts of power. The Eco-Bulb has an averageillumination of 1023 lumens and a standard deviation of 136.5 lumens for a sample of 16bulbs. For the Green-Plan bulb, the average is 986 lumens with a standard deviation of 103.2lumens for 13 bulbs. The illumination of the two types of bulbs are normally distributed withequal standard deviations.

(a) Construct a 90% confidence interval for the difference of the mean illumination of thetwo types of bulbs.

(b) Test, at a 10% level of significance if the mean illuminations of the two bulbs are different.

(c) Redo the problem assuming the population standard deviations are not equal.

5. The maker of duct tape is comparing two different formulations of the adhesive. The tapeis applied to a standard surface and then pulled until the tape fails. A sample of 10 of theoriginal formulation of adhesive had an average fail force of 3.26 pounds with a standarddeviation of 0.26 pounds. A sample of 9 of the new formulation of adhesive had an averagefail force of 3.46 pounds with a standard deviation of 0.22 pounds. The fail forces for the twodifferent adhesives are approximately normally distributed.

(a) Construct a 95% confidence interval for the difference of the average fail forces for thetwo adhesive formulations.

(b) Test, at a 5% level of significance if the mean time for the average fail force of the newadhesive is greater than the average fail force for the original adhesive.

(c) Redo the problem assuming the population standard deviations are not equal.

6. The owner of two stores has conducted a survey of its customers to measure the customerssatisfaction. The customers rate the stores from 1 to 5, 5 being the best. A sample of 28customers at the original store had a mean satisfaction of 4.15 with a standard deviationof 1.35. The new store had 21 customers rate the store an average of 4.06 with a standarddeviation of 1.13. Assume the population standard deviations are approximately normal andthe populations from which they come are approximately normal.

(a) Construct a 99% confidence interval for the difference of the mean satisfaction for thetwo stores.

(b) Test, at a 5% level of significance if the mean satisfaction for the old store is greater thanthe mean for the new store.

(c) Redo the problem assuming the population standard deviations are not equal.

Page 239: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.2. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS OF µ1−µ2 WITH UNKNOWN σ’S233

7. The Sleep More mattress company claims that people who use their mattress get a betternight sleep. Sleeping Dead, their competitor, wants to test the claim. A ‘better night sleep’we will take to mean that they sleep more, on average. A sample of 11 people sleep on theSleep More mattress for a month and the average amount of sleep they got was 7.46 hoursper night with a standard deviation of 2.64 hours per night. A sample of 13 people slepton Sleeping Dead mattresses for a month and the average amount of sleep they got was 7.29hours per night with a standard deviation of 2.34 hours per night. The times that people sleepeach night with the mattresses are approximately normal with equal standard deviations.

(a) Construct a 90% confidence interval for difference of the mean times that people sleepeach night with the two mattresses.

(b) Test, at a 5% level of significance if the claim of the Sleep More manufacturer is true.

(c) Redo the problem assuming the population standard deviations are not equal.

8. A farmer grows almonds in two different orchards. At the valley orchard, a sample of 5 treesare randomly selected, carefully harvested, and the yield is measured. For the 5 trees a meanof 195 pounds per tree were harvested with a standard deviation of 36.5 pounds. At thefoothill orchard, 7 trees were similarly selected and the mean harvest was 168 pounds with astandard deviation of 31.6 pounds. The farmer knows from past experience that trees withinan orchard have yields that are approximately normally distributed. We will further assumethe population standard deviations are approximately equal.

(a) Construct a 99% confidence interval for the difference of the mean satisfaction for thetwo stores.

(b) Test, at a 5% level of significance if the mean satisfaction for the old store is greater thanthe mean for the new store.

(c) Redo the problem assuming the population standard deviations are not equal

9. The amount of sodium in two different brands of soup are being investigated: LoSo andNoFlav soups. A random sample of 18 cans of LoSo soup is measured to have a mean of 236mg with a standard deviaiton of 26.8 mg. A random sample of 15 cans of NoFlav soup wastaken and the mean was found to be 201 mg with a standard deviation of 32.1 mg. Assumethe distributions of sodium in the two brands of soup are approximately normal.

(a) Construct a 99% confidence interval for the difference of the mean sodium levels for thetwo soups.

(b) Test, at a 5% level of significance if the mean sodium levels in the two soups are different.

10. A farmer has maple trees that produce sap for maple syrup. The farmer randomly selects 20trees of approximately the same size and divides them into two groups of ten. For the firstgroup, the farmer sings to them each night for 30 minutes. The mean yield for these ten treeswas 12.5 gallons of sap with a standard deviation of 2.35 gallons. For the trees that the farmerdidn’t sing to, the mean yield was 13.5 gallons with a standard deviation of 2.64 gallons.

(a) Construct a 99% confidence interval for the difference of the mean yields for all trees inregards to singing.

Page 240: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

234CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

(b) Test, at a 5% level of significance if the mean sap production for trees that are sung tois greater than the mean of trees that aren’t sung to.

11. To measure clarity of Lake Tahoe, a disk, called a Secchi disk is lowered until it can nolonger be seen. The Sacramento Bee reported that in 2017 the average depth for a sampleof locations was 60.4 feet. In 2018 the average was 70.9 feet. Assume the sample standarddeviations are 3.51 and 4.21 feet, repectively. Furthermore assume the 24 measurements weremade each year and the populations of depths where the disk can be seen are approximatelynormal.

(a) Construct a 99% confidence interval for the difference of the mean visibility for the twoyears.

(b) Test, at a 1% level of significance if the mean visibility improved from 2017 to 2018.

12. An avid golfer is checking out the driving distance of two different golf balls: GoFar ball andthe Zoom ball. A random sample of 12 GoFar balls are put on a machine that mimics a golfersswing and the balls go an average of 216.4 yards with a standard deviation of 6.96 yards. Arandom sample of 12 Zoom balls are similarly hit and they go an average of 229.8 yards witha standard deviation of 8.94 yards. Assume the popuations of distances for the two types ofballs are approximately normal

(a) Construct a 95% confidence interval for the difference of the mean distance of the twogolf balls.

(b) Test, at a 2.5% level of significance if the Zoom golf ball goes further, on average, thanthe GoFar balls.

Page 241: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.3. CONFERENCE INTERVALS AND HYPOTHESIS TESTS OF µD 235

10.3 Conference Intervals and Hypothesis Tests of µd

For the hypothesis tests and confidence intervals we have done for two populations we need to makesure that our samples are independent of one another. In other words, what gets picked for thesample from one population has nothing to do with how the sample from the second population ispicked. This section removes the independence requirement but does so in a prescribed way. If oursamples are dependent, we want to exploit the dependence.

If we wished to determine the effectiveness of a weight loss program we could simply take asample of people who are about to begin the weight loss program and then take an independentsample of people who are finishing the weight loss program and then constructed a confidenceinterval or do a hypothesis test of µ1 − µ2. DON’T. Although statistically valid, it is a terribleway to construct an interval. What we want to do instead is take a sample of people, send themthrough the diet, and then calculate the change in weight for each person. The only data values wewill actually have are the weights of the individuals. We will construct the data set by subtractingthe weights of the individuals to determine how much weight they lost. From there, we can proceedas we did with inferences on µ

As mentioned above, we wish to exploit the dependence. This exploitation results in a reductionin the variance. Consider the weight loss idea above. If we look at the standard deviations ofpeople’s weights before and after the program, a reasonable value might be in the neighborhood of50 pounds. Now consider the standard deviation of the weight losses. This will probably be only afew pounds. Since our standard deviation is much smaller looking at the differences our test willbe much more sensitive to differences.

Before we proceed with an example, let us look at some new notation.

d = Difference of the paired data values. (often ‘before-after’ or vice versa.)d = the mean of the d’s in our sample.sd = the sample standard deviation of the d’s in our sample.σd = the population standard deviation of all d’s.

before NowX dX dµ µds sdσ σdsX sd

Our confidence intervals and hypothesis tests will use the same formulas as before with theappropriate statistics put in.

Instead of

X ± tα/2s√n

and t =X − µs/√n

We have

d± tα/2sd√n

and t =d− µdsd/√n

Page 242: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

236CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

In order for these formulas to be valid we need either the population to be normally distributed(X ∼ N or d ∼ N) or n > 30.

In this section we will be dealing with data. As such we are being kind, sorta, and looking onlyat small data sets which will require d ∼ N .

10.3.1 Confidence Intervals of µd

Example 10.3.1.

A random sample of 6 people are put through a weight loss program. Their weights before andafter are given in the table. Assume the popuation of paired differences is approximately normal.

1. What is the point estimate for the average weight loss by all people who use this diet?

2. Find the 95% margin of error for this estimate.

3. Construct a 95% confidence interval for the average weight loss for all people who go on thisdiet.

Weight before (kg) 102.3 84.8 76.7 123.9 94.0 138.4Weight after (kg) 100.0 79.1 76.9 120.3 89.8 136.5

Solution.

Before we proceed we need to realize that the first two data values, 102.3 and 100.0, are for thesame person. The same for the next two etc. This means our samples are paired. They are notindependent.

To proceed with this we need to find the weight loss of each person. Simply subtract the beforeand after.

Weight before, B (kg) 102.3 84.8 76.7 123.9 94.0 138.4Weight after, A (kg) 100.0 79.1 76.9 120.3 89.8 136.5

d = B −A 2.3 5.7 -.2 3.6 4.2 1.9

Notice that we have subtracted before−after. We could have instead subtracted after−before.The only difference is the signs on d will all change, our interpretation will remain the same.

Since we have data and we want to calculate the mean, standard deviation, etc., it is recom-mended that we use our calculator. To use the calculator to the following:

Go to your statistical editor (STAT>EDIT > 1:Edit)

Remove data in L1 and L2, if necessary

Input one sample in L1

Page 243: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.3. CONFERENCE INTERVALS AND HYPOTHESIS TESTS OF µD 237

Input other sample in L2

Using the arrow keys move cursor to highlight L3

Type in L1-L2 (or L2-L1)

Use the data in L3 as needed: 1-Var Stats, TInterval, TTest

1. The point estimate of the average weight loss by all people (µd) is simply d. This is, from thecalculator, 2.916. . . .

In words, on average, people lose an average of 2.92 kg on this diet.

2. To get the margin of error, we need t.025 with df = 5. From our table we get t.025 = 2.571

The margin of error is

±tα/2sd√n

= ±2.571× 2.0488

6= ±2.15

Including units we get the margin of error is ±2.15 kg. This means that we are 95% sure thatour estimate of the average weight loss of all people is within 2.15 kg of the actual value.

3. Once we have the point estimate and the margin of error the confidence is very straightforward:point estimate ± margin of error. We could also use the TInterval on our calculator.

Our confidence interval is 2.92 ± 2.15 or .77 to 5.07. We are 95% confident that the averageweight loss by all people who use this diet is between 0.77 kg and 5.07 kg.

10.3.2 Hypothesis Tests of µd

Example 10.3.2.

Let us consider the weight loss program from above. Test, using a 5% significance level if, onaverage, people who go through the program lose weight.

Solution.

What we are looking at is the average weight loss. As before, the losses are obtained from thedifferences. As above we will let d = before - after. In this case, our alternative hypothesis isH1 : µd > 0. If we had subtracted after - before, our alternative hypothesis would be H1 : µd < 0and our right-tailed test would become a left-tailed test. It does not matter which way we choose.We simply must be consistent with whatever we choose.

What we have is:

Data valuesH1 : µd > 0d ∼ Nα = .05

We can use either the critical value or p-value approach. In this example we choose to use thep-value approach

Page 244: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

238CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

1.H0 : µd = 0H1 : µd > 0

2. Use t because d ∼ N and σd is unknown.

Since we have data we are going to use our calculator. Make sure the differences are in a list(probably L3) and run a TTest using data. Be sure to specify the correct list and make sureyour alternative hypothesis is correctly chosen. See bulleted list earlier in this section.

3. t=3.487 (From calculator)

4. p = .0088 (From calculator)

5. There is sufficient evidence that, on average, people who use this weight loss program loseweight.

10.3.3 Exercises

1. Because of turning on a motorcycle, the front tire wears faster than the back tire. A randomsample of 7 motorcycles are outfitted with new tires. After 10,000 miles, the tread is checkedfor wear. The amount of tread remaining is shown in the table. Assume the population ofpaired differences is approximately normally distributed.

Rear tread (inches) .74 .64 .54 .61 .55 .60 .71Front tread (inches) .69 .60 .46 .57 .51 .49 .67

(a) What is the point estimate for the average amount of tread the front wears down thanthe rear tire?

(b) Find the 95% margin of error for this point estimate.

(c) Construct a 90% confidence interval for the average difference of the tread used betweenthe two tires.

2. A physical therapist is investigating the differences in the biceps of dominant arms and non-dominants arms. It is expected that the dominant arm will be larger. A random sample of 8adults were selected and the circumference of their dominant and non-dominant biceps weremeasured. The data follow. Assume that the differences of the circumferences of all adults isnormally distributed.

Dominant arm (mm) 235 326 319 294 354 374 291 354Non-dominant arm (mm) 228 318 314 288 351 365 290 350

(a) What is the point estimate for the average amount the dominant arm’s circumference isgreater than the non-dominant arm’s circumference ?

(b) Find the 95% margin of error for this point estimate.

Page 245: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.3. CONFERENCE INTERVALS AND HYPOTHESIS TESTS OF µD 239

(c) Construct a 95% confidence interval for the difference of the population percentages.

3. A hospital is planning on replacing their records computer system. They are examining twosystems: EZ Patient and Patient Stat. They take a sample of 12 nurses and have the nursesuse both systems to input a new patient. The time, in seconds, it takes to input is recorded.

EZ Patient 152 168 126 194 167 148Patient Stat 148 138 102 186 165 141EZ Patient 181 170 133 184 181 166Patient Stat 175 167 134 191 169 154

Assume that the differences of times for all nurses is approximately normal.

(a) What is the point estimate for the mean amount time the Patient Stat saves over theEZ Patient to input a new patient?

(b) Find the 95% margin of error for this point estimate.

(c) Construct a 95% confidence interval for the mean amount time the Patient Stat savesover the EZ Patient to input a new patient.

4. A drug company is investigating the effectiveness of its drug in reducing high blood pressure.A random sample of 8 patients with high blood is selected and each patient is given the drugfor one month. The systolic blood pressure is measured before the patients begin and thenat the end of the study. The results are given.

Before (mm Hg) 143 153 144 164 170 155 161 182After (mm Hg) 152 142 136 152 157 140 152 163

Assume that the change in blood pressure of all adults who use this drug approximatly normal.

(a) What is the point estimate for the mean decrease in blood pressure for all people whouse the drug?

(b) Find the 90% margin of error for this point estimate.

(c) Construct a 90% confidence interval for the mean decrease in blood pressure for all peoplewho use the drug.

5. The manufacturer of two different golf balls is investigating how much further its premiumball travels than their economy ball when players use their driver. A machine tests it out on8 randomly selected drivers. The distances, in yards, is given.

Premium 234 259 219 234 227 248 229 266Economy 224 251 208 230 221 237 223 255

Assume that the increase in distance of all drivers is approximately normal.

(a) What is the point estimate for the mean increase in distance using the premium ball?

(b) Find the 90% margin of error for this point estimate.

(c) Construct a 90% confidence interval for the mean increase in distance using the premiumball.

Page 246: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

240CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

6. As part of a sobriety exercise, 9 adults who feel that they are unimpaired by one drink areenrolled in an experiment. They are to perform a driving test in a simulator where a smallchild runs in front of the car. The time, in milliseconds, to react is recorded before they haveany alcohol and then the time to perform with alcohol.

No Alcohol 295 325 319 284 288 276 306 318 304With Alcohol 308 364 361 301 299 291 349 364 361

(a) What is the point estimate for the average increase in reaction time for people afteringesting alcohol?

(b) Give the 99% margin of error for this point estimate.

(c) Construct a 99% confidence interval for the average increase in reaction time for peopleafter ingesting alcohol.

7. Every day a commuter takes the same route. The commuter hears about an alternate route.The commuter asks a friend, who drives just like our commuter, to drive the alternate routeat the same time the commuter takes their traditional route. The commute times, in minutes,are reported for 6 days as:

Traditional route 36.5 35.4 33.1 38.9 31.5 30.8Alternative route 33.4 34.0 32.9 38.4 31.6 28.6

Assuming the difference of the times are approximately normal test if the alternative routetakes less time, on average, than the traditional route. Use a 5% level of significance.

8. Six families are selected at random. From each family, one parent and one child are selected.The times spent using a smart media for one day for each pair , in hours, are in the table.Assuming the population of paired differences are approximately normal, test if the timesthat parents and children spend using smart media are different.

Child 3.53 2.25 1.19 3.80 4.22 1.95Parent 2.87 2.67 2.34 3.29 3.97 1.67

9. A large building has two heaters which are designed to equally share the work of heating thebuilding. The amount of energy, in kWh, each heater uses is recorded for 8 days and recordedin the table. Assume the population of paired differences is approximately normal. Test,using a 1% significance level if the average energy used for the two heaters are not the same.

Heater A 126 135 146 128 155 205 196 157Heater B 136 151 153 128 167 221 216 184

10. A gasoline additive is supposed to improve mileage. A consumer testing organization isinvestigating. A random sample of 7 cars is selected, driven on one tank of gas without thesupplement and then driven for one tank with the suppliment. The miles per gallon for thecars follows. It is reasonable that the change in mileage for cars using the additive will beapproximately normal. Is there sufficient evidence that, on average, the additive improvesmileage?

Without additive 22.4 19.2 34.3 21.7 30.1 36.3 21.9With additive 23.5 19.8 35.6 22.8 30.7 36.4 22.8

Page 247: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.3. CONFERENCE INTERVALS AND HYPOTHESIS TESTS OF µD 241

11. A pianist is trying hypnosis to help improve speed while playing. The pianist plays 8 differentpieces at as fast a pace as possible. The times are recorded and then the pianist is hypnotizedand given a suggestion to help with speed. The pieces are again played as fast as possible andthe times are recorded. Assume that the change in the times after hypnotism is approximatelynormal. Test at the 1% level of significance if the times after hypnotism, on average, are lessthan the times before hypnotism. The times, in seconds, follow.

Before 96.4 103.5 92.4 151.4 103.5 123.5 116.5 81.5After 92.3 102.7 92.8 150.1 101.2 123.7 114.6 80.1

12. A SAT prep course promises that students that take their course will score more than 200points better on their SAT after taking the course. A random sample of 8 students take theSAT, then take the course, and then take the SAT again. The results follow. Assuming thechanges in scores are approximately normally distributed, test if the claim of the course istrue. Use a 5% level of significance.

Before 1206 1035 1264 1326 1468 1210 956 1108After 1450 1238 1495 1556 1590 1446 1359 1409

13. A nutritionist is investigating weight gain in adults. A random sample of 7 adults wereselected. They were each fed an additional 1000 calories per day. After one month, theweights, in pounds, were taken. Test, using a 5% significance level if the average weight gainis different from 5 pounds. Assume the weight gains are approximately normally distributed.

Before 164.1 129.6 216.8 167.4 154.3 184.5 94.4After 167.2 133.8 218.9 170.2 159.2 193.6 100.5

14. The medical director is investigating the times it takes for patients to get an X-ray at twodifferent facilities: one located in the north part of town and the other in the south part oftown. Seven patients with different needs are sent to the two different X-ray facilities. Thetimes required from the time the patient walks in to the room until they walk out are recorded.The times, in minutes, are given in the table below. Using a 5% level of significance, test ifthe south facility saves patients, on average, more than one minute while getting an X-raythan patients using the north facility. Assume the differences of the times are approximatelynormal.

North Facility 18.6 15.9 22.6 13.5 10.9 6.4 9.7South Facility 15.4 12.8 21.3 13.4 9.2 5.5 8.8

Page 248: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

242CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

10.4 Confidence Intervals and Hypothesis Tests of p1 − p2

As with the section on inferences of µ1 − µ2 with the t-distribution, this section will rely on thecalculator to do the calculations. We will discuss the derivations of the formulas but rely on thecalculator to do most of the actual calculations.

10.4.1 Confidence intervals for p1 − p2

Example 10.4.1.

A random sample of 800 republicans showed that 463 are in favor of a proposition. A randomsample of 820 democrats showed that 610 are in favor of the proposition. Find and interpret thepoint estimate for the difference in proportions of those groups in favor of the proposition.

Solution.

Let us organize what we have and what we want.Republican Democratn1 = 800 n2 = 820X1 = 463 X2 = 610

Notice that we have X not X. The 463 and 610 are counts not an averages. There were 463 or610 objects selected that fit a criteria.

Our goal here is to estimate p1 − p2. Since we don’t know the proportion, or percent, of allrepublicans and democrats in favor of the proposition we will estimate it with p1 − p2.

Recall that p = Xn . Adding subscripts we get

p1 =X1

n1=

463

800= 0.57875

and

p2 =X2

n2=

610

820= 0.74390

Yielding

p1 − p2 = 0.57875− 0.74390 = −0.16515

Our point estimate as p1 = p2 = −0.16515 which we interpret by stating that the proportionof republicans in favor of the proposition is 0.165 less than the proportion of democrats in favor ofthe proposition.

We could have stated that the proportion of democrats in favor of the proposition is 0.165 morethan the proportion of republicans in favor of the proposition.

We also can write this in terms of percentages. We would say that the percentage of republicansin favor of the proposition is 16.5 percentage points less than the percentage of democrats in favorof the proposition. Note that we can’t say . . . 16.5% fewer republicans . . . . We could say this ifthe population sizes (N1 = N2) of republicans and democrats were the same. This is not a realisticexpectation when looking at two different populations. We are stuck with ‘percentage points’.

Page 249: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.4. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS OF P1 − P2 243

Example 10.4.2.

Find the 95% margin of error of the difference of the two population proportions in the aboveproblem.

Solution.

It is reasonable that the margin of error would be ±zα/2σp1−p2 .Since we are looking at inferences, we need to make sure certain assumptions are met.If we have np > 5 and nq > 5 we have the sampling distribution of p will be approximately

normal. (Recall q = 1− p)

If n1p1 > 5 and n1q1 > 5 we have p1 ∼ N(p1,√

p1q1n1

)

Also, if n2p2 > 5 and n2q2 > 5 we have p2 ∼ N(p2,√

p2q2n2

)

If both are true, np′s > 5 and nq′s > 5. we can put them together to get p1 − p2 ∼N(p1 − p2,

√p1q1n1

+ p2q2n2

)

Since we don’t know what the population proportions are we will substitute in the sampleproportions for the margin of error. This yields

E = ±zα/2√p1q1

n1+p2q2

n2

z.025 = 1.960 and we get

E = ±1.960

√.60375× .39625

800+.7439× .2561

820= 0.0457

Our margin of error is .046 or 4.6 percentage points.

Example 10.4.3.

Construct a 95% confidence interval for the above example.

Solution.

Most of the work is done in the previous solutions. Since our point estimate, p1− p2, is normallydistributed, we have a confidence interval of the form P ± E where P is the point estimate and Eis the margin of error. Or put all together we get

p1 − p2 ± zα/2√p1q1

n1+p2q2

n2

Page 250: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

244CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

P ± E = −0.16515± 0.04575 = −0.1194 to − 0.2109

For our statement we will expand on the statement from above. We are 95% confident that thepercent of republicans in favor of the proposition is between 11.4 and 21.1 percentage points lessthan the percentage of democrats in favor of the proposition.

Our confidence interval for the difference of two population proportions then becomes:

A (1− α)100% confidence interval for p1 − p2 is given by

p1 − p2 ± zα/2√p1q1

n1+p2q2

n2

Provided p1 ∼ N and p2 ∼ N

Notice that we have replace the population proportions with the sample proportions. If we knowthe population proportions, we simply calculate the difference, p1 − p2.

10.4.2 Hypothesis Tests of p1 − p2

In order to conduct a hypothesis test of the proportions, we need to make sure that the p′s ∼ N .For this we need np′s > 5 and the nq′s > 5. Although we can’t check this directly, it is sufficientto check that the np′s >> 52 and the nq′s >> 5.

Recall that when we do a hypothesis test we assume the null hypothesis test is true. For thissection, our null hypothesis will always be H0 : p1 = p2.

This makes the test statistic a bit different than expected. We might expect our test statisticto be

z =(p1 − p2)− (p1 − p2)√

p1q1n1

+ p2q2n2

( What’s wrong with this?)

Since our null hypothesis states that the proportions are the same, we need to follow thisassumptions through until we have evidence that the assumption is false. But if you examine thedenominator of the incorrect test statistic above you will notice that we have two different estimatesfor p1 and p2. We must use the same estimate. To do this we will ‘pool’ the estimates.

What the null hypothesis says is that the two populations are indistinguishable with respect tothe characteristic in question. In the confidence interval example, it means that republicans aredemocrats are the same when it comes to the mentioned proposition. Let us examine this with anexample.

Example 10.4.4.

2‘>>’ means ‘a lot more than’

Page 251: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.4. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS OF P1 − P2 245

Many adults read for pleasure. A random sample of 865 adults in their 50’s showed that 614 ofthem read for pleasure. Another sample of 930 adults in their 60’s showed that 620 of them readfor pleasure. Test, using a 5% level of significance, if the percentage of adults in their 50’s who readfor pleasure is greater than the percent of adults in their 60’s who read for pleasure.

Solution.

Let us organize what we have50’s 60’s

n1 = 865 n2 = 930X1 = 614 X2 = 620

H1 : p1 > p2 equivalently, H1 : p1 − p2 > 0α = 5%

1.H1 : p1 − p2 = 0H1 : p1 − p2 > 0

2. Use z because np′s > 5 and nq′s > 5.

3. z

.05

1.645

Input the data in 2-PropZTest

4. z = 1.971 (From calculator, see below)

5. There is sufficient evidence the percentage of adults in their 50’s who read for pleasure isgreater than the percent of adults in their 60’s who read for pleasure.

Notice what your calculator gives you. It gives the following

p1 = .7098(= 614/865)

p2 = .6667(= 620/930)

p = .6875(= (614 + 620)/(865 + 930))

Calculating test statistic on tests of p1 − p2

Select STAT>TESTS>2-PropZTest

Enter the values of n1, X1, n2, and X2

Page 252: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

246CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

Select the appropriate alternative hypothesis

Highlight Calculate and hit ENTER

Example 10.4.5.

From a sample of 952 randomly selected college bound high school juniors, 31.0% are undecidedwhat major they will pursue in college. A random sample of 1238 entering college freshmen had35.2% with an undecided major. Test, using a 5% level of significance if the percentages of enteringcollege freshmen and high school juniors with undecided college majors are different.

Solution.

As before, let us organize what we have

HS Juniors Coll. Freshmenn1 = 952 n2 = 1238p1 = .310 p2 = .352

H1 : p1 6= p2 or, H1 : p1 − p2 6= 0α = 5%

1.H0 : p1 = p2

H1 : p1 6= p2

2. Use z because np′s > 5 and nq′s > 5.

3. z

.025 .025

1.960−1.960

4. Our calculator requires us to have the X ′s and n′s. Instead we have the p′s and n′s. To getthe X ′s use the relationship

p =X

nequivalently, X = np

We get

X1 = 952× .310 = 295 (rounded from 295.12)

X2 = 1238× .352 = 436 (rounded from 435.776)

z = −2.081 (From 2-PropZTest)

Page 253: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.4. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS OF P1 − P2 247

5. There is sufficient evidence that the percentages of entering college freshmen and high schooljuniors with undecided college majors are different.

If you examine the p′s given by your calculator, we see p1 = .30987. . . and p2 = .35218. . . Theseare close to the original values given in the problem. When we are given percentages in theseproblems we are going to get rounded off values. You should check that what your calculator givesyou rounds off to the given percent.

Also, if you prefer the p-value approach, the calculator gives a p-value of .0374 which is less thatα.

10.4.3 Exercises

1. A sample of 1205 voter-eligible people that did not graduate from high school indicated 32.2%voted in the 2016 presidential election. A sample of 1320 voter-eligible people that have acollege degree showed that 62.1% voted in the 2016 presidential election.

(a) What is the point estimate for the difference of the percentages that voted in the 2016presidential election for the two populations?

(b) Find the 95% margin of error for this point estimate.

(c) Construct a 95% confidence interval for the difference of the population percentages.

2. At a egg farm, the farmer is investigating the effectiveness of a new feed. Of 400 randomlyselected hens given the old feed, 30.5% of the eggs were ‘large’ or larger. Of 400 randomlyselected hens given the new feed, 38.25% were ‘large’ or larger.

(a) What is the point estimate for the improvement for the percentage of eggs that are ‘large’or larger with the new feed?

(b) Find the 95% margin of error for this point estimate.

(c) Construct a 95% confidence interval for the difference of the population percentages.

3. A pharmaceutical company is testing a skin cream product to reduce eczema. 1000 patientswith eczema were randomly split into two groups: a treatment group and a placebo (control)group. Of 500 patients administered the skin cream with the medicine, 432 reported animprovement after a week of use. Of the remaining 500 patients given the skin cream withoutthe medicine, 364 reported improvement after a week of use.

(a) What is the point estimate for the difference of the percentages improvement using themedicated cream over the non-medicated cream?

(b) Find the 90% margin of error for this point estimate.

(c) Construct a 90% confidence interval for the difference of the population percentages.

4. In 2012, 1024 Americans were asked if they are sports fans of professional football. 688 ofthose people said they were. In 2017 the same quesiton was asked of 1034 Americans. Ofthese, 582 said they were.

Page 254: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

248CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

(a) What is the point estimate for the change in the percentage of Americans who are footballfans?

(b) Find the 95% margin of error for this point estimate.

(c) Construct a 95% confidence interval for the difference of the population percentages.

5. Even if a person has had an immunization from a disease, there is still a chance that the personmight contract the targeted disease. Of 1230 people that were given an immunization in theform of a shot, 165 contracted the disease. Of 1562 people that were given the immunizationin the form of a pill, 95 ended contracting the disease. Is there sufficient evidence, at the 5%level of significance, if the shot is a more effective form of immunization for this disease?

6. At a large hotel, guests are asked to rate the employees. A random sample of 564 people thatuse the spa, 431 would recommend the service to their friends. Of 842 people that used roomservice, 502 would recommend the service to their friends. Is there sufficient evidence thatthe percentages of guests that would recommend the services are different at a 1% level ofsignificance?

7. A large municipality has just begun a single stream recycling program (all recyclables go inone container). Education about what goes in the bin and what doesn’t is an important partof the program. The management team has sent information about what goes in the binsand what doesn’t to several randomly selected households. A random sample of 1500 itemsput in the bins by the households that got the information found that 8.6% of the items werein the bins incorrectly. A random sample of 1200 items put in the bins by the householdsthat didn’t get the information found that 13.5% of items were in the bins incorrectly. Usinga 1% significance level test if the percentage of incorrect materials in the recycling bins byhouseholds that receive the information is less than the percentage that don’t receive theinformation.

8. A drug that is used for treatment of cancer has the side effect of nausea. 2000 patients usingthe cancer drug are split into two groups: one receives a placebo and the other receives anadditional drug to treat the nausea. 1000 of the patients were given a placebo and 43%reported severe nausea. 1000 patients given the anti-nausea drug had a report of 28% withsevere nausea. Test, using a 5% level of significance if patients who take the anti-nausea drughave a lower incidence of severe nausea than the patients who take a placebo.

9. At a large hospital patients were surveyed about their health care team washing their handsor using gel on their hands. A random sample of 800 patients were surveyed and 86% saidthe team washed or gelled their hands. A new patient protocol is put into place. A randomsample of 900 patients were surveyed after the protocol is put into place and 95% are reportedto wash or gel their hands. Test, using a 5% significance level, if the percentage of patientsthat report there health care team wash or get their hands is greater after the new protocolis put into place.

10. A popular celebrity has begun a campaign to get teens to avoid all tobacco products. Afterone year, a sample of 1342 fans of the celebrity reported that 235 use tobacco products. Asample of 1354 fans who were unaware of the campaign reported 376 use tobacco products.Test if the percent of fans of the celebrity who use tobacco products is less than the percentof people who were unaware of the campaign. Use a 5% level of significance.

Page 255: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.4. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS OF P1 − P2 249

11. A psychologist is investigating left/right handedness and schizophrenia. A random sampleof 830 people with psychosis had 325 left-handed persons. A random sample of 1036 peoplewithout psychosis had an 120 left-handed persons. Using a 5% significance level, test if thepercentage of left-handed people with psychosis is greater than the percent of left-handedpersons without psychosis.

12. A veterinarian is doing a survey of political party and pet preference. A random sample of1300 ‘dog-people’ showed that 53% were registered democrat. A random sample of 1100 ‘cat-people’ had 56% registered democrat. Test if the percent of ‘dog-people’ who are democrat isdifferent from than the percent of ‘cat-people’ who are democrat. Use a 5% level of significance.

13. A high school counselor is investigating what percent of students take summer classes. Froma random sample of 820 juniors, 260 have summer classes. From a random sample of 762seniors, 194 have summer classes. Test if the percentages of high school juniors and seniorsthat have summer classes are different. Use a 1% level of significance.

14. The maker of rechargeable cordless ear buds is testing its ear buds against a popular brand.A random sample of 431 people were given the new ear buds and 264 of them rated them as‘excellent’. A random sample of 460 people were given the popular brand ear bud to use and234 of them rated them as ‘excellent’. Test, using a 5% level of significance, if the percent ofall people who would rate the new buds as ‘excellent’ is greater than the percent of all peoplewho would rate the popular brand as ‘excellent’.

15. Should we abandon the electoral college for president and go to a popular vote instead? Oneof the concerns is candidates focusing on ‘swing’ states. 900 democrats were asked about thisconcern. 37% were ‘very concerned’. A random sample of 850 republicans had 20% ‘veryconcerned’. Test if the percentages of democrats and republicans ‘very concerned’ about thisissue are different at a 1% level of significance.

16. According to a time.com report, only 74% of Americans know that the Earth revolved aroundthe sun. Of residents of the European Union, only 66% knew this. Assume these are basedon samples of 1000 Americans and 1200 European Union residents. Using a 2.5% level ofsignificance test if the percent of Americans who correctly know that the Earth revolvesaround the sun is greater than the percent of European Union residents who know the Earthrevolves around the sun.

Page 256: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

250CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

10.5 The F-Distribution

We have done inferential statistics on µ and µ1 − µ2. We have also done tests of p and p1 − p2.We also have done tests on σ2. We would like now compare two variances. For the means andproportions, we could subtract the means or proportions and we had the same distribution: t or z.In comparing σ2

1 and σ22 , we don’t subtract them, we need to divide them.

Specifically, if χ21 ∼ χ2(df1) and χ2

2 ∼ χ2(df2) then

χ21

df1

χ22

df2

∼ F (df1, df2)

We will worry about the applications later. Right now, we will focus on the distribution. Somefacts about the F -distribution:

The distribution is skewed right.It has two parameters: degrees of freedom of the numerator and degrees of freedom of the

denominator. This is usually written as an ordered pair: (dfnum, dfdenom)To find the critical values of F we will use the tables.

Example 10.5.1.

Find F.01 with df = (5, 7)

Solution.

First we need to find the table with the .01 area labeled on the graph. Second, we need tofind the degrees of freedom of the numerator along the top row and the degrees of freedom of thedenominator along the left column. Where the two meet we get F.01 = 7.460.

df

1

7

1 5

7.460

9. . . . . .

...

Our focus here is to look at hypothesis tests. As mentioned above,χ21

df1χ22

df2

∼ F (df1, df2). This will

be the basis of our hypothesis tests.

From before we have that if X ∼ N then (n−1)s2

σ2 ∼ χ2(n− 1).

To perform a hypothesis test of the equality of two population variances we will use the following.

Page 257: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.5. THE F-DISTRIBUTION 251

In performing a hypothesis test with null hypothesis, H0 : σ21 = σ2

2 or(H0 : σ1 = σ2),

Reindex, if necessary, so that H1 : σ1 > σ2 or s21 > s2

2 if H1 : σ1 6= σ2

The test statistic will be F =s21df1s22df2

df = (n1 − 1, n2 − 1)

The critical value will be Fα for a right-tailed test

The critical value will be Fα/2 for a two-tailed test

Provided X1 ∼ N and X2 ∼ N

Let us put this in use with a test.

Example 10.5.2.

A company that sells caulking uses two plants to dispense the caulking into tubes. A randomsample of 8 tubes from the east plant is selected and the standard deviation of the amount of caulk is5.64 grams. A random sample of 9 tubes from the west plant is selected and the standard deviationof the amount of caulk is 4.26 grams. Is there sufficient evidence, at a .05 level of significance, ifthe standard deviation of the amount of caulk dispensed into the tubes at the east plant is greaterthan the standard deviation of the amount of caulking dispensed into the tubes at the west plant?It is known by the management that the distributions of the amount of caulking dispensed into alltubes at both plants are approximately normally distributed.

Solution.

Note that we are asked to test if the standard deviation at one plant is greater than at the other.Let us organize what we have

East Plant West Plantn1 = 8 n2 = 9s1 = 5.64 s2 = 4.26X1 ∼ N X2 ∼ N

α = 5%

H1 : σ1 > σ2

We will proceed with a hypothesis test using the same 5 steps we used before. We will proceedwith the critical value approach.

1.

H0 : σ1 = σ2

H1 : σ1 > σ2

2. Use F because X1 ∼ N and X1 ∼ N

Page 258: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

252CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

3. df = (7, 8) (n1 − 1 = 7, n2 − 1 = 8)

α = 5%

3.500 F

4. Using the formula from above we have, F =

s21df1

s22df2

=5.642

74.262

8

= 2.003.

5. There is not sufficient evidence that the standard deviation of the amount of caulk dispensedat the east plant is greater than the standard deviation of the amount dispensed at the westplant.

Since the table only shows the right tail, we need to make sure that our test statistic is in theright tail. We can do this by defining σ1 as the larger of the two population standard deviations.

Example 10.5.3.

A manufacturer of cables is testing the breaking strength of the cables. The manufacturer usestwo different processes to make the cable. A random sample of 10 cables using the traditionalprocess was selected and the standard deviation of the breaking strengths was 16.5 pounds. Arandom sample of 8 cables using the newer process was selected and the standard deviation of thebreaking strength was 26.4 pounds. From testing the strengths before, the manufacturer knows thatthe distribution of the breaking strengths of the two processes should be approximately normal.Use a 5% significance level to test if the standard deviations are different.

Solution.

Although we usually let s1 be the first standard deviation we get to when we read the problem,we need to have a test that uses the right tail. We will then let 1 be the newer process and 2represent the traditional process. This way, s1 > s2.

Let us orgainize what we have

Newer Process Older Processn1 = 8 n2 = 10s1 = 26.4 s2 = 16.5X1 ∼ N X2 ∼ N

α = 5%

H1 : σ1 6= σ2

We will proceed with a hypothesis test using the same 5 steps we used before. We will proceedwith the critical value approach.

Page 259: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.5. THE F-DISTRIBUTION 253

1.

H0 : σ1 = σ2

H1 : σ1 6= σ2

2. Use F because X1 ∼ N and X1 ∼ N

3. df = (7, 9) (n1 − 1 = 7, n2 − 1 = 9). In the picture you will notice that we have a two-tailedtest but we don’t know the value of the left critical value, labelled ‘Don’t care’. 3

α = 2.5%α = 2.5%

4.026Don’t care F

4. We have,

F =

s21df1

s22df2

=26.42

716.52

9

= 3.291.

5. There is sufficient evidence that the standard deviations of the breaking strengths of the twoprocesses are not the same.

10.5.1 Exercises

Perform the following hypothesis tests.

1. n1 = 7, s1 = 23.5, n2 = 12, s2 = 18.5, α = .01, H1 : σ1 > σ2, X1 ∼ N , and X2 ∼ N

2. n1 = 12, s1 = 0.265, n2 = 13, s2 = 0.136, α = .025, H1 : σ1 > σ2, X1 ∼ N , and X2 ∼ N

3. n1 = 15, s1 = 1.26, n2 = 17, s2 = 1.99, α = .05, H1 : σ1 6= σ2, X1 ∼ N , and X2 ∼ N

4. n1 = 8, s1 = 4.26, n2 = 14, s2 = 7.05, α = .05, H1 : σ1 6= σ2, X1 ∼ N , and X2 ∼ N

5. n1 = 14, s1 = 0.456, n2 = 9, s2 = 0.636, α = .025, H1 : σ1 < σ2, X1 ∼ N , and X2 ∼ N

6. n1 = 16, s1 = 2.96, n2 = 12, s2 = 4.26, α = .01, H1 : σ1 < σ2, X1 ∼ N , and X2 ∼ N

7. A vintner is examining the sugar content of grapes at two of their vinyards. At the oldvineyard, the standard deviation of the amount of sugar in a random sample of 10 one-cupsamples is 2.12 grams. At the newer vineyard a random sample of 9 one-cup samples weretaken and the standard deviation of the amount of sugar was 3.56 grams. The amount ofsugar per cup of grapes is known to be approximately normally distributed for both vineyards.Test, using a 5% significance level if the standard deviations of the amount of sugar per cupof grapes for the two vineyards are different.

3If you do care, it is 1/F.025 with df = (9, 7), not (7,9)

Page 260: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

254CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

8. A chemist at a paint company is investigating the drying time for two different paints. Asample from 14 randomly chosen batches of their premium paint yielded a standard deviationof drying times of 15.68 minutes. Another sample from 15 randomly chosen batches of theeconomy paint yielded a drying time of 10.35 minutes. Test if the standard deviation of thedrying times are not equal. It is reasonable that the times it takes for the paint to dry forboth types of paint to be approximately normal. Use a 5% level of significance.

9. In order to keep an assembly line moving smoothly, the time taken at each station has to beconsistent. At an assembly line, the management is investigating two different robot assistedtechnologies: Nomad and HAL. A random sample of 10 products are sent through the Nomadassembly line and the standard deviation of times was 35.6 seconds. The manufacturer assuresus that the distribution of times to get the product through are normally distributed. Anothersample of 15 using HAL assembly line is taken and the standard deviation was found to be56.3 seconds. The manufacturer the HAL machine also assures us that the times to get theproduct through are normally distributed. Test, using a 5% level of significance if the standarddeviation of the times for the Nomad machine is less than the standard deviation of the HALmachine.

10. Two different companies manufacture resistors. The resistors vary and so a sample of 12100-Ohm HighQual resistors are measured for resistance and the standard deviation of theresistance is 9.46 Ohms. A sample of 11 100-Ohm Cheepo resistors are taken and the standarddeviation turns out to be 13.54 Ohms. Can you conclude that the standard deviation of theHighQual 100-Ohm resistors is less than the standard deviation of the Cheepo 100-Ohmresistors. Use a 5% level of significance assuming the resistance of all 100-Ohm resistors arenormally distributed.

11. Two internet providers are bragging about consistent speeds. A random sample of 15 1-GB files are downloaded using the WeBeFast internet provider and the standard deviationof the times is 54.8 seconds. A random sample of 10 1-GB files are downloaded using theFastBeUs internet provider and the standard deviation of the times was 93.2 seconds. Is theresufficient evidence that the standard deviation of the WeBeFast internet provider is less thanthe standard deviation of the times for the FastBeUs internet provider? From downloadingexperience, we know that the times to download files from the two internet service providersare each approximately normally distributed. Use a 2.5% level of significance.

12. A company that cans sliced pineapple is investigating the consistency of the sizes of ripepineapples at two growers: Sweet Pine Farms and Juicy Pine Farms. A random sample of 15pineapples from the Sweet Pine Farms are selected and the standard deviation of the diameteris 23.6 mm. A sample of 13 pineapples from the Juicy Pine Farms is selected and the standarddeviation of the diameters is found to be 39.1 mm. The diameters of the pineapples from bothfarms are approximately normal. Using a 1% significance level test if the standard deviationof the diameters of the pineapples at Sweet Pine Farms is less than the standard deviation ofthe pineapples at Juicy Pine Farms.

13. While investigating thermometers, a chef investigates two different brands: SureTemp andAccuTemp thermometers. A random sample of 15 SureTemp thermometers are placed in ahot water bath and the temperatures are measured. From the temperatures obtained it isobserved that the population is approximately normally distributed and the sample standard

Page 261: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

10.5. THE F-DISTRIBUTION 255

deviation is 3.540F. A sample of 12 AccuTemp thermometers are also placed in the hot waterbath and the standard deviation is observed to be 8.460F. It was also noted that the populationof temperatures is approximately normally distributed. Using a 5% significance level test ifthe standard deviation of the temperatures that SureTemp thermometers report is less thanthe standard deviation of the AccuTemp thermometers.

14. The fat content at a company that makes ice cream is under investigation. The companyhas two different chefs preparing the ice cream. The morning chef creates 12 batches of icecream. From each batch a 100-gram sample is obtained and the fat content is measured. Thestandard deviation of the amount of fat is found to be 5.64 grams. The morning chef creates8 batches of ice cream and the standard deviation is found to be 9.56 grams. In observingthe amount of fat in samples, it is noted that the amount of fat in all 100-gram samples isapproximately normal for both chefs. Is there sufficient evidence that the standard deviationof the amounts of fat by the two chefs are different? Test using a 5% significance level.

15. A gunsmith is comparing two rifles for accuracy. The gunsmith locks the gun in a vise, aimsit at a bull’s-eye target at a distance, and then measures the distance from the center. Arandom sample of 6 shots with the long barrel rifle is taken and the standard deviation of thedistances from the center is 3.65 cm. A random sample of 6 shots from a short barrel rifle istaken and the standard deviation is 7.51 cm. The gunsmith knows that the distribution ofthe distances from the bull’s-eye for both rifles are approximately normally distributed. Usea 5% level of significance to test if the standard deviation of the distances from the bull’s-eyeof the longer rifle is less than the standard deviation for the short barreled rifle.

16. A master piano tuner along with the apprentice are comparing tuning abilities. They eachtune the A above middle C, which should be 440 Hz. The master tunes 10 pianos andthe standard deviation of the frequencies of the tuned string is found to be 0.135 Hz. Theapprentice tunes 9 pianos and the standard deviation is found to be .216 Hz. The distributionsof the master and the apprentice are known to be approximately normally distributed. Usinga 5% significance level, test if the standard deviation of the master’s tunings is less than thestandard deviation of the apprentice’s tunings.

17. A commuter has two different routes to take to work: the back road route and the freewayroute. The average times on the two routes are about the same. Our commuter is interested inwhich route is more consistent. The commuter drives the back roads and finds the standarddeviation of the times to be 1.23 minutes. The commuter drives the freeway route on 12randomly selected days and the standard deviation of the times to get to work is 2.38 minutes.Using a 5% significance level test if the standard deviation of the times on the back roadsis less than the standard deviation for the freeway. The times are approximately normal forboth routes.

18. A play is being currently put on in both New York and San Francisco. The writer of the playis interested in the running times of the play. On 15 randomly selected performances, thetimes are recorded from start to finish in New York. The times are noted to be approximatelynormal. Additionally the standard deviation is calculated to be 2.35 minutes. On 13 randomlyselected performances in San Francisco, the times are also noted to be approximately normaland the standard deviation turned out to be 3.67 minutes. Can you conclude that the standarddeviation of the times in New York is less than the standard deviation of the times in SanFrancisco? Use a 1% significance level.

Page 262: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

256CHAPTER 10. CONFIDENCE INTERVALS ANDHYPOTHESIS TESTS FOR TWOPOPULATION DATA

Page 263: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Chapter 11

Inferential Statistics withChi-Square Distribution

257

Page 264: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

258 CHAPTER 11. INFERENTIAL STATISTICS WITH CHI-SQUARE DISTRIBUTION

11.1 Chi-Square Distribution

We have already looked at the sampling distributions of X and p. Our next task is to look at thesampling distribution of s2. To do this requires us to look at a new distribution, the chi-squareddistribution, written χ2(‘ki’, rhymes with ‘my’). This distribution has one parameter: degrees offreedom.1

If X is a random variable that follows a χ2 distribution written X ∼ χ2(df).

Here are some facts about the χ2 distribution:

It has one parameter: degrees of freedom, df .The distribution is skewed right.The mean is df .The mode occurs at df − 2.The variance is 2df .If z is a standard normal random variable then z2 ∼ χ2(1).If X1 ∼ χ2(df1) and X2 ∼ χ2(df2) then X1 +X2 ∼ χ2(df1 + df2)

Let us consider the graph of the χ2 distribution. Several of them are below with different degreesof freedom.

0

df = 3

df = 5

df = 7

χ2

We can look up values the same way we looked up values of t.

Example 11.1.1.

Find χ2.025,5

Find χ2 such that the area to the left is .05 with 7 degrees of freedom.

Solution.

We look in the column heading of .025 and the row corresponding to 5 degrees of freedom andwe get 12.833. We can write this as χ2

.025 = 12.833 or χ2.025,5 = 12.833 if we want to emphasize the

degrees of freedom.

For the second part, note that we are given the area to the left is .05. The table works witharea to the right. Since .05 is to the left, the area to the right must be .95 (= 1− .05) so what weare looking for is χ2

.95,7. From the table we get 2.167.

1It is no coincidence that this is the same as the parameter for the t-distribution.

Page 265: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

11.1. CHI-SQUARE DISTRIBUTION 259

11.1.1 Sampling Distribution of s2

We wish to now turn our attention to the sampling distribution of the sample variance, s2. If westart with a population distribution which is normally distributed, the sampling distribution ofs2 will be skewed right. Below are graphs of the population distribution along with the samplingdistributions of X and s2.

µ X

µ X σ2 s2

Population Distribution

Sampling Distributions

Since s2 is an unbiased estimator for σ2, the mean of the distribution is σ2. Note that the shapeof the distribution looks like a χ2 distribution. It has the desired shape but the scale is wrong.

s2 has a χ2 shape with a mean of σ2.

If we divide s2 by σ2 we get s2

σ2 .

s2

σ2 will have a distribution which has a χ2 shape with a mean of 1.

Finally, if we multiply by n− 1 we get(n− 1)s2

σ2.

(n− 1)s2

σ2will have a distribution which has a χ2 shape with a mean of n− 1.

The degrees of freedom for the sample variance is n− 1 which is also the degrees of freedom forthe the above.

Putting it all together, we get

(n− 1)s2

σ2∼ χ2(n− 1)

Page 266: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

260 CHAPTER 11. INFERENTIAL STATISTICS WITH CHI-SQUARE DISTRIBUTION

11.1.2 Confidence intervals of σ2.

To calculate a 95% confidence interval of σ2 consider the following:

95% of the time χ2.975 <

(n− 1)s2

σ2< χ2

.025. See figure

σ2

χ2.975 χ2

.025

2.5% 2.5%95%

Manipulating this inequality, we get that 95% of the time

(n− 1)s2

χ2.025

< σ2 <(n− 1)s2

χ2.975

This is a 95% confidence interval for σ2

Example 11.1.2.

The amount of water dispensed into all 500 ml bottles has a distribution which is approximatelynormally distributed. A random sample of 12 bottles is taken and the variance of the amounts ofwater dispensed is 2.36 ml2. Construct a 95% confidence interval for the variance of the amount ofwater dispensed into all 500 ml bottles.

Solution.

We are told that the distribution is normal. This is required for our inferences to be valid. Wealso obtain the following

s2 = 2.36n = 12

We find χ2.025 = 21.920 and χ2

.975 = 3.816 both with df = 11 (df = n− 1)Our confidence interval is

(11)2.36

21.920< σ2 <

(11)2.36

3.816

Which gives us1.184 < σ2 < 6.803

. So. . .

We are 95% confident that the variance of the amount of water dispensed into all 500-ml bottlesis between 1.18 and 6.80 ml2.

If we want a confidence interval for the standard deviation, we simply take the square roots ofthe interval above.

Page 267: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

11.1. CHI-SQUARE DISTRIBUTION 261

From above 1.184 < σ2 < 6.803.Taking square roots we get √

1.184 <√σ2 <

√6.803

or1.09 < σ < 2.61

We are 95% confident that the standard deviation of the amount of water dispensed into all500-ml bottles is between 1.09 and 2.61 ml.

We can construct a confidence interval for any confidence level with the following

A (1− α)100% confidence interval for σ2 is given by

(n− 1)s2

χ2α/2

< σ2 <(n− 1)s2

χ21−α/2

A (1− α)100% confidence interval for σ is given by

√(n− 1)s2

χ2α/2

< σ <

√(n− 1)s2

χ21−α/2

Provided the population distribution is approximately normal

Example 11.1.3.

The time a runner takes to run their daily course varies from day to day but it is known that thedistribution of times is approximately normal. A sample of 15 times is selected and the standarddeviation of the times is 3.15 minutes. Construct a 99% confidence interval for the standarddeviation of the times to run the course.

Solution.

We are given the following n = 15, s = 3.15, X ∼ N , Where X represents the time for arandomly selected run.

We find We find χ2.005 = 31.319 and χ2

.995 = 4.075 with df = 14

Note that unlike the previous example we are given s, not s2, so we need to square it in theformula:

(14)3.152

31.319< σ2 <

(14)3.152

4.075

4.435 < σ2 < 34.090

We want the confidence interval for σ not σ2 so we get:

Page 268: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

262 CHAPTER 11. INFERENTIAL STATISTICS WITH CHI-SQUARE DISTRIBUTION

√4.435 < σ <

√34.090

2.11 < σ < 5.84

We are 99% confident that the standard deviation of the times it takes the runner to run thecourse is between 2.11 and 5.84 minutes.

11.1.3 Hypothesis tests of σ2

Since we know that for samples selected from a normally distributed population,

(n− 1)s2

σ2∼ χ2(n− 1)

this is the key to how we will proceed with a hypothesis test.

The test statistic for a test of σ or σ2 is given by

(n− 1)s2

σ2

where

df = n− 1

Provided X ∼ N .

Example 11.1.4.

The diameters of ball bearings are supposed to be 9.0 micrometers (µm). The diameters of ballbearings are known to the manufacturer to be approximately normal. A sample of 20 ball bearingsare collected at random and the standard deviation of the diameters is found to be 8.67 µm. Test,using a 5% significance level, if the standard deviation of the ball bearings is more than 9.0 µm

Solution.

From the problem we get the following

H1 : σ > 9.0n = 20s = 8.67α = .05

Using the same five steps for hypothesis tests that we had before(critical value approach) wehave

Page 269: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

11.1. CHI-SQUARE DISTRIBUTION 263

1.

H0 : σ = 9.0H1 : σ > 9.0

2. Use χ2 because X ∼ N

3. We will reject the null hypothesis if the sample standard deviation is ’a lot’ more than 9.0.This corresponds to a right tailed test so we need χ2

.05,19 = 30.144

σ2

30.144 χ2

df = 19

4. χ2 =(n− 1)s2

σ2=

19× 8.672

92= 17.632

5. Since the test statistic is not in the rejection region, we do not conclude the standard deviationof the diameters of all ball bearings is greater than 9.0 µm.

Although the problem doesn’t ask, why do we expect the manufacturer to have the alternativehypothesis as H1 : σ > 9.0 and not something else?

As with hypothesis tests of µ, the test can be right, left, or two-tailed.

Example 11.1.5.

The manufacturer from above is considering the purchase of a new machine that produces ballbearings. In order to purchase the machine, the variance of the ball bearings must be less than 81µm2. An engineer from the company understands the process used to create them and so knowsthat the diameters of the ball bearings are approximately normal. A sample of 12 ball bearings istaken and the variance of the ball bearings is found to be 20.1 µm2. Is there sufficient evidence atthe 1% level of significance if the variance of the ball bearings is less than 81 µm2?

Solution.

From the problem we get the following

H1 : σ2 < 81n = 12s2 = 20.1α = .01

Page 270: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

264 CHAPTER 11. INFERENTIAL STATISTICS WITH CHI-SQUARE DISTRIBUTION

1.

H0 : σ2 = 81H1 : σ2 < 81

2. Use χ2 because X ∼ N

3. Since we want to know if the population variance is less than 81, this is a left-tailed test. Weneed χ2

.99,11 = 3.053

σ2

3.053 χ2

df = 11

4. Note that we are given the variance so we don’t need to square the values given. Plugging inwe get

χ2 =(n− 1)s2

σ2=

11× 20.1

81= 2.730

5. Since the test statistic is in the rejection region, we conclude the variance of the diameters ofall ball bearings for the new machine is less than 81 µm2.

Although the tests of hypothesis for σ2 done above are done using the critical value, we canget an interval in which the p-value lies. Just like estimating p-values using the t-distribution, wecan do the same thing here. If we examine the χ2 table we see the following row at 11 degrees offreedom.

df .995 .990 .975 .950 .900...

......

......

...11 2.603 3.053 3.816 4.575 5.578

Looking at the table, we see that 2.603 < 2.730 < 3.053. These numbers correspond to χ2.995 <

χ2 < χ2.990. We want the area to the left of 2.730. We know that the area to the right of 2.730 is

between .990 and .995. This tells us that the area to the left of 2.730 is between .005 and .010 sowe have .005 < p-value < .001. This will be reported as p-value < .001.

11.1.4 Exercises

1. If s2 = 5.8, n = 20, X ∼ N , find a 95% confidence interval for the variance.

2. If s2 = 12.3, n = 11, X ∼ N , construct a 90% confidence interval for the variance.

Page 271: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

11.1. CHI-SQUARE DISTRIBUTION 265

3. If s = 2.31, n = 8, X ∼ N , find a 95% confidence interval for the standard deviation.

4. If s = 1.03, n = 9, X ∼ N , create a 95% confidence interval for the standard deviation.

5. If s2 = 7.96, n = 13, X ∼ N , α = .05. Test if H1 : σ2 6= 12.

6. If s2 = .82, n = 18, X ∼ N , α = .02. Test if H1 : σ2 6= 1.00.

7. If s2 = 27.81, n = 15, X ∼ N , α = .01. Test if H1 : σ2 > 20.

8. If s2 = 189.5, n = 6, X ∼ N , α = .05. Test if H1 : σ2 > 100.

9. If s = 4.06, n = 13, X ∼ N , α = .05. Test if H1 : σ < 6.

10. If s = 8.66, n = 10, X ∼ N , α = .05. Test if H1 : σ < 12.5.

11. A drug manufacturer produces 350 mg capsules. The amount of medicine distributed intoeach capsule varies from capsule to capsule but the distribution of the amount of drug in eachcapsule is approximately normally distributed. A recent sample of 12 capsules produces astandard deviation of 3.65 mg.

(a) Construct a 98% confidence interval for the variance of the amount of drug in the cap-sules.

(b) Construct a 98% confidence interval for the standard deviation fo the amount of drug inthe capsules.

12. A quality control engineer at a linen factory is examining the lengths of the sheets that arecoming off the production line. The lengths are approximately normally distributed. A recentsample of 12 sheets showed the standard deviation of the lengths of the sheets to be 3.64 mm.

(a) Find a 95% confidence interval for the variance of the lengths of the sheets.

(b) Find a 95% confidence interval for the standard deviation of the lengths of the sheets.

13. While investigating different varieties of corn, a farmer wants to make sure that the heightsdon’t have a lot of variability. A sample of 12 month old corn plants yielded a standarddeviation of the heights to be 1.35 inches.

(a) Construct a 90% confidence interval for the variance of the heights.

(b) Construct a 90% confidence interval for the standard deviation of the heights.

14. The diameter of wires that are coming off an assembly line are checked by taking a sampleof 10 sections of wire and measuring the diameter. The variance of the diameters is foundto be 6.35 µm2. Having been measuring the diameters for some time, it is known that thediameters are approximately normally distributed.

(a) Find a 95% confidence interval for the variance of the diameters of all wires.

(b) Find a 95% confidence interval for the standard deviation of the diameters of all wires.

Page 272: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

266 CHAPTER 11. INFERENTIAL STATISTICS WITH CHI-SQUARE DISTRIBUTION

15. Before purchasing a machine that dispenses paint, a paint manufacturer wants a demonstra-tion that the variance is less than 1.35 ounces2. To do this they have 12 cans filled with paintand find the variance of amount of paint dispensed is 0.93 ounces2. Being familiar with dis-pensing machines, the paint manufacturer knows that the amounts of paint dispensed shouldbe approximately normally distributed. Test, using a 5% level of significance if the varianceof the amount of paint dispensed is less than 1.35 ounces2.

16. A violinist must be able to accurately play a given note by ear since there are no frets on theinstrument. A violinist claims that they can very accurately play a 440 Hz note on the violin(A above middle C). Specifically, they claim that their standard deviation of the frequenciesof their attempts to play the note is at most 0.5 Hz. You take a sample of 12 attempts andfind the standard deviation to be 0.84 Hz. Assuming the frequencies of the notes played arenormally distributed, test if the violinist’s claim is false. Use a level of significance of 5%.

17. When purchasers in a grocery store use a self serve checkout, they place the product in anarea where the times are weighed. In order for this to work, the items need a small standarddeviation of weight. For a specific check out machine, the standard deviation of a particularitem needs to be less than 2.5 grams. A random sample of 8 of the item produced a standarddeviation of 1.67 grams. If the weights for the items are normally distributed test if thestandard deviation is less than 2.5 grams. Use a 10% level of significance.

18. A battery manufacturer is testing the life of the battery. A random sample of 18 batteriesare to be taken and placed in an electronic toy that slowly drains the battery until it stopsworking. The lives of the battery appear to be approximately normal. The sample yielded avariance of battery life of 6.97 minutes2. Using a 5% level of significance test if the varianceof the life of the battery is less than 10 minutes2.

19. A jogger runs the same course every day. Last year, the variance of the times to run theirfavorite course was 3.67 minutes2. A recent sample of 15 runs produced a variance of 5.62minutes2. Test, using a 5% level of significance, if the variance is different from last year.Assume the times for the runs to be approximately normally distributed.

20. At an amusement park, food sales are a large part of the park’s income. The food coordinatoris investigating the amount of money people spend on food. A random sample of 22 customersare found to have a variance of the amount spent on food of 26.4 dollars2. Test, using a 1%level of significance if the variance of the amount spent is different from 18 dollars2.

21. A typist types at a consistently fast rate. The typist claims that the standard deviation ofthe number of words typed per minute is no more than 6 wpm. A sample of 15 one-minutedisplays found the standard deviation to be 9.64 wpm. Is there sufficient evidence that theclaim is false. Assume the words typed in a minute are normally distributed and use a 5%significance level.

22. The manufacturer of a particular brand of light bulb claims that the brightness of the bulbsare all about the same. Specifically, they claim that the standard deviation of the brightnessof their bulbs is at most 30 lumens. A consumer agency takes a sample of 16 bulbs and findsthe standard deviation of the brightness to be 43.5 lumens. Can you conclude, at a 5% levelof significance, if the manufacturer’s claim is false. Assume the brightness is approximatelynormally distributed.

Page 273: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

11.2. GOODNESS OF FIT TESTS 267

11.2 Goodness of Fit Tests

We have looked at hypothesis tests of p and would like to extend the idea beyond a binomial experi-ment. An unfortunate aspect of a binomial experiment is that it only allows two different outcomes:success and failure. What if we have ‘success’, ‘failure’, and ‘other’? Or even more categories thatthat? We need to modify our binomial experiment. Instead of a binomial experiment, we want amultinomial experiment which we define below.

A Multinomial Experiment is an experiment in which the following conditions are met:

The experiment consists of n identical trials. There are k outcomes possible for each trial. The probabilities of the outcomes do not change. The trials are independent.

A simple example of a multinomial experiment is tossing a die 100 times. In this case, a trial istossing the die one time. We are repeating this trial 100 times. Since our die contains 6 sides, k = 6.Since the die is not changing (unless you are throwing the die so hard the edges are chipping away)the probability of each outcome stays the same. Note: although we might expect the probabilitiesto be equal, they do not need to be for a multinomial experiment. Lastly, the trials are independent:if I just rolled a ‘4’, the probabilities on the next toss have not changed.

Example 11.2.1.

You want to test if a die is unfair. You roll the die 600 times and you obtain the followingOutcome 1 2 3 4 5 6Frequency 83 105 93 118 91 110

Using a 5% significance level, determine if the die is unfair.

Solution.

If the die is fair, we expect to get 100 each of one through six. (The die being ‘fair’ means theprobability of a one is 1/6, a two is 1/6, etc.) Since this is an extension of a test of p we might wantto calculate six different test statistics, one for each outcome and then combine the test statisticssomehow. This is precisely how we are going to proceed. It turns out that if we square all the teststatistics and add them with weights we will get our test statistic. The formula we are going to useis what you get when you go through the algebra (something we aren’t going to worry about). Onwith the test.

We want to check if the die is unfair. That is what we are trying to show. This means ouralternative hypothesis is the die is unfair and the null hypothesis is the die is fair. We can write(the 1/6 is not needed. Why?):

H0 : p1 = p2 = p3 = p4 = p5 = p6 = 1/6

This is fine, but then what is our alternative hypothesis?What is wrong with the following? H0 : p1 6= p2 6= p3 6= p4 6= p5 6= p6 6= 1/6For this reason and ease of writing we will typically write out what our hypotheses are. We

have:

Page 274: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

268 CHAPTER 11. INFERENTIAL STATISTICS WITH CHI-SQUARE DISTRIBUTION

1.

H0 : The die is fairH1 : The die is unfair

2. As mentioned above, we are looking at squaring several standard normal random variablesand combining them. This tells us that we want a Chi-Squared distribution. This will onlybe valid if we expect at least 5 in each class. This is the same as saying np > 5 and nq > 5in hypothesis test of p. So. . .

We will use the χ2 because the expected values are greater than 5 (E′s > 5)

3. All Goodness of Fit tests are right-tailed (Later for reason). We will have k − 1 degrees offreedom. In this case we have df = 5

σ2

11.070

df = 5

4. The formula we are going to use is given by χ2 =∑ (E −O)

2

E

Where E represents the Expected values of the outcomes and O represents the Observedvalues of the outcomes.

To get this we will extend our table from before

Outcome Observed Expected O − E (O − E)2 (O − E)2/E1 83 100 -17 289 2.892 105 100 5 25 2.53 93 100 -7 49 4.94 118 100 18 324 3.245 91 100 -9 81 8.16 110 100 10 100 1.00

Sum 8.68

This sum we obtained is the test statistic so χ2 = 8.68

5. Notice where the test statistic is. It is not in the rejection region. Therefore we do notconclude that the die is unfair.

Why are all goodness of fit tests right tailed? Consider the graph of the χ2 distribution. The lefttail begins at 0. In order for the test statistic to be 0 the Observed and the Expected would needto be the same. If we got 100 each of 1 through 6 on our die we would certainly not conclude thedie was unfair. If the test statistic was very small then the Observed would need to be very close tothe Expected. Again we would not reject the null hypothesis. As the match gets worse and worse,the test statistic gets greater and greater until we conlclude that the test statitic is too large andwe end up rejecting the null hypothsis. See diagram below.

Page 275: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

11.2. GOODNESS OF FIT TESTS 269

0

Perfect Match

Match Worsens Match Terrible:

Reject H0

Goodness of Fit Summary

The null and alternative hypotheses are typically words

The distribution used is the χ2 distribution with k−1 degrees of freedom. In order to do thiswe need E′s > 5. We usually don’t know the expected values until we get to the calculationof the test statistic.

The test is always a right-tailed test.

The test statistic is given by∑ (E−O)2

E where E = Number expected, O = Number Observed

The Goodness of Fit test is used when we are comparing the outcomes from a multinomialexperiment with the expected values. To determine this, try and imagine the ‘trial’ anddetermine what kind/how many outcomes we have.

Calculating the Test Statistic for a Goodness of Fit Test on the Calculator

STAT>EDIT>1:Edit

Clear the data in L1 and L2

Input the Observed data values in L1

Input the percentages expected in L2

Move the cursor using the arrow keys until L2 is highlighted. Type in L2 × n/100 Hit[ENTER]. (If you entered the decimal form of the percents in L2 then type in L2 × n.)

STAT>CALC>1-Var Stats L3

The test statistic is given by Σx

Example 11.2.2.

A popular candy advertised that its colored candies follow a particular distribution for eyeappeal. Specifically, they state that 35% are red, 25% are blue, 20% are yellow, 15% are orange and5% are brown. A random sample of several candies yielded 149 red, 89 blue, 84 yellow, 124 orange,and 50 brown. Test if the advertised percentages are incorrect. Use a 5% level of significance.

Page 276: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

270 CHAPTER 11. INFERENTIAL STATISTICS WITH CHI-SQUARE DISTRIBUTION

Solution.

We could write the null hypothesis as H0 : p1 = .35, p2 = .25, p3 = .20, p4 = .15, p5 = .05.Instead we will describe the distribution in words.

1.

H0 : The distribution of colors given is correctH1 : The distribution of the colors given is not correct

2. Use χ2 because E′s >5

3.

σ2

11.070

df = 5

4. In order to do the calculation, we need to know how many candies were sampled. If we addup the counts we get 496 (See table below). Below is the partial table. The O and E columnsand last column are in L1, L2, and L3, respectively if using the procedure above. Note thatthe smallest expected value is 24.8 (= 5% of 496) which is greater than five.

Outcome O E O − E (O − E)2 (O − E)2/ERed 149 173.6 -24.6 605.16 3.48. . .Blue 89 124 9.879. . .

Yellow 84 99.2Orange 124 74.4Brown 50 24.8Sum 496 74.367

We get χ2 = 74.367

5. We conclude that the distribution of colors as advertised is not correct.

11.2.1 Exercises

1. Let H0 : p1 = p2 = p3 = p4, X1 = 215, X2 = 236, X3 = 285, X4 = 244, α = 1%. Test H1.

2. Let H0 : p1 = p2 = p3 = p4 = p5, X1 = 566, X2 = 496, X3 = 525, X4 = 501, X5 = 567,α = 2.5%. Test H1.

3. Let H0 : p1 = .23, p2 = .38, p3 = .49, X1 = 205, X2 = 395, X3 = 502, α = 5%. Test H1.

4. Let H0 : p1 = .35, p2 = .27, p3 = .18, p4 = .20, X1 = 305, X2 = 296, X3 = 236, X4 = 239,α = 1%. Test H1.

5. A polling company has taken what it describes as a random sample of voters. In the sample,there are 127 republicans, 238 democrats, and 49 from other parties. The population fromwhich they drew their sample has 25.6% republicans, 62.3% democrats, and 12.1% from otherparties. Test using a 5% level of significance if the sample taken was not a random sample.

Page 277: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

11.2. GOODNESS OF FIT TESTS 271

6. According to wikipedia, a report from DuPont showed that 24% of cars sold in North Americaare white, 16% are silver, 19% are black, 15% gray, 7% are blue, 10% are red, and the rest areother colors. Assume these percentages reflect the current colors of cars in North America.While on vacation, a tourist decides to check if the percentage distribution of colors of carsin Paris is different from the percentage distribution of cars in North America. A randomsample of cars in Paris had 81 white, 39 silver, 62 black, 34 gray, 23 blue, 23 red, and 20 wereother colors. Using a 5% significance level, test if the distribution of car colors in Paris isdifferent from the distribution of car colors in North America.

7. According to an online education blog, 14% of college students take all their courses online,15% take some of their courses online, and 71% take none of their classes online. Assumethis is true for the 2018-2019 school year. A random sample of 400 incoming students forthe 2019-2020 school year had 40 students taking all their courses online, 96 taking some oftheir classes online, and 264 taking none of their classes online. Can you conclude that thepercentage distribution of modes of student learning has changed from the 2018-2019 schoolyear. Use a 5% level of significance.

8. At a large university system, 28% of students are freshmen, 22% sophomores, 20% juniors,19% seniors, and 11% post graduate. A group of students have come together from one of theuniversities to perform a community service project. There were 85 freshmen, 55 sophomores,49 juniors, 35 seniors, and 15 post graduate students. Test, using a 1% significance level if thepercentage distribution of students by class in the system is the different from the distributionof student’s breakdown by class of those who would volunteer.

9. In a roulette wheel, there are 18 red slots, 18 black slots, and 2 green slots. Each slot issupposed to have the same probability of occurring. While watching a particular wheel for awhile, you notice red shows up 432 times, black shows up 502 times, and green shows up 51times. Can you conclude using a 1% level of significance that the wheel is not fair?

10. According to researchgate.net, the percentage of first-time blood donors in 2007 in the Atlantaarea was 26.5% African American, 64.4% White, 3.8% Hispanic, 3.2% Asian, and 2.1% other.After an advertising campaign to encourage underrepresented groups in the past to donateblood, a random sample of 377 first-time donors is taken and there are 91 African Americans,182 White, 42 Hispanic, 32 Asian, and 30 other. Test, using a 5% level of significance if thepercentage distribution of first-time blood donors has changed since 2007.

11. In Mendelian genetic theory, if both parents of offspring have a recessive and dominant gene,then the 50% of the offspring should also have both the recessive and dominate gene, 25%should have two dominant genes, and 25% should have two recessive genes. Several fruit flies,each with the dominant and recessive genes, produce several offspring. It is determined that364 have both the recessive and dominate gene, 163 have two dominant genes, and 190 havetwo recessive genes. Test, using the 2.5% level of significance if the distribution of offspringis significantly different from the predicted percentages.

12. According to a report by PC Magazine, in 2018 43.9% of cell phones sold are Apple phones,26.9% are Samsung and 29.2% from other vendors. A recent sample of 325 cell phonesfound that 156 were Apples, 94 were Samsungs, and 75 were other brands. Test, using a 1%significance level if the percentage distribution of sales has changed from 2018.

Page 278: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

272 CHAPTER 11. INFERENTIAL STATISTICS WITH CHI-SQUARE DISTRIBUTION

13. In a large city, a police commissioner is investigating the relationship between the day of theweek and the number of DUI arrests made. A random sample of 351 DUI arrests that weremade on a non-weekend day were inspected and it was found that there were 71 on Mondays,86 on Tuesdays, 91 on Wednesday, and 103 on Thursdays. Use a 1% level of significance ifDUI arrests made on a non-weekend day are not uniformly distributed throughout those days.

14. A grocer is performing an experiment. In the aisle that contains the chips. The grocer putsthe same type of chip on each of the four different shelves. The shelves are always kept wellstocked and at the end of the week the grocer notes the number sold from each shelf. Therewere 136 bags sold on the top shelf, 163 on the second shelf, 106 on the third shelf, and 84 onthe bottom shelf. Using a 5% level of significance test if the distribution of the shelves fromwhich the chips were sold are not all the same.

Page 279: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

11.3. TESTS OF INDEPENDENCE AND HOMOGENEITY 273

11.3 Tests of Independence and Homogeneity

Our next test using χ2 is a variation on the Goodness of Fit test. The tests of Independenceand Homogeneity are essentially the same test. It is just whether or not you are viewing yourdata as coming from several populations or one population consisting of subgroups. For example,we can think of the population of all voters as one population where some are Democrats, othersRepublicans, etc. or we can think of it as a population of Democrats, a population of Republicans,etc. Neither way is ‘wrong’. In some problems, the implication is we are looking at one population,in others several. It doesn’t matter if you think about it the ‘other’ way.

Example 11.3.1.

The city manager has been accused of inappropriate finances. Five hundred adults from thecity are polled as to whether or not the manager should step down and if they are registered voters.The results are in the following table. Determine if ‘Voter Status’ and ‘Opinion of stepping down’are independent. Use a 5% level of significance.

Leave StayRegistered 92 99

Unregistered 108 201

When we studied independence during our trek through probability, we looked at the inde-pendence of two events only: for example, in this case ‘Registered’ and ‘Stay’. For the tests ofindependence, we want to determine if the categories are not independent: ‘Voter Status’ and‘Opinion’.

Solution.

We are trying to show that voter status and opinion are dependent (or not independent orrelated). Also α = .05.

Note that what we have are counts in the table. The 92 is a count of how many of those polledare registered voters and think the manager should leave. If we had the expected values we couldcompare using a Goodness of Fit test. This is exactly how we will proceed.

1.

H0 : Voter status and opinion of adults in the city are independent.H1 : Voter status and opinion of adults in the city are dependent.

2. Since this is a variation on Goodness of Fit we have

Use χ2 because E′s > 5 (we need to check this)

3. Since the categories are interrelated, the degrees of freedom decrease. To determine thedegrees of freedom, cover up a row and a column of numbers in the original table and countthe number of data values left. In this case we get 1 degree of freedom. This is a right tailedtest because all Goodness of Fit tests are right tailed.

Page 280: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

274 CHAPTER 11. INFERENTIAL STATISTICS WITH CHI-SQUARE DISTRIBUTION

σ2

3.841

df = 1

4. The following is merely for the sake of completion. This calculation is easily done on acalculator. After this example that is how we will do the calculation.

To calculate the test statistic we need to get the expected values. In order to get the expectedvalues we will look at the totals of the columns and rows.

Leave Stay TotalRegistered 92 99 363

Unregistered 108 201 137Total 200 300 500

Note that from the totals we get 40% of respondents want the manager to leave. (= 200/500×100%) Since there are 309 unregistered voters, we expect 40% of them to want the managerto leave. This give us 123.6 which is in the parentheses in the following table. Note thatthis assumes the categories are independent which is what the null hypothesis states. We cansimilarly calculate the other expected values.

Leave Stay TotalRegistered 92 99 191

(76.4) (114.6)Unregistered 108 201 309

(123.6) (185.4)Total 200 300 500

The expected values are in the parentheses next to the observed value. The calculation isexactly as before (using the counts and the expected values) and we get χ2 = 8.590 Noticethat the smallest expected value is 76.4 which is greater than 5 as we stated in step 2. If ithad not been, we couldn’t make the following conclusion.

5. There is sufficient evidence that the voter status and opinion of adults in the city are depen-dent.

Example 11.3.2.

The owner of a business has three stores. Customer service is important to the owner. Severalcustomers were polled about their experiences at the store. They were asked their level of satisfac-tion with the employees. The results are in the following table. Test, using a 1% significance levelif the distributions of the responses for the three stores are not all the same.

Page 281: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

11.3. TESTS OF INDEPENDENCE AND HOMOGENEITY 275

Store A Store B Store CExcellent 108 127 83

Very Good 72 103 20Good 53 72 10Poor 18 22 9

Solution.

Whereas the last one was independence (one population) this is homogeneity (three populations:store A customers etc.). We could have thought of the first example as two populations: registeredand unregistered. Or we could think of this example as one population, some happen to frequentstore A etc.

1.

H0 : The distributions of the responses for the three stores are all the same.H1 : The distributions of the responses for the three stores are not all the same.

2. χ2 because E′s > 5

3. α = .01, df = 6

σ2

16.812

df = 6

4. We get χ2 = 33.985 with a p-value is given as 6.77× 10−6 from our calculator. (See below.)

Since the test statistic is very far to the right of the critical value, we have very little faith inthe reported p-value. Clearly it is very small but could easily be off by a factor of 10. Theproblem is the model breaks down when the test statistic is that far to the right. It is betterto report the value as p-value< .001.

In the second step we stated that E′s > 5. Let us check this. Go to 2nd>MATRIX>EDIT>2:B.This is what your calculator has calculated the expected values to be. Look for the smallestvalue in the matrix. It is 8.5768. This means that what we have done is valid. We can nowmake a conclusion

5. There is sufficient evidence that the distributions of the responses for the three stores are notall the same.

Page 282: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

276 CHAPTER 11. INFERENTIAL STATISTICS WITH CHI-SQUARE DISTRIBUTION

Test statistic for Tests in Independence/Homogeneity

2ND MATRIX (on x−1 button)

EDIT>1:A

Change the dimensions to match the dimensions of the matrix with the observed data.(rows × columns)

Input the observed matrix

STAT>TESTS> χ2-Test

Leave observed and expected matrices as A and B

Use arrow keys to highlight Calculate and hit [ENTER]

11.3.1 Exercises

1. Several people coming out of a theater watching a particular movie were asked if they liked itand when the last time they saw a movie was. Test using a 5% level of significance if whetheror not someone liked the movie and the last time they saw a movie are dependent.

less than 6 months 6 months to a year more than 1 yearLiked 138 103 24

Didn’t like 81 90 18

2. Several people were asked if they are smokers and also if at least one parent is a smoker. Theresults are in the table below. Test if being a smoker and having parents that are smokersare related. Use a 1% level of significance.

Smoker Non-smokerParent Smokes 103 246

Parents Non-smokers 121 499

3. For a project, a safety student observes cars go by and notes if the driver is using their phoneand how many people are in the car. The results follow. Test, using a 5% level of significanceif whether or not a person is on their phone and if they have passengers are related.

Alone Has passengersOn Phone 85 264

Not on Phone 136 159

4. The managers of two restaurants are discussing the party sizes at their restaurants. Severalrandomly chosen parties are selected and the size is noted at the two restaurants. One of themanager claims that the distribution of party sizes are the same for both restaurants. Testat a 5% significance level if the managers claim is false.

Page 283: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

11.3. TESTS OF INDEPENDENCE AND HOMOGENEITY 277

Party Size Restaurant #1 Restaurant #21 12 152 67 723 51 434 84 705 52 38

6+ 31 19

5. Two college deans are discussing the demographics of their students when one of them statesthat the distribution of the status of students each semester is different at their school. Arandom sample of several students at each college is selected and the results are in the tablebelow. Test if the deans claim is correct at the 2.5% level of significance.

College A College BPart Time 79 88Full Time 164 132

6. At a small, yet busy, coffee kiosk, some patrons leave a tip and others do not. A statisticssavvy employee decides to check if what mode of payment and whether or not a tip was leftare related. Several patrons were observed and the results are summarized in the table. Testif the mode of payment and tip tendencies are related. Use a 5% level of significance.

Cash Card OtherNo Tip 25 37 19

Tip 32 38 10

7. A frustrated phone/bank customer can’t decide if the ratings for ‘On Hold Bank’ or ‘CanYou Wait Phone Ccompany’ are different. Random samples of customers from each wereselected and asked to rate their experience on the phone. The results are in the table. Canyou conclude, using a 1% level of significance if the distributions of customer responses by allcustomers at ‘On Hold Bank’ and ‘Can You Wait Phone Company’ are different?

Service On Hold Bank Can You Wait PhoneExcellent 35 46

Good 134 146Poor 156 109

Terrible 35 18

8. Two competing airlines are trying to get more riders. Before doing this they need to determinethe demographics of the customers. A random sample is selected and the customers indicatewhy they are flying. The results follow. Test, using a 5% significance if the distributions ofthe reasons people fly for the two companies are not the same.

Reason Airline A Airline BBuisness 310 260

Personal, leisure 483 361Personal, non-leisure 204 106

9. A survey in 2015 asked fliers how they checked in for their flights. A recent survey asked thesame question. Test, using a 1% level of significance if the distribution of methods of checkingin has changed since 2015.

Page 284: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

278 CHAPTER 11. INFERENTIAL STATISTICS WITH CHI-SQUARE DISTRIBUTION

Method 2015 CurrentComputer 264 186

Mobile device 113 103Kiosk at airport 53 29

Airport ticket counter 35 16

10. A sociologist is comparing the entertainment consumption of Gen Z versus Gen X. Severalof these people were asked if they had subscriptions to different entertainment. Test if thedistribution of entertainment consumption of Gen Zers is Different from Gen Xers using a 5%level of significance.

Service Gen X Gen ZPay TV 228 268

Streaming Video 320 308Streaming Music 232 188Gaming Service 208 132

We can do certain tests more than one way. In the following problems, the tests can beperformed as a test of independence or a test of proportions with the alternative hypothesisbeing the proportions are not equal.

11. Several college students were asked if they currently take a math class and if they feel theydon’t get enough sleep. See table.

Enrolled Not enrolledNot enough sleep 126 253

Enough sleep 36 135

(a) Using the methods of this section, test if whether or not someone is enrolled in a mathclass and if they get enough sleep are independent. Use α = .05 with the p-valueapproach.

(b) In order to do the test, we need the expected values to be greater than 5. What is theminimum expected value?

(c) Do the test a second time. This time do it as a test of proportions. Let the X1 be thenumber of Enrolled who don’t get enough sleep and n1 be the total number enrolled.Similarly define X2 and n2. Use the same level of significance and use the p-valueapproach.

(d) How do the p-values compare?

(e) What is the relationship between the test statistics?

12. Several fliers were asked if they are a member of a frequent flier program and if they aremarried. The results follow.

Married Not marriedMember 137 235

Non member 61 82

Page 285: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

11.3. TESTS OF INDEPENDENCE AND HOMOGENEITY 279

(a) Using the methods of this section, test if whether or not someone is a member of afrequent flier program and if they are married are related. Use α = .05 with the p-valueapproach.

(b) In order to do the test, we need the expected values to be greater than 5. What is theminimum expected value?

(c) Do the test a second time. This time do it as a test of proportions. The populations are:Marrried and Not Married, X for the two samples are the number who are members.Use the same level of significance and use the p-value approach

(d) How do the p-values compare?

(e) What is the relationship between the test statistics?

Page 286: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

280 CHAPTER 11. INFERENTIAL STATISTICS WITH CHI-SQUARE DISTRIBUTION

Page 287: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Chapter 12

Analysis of Variance

281

Page 288: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

282 CHAPTER 12. ANALYSIS OF VARIANCE

12.1 Analysis of Variance

We continue our hypothesis tests with an extension of the two sample t test, which is an extensionof the t test of µ. In hypothesis tests of µ we compared it to a fixed value, e.g. H0 : µ = 5 is atypical value. When we extended this to a hypothesis test of two means we had the difference equalto a value, usually zero, i.e. H0 : µ1−µ2 = 0. We now need to extend this to three or more means.This is a bit of a challenge. With one mean we can compare it to a value, with 2 we can subtractthem to compare that to a number. What to do with three? There isn’t an obvious thing we can doto put the three numbers together into a meaningful number. Not to worry, that is where analysisof variance comes in. Notice the name: analysis of variance. In this test we will be essentiallytesting the equality of two variances. A typical null hypothesis will be H0 : µ1 = µ2 = µ3.

The basic idea is simple: look at the means for the samples. If the X’s are not ‘close’ then weconclude the population means are not all equal. Simple. The devil is in the details here. Let’slook at three means located on a number line.

X3X2 X1

Are the data values close together or far apart? By now, we hope that you realize that thereis not enough information to answer the question. Let us consider what might make the values ofthe X’s seem far apart. There are two: first is the variance of the populations involved. The largerthe variance is the larger the spread of the data will appear. The other way is what we are tryingto detect, that is, if the population means are not all the same. Before we discuss the test, let usget the assumptions required to do the test.

Assumptions required to do an ANOVA test.

1. The samples are chosen independently2. The populations are normally distributed3. The variances of the populations are equal

The first assumption we have had whenever we are taking more than one sample. Let us lookat the last assumption. The variances are all equal. This is the key to the test. What we willdo is estimate the common variance two ways. The first way is to treat the X’s as individualdata values and determine the variance of them, σX

21. The second way is to pool the estimatesof the population variances, σp

2. If the common variances are ‘large’, then the two estimates ofthe common variance will both be ‘large’ but when we look at the ratio, the ‘largeness’ will cancel.Likewise if the variance is ‘small’. If the first estimate is much larger than the second estimate thenwe will conclude that the null hypothesis is false. The entire test turns into a hypothesis test ofvariances with H1 : σX

2 > σp2

Since this is a test of variances we will need the F -distribution. The degrees of freedom aregiven by df = (k−1, n−k) where k is the number of samples (populations from which your samplesare taken) and n is the total number of data values.

1This requires weighting the X’s. We will not go into the calculation here.

Page 289: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

12.1. ANALYSIS OF VARIANCE 283

The calculations required to do the test are a bit of a challenge we will focus on the computeroutput instead.

Example 12.1.1.

An engineer is testing the force required to break a block of concrete. Samples of the concretewill be cured and then force is applied until the concrete breaks. The engineer is comparing threedifferent concrete mixtures. The forces at which the blocks break for the three different mixturesare normally distributed with equal standard deviations. The breaking forces for the samples, inpounds, are given below. Test, using a 1% significance level, if the mean forces required for theblocks to break using the three different mixtures are not all the same.

Mixture 1 Mixture 2 Mixture 31025 1253 13261135 1356 12431346 1425 13261225 1532 15241334 1402 1267

Solution.

We are asked to show if ‘the mean forces . . . are not all the same’. This translates to H0 : µ1 =µ2 = µ3 with an alternative hypothesis of H1 : At least one mean is different Also, we have k = 3(we have three different mixtures of concrete) and n = 15, the total number of blocks.

Following the steps for the hypothesis test we get

1.

H0 : µ1 = µ2 = µ3

H1 : At least one mean is different

2. Use F because X ′s ∼ N and σ2’s are equalSince we are determining if σX

2 > σp2 our ANOVA tests will always be right tailed. Also,

df = (2, 12) so we have F.01 = 6.93

3.

α = 1%

6.93 F

To determine our test statistic we will enter the different mixtures in L1, L2, and L3. Thenrun the ANOVA test.

4. F = 3.11 . . . From calculator(See below)

5. Since the test statistic is not in the rejection region, we do not conclude the average breakingstrength for the three blocks are not all the same.

Page 290: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

284 CHAPTER 12. ANALYSIS OF VARIANCE

Doing ANOVA Test on Calculator

Input data into separate lists

Select STAT>TESTS >ANOVA

Key in lists, separated by commas

ENTER.

Let us look at the output from our calculator. We get what is called the ANOVA table. Thegeneral form of the ANOVA table is

Source of Degrees of Sum of Mean TestVariation Freedom Squares Square StatisticBetween

Errork − 1n− k

SSBSSE

MSBMSE

F =MSB

MSWTotal n− 1 SST

Where

n is the total number of data values from all the samples k is the number of samples SSB is the sum of squares between the samples. Long calculation involved. SSE is the sum of squared of the errors. Long calculation involved. MSB is the mean sum of squares and MSB = SSB

k−1

MSE is the mean sum of errors and MSE = SSEn−k

F is the test statistic and F = MSBMSE

For the problem above our calculator gives us the ANOVA table broken up by source.

Source of Degrees of Sum of Mean TestVariation Freedom Squares Square StatisticBetween

Error212

85371.6164242

42685.813686.8

F = 3.1187

Total 14 249613.6

In addition, the calculator also gives us the p-value (0.0812) as well as Sxp = 116.99Coming back to the original idea of the test, MSB and MSE are the two variances we discussed

earlier. The variance using the X as data values is what is called SSB and the pooled varianceis SSE. Sxp = 116.99 is the square root of this pooled variance (

√13686.6 = 116.99). The test

statistic tells us that the variance using the X is 3.11 times as big as the pooled variance. Thecutoff for the given level of significance was 6.93 times the variance of the pooled variance.

A word about the ANOVA table. For what we are trying to achieve in this text the table is notvery useful. It is presented here because it is almost always given and we want to expose the readerto it since some of you will move on and use it in further courses/careers.

Page 291: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

12.1. ANALYSIS OF VARIANCE 285

Example 12.1.2.

A small restaurant chain has 3 stores. The owner wants to compare the wait times for tablesduring the lunch hour. It would be reasonable to assume that the wait times are approximatelynormal with equal standard deviations. The owner randomly selects several lunch hour patrons andfinds the time it takes until they are seated. The results are in the table that follows. Test, usinga 5% level of significance if the mean wait times for the three restaurants are not all the same.

East Side North Side South Side5.64 10.65 3.109.84 12.54 5.6110.64 8.65 7.547.59 16.50 2.973.79 12.40 7.765.49 2.26

Solution.

Since we are looking to see if ‘the mean times . . . not all equal’, we are looking at an ANOVAtest.

Following the steps for the hypothesis test we get

1.

H0 : µ1 = µ2 = µ3

H1 : At least one mean is different

2. Use F because X’s ∼ N and σ2’s are equal

3. df = (2, 14) so we have F.05 = 3.74

α = 5%

3.74 F

To determine our test statistic we will enter the different mixtures in L1, L2, and L3. Thenrun the ANOVA test.

4. F = 10.47

5. We conclude the average wait times for the three restaurants are not all the same.

12.1.1 Exercises

1. Consider the samples below come from normally distributed populations with equal standarddeviations. Test if the means are not equal at the 1% level of significance two ways

Page 292: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

286 CHAPTER 12. ANALYSIS OF VARIANCE

Sample 1 Sample 256 7548 7159 4837 8842 9159 63

55

(a) Using the two sample t test perform the test. Use the p-value approach.

(b) Use the ANOVA procedure to perform the test. Use the p-value approach.

(c) What similarities and differences are there in the results of the two tests?

2. Consider the samples below come from normally distributed populations with equal standarddeviations. Test if the means are not equal at the 1% level of significance two ways

Sample 1 Sample 273 13549 123103 5439 12431 10282 99102

(a) Using the two sample t test perform the test. Use the p-value approach.

(b) Use the ANOVA procedure to perform the test. Use the p-value approach.

(c) What similarities and differences are there in the results of the two tests?

For the following problems use the critical value approach if the critical value is in the F -distribution table. If not, use the p-value approach using your calculator.

3. The Ace Test Preparation business offers a course to prepare students for the SAT. Severalstudents sign up for the service and are split up into four groups randomly. Each group getsa different teacher for the course. The students report their math scores after they take theSAT. We can assume the scores for the teachers are approximately normally distributed withequal variances.

Teacher 1 Teacher 2 Teacher 3 Teacher4451 564 599 659579 759 569 771635 657 741 623540 659 597 603754 465 546 523556 505 432 451

Test, using a 5% significance level if the mean scores for the four teachers are not all the same.

Page 293: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

12.1. ANALYSIS OF VARIANCE 287

4. Avocados are high in potassium. An avocado farmer has three different orchards. For thethree orchards, the farmer takes a sample of several avocados, mashes the avocados up andcollects a 100 gram sample from each fruit selected. The potassium content of each fruit isthen determined from a lab. The amount of potassium in the samples for the three orchardsare approximately normal with equal standard deviations. The amount of potassium, in mg,is given below.

Orchard 1 Orchard 2 Orchard 3650 685 635685 702 652641 662 618625 651 609671 685 665

695 627

Test if the mean amounts of potassium in each orchard are not equal. Use a 5% level ofsignificance.

5. A math instructor is looking at trying an experiment. The instructor has been assigned threestatistics courses and will require extensive homework in the 8 am class, a moderate amountof homework in the 9 am class, and the 10 am class was given no homework but daily quizzesinstead. The students final exam scores for a sample of students are given in the table.Assuming that the standard deviations of the scores of all students with the three modes ofinstruction are equal and the populations are normal determine if the mean scores from allstudents in the three modes are not all the same. Use a 5% level of significance.

8 am class 9 am class 10 am class65 73 6685 56 5891 78 8288 74 8564 72 6481 53 9259 69 8681 54 8376 64 7975 63 91

Discuss problems with the sampling techniques for this problem.

6. For a student’s science project, the student has decided to determine if music affects thegrowth of a plant. The student plants 24 seeds. For 8 of the plants, the plants are subjectedto classical music. Another group of 8 are subjected to heavy metal, and the remaining 8 arein a music-free environment. The plants are measured after 4 weeks to see how tall the plantsare. To the student the heights of the three groups of plants appear to be approximatelynormal with equal standard deviations. Test, at a 10% level of significance if the mean heightof all such plants are not all the same for the three types of music.

Page 294: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

288 CHAPTER 12. ANALYSIS OF VARIANCE

Classical Heavy Metal No Music24.5 28.6 30.823.6 25.9 21.828.4 26.4 26.522.6 22.4 25.421.4 20.8 26.819.8 26.4 28.427.6 23.4 26.922.4 20.7 27.8

7. A candle maker is experimenting with different compositions of wax (beeswax, paraffin, soywax, etc.) and is interested in the average time a 10-inch taper will burn. The makers have 4different mixtures it is examining: the traditional mix, the organic mix, the eco-friendly mix,and the synthetic mix. The maker randomly selects 10 candles of each type and sets themall up in a room and lights them at the same time and will note how long until each candleburns itself out. Someone let the cat in. Now we don’t have as many as before. The times,in hours, the tapers lasted until they were extinguished are given below. Assume the timesare approximately normal with equal variances.

Traditional Organic Eco-Friendly Synthetic10.0 9.5 10.8 9.112.5 10.8 10.6 9.813.6 11.7 9.5 9.614.8 12.6 9.9 11.49.8 11.5 11.4 10.89.6 12.4 10.6 11.68.5 10.5 10.1 12.412.6 12.3 10.613.5 10.2

Test at the 1% level of significance if the mean burning times for tapers made of the fourwaxes are not all the same.

8. A marine biologist is investigating total coliforms in the water where the biologist enjoyssurfing. The total coliforms per 100 ml water is given in the table. The biologist tests quiteoften so it is well known to the biologist that the distribution of coliforms on any given daywill be approximately normal. It has been further observed that the variances are equal.

Steamer’s Lane Manresa Pleasure Pt. The Hook1156 1264 1367 19571264 1354 1689 23051354 1467 1579 18561256 1249 1687 20131467 1456 1397 16451679 1554 1675 2139

Test, at a 5% level of significance if the mean coliform levels for the surfing spots are not allthe same.

Page 295: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Chapter 13

Linear Regression

289

Page 296: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

290 CHAPTER 13. LINEAR REGRESSION

13.1 Descriptive Statistics using Linear Regression

In this chapter we will look at the relationship between two different variables. They are typicallygoing to measure different things. For example, heights versus weights. They have different units aswill be the case for a lot of regression problems. Since we have two variables, we need to distinguishthe two variables. Just as when you looked at graphs in a previous course, you had x and y.The variable x is called the independent variable or explainatory variable. The variable y is thedependent variable. It ‘depends’ on x. And x ‘explains’ the changes in y.

As we continue to look at regression we need to distinguish an observational study vs. anexperiment. In order for us to do an experiment we need to be able to change one variable andthen see what happens to the other variable. For many situations, this is something we can notdo. For example if you are looking at height vs weight, we can’t change someone’s height and seewhat then happens to their weight. Instead we observe several people’s heights and weights. Weare looking at ‘correlation not causation’. Even if you think that one of the variables causes theother variable to change, the only way to really determine causation is with an experiment.

We will limit our attention to linear regression. Why linear? There are two good answers: first,it is a realtively easy model; second, a lot of relationships are linear on a small interval even if theyare non linear globally.

We are going to make some assumptions up front and we will proceed from there. First of all,we will assume that the data has a general linear relationship given by y = A+Bx1. Most student’sexperience with models of this form was assuming the model was exact. That is, for a given valueof x you would plug in the value into the equation and get y and this was the ‘answer’. We are nowintroducing variablility into the situation. Consider the relationship between heights and weightsof adult Americans. If you were given an equation of the form y = A + Bx we certainly wouldn’texpect that if we plug into the equation a person’s height we would expect to get exacty the person’sweight. We would expect to be close but not the exact weight. If we look at the weights of allAmericans that were, say, 5′7′′, their weights would follow a distribution.

y = A+Bx

x

y

In the diagram, the line y = A + Bx is indicated. A few distributions are shown for different

1some texts use y = α+ βx

Page 297: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

13.1. DESCRIPTIVE STATISTICS USING LINEAR REGRESSION 291

values of x. Notice that the distributions are: normal, the mean is on the line, and the standarddeviations are all the same. We can put this in the following form:

y = A+Bx+ ε

where

ε ∼ N(0, σε)

13.1.1 Scatter Plots

Just as we looked at graphical displays when we first stared looking at data, we will do the samehere. Consider the picture in the next example. If we randomly select a person that is 5 feet tallthere weight would fall somewhere on the vertical line above 5 feet on the x-axis. Likewise for anyother heights we happen to get. If we look at all of these points together we get a scatter plot. Wewlll examine this in the next example.

Example 13.1.1.

A random sample of 8 recently recruted soldiers is selected after finishing boot camp. Theirheights and weights are in the table below. Use the heights as the independent variable and drawa scatterplot. Discuss if the assumptions required are reasonable in this case.

Height, inches 60 68 73 66 64 69 71 63Weight, pounds 105 137 195 159 134 184 201 134

Solution.

To graph the scatter plot we need to plot the ordered pairs on a pair of axes. Since this is avisual display we need to include information to the reader: scales, titles, etc.

When we examine the graph, we notice an overall linear trend. That is not to say the pointslie on a line, clearly they do not. They do however follow a line, more or less. Also note that if weexamine the scatter plot and imagine a line of ‘best fit’ passing through the data values, the pointswill be evenly distributed on either side of the line.

Page 298: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

292 CHAPTER 13. LINEAR REGRESSION

58 60 62 64 66 68 70 72 74100

120

140

160

180

200

220

Height, inches

Weigh

t,pou

nds

Height and Weight of Boot Camp Graduates

13.1.2 Finding the Line of Best Fit

We would like to estimate the equation of the line that data follows. Like estimates before, we willhave point estimates.

The best fitting line is graphed along with the scatter plot below.

58 60 62 64 66 68 70 72 74100

120

140

160

180

200

220

Height, inches

Weigh

t,pou

nds

Height and Weight of Boot Camp Graduates

Page 299: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

13.1. DESCRIPTIVE STATISTICS USING LINEAR REGRESSION 293

The term ‘best fit’ needs to be defined. Afterall, what does ‘best’ mean? Let us consider ourexample above. Clearly, there should be a relationship between height and weight. The taller aperson is the more they weigh. (Remember, we are talking trends here). We would ultimately liketo estimate someone’s weight simply by knowing their height.

In the next graph let us look at the distance from each point to the line.

58 60 62 64 66 68 70 72 74100

120

140

160

180

200

220

Height, inches

Weigh

t,pou

nds

Height and Weight of Boot Camp Graduates

The lengths of the line segments represent what are called the residuals. If you look at thelargest of the residuals you would describe this person as underweight. For this person’s height, 68inches, the line predicts about 165 pounds but their actual weight is 137 pounds.

If we call the residuals e (our sample version of ε), then we want to find a and b so that Σe2 isminimized.2 This is why we often refer to the line as the ‘least squares’ regression line.

Example 13.1.2.

Find the equation of the regression line for the data in the previous example. Use the equationto predict the weight of a 65 inch tall recent boot camp graduate.

Solution.

As mentioned before, we will rely on our calculator to find this. On a TI-83/84, input the valuesinto L1 and L2.

STAT>CALC>LinReg(a+bx) (option 8)Note that we want option 8, not option 4, which is ax+b

2This is a rather straightforward problem for someone who has taken calculus. We will rely on our calculator forthe values.

Page 300: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

294 CHAPTER 13. LINEAR REGRESSION

Once we have selected the correct option specify where your lists are: the x-list is in L1 and they-list is in L2.

You should get the following output

LinRegy=a+bxa=-330.7946768b=7.294676806

You may also have r2 and r. More on that later.Our equation is y = −330.79 + 7.295xNote that this equation is only applicable to the population in question: all recent boot camp

graduates.Let us evaluate the equation at x = 65. y = −330.79 + 7.295× 65 = 143.385. A reasonable way

to round would be to the nearest pound so our estimate of a recent graduate that is 65 inches tallis 143 pounds. As mentioned before, this is a point estimate. Interval estimates are on their way.

13.1.3 Interpreting a and b

If we recall from algebra, -330.79 represents the y-intercept and 7.295 represents the slope. Wewant to interpret these in terms of the problem. If we recall how we get the y-intercept is to setx equal to 0. In this particular case, this means that for a person that is 0 inches tall, they areexpected to weigh -330 pounds. This makes no sense. What this is telling us that we are pushingour model way beyond reasonable values of x. In several situations, the y-intercept will have nomeaning. Often, 0 is so far from the values of x that its interpretation is not reliable as an estimate.

Now let’s look at 7.295. This is the slope. Remember that

slope =rise

run

If we set the run equal to one we get

slope =rise

run= 7.295 =

rise

1

or7.295 = rise

What we get out of this is if the height increases by one inch (run = 1) then we expect theweight to increase by 7.295 pounds. We are not suggesting that the boot camp graduates will grow.What is happening is if we pick a graduate that is one inch taller that another graduate, we expectthem to weigh 7.295 pounds more than the other graduate.

Example 13.1.3.

A backyard farmer loves to grow tomatoes. The farmer is trying to determine the relationshipbetween yield and amount of water the plants get. The farmer turns on the drip water system everyday and times how long the water is on each plant. The data are in the folloiwng table.

Page 301: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

13.1. DESCRIPTIVE STATISTICS USING LINEAR REGRESSION 295

Water times, minutes 4.8 5.2 6.1 6.8 7.5 7.3 7.1Yield, pounds 50.2 51.8 56.4 66.5 60.1 73.2 66.1

Notice the water units are a bit odd. The farmer doesn’t have an easy way to determine thevolume of water, which is what you really want, but the farmer can easily determine how long thewater runs.

1. Is this an experiment or observational study?

2. Do you expect a linear relationship between the two variables?

3. Construct a scatter plot for the data.

4. Do the data follow a linear relationship?

5. Find the equation of the regression equation

6. Interpret a and b

7. Predict the yield if the water is run for 6.5 minutes each day.

Solution.

1. Since our farmer is changing the amount of water and observing the yield it would be consid-ered an experiment. It could be better. To be better, the farmer should have a better way tomeasure the water.

2. A linear relationship would be unrealistic for all values of x. If you overwater, or underwater,the plants, they will die and produce no tomatoes. But if we have a fairly small range ofvalues, we expect the relationship to be close enough to linear.

4.5 5 5.5 6 6.5 7 7.5

50

60

70

Water, minutes

Yield,pou

nds

Tomato Yield and Amount of Water

Page 302: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

296 CHAPTER 13. LINEAR REGRESSION

3. Yes, although the data don’t fall on a line, they follow a line fairly closely.

4. By running the regression analysis on our calculator like the last example we get

y = 17.58 + 6.725x

5. The value of a, which is 17.58 represents the yield of tomatoes, in pounds, if no water isapplied. (If you live in a hot region you will say this doesn’t make sense, the plant will die.In cooler climates or where there is enough rainfall there is ample moisture in the soil theywill grow without any application of water by the farmer)

The value of b is the slope which tells us that for each addional minute the water is run, wecan expect to get an additional 6.725 pounds of tomatoes.

6. We expect

y = 17.58 + 6.725× 6.5 = 61.3

So if a plant is watered for 6.5 minutes we expect a yield of 61.3 pounds of tomatoes.

Finally, for our descriptive statistics portion of regression we want a measure of how well thedata fits the line. In the last example we noticed that more water meant more tomatoes in the rangeof values. If there was no relationship between the amount of water and the yield, our estimatefor the yield would be the same regardless of how much water was applied. A big if, granted. Ifwe look at the yields, it is easy to calculate the average yield for the plants. It is 60.6 pounds. Soif you asked the yield if water is applied for 5.4 minutes the expected yield would be 60.6 pounds.7.4 minutes? 60.6 pounds. 4.9? 60.6 pounds!

Consider the following table.(Values rounded)

x y y e = y − y e2 = (y − y)2 e = y − y e2 = (y − y)2

4.8 50.2 49.9 0.35 0.12 -10.41 108.455.2 51.8 52.5 -0.74 0.55 -8.81 77.696.1 56.4 58.6 -2.20 4.83 -4.21 17.766.8 66.5 63.3 3.20 10.21 5.89 34.647.5 60.1 68.0 -7.91 62.59 -0.51 0.267.3 73.2 66.7 6.53 42.66 12.59 158.407.1 66.1 65.3 0.78 0.61 5.49 30.09

121.60 427.31

Consider the first row. We had a plant receive 4.8 minutes of watering. The yield was 50.2pounds. The model y = 17.58 + 6.725x predicted 49.9 pounds of tomatoes. We had .35 poundsmore than the model predicted (We have rounded the values in the table). This is what we get inthe fourth column. If we don’t consider the amount of water applied, then our prediction is 60.6pounds. We had 10.41 pounds less than this prediction. This is in the column labeled e.

If we add up the e2 using the model and not using the model we obtain

Σe2 = 121.60 Using the regression model

Page 303: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

13.1. DESCRIPTIVE STATISTICS USING LINEAR REGRESSION 297

Σe2 = 427.31 Not using the regression model

Think of the 427.31 as the total error. After we use the model, the error remaining is 121.60.If we subtract them, we see that we removed 305.71 of the error. So what percent of the error didwe remove?

305.71

427.31= 0.7154

So we removed 71.54% of the error by applying the model. We would like to remove 100% butthat would imply an exact relationship between the two variables. If we wanted to remove more ofthe error and get a better prediction, we could look at additional independent variables: amount ofsunlight the plant gets, position in the garden, etc. This would lead us to multiple linear regressionor even multiple non-linear regression. Subjects for a different textbook.

The Coefficient of Determination of a bivariate sample data set, denoted r2, is the proportionof the error removed from between the independent and dependent variables. The populationCoefficient of Determination is denoted ρ2.

The Correlation Coefficient of a bivariate sample data set is given by r and has the same signas b. The population correlation coefficient is denoted ρ and has the same sign as B.

One thing we need to be concerned about is causation versus correlation. We can have, andoften times do have, variables that are correlated but one does not cause the other. The followingexample illustrates this.

Example 13.1.4.

The divorce rate in Maine, per 1000 marriages, and the per capita consumption of margarine,in pounds, for several years are given in the table below.

Margarine 8.2 7.0 6.5 5.3 5.2 4.0 4.6 4.5 4.2 3.7Divorce rate 5.0 4.7 4.6 4.4 4.3 4.1 4.2 4.2 4.2 4.1

1. Do you expect a linear relationship between the two variables?

2. Construct a scatter plot for the data with per capita margarine consumption as the indepen-dent variable, x.

3. Do the data follow a linear relationship?

4. Find the equation of the regression equation.

5. Find the correlation coefficient.

6. Interpret a and b.

7. If the per capita margarine consumption is 5.5 pounds per person estimate the divorce ratein Maine.

Page 304: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

298 CHAPTER 13. LINEAR REGRESSION

8. Comments

Solution.

1. We see no reason as to why there should be a reason for a relationship between the variables,so no.

2.

3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.54

4.2

4.4

4.6

4.8

5

Margarine Consumption per capita, pounds

Divorce

rate,per

1000

Divorce Rate in Maine and Margarine Consumption

3. Looking at the data set we see a very strong positive linear relation.

4. From our calculator we get y = 3.309 + .201x

5. From our calculator we get r = .993

6. a = 3.309 which means if there were no margarine consumed, then the divorce rate wouldbe expected to be 3.3 per 1000. b = .201 which tells us that for each additional pound ofmargarine consumed per capita, the divorce rate is expected to increase by .2 per 1000.

7. y = 3.309 + .201× 5.5 = 4.4145 or we expect the divorce rate to be 4.4 per 1000 marriages.

8. Comments: our interpretation of a has little or no reliability. The value x = 0 is very farfrom the rest of the data set. From the problem, we see we have a very strong relationshipbetween the two variables. If we feel the divorce rate in Maine is too high should we outlawmargarine? Of course not. This data came from tylervigen.com/spurious-correlations. Thewebsite is plugging a book about spurious correlations. If you look at a lot of different datasets, you will eventually find a pair that has a very high correlation coefficient, just by chance.

13.1.4 Exercises

1. In Greenville, the city is investigating trash bin and recycle bin use for the residents. A sampleof several homes are taken and the amounts of trash and recycling are noted each week. Theamount of trash and recycling are given.

Recycling, gallons 43 38 53 61 44 50 49 51Trash, gallons 32 42 18 13 45 32 40 19

Page 305: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

13.1. DESCRIPTIVE STATISTICS USING LINEAR REGRESSION 299

(a) Construct a scatter plot with amount of recycling as the independent variable. Do thedata exhibit a linear relationship?

(b) Find the equation of the regression line.

(c) Interpret the values of a and b. If no reasonable interpretation, give a reason why.

(d) A resident is selected and it is determined that they recycle 45 gallons per week. Whatis the expected amount of trash they generate?

(e) Find and interpret the correlation coefficient and the coefficient of determination.

2. At a party where alcohol is consumed the guests have a breathalyzer and are having fun withit. One partygoer arrives at the party and drinks several shots in succession and immediatelystarts a stopwatch and has no more alcohol. At several times the partygoer blows into thebreathalyzer and records the BAC(blood alcohol content) along with how long since they tookthe drinks.

Time, minutes 24 35 61 77 95 123 152BAC 0.113 0.105 0.096 0.093 0.086 0.078 0.071

(a) Construct a scatter plot with time as the independent variable. Do the data exhibit alinear relationship?

(b) Find the equation of the regression line.

(c) Interpret the values of a and b. If no reasonable interpretation, give a reason why.

(d) What is the expected BAC after 100 minutes?

(e) How long after the drinks were taken will the BAC be 0.04? Comment on this result.

(f) Find and interpret the correlation coefficient and the coefficient of determination.

3. A reservoir is fed by several rivers and streams. A hydrologist is measuring the total annualrainfall at a location and the water in the reservoir on May 1st for several years.

Rainfall, inches 26.5 28.9 35.7 44.6 29.7Water Storage acre-feet 7700 7890 8070 8240 7630

Rainfall, inches 33.7 36.8 40.8 22.5 39.8Water Storage acre-feet 7940 8250 8160 7440 8200

(a) Construct a scatter plot with rainfall as the independent variable. Do the data exhibita linear relationship?

(b) Find the equation of the regression line.

(c) Interpret the values of a and b. If no reasonable interpretation, give a reason why.

(d) What is the expected water in the reservoir for an annual rainfall of 40.0 inches?

(e) If the reservoir has 8000 acre-feet, what is your estimate of the rainfall that year?

(f) Find and interpret the correlation coefficient and the coefficient of determination.

4. In Bakersfield, a city in the southern California Central Valley, it gets hot in the summer.Very hot. An energy consumer is comparing their energy use with the high temperature forthe day. The high temperature for several days in the summer are given.

Page 306: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

300 CHAPTER 13. LINEAR REGRESSION

Temperature, F 95 105 106 98 98Energy Usage, kWh 19.3 22.5 32 19.9 22.1

Temperature, F 99 101 94 105 100Energy Usage, kWh 21.3 16.0 28.5 29.5 23.5

(a) Construct a scatter plot for the data given, with Temperature as the independent vari-able. Do the data exhibit a linear relationship?

(b) Find the equation of the regression line.

(c) Interpret the values of a and b. If no reasonable interpretation, give a reason why.

(d) Predict the energy usage for the household if the high temperature outside is 99 degrees.

(e) What value does the model predict for a high temperature of 78. Comment on the result.

(f) Find the correlation coefficient and the coefficient of determination. What do these tellus?

5. The largest part of the cost of a gallon of gasoline is the cost of crude oil.

Cost of a crude, $/barrel 29 41 65 101 95 42Cost of gasoline, $/gallon 1.73 1.95 3.21 3.28 3.25 2.24

(a) Use the price of crude as the independent variable and construct the scatter plot. Dothe data exhibit a linear relationship?

(b) Find the equation of the regression line.

(c) Interpret the values of a and b. If no reasonable interpretation, give a reason why.

(d) An investor expects the price of crude to be $98 per barrel. What is the expected priceof gasoline.

(e) What if the price of crude jumps to $200 a barrel, what about the price of gasoline then?Comment on your answer.

(f) Find and interpret the correlation coefficient and the coefficient of determination.

6. The lengths and weights of several newborn babies born at Memorial Hospital is observed.The results are given

Length, cm 56 55 55 56 54 57Weight, ounces 123 110 96 132 105 108

Length, cm 54 59 59 54 56Weight, ounces 103 140 137 121 119

(a) Construct a scatter plot with length as the independent variable. Do the data exhibit alinear relationship?

(b) Find the equation of the regression line.

(c) Interpret the values of a and b. If no reasonable interpretation, give a reason why.

(d) The records for a newborn states the length as 58 cm but the weight is missing. Whatis your best guess for the weight?

Page 307: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

13.1. DESCRIPTIVE STATISTICS USING LINEAR REGRESSION 301

(e) A newborn is finally asleep. You weigh the baby and find the weight is 120 ounces. Youknow that to measure the length would surely wake up the baby. What is your bestguess for the length?

(f) Find and interpret the correlation coefficient and the coefficient of determination.

7. The voulnteers of the Pelagic Shark Research Poundatiaon colledted data on bat rays in theElkhorn Slough. The total length and disk width of several specimens were measured. Thedata follow.

Total Length, cm 28 38 34.5 25.0 33.0 29.5 30.0 34.0Disk Width, cm 40.0 50.5 47.0 29.5 44.0 41.0 42.0 47.0

(a) Explain why you expect b to be positve.

(b) Construct a scatter plot with total length as the independent variable. Do the dataexhibit a linear relationship?

(c) Find the equation of the regression line.

(d) Interpret the values of a and b. If no reasonable interpretation, give a reason why.

(e) You have agreed to assist in the measurement of the bat rays. You capture a bat rayand find the total length to be 31.0 cm. What do you expect the disk width to be?

(f) You have captured a bat ray. It is flopping around on the beach and leaves an imprint ofthe disk on the beach before escaping back into the water. You measure the disk widthfrom the imprint in the sand and find it to be 43.5 cm. Estimate the total width.

(g) Find and interpret the correlation coefficient and the coefficient of determination.

8. The Sacramento and San Joaquin drainage areas are part of the water storage areas of theCalifornia Department of Water Resources. The storage of the areas are recorded on June30th for the years 2013-2109. The water storage, in 1000’s of acre feet are given below.

Year 2013 2014 2015 2016 2017 2018 2019Sacramento 11348.1 8273.9 8268.3 13026.6 13930 12596.7 15204.9

San Joaquin 6524.4 4948.9 4084.4 6330.8 10570.2 9279.9 10302.2

(a) Explain why you expect b to be positve.

(b) Construct a scatter plot with Sacramento as the independent variable. Do the dataexhibit a linear relationship?

(c) Find the equation of the regression line.

(d) Interpret the values of a and b. If no reasonable interpretation, give a reason why.

(e) You have found a value in the past that indicates the storage in the Sacramento area was11,000,000 acre-feet. Use the equaiton to predict the water storage in the San Joaquinarea.

(f) Find and interpret the correlation coefficient and the coefficient of determination.

9. Below are four different data sets. For each data set, find the equation of the regression line,the correlation coefficient, and the scatterplot. Comment on your findings.

Data Set I

Page 308: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

302 CHAPTER 13. LINEAR REGRESSION

x 10 8 13 9 11 14 6 4 12 7 5y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68

Data Set II

x 10 8 13 9 11 14 6 4 12 7 5y 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 4.74

Data Set III

x 10 8 13 9 11 14 6 4 12 7 5y 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73

Data Set IV

x 8 8 8 8 8 8 8 19 8 8 8y 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 6.89

Page 309: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

13.2. HYPTOTHESIS TESTS AND CONFIDENCE INTERVALS FOR B 303

13.2 Hyptothesis Tests and Confidence Intervals for B

In the first part of our regression discussion, we focused solely on the descriptive portion of regres-sion. We now turn our attention to the inferential statistics portion of regression. Specifically, wewill discuss hypothesis tests and intervals.

When we began our journey through descriptive statistics we started with point estimates. Wedidn’t refer to them as such until much later, but that is what we did. We then constructedconfidence intervals and performed hypothesis tests on the associated parameters. We do the samein this chapter. We have calculated r, y, and b. We will now perform inferences on their populationcounterparts: ρ, y, and B, respectively.

As with the previous part of the chapter, we will limit the use of formulas and allow ourcalculators to do the work.

13.2.1 Hypothesis Tests of H0 : B = 0 and H0 : ρ = 0

Recall that b and r (and also B and ρ) have the same sign. For hypothesis tests where thehypothesiized value is 0 we can do the tests together.

We now need to bring back our initial assumtions about the population:

For Linear Regression Inferences we need

y = A+Bx+ ε

whereε ∼ N(0, σε)

We will use the t-distribution with

df = n− 2

Notice the degrees of freedom are different for these tests than the tests of µ. In tests of µ wehad a one dimensional problem so df = n − 1. In the linear regression model, we are looking at atwo dimensional problem. This is why df = n− 2.

Let’s jump right in with an example:

Example 13.2.1.

A random sample of 8 recently recruted soldiers is selected after finishing boot camp. Theirheights and weights are in the table below. Test the null hypothesis that the two variables have apostive correlation. Use a 5% level of significance.

Height, inches 60 68 73 66 64 69 71 63Weight, pounds 105 137 195 159 134 184 201 134

Solution.

Page 310: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

304 CHAPTER 13. LINEAR REGRESSION

We have already observed that the assumptions are reasonable. Note that the problem is askingif the variables have a ‘positive correlation’. This is the same as saying ρ > 0. Note that we expectthis to be true. As height increases the weight increases as well.

We have the followingH1 : ρ > 0α = .05

1.

H0 : ρ = 0H1 : ρ > 0

or

(H0 : ρ = 0H1 : ρ > 0

)

2. Use t because y = A+Bx+ ε where ε ∼ N(0, σε), σε is unknown

3. t

.05

1.943

4. We will use our calculator here (see below). We get t = 5.746 with a p-value of 0.0006

5. Since the test statistic is in the rejection region (or equivalently the p-value is less than α),we conclude that a person’s height and weight are positively correlated.

Hypothesis test of B and ρ on the Calculator

Enter the data in a list

Select STAT>TESTS >LinRegTTest

Specify the lists for x and y.

Leave Freq at 1

Select the alternative hypothesis (only necessary if using p-value approach)

Highlight Calculate and hit ENTER.

Example 13.2.2.

Recall our backyard farmer who loves to grow tomatoes. Determine, using a 1% significancelevel, if increasing watering time increases tomato yield, on average.

Page 311: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

13.2. HYPTOTHESIS TESTS AND CONFIDENCE INTERVALS FOR B 305

Water times, minutes 4.8 5.2 6.1 6.8 7.5 7.3 7.1Yield, pounds 50.2 51.8 56.4 66.5 60.1 73.2 66.1

Solution.

If we examine the scatterplot, the assumptions aren’t as obvious as the scatterplot for the heightand weight. We will, however proceed.

We have the followingH1 : B > 0α = .01

1.

H0 : B = 0H1 : B > 0

2. Use t because y = A+Bx+ ε where ε ∼ N(0, σε) and σε is unknown

3. t

.01

3.365

4. On our caculator we get t = 3.546 with a p-value of 0.0082

5. We conclude that as the time the plants are watered increases, the yield of tomatoes increases.

A comment: since the assumptions required to do the inference here were not clearly met,the results here are not so black and white. More data is required from the farmer. Also, if theconclusion is correct, then it is only valid in the range of values for watering times that are given.

13.2.2 Hypothesis Tests of B when B0 6= 0

In the hypothesis test of the heights and weights, it was obvious going into the problem that theyshould be positively correlated. (Of course, as good statistics students, we need proof.) What ifwe have an idea about what B might be? When you did your LinRegTTest, there was no place toenter a value of B. We will need to tweek the results of the Test to get an appropriate test statistic.We now need to look at the test statistic we used before.

t =b−B0

sb

We didn’t give the formula before now. This is what the calclulator uses to do the test. Weneed to change the value of B0. The value the caluclator uses is 0. What the calculator calculates

Page 312: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

306 CHAPTER 13. LINEAR REGRESSION

is t = bsb

. We can solve for sb and we get sb = bt . We can plug this into the formula above to get

our test statistic. To avoid confusion, let tc be the value the calculator gives for the test statistic.So we get

t =b−B0

b/tcor t = tc(1−B0/b)

Where t is the test statistic with null hypothesis H0 : B = B0 and tc is the test statistic fromthe LinRegTTest.

Finding the Test Statistic when B0 6= 0Use the formula

t = tc(1−B0/b)

Where t is the test statistic with null hypothesis H0 : B = B0 and tc is the test statisticfrom the LinRegTTest.

Example 13.2.3.

We have heard that for each additional inch, the weight should increase by 5 pounds. We feelthat this is not correct for all of our recent recruits. Use a 5% level of significance to test if theexpected increase in weight by an increase of 1 inch in height is different from 5 pounds.

Height, inches 60 68 73 66 64 69 71 63Weight, pounds 105 137 195 159 134 184 201 134

Solution.

The description is exaclty how we would describe b (or B). Since we want to make a statementabout all recuits, we are actually describing B. Specifically, H1 : B 6= 5 We have:

H1 : B 6= 5α = .05

1.H0 : B = 5H1 : B 6= 5

2. Use t because y = A+Bx+ ε where ε ∼ N(0, σε) and σε is unknown

3. t

.005 .005

−4.032 4.032

Page 313: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

13.2. HYPTOTHESIS TESTS AND CONFIDENCE INTERVALS FOR B 307

4. From the calculator we get

tc = 5.746 . . .b = 7.294 . . .

So we get t = tc(1−B0/b) = 5.746(1− 5/7.294) = 1.808

5. There is not sufficient evidence that for each additional inch of height, the weight increases,on average, by something other than 5 pounds.

13.2.3 Confidence Intervals of B

Some calculators will calcuate these directly, others will not. We will proceed in this section as ifyou have the later. The confidence interval formula for B should look reasonable. It is similar instructure to most of our confidence intervals.

b± tα/2sbUsing sb = b/tc from before we get

b± tα/2b

tc

A (1− α/2)× 100% Confidence Interval for B is given by

b

(1± tα/2

tc

)

Where tc is the test statistic from the LinRegTTest.

Finding a Confidence Interval for BUse the formula

b

(1± tα/2

tc

)

Where tα/2 is from the t-table and tc is the test statistic from the LinRegTTest.

Example 13.2.4.

Construct a 95% confidence interval for the increase in weight for each additional inch of height.

Height, inches 60 68 73 66 64 69 71 63Weight, pounds 105 137 195 159 134 184 201 134

Solution.

Page 314: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

308 CHAPTER 13. LINEAR REGRESSION

With 6 degrees of freedom we get t.025 = 2.447 After running the LinRegTTest we gettc = 5.746 . . .b = 7.294 . . .

Our confidence interval of B is given by

7.294(1± 2.447/5.746)

or

4.188 to 10.401

Finally, we are 95% confident that for each additional height in the recruits, weight goes up by4.2 to 10.4 pounds, on average.

13.2.4 Exercises

1. In Greenville, the city is investigating trash bin and recycle bin use for the residents. A sampleof several homes are taken and the amounts of trash and recycling are noted each week. Theamount of trash and recycling are given.

Recycling, gallons 43 38 53 61 44 50 49 51Trash, gallons 32 42 18 13 45 32 40 19

(a) Using a 5% level of significance, test if the amount recycled and amount of trash arenegatively correlated.

(b) Construct a 95% confidence for the rate of change of the amount of trash with respectto amount recycled.

2. At a party where alcohol is consumed the guests have a breathalyzer and are having fun withit. One partygoer arrives at the party and drinks several shots in succession and immediatelystarts a stopwatch and has no more alcohol. At several times the partygoer blows into thebreathalyzer and records the BAC(blood alcohol content) along with how long since they tookthe drinks.

Time, minutes 24 35 61 77 95 123 152BAC 0.113 0.105 0.096 0.093 0.086 0.078 0.071

(a) A report indicates that the BAC reduces by .0002 per hour. Test using a 5% level ofsignificance if the report is incorrect.

(b) Construct a 95% confidence interval for the slope of the regression line.

3. A reservoir is fed by several rivers and streams. A hydrologist is measuring the total annualrainfall at a location and the water in the reservoir on May 1st for several years.

Rainfall, inches 26.5 28.9 35.7 44.6 29.7Water Storage acre-feet 7700 7890 8070 8240 7630

Rainfall, inches 33.7 36.8 40.8 22.5 39.8Water Storage acre-feet 7940 8250 8160 7440 8200

Page 315: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

13.2. HYPTOTHESIS TESTS AND CONFIDENCE INTERVALS FOR B 309

(a) A neighbor of the hydrologist claims that ‘For each additional inch of rain, the reservoirincreases water storage by 25 acre-feet.’ Test if the neighbor’s statement is false.

(b) Construct a 90% confidence interval for the slope of the regression line.

4. In Bakersfield, a city in the southern California Central Valley, it gets hot in the summer.Very hot. An energy consumer is comparing their energy use with the high temperature forthe day. The high temperature for several days in the summer are given.

Temperature, F 95 105 106 98 98Energy Usage, kWh 19.3 22.5 32 19.9 22.1

Temperature, F 99 101 94 105 100Energy Usage, kWh 21.3 16.0 28.5 29.5 23.5

(a) The consumer has read that for each additional degree the high is, the energy usage goesup by 1 kWh. Test at the 5% level of significance if the statement is false.

(b) Construct a confidence interval for the true slope of the regression line.

5. The largest part of the cost of a gallon of gasoline is the cost of crude oil.

Cost of a crude, $/barrel 29 41 65 101 95 42Cost of gasoline, $/gallon 1.73 1.95 3.21 3.28 3.25 2.24

(a) Test if the price of crude and the price of gasoline are positively correlated. Use a 5%level of significance

(b) Construct a 90% confidence interval for the slope of the regression line.

6. The lengths and weights of several newborn babies born at Memorial Hospital is observed.The results are given

Length, cm 56 55 55 56 54 57Weight, ounces 123 110 96 132 105 108

Length, cm 54 59 59 54 56Weight, ounces 103 140 137 121 119

(a) Test at the 5% level of significance if the length and weight of babies are positivelycorrelated.

(b) Construct a 95% confidence interval for the slope of the regression line.

7. The voulnteers of the Pelagic Shark Research Poundatiaon colledted data on bat rays in theElkhorn Slough. The total length and disk width of several specimens were measured. Thedata follow.

Total Length, cm 28 38 34.5 25.0 33.0 29.5 30.0 34.0Disk Width, cm 40.0 50.5 47.0 29.5 44.0 41.0 42.0 47.0

(a) Test, at the 1% level of significance if the slope of the regression line is different from 1.

(b) Construct a 99% confidence interval for the slope of the regression line.

Page 316: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

310 CHAPTER 13. LINEAR REGRESSION

8. The Sacramento and San Joaquin drainage areas are part of the water storage areas of theCalifornia Department of Water Resources. The storage of the areas are recorded on June30th for the years 2013-2109. The water storage, in 1000’s of acre feet are given below.

Year 2013 2014 2015 2016 2017 2018 2019Sacramento 11348.1 8273.9 8268.3 13026.6 13930 12596.7 15204.9

San Joaquin 6524.4 4948.9 4084.4 6330.8 10570.2 9279.9 10302.2

(a) Test, at the 1% level of significance if the slope of the regression line is less than 1.

(b) Construct a 99% confidence interval for the slope of the regression line.

9. Below are four different data sets from the last section. For which data set (s) would it beappropriate to use the methods of this section. Explain.

Data Set I

x 10 8 13 9 11 14 6 4 12 7 5y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68

Data Set II

x 10 8 13 9 11 14 6 4 12 7 5y 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 4.74

Data Set III

x 10 8 13 9 11 14 6 4 12 7 5y 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73

Data Set IV

x 8 8 8 8 8 8 8 19 8 8 8y 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 6.89

Page 317: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

13.3. PREDICTION INTERVALS AND CONFIDENCE INTERVALS FOR µY |X 311

13.3 Prediction Intervals and Confidence Intervals for µy|xWhen we started our regression discussion one of the things we did was predict values of y for agiven value of x. In the tomato example, we found that when we watered a plant for 6.5 minutes,we expected a yield of 61.3 pounds. So was that an estimate of the yield of a particular plant if wewatered it for 6.5 minutes or an estimate of the average yield if water is applied for 6.5 minutes?Answer: both. What we would like to do now is create confidence intervals for both of these. Theseintervals are called prediction intervals and confidence intervals of µy|x, respectively. We will lookat the formulas for both.

A (1− α/2) 100% Prediction Interval for a given x is given by

(a+ bx)± tα/2se

√1 +

1

n+

(x− x)2

(n− 1)s2x

A (1− α/2) 100% Confidence Interval for µy|x is given by

(a+ bx)± tα/2se

√1

n+

(x− x)2

(n− 1)s2x

Notice that the formulas are almost the same. The only difference is the 1 under the radicalof the prediction interval. The formula share the following: a, b, x, and n. They also share se,the sample standard deviations of the errors, X and sx, the mean and standard deviation for theindependent variable, x. Notice the numerator of the last term under the radical: (x − x)2. Forvalues of x near the center of the data set, this is small. As the value of x moves further out fromthe mean, the entire radicand gets bigger and hence we get a wider interval. This is what we expect:for values of x that are far away from the data set, we have little confidence in the estimate. (If wepoured a lot of water on the tomato plants, we wouldn’t expect the predicted value from the modelto be very reliable.)

Let’s let n, the number of pairs of data values get large. Very large. In fact, so large that thestatistics are essentially equal to the corresponding parameters. We then get

(A+Bx)± zα/2σe and (A+Bx), respectively

For the first prediction interval, it has been reduce to a simple interval and the confidenceinterval has been reduced to a single number. The latter is what we expect to see. This was whatwe observed when we first started with confidence intervals.

As we can see, the formula is rather large for each case. As such we will enter a simple programon our calculator (TI-83 or 84) for each. We will call the prediction interval program PI and theconfidence interval for µy|x will be called CI.

PRGM>NEW>Create New (hit the number 1 or [ENTER])

Type in PI (You should be in ALPHA Lock mode.) [ENTER]

PRGM>I/O>Disp (hit the number, 3, or highlight and hit [ENTER])

2nd>A-LOCK>“ENTER X” (The space is located with the 0 key, the quotes above the +key) [ENTER]

Page 318: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

312 CHAPTER 13. LINEAR REGRESSION

PRGM>I/O>Input (hit the number, 1, or highlight and hit ENTER)

X [ENTER]

PRGM>I/O>Disp

2nd>A-LOCK>“ENTER T” [ENTER]

PRGM>I/O>Input

T [ENTER]

a+b*X → P (See below) [ENTER]

T*s*√

(1 + 1/n + (X− x)2/((n− 1)S2x))→ E (See below) [ENTER]

PRGM>I/O>Disp P-E, P+E [ENTER] (P for point estimate, E for margin of error)

2nd QUIT

Your program is now ready to use.

For the above you need the following:

For X and T, use the ALPHA button and then find the letter.a is located at VARS>Statistics...>EQ>a (A is not the same as a here)b is located at VARS>Statistics...>EQ>bs is located at VARS>Statistics...>TEST>s (Scroll down to find)n is located at VARS>Statistics...>XY>nx is located at VARS>Statistics...>XY> xSx is located at VARS>Statistics...>XY> Sx

→ is STOB on keyboard (STO is just above the ON button.)

Now to write the program for our confidence interval of µy|x. This assumes you have the PIprogram in.

PRGM>NEW>Create New

CI [ENTER]

2nd>RCL>PRGM>EXEC>CI [ENTER]

This has copied the PI program into CI. Use your arrow keys to find the ‘1+’ . highlight the1 and hit your delete button twice.

2nd> QUIT

Your programs are now ready to use

IN ORDER TO USE THE PROGRAM YOU NEED TO RUN THE LinRegTTest FIRST!

You can ignore the output. Let us proceed with an example which utilizes our programs.

Page 319: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

13.3. PREDICTION INTERVALS AND CONFIDENCE INTERVALS FOR µY |X 313

Using Programs PI and CI to find prediction intervals and confidence intervals of µy|x.

Look up tα/2 in t-table

Enter data into lists

Run LinRegTTest (STAT > TESTS > LinRegTTest)

CLEAR (Not required)

PRGM>EXEC>CI (for confidence interval of µy|x, PI for prediction interval)

Enter values of tα/2 and X as prompted

Example 13.3.1.

Construct a 95% prediction interval and a 95% confidence interval for the mean yield when a tomatoplant is watered for 6.5 minutes each day.

Solution.

To refresh our memories, here is the data

Water times, minutes 4.8 5.2 6.1 6.8 7.5 7.3 7.1Yield, pounds 50.2 51.8 56.4 66.5 60.1 73.2 66.1

Notice in this case we are given the value of x as 6.5. We have 5 degrees of freedom. From ourtable we get t.025 = 2.571

Input the data in L1 (times) and L2 (yields) and run the LinRegTTest. (You don’t need to lookat the output, you can simply hit CLEAR.)

Now hit PRGM>EXEC>PI (Hit [ENTER] or simply hit the number.)[ENTER] (This starts the program)It should be asking you for X, enter 6.5 [ENTER]Now it should want T, enter the value we looked up, 2.571 [ENTER]

It should give you an interval of 47.723 . . . to 74.849. . .

So, we are 95% confident that the yield of a randomly selected tomato plant which has 6.5minutes of water applied each day will be between 47.7 and 74.8 pounds.

To get the confidence interval repeat the above proceedure but select CI instead of PI. Thecorresponding interval is 56.469 . . . to 66.103. . . .

So, we are 95% confident that the average yield for tomato plants that have 6.5 minutes of waterapplied each day is between 56.5 and 66.1 pounds.

Look at the two intervals. Notice that the prediction interval is wider. This is because theprediction interval includes the variablity of the individual, whereas the confidence interval does

Page 320: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

314 CHAPTER 13. LINEAR REGRESSION

not. Also note that the sentences for the second interval clearly addresses the mean where the firstinterval makes no mention of the mean.

Example 13.3.2.

Construct 90% prediction and confidence intervals for the mean height of all recruits that are65 inches tall.

Solution.

Here is the data from before.

Height, inches 60 68 73 66 64 69 71 63Weight, pounds 105 137 195 159 134 184 201 134

We have 6 degrees of freedom and t.05 = 1.943 .

Enter the data into your calculator, run LinRegTTest,.

Next run the PI program. Enter 65 and 1.943 when prompted. We get 113.0 to 173.7 pounds.Do the same for the CI program and we get 132.5 to 154.3

We are 90% confident that a randomly selected recruit that is 65 inches tall will weigh between113.0 and 173.7 pounds.

We are 90% confident that the average weight of a recruit that is 65 inches tall is between 132.5and 154.3 pounds.

Below is a graph of the scatterplot of the heights and weights of our recruits along with theregression line (the darker one in the middle), the prediction interval curves (top and bottomcurves), and the confidence interval for µy|x, (the remaining two curves). Also, there is a verticalline at x = 65, the value we looked at in the last example. The prediction interval can be found bylooking at the y coordinates of the points where the line intersects the top and bottom curve. Youshould check that this is correct. Similarly we can get the confidence interval for µy|x, and even thepoint estimate by looking at the line in the middle.

Page 321: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

13.3. PREDICTION INTERVALS AND CONFIDENCE INTERVALS FOR µY |X 315

58 60 62 64 66 68 70 72 74100

120

140

160

180

200

220

Height, inches

Weigh

t,pou

nds

Height and Weight of Boot Camp Graduates

13.3.1 Exercises

1. In Greenville, the city is investigating trash bin and recycle bin use for the residents. A sampleof several homes are taken and the amounts of trash and recycling are noted each week. Theamount of trash and recycling are given.

Recycling, gallons 43 38 53 61 44 50 49 51Trash, gallons 32 42 18 13 45 32 40 19

(a) Construct a 95% prediction interval for the amount of trash collected for a randomlyselected household that recycles 45 gallons.

(b) Construct a 95% confidence interval for the average amount of trash collected for house-holds that recycle 45 gallons.

2. At a party where alcohol is consumed the guests have a breathalyzer and are having fun withit. One partygoer arrives at the party and drinks several shots in succession and immediatelystarts a stopwatch and has no more alcohol. At several times the partygoer blows into thebreathalyzer and records the BAC(blood alcohol content) along with how long since they tookthe drinks.

Time, minutes 24 35 61 77 95 123 152BAC 0.113 0.105 0.096 0.093 0.086 0.078 0.071

(a) Construct a 95% prediction interval for the BAC after 1 hour.

(b) Construct a 95% confidence interval for the average BAC after 1 hour.

3. A reservoir is fed by several rivers and streams. A hydrologist is measuring the total annualrainfall at a location and the water in the reservoir on May 1st for several years.

Page 322: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

316 CHAPTER 13. LINEAR REGRESSION

Rainfall, inches 26.5 28.9 35.7 44.6 29.7Water Storage acre-feet 7700 7890 8070 8240 7630

Rainfall, inches 33.7 36.8 40.8 22.5 39.8Water Storage acre-feet 7940 8250 8160 7440 8200

(a) Construct a 95% prediction interval for the water storage in a year that saw 30 inchesof rain.

(b) Construct a 95% confidence interval for the average water storage for years that see 30inches of rain.

4. In Bakersfield, a city in the southern California Central Valley, it gets hot in the summer.Very hot. An energy consumer is comparing their energy use with the high temperature forthe day. The high temperature for several days in the summer are given.

Temperature, F 95 105 106 98 98Energy Usage, kWh 19.3 22.5 32 19.9 22.1

Temperature, F 99 101 94 105 100Energy Usage, kWh 21.3 16.0 28.5 29.5 23.5

(a) If a randomly selected day the high temperature is 101F, make a 99% prediction intervalenergy usage for the day.

(b) Find a 90% confidence interval for the mean energy usage for days when the temperatureis 101F.

5. The largest part of the cost of a gallon of gasoline is the cost of crude oil.

Cost of a crude, $/barrel 29 41 65 101 95 42Cost of gasoline, $/gallon 1.73 1.95 3.21 3.28 3.25 2.24

(a) You have just read that the cost of crude is expected to be $80 per barrel when you leaveto go on vacation. Construct a 95% prediction interval cost of a gallon of gasoline.

(b) Construct the corresponding 95% confidence interval for the average price of gasolinewhen oil is $80 per barrel.

6. The lengths and weights of several newborn babies born at Memorial Hospital is observed.The results are given

Length, cm 56 55 55 56 54 57Weight, ounces 123 110 96 132 105 108

Length, cm 54 59 59 54 56Weight, ounces 103 140 137 121 119

(a) If a baby that is 58 cm long is selected, what is the 90% prediction interval for the babiesweight?

(b) Of all babies that are 58 cm long, what is the 90% confidence interval for the meanweight?

Page 323: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

13.3. PREDICTION INTERVALS AND CONFIDENCE INTERVALS FOR µY |X 317

7. The voulnteers of the Pelagic Shark Research Poundatiaon colledted data on bat rays in theElkhorn Slough. The total length and disk width of several specimens were measured. Thedata follow.

Total Length, cm 28 38 34.5 25.0 33.0 29.5 30.0 34.0Disk Width, cm 40.0 50.5 47.0 29.5 44.0 41.0 42.0 47.0

(a) For a bat ray with a total length of 32.0 cm, find the 95% prediction interval for the diskwidth.

(b) Of all bat ray with a total length of 32.0 cm, find the 95% confidence interval for themean disk width.

8. The Sacramento and San Joaquin drainage areas are part of the water storage areas of theCalifornia Department of Water Resources. The storage of the areas are recorded on June30th for the years 2013-2109. The water storage, in 1000’s of acre feet are given below.

Year 2013 2014 2015 2016 2017 2018 2019Sacramento 11348.1 8273.9 8268.3 13026.6 13930 12596.7 15204.9

San Joaquin 6524.4 4948.9 4084.4 6330.8 10570.2 9279.9 10302.2

(a) For a year in which the storage in the Sacramento drainage area is 10,000,000 acre-feetfind a 95% prediction interval for the storage in the San Joaquin drainage area.

(b) For all years in which the storage in the Sacramento drainage area is 10,000,000 acre-feetfind a 95% confidence interval for the mean storage in the San Joaquin drainage area.

Page 324: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

318 CHAPTER 13. LINEAR REGRESSION

Page 325: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Appendix A

Tables

319

Page 326: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Table A.1: Standard Normal Distribution Table

z

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

-3.4 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .0002-3.3 .0005 .0005 .0005 .0004 .0004 .0004 .0004 .0004 .0004 .0003-3.2 .0007 .0007 .0006 .0006 .0006 .0006 .0006 .0005 .0005 .0005-3.1 .0010 .0009 .0009 .0009 .0008 .0008 .0008 .0008 .0007 .0007-3.0 .0013 .0013 .0013 .0012 .0012 .0011 .0011 .0011 .0010 .0010

-2.9 .0019 .0018 .0018 .0017 .0016 .0016 .0015 .0015 .0014 .0014-2.8 .0026 .0025 .0024 .0023 .0023 .0022 .0021 .0021 .0020 .0019-2.7 .0035 .0034 .0033 .0032 .0031 .0030 .0029 .0028 .0027 .0026-2.6 .0047 .0045 .0044 .0043 .0041 .0040 .0039 .0038 .0037 .0036-2.5 .0062 .0060 .0059 .0057 .0055 .0054 .0052 .0051 .0049 .0048

-2.4 .0082 .0080 .0078 .0075 .0073 .0071 .0069 .0068 .0066 .0064-2.3 .0107 .0104 .0102 .0099 .0096 .0094 .0091 .0089 .0087 .0084-2.2 .0139 .0136 .0132 .0129 .0125 .0122 .0119 .0116 .0113 .0110-2.1 .0179 .0174 .0170 .0166 .0162 .0158 .0154 .0150 .0146 .0143-2.0 .0228 .0222 .0217 .0212 .0207 .0202 .0197 .0192 .0188 .0183

-1.9 .0287 .0281 .0274 .0268 .0262 .0256 .0250 .0244 .0239 .0233-1.8 .0359 .0351 .0344 .0336 .0329 .0322 .0314 .0307 .0301 .0294-1.7 .0446 .0436 .0427 .0418 .0409 .0401 .0392 .0384 .0375 .0367-1.6 .0548 .0537 .0526 .0516 .0505 .0495 .0485 .0475 .0465 .0455-1.5 .0668 .0655 .0643 .0630 .0618 .0606 .0594 .0582 .0571 .0559

-1.4 .0808 .0793 .0778 .0764 .0749 .0735 .0721 .0708 .0694 .0681-1.3 .0968 .0951 .0934 .0918 .0901 .0885 .0869 .0853 .0838 .0823-1.2 .1151 .1131 .1112 .1093 .1075 .1056 .1038 .1020 .1003 .0985-1.1 .1357 .1335 .1314 .1292 .1271 .1251 .1230 .1210 .1190 .1170-1.0 .1587 .1562 .1539 .1515 .1492 .1469 .1446 .1423 .1401 .1379

-0.9 .1841 .1814 .1788 .1762 .1736 .1711 .1685 .1660 .1635 .1611-0.8 .2119 .2090 .2061 .2033 .2005 .1977 .1949 .1922 .1894 .1867-0.7 .2420 .2389 .2358 .2327 .2296 .2266 .2236 .2206 .2177 .2148-0.6 .2743 .2709 .2676 .2643 .2611 .2578 .2546 .2514 .2483 .2451-0.5 .3085 .3050 .3015 .2981 .2946 .2912 .2877 .2843 .2810 .2776

-0.4 .3446 .3409 .3372 .3336 .3300 .3264 .3228 .3192 .3156 .3121-0.3 .3821 .3783 .3745 .3707 .3669 .3632 .3594 .3557 .3520 .3483-0.2 .4207 .4168 .4129 .4090 .4052 .4013 .3974 .3936 .3897 .3859-0.1 .4602 .4562 .4522 .4483 .4443 .4404 .4364 .4325 .4286 .42470.0 .5000 .4960 .4920 .4880 .4840 .4801 .4761 .4721 .4681 .4641

Page 327: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Table A.2: Standard Normal Distribution Table

z

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .53590.1 .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .57530.2 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .61410.3 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .65170.4 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879

0.5 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .72240.6 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .75490.7 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .78520.8 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .81330.9 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389

1.0 .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .86211.1 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .88301.2 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .90151.3 .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .91771.4 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319

1.5 .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .94411.6 .9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .95451.7 .9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .96331.8 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .97061.9 .9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767

2.0 .9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .98172.1 .9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .98572.2 .9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887 .98902.3 .9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .99162.4 .9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936

2.5 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .99522.6 .9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .99642.7 .9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .99742.8 .9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980 .99812.9 .9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986

3.0 .9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .99903.1 .9990 .9991 .9991 .9991 .9992 .9992 .9992 .9992 .9993 .99933.2 .9993 .9993 .9994 .9994 .9994 .9994 .9994 .9995 .9995 .99953.3 .9995 .9995 .9995 .9996 .9996 .9996 .9996 .9996 .9996 .99973.4 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9998

Page 328: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Table A.3: t-distribution

t

Area in the right taildf 0.100 0.050 0.025 0.010 0.005 0.001

1 3.078 6.314 12.706 31.821 63.656 318.2892 1.886 2.920 4.303 6.965 9.925 22.3283 1.638 2.353 3.182 4.541 5.841 10.2144 1.533 2.132 2.776 3.747 4.604 7.1735 1.476 2.015 2.571 3.365 4.032 5.8946 1.440 1.943 2.447 3.143 3.707 5.2087 1.415 1.895 2.365 2.998 3.499 4.7858 1.397 1.860 2.306 2.896 3.355 4.5019 1.383 1.833 2.262 2.821 3.250 4.29710 1.372 1.812 2.228 2.764 3.169 4.14411 1.363 1.796 2.201 2.718 3.106 4.02512 1.356 1.782 2.179 2.681 3.055 3.93013 1.350 1.771 2.160 2.650 3.012 3.85214 1.345 1.761 2.145 2.624 2.977 3.78715 1.341 1.753 2.131 2.602 2.947 3.73316 1.337 1.746 2.120 2.583 2.921 3.68617 1.333 1.740 2.110 2.567 2.898 3.64618 1.330 1.734 2.101 2.552 2.878 3.61019 1.328 1.729 2.093 2.539 2.861 3.57920 1.325 1.725 2.086 2.528 2.845 3.55221 1.323 1.721 2.080 2.518 2.831 3.52722 1.321 1.717 2.074 2.508 2.819 3.50523 1.319 1.714 2.069 2.500 2.807 3.48524 1.318 1.711 2.064 2.492 2.797 3.46725 1.316 1.708 2.060 2.485 2.787 3.45026 1.315 1.706 2.056 2.479 2.779 3.43527 1.314 1.703 2.052 2.473 2.771 3.42128 1.313 1.701 2.048 2.467 2.763 3.40829 1.311 1.699 2.045 2.462 2.756 3.39630 1.310 1.697 2.042 2.457 2.750 3.38531 1.309 1.696 2.040 2.453 2.744 3.37532 1.309 1.694 2.037 2.449 2.738 3.36533 1.308 1.692 2.035 2.445 2.733 3.35634 1.307 1.691 2.032 2.441 2.728 3.34835 1.306 1.690 2.030 2.438 2.724 3.34036 1.306 1.688 2.028 2.434 2.719 3.33337 1.305 1.687 2.026 2.431 2.715 3.32638 1.304 1.686 2.024 2.429 2.712 3.31939 1.304 1.685 2.023 2.426 2.708 3.31340 1.303 1.684 2.021 2.423 2.704 3.307z 1.282 1.645 1.960 2.326 2.576 3.090

Page 329: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

323

Table A.4: t-distribution

t

df 0.100 0.050 0.025 0.010 0.005 0.001

41 1.303 1.683 2.020 2.421 2.701 3.30142 1.302 1.682 2.018 2.418 2.698 3.29643 1.302 1.681 2.017 2.416 2.695 3.29144 1.301 1.680 2.015 2.414 2.692 3.28645 1.301 1.679 2.014 2.412 2.690 3.28146 1.300 1.679 2.013 2.410 2.687 3.27747 1.300 1.678 2.012 2.408 2.685 3.27348 1.299 1.677 2.011 2.407 2.682 3.26949 1.299 1.677 2.010 2.405 2.680 3.26550 1.299 1.676 2.009 2.403 2.678 3.26151 1.298 1.675 2.008 2.402 2.676 3.25852 1.298 1.675 2.007 2.400 2.674 3.25553 1.298 1.674 2.006 2.399 2.672 3.25154 1.297 1.674 2.005 2.397 2.670 3.24855 1.297 1.673 2.004 2.396 2.668 3.24556 1.297 1.673 2.003 2.395 2.667 3.24257 1.297 1.672 2.002 2.394 2.665 3.23958 1.296 1.672 2.002 2.392 2.663 3.23759 1.296 1.671 2.001 2.391 2.662 3.23460 1.296 1.671 2.000 2.390 2.660 3.23261 1.296 1.670 2.000 2.389 2.659 3.22962 1.295 1.670 1.999 2.388 2.657 3.22763 1.295 1.669 1.998 2.387 2.656 3.22564 1.295 1.669 1.998 2.386 2.655 3.22365 1.295 1.669 1.997 2.385 2.654 3.22066 1.295 1.668 1.997 2.384 2.652 3.21867 1.294 1.668 1.996 2.383 2.651 3.21668 1.294 1.668 1.995 2.382 2.650 3.21469 1.294 1.667 1.995 2.382 2.649 3.21370 1.294 1.667 1.994 2.381 2.648 3.21171 1.294 1.667 1.994 2.380 2.647 3.20972 1.293 1.666 1.993 2.379 2.646 3.20773 1.293 1.666 1.993 2.379 2.645 3.20674 1.293 1.666 1.993 2.378 2.644 3.20475 1.293 1.665 1.992 2.377 2.643 3.20276 1.293 1.665 1.992 2.376 2.642 3.20177 1.293 1.665 1.991 2.376 2.641 3.19978 1.292 1.665 1.991 2.375 2.640 3.19879 1.292 1.664 1.990 2.374 2.639 3.197∞ 1.282 1.645 1.960 2.326 2.576 3.090

Page 330: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

324 APPENDIX A. TABLES

Table A.5: χ2 distribution

σ2

χ2

Area in the right taildf 0.995 0.99 0.975 0.95 0.05 0.025 0.01 0.005

1 0.000 0.000 0.001 0.004 3.841 5.024 6.635 7.8792 0.010 0.020 0.051 0.103 5.991 7.378 9.210 10.5973 0.072 0.115 0.216 0.352 7.815 9.348 11.345 12.8384 0.207 0.297 0.484 0.711 9.488 11.143 13.277 14.8605 0.412 0.554 0.831 1.145 11.070 12.832 15.086 16.750

6 0.676 0.872 1.237 1.635 12.592 14.449 16.812 18.5487 0.989 1.239 1.690 2.167 14.067 16.013 18.475 20.2788 1.344 1.647 2.180 2.733 15.507 17.535 20.090 21.9559 1.735 2.088 2.700 3.325 16.919 19.023 21.666 23.58910 2.156 2.558 3.247 3.940 18.307 20.483 23.209 25.188

11 2.603 3.053 3.816 4.575 19.675 21.920 24.725 26.75712 3.074 3.571 4.404 5.226 21.026 23.337 26.217 28.30013 3.565 4.107 5.009 5.892 22.362 24.736 27.688 29.81914 4.075 4.660 5.629 6.571 23.685 26.119 29.141 31.31915 4.601 5.229 6.262 7.261 24.996 27.488 30.578 32.801

16 5.142 5.812 6.908 7.962 26.296 28.845 32.000 34.26717 5.697 6.408 7.564 8.672 27.587 30.191 33.409 35.71818 6.265 7.015 8.231 9.390 28.869 31.526 34.805 37.15619 6.844 7.633 8.907 10.117 30.144 32.852 36.191 38.58220 7.434 8.260 9.591 10.851 31.410 34.170 37.566 39.997

21 8.034 8.897 10.283 11.591 32.671 35.479 38.932 41.40122 8.643 9.542 10.982 12.338 33.924 36.781 40.289 42.79623 9.260 10.196 11.689 13.091 35.172 38.076 41.638 44.18124 9.886 10.856 12.401 13.848 36.415 39.364 42.980 45.55825 10.520 11.524 13.120 14.611 37.652 40.646 44.314 46.928

26 11.160 12.198 13.844 15.379 38.885 41.923 45.642 48.29027 11.808 12.878 14.573 16.151 40.113 43.195 46.963 49.64528 12.461 13.565 15.308 16.928 41.337 44.461 48.278 50.99429 13.121 14.256 16.047 17.708 42.557 45.722 49.588 52.33530 13.787 14.953 16.791 18.493 43.773 46.979 50.892 53.672

31 14.458 15.655 17.539 19.281 44.985 48.232 52.191 55.00232 15.134 16.362 18.291 20.072 46.194 49.480 53.486 56.32833 15.815 17.073 19.047 20.867 47.400 50.725 54.775 57.64834 16.501 17.789 19.806 21.664 48.602 51.966 56.061 58.96435 17.192 18.509 20.569 22.465 49.802 53.203 57.342 60.275

36 17.887 19.233 21.336 23.269 50.998 54.437 58.619 61.58137 18.586 19.960 22.106 24.075 52.192 55.668 59.893 62.88338 19.289 20.691 22.878 24.884 53.384 56.895 61.162 64.18139 19.996 21.426 23.654 25.695 54.572 58.120 62.428 65.47540 20.707 22.164 24.433 26.509 55.758 59.342 63.691 66.766

Page 331: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Table A.6: F -Distribution Table

α = 1%

F

Degrees of Freedom for the Numerator1 2 3 4 5 6 7 8 9

Deg

rees

ofF

reed

omfo

rth

eD

enom

inat

or

1 4052.185 4999.3 5403.5 5624.3 5764.0 5859.0 5928.3 5981.0 6022.42 98.502 99.000 99.164 99.251 99.302 99.331 99.357 99.375 99.3903 34.116 30.816 29.457 28.710 28.237 27.911 27.671 27.489 27.3454 21.198 18.000 16.694 15.977 15.522 15.207 14.976 14.799 14.6595 16.258 13.274 12.060 11.392 10.967 10.672 10.456 10.289 10.1586 13.745 10.925 9.780 9.148 8.746 8.466 8.260 8.102 7.9767 12.246 9.547 8.451 7.847 7.460 7.191 6.993 6.840 6.7198 11.259 8.649 7.591 7.006 6.632 6.371 6.178 6.029 5.9119 10.562 8.022 6.992 6.422 6.057 5.802 5.613 5.467 5.351

10 10.044 7.559 6.552 5.994 5.636 5.386 5.200 5.057 4.94211 9.646 7.206 6.217 5.668 5.316 5.069 4.886 4.744 4.63212 9.330 6.927 5.953 5.412 5.064 4.821 4.640 4.499 4.38813 9.074 6.701 5.739 5.205 4.862 4.620 4.441 4.302 4.19114 8.862 6.515 5.564 5.035 4.695 4.456 4.278 4.140 4.03015 8.683 6.359 5.417 4.893 4.556 4.318 4.142 4.004 3.89516 8.531 6.226 5.292 4.773 4.437 4.202 4.026 3.890 3.78017 8.400 6.112 5.185 4.669 4.336 4.101 3.927 3.791 3.68218 8.285 6.013 5.092 4.579 4.248 4.015 3.841 3.705 3.59719 8.185 5.926 5.010 4.500 4.171 3.939 3.765 3.631 3.52320 8.096 5.849 4.938 4.431 4.103 3.871 3.699 3.564 3.45721 8.017 5.780 4.874 4.369 4.042 3.812 3.640 3.506 3.39822 7.945 5.719 4.817 4.313 3.988 3.758 3.587 3.453 3.34623 7.881 5.664 4.765 4.264 3.939 3.710 3.539 3.406 3.29924 7.823 5.614 4.718 4.218 3.895 3.667 3.496 3.363 3.25625 7.770 5.568 4.675 4.177 3.855 3.627 3.457 3.324 3.21726 7.721 5.526 4.637 4.140 3.818 3.591 3.421 3.288 3.18227 7.677 5.488 4.601 4.106 3.785 3.558 3.388 3.256 3.14928 7.636 5.453 4.568 4.074 3.754 3.528 3.358 3.226 3.12029 7.598 5.420 4.538 4.045 3.725 3.499 3.330 3.198 3.09230 7.562 5.390 4.510 4.018 3.699 3.473 3.305 3.173 3.06731 7.530 5.362 4.484 3.993 3.675 3.449 3.281 3.149 3.04332 7.499 5.336 4.459 3.969 3.652 3.427 3.258 3.127 3.02133 7.471 5.312 4.437 3.948 3.630 3.406 3.238 3.106 3.00034 7.444 5.289 4.416 3.927 3.611 3.386 3.218 3.087 2.98135 7.419 5.268 4.396 3.908 3.592 3.368 3.200 3.069 2.96336 7.396 5.248 4.377 3.890 3.574 3.351 3.183 3.052 2.94637 7.374 5.229 4.360 3.873 3.558 3.334 3.167 3.036 2.93038 7.353 5.211 4.343 3.858 3.542 3.319 3.152 3.021 2.91539 7.333 5.194 4.327 3.843 3.528 3.305 3.137 3.006 2.90140 7.314 5.178 4.313 3.828 3.514 3.291 3.124 2.993 2.888

Page 332: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Table A.7: F -Distribution Table

α = 1%

F

Degrees of Freedom for the Numerator10 11 12 13 14 15 16 17 18

Deg

rees

ofF

reed

omfo

rth

eD

enom

inat

or

1 6055.9 6083.4 6106.7 6125.8 6143.0 6157.0 6170.0 6181.2 6191.42 99.397 99.408 99.419 99.422 99.426 99.433 99.437 99.441 99.4443 27.228 27.132 27.052 26.983 26.924 26.872 26.826 26.786 26.7514 14.546 14.452 14.374 14.306 14.249 14.198 14.154 14.114 14.0795 10.051 9.963 9.888 9.825 9.770 9.722 9.680 9.643 9.6096 7.874 7.790 7.718 7.657 7.605 7.559 7.519 7.483 7.4517 6.620 6.538 6.469 6.410 6.359 6.314 6.275 6.240 6.2098 5.814 5.734 5.667 5.609 5.559 5.515 5.477 5.442 5.4129 5.257 5.178 5.111 5.055 5.005 4.962 4.924 4.890 4.860

10 4.849 4.772 4.706 4.650 4.601 4.558 4.520 4.487 4.45711 4.539 4.462 4.397 4.342 4.293 4.251 4.213 4.180 4.15012 4.296 4.220 4.155 4.100 4.052 4.010 3.972 3.939 3.91013 4.100 4.025 3.960 3.905 3.857 3.815 3.778 3.745 3.71614 3.939 3.864 3.800 3.745 3.698 3.656 3.619 3.586 3.55615 3.805 3.730 3.666 3.612 3.564 3.522 3.485 3.452 3.42316 3.691 3.616 3.553 3.498 3.451 3.409 3.372 3.339 3.31017 3.593 3.518 3.455 3.401 3.353 3.312 3.275 3.242 3.21218 3.508 3.434 3.371 3.316 3.269 3.227 3.190 3.158 3.12819 3.434 3.360 3.297 3.242 3.195 3.153 3.116 3.084 3.05420 3.368 3.294 3.231 3.177 3.130 3.088 3.051 3.018 2.98921 3.310 3.236 3.173 3.119 3.072 3.030 2.993 2.960 2.93122 3.258 3.184 3.121 3.067 3.019 2.978 2.941 2.908 2.87923 3.211 3.137 3.074 3.020 2.973 2.931 2.894 2.861 2.83224 3.168 3.094 3.032 2.977 2.930 2.889 2.852 2.819 2.78925 3.129 3.056 2.993 2.939 2.892 2.850 2.813 2.780 2.75126 3.094 3.021 2.958 2.904 2.857 2.815 2.778 2.745 2.71527 3.062 2.988 2.926 2.872 2.824 2.783 2.746 2.713 2.68328 3.032 2.959 2.896 2.842 2.795 2.753 2.716 2.683 2.65329 3.005 2.931 2.868 2.814 2.767 2.726 2.689 2.656 2.62630 2.979 2.906 2.843 2.789 2.742 2.700 2.663 2.630 2.60031 2.955 2.882 2.820 2.765 2.718 2.677 2.640 2.606 2.57732 2.934 2.860 2.798 2.744 2.696 2.655 2.618 2.584 2.55533 2.913 2.840 2.777 2.723 2.676 2.634 2.597 2.564 2.53434 2.894 2.821 2.758 2.704 2.657 2.615 2.578 2.545 2.51535 2.876 2.803 2.740 2.686 2.639 2.597 2.560 2.527 2.49736 2.859 2.786 2.723 2.669 2.622 2.580 2.543 2.510 2.48037 2.843 2.770 2.707 2.653 2.606 2.564 2.527 2.494 2.46438 2.828 2.755 2.692 2.638 2.591 2.549 2.512 2.479 2.44939 2.814 2.741 2.678 2.624 2.577 2.535 2.498 2.465 2.43440 2.801 2.727 2.665 2.611 2.563 2.522 2.484 2.451 2.421

Page 333: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Table A.8: F -Distribution Table

α = 2.5%

F

Degrees of Freedom for the Numerator1 2 3 4 5 6 7 8 9

Deg

rees

of

Fre

edom

for

the

Den

omin

ato

r

1 647.793 799.482 864.151 899.599 921.835 937.114 948.203 956.643 963.2792 38.506 39.000 39.166 39.248 39.298 39.331 39.356 39.373 39.3873 17.443 16.044 15.439 15.101 14.885 14.735 14.624 14.540 14.4734 12.218 10.649 9.979 9.604 9.364 9.197 9.074 8.980 8.9055 10.007 8.434 7.764 7.388 7.146 6.978 6.853 6.757 6.6816 8.813 7.260 6.599 6.227 5.988 5.820 5.695 5.600 5.5237 8.073 6.542 5.890 5.523 5.285 5.119 4.995 4.899 4.8238 7.571 6.059 5.416 5.053 4.817 4.652 4.529 4.433 4.3579 7.209 5.715 5.078 4.718 4.484 4.320 4.197 4.102 4.026

10 6.937 5.456 4.826 4.468 4.236 4.072 3.950 3.855 3.77911 6.724 5.256 4.630 4.275 4.044 3.881 3.759 3.664 3.58812 6.554 5.096 4.474 4.121 3.891 3.728 3.607 3.512 3.43613 6.414 4.965 4.347 3.996 3.767 3.604 3.483 3.388 3.31214 6.298 4.857 4.242 3.892 3.663 3.501 3.380 3.285 3.20915 6.200 4.765 4.153 3.804 3.576 3.415 3.293 3.199 3.12316 6.115 4.687 4.077 3.729 3.502 3.341 3.219 3.125 3.04917 6.042 4.619 4.011 3.665 3.438 3.277 3.156 3.061 2.98518 5.978 4.560 3.954 3.608 3.382 3.221 3.100 3.005 2.92919 5.922 4.508 3.903 3.559 3.333 3.172 3.051 2.956 2.88020 5.871 4.461 3.859 3.515 3.289 3.128 3.007 2.913 2.83721 5.827 4.420 3.819 3.475 3.250 3.090 2.969 2.874 2.79822 5.786 4.383 3.783 3.440 3.215 3.055 2.934 2.839 2.76323 5.750 4.349 3.750 3.408 3.183 3.023 2.902 2.808 2.73124 5.717 4.319 3.721 3.379 3.155 2.995 2.874 2.779 2.70325 5.686 4.291 3.694 3.353 3.129 2.969 2.848 2.753 2.67726 5.659 4.265 3.670 3.329 3.105 2.945 2.824 2.729 2.65327 5.633 4.242 3.647 3.307 3.083 2.923 2.802 2.707 2.63128 5.610 4.221 3.626 3.286 3.063 2.903 2.782 2.687 2.61129 5.588 4.201 3.607 3.267 3.044 2.884 2.763 2.669 2.59230 5.568 4.182 3.589 3.250 3.026 2.867 2.746 2.651 2.57531 5.549 4.165 3.573 3.234 3.010 2.851 2.730 2.635 2.55832 5.531 4.149 3.557 3.218 2.995 2.836 2.715 2.620 2.54333 5.515 4.134 3.543 3.204 2.981 2.822 2.701 2.606 2.52934 5.499 4.120 3.529 3.191 2.968 2.808 2.688 2.593 2.51635 5.485 4.106 3.517 3.179 2.956 2.796 2.676 2.581 2.50436 5.471 4.094 3.505 3.167 2.944 2.785 2.664 2.569 2.49237 5.458 4.082 3.493 3.156 2.933 2.774 2.653 2.558 2.48138 5.446 4.071 3.483 3.145 2.923 2.763 2.643 2.548 2.47139 5.435 4.061 3.473 3.135 2.913 2.754 2.633 2.538 2.46140 5.424 4.051 3.463 3.126 2.904 2.744 2.624 2.529 2.452

Page 334: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Table A.9: F -Distribution Table

α = 2.5%

F

Degrees of Freedom for the Numerator10 11 12 13 14 15 16 17 18

Deg

rees

of

Fre

edom

for

the

Den

omin

ator

1 968.634 973.028 976.725 979.839 982.545 984.874 986.911 988.715 990.3452 39.398 39.407 39.415 39.421 39.427 39.431 39.436 39.439 39.4423 14.419 14.374 14.337 14.305 14.277 14.253 14.232 14.213 14.1964 8.844 8.794 8.751 8.715 8.684 8.657 8.633 8.611 8.5925 6.619 6.568 6.525 6.488 6.456 6.428 6.403 6.381 6.3626 5.461 5.410 5.366 5.329 5.297 5.269 5.244 5.222 5.2027 4.761 4.709 4.666 4.628 4.596 4.568 4.543 4.521 4.5018 4.295 4.243 4.200 4.162 4.130 4.101 4.076 4.054 4.0349 3.964 3.912 3.868 3.831 3.798 3.769 3.744 3.722 3.701

10 3.717 3.665 3.621 3.583 3.550 3.522 3.496 3.474 3.45311 3.526 3.474 3.430 3.392 3.359 3.330 3.304 3.282 3.26112 3.374 3.321 3.277 3.239 3.206 3.177 3.152 3.129 3.10813 3.250 3.197 3.153 3.115 3.082 3.053 3.027 3.004 2.98314 3.147 3.095 3.050 3.012 2.979 2.949 2.923 2.900 2.87915 3.060 3.008 2.963 2.925 2.891 2.862 2.836 2.813 2.79216 2.986 2.934 2.889 2.851 2.817 2.788 2.761 2.738 2.71717 2.922 2.870 2.825 2.786 2.753 2.723 2.697 2.673 2.65218 2.866 2.814 2.769 2.730 2.696 2.667 2.640 2.617 2.59619 2.817 2.765 2.720 2.681 2.647 2.617 2.591 2.567 2.54620 2.774 2.721 2.676 2.637 2.603 2.573 2.547 2.523 2.50121 2.735 2.682 2.637 2.598 2.564 2.534 2.507 2.483 2.46222 2.700 2.647 2.602 2.563 2.528 2.498 2.472 2.448 2.42623 2.668 2.615 2.570 2.531 2.497 2.466 2.440 2.416 2.39424 2.640 2.586 2.541 2.502 2.468 2.437 2.411 2.386 2.36525 2.613 2.560 2.515 2.476 2.441 2.411 2.384 2.360 2.33826 2.590 2.536 2.491 2.452 2.417 2.387 2.360 2.335 2.31427 2.568 2.514 2.469 2.429 2.395 2.364 2.337 2.313 2.29128 2.547 2.494 2.448 2.409 2.374 2.344 2.317 2.292 2.27029 2.529 2.475 2.430 2.390 2.355 2.325 2.298 2.273 2.25130 2.511 2.458 2.412 2.372 2.338 2.307 2.280 2.255 2.23331 2.495 2.442 2.396 2.356 2.321 2.291 2.263 2.239 2.21732 2.480 2.426 2.381 2.341 2.306 2.275 2.248 2.223 2.20133 2.466 2.412 2.366 2.327 2.292 2.261 2.234 2.209 2.18734 2.453 2.399 2.353 2.313 2.278 2.248 2.220 2.195 2.17335 2.440 2.387 2.341 2.301 2.266 2.235 2.207 2.183 2.16036 2.429 2.375 2.329 2.289 2.254 2.223 2.196 2.171 2.14837 2.418 2.364 2.318 2.278 2.243 2.212 2.184 2.160 2.13738 2.407 2.353 2.307 2.267 2.232 2.201 2.174 2.149 2.12639 2.397 2.344 2.298 2.257 2.222 2.191 2.164 2.139 2.11640 2.388 2.334 2.288 2.248 2.213 2.182 2.154 2.129 2.107

Page 335: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Table A.10: F -Distribution Table

α = 5%

F

Degrees of Freedom for the Numerator1 2 3 4 5 6 7 8 9

Deg

rees

of

Fre

edom

for

the

Den

omin

ato

r

1 161.446 199.499 215.707 224.583 230.160 233.988 236.767 238.884 240.5432 18.513 19.000 19.164 19.247 19.296 19.329 19.353 19.371 19.3853 10.128 9.552 9.277 9.117 9.013 8.941 8.887 8.845 8.8124 7.709 6.944 6.591 6.388 6.256 6.163 6.094 6.041 5.9995 6.608 5.786 5.409 5.192 5.050 4.950 4.876 4.818 4.7726 5.987 5.143 4.757 4.534 4.387 4.284 4.207 4.147 4.0997 5.591 4.737 4.347 4.120 3.972 3.866 3.787 3.726 3.6778 5.318 4.459 4.066 3.838 3.688 3.581 3.500 3.438 3.3889 5.117 4.256 3.863 3.633 3.482 3.374 3.293 3.230 3.179

10 4.965 4.103 3.708 3.478 3.326 3.217 3.135 3.072 3.02011 4.844 3.982 3.587 3.357 3.204 3.095 3.012 2.948 2.89612 4.747 3.885 3.490 3.259 3.106 2.996 2.913 2.849 2.79613 4.667 3.806 3.411 3.179 3.025 2.915 2.832 2.767 2.71414 4.600 3.739 3.344 3.112 2.958 2.848 2.764 2.699 2.64615 4.543 3.682 3.287 3.056 2.901 2.790 2.707 2.641 2.58816 4.494 3.634 3.239 3.007 2.852 2.741 2.657 2.591 2.53817 4.451 3.592 3.197 2.965 2.810 2.699 2.614 2.548 2.49418 4.414 3.555 3.160 2.928 2.773 2.661 2.577 2.510 2.45619 4.381 3.522 3.127 2.895 2.740 2.628 2.544 2.477 2.42320 4.351 3.493 3.098 2.866 2.711 2.599 2.514 2.447 2.39321 4.325 3.467 3.072 2.840 2.685 2.573 2.488 2.420 2.36622 4.301 3.443 3.049 2.817 2.661 2.549 2.464 2.397 2.34223 4.279 3.422 3.028 2.796 2.640 2.528 2.442 2.375 2.32024 4.260 3.403 3.009 2.776 2.621 2.508 2.423 2.355 2.30025 4.242 3.385 2.991 2.759 2.603 2.490 2.405 2.337 2.28226 4.225 3.369 2.975 2.743 2.587 2.474 2.388 2.321 2.26527 4.210 3.354 2.960 2.728 2.572 2.459 2.373 2.305 2.25028 4.196 3.340 2.947 2.714 2.558 2.445 2.359 2.291 2.23629 4.183 3.328 2.934 2.701 2.545 2.432 2.346 2.278 2.22330 4.171 3.316 2.922 2.690 2.534 2.421 2.334 2.266 2.21131 4.160 3.305 2.911 2.679 2.523 2.409 2.323 2.255 2.19932 4.149 3.295 2.901 2.668 2.512 2.399 2.313 2.244 2.18933 4.139 3.285 2.892 2.659 2.503 2.389 2.303 2.235 2.17934 4.130 3.276 2.883 2.650 2.494 2.380 2.294 2.225 2.17035 4.121 3.267 2.874 2.641 2.485 2.372 2.285 2.217 2.16136 4.113 3.259 2.866 2.634 2.477 2.364 2.277 2.209 2.15337 4.105 3.252 2.859 2.626 2.470 2.356 2.270 2.201 2.14538 4.098 3.245 2.852 2.619 2.463 2.349 2.262 2.194 2.13839 4.091 3.238 2.845 2.612 2.456 2.342 2.255 2.187 2.13140 4.085 3.232 2.839 2.606 2.449 2.336 2.249 2.180 2.124

Page 336: Introductory Statisticsbkrein/Introductory Statistics...Statistics is the collection of methods used to collect, analyze, and interpret data and use to the data make decisions. Statistics

Table A.11: F -Distribution Table

α = 5%

F

Degrees of Freedom for the Numerator10 11 12 13 14 15 16 17 18

Deg

rees

of

Fre

edom

for

the

Den

omin

ator

1 241.882 242.981 243.905 244.690 245.363 245.949 246.466 246.917 247.3242 19.396 19.405 19.412 19.419 19.424 19.429 19.433 19.437 19.4403 8.785 8.763 8.745 8.729 8.715 8.703 8.692 8.683 8.6754 5.964 5.936 5.912 5.891 5.873 5.858 5.844 5.832 5.8215 4.735 4.704 4.678 4.655 4.636 4.619 4.604 4.590 4.5796 4.060 4.027 4.000 3.976 3.956 3.938 3.922 3.908 3.8967 3.637 3.603 3.575 3.550 3.529 3.511 3.494 3.480 3.4678 3.347 3.313 3.284 3.259 3.237 3.218 3.202 3.187 3.1739 3.137 3.102 3.073 3.048 3.025 3.006 2.989 2.974 2.960

10 2.978 2.943 2.913 2.887 2.865 2.845 2.828 2.812 2.79811 2.854 2.818 2.788 2.761 2.739 2.719 2.701 2.685 2.67112 2.753 2.717 2.687 2.660 2.637 2.617 2.599 2.583 2.56813 2.671 2.635 2.604 2.577 2.554 2.533 2.515 2.499 2.48414 2.602 2.565 2.534 2.507 2.484 2.463 2.445 2.428 2.41315 2.544 2.507 2.475 2.448 2.424 2.403 2.385 2.368 2.35316 2.494 2.456 2.425 2.397 2.373 2.352 2.333 2.317 2.30217 2.450 2.413 2.381 2.353 2.329 2.308 2.289 2.272 2.25718 2.412 2.374 2.342 2.314 2.290 2.269 2.250 2.233 2.21719 2.378 2.340 2.308 2.280 2.256 2.234 2.215 2.198 2.18220 2.348 2.310 2.278 2.250 2.225 2.203 2.184 2.167 2.15121 2.321 2.283 2.250 2.222 2.197 2.176 2.156 2.139 2.12322 2.297 2.259 2.226 2.198 2.173 2.151 2.131 2.114 2.09823 2.275 2.236 2.204 2.175 2.150 2.128 2.109 2.091 2.07524 2.255 2.216 2.183 2.155 2.130 2.108 2.088 2.070 2.05425 2.236 2.198 2.165 2.136 2.111 2.089 2.069 2.051 2.03526 2.220 2.181 2.148 2.119 2.094 2.072 2.052 2.034 2.01827 2.204 2.166 2.132 2.103 2.078 2.056 2.036 2.018 2.00228 2.190 2.151 2.118 2.089 2.064 2.041 2.021 2.003 1.98729 2.177 2.138 2.104 2.075 2.050 2.027 2.007 1.989 1.97330 2.165 2.126 2.092 2.063 2.037 2.015 1.995 1.976 1.96031 2.153 2.114 2.080 2.051 2.026 2.003 1.983 1.965 1.94832 2.142 2.103 2.070 2.040 2.015 1.992 1.972 1.953 1.93733 2.133 2.093 2.060 2.030 2.004 1.982 1.961 1.943 1.92634 2.123 2.084 2.050 2.021 1.995 1.972 1.952 1.933 1.91735 2.114 2.075 2.041 2.012 1.986 1.963 1.942 1.924 1.90736 2.106 2.067 2.033 2.003 1.977 1.954 1.934 1.915 1.89937 2.098 2.059 2.025 1.995 1.969 1.946 1.926 1.907 1.89038 2.091 2.051 2.017 1.988 1.962 1.939 1.918 1.899 1.88339 2.084 2.044 2.010 1.981 1.954 1.931 1.911 1.892 1.87540 2.077 2.038 2.003 1.974 1.948 1.924 1.904 1.885 1.868