measures of dispersion - sjcd.qualitycampus.comsjcd.qualitycampus.com/guides/com_000_01599.pdf ·...

15
1 Online Student Guide OpusWorks 2019, All Rights Reserved Measures of Dispersion

Upload: others

Post on 11-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Measures of Dispersion - sjcd.qualitycampus.comsjcd.qualitycampus.com/guides/com_000_01599.pdf · Three measures of central tendency--that is, the tendency of the data to cluster

1

Online Student Guide

OpusWorks 2019, All Rights Reserved

Measures of Dispersion

Page 2: Measures of Dispersion - sjcd.qualitycampus.comsjcd.qualitycampus.com/guides/com_000_01599.pdf · Three measures of central tendency--that is, the tendency of the data to cluster

2

Table of Contents

LEARNING OBJECTIVES ........................................................................................................................................ 3

INTRODUCTION ...................................................................................................................................................... 3 CHARACTERISTICS OF DATA .................................................................................................................................................. 3 CENTRAL TENDENCY .............................................................................................................................................................. 3 DISPERSION .............................................................................................................................................................................. 4 MEASURES OF DISPERSION .................................................................................................................................................... 4 RANGE ....................................................................................................................................................................................... 4 RANGE & DOT PLOT ............................................................................................................................................................... 5

VARIANCE & STANDARD DEVIATION ............................................................................................................. 5 FORMULAS FOR VARIANCE & STANDARD DEVIATION ...................................................................................................... 6 VARIANCE ................................................................................................................................................................................. 6 CALCULATING THE SAMPLE VARIANCE ............................................................................................................................... 6 VARIANCE & STANDARD DEVIATION ................................................................................................................................... 7 SHORTCUT................................................................................................................................................................................. 8 CYCLE TIME DATA ................................................................................................................................................................... 9

DOT PLOT .............................................................................................................................................................. 10

CHEBYSHEV'S RULE ............................................................................................................................................ 10

EMPIRICAL RULE ................................................................................................................................................. 11

Z SCORE .................................................................................................................................................................. 12 Z SCORE FORMULA ................................................................................................................................................................ 13

PERCENTILE RANKING ...................................................................................................................................... 13 PERCENTILES & QUARTILES ................................................................................................................................................ 14

BOX PLOTS ............................................................................................................................................................ 14 COMPARISON OF CYCLE TIMES ........................................................................................................................................... 15 FIVE NUMBER SUMMARY ..................................................................................................................................................... 15

© 2019 by OpusWorks. All rights reserved. August, 2019 Terms of Use This guide can only be used by those with a paid license to the corresponding course in the e-Learning curriculum produced and distributed by OpusWorks. No part of this Student Guide may be altered, reproduced, stored, or transmitted in any form by any means without the prior written permission of OpusWorks. Trademarks All terms mentioned in this guide that are known to be trademarks or service marks have been appropriately capitalized. Comments Please address any questions or comments to your distributor or to OpusWorks at [email protected].

Page 3: Measures of Dispersion - sjcd.qualitycampus.comsjcd.qualitycampus.com/guides/com_000_01599.pdf · Three measures of central tendency--that is, the tendency of the data to cluster

3

Learning Objectives

Upon completion of this course, student will be able to: • Calculate measures of dispersion such as range, variance, and standard deviation • Know how a change in dispersion will affect the shape of the histogram • Know how a transformation made to the original data affects the standard deviation • Explain how to estimate the percentage of measurements within a specified interval of the mean • Calculate the Z score for a stated measurement

Introduction

Characteristics of Data

So far, we have studied two important characteristics of data: shape and central tendency. Now it is time to learn about another very important characteristic of data -- dispersion. Let’s see how each of these characteristics relates to the data. Let’s begin with shape. The frequency in which the data occurs in each class interval will determine the shape of the histogram. Skewed distributions are easily recognized because they lack symmetry. A histogram where all the intervals have approximately the same frequency of measurements is characteristic of a uniform distribution. In many instances, data can be represented by a symmetrical bell-shaped distribution shown here.

Central Tendency

The location of the center of the data, relative to a target value, can have a profound effect on the proportion of measurements that are within or outside a given interval. Three measures of central tendency--that is, the tendency of the data to cluster or center about a specific value-- are the mean, median, and mode. As you have already learned, because of its statistical properties, the mean is the

statistic most often used as the measure of central tendency or center of the data.

Page 4: Measures of Dispersion - sjcd.qualitycampus.comsjcd.qualitycampus.com/guides/com_000_01599.pdf · Three measures of central tendency--that is, the tendency of the data to cluster

4

Dispersion

Just knowing the shape and location of the data is not enough to completely describe the data. As you can see here, both histograms are bell shaped with the same mean. The histogram representing Process A is fairly narrow with all its measurements very close to the mean. The histogram representing Process B has more spread. There is more dispersion or variability of the measurements from the mean than with Process A.

Measures of Dispersion

In this module, we’ll work with three measures of dispersion used to describe variability: range, variance, and standard deviation.

Range

The range is the easiest measurement of dispersion to calculate. It is the distance between the largest and smallest measurements. To calculate the range of the data, subtract the minimum value from the maximum value. The formula for calculating the range is the same for both samples and populations. The symbol "R" is used to represent the range.

Page 5: Measures of Dispersion - sjcd.qualitycampus.comsjcd.qualitycampus.com/guides/com_000_01599.pdf · Three measures of central tendency--that is, the tendency of the data to cluster

5

Range & Dot Plot

Both of these sets of sample data have the same range "6" and the same mean "5." However, the distribution of scores around the mean for each set of data is quite different, as illustrated by the Dot Plot for each data set. By using only two data points to calculate the range, the dispersion of the remaining data points around the mean is ignored. This is a disadvantage of using the range as a measure of dispersion.

Variance & Standard Deviation

As a measure of dispersion, the range shows the breadth of the data and is easy to calculate. Although it also provides information about the dispersion of the data, its usefulness is somewhat limited since it only uses only two of the data points. Other measures of variability use all of the data available. Two of the most common are the standard deviation and variance. We will learn about these together, because if you calculate one value, you know the other. The standard deviation is the positive square root of the variance.

Page 6: Measures of Dispersion - sjcd.qualitycampus.comsjcd.qualitycampus.com/guides/com_000_01599.pdf · Three measures of central tendency--that is, the tendency of the data to cluster

6

Formulas for Variance & Standard Deviation

To understand the concepts of variance and standard deviation better, we need only examine their formulas. While they may seem complicated at first, you will soon see that they are easier to use than you think. And besides, the calculations are usually done using a computer or calculator. Your task will be to interpret and apply the measures.

Variance

The variance measures the average squared difference between each data point and the mean. There are two formulas for variance. The first one is the population formula which is represented by a lowercase Greek letter sigma squared. It is used when you have the entire population or census of data. The other formula is represented by a lowercase “s” squared and is used when you only have a sample of all available data.

It is important to note that these formulas are slightly different from each other. The denominator of the population formula just has “n,” while the denominator of the sample formula is “n-1.” The “n-1” forces the sample formula to provide an unbiased estimate of the true population variance when working with data that have been sampled from a population.

The mathematical proof for this concept is beyond the scope of this presentation. However, since we are usually working with a sample of data, we will work with the sample formula. Let’s look at an example of how to calculate a sample variance.

Calculating the Sample Variance

Consider the following sample data. First, we need to calculate the mean so we can compute the deviation from the mean for each data point. The mean is 4. For the first data point, the deviation from the mean is negative 3. Similarly, we compute the deviation from the mean for the remaining data points. To find the sum of the squared deviations from the mean, square each deviation from the mean, then compute the sum of these values.

Page 7: Measures of Dispersion - sjcd.qualitycampus.comsjcd.qualitycampus.com/guides/com_000_01599.pdf · Three measures of central tendency--that is, the tendency of the data to cluster

7

The sum is 104. This is the value of the numerator in our formula. To complete the calculation of the variance, divide by the formula "n -1. " The value is 26. This is the variance of the data. Using the variance, we compute the value of the standard deviation by taking the positive square root of 26. The value is approximately 5.10.

Variance & Standard Deviation

Most of the time, the calculation of the variance and standard deviation for a data set will be performed with the aid of a computer. However, there is a great deal of knowledge that can be learned by performing the calculation as we have shown here.

Note that some of these deviations from the mean are positive and some are negative. The positive deviations correspond to data values that are larger than the mean, while the negative deviations correspond to data values that are smaller than the mean. There wasn’t any reason to compute the sum of the deviations from the mean, because we know this sum is always 0. That is a property of the mean.

Squaring the deviations from the mean made the squared deviations non-negative. By adding up the squared deviations from the mean, each data point that is not equal to the mean will contribute to the sum in the numerator. Because the variance is squared, it is not in the same unit of measurement as the data. By taking the positive square root, the standard deviation will be expressed in the same unit of measurement as the data. For this reason, the standard deviation becomes one of the most useful measurements of variability.

Page 8: Measures of Dispersion - sjcd.qualitycampus.comsjcd.qualitycampus.com/guides/com_000_01599.pdf · Three measures of central tendency--that is, the tendency of the data to cluster

8

Shortcut

There is an equivalent formula often used to compute the variance when using a calculator. This shortcut or calculation formula is often a programmed function found in most calculators. While it may look more complex, it is actually easier to use because it only requires you to compute three values, no matter how many measurements are in the data set. These three values are "n," the number of data points, The sum of the individual data points, and The sum of the squared values of each data point. Let’s look at our previous example to see how easy this calculation is. The number of data points in the sample is 5. Therefore, small "n" is 5. The sum of the measurements is 20. Just add up the 5 numbers. The sum of the squared measurements is 184. Square each data point and add them up. Substitute the three computed values into the formula. The variance is 26. The standard deviation "s" is the square root of 26, which is approximately 5.10. These are the same values previously computed.

Page 9: Measures of Dispersion - sjcd.qualitycampus.comsjcd.qualitycampus.com/guides/com_000_01599.pdf · Three measures of central tendency--that is, the tendency of the data to cluster

9

Cycle Time Data

Remember this process cycle time data? This data set was large-- 52 elements. It would be quite a task to compute the standard deviation for this data. Fortunately, using a computer, the analysis is quick and easy. It’s the interpretation that is more difficult. Using just the mean and the standard deviation of the data, we can estimate how the data might be spread out from the mean, regardless of the shape of the data. For example, it has been proven that, no matter what the shape of the data, at least 75% of the data lies within two standard deviations of the mean. And at least 89.9% or almost 90% lies within three standard deviations of the mean. Let’s see if these statements hold true for this data. First, we need to compute the boundaries of the interval that contains all the values within two standard deviations of the mean. For this data, the lower boundary is 31.808 minus 10.666, which is 21.142. The upper boundary is 31.808 plus 10.666, which is 42.474.

Page 10: Measures of Dispersion - sjcd.qualitycampus.comsjcd.qualitycampus.com/guides/com_000_01599.pdf · Three measures of central tendency--that is, the tendency of the data to cluster

10

Dot Plot

Using a Dot Plot, it is fairly easy to calculate the percentage of the data that lies within a given interval. Since the interval must contain a high percentage of the data, let’s identify the points that are not in the interval. There are two. Since there were 52 data points, 50 are inside the interval. 50 divided by 52 is approximately 96% of the data. Similarly, we can see that 100% of the data will lie within three standard deviations of the mean.

Chebyshev's Rule

The rule just applied to the previous data is known as Chebyshev’s Rule. Chebyshev’s Rule applies to all data sets regardless of shape.

It works with both populations and samples to predict a lower boundary for the proportion of measurements that will lie within a specified number of standard deviations of the mean. In general, the rule says that regardless of shape, at least one minus one over k-squared of the measurements will lie in the interval between the mean "minus k-standard deviations" and the mean "plus k-standard deviations," as long as "k" is greater than or equal to one.

Page 11: Measures of Dispersion - sjcd.qualitycampus.comsjcd.qualitycampus.com/guides/com_000_01599.pdf · Three measures of central tendency--that is, the tendency of the data to cluster

11

Let’s examine this in more detail for specific values of "k."

This table applies Chebyshev’s Rule for values of k equal to two and three. Of course, you could include other entries for k, like 1.4, 2.5, 3.2, etc. However, "k" must be greater than or equal to 1.

Empirical Rule

While Chebyshev’s Rule will work on all curves, the estimate is a minimum. With distributions like the bell-shaped curve shown here, the Empirical Rule will give you a better estimate of the proportion of measurements that are within a specified interval around the mean. The Empirical Rule requires a special type of curve, called a normal curve, which you will learn more about in another lesson. The Empirical Rule states that: --approximately 68% of the data lies within one standard deviation of the mean --95% lies within two standard deviations of the mean, --and almost all the data, 99.7% lies within three standard deviations of the mean. To gain a better understanding of the Empirical Rule, let’s apply it to an example. A set of data has a bell-shaped distribution, shown here, with a mean of 20 and a standard deviation of 4. Approximately how much of the data will lie between 12 and 28? To answer this question, we need to know the number of standard deviation units the values of 12 and 28 are from the mean.

Page 12: Measures of Dispersion - sjcd.qualitycampus.comsjcd.qualitycampus.com/guides/com_000_01599.pdf · Three measures of central tendency--that is, the tendency of the data to cluster

12

We can use a simple algebraic process to find the answer. Step 1: Use the two values given, 12 and 28. Find their deviations from the mean. Step 2: Since both deviations form a symmetrical interval about the mean, express the deviations in terms of standard deviations. Step 3: Interpret the Empirical Rule relative to the number of standard deviation units computed in Step 2. In this example, 12 and 28 define an interval of two standard deviations on each side of the mean. Approximately 95% of the measurements will be found in this interval.

Z Score

In solving the last example, we intuitively computed a value known as a Z-score. The Z-score expresses the distance between a measurement and the mean in terms of standard deviations. It is a measure of relative standing in the data set. It shows the extent that a particular value differs from the mean.

Data values that are smaller than the mean will have negative Z-scores, and data values larger than the mean will have positive Z-scores. By formula, the Z-score for the mean of a set of data is always zero. Z-scores may be computed for any data set and will provide a relative ranking of the data.

Page 13: Measures of Dispersion - sjcd.qualitycampus.comsjcd.qualitycampus.com/guides/com_000_01599.pdf · Three measures of central tendency--that is, the tendency of the data to cluster

13

Z Score Formula

We can also use the Z-score formula to calculate the value of a specific observation with a given Z-score. For example, suppose a doctor indicated an individual’s cholesterol level had a Z-score of +1.5. This Z-score provides useful information but not the actual reading. However, if the mean of the distribution was 180 and the standard deviation was 50, then the actual cholesterol level could be computed.

Starting with the transformation formula for a Z-score, we use some elementary algebra to solve for X. Substitute the known values into the derived equation to solve for X. The cholesterol level is 255.

Percentile Ranking

Up to this point, we have identified the percentage of data that lies within specific intervals on each side of the mean in a bell-shaped distribution. Now we will illustrate the concept of percentile ranking. In any distribution, the median is the 50th percentile. For a bell-shaped (normal) distribution shown here, the mean is equal to the median, so the mean is also the 50th percentile. Using the Empirical Rule, the following statements can be made about this distribution: The 16th percentile has a Z-score of -1. The 84th percentile has a Z-score of +1. 84% of the measurements are less than the value that is one standard deviation above the mean.

Page 14: Measures of Dispersion - sjcd.qualitycampus.comsjcd.qualitycampus.com/guides/com_000_01599.pdf · Three measures of central tendency--that is, the tendency of the data to cluster

14

Similarly, other percentile values can be found. You will learn how to find a certain percentile for any normal distribution.

Percentiles & Quartiles

To help explain the concept of percentiles, think of this dollar bill as representing the entire population or 100% of the data. We represent the 25th percentile with this quarter. To represent the 25th percentile, we use the symbol, P with a subscript of 25. Likewise, the median or 50th percentile would be represented by 50 cents or two quarters. And the 75th percentile, by three quarters. In statistics, these three percentiles are known as the quartiles of the database and we label them Q1, Q2, and Q3 respectively.

Box Plots

The Box Plot, also known as a box-and-whisker plot, is a graphical tool that is based upon the quartiles of the data.

The box in the Box Plot is a rectangular box drawn from the lower quartile to the upper quartile. This distance is called the inter quartile range or IQR. The sides of the box corresponding to the lower and upper quartiles are referred to as the “hinges”. There are two “fences” in the box and whisker plot, but these are usually not printed. The “inner fences” begin at the upper and lower hinges and extend outward a distance of 1.5 times the IQR. In a similar fashion a set of “outer fences” begin at the hinges and extend a distance of 3 times the IQR. These fences are used in identifying potential outliers in the data.

Page 15: Measures of Dispersion - sjcd.qualitycampus.comsjcd.qualitycampus.com/guides/com_000_01599.pdf · Three measures of central tendency--that is, the tendency of the data to cluster

15

In a basic Box Plot, the whiskers extend from the hinges to the maximum and minimum points. In modified Box Plots, the whiskers extend to the most extreme data points inside the inner fences. Points outside the inner fences are usually marked with special characters. In most Box Plots, a line is drawn at the median or second quartile.

Comparison of Cycle Times

This graphic shows a comparison of the cycle times for three repair facilities that perform the same process. The data for each facility is illustrated with a Box Plot. A visual comparison of the IQR for each facility provides information of the dispersion in the data for each facility. First, the mean of each distribution was added to the graphic. This allows us to make a visual interpretation of the relationship of the median to the mean. The second important feature is the identification of a potential outlier by an asterisk. This point is identified as a potential outlier because it is beyond the inner fence and represents a rare occurrence. Extreme values in the data can have a profound effect on the estimate of the mean and variance of a data set.

Five Number Summary

In its simplest form, the Box Plot is a visual display of what is often referred to as the Five Number Summary of the data. The five number summary consists of the minimum, maximum and three quartile measurements. In this example, the output from the Descriptive Statistics command in Minitab was used to compute the Five Number Summary for this data set.