basic statistics

24
Statistical Lingo You may have come to this presentation because you really like statistics, but there’s also the possibility that you’d rather be somewhere else… The irony is that sports probably refers to statistics more than any other segment of our society. …like maybe playing golf at a fancy resort or something?

Upload: lokesh-gupta

Post on 14-Dec-2015

11 views

Category:

Documents


0 download

DESCRIPTION

Basic Statistics for Six Sigma

TRANSCRIPT

Page 1: Basic Statistics

Statistical Lingo

You may have come to this presentation because you really like statistics, but there’s also the possibility that you’d rather be somewhere else…

The irony is that sports probably refers to statistics more than any other segment of our society.

…like maybe playing golf at a fancy resort or something?

Page 2: Basic Statistics

75 80 85 90

Virtually everyone has a pretty good understanding of what is meant by the word “average”.

Statistical Lingo

A golfer who shot rounds of 78, 84, and 87 could compute her average, or what statisticians would call her “mean” ( or x).

NOTE: In general, a Greek letter is used if an entire population’s data is being checked…

…but in the case of a sample, the regular letter is used.

78 84 87

Page 3: Basic Statistics

78 84+87249 : 3 = 83

75 80 85 90

Statistical Lingo

Virtually everyone has a pretty good understanding of what is meant by the word “average”.

A golfer who shot games of 78, 84, and 87 could compute her average, or what statisticians would call her “mean” ( or x), like this:

Page 4: Basic Statistics

78 84+87249

Each of these scores deviates from the average (83) by some amount.

These deviations can be combined to calculate what is called a “standard” deviation.

78 -5 84 +1+87 +4

Statistical Lingo

75 80 85 90

Page 5: Basic Statistics

But if we want to calculate the “standard deviation” we can’t simply add them up – they’ll cancel each other out and we’ll get zero.

On the other hand, squaring the deviations will prevent that problem.

78 -5 25 84 +1 1+87 +4 16249

Statistical Lingo

75 80 85 90

78 -5 84 +1+87 +4249

Page 6: Basic Statistics

Then we can add the squares up - this helps to get an estimate of how much variation is present. (The concept of adding up squares of differences like this is called “the sum of the squares”.)

78 -5 25 84 +1 1+87 +4 + 16249 42

Statistical Lingo

75 80 85 90

Page 7: Basic Statistics

Then we divide this sum by the number of scores in the list (N) minus 1. (This is because we only have a sample of all this person’s golf scores – if we had all of their golf scores we would simply divide by N.)

78 -5 25 84 +1 1+87 +4 + 16249 42 : 2 = 21

Statistical Lingo

75 80 85 90

Page 8: Basic Statistics

78 -5 25 84 +1 1+87 +4 + 16249 42 : 2 = 21

21 = 4.6

Statistical Lingo

75 80 85 90

If we just leave it like this, it’s called the variance (2 or s2). If we take the square root (which cancels out the fact that we squared the deviations earlier) we’ll get the standard deviation ( or s). (Also, we divide by 2 because it’s the number of data points in the sample minus 1.)

Page 9: Basic Statistics

Another common term is the median. It’s the “middle value” of the data and is insensitive to actual values in the set.

Real estate folks might refer to a median income level for an area – it’s virtually unaffected by Bill Gates moving into (or out of) the neighborhood.

Statistical Lingo

75 80 85 90

78 84 87

Page 10: Basic Statistics

variance(s2) 78 -5 25

84 +1 1+87 +4 + 16249 42 : 2 = 21

249 / 3 = 83 21 = 4.6

Statistical Lingo

deviation

standard deviation

(s)

75 80 85 90

mean (x)

In a few short slides, we’ve covered a number of the most frequently used statistical terms.

median

Page 11: Basic Statistics

Statistical Lingo

75 80 85 90

Of course, if you had to manually compute:

• an average• a deviation for each data point• a square of all the deviations• a sum of the squares• a variance• a standard deviation• a median

every time you got some data, things could get crazy; especially if there’s a lot of data. Thankfully, we have Minitab.

Page 12: Basic Statistics

1. Enter whatever data you want to analyze into a column in Minitab

Getting Basic Stats From Minitab

2. Click on Stat, then on Basic Statistics, then on Display Descriptive Statistics.

Page 13: Basic Statistics

3. In the box labeled Variable, indicate the column containing the data.

4. Click on the box labeled “Graphs”. 3.

4.

5.

6.

Getting Basic Stats From Minitab

5. Check “Graphical summary”.

6. Click OK.

7. Click OK.

7.

Page 14: Basic Statistics

Minitab will provide a summary of the data that looks something like this. We’ll break this down in pieces to explain all the information displayed.

Getting Basic Stats From Minitab

Page 15: Basic Statistics

If data is normally distributed, it allows for a number of predictions and analytical methods that would otherwise not be valid. For example, the mean and standard deviation can be used to predict the odds of having values fall within certain ranges (like within specified tolerances).

If a set of data is normally distributed it means that when it is plotted as a histogram it has a symmetric bell shaped distribution.

Normal Not Normal Not Normal

Getting Basic Stats From Minitab

Does the data “fit” a normal distribution well enough to assume normality? (p < 0.05, no; p > 0.05 yes)

Page 16: Basic Statistics

Mean: The average value of all the data points. (If calculated using a sample of data from a population it may be written x, if calculated using all the data in the population it may be written .)

x (sample) or (population)

Getting Basic Stats From Minitab

s (sample) or (population)

StDev: The standard deviation of all the data points. It can be thought of as the “average distance that data points are from the mean” – the larger the standard deviation, the greater the variation. (If calculated using a sample of data from a population it’s usually written s, if calculated using all the data inthe population It’s usually written .)

Page 17: Basic Statistics

Getting Basic Stats From Minitab

Variance: Equal to the standard deviation squared.

Skewness: A measure of asymmetry – the further from zero, the more skewed the data. For example, if a distribution has a large tail at the upper end of its distribution, skewness will likely be positive. Typically, the skewness value will range from negative 3 to positive 3.

Kurtosis: A number reflecting how much the sample data resembles a normal distribution in shape. A very negative kurtosis indicates a distribution that is flatter than usual, a very positive kurtosis indicates a distribution that is more peaked than usual. The kurtosis value is approximately zero for a normal distribution.

N: The number of data points used in the creation of this summary.

s2 (sample) or 2 (population)

N (sample size)

Page 18: Basic Statistics

Getting Basic Stats From Minitab

The lowest value data point in the sample.

The value which 25% of the data points fall below.

The value which 50% of the data points fall below.

The value which 75% of the data points fall below.

The highest value data point in the sample.

Minimum:

1st Quartile:

Median:

3rd Quartile:

Maximum:

Page 19: Basic Statistics

Confidence Intervals: Because we only gave Minitab a sample of data from a presumably larger population, it can only estimate what the entire population is like.

Minitab can help us to understand how good our estimates of things like the mean (Mu), the standard deviation (Sigma), and median are.

Minitab does this by calculating an interval within which it is 95% certain that these parameters actually reside if the whole population were to be included.

Getting Basic Stats From Minitab

Page 20: Basic Statistics

The vertical line part way through each of the red boxes is the calculated mean (top) and median (bottom) for the sample of data entered.

Around these points, Minitab calculates an interval within which it is 95% certain that the population mean and median actually reside.

Getting Basic Stats From Minitab

While this is probably not the EXACT mean for the population, using the number of data points and the amount of variation they exhibited it can be estimated with good confidence (95%) that the mean for the population falls somewhere between 48.9 and 52.3.

For example, in the case of the top red bar, the vertical line in the middle of the red bar shows a mean of about 50.6.

Page 21: Basic Statistics

The “Box and Whisker plot” divides data into “quarters”

1st quartile Median 3rd quartile

Getting Basic Stats From Minitab

NOTE: Data points with values lower than Q1-1.5(Q3-Q1) or

greater than Q3+1.5(Q3-Q1) are considered “outliers” and appear as individual dots

Histogram of the data (with Minitab’s best estimate of what normal curve fits the data best)

Page 22: Basic Statistics

69.5 74 78.5 83 87.5 92 96.5

78.5 87.5

68%34% 34%

68% of the population will be captured within one standard

deviation of the mean.

Once you have the basic stats, what’s next?

Given a process with a mean = 83 & std dev = 4.6

74 95% 92

68%34% 34%

13.5% 13.5%95% of the population will be captured within two standard

deviations of the mean.

99.73% of the population will be captured within three standard

deviations of the mean.

69.5 99.73% 96.5

2.36%2.36%68%

34% 34%

13.5% 13.5%

Page 23: Basic Statistics

Note that the three items mentioned (shape, mean, and standard deviation) help to characterize the process [or the performance of a process].

Once you have the basic stats, what’s next?

It’s somewhat like when you ship a box for overnight delivery: the courier wants to know the length, width, height, and weight of the box. That information characterizes the box for them. In other words, they know what to expect when they come to get it.

Given a process with a mean = 83 & std dev = 4.6

69.5 74 78.5 83 87.5 92 96.5

78.5 87.5

68%34% 34%

74 95% 92

68%34% 34%

13.5% 13.5%

69.5 99.73% 96.5

2.36%2.36%68%

34% 34%

13.5% 13.5%

Page 24: Basic Statistics

For example, once you know the mean and standard deviation of a process that’s normally distributed, predicting the percentage of times something will fall above or below any given value (like a tolerance limit, for instance) is relatively easy.

In other words, we can tell how often the process will perform “properly”.

That’s the topic of another tool time: Process Capability.

Once you have the basic stats, what’s next?

Understanding a process this well has some rather powerful implications.

Given a process with a mean = 83 & std dev = 4.6

69.5 74 78.5 83 87.5 92 96.5

78.5 87.5

68%34% 34%

74 95% 92

68%34% 34%

13.5% 13.5%

69.5 99.73% 96.5

2.36%2.36%68%

34% 34%

13.5% 13.5%