chapter 6 - random sampling and data description experience the joy of dealing with large quantities...

49
Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Upload: melina-robinson

Post on 11-Jan-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Chapter 6 - Random Sampling and Data Description

Experience the joy of

dealing with large

quantities of data

Chapter 6A

Page 2: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

This Week in Prob/Stat

+ Bonus Material

Today’s Discussion

Page 3: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Descriptive Statistics

Distributions Histogram Cumulative frequency distribution Frequency distribution (continuous data)

Measures of Central Tendency (location) Mean Median Mode

Measures of Variability (dispersion) Variance (standard deviation) Range Quartiles Coefficient of Variation

Measures of skewness Measures of Kurtosis

Page 4: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Histograms

Data gets placed into class intervals, cells, or bins (synonyms).

Continuous data - Number of bins ~ sqrt(nobs) or use Sturges rule.

Histogram shows the relative frequency of the sample observations in each class.

Histogram ~ probability density (or mass) function By summing counts in the succession of bins you

can construct a cumulative frequency plot. Cumulative frequency plot ~ empirical distribution

function ~ cumulative distribution function

Page 5: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

A Discrete Example

Raw data: the number of accident claims received per day over the last 50 days by the Nofrills Insurance Co.

week Mon Tues Wed Thur Fri1 4 3 1 1 42 0 3 0 0 13 5 2 0 2 04 1 1 0 1 15 3 4 2 3 36 2 0 0 1 27 1 4 4 5 28 7 2 2 3 69 3 3 1 4 1

10 2 1 1 2 2

Bin Frequency Cumulative %0 8 16.00%1 13 42.00%2 11 64.00%3 8 80.00%4 6 92.00%5 2 96.00%6 1 98.00%7 1 100.00%

Frequency

0

0.05

0.1

0.15

0.2

0.25

0.3

0 1 2 3 4 5 6 7

Number of claims

Page 6: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

A Discrete Empirical Cumulative Frequency Distribution

Number of claims

Cumulative Frequency

0 x < 1 16%1 x < 2 42%2 x < 3 64%3 x < 4 80%4 x < 5 92%5 x < 6 96%6 x < 7 98%7 x < ? 100% < ∞

Page 7: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Cumulative %

0%10%20%30%40%50%60%70%80%90%

100%

0 1 2 3 4 5 6 7 8

Number of Claims

A Discrete Empirical Cumulative Frequency Distribution Graph

Page 8: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

A Continuous Data Example

Raw Data: Time to repair or replace in hours a failed transformer by the Dayton Power and Light Company

Industry standard is 2.5 hours

2.2 5.0 5.0 4.3 4.5 3.31.7 1.8 5.6 2.4 3.3 2.02.4 3.2 2.5 3.6 3.9 1.62.5 2.2 4.5 2.5 2.7 1.62.9 1.9 3.7 1.9 3.0 2.94.4 4.0 4.3 2.7 3.9

data collection: 35 repairs performedbetween 01/01/07 and 06/30/07

Page 9: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Sturges’ rule for grouping data

k = 1 + 3.3 log10 n

where k = number of classes,n = sample size.

x = integer part of x

For example, n k 35 6 650 7 7500 10

225000 13

71

n

Page 10: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

A Histogram

Data was generated froma lognormal distribution

Transformer repair times in hours

Bin frequencyx <= 1 01<x<=2 0.22<x<=3 0.342863<x<=4 0.228574<x<=5 0.171435<x 0.05714

0

0.1

0.2

0.3

0.4

x <= 1 1<x<=2 2<x<=3 3<x<=4 4<x<=5 5<x

Page 11: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Frequency Polygon

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 1 2 3 4 5 6 7 8

Repair time in hours

Page 12: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Cumulative Frequency Distribution- ogive

Cumulative Frequency

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

0 1 2 3 4 5 6 7

Repair Times

Page 13: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Measures of Central Tendency – i.e. averages

Seeking the middle ground

Page 14: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Types of Data nominal (also categorical or discrete) (e.g. group employees by job type)

only comparisons are equality and inequality. no "less than" or "greater than" relations among the classifying names no operations such as addition or subtraction

ordinal (e.g. rank colleges based surveys and interviews) the numbers assigned to objects represent the rank order (1st, 2nd, 3rd etc.)

of the entities measured. comparisons of greater and less can be made, in addition to equality and

inequality. interval (e.g. temperature, IQ measurements)

have all the features of ordinal measurements, equal differences between measurements represent equivalent

intervals. operations such as addition and subtraction are therefore meaningful.

Ratio (e.g. group travel times into intervals) have all the features of interval operations such as multiplication and division are therefore meaningful. The zero value on a ratio scale is non-arbitrary

Page 15: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

6-1 Numerical Summaries

Definition: Sample Mean

Page 16: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Characteristics of the mean most widely known and used average an artificial concept since it may not coincide with any actual

value affected by every value of every item

therefore uses all the information available in the sample highly influenced by extreme values can be computed directly from the raw data

e.g. does not need to be sorted as does the median requires interval or ratio data lends itself better to algebraic analysis than other measures

of central tendency has some desirable statistical properties answers the question, "if all the quantities had the same

value, what would that value have to be in order to achieve the same total?"

Page 17: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Example 6-1

Page 18: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

6-1 Numerical Summaries

Figure 6-1 The sample mean as a balance point for a system of weights.

Page 19: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Population Mean

For a finite population with N measurements, the mean is

The sample mean is a reasonable estimate of the population mean.

Page 20: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Sample Median

Median is a measure of central tendency such that half of the values in a sample are below it and half are above it. If the number of observations is even, then average the two

central values. Sample median less influenced by ‘outliers’ than the sample

mean.• Not affected by extreme values• affected by the number but not the value of extremes

widely used in skewed distributions where the mean would be distorted by extreme values

• e.g. economic data Can be used where the data is ranked but not measured

quantitatively unreliable if the data do not cluster at the center of the

distribution

Page 21: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Order StatisticsDefineX(1) = Min {X1, X2, …, Xn}

X(2) = 2nd smallest {X1, X2, …, Xn}

X(i) = ith smallest {X1, X2, …, Xn}

X(n) = Max {X1, X2, …, Xn}

Therefore X(1) X(2) X(3) … X(n)

( )

( ) ( 1)

if 2 1 is odd

if 2 is even2

k

med k k

X n k

X X Xn k

Page 22: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Median Repair Time

Observation number1 1.62 1.63 1.74 1.85 1.96 1.97 2.08 2.29 2.2

10 2.411 2.412 2.513 2.514 2.515 2.716 2.717 2.918 2.919 3.020 3.221 3.322 3.323 3.624 3.725 3.926 3.927 4.028 4.329 4.330 4.431 4.532 4.533 5.034 5.035 5.6

Raw Data: Time to repair or replace in hours a failed transformer by the Dayton Power and Light Company

Industry standard is 2.5 hours

2.2 5.0 5.0 4.3 4.5 3.31.7 1.8 5.6 2.4 3.3 2.02.4 3.2 2.5 3.6 3.9 1.62.5 2.2 4.5 2.5 2.7 1.62.9 1.9 3.7 1.9 3.0 2.94.4 4.0 4.3 2.7 3.9

sort data

Page 23: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Median of an even number of observations

observation value1 27.402 9.083 165.294 214.855 98.706 76.077 9.878 77.969 15.01

10 49.8611 1.1812 188.0713 317.2614 59.7915 384.6316 48.74

observation value1 1.182 9.083 9.874 15.015 27.406 48.747 49.868 59.799 76.07

10 77.9611 98.7012 165.2913 188.0714 214.8515 317.2616 384.63

raw data

sort

59.79 76.0767.93

2

average the middletwo observations

Page 24: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Good use of the median

Constructively Yours is a small privately owned and operated business that specializes in small residential construction and remodeling projects. In addition to the owner-president, the company employs 8 other workers. position annual salary

receptionist $22,050worker 1 $28,175worker 2 $29,500worker 3 $31,450worker 4 $32,800salesperson 1 $34,150salesperson 2 $38,000job foreman $43,200President $230,000mean $54,369median $32,800

Warning: The above salary information is confidential and proprietaryand should not be disclosed beyond its use in the classroom.

Page 25: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Is the Median Representative?

The Makit Company is a small job shop that primarily employs machine operators and engineers.

Warning: The above salary information is confidential and proprietaryand should not be disclosed beyond its use in the classroom.

position annual salaryclerk $18,400Machinist 1 $28,175Machinist 2 $29,500Machinist 3 $31,450Machinist 4 $32,800Machinist 5 $34,150Machinist 6 $34,200Machinist 7 $35,500Machinist 8 $36,100Engineer 1 $68,500Engineer 2 $78,230Engineer 3 $85,400Engineer 4 $90,100mean $46,347median $34,200

Page 26: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Mode

The most frequent value assumed by a random variable or occurring in a sample.

The term is applied both to probability distributions and to collections of data.

The mode is not necessarily unique, since the same maximum frequency may be attained at different values. The worst case is given by the uniform distributions in which all values are equally likely.

For example, the mode of the sample [1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17] is 6.

The mode of a discrete probability distribution is the value x at which its probability mass function takes its maximum value. the value that is most likely to be sampled.

The mode of a continuous probability distribution is the value x at which its probability density function attains its maximum value.

Not affected by extreme values Can be computed from nominal data

Page 27: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Example – Sample Mode

Raw data: the number of accident claims received per day over the last 50 days by the Nofrills Insurance Co.

week Mon Tues Wed Thur Fri1 4 3 1 1 42 0 3 0 0 13 5 2 0 2 04 1 1 0 1 15 3 4 2 3 36 2 0 0 1 27 1 4 4 5 28 7 2 2 3 69 3 3 1 4 1

10 2 1 1 2 2

Bin Frequency0 81 132 113 84 65 26 17 1

Mode = 1

Page 28: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

geometric mean The geometric mean is smaller than or equal to the

arithmetic mean the two means are equal if and only if all members of

the data set are equal allows the definition of the arithmetic-geometric mean,

a mixture of the two which always lies in between

Used to determine "average factors" If a stock rose 10% in the first yr, 20% in the second

yr and fell 15% in the third yr, then compute the geometric mean of the factors 1.10, 1.20 and 0.85 as (1.10 × 1.20 × 0.85)1/3 = 1.0391... and conclude that the stock rose 3.91 percent per year, on average.

answers the question, "if all the quantities had the same value, what would that value have to be in order to achieve the same product?"

1 2n

nx x x

Page 29: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

harmonic mean

is appropriate for situations when the average of rates is desired if for half the distance of a trip you travel at 40 mph per hour and for

the other half of the distance you travel at 60 mph per hour, then your average speed for the trip is given by the harmonic mean of 40 and 60, which is 48; that is, the total amount of time for the trip is the same as if you traveled the entire trip at 48 mph per hour.

If you had traveled for half the time at one speed and the other half at another, the arithmetic mean, in this case 50 mph per hour, would provide the correct average.

In finance, used to calculate the average cost of shares purchased over a period of time. an investor purchases $1000 worth of stock every month for three

months. If the spot prices at execution time are $8, $9, and $10, then the average price the investor paid is $8.926 per share.

However, if the investor purchased 1000 shares per month, the arithmetic mean would be used

1 2

1 1 1...

n

n

x x x

Page 30: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

midrange and beyond

It is highly sensitive to outliers and ignores all but two data points; therefore it is rarely used in statistical analysis.

While the mean of a set of values minimizes the sum of squares of deviations and the median minimizes the average absolute deviation, the midrange minimizes the maximum deviation.

For a given data set, the harmonic mean is always the least of the three, while the arithmetic mean is always the greatest of the three and the geometric mean is always in between

min max

2

x x

Page 31: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Measures of Dispersion

The search for variability

Page 32: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Definition: Sample Variance

Page 33: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Figure 6-2

How Does the Sample Variance Measure Variability?

How the sample variance measures variability through the deviations . xxi

Page 34: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Example 6-2

Page 35: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Table 6-1

Page 36: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Computational Form of s2

Page 37: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Population Variance

When the population is finite and consists of N values, we may define the population variance as

The sample variance is a reasonable estimate of the population variance.

Page 38: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Homing in on the Sample Range

Page 39: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Example measures

min 1.6max 5.6

mean 3.1median 2.9std dev 1.10

range 4.0

Raw Data: Time to repair or replace in hours a failed transformer by the Dayton Power and Light Company

2.2 5.0 5.0 4.3 4.5 3.31.7 1.8 5.6 2.4 3.3 2.02.4 3.2 2.5 3.6 3.9 1.62.5 2.2 4.5 2.5 2.7 1.62.9 1.9 3.7 1.9 3.0 2.94.4 4.0 4.3 2.7 3.9

= 5.6 – 1.6

Page 40: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Quartiles

A quartile is any of the three values which divide the sorted data set into four equal parts, so that each part represents 1/4th of the sampled population.

Thus:

first quartile (designated Q1) = lower quartile = cuts off lowest 25% of data = 25th percentile

second quartile (designated Q2) = median = cuts data set in half = 50th percentile

third quartile (designated Q3) = upper quartile = cuts off highest 25% of data, or lowest 75% = 75th percentile

The difference between the upper and lower quartiles is called the interquartile range.

Page 41: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

When an ordered set of data is divided into four equal parts, the division points are called quartiles.

The first or lower quartile, q1 , is a value that has approximately one-fourth (25%) of the observations below it and approximately 75% of the observations above.

The second quartile, q2, has approximately one-half (50%) of the observations below its value. The second quartile is exactly equal to the median.

The third or upper quartile, q3, has approximately three-fourths (75%) of the observations below its value. As in the case of the median, the quartiles may not be unique.

More Data Features

Page 42: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

The compressive strength data in Table 6-2 containsn = 80 observations. Minitab software calculates the first and third quartiles as the(n + 1)/4 and 3(n + 1)/4 ordered observations and interpolates as needed.

For example, (80 + 1)/4 = 20.25 and 3(80 + 1)/4 = 60.75.

Therefore, Minitab interpolates between the 20th and 21st ordered observation to obtain q1 = 143.50 and between the 60th and61st observation to obtain q3 =181.00.

6-2 Example of Data Features

Page 43: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

• The interquartile range is the difference between the upper and lower quartiles, and it is sometimes used as a measure of variability.

• In general, the 100kth percentile is a data value such that approximately 100k% of the observations are at or below this value and approximately 100(1 - k)% of them are above it.

Data Features

Page 44: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Examples in Variability Professor Higgins has experienced considerable variability in

his driving time from home to the University. Help the good professor find a measure of his variability.

driving times in minutes

observation value Quartiles1 21.82 23.33 26.04 27.9 Q15 28.96 31.47 32.58 34.6 Q29 37.0

10 38.911 38.912 41.7 Q313 42.414 44.015 44.816 45.9

value44.041.734.644.821.826.028.927.938.937.023.332.545.938.942.431.4

sorted

variance 62.1std dev 7.88range 24.1interquartile range 13.8true median 35.8mean 35.0

Page 45: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Coefficient of Variation

100s

CVX

Data source mean std dev CVgear assemblyJeff 1.65 0.088 5.33Jerry 1.73 0.075 4.34Housing assemblyJudy 4.23 1.02 24.11Jared 5.67 0.99 17.46Julie 4.78 0.85 17.78Final AssemblyJane 34.56 2.45 7.09Jim 37.58 2.05 5.46John 32.1 2.11 6.57

Where should the Vary A. SchunCompany direct its efforts toreduce the variability in itsproduction lead-time?

unit production times in minutes

Page 46: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Real Bonus Material

Descriptive Statistics for the Overachieving Student

Page 47: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Skewness and Kurtosis

1

3 31 3/2 3

2

4 42 2 4

2

( )1,2,3,4

ˆ

ˆ

nj

ii

j

x xM j

n

M M

M

M M

M

1 is Skewness – third moment about

the mean

2 is Kurtosis – the fourth

moment about the mean

Moments about the mean. For example,

variance is the second

Note how a power of the sample variance is used to ‘standardize’ the 1 and 2 estimates.

Page 48: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Skewness Measures the direction and degree of departure from symmetry If the distribution is perfectly symmetrical, the measure of

skewness will be zero Normal distribution uniform and rectangular

If the distribution is asymmetrical (i.e. skewed), the tail of the distribution will extend in the direction of the positive (negative) numbers if the measure of skewness is positive (negative)

Both distributions have the same expectation and variance. The one on the left is positively skewed. The one on the right is negatively skewed.

Page 49: Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A

Kurtosis The extent of peakedness in the distribution Kurtosis is a measure of whether the data are peaked or flat

relative to a normal distribution (kurtosis = 3 - mesokurtic). Data with high (positive) kurtosis tend to have a distinct peak near the

mean, decline rather rapidly, and have heavy tails. Data with low (negative) kurtosis tend to have a flat top near the mean. A

uniform distribution would be the extreme case. Higher kurtosis means more of the variance is due to infrequent extreme

deviations, as opposed to frequent modestly-sized deviations. If a random variable’s kurtosis is greater than 3, it is said to be leptokurtic.

If its kurtosis is less than 3, it is said to be platykurtic.

The distribution on the right has higher kurtosis than the one on the left. It is more peaked at the center, and it has fatter tails.