chapter 6 - random sampling and data description experience the joy of dealing with large quantities...

Chapter 6 - Random Sampling and Data Description

Experience the joy of

dealing with large

quantities of data

Chapter 6A

This Week in Prob/Stat

+ Bonus Material

Today’s Discussion

Descriptive Statistics

Distributions Histogram Cumulative frequency distribution Frequency distribution (continuous data)

Measures of Central Tendency (location) Mean Median Mode

Measures of Variability (dispersion) Variance (standard deviation) Range Quartiles Coefficient of Variation

Measures of skewness Measures of Kurtosis

Histograms

Data gets placed into class intervals, cells, or bins (synonyms).

Continuous data - Number of bins ~ sqrt(nobs) or use Sturges rule.

Histogram shows the relative frequency of the sample observations in each class.

Histogram ~ probability density (or mass) function By summing counts in the succession of bins you

can construct a cumulative frequency plot. Cumulative frequency plot ~ empirical distribution

function ~ cumulative distribution function

A Discrete Example

Raw data: the number of accident claims received per day over the last 50 days by the Nofrills Insurance Co.

week Mon Tues Wed Thur Fri1 4 3 1 1 42 0 3 0 0 13 5 2 0 2 04 1 1 0 1 15 3 4 2 3 36 2 0 0 1 27 1 4 4 5 28 7 2 2 3 69 3 3 1 4 1

10 2 1 1 2 2

Bin Frequency Cumulative %0 8 16.00%1 13 42.00%2 11 64.00%3 8 80.00%4 6 92.00%5 2 96.00%6 1 98.00%7 1 100.00%

Frequency

0

0.05

0.1

0.15

0.2

0.25

0.3

0 1 2 3 4 5 6 7

Number of claims

A Discrete Empirical Cumulative Frequency Distribution

Number of claims

Cumulative Frequency

0 x < 1 16%1 x < 2 42%2 x < 3 64%3 x < 4 80%4 x < 5 92%5 x < 6 96%6 x < 7 98%7 x < ? 100% < ∞

Cumulative %

0%10%20%30%40%50%60%70%80%90%

100%

0 1 2 3 4 5 6 7 8

Number of Claims

A Discrete Empirical Cumulative Frequency Distribution Graph

A Continuous Data Example

Raw Data: Time to repair or replace in hours a failed transformer by the Dayton Power and Light Company

Industry standard is 2.5 hours

2.2 5.0 5.0 4.3 4.5 3.31.7 1.8 5.6 2.4 3.3 2.02.4 3.2 2.5 3.6 3.9 1.62.5 2.2 4.5 2.5 2.7 1.62.9 1.9 3.7 1.9 3.0 2.94.4 4.0 4.3 2.7 3.9

data collection: 35 repairs performedbetween 01/01/07 and 06/30/07

Sturges’ rule for grouping data

k = 1 + 3.3 log10 n

where k = number of classes,n = sample size.

x = integer part of x

For example, n k 35 6 650 7 7500 10

225000 13

71

n

A Histogram

Data was generated froma lognormal distribution

Transformer repair times in hours

Bin frequencyx <= 1 01<x<=2 0.22<x<=3 0.342863<x<=4 0.228574<x<=5 0.171435<x 0.05714

0

0.1

0.2

0.3

0.4

x <= 1 1<x<=2 2<x<=3 3<x<=4 4<x<=5 5<x

Frequency Polygon

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 1 2 3 4 5 6 7 8

Repair time in hours

Cumulative Frequency Distribution- ogive

Cumulative Frequency

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

0 1 2 3 4 5 6 7

Repair Times

Measures of Central Tendency – i.e. averages

Seeking the middle ground

Types of Data nominal (also categorical or discrete) (e.g. group employees by job type)

only comparisons are equality and inequality. no "less than" or "greater than" relations among the classifying names no operations such as addition or subtraction

ordinal (e.g. rank colleges based surveys and interviews) the numbers assigned to objects represent the rank order (1st, 2nd, 3rd etc.)

of the entities measured. comparisons of greater and less can be made, in addition to equality and

inequality. interval (e.g. temperature, IQ measurements)

have all the features of ordinal measurements, equal differences between measurements represent equivalent

intervals. operations such as addition and subtraction are therefore meaningful.

Ratio (e.g. group travel times into intervals) have all the features of interval operations such as multiplication and division are therefore meaningful. The zero value on a ratio scale is non-arbitrary

6-1 Numerical Summaries

Definition: Sample Mean

Characteristics of the mean most widely known and used average an artificial concept since it may not coincide with any actual

value affected by every value of every item

therefore uses all the information available in the sample highly influenced by extreme values can be computed directly from the raw data

e.g. does not need to be sorted as does the median requires interval or ratio data lends itself better to algebraic analysis than other measures

of central tendency has some desirable statistical properties answers the question, "if all the quantities had the same

value, what would that value have to be in order to achieve the same total?"

Example 6-1

6-1 Numerical Summaries

Figure 6-1 The sample mean as a balance point for a system of weights.

Population Mean

For a finite population with N measurements, the mean is

The sample mean is a reasonable estimate of the population mean.

Sample Median

Median is a measure of central tendency such that half of the values in a sample are below it and half are above it. If the number of observations is even, then average the two

central values. Sample median less influenced by ‘outliers’ than the sample

mean.• Not affected by extreme values• affected by the number but not the value of extremes

widely used in skewed distributions where the mean would be distorted by extreme values

• e.g. economic data Can be used where the data is ranked but not measured

quantitatively unreliable if the data do not cluster at the center of the

distribution

Order StatisticsDefineX(1) = Min {X1, X2, …, Xn}

X(2) = 2nd smallest {X1, X2, …, Xn}

X(i) = ith smallest {X1, X2, …, Xn}

X(n) = Max {X1, X2, …, Xn}

Therefore X(1) X(2) X(3) … X(n)

( )

( ) ( 1)

if 2 1 is odd

if 2 is even2

k

med k k

X n k

X X Xn k

Median Repair Time

Observation number1 1.62 1.63 1.74 1.85 1.96 1.97 2.08 2.29 2.2

10 2.411 2.412 2.513 2.514 2.515 2.716 2.717 2.918 2.919 3.020 3.221 3.322 3.323 3.624 3.725 3.926 3.927 4.028 4.329 4.330 4.431 4.532 4.533 5.034 5.035 5.6


Industry standard is 2.5 hours

2.2 5.0 5.0 4.3 4.5 3.31.7 1.8 5.6 2.4 3.3 2.02.4 3.2 2.5 3.6 3.9 1.62.5 2.2 4.5 2.5 2.7 1.62.9 1.9 3.7 1.9 3.0 2.94.4 4.0 4.3 2.7 3.9

sort data

Median of an even number of observations

observation value1 27.402 9.083 165.294 214.855 98.706 76.077 9.878 77.969 15.01

10 49.8611 1.1812 188.0713 317.2614 59.7915 384.6316 48.74

observation value1 1.182 9.083 9.874 15.015 27.406 48.747 49.868 59.799 76.07

10 77.9611 98.7012 165.2913 188.0714 214.8515 317.2616 384.63

raw data

sort

59.79 76.0767.93

2

average the middletwo observations

Good use of the median

Constructively Yours is a small privately owned and operated business that specializes in small residential construction and remodeling projects. In addition to the owner-president, the company employs 8 other workers. position annual salary

receptionist $22,050worker 1 $28,175worker 2 $29,500worker 3 $31,450worker 4 $32,800salesperson 1 $34,150salesperson 2 $38,000job foreman $43,200President $230,000mean $54,369median $32,800

Warning: The above salary information is confidential and proprietaryand should not be disclosed beyond its use in the classroom.

Is the Median Representative?

The Makit Company is a small job shop that primarily employs machine operators and engineers.

Warning: The above salary information is confidential and proprietaryand should not be disclosed beyond its use in the classroom.

position annual salaryclerk $18,400Machinist 1 $28,175Machinist 2 $29,500Machinist 3 $31,450Machinist 4 $32,800Machinist 5 $34,150Machinist 6 $34,200Machinist 7 $35,500Machinist 8 $36,100Engineer 1 $68,500Engineer 2 $78,230Engineer 3 $85,400Engineer 4 $90,100mean $46,347median $34,200

Mode

The most frequent value assumed by a random variable or occurring in a sample.

The term is applied both to probability distributions and to collections of data.

The mode is not necessarily unique, since the same maximum frequency may be attained at different values. The worst case is given by the uniform distributions in which all values are equally likely.

For example, the mode of the sample [1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17] is 6.

The mode of a discrete probability distribution is the value x at which its probability mass function takes its maximum value. the value that is most likely to be sampled.

The mode of a continuous probability distribution is the value x at which its probability density function attains its maximum value.

Not affected by extreme values Can be computed from nominal data

Example – Sample Mode

Raw data: the number of accident claims received per day over the last 50 days by the Nofrills Insurance Co.

week Mon Tues Wed Thur Fri1 4 3 1 1 42 0 3 0 0 13 5 2 0 2 04 1 1 0 1 15 3 4 2 3 36 2 0 0 1 27 1 4 4 5 28 7 2 2 3 69 3 3 1 4 1

10 2 1 1 2 2

Bin Frequency0 81 132 113 84 65 26 17 1

Mode = 1

geometric mean The geometric mean is smaller than or equal to the

arithmetic mean the two means are equal if and only if all members of

the data set are equal allows the definition of the arithmetic-geometric mean,

a mixture of the two which always lies in between

Used to determine "average factors" If a stock rose 10% in the first yr, 20% in the second

yr and fell 15% in the third yr, then compute the geometric mean of the factors 1.10, 1.20 and 0.85 as (1.10 × 1.20 × 0.85)1/3 = 1.0391... and conclude that the stock rose 3.91 percent per year, on average.

answers the question, "if all the quantities had the same value, what would that value have to be in order to achieve the same product?"

1 2n

nx x x

harmonic mean

is appropriate for situations when the average of rates is desired if for half the distance of a trip you travel at 40 mph per hour and for

the other half of the distance you travel at 60 mph per hour, then your average speed for the trip is given by the harmonic mean of 40 and 60, which is 48; that is, the total amount of time for the trip is the same as if you traveled the entire trip at 48 mph per hour.

If you had traveled for half the time at one speed and the other half at another, the arithmetic mean, in this case 50 mph per hour, would provide the correct average.

In finance, used to calculate the average cost of shares purchased over a period of time. an investor purchases $1000 worth of stock every month for three

months. If the spot prices at execution time are $8, $9, and $10, then the average price the investor paid is $8.926 per share.

However, if the investor purchased 1000 shares per month, the arithmetic mean would be used

1 2

1 1 1...

n

n

x x x

midrange and beyond

It is highly sensitive to outliers and ignores all but two data points; therefore it is rarely used in statistical analysis.

While the mean of a set of values minimizes the sum of squares of deviations and the median minimizes the average absolute deviation, the midrange minimizes the maximum deviation.

For a given data set, the harmonic mean is always the least of the three, while the arithmetic mean is always the greatest of the three and the geometric mean is always in between

min max

2

x x

Measures of Dispersion

The search for variability

Definition: Sample Variance

Figure 6-2

How Does the Sample Variance Measure Variability?

How the sample variance measures variability through the deviations . xxi

Example 6-2

Table 6-1

Computational Form of s2

Population Variance

When the population is finite and consists of N values, we may define the population variance as

The sample variance is a reasonable estimate of the population variance.

Homing in on the Sample Range

Example measures

min 1.6max 5.6

mean 3.1median 2.9std dev 1.10

range 4.0


2.2 5.0 5.0 4.3 4.5 3.31.7 1.8 5.6 2.4 3.3 2.02.4 3.2 2.5 3.6 3.9 1.62.5 2.2 4.5 2.5 2.7 1.62.9 1.9 3.7 1.9 3.0 2.94.4 4.0 4.3 2.7 3.9

= 5.6 – 1.6

Quartiles

A quartile is any of the three values which divide the sorted data set into four equal parts, so that each part represents 1/4th of the sampled population.

Thus:

first quartile (designated Q1) = lower quartile = cuts off lowest 25% of data = 25th percentile

second quartile (designated Q2) = median = cuts data set in half = 50th percentile

third quartile (designated Q3) = upper quartile = cuts off highest 25% of data, or lowest 75% = 75th percentile

The difference between the upper and lower quartiles is called the interquartile range.

When an ordered set of data is divided into four equal parts, the division points are called quartiles.

The first or lower quartile, q1 , is a value that has approximately one-fourth (25%) of the observations below it and approximately 75% of the observations above.

The second quartile, q2, has approximately one-half (50%) of the observations below its value. The second quartile is exactly equal to the median.

The third or upper quartile, q3, has approximately three-fourths (75%) of the observations below its value. As in the case of the median, the quartiles may not be unique.

More Data Features

The compressive strength data in Table 6-2 containsn = 80 observations. Minitab software calculates the first and third quartiles as the(n + 1)/4 and 3(n + 1)/4 ordered observations and interpolates as needed.

For example, (80 + 1)/4 = 20.25 and 3(80 + 1)/4 = 60.75.

Therefore, Minitab interpolates between the 20th and 21st ordered observation to obtain q1 = 143.50 and between the 60th and61st observation to obtain q3 =181.00.

6-2 Example of Data Features

• The interquartile range is the difference between the upper and lower quartiles, and it is sometimes used as a measure of variability.

• In general, the 100kth percentile is a data value such that approximately 100k% of the observations are at or below this value and approximately 100(1 - k)% of them are above it.

Data Features

Examples in Variability Professor Higgins has experienced considerable variability in

his driving time from home to the University. Help the good professor find a measure of his variability.

driving times in minutes

observation value Quartiles1 21.82 23.33 26.04 27.9 Q15 28.96 31.47 32.58 34.6 Q29 37.0

10 38.911 38.912 41.7 Q313 42.414 44.015 44.816 45.9

value44.041.734.644.821.826.028.927.938.937.023.332.545.938.942.431.4

sorted

variance 62.1std dev 7.88range 24.1interquartile range 13.8true median 35.8mean 35.0

Coefficient of Variation

100s

CVX

Data source mean std dev CVgear assemblyJeff 1.65 0.088 5.33Jerry 1.73 0.075 4.34Housing assemblyJudy 4.23 1.02 24.11Jared 5.67 0.99 17.46Julie 4.78 0.85 17.78Final AssemblyJane 34.56 2.45 7.09Jim 37.58 2.05 5.46John 32.1 2.11 6.57

Where should the Vary A. SchunCompany direct its efforts toreduce the variability in itsproduction lead-time?

unit production times in minutes

Real Bonus Material

Descriptive Statistics for the Overachieving Student

Skewness and Kurtosis

1

3 31 3/2 3

2

4 42 2 4

2

( )1,2,3,4

ˆ

ˆ

nj

ii

j

x xM j

n

M M

M

M M

M

1 is Skewness – third moment about

the mean

2 is Kurtosis – the fourth

moment about the mean

Moments about the mean. For example,

variance is the second

Note how a power of the sample variance is used to ‘standardize’ the 1 and 2 estimates.

Skewness Measures the direction and degree of departure from symmetry If the distribution is perfectly symmetrical, the measure of

skewness will be zero Normal distribution uniform and rectangular

If the distribution is asymmetrical (i.e. skewed), the tail of the distribution will extend in the direction of the positive (negative) numbers if the measure of skewness is positive (negative)

Both distributions have the same expectation and variance. The one on the left is positively skewed. The one on the right is negatively skewed.

Kurtosis The extent of peakedness in the distribution Kurtosis is a measure of whether the data are peaked or flat

relative to a normal distribution (kurtosis = 3 - mesokurtic). Data with high (positive) kurtosis tend to have a distinct peak near the

mean, decline rather rapidly, and have heavy tails. Data with low (negative) kurtosis tend to have a flat top near the mean. A

uniform distribution would be the extreme case. Higher kurtosis means more of the variance is due to infrequent extreme

deviations, as opposed to frequent modestly-sized deviations. If a random variable’s kurtosis is greater than 3, it is said to be leptokurtic.

If its kurtosis is less than 3, it is said to be platykurtic.

The distribution on the right has higher kurtosis than the one on the left. It is more peaked at the center, and it has fatter tails.

chapter 6 - random sampling and data description experience the joy of dealing with large quantities...

Documents