©2003 thomson/south-western 1 chapter 3 – data summary using descriptive measures slides prepared...

©2003 Thomson/South-Western 1

Chapter 3 –Chapter 3 –

Data SummaryData SummaryUsing Using Descriptive Descriptive MeasuresMeasures

Slides prepared by Jeff Heyl, Lincoln UniversitySlides prepared by Jeff Heyl, Lincoln University©2003 South-Western/Thomson Learning™

Introduction toIntroduction to Business StatisticsBusiness Statistics, 6e, 6eKvanli, Pavur, KeelingKvanli, Pavur, Keeling


Types of Descriptive Types of Descriptive MeasuresMeasures

Measures of central tendencyMeasures of central tendency Measures of variationMeasures of variation Measures of positionMeasures of position Measures of shapeMeasures of shape


Measures of Central Measures of Central TendencyTendency

The MeanThe Mean The MedianThe Median The MidrangeThe Midrange The ModeThe Mode


The MeanThe MeanThe Mean is simply the average of the dataThe Mean is simply the average of the data

Each value in the sample is represented by x.Each value in the sample is represented by x.

Thus to get the mean simply add all the Thus to get the mean simply add all the values in the sample and divide by the values in the sample and divide by the number of values in the sample (n)number of values in the sample (n)

A Sample MeanA Sample Mean

xx = =xxnnxxnn


The Population MeanThe Population Mean

Each value in the population is Each value in the population is represented by x.represented by x.

Thus to get the population mean Thus to get the population mean (()) simply add all the values in the simply add all the values in the population and divide by the number of population and divide by the number of values in the population (N)values in the population (N)

==xxNNxxNN


The Accident Data SetThe Accident Data Set

xx = = 10.0 = = 10.06 + 9 + 7 + 23 +56 + 9 + 7 + 23 +555

xx = = 11.25 = = 11.256 + 9 + 7 + 236 + 9 + 7 + 2344

If we remove the last value If we remove the last value from the data set, thenfrom the data set, then


The MedianThe Median

The Median The Median (Md)(Md) of a set of data is of a set of data is the value in the center of the data the value in the center of the data values when they are arranged values when they are arranged from lowest to highestfrom lowest to highest


Accident DataAccident Data

Ordered array: Ordered array: 5, 6, 7, 9, 235, 6, 7, 9, 23

The value that has an equal number of The value that has an equal number of items to the right and left is the median items to the right and left is the median

If n is an odd number, If n is an odd number, MdMd is the center is the center data value of the ordered data setdata value of the ordered data set

Md = st ordered valueMd = st ordered valuenn + 1 + 122

MdMd = 7= 7


Even Numbered DataEven Numbered Data

Ordered array: Ordered array: 3, 8, 12, 143, 8, 12, 14

The value that has an equal number of The value that has an equal number of items to the right and left is the median items to the right and left is the median

If n is an even number, If n is an even number, MdMd is the average of is the average of the two center values of the ordered data setthe two center values of the ordered data set

MdMd = (8 + 12)/2 = 10= (8 + 12)/2 = 10


The MidrangeThe Midrange

The Midrange The Midrange (Mr)(Mr) provides an easy- provides an easy-to-grasp measure of central tendencyto-grasp measure of central tendency

Mr =Mr = LL + + HH22


Accident DataAccident Data

Ordered array: Ordered array: 5, 6, 7, 9, 235, 6, 7, 9, 23

Mr = = 14Mr = = 145 + 235 + 2322

Note: that the Midrange is severely affected Note: that the Midrange is severely affected by outliersby outliers

Compare Compare Mr Mr toto xx = 10 = 10 andand Md = 7 Md = 7


The ModeThe Mode

The Mode The Mode (Mo)(Mo) of a data set is of a data set is the value that occurs more than the value that occurs more than once and the most oftenonce and the most often

The Mode is not always a The Mode is not always a measure of central tendency; this measure of central tendency; this value need not occur in the value need not occur in the center of the datacenter of the data


Bellaire College ExampleBellaire College Example

Figure 3.2Figure 3.2


Level of Measurement and Level of Measurement and Measure of Central TendencyMeasure of Central Tendency

Summary of levels of measurement and appropriate measure Summary of levels of measurement and appropriate measure of central tendency. A of central tendency. A “Y”“Y” indicates this measure can be indicates this measure can be used with the corresponding level of measurement.used with the corresponding level of measurement.

Measure ofMeasure ofCentral TendencyCentral Tendency NominalNominal OrdinalOrdinal IntervalInterval RatioRatio

MeanMean YY YYMedianMedian YY YY YYMidrangeMidrange YY YYModeMode YY YY YY YY

Level of MeasurementLevel of Measurement

Table 3.1Table 3.1


Measures of VariationMeasures of Variation

Homogeneity refers to the degree of Homogeneity refers to the degree of similarity within a set of datasimilarity within a set of data

The more homogeneous a set of The more homogeneous a set of data is, the better the mean will data is, the better the mean will represent a typical valuerepresent a typical value

Variation is the tendency of data Variation is the tendency of data values to scatter about the mean,values to scatter about the mean, xx


Common Measures of Common Measures of VariationVariation

RangeRange VarianceVariance Standard DeviationStandard Deviation Coefficient of VariationCoefficient of Variation


The RangeThe Range

For the Accident data:For the Accident data:

Range =Range = H H -- L L = 23 - 5 = 18= 23 - 5 = 18

Rather crude measure but easy to Rather crude measure but easy to calculate and contains valuable calculate and contains valuable information in some situationsinformation in some situations


The Variance and The Variance and Standard DeviationStandard Deviation

Both measures describe the variation of Both measures describe the variation of the values about the meanthe values about the mean

55 -5-5 252566 -4-4 161677 -3-3 9999 -1-1 11

2323 1313 169169

((xx - - xx ) = 0 ) = 0 ((xx - - xx ) )22 = 220 = 220

Data Value (Data Value (xx)) ((xx - - x x )) (x - (x - xx ) )22


Sample VarianceSample Variance

ss22 = =((xx - - xx ) )22

nn - 1 - 1

Using the accident data:Using the accident data:

ss22 = = = 55.0 = = = 55.02202205 - 15 - 1

22022044


Sample Standard DeviationSample Standard Deviation

ss = =((xx - - xx ) )22

nn - 1 - 1

Using the accident data:Using the accident data:

ss = 55.0 = 7.416 = 55.0 = 7.416


Population Variance and Population Variance and Standard DeviationStandard Deviation

==((xx - - ))22

NN

22 = =((xx - - ))22

NN


The Coefficient of VariationThe Coefficient of Variation

The Coefficient of Variation The Coefficient of Variation (CV)(CV) is is used to compare the variation of used to compare the variation of two or more data sets where the two or more data sets where the values of the data differ greatlyvalues of the data differ greatly

CV = CV = 100 100ssxx


Machined Parts ExampleMachined Parts Example



Measures of PositionMeasures of Position

Percentile (Quartile)Percentile (Quartile) Most common measure of positionMost common measure of position Quartiles are percentiles with the data Quartiles are percentiles with the data

divided into quartersdivided into quarters

Z-ScoreZ-Score The relative position of a data value The relative position of a data value

expressed in terms of the number of expressed in terms of the number of standard deviations above or below the standard deviations above or below the meanmean


Percentile ExamplePercentile Example

The The 3535th Percentile (Pth Percentile (P3535) is that value ) is that value

such that at most such that at most 35%35% of the data of the data values are less than Pvalues are less than P3535 and at most and at most

65%65% of the data values are greater of the data values are greater than Pthan P3535..


Aptitude Test ScoresAptitude Test Scores

2222 4444 5656 6868 78782525 4444 5757 6868 78782828 4646 5959 6969 80803131 4848 6060 7171 82823434 4949 6161 7272 83833535 5151 6363 7272 85853939 5353 6363 7474 88883939 5353 6363 7575 90904040 5555 6565 7575 92924242 5555 6666 7676 9696

Table 3.2Table 3.2 Ordered array of aptitude test scores Ordered array of aptitude test scores for 50 applicants (for 50 applicants (xx = 60.36, = 60.36, ss = 18.61) = 18.61)


PercentilePercentileTexon Industries DataTexon Industries Data

17.5 represents the position of the 17.5 represents the position of the 35th percentile35th percentile

n n • = 50 • .35 = 17.5• = 50 • .35 = 17.5PP100100

Number of data values, Number of data values, nn = 50 = 50Percentile, Percentile, PP = 35 = 35


Percentile Location RulesPercentile Location Rules

Rule 1:Rule 1: If n If n PP/100/100 is not a counting number, is not a counting number, round it up, and the Pth percentile round it up, and the Pth percentile will be the value in this position of will be the value in this position of the ordered datathe ordered data

Rule 2:Rule 2: If n If n PP/100/100 is a counting number, is a counting number, the Pth percentile is the average of the Pth percentile is the average of the number in this location (of the the number in this location (of the ordered data) and the number in the ordered data) and the number in the next largest locationnext largest location


Aptitude Scores ExampleAptitude Scores ExampleMs. Jensen received a score of Ms. Jensen received a score of 8383 on the on the aptitude test. What is her percentile value?aptitude test. What is her percentile value?

83 is the 45th largest value out of 50.83 is the 45th largest value out of 50.A guess of the percentile would be:A guess of the percentile would be:

P = • 100 = 90P = • 100 = 9045455050

Examining the surrounding values clarifies Examining the surrounding values clarifies the true percentilethe true percentile

PP ((nn • • PP)/100)/100 P P th Percentileth Percentile

8888 50 • .88 = 4450 • .88 = 44 (80 + 83)/2 = 82.5(80 + 83)/2 = 82.58989 50 • .89 = 44.550 • .89 = 44.5 45th value = 8345th value = 839090 50 • .90 = 4550 • .90 = 45 (83 + 85)/2 = 84(83 + 85)/2 = 84

Example 3.5Example 3.5


QuartilesQuartilesQuartiles are merely particular percentiles Quartiles are merely particular percentiles that divide the data into quarters, namely:that divide the data into quarters, namely:

QQ11 = 1st quartile = 25th percentile (= 1st quartile = 25th percentile (PP2525))

QQ22 = 2nd quartile = 50th percentile = 2nd quartile = 50th percentile

= median (= median (PP5050))

QQ33 = 3rd quartile = 75th percentile (= 3rd quartile = 75th percentile (PP7575))


Quartile ExampleQuartile Example

Using the applicant data, the first quartile is:Using the applicant data, the first quartile is:

Rounded up Rounded up QQ11 = 13th ordered value = 46 = 13th ordered value = 46

Similarly the third quartile is:Similarly the third quartile is:

PP100100n n • = (50)(.75) = 37.5 ≈ 38 and • = (50)(.75) = 37.5 ≈ 38 and QQ33 = 75 = 75

n n • = (50)(.25) = 12.5• = (50)(.25) = 12.5PP

100100


Interquartile RangeInterquartile Range

The interquartile range (IQR) is The interquartile range (IQR) is essentially the middle 50% of the essentially the middle 50% of the data setdata set

IQR = IQR = QQ33 - - QQ11

Using the applicant data, the IQR is:Using the applicant data, the IQR is:

IQR = 75 - 46 = 29IQR = 75 - 46 = 29


Z-ScoresZ-Scores Z-score determines the relative position Z-score determines the relative position

of any particular data value x and is of any particular data value x and is based on the mean and standard based on the mean and standard deviation of the data setdeviation of the data set

The Z-score is expresses the number of The Z-score is expresses the number of standard deviations the value x is from standard deviations the value x is from the meanthe mean

A negative Z-score implies that x is to the A negative Z-score implies that x is to the left of the mean and a positive Z-score left of the mean and a positive Z-score implies that x is to the right of the meanimplies that x is to the right of the mean


Z Score EquationZ Score Equation

zz = =xx - - xx

ss

For a score of 83 from the aptitude data set,For a score of 83 from the aptitude data set,

zz = = 1.22 = = 1.2283 - 60.6683 - 60.66

18.6118.61

For a score of 35 from the aptitude data set,For a score of 35 from the aptitude data set,

zz = = -1.36 = = -1.3635 - 60.6635 - 60.66

18.6118.61


Standardizing Sample DataStandardizing Sample Data

The process of subtracting the The process of subtracting the mean and dividing by the standard mean and dividing by the standard deviation is referred to as deviation is referred to as standardizing the sample data.standardizing the sample data.

The corresponding z-score is the The corresponding z-score is the standardized score.standardized score.


Measures of ShapeMeasures of Shape

SkewnessSkewness Skewness measures the tendency of Skewness measures the tendency of

a distribution to stretch out in a a distribution to stretch out in a particular directionparticular direction

KurtosisKurtosis Kurtosis measures the peakedness Kurtosis measures the peakedness

of the distributionof the distribution


SkewnessSkewness In a symmetrical distribution the mean, In a symmetrical distribution the mean,

median, and mode would all be the same median, and mode would all be the same value and value and SkSk = 0= 0

A positive A positive SkSk number implies a shape number implies a shape which is skewed right and thewhich is skewed right and the

mode < median < meanmode < median < mean In a data set with a negative In a data set with a negative SkSk value the value the

mean < median < modemean < median < mode


Skewness CalculationSkewness Calculation

Pearsonian coefficient of skewnessPearsonian coefficient of skewness

Sk =Sk =3(3(xx - Md) - Md)

ss

Values of Values of SkSk will always fall between -3 and will always fall between -3 and 33


Histogram of Symmetric DataHistogram of Symmetric DataF

req

ue

ncy

Fre

qu

en

cy

xx = Md = Mo = Md = MoFigure 3.7Figure 3.7


Histogram with Right Histogram with Right (Positive) Skew(Positive) Skew

Re

lati

ve

Fre

qu

enc

yR

ela

tiv

e F

req

uen

cy

ModeMode(Mo)(Mo)

MedianMedian(Md)(Md)

Sk > 0Sk > 0

MeanMean((xx )) Figure 3.8Figure 3.8


Histogram with Left Histogram with Left (Negative) Skew(Negative) Skew

ModeMode(Mo)(Mo)

MedianMedian(Md)(Md)

Re

lati

ve

Fre

qu

enc

yR

ela

tiv

e F

req

uen

cy

Sk < 0Sk < 0

MeanMean((xx ))Figure 3.9Figure 3.9


KurtosisKurtosis

Kurtosis is a measure of the Kurtosis is a measure of the peakedness of a distributionpeakedness of a distribution

Large values occur when there is a Large values occur when there is a high frequency of data near the high frequency of data near the mean and in the tailsmean and in the tails

The calculation is cumbersome and The calculation is cumbersome and the measure is used infrequentlythe measure is used infrequently


Chebyshev’s InequalityChebyshev’s Inequality1.1. At least At least 75%75% of the data values are between of the data values are between

xx - 2 - 2s and x + s and x + 22s, ors, orAt least At least 75%75% of the data values have a z- of the data values have a z-score value between score value between -2-2 and and 22

3.3. In general, at least In general, at least (1-1/(1-1/kk22) x) x 100%100% of the of the data values lie between x - ks and x data values lie between x - ks and x ++ ks for any kks for any k>1>1

2.2. At least 89% of the data values are between At least 89% of the data values are between x x - 3- 3s and x s and x + 3+ 3s, or s, or At least At least 75%75% of the data values have a z- of the data values have a z-score value between score value between -3-3 and and 33


Empirical RuleEmpirical RuleUnder the assumption of a bell Under the assumption of a bell shaped population:shaped population:

1.1. Approximately Approximately 68%68% of the data values lie of the data values lie between x between x -- s and x s and x ++ s (have z-scores s (have z-scores between between -1-1 and and 11))

2.2. Approximately Approximately 95%95% of the data values lie of the data values lie between x between x -- 22s and x s and x ++ 22s (have z-scores s (have z-scores between between -2-2 and and 22))

3.3. Approximately Approximately 99.7%99.7% of the data values lie of the data values lie between x between x -- 33s and x s and x ++ 33s (have z-scores s (have z-scores between between -3-3 and and 33))


A Bell-Shaped A Bell-Shaped (Normal) Population(Normal) Population



Chebyshev’s Versus Chebyshev’s Versus EmpiricalEmpirical

Chebyshev’sChebyshev’sActualActual Inequality Inequality Empirical RuleEmpirical Rule

BetweenBetween PercentagePercentage PercentagePercentage PercentagePercentage

xx - - ss and and xx + + ss 66%66% —— ≈ 68%≈ 68%(33 out of 50)(33 out of 50)

xx - 2 - 2ss and and xx + 2 + 2ss 98%98% ≥ 75%≥ 75% ≈ 95%≈ 95%(49 out of 50)(49 out of 50)

xx - 3 - 3ss and and xx + 3 + 3ss 100%100% ≥ 89%≥ 89% ≈ 100%≈ 100%(50 out of 50)(50 out of 50)

Table 3.3Table 3.3

Md = 62Sk = -.26


Allied Manufacturing ExampleAllied Manufacturing ExampleIs the Empirical Rule Is the Empirical Rule applicable to this data?applicable to this data?

Probably yes.Probably yes.

Histogram is Histogram is approximately bell approximately bell shaped.shaped.

xx - 2 - 2ss = 10.275 and = 10.275 and xx + 2 + 2ss = 10.3284 = 10.3284

96 of the 100 data values fall between these limits 96 of the 100 data values fall between these limits closely approximating the 95% called for by the closely approximating the 95% called for by the Empirical RuleEmpirical Rule


Grouped DataGrouped Data

Class NumberClass Number Class (Age in years)Class (Age in years) FrequencyFrequency

11 20 and under 3020 and under 30 5522 30 and under 4030 and under 40 141433 40 and under 5040 and under 50 9944 50 and under 6050 and under 60 6655 60 and under 7060 and under 70 22

3636

Table 3.4Table 3.4

When raw data are not availableWhen raw data are not available

Estimate Estimate xx by assuming data values are equal to the by assuming data values are equal to the midpoint of their classmidpoint of their class


Grouped DataGrouped DataWhen raw data are not availableWhen raw data are not available

Estimate Estimate xx by assuming data values are equal to the by assuming data values are equal to the midpoint of their classmidpoint of their class

5 values at (20 + 30)/25 values at (20 + 30)/2 = 25= 2514 values at (30 + 40)/214 values at (30 + 40)/2 = 35= 35

9 values at (40 + 50)/59 values at (40 + 50)/5 = 45= 456 values at (50 + 60)/26 values at (50 + 60)/2 = 55= 552 values at (60 + 70)/22 values at (60 + 70)/2 = 65= 65

xx = =

xx = = 41.1 = = 41.1

(5)(25) + (14)(35) + (9)(45) + (6)(55) + (2)(65)(5)(25) + (14)(35) + (9)(45) + (6)(55) + (2)(65)3636

148014803636


Grouped DataGrouped DataWhen raw data are not availableWhen raw data are not available

Estimate Estimate ss22 by assuming data values are equal to the by assuming data values are equal to the midpoint of their class and using the normal methodmidpoint of their class and using the normal method

ss22 = =∑∑(each data value)(each data value)22 - ∑(each data value) - ∑(each data value)22//nn

nn - 1 - 1

ss22 = = 121.59 = = 121.59

ss = 121.59 = 11.03 = 121.59 = 11.03

65,100 - (1480)65,100 - (1480)22/36/36

3535



Table 3.5Table 3.5

Summary of calculationsSummary of calculations

Class Class NumberNumber ClassClass ff mm ff • • mm ff • • mm22

11 20 and under 3020 and under 30 55 2525 125125 3,1253,12522 30 and under 4030 and under 40 1414 3535 490490 17,15017,15033 40 and under 5040 and under 50 99 4545 405405 18,22518,22544 50 and under 6050 and under 60 66 5555 330330 18,15018,15055 60 and under 7060 and under 70 22 6565 130130 8,4508,450

3636 ∑∑ff • • mm = 1,480 = 1,480 ∑∑ff • • mm22 = 65,100 = 65,100


Box PlotsBox Plots

Box plots are graphical representations of Box plots are graphical representations of data sets that illustrate the lowest data data sets that illustrate the lowest data value (value (LL), the first quartile (), the first quartile (QQ11), the median ), the median

((QQ22, MD), the third quartile (, MD), the third quartile (QQ33), the ), the

interquartile range (IQR), and the highest interquartile range (IQR), and the highest data value (data value (HH))


Box PlotsBox PlotsGiven the aptitude test data:Given the aptitude test data:

LL = 22= 22 QQ33 = 75= 75

QQ11 = 46= 46 IQRIQR = 75 - 46 = 29= 75 - 46 = 29

QQ22 = Md = 62= Md = 62 HH = 96= 96

| | | | | | | | |2020 3030 4040 5050 6060 7070 8080 9090 100100

LL = 22 = 22 QQ11 = 46 = 46 Md = 62Md = 62 QQ33 = 75 = 75 HH = 96 = 96


xxxx


Box PlotsBox Plots



Box PlotsBox Plots

Figure 3.16aFigure 3.16a


Box PlotsBox Plots

Figure 3.16bFigure 3.16b


Box PlotsBox Plots


100100

8080

6060

4040

2020

Ap

pti

tud

e S

core

Ap

pti

tud

e S

core

Box Plots for Aptitude ScoresBox Plots for Aptitude Scores

SampleSample11 22

©2003 thomson/south-western 1 chapter 3 – data summary using descriptive measures slides prepared...

Documents