chapter 3 descriptive statistics

112
2-1 Chapter 3 Chapter 3 Descriptive Statistics by Try Sothearith by Try Sothearith [email protected] [email protected] [email protected] [email protected] Tel: 012 585 865 / 016555507 Tel: 012 585 865 / 016555507 NU: Statistics for Manager

Upload: cuteapufriends

Post on 10-Nov-2015

18 views

Category:

Documents


3 download

DESCRIPTION

statistic fof management

TRANSCRIPT

  • 2-*Chapter 3 Descriptive Statistics by Try Sothearith [email protected] [email protected] Tel: 012 585 865 / 016555507NU: Statistics for Manager

  • 2-*NU: Statistics for ManagerNumerical DescriptionCentral TendencyDispersionPercentilesSkewnessKurtosisCorrelation

  • Statistics are descriptive measures derived from a sample (n items).Parameters are descriptive measures derived from a population (N items).Numerical Description4A-*

  • Summary MeasuresArithmetic MeanMedianModeDescribing Data NumericallyVarianceStandard DeviationCoefficient of VariationRangeInterquartile RangeGeometric MeanSkewnessCentral TendencyVariationShapeQuartilesPercentilesQuintilesdecilesPercentiles

  • Three key characteristics of numerical data:Numerical Description4A-*

    CharacteristicInterpretationCentral TendencyWhere are the data values concentrated? What seem to be typical or central data values?

    DispersionHow much variation is there in the data? How spread out are the data values? Are there unusual values?

    ShapeAre the data values distributed symmetrically? Skewed? Sharply peaked? Flat? Bimodal?

  • The central tendency is the middle or typical values of a distribution.Central tendency can be assessed using a dot plot, histogram or more precisely with numerical statistics.Central Tendency4A-*

  • Measures of Central TendencyCentral TendencyArithmetic MeanMedianModeGeometric MeanOverviewMidpoint of ranked valuesMost frequently observed value

  • Central Tendency Six Measures of Central Tendency4A-*

    StatisticFormulaExcel FormulaPropertyConcernMean=AVERAGE(Data)Familiar and uses all the sample information. Influenced by extreme values.

    MedianMiddle value in sorted array=MEDIAN(Data)Robust when extreme data values exist. Ignores extremes and can be affected by gaps in data values.

  • Central Tendency Six Measures of Central Tendency4A-*

    StatisticFormulaExcel FormulaProConModeMost frequently occurring data value=MODE(Data)Useful for attribute data or discrete data with a small range.May not be unique, and is not helpful for continuous data.

    Midrange=0.5*(MIN(Data)+MAX(Data))Easy to understand and calculate.Influenced by extreme values and ignores most data values.

  • Central Tendency Six Measures of Central Tendency4A-*

    StatisticFormulaExcel FormulaProConGeometric mean (G)=GEOMEAN(Data)Useful for growth rates and mitigates high extremes.Less familiar and requires positive data.

    Trimmed meanSame as the mean except omit highest and lowest k% of data values (e.g., 5%)=TRIMMEAN(Data, Percent)Mitigates effects of extreme values.Excludes some data values that could be relevant.

  • Central Tendency Six Measures of Central Tendency4A-*

    StatisticFormulaExcel FormulaProConGeometric mean Growth Rate (GR)

    Xn: End of PeriodX1: Beginning of Periodn : number of period Useful for average growth rates overtimefamiliar and requires positive data.

    Weighted Mean Familiar and uses all the sample information. Influenced by extreme values.

  • Arithmetic ms the measure of central tendency representing of each value in average.

    In Excel, use function =AVERAGE(Data) where Data is an array of data values.Central TendencyArithmetic Mean4A-*

    Population FormulaSample Formula

  • For the sample: selected of n = 5 car prices ($) 30,000, 43,000, 15,000, 50,000 and 35,000

    = 34,600Central TendencyArithmetic Mean4A-*

  • Arithmetic MeanThe arithmetic mean (mean) is the most common measure of central tendency

    For a sample of size n:Sample sizeObserved values

  • Arithmetic MeanThe most common measure of central tendencyMean = sum of values divided by the number of valuesAffected by extreme values (outliers)

    (continued)0 1 2 3 4 5 6 7 8 9 10Mean = 3 0 1 2 3 4 5 6 7 8 9 10Mean = 4

  • Weighted MeanThe Weighted Mean of a set of numbers X1, X2, ..., Xn, with corresponding weights w1, w2, ...,wn, is computed from the following formula: 3- *Weighted Mean

  • Example 4During a one hour period on a hot Saturday afternoon cabana boy Chris served fifty drinks. He sold five drinks for $0.50, fifteen for $0.75, fifteen for $0.90, and fifteen for $1.10. Compute the weighted mean of the price of the drinks. 3- *Weighted Mean

  • Arithmetic mean is the most familiar average.Affected by every sample item.The balancing point or fulcrum for the data.Central Tendency Characteristics of the Mean4A-*

  • Regardless of the shape of the distribution, absolute distances from the mean to the data points always sum to zero.Central Tendency Characteristics of the MeanConsider the following asymmetric distribution of quiz scores whose mean = 65.4A-*

  • The geometric mean (G) is a multiplicative average.For the J. D. Power quality data (n=37):In Excel use =GEOMEAN(Array)The geometric mean tends to mitigate the effects of high outliers.Central Tendency Geometric Mean4A-*

  • A variation on the geometric mean used to find the average growth rate for a time series.For example, from 2002 to 2006, JetBlue Airlines revenues are:Central Tendency Growth Rates4A-*

    YearRevenue (mil)20026352003998200412652005170120062363

  • The average growth rate is given by taking the geometric mean of the ratios of each years revenue to the preceding year.Due to cancellations, only the first and last years are relevant: = 1.3891 = .389 or 38.9% per yearIn Excel use =(2363/635)^(1/4)-1Central Tendency Growth Rates4A-*

  • The median (M) is the value of 50th percentile or midpoint of the sorted sample data. There are 50% of observation lower and other 50% higher than medianIf n is odd, the median is the middle observation in the data array.If n is even, the median is the average of the middle two observations in the data array.Central Tendency Median4A-*

  • Consider the following n = 6 data values: 11 12 15 17 21 32What is the median?M = (x3+x4)/2 = (15+17)/2 = 16 11 12 15 16 17 21 32n/2 = 6/2 = 3 and n/2+1 = 6/2 + 1 = 4Central Tendency Median4A-*

  • Central Tendency Median (Figure 4.6)4A-*

  • Central Tendency Median4A-*

  • Consider the following n = 7 data values: 12 23 23 25 27 34 41What is the median?M = x4 = 2512 23 23 25 27 34 41(n+1)/2 = (7+1)/2 = 8/2 = 4Central Tendency Median4A-*

  • Use Excels function =MEDIAN(Data) where Data is an array of data values.For the 37 vehicle quality ratings (odd n) the position of the median is (n+1)/2 = (37+1)/2 = 19. So, the median is x19 = 121.When there are several duplicate data values, the median does not provide a clean 50-50 split in the data.Central Tendency Median4A-*

  • The median is insensitive to extreme data values.For example, consider the following quiz scores for 3 students:Toms scores: 20, 40, 70, 75, 80 Mean =57, Median = 70, Total = 285Jakes scores: 60, 65, 70, 90, 95 Mean = 76, Median = 70, Total = 380Marys scores: 50, 65, 70, 75, 90 Mean = 70, Median = 70, Total = 350What does the median for each student tell you?Central Tendency Characteristics of the Median4A-*

  • The most frequently occurring data value.Similar to mean and median if data values occur often near the center of sorted data.May have multiple modes or no mode. Central Tendency Mode4A-*

  • Lees scores: 60, 70, 70, 70, 80Mean =70, Median = 70, Mode = 70Pats scores: 45, 45, 70, 90, 100Mean = 70, Median = 70, Mode = 45Sams scores: 50, 60, 70, 80, 90Mean = 70, Median = 70, Mode = noneXiaos scores: 50, 50, 70, 90, 90Mean = 70, Median = 70, Modes = 50,90Central Tendency ModeFor example, consider the following quiz scores for 3 students:What does the mode for each student tell you?4A-*

  • Easy to define, not easy to calculate in large samples.Use Excels function =MODE(Array) - will return #N/A if there is no mode. - will return first mode found if multimodal.May be far from the middle of the distribution and not at all typical.Central Tendency Mode4A-*

  • Generally isnt useful for continuous data since data values rarely repeat.Best for attribute data or a discrete variable with a small range (e.g., Likert scale).Central Tendency Mode4A-*

  • Consider the following P/E ratios for a random sample of 68 Standard & Poors 500 stocks.What is the mode?Central Tendency Example: Price/Earnings Ratios and Mode4A-*

    7881010101012131313131313131414141515151515161616171818181819191919192020202121212222232323242526262626272929303134363740414548556891

  • Excels descriptive statistics results are:The mode 13 occurs 7 times, but what does the dot plot show?Central Tendency Example: Price/Earnings Ratios and Mode4A-*

    Mean22.7206Median19Mode13Range84Minimum7Maximum91Sum1545Count68

  • The dot plot shows local modes (a peak with valleys on either side) at 10, 13, 15, 19, 23, 26, 29.These multiple modes suggest that the mode is not a stable measure of central tendency.Central Tendency Example: Price/Earnings Ratios and Mode4A-*

  • A bimodal distribution refers to the shape of the histogram rather than the mode of the raw data.Occurs when dissimilar populations are combined in one sample. For example,Central Tendency Mode4A-*

  • Compare mean and median or look at histogram to determine degree of skew ness.Central Tendency Skew ness4A-*

  • Central Tendency Symptoms of Skew ness4A-*

    Distributions ShapeHistogram AppearanceStatisticsSkewed left(negative skew ness)Long tail of histogram points left(a few low values but most data on right)Mean < Median

    SymmetricTails of histogram are balanced (low/high values offset)Mean Median

    Skewed right(positive skew ness)Long tail of histogram points right(most data on left but a few high values)Mean > Median

  • For the sample of spending per customer at 74 Noodles &, the mean ($7.04) exceeds the median ($7.00). What does this suggest?Central Tendency Skew ness4A-*

  • The midrange is the point halfway between the lowest and highest values of X.Easy to use but sensitive to extreme data values.For the J. D. Power quality data (n=37):Here, the midrange (147.5) is higher than the mean (134.51) or median (132).Central Tendency Midrange4A-*

  • To calculate the trimmed mean, first remove the highest and lowest k percent of the observations.For example, for the n = 68 P/E ratios, we want a 5 percent trimmed mean (i.e., k = .05).To determine how many observations to trim, multiply k x n = 0.05 x 68 = 3.4 or 3 observations. So, we would remove the three smallest and three largest observations before averaging the remaining values.Central Tendency Trimmed Mean4A-*

  • Here is a summary of all the measures of central tendency for the n = 68 P/E values.The trimmed mean mitigates the effects of very high values, but still exceeds the median.Central Tendency Trimmed Mean4A-*

    Mean:22.72 =AVERAGE(PERatio)Median:19.00 =MEDIAN(PERatio)Mode:13.00 =MODE(PERatio)Geometric Mean:19.85 =GEOMEAN(PERatio)Midrange:49.00 (MIN(PERatio)+MAX(PERatio))/25% Trim Mean:21.10 =TRIMMEAN(PERatio,0.1)

  • Central Tendency Trimmed MeanThe Federal Reserve uses a 16% trimmed mean to mitigate the effects of extremes in its analysis of the Consumer Price Index.4A-*

  • Variation is the spread of data points about the center of the distribution in a sample. Consider the following measures of dispersion:Dispersion Measures of Variation4A-*

    StatisticFormulaExcelProConRangexmax xmin=MAX(Data)-MIN(Data)Easy to calculateSensitive to extreme data values.

    Variance (s2)=VAR(Data)Plays a key role in mathematical statistics.Non-intuitive meaning.

  • Dispersion Measures of Variation4A-*

    StatisticFormulaExcelProConStandard deviation (s)=STDEV(Data)Most common measure. Uses same units as the raw data ($ , , , etc.).Non-intuitive meaning.

    Coef-ficient. ofvariation (CV)NoneMeasures relative variation in percent so can compare data sets.Requires non-negative data.

  • Dispersion Measures of Variation4A-*

    StatisticFormulaExcelProConMean absolute deviation (MAD)=AVEDEV(Data)Easy to understand.Lacks nice theoretical properties.

  • The difference between the largest and smallest observation.Range = xmax xmin For example, for the n = 68 P/E ratios, Range = 91 7 = 84 Dispersion Range4A-*

  • The population variance (s2) is defined as the sum of squared deviations around the mean m divided by the population size.For the sample variance (s2), we divide by n 1 instead of n, otherwise s2 would tend to underestimate the unknown population variance s2.Dispersion Variance4A-*

  • The square root of the variance.Units of measure are the same as X.Explains how individual values in a data set vary from the mean.Dispersion Standard Deviation4A-*

  • Excels built in functions areDispersion Standard Deviation4A-*

    StatisticExcel population formulaExcel sample formulaVariance=VARP(Array)=VAR(Array)Standard deviation=STDEVP(Array)=STDEV(Array)

  • Consider the following five quiz scores for Stephanie. (Table 4.12)Dispersion Calculating a Standard Deviation4A-*

  • Now, calculate the sample standard deviation:Somewhat easier, the two-sum formula can also be used:Dispersion Calculating a Standard Deviation4A-*

  • The standard deviation is nonnegative because deviations around the mean are squared.When every observation is exactly equal to the mean, the standard deviation is zero.Standard deviations can be large or small, depending on the units of measure.Compare standard deviations only for data sets measured in the same units and only if the means do not differ substantially.Dispersion Calculating a Standard Deviation4A-*

  • Useful for comparing variables measured in different units or with different means.A unit-free measure of dispersionExpressed as a percent of the mean.Only appropriate for nonnegative data. It is undefined if the mean is zero or negative.Dispersion Coefficient of Variation4A-*

  • 4-*The coefficient of variation is the ratio of the standard deviation to the arithmetic mean, expressed as a percentage.Relative dispersion

  • For example:Dispersion Coefficient of Variation4A-*

    Defect rates (n = 37)s = 22.89 = 125.38 gives CV = 100 (22.89)/(125.38) = 18%ATM deposits (n = 100)s = 280.80 = 233.89 gives CV = 100 (280.80)/(233.89) = 120%P/E ratios (n = 68)s = 14.28 = 22.72 gives CV = 100 (14.08)/(22.72) = 62%

  • The Mean Absolute Deviation (MAD) reveals the average distance from an individual data point to the mean (center of the distribution).Uses absolute values of the deviations around the mean.Excels function is =AVEDEV(Array)Dispersion Mean Absolute Deviation4A-*

  • Consider the histograms of hole diameters drilled in a steel plate during manufacturing.The desired distribution is outlined in red.Dispersion Central Tendency vs. Dispersion: Manufacturing4A-*

  • Desired mean (5mm) but too much variation.Acceptable variation but mean is less than 5 mm.Take frequent samples to monitor quality.Dispersion Central Tendency vs. Dispersion: Manufacturing4A-*

  • Consider student ratings of four professors on eight teaching attributes (10-point scale).Dispersion Central Tendency vs. Dispersion: Job Performance4A-*

  • Jones and Wu have identical means but different standard deviations.Dispersion Central Tendency vs. Dispersion: Job Performance4A-*

  • Smith and Gopal have different means but identical standard deviations.Dispersion Central Tendency vs. Dispersion: Job Performance4A-*

  • A high mean (better rating) and low standard deviation (more consistency) is preferred. Which professor do you think is best?Dispersion Central Tendency vs. Dispersion: Job Performance4A-*

  • Descriptive Statistics (Part 2)Chapter3Standardized DataPercentiles, Quartiles and Box PlotsGrouped DataSkew ness and Kurtosis

    McGraw-Hill/Irwin Copyright 2009 by The McGraw-Hill Companies, Inc. All rights reserved.

  • For any population with mean m and standard deviation s, the percentage of observations that lie within k standard deviations of the mean must be at least 100[1 1/k2]. Developed by mathematicians Jules Bienaym (1796-1878) and Pafnuty Chebyshev (1821-1894).Standardized Data Chebyshevs Theorem

  • For k = 2 standard deviations, 100[1 1/22] = 75%So, at least 75.0% will lie within m + 2sFor k = 3 standard deviations, 100[1 1/32] = 88.9%So, at least 88.9% will lie within m + 3sAlthough applicable to any data set, these limits tend to be too wide to be useful.Standardized Data Chebyshevs Theorem

  • The Empirical Rule states that for data from a normal distribution, we expect that forThe normal or Gaussian distribution was named for Karl Gauss (1771-1855).The normal distribution is symmetric and is also known as the bell-shaped curve.k = 1 about 68.26% will lie within m + 1sk = 2 about 95.44% will lie within m + 2sk = 3 about 99.73% will lie within m + 3sStandardized Data The Empirical Rule

  • Note: no upper bound is given. Data values outside m + 3s are rare.Distance from the mean is measured in terms of the number of standard deviations.Standardized Data The Empirical Rule

  • If 80 students take an exam, how many will score within 2 standard deviations of the mean?Assuming exam scores follow a normal distribution, the empirical rule statesabout 95.44% will lie within m + 2sso 95.44% x 80 76 students will score + 2s from m.How many students will score more than 2 standard deviations from the mean?Standardized Data Example: Exam Scores

  • Unusual observations are those that lie beyond m + 2s.Outliers are observations that lie beyond m + 3s.Standardized Data Unusual Observations

  • For example, the P/E ratio data contains several large data values. Are they unusual or outliers?Standardized Data Unusual Observations

    7881010101012131313131313131414141515151515161616171818181819191919192020202121212222232323242526262626272929303134363740414548556891

  • If the sample came from a normal distribution, then the Empirical rule states = 22.72 1(14.08) = 22.72 2(14.08) = 22.72 3(14.08) Standardized Data The Empirical Rule = (8.6, 38.8) = (-5.4, 50.9) = (-19.5, 65.0)

  • 22.72Standardized Data The Empirical RuleAre there any unusual values or outliers?7 8 . . . 48 55 68 91

  • A standardized variable (Z) redefines each observation in terms the number of standard deviations from the mean.Standardization formula for a population:Standardization formula for a sample:Standardized Data Defining a Standardized Variable

  • zi tells how far away the observation is from the mean. Standardized Data Defining a Standardized VariableFor example, for the P/E data, the first value x1 = 7. The associated z value is

  • A negative z value means the observation is below the mean.Standardized Data Defining a Standardized VariablePositive z means the observation is above the mean. For x68 = 91,

  • Here are the standardized z values for the P/E data:Standardized Data Defining a Standardized Variable

  • In Excel, use =STANDARDIZE(Array, Mean, STDev) to calculate a standardized z value.MegaStat calculates standardized values as well as checks for outliers.Standardized Data Defining a Standardized Variable

  • What do we do with outliers in a data set?If due to erroneous data, then discard.An outrageous observation (one completely outside of an expected range) is certainly invalid.Recognize unusual data points and outliers and their potential impact on your study.Research books and articles on how to handle outliers.Standardized Data Outliers

  • For a normal distribution, the range of values is 6s (from m 3s to m + 3s).If you know the range R (high low), you can estimate the standard deviation as s = R/6.Useful for approximating the standard deviation when only R is known.This estimate depends on the assumption of normality.Standardized Data Estimating Sigma

  • Percentiles are data that have been divided into 100 groups.For example, you score in the 83rd percentile on a standardized test. That means that 83% of the test-takers scored below you. Deciles are data that have been divided into 10 groups.Quintiles are data that have been divided into 5 groups.Quartiles are data that have been divided into 4 groups.Percentiles and Quartiles Percentiles

  • Percentiles are used to establish benchmarks for comparison purposes (e.g., health care, manufacturing and banking industries use 5, 25, 50, 75 and 90 percentiles). Quartiles (25, 50, and 75 percent) are commonly used to assess financial performance and stock portfolios. Percentiles are used in employee merit evaluation and salary benchmarking.Percentiles and Quartiles Percentiles

  • Quartiles are scale points that divide the sorted data into four groups of approximately equal size.The three values that separate the four groups are called Q1, Q2, and Q3, respectively.Percentiles and Quartiles Quartiles

    Q1Q2Q3Lower 25% |Second 25%|Third 25%|Upper 25%

  • The second quartile Q2 is the median, an important indicator of central tendency.Q1 and Q3 measure dispersion since the interquartile range Q3 Q1 measures the degree of spread in the middle 50 percent of data values.Percentiles and Quartiles Quartiles4B-*

    Q2 Lower 50% | Upper 50%

    Q1Q3Lower 25%| Middle 50% |Upper 25%

  • The first quartile Q1 is the median of the data values below Q2, and the third quartile Q3 is the median of the data values above Q2.Percentiles and Quartiles Quartiles

    Q1Q2Q3Lower 25% |Second 25%|Third 25%|Upper 25%

  • Depending on n, the quartiles Q1,Q2, and Q3 may be members of the data set or may lie between two of the sorted data values.Percentiles and Quartiles Quartiles

  • Quintiles are scale points that divide the sorted data into five groups of approximately equal size.The four values that separate the four groups are called Q1, Q2, and Q3, respectively.Percentiles and Quartiles Quintiles

    Qn1Qn2Qn3Lower 20% |Second 20%|Third 20%|Upper 20%

    Qn3|Upper 20%

  • Deciles are scale points that divide the sorted data into ten groups of approximately equal size.The four values that separate the four groups are called D1, D2, and D3, ..... respectively.Percentiles and Quartiles Deciles

    D1D2D3Lower 10% |Second 10%|Third 10%|Upper 10%

    D9|Upper 10%

  • For raw data sets, find any percentiles using method of Percentile:Step 1. Sort the observations smallest to largest.Step 2. Compute Lp, location at p percentagePercentiles and Quartiles Method of PercentilesLp = (n+1)

  • Step 3. Find Vp the value of Lp.Step 4. Vp= Value Lp + Distance from Lp to L(p+1) Percentiles and Quartiles Method of Percentiles

  • Consider the following P/E ratios for 68 stocks in a portfolio. Use quartiles to define benchmarks for stocks that are low-priced (bottom quartile) or high-priced (top quartile).Percentiles and Quartiles Example: P/E Ratios

    7881010101012131313131313131414141515151515161616171818181819191919192020202121212222232323242526262626272929303134363740414548556891

  • Example: Compute percentile at 80%Step 1. Sort the observations smallest to largest.Step 2. Compute Lp, location at p percentagePercentiles and Quartiles Method of PercentilesLp = (68+1) *80/100=54.4 Step 3. Vp= Value Lp=54 + 0.40 of difference Lp=54 and Lp=55 Vp= 27+0.4(29-27) = 27.8There 80% of stocks the P/E ratio less than 27.8 and other 20% higher than 27.8

  • Use Excel function =QUARTILE(Array, k) to return the kth quartile.=QUARTILE(Array, 3)=PERCENTILE(Array, 75)Excel treats quartiles as a special case of percentiles. For example, to calculate Q3Percentiles and Quartiles Excel Quartiles

  • So, to summarize:These quartiles express central tendency and dispersion. What is the interquartile range?Because of clustering of identical data values, these quartiles do not provide clean cut points between groups of observations.Percentiles and Quartiles Example: P/E Ratios and Quartiles

    Q1Q2Q3Lower 25% of P/E Ratios14Second 25% of P/E Ratios19Third 25% of P/E Ratios26Upper 25% of P/E Ratios

  • Quartiles generally resist outliers.However, quartiles do not provide clean cut points in the sorted data, especially in small samples with repeating data values.Although they have identical quartiles, these two data sets are not similar. The quartiles do not represent either data set well.Percentiles and Quartiles Caution

    Data set A:1, 2, 4, 4, 8, 8, 8, 8Q1 = 3, Q2 = 6, Q3 = 8Data set B:0, 3, 3, 6, 6, 6, 10, 15Q1 = 3, Q2 = 6, Q3 = 8

  • The sample correlation coefficient is a statistic that describes the degree of linearity between paired observations on two quantitative variables X and Y.Correlation Correlation Coefficient

  • Its range is -1 r +1.Excels formula =CORREL(Xdata, Ydata) Correlation Correlation Coefficient

  • Illustration of Correlation CoefficientsCorrelation Correlation Coefficient

  • What is the nature of the relationship between square feet of shopping area and sales that is implied by the following correlation?Correlation

  • Although some information is lost, grouped data are easier to display than raw data. When bin limits are given, the mean and standard deviation can be estimated.Accuracy of grouped estimates depend on - the number of bins - distribution of data within bins - bin frequenciesGrouped Data Nature of Grouped Data

  • Consider the frequency distribution for prices of Lipitor for three cities:Grouped Data Mean and Standard DeviationWhere mj = class midpoint fj = class frequency k = number of classes n = sample size

  • Estimate the mean and standard deviation byNote: dont round off too soon.Grouped Data Nature of Grouped Data

  • How accurate are grouped estimates compared to ungrouped estimates?Now estimate the coefficient of variationFor the previous example, we can compare the grouped data statistics to the ungrouped data statistics.Grouped Data Nature of Grouped Data Accuracy Issues

  • Accuracy tends to improve as the number of bins increases.If the first or last class is open-ended, there will be no class midpoint (no mean can be estimated).Assume a lower limit of zero for the first class when the data are nonnegative.You may be able to assume an upper limit for some variables (e.g., age).Median and quartiles may be estimated even with open-ended classes.Grouped Data Accuracy Issues

  • Generally, skew ness may be indicated by looking at the sample histogram or by comparing the mean and median.This visual indicator is imprecise and does not take into consideration sample size n.Skew ness and Kurtosis Skew ness

  • Skew ness and Kurtosis Skew nessSkew ness is a unit-free statistic. The coefficient compares two samples measured in different units or one sample with a known reference distribution (e.g., symmetric normal distribution).Calculate the samples skew ness coefficient as:

  • In Excel, go to Tools | Data Analysis | Descriptive Statistics or use the function =SKEW(array)Skew ness and Kurtosis Skew ness

  • Coefficients outside the range suggest the sample came from a non-normal population.Skew ness and Kurtosis Skew ness (Figure 4.36)

  • Kurtosis is the relative length of the tails and the degree of concentration in the center.Consider three kurtosis prototype shapes.Skew ness and Kurtosis KurtosisHeavier tails

  • A histogram is an unreliable guide to kurtosis since scale and axis proportions may differ.Excel and MINITAB calculate kurtosis as:Skew ness and Kurtosis Kurtosis

  • Coefficients outside the range would suggest the sample differs from a normal population.Skew ness and Kurtosis Kurtosis

    **********************************************************************************************************