measures of central tendency measures of dispersion

Measures of Central Tendency

Measures of Dispersion

Multiple Choice.30 questions.

Lectures 4 and 5 (today).10% of your course grade.

In this room.45 minutes.8:10 start.8:55 end.

DO NOT MISS IT.THERE WILL BE NO MAKE-UPS.

SECOND QUIZ NEXT WEEK

The Frightening Power of Central Tendency(George Carlin – funniest guy ever)

REGRESSION TO THE MEAN:

Eventually, everything becomes mediocre.*

*(Late 16th century: from French médiocre, from Latin mediocris 'of middle height or degree‘.)

Measures of Central TendencyThe value that best represents the mid-point of a

set of values, but which may not actually be found in the set of values themselves. Major types are:

• Means:- Arithmetic- Weighted/Grouped- Geometric- Harmonic- Trimmed

• Median• Mode

Some are more robust than

others…{

What Does Being Robust Mean?When a statistic is robust it means that deviations

from the underlying assumptions of a data distribution do not affect the statistic’s ability to

represent the data values that comprise a dataset’s distribution.

WHAT DOES THAT MEAN?

That a sample’s statistics are a good representation of what’s happening in the population from which it

came.

And, the larger a sample gets, the closer its statistics approximate the population’s statistics.

This is called regression to the mean, or The Wisdom of the Crowd.

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 490

100020003000400050006000700080009000

10000

Guesses Actual MeanRunning Mean Linear (Running Mean)

Number of Guesses

Num

ber o

f Pop

corn

Ker

nels

The Wisdom of the CrowdRegression to the Mean Experiment

You have a jar of popcorn kernels and ask people how many are in it. You get a range of answers, one

guess for each person you ask.

There are 52 guesses (blue dots).The actual number of kernels is 5,524 (red line).

Note that the running mean of the guesses (green solid line) converges on (or regresses to (green

dotted line)) the actual mean.

So What Does Regression to the Mean, Mean?

What regression to the mean (more or less) means is that eventually, with a large enough sample, sample values for a variable will get closer and

closer to the mean of all values for the variable.

Put another way, the further a given sample value is from the mean, the higher the probability that the

next sample will be closer to the mean.

But to say this we need to have some assumptions about the sample’s view of the population’s reality.

These are called the sample’s underlying assumptions.

The Assumptions Underlying Why The Crowd’s View is a Good Approximate of The Population’s Reality

These are the underlying assumptions for The Crowd:Data distributions are normally distributed (a.k.a. bell shaped).

This means that:• They have no outliers.• They have no gaps.• They are not skewed (skewness).• They are not peaked (kurtosis).• They have no extreme values.• They are not bi-modal (two peaks).• They are not poly-modal (many peaks).• Their measures of central tendency are equal (hmmm).

This is called being “robust”.

What Does Being Robust Mean?Deviant distributions (distributions that deviate from

the assumptions) happen because of non-normal attributes such as:

• extreme values• bi-modality or poly-modality

• outliers• gaps

• skewness• kurtosis

• extreme differences between values

These, thankfully for Statistics, are rare occurrences, but we must always check for them.

What Does Being Robust Not Mean?Robustness:

ability to withstand assumption violation

Discrimination: ability to accurately represent a set of values

Statistical tools with high robustness

usually have lower ability to discriminate

Statistical tools with low robustness

usually have higher ability to discriminate

Being robust does not mean being better or more accurate – quite the opposite. It means that a robust statistical tool is better able to withstand assumption

violation but less able to discriminate accurately.

Parametric and Non-parametric Tools

Parametric statistical tools usually have a lowability to withstand assumption violation.

Non-parametric statistical tools usually have a highability to withstand assumption violation.

Non-parametric statistical tools usually have higher robustness

but have a lower ability to discriminate.

Parametric statistical tools usually have lower robustness

but have a higher ability to discriminate.

The assumptions underlying statistical distributions are called parameters.

Therefore…

The Effect of Extreme Values and Outliers

Ten people are sitting at a bar.Each earns $50K a year.

Their average is $50,000.

In walks Bill Gates and sits down.He earns $1,000,000,000 a year.Their average is now $90 million.

The data and the averages are both correct but the result is ridiculous because an underlying

assumption - that of extreme values - has been violated.

Moral: don’t hang around in bars

n=9 Value Rank

Incremental Difference each value

Incremental Difference each rank

Difference mean to

value

Difference median to

rank 1 1 1 1 -13 -4 2 2 1 1 -12 -3 3 3 1 1 -11 -2 4 4 1 1 -10 -1 5 5 1 1 -9 0 6 6 1 1 -8 1 7 7 1 1 -7 2 8 8 82 1 -6 3 90 9 na na 76 4

Median 5 5 1 1 n/a 0Arithmetic

mean 14 5 11.13 1 0 0

Mean is almost 3 times the median.

Middle value

Extremevalue

Why Median Income is a Robust Statistic

Ranks ‘neutralize’ the extreme value.

These are our values and where they lie on the distribution.

This is the median (5)

This is the mean (14)

This is the outlier – waaaaay out.

…and it drags the mean out with it.

Thus the median is the more ‘robust’ statistic because it is less effected by the extreme value – it represents

the dataset more accurately.

n=9 Value Rank

Incremental Difference each value

Incremental Difference each rank

Difference mean to

value

Difference median to

rank 1 1 1 1 -4 -4 2 2 1 1 -3 -3 3 3 1 1 -2 -2 4 4 1 1 -1 -1 5 5 1 1 0 0 6 6 1 1 1 1 7 7 1 1 2 2 8 8 1 1 3 3 9 9 na na 4 4

Median 5 5 1 1 n/a 0Arithmetic

mean 5 5 1 1 0 0

Being Robust – Removing the Extreme Value

These are our values and where they lie on the distribution.

This is the median and the mean.

Outlier is gone.

…so the mean moves back to betterrepresent the data values.

Now the mean is the better statistic because it is more accurate since it uses arithmetic to represent the dataset. But the median is still the more robust

statistic but dos not discriminate as well.

CalculatingMean

MedianMode

Arithmetic MeanReturns the arithmetic centre of the data distribution,

such that the sum of all differences between data values and the mean equals zero.

The arithmetic mean is the arithmetic middle point of a set of values.

This means that the differences between any value x in a dataset and the mean of that dataset will sum to zero.

THIS DOES NOT MEAN THAT THE ARITHMETIC MEAN IS THE MID POINT OF THE DATASET’S DISTRIBUTION

BECAUSE THE ARITHMETIC MEAN IS STRONGLY INFLUENCED BY EXTREME VALUES.

Arithmetic lives here.

Returns the exact centre value of the dataset and hence the second quartile value. Half the values of the dataset

will be above the median and half will be below.

The median is the middle case of a set of cases (records or rows).

THIS MEANS THAT THE MEDIAN IS THE EXACT CENTRAL POINT OF A DATASET’S DISTRIBUTION – 50% of values

will be below the median and 50% will be above.THIS MEANS THAT THE MEDIAN IS NOT AFFECTED BY

EXTREME VALUES.

Median

Arithmetic does not live here.

Returns the most frequently occurring data value in the dataset. Sometimes reported as a label, when the

“value” is nominal level data – e.g. religious or political affiliation.

CASE RELIGIOUS AFFILIATION

Person #1 Protestant


Person #3 Muslim

Person #4 Catholic


Person #6 Jewish

Person #7 Catholic

The “modality” of this sample would be Protestant because it is the most frequently occurring “value”.

Mode (or Modality)

An Example of How They All Work TogetherYou work for a computer manufacturer and are asked to do a quality control analysis, so you gather data on the number of

faults your computers have.

These data show that your company has an average of 9.1 faults per 100 computers and that your competition has 9.1

faults per 100 as well.

In other words the average number of faults is about the same.

Should you report to your boss that your company is alright? It’s no worse than your competitors?

Perhaps not, because you are a smart statistician and you collected more than the bare bones dataset.

Percentage of Faults by CategoryNumber of Faults Per

UnitYour

CompanyYour

CompetitorZero 30 16One 20 17Two 10 9

Three 4 8Four 3 12Five 2 8Six 2 8

Seven 1 8Eight 2 7Nine 0 3Ten+ 26 4

TOTAL 100 100MEAN 9.1 9.1

MEDIAN One Three

Rather than the very basic arithmetic mean data you collected these.

They show what proportion of the machines had no faults, 1 fault, 2

faults etc.

The average proportion of faults stays at 9.1%, but the median shows that your company is doing much better

with 50% of machines having 1 or no faults, whereas your competitor’s median shows 50% of machines

having up to 3 faults.

Your problem lies in in the 26% of machines having 10 or more faults.

Further investigation shows that one of your assembly lines is the culprit

due to sloppy workers

Zero One Two Three Four Five Six SevenEight Nine Ten+0

5

10

15

20

25

30

35

Percentage of Units with Specific Number of Faults

Your Company Competitor

Number of Faults

Perc

enta

ge o

f Fau

lts

The Arithmetic Mean

where:= the arithmetic mean∑= the sum of all x’sx = a value in the datasetn = the number of values (or cases) in the dataset

𝑥=∑ 𝑥𝑛

δ𝑥=

∑ 𝑥𝑛

SAMPLE AND POPULATION SYMBOLOGY

• In formulas for a sample (such as the arithmetic mean), Latin letters are used, and a lower case ‘n’ used for the number of cases.

• In formulas for a population (such as the arithmetic mean), Greek symbols and letters are used (here delta) and a upper case ‘N’ used for the number of cases.

THE ARITHMETIC MEAN - DIFFERENCES SUM TO ZERO

e.g.38.25 – 21 = 17.2538.25 – 34 = 4.25

etc

Data Values Differences from Mean

21 17.2534 4.2545 -6.7556 -17.7554 -15.7543 -4.7532 6.2521 17.25

Sum 306 0

N 8

Mean 38.25

Remember this.

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00

THE ARITHMETIC MEAN - EFFECT OF EXTREME VALUESMiddle dataset is

unbalanced. It has one extreme value that pulls the mean higher than all but that extreme value.Note that median and mode are not affected.An opposite extreme

value balances the dataset again.

When an extreme value is present the median should be used and not the arithmetic mean

because the distribution will be

skewed.

BUT YOU ARE STILL LEFT WITH EXTREME VALUES.These will affect the deviation in the dataset (s and s2).


THE MEDIANThe median is the middle point of a set of cases.

Half the valuesabove…

…and halfbelow.

Since there are an even

number of values you

take the mean of the centre two values -

$42,000If there were

an odd number of

values, then the single

middle value is the median.

Mean and Median – Points to RememberThe Mean is the arithmetic middle point of a set of values. It

is calculated arithmetically from all data values. It is not the exact mid point of a set of values because it is strongly influenced by extreme values.

BECAUSE OF THIS THE ARITHMETIC MEAN IS NOT A ROBUST STATISTIC BUT IT IS A DISCRIMINATING ONE.

The Median is the middle point of a set of cases (records or rows). It is calculated by dividing the number of rows into two halves. Because it is the exact mid point of a set of cases it is not influenced by extreme values.

BECAUSE OF THIS THE MEDIAN IS A ROBUST STATISTIC BUT IT IS NOT A DISCRIMINATING ONE.


The ModeThe most frequently occurring data value (in this example $45,000 in each dataset)

Gaps and OutliersThe following histogram has outliers—there are three

cities in the leftmost bar. This creates a gap where there are effectively no values.

Gap

Outliers

Using Measures of Central Tendency

Use the method that returns the most information about the centre of the dataset – usually the

arithmetic mean. BUT…

With highly skewed (such as income) or non-unimodal datasets the median should be used.

Means and medians cannot be used with nominal level data – the mode can be used to describe the

most frequently occurring label.USING THE MODE IS CALLED MODAL ANALYSIS.

Other means to an end…• Weighted mean:

Useful when the ‘x’s have unequal weights as in grade calculations (e.g. tests worth 20% labs worth 30%, etc).

• Grouped data mean:Useful when you only have data in categories, as with income classes – is a special case of the weighted mean.

• Geometric mean:Useful when you have percentages, ratios, indexes or data covering several orders of magnitude.

• Harmonic mean:Useful when you have rates as in calculating average speeds.

• Trimmed mean:Useful for removing outliers.

Weighted Mean

𝑥𝑤=∑ 𝑥 𝑖𝑤𝑖

∑𝑤𝑖

Where: = weighted mean

xi = data value

wi = weight of data value

The weighted mean is used when data values have weighting schemes, as with the grades in this course.

Weighted mean example #1Component Weight

wi

Your Markxi

Mark X Weight xi * wi

Lab Assignments 45% 86% 3870Tests 15% 80% 1200Final Exam 40% 80% 3200Totals 100% 246 8270

= 8270/100 = 82.7%246/3=83%

n=3123


∑𝑤𝑖

The weighted mean methodThe arithmetic mean method

Weighted mean example #2Changing the weights

Component Weightwi

Your Markxi

Mark X Weight xi * wi

Lab Assignments 10% 86% 860Tests 30% 80% 2400Final Exam 60% 80% 4800Totals 100% 246 8060

= 8060/100 = 80.6%246/3=83%

n=3123


∑𝑤𝑖

The weighted mean methodThe arithmetic mean method

Grouped data mean examplePopulation in Census Tract 12345.6

IncomeClasses

Frequency (f)wi

Class midpoints (CM)

xi

CM X fwi * xi

$0-$10,000 22 $5000 $110,000$10,001-$20,000 56 $15,000 $840,000$20,001-$30,000 81 $25,000 $2,025,000$30,001-$40,000 45 $35,000 $1,575,000$40,001-$50,000 23 $45,000 $1,035,000$50,001-$60,000 15 $55,000 $825,000

>$60,000 7 Excluded 0Totals 249 $180,000 $6,410,000

= $6,410,000/249 = $25,742.97 $180,000/6 = $30,000

123456

$180,000


∑𝑤𝑖

The weighted mean method The arithmetic mean method

Geometric mean

Where:GM: geometric meanx : data valuesn√ : nth root of product of all x

The ∏ symbol is the upper case Greek letter pi and signifies the product of a set

multiplications.

Used extensively in biology and finance

GMGM

Geometric mean – use when your data:• Are percentages, ratios, indexes or growth rates;• Have an exponential distribution;• Have high value more than 3 times the low value;• Cover several orders of magnitude.

Geometric mean – do not use when your data:

• Are already log scaled such as decibels or pH;• Have high value less than 3 times the low value;

Geometric Mean ExampleExample using bacteria counts (they typically vary widely)

Water Sample # Enteric Bacteria Count per ml

1 62 503 94 1200

Arithmetic mean 316.25Geometric Mean 42.42

𝟒√𝟔∗𝟓𝟎∗𝟗∗𝟏𝟐𝟎𝟎GM = = 42.42Basically the data are log transformed.

Thus extreme values are tempered.

Harmonic mean

Where:HM: harmonic mean1/x: reciprocals of data values n : number of data values

Harmonic meanUseful when you have rates per unit (such as

distance per unit of time (speed) to average out.

HM

Harmonic meanexample of transportation & speed

What’s the average speed for a delivery truck given these data:

Segment Length(km) Speed (kph) Time takenOutbound 180 60 180/60=3hrsInbound 180 80 180/80=2.25hrs

Total distance 360

Arithmetic mean speed = 60kph+80kph/2 = 70kph.But the time taken is 3hrs+2.25hrs=5.25hrs

Therefore the actual average speed = 360km/5.25hrs=68.57kph

This is what the harmonic mean does:HM = 2/((1/60kph)+(1/80kph)) = 68.57kph

Difference small but over more segments it can be significant

HM

Trimmed MeanThis is easy: it is any mean where the outliers

have been stripped or trimmed away.

Thus you would sort your data and drop the top and bottom 10% of your values. This is called a

10% trimmed mean.

You can drop whatever proportion of the dataset you wish - within reason.

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00Mean $41,375.00 $40,857.14 $41,833.33Median $42,000.00 $41,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00

7 6 $286,000.00 $251,000.00

Example of Trimmed Mean

GNI Per CapitaMoving Average (period 2) Trend Line

0102030405060708090

5000

1000

0

1500

0

2000

0

2500

0

3000

0

3500

0

3500

0

4000

0

4500

0

5000

0

U.S. $

Fre

qu

ency

Moving average trend lines produce a smoother ‘actual

value’ trend line that is based on consecutive recalculations of an

arithmetic mean of set size.

The weakness is that the line loses values, depending on what

size you make the averaging group.

Calculating the Moving AverageOriginal

Data Values

Moving Average, Period 2

Moving Average Period 3

10 10+22/2=16 10+22+12/3=14.6

22 22+12/2=17 22+12+14/3=16

12 12+14/2=13 12+14+16/3=14

14 14+16/2=15 14+16+20/3=16.6

16 16+20/2=18 16+20+32/3=22.6

20 20+32/2=26

32

# of values to be plotted

7 6 5

All Geography students are above average.

Measures of Dispersion

What is Dispersion?Refers to the way in which quantitative data

values are dispersed or spread out in a dataset.

The most powerful dispersion statistics calculate the quantitative spread of the data values around the arithmetic mean and are called

measures of deviation.

The various measures of deviation calculate the arithmetic differences between each data value

and the arithmetic mean of the dataset.

Why bother with measuring deviation?

Consider the following datasets:3+3+3+3+3 1+1+1+2+10

First we calculate their arithmetic means using:

𝑥=∑ 𝑥𝑛

=3 =3Are they the same? According to the mean they are.

Then we calculate their standard deviations using:

Same means, very different standard deviations.So are the datasets the same – or not?

𝑠=√∑ ¿¿¿ ¿ = 0 = 3.94

Measures of Dispersion and DeviationThe Range (a measure of dispersion):

The range is the difference between the lowest value (called MIN) and the highest value (called MAX) in a

dataset.

The Standard Deviation (a measure of deviation):Measures the average difference between a data value

and the arithmetic mean of all data values.

The Variance (a measure of deviation):Squares the average difference between a data value and the arithmetic mean of the data set. Thus it is the

standard deviation squared.

The Range(Range = MAX-MIN)

The RangeThe range describes the span of your dataset, from the

minimum value (MIN) to the maximum value (MAX) using:

Range = MAX – MIN

Used as a measure of data dispersion NOT deviation, because deviation implies a difference between your data values and something, e.g. the arithmetic mean.

The Range is used in finding histogram (or bar chart) classes.

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00MAX $45,000.00 $80,000.00 $80,000.00MIN $35,000.00 $35,000.00 $1.00Range $10,000.00 $45,000.00 $79,999.00

Even the range is telling us more about the data than just the

central tendency measures do.

Compare dataset #1 with #3.

The Standard Deviation(s )

The Standard Deviation

Where:s is the sample standard deviationx is a value in the dataset is the arithmetic mean of the datasetn is the number of values in the dataset

𝑠=√∑ ¿¿¿ ¿

𝑥

The standard deviation measures the average difference between a data value and the arithmetic mean of all data values. It is given by:

The standard deviation and the variance are related insofar as the s is the square root of the variance (or the variance is s2). s is the

most widely used measure of deviation, though it should always be used in conjunction with the variance.

Interpreting the Standard Deviation Formula

Subtract each data value x from the arithmetic mean and sum them:

But this returns a set of plus and minus differences that add to zero.So to remove the signs we square each difference and sum the squared

differences …

… then take their square root to return the magnitudes of the original values.

𝑥

𝑠=∑ (𝑥−𝑥)

𝑠❑=∑ (𝑥−𝑥 )2

𝑠=√∑ ¿¿¿ ¿

𝑠=√∑ ¿¿¿ ¿

A reminder of the effect of squaring…

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

50

100

150

200

250

300

350

400

450

numbers squares

# #2

1 12 43 94 165 256 367 498 649 81

10 10011 12112 14413 16914 19615 22516 25617 28918 32419 36120 400

… it emphasizes higher values

An exponential progression

An arithmetic progression

x x-meanx-mean squared

sqrt of x-mean

squared1 -9.5 90.25 9.52 -8.5 72.25 8.53 -7.5 56.25 7.54 -6.5 42.25 6.55 -5.5 30.25 5.56 -4.5 20.25 4.57 -3.5 12.25 3.58 -2.5 6.25 2.59 -1.5 2.25 1.5

10 -0.5 0.25 0.511 0.5 0.25 0.512 1.5 2.25 1.513 2.5 6.25 2.514 3.5 12.25 3.515 4.5 20.25 4.516 5.5 30.25 5.517 6.5 42.25 6.518 7.5 56.25 7.519 8.5 72.25 8.520 9.5 90.25 9.5

10.5 0.0

Why Squares and Roots?

The difference x-x produces negative

numbers and a sum of zero, but

… the square of a number is

always positive,

and…

… differences between squares

increase more rapidly than differences

between original numbers, so…

…taking the square root of the squared data values simply

returns them to the original numbers, and also removes

the sign.

‾

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

0

50

100

150

200

250

300

350

400

450

numbers squares num diff square diffs

number

square

This is a list of numbers, x.

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00MAX $45,000.00 $80,000.00 $80,000.00MIN $35,000.00 $35,000.00 $1.00Range $10,000.00 $45,000.00 $79,999.00s $3,852.18 $14,290.36 $21,559.86

Low s means that the data are clustered around mean

(data are leptokurtic or

‘peaked’)

REMEMBERs values do not indicate skewness.

They do indicate kurtosis.

High s means that the data

are spread out around the mean (data are

platykurtic or ‘flat’)

‘Normal’ standard deviation

‘Small’ standard deviation

‘Large’ standard deviation

Freq

uenc

yReview Slide

Standard Deviation and the ‘Shape’ of Data

This ‘peakedness’ of the distribution is called kurtosis.Use the kurtosis statistic to test for normality.

𝒙

The Variance(s2)

The Variance

𝑠𝟐=∑ (𝑥− 𝑥 )2

𝑛−1

Squares the average difference between a data value and the arithmetic mean of the data set. It is given by:

Where:s2 is the sample variancex is a value in the dataset is the arithmetic mean of the datasetn is the number of values in the dataset𝑥

Since it uses the arithmetic mean, it is subject to the same effect of extreme values – except much more because of

the effect of squaring.

Interpreting the Variance Formula

Subtract each data value x from the arithmetic mean and sum them.

But this returns a set of plus and minus differences that adds to zero.

So to remove the signs we square each difference thus:

…and sum the squared differences.

𝑠2=∑ (𝑥− 𝑥 )2

𝑛−1𝑥

𝑠2=∑(𝑥−𝑥)

𝑠2=∑ (𝑥−𝑥 )2

Variance and SD Compared

By squaring the differences you

remove the negative signs and exaggerate

more extreme differences to make them more obvious

for analysis.

By taking the square root you return the differences to their original magnitude

but the signs are removed so the

differences no longer sum to zero.

In comparing the two, when the s is small, the difference between the variance (s2) and the s is

smaller than if the s is large – that’s what happens when you square numbers.

𝑠=√∑ ¿¿¿ ¿𝑠2=∑ (𝑥− 𝑥 )2

𝑛−1

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00MAX $45,000.00 $80,000.00 $80,000.00MIN $35,000.00 $35,000.00 $1.00Range $10,000.00 $45,000.00 $79,999.00s $3,852.18 $14,290.36 $21,559.86s2 $14,839,285.71 $204,214,285.71 $464,827,464.41

Note that the highest s is 5.6 times the lowest whereas the highest s2 is 31 times the lowest – this is the effect of

squaring extreme values

N and n-1

Why do the sample standard deviation and sample variance (in fact, sample anything) formulas have n-1 as the denominator?

Because n-1 gives a more conservative estimate of deviation by increasing the standard deviation and variance values.

If you have a larger standard deviation or variance, you have a higher standard to pass in making your case. Why?

Because if you are testing to see if a data value is 1.96 s away from the mean of its dataset, then a larger s means the data value has to

meet a stricter test – i.e. it has to be higher.

𝑠2=∑ (𝑥− 𝑥 )2

𝑛−1𝑠=√∑ ¿¿¿ ¿

Sample versus population – n-1 versus NSample

size(n)

Value of numerator

in standard deviation formula

Biased estimate of population standard

deviation (i.e. dividing by N)

Unbiased estimate of population standard deviation

(dividing by n-1)

Difference between

biased and

unbiased estimates

10 500 7.07 7.45 .38100 500 2.24 2.25 .011000 500 0.7071 0.7075 .0004Source: After Salkind, page 40.

Note:1. With n-1 the standard deviation is higher.2. The larger the sample, the smaller the effect of n-1

√(500/10)=

√(500/100)=

√(500/1000)=

√(500/(10-1))=

√(500/(100-1))=

√(500/(1000-1))=

2( )

1

x xs

n

5.0%0.4%0.056%

∑

N

Interpreting Variance & Standard Deviations gives the average difference between each data

value and the mean of a dataset and s2 squares it and so exaggerates it.

The larger the values, the more spread out the values are and the larger the differences between them.

If the values are equal to zero then there are no differences between your data values.

The standard deviation and the variance each require an arithmetic mean to work, not the median or the

mode. Therefore they require the same rigour as the mean and are sensitive to extreme values as well,

especially the variance.

The Coefficient of Variation(Cv)

Calculating the Coefficient Of Variation

The equation for the sample coefficient of variation is:

And, for the population:

* 100 * 100

Interpreting The Coefficient Of Variation

The coefficient of variation expresses the standard deviation as a percentage

of the mean.

Allows easy comparison of standard deviations with one another.

Interpreting The Coefficient Of Variation

By way of example:

Compare a s of $2,400 on a per capita average income of $55,000 against an s of $300 on a per

capita average income of $2,000 – how to interpret?

Here the coefficients of variation are 4.4% and 15% indicating a much wider range of variability in the poorer nation – that is a much wider gap between

rich and poor.

Case in point: the coefficient of variation for global GNI is 108.9%! This indicates an extraordinary gap

between rich and poor nations.

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00MAX $45,000.00 $80,000.00 $80,000.00MIN $35,000.00 $35,000.00 $1.00Range $10,000.00 $45,000.00 $79,999.00s $3,852.18 $14,290.36 $21,559.86s2 $14,839,285.71 $204,214,285.71 $464,827,464.41Cv 9.31% 31.24% 52.11%

Note that the highest Cv is 5.3 times the lowest indicating that dataset#3 is considerably more variable that dataset #1 – the effect of the two

extreme values is evident.

Summary Stats So Far

Arithmetic mean and standard deviation are fundamental to statistics.

Form the heart of descriptive statistics.

Are the essential building blocks of all other statistical methods – look for them as

elements in future formulas.

Other measures of dispersion have their roles, are more robust, but not as powerful.

All Geography students are deviants.

All Geography students areabove average deviants.

mg!

measures of central tendency measures of dispersion

Documents

robust mean

actual mean

mean experimentyou

running mean

sample values

data values

underlying assumptions

set of values