measures of central tendency measures of dispersion

80
Measures of Central Tendency

Upload: angela-joseph

Post on 04-Jan-2016

345 views

Category:

Documents


12 download

TRANSCRIPT

Page 1: Measures of Central Tendency Measures of Dispersion

Measures of Central Tendency

Page 2: Measures of Central Tendency Measures of Dispersion

Measures of Dispersion

Page 3: Measures of Central Tendency Measures of Dispersion

Multiple Choice.30 questions.

Lectures 4 and 5 (today).10% of your course grade.

In this room.45 minutes.8:10 start.8:55 end.

DO NOT MISS IT.THERE WILL BE NO MAKE-UPS.

SECOND QUIZ NEXT WEEK

Page 4: Measures of Central Tendency Measures of Dispersion

The Frightening Power of Central Tendency(George Carlin – funniest guy ever)

Page 5: Measures of Central Tendency Measures of Dispersion

REGRESSION TO THE MEAN:

Eventually, everything becomes mediocre.*

*(Late 16th century: from French médiocre, from Latin mediocris 'of middle height or degree‘.)

Page 6: Measures of Central Tendency Measures of Dispersion

Measures of Central TendencyThe value that best represents the mid-point of a

set of values, but which may not actually be found in the set of values themselves. Major types are:

• Means:- Arithmetic- Weighted/Grouped- Geometric- Harmonic- Trimmed

• Median• Mode

Some are more robust than

others…{

Page 7: Measures of Central Tendency Measures of Dispersion

What Does Being Robust Mean?When a statistic is robust it means that deviations

from the underlying assumptions of a data distribution do not affect the statistic’s ability to

represent the data values that comprise a dataset’s distribution.

WHAT DOES THAT MEAN?

That a sample’s statistics are a good representation of what’s happening in the population from which it

came.

And, the larger a sample gets, the closer its statistics approximate the population’s statistics.

This is called regression to the mean, or The Wisdom of the Crowd.

Page 8: Measures of Central Tendency Measures of Dispersion

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 490

100020003000400050006000700080009000

10000

Guesses Actual MeanRunning Mean Linear (Running Mean)

Number of Guesses

Num

ber o

f Pop

corn

Ker

nels

The Wisdom of the CrowdRegression to the Mean Experiment

You have a jar of popcorn kernels and ask people how many are in it. You get a range of answers, one

guess for each person you ask.

There are 52 guesses (blue dots).The actual number of kernels is 5,524 (red line).

Note that the running mean of the guesses (green solid line) converges on (or regresses to (green

dotted line)) the actual mean.

Page 9: Measures of Central Tendency Measures of Dispersion

So What Does Regression to the Mean, Mean?

What regression to the mean (more or less) means is that eventually, with a large enough sample, sample values for a variable will get closer and

closer to the mean of all values for the variable.

Put another way, the further a given sample value is from the mean, the higher the probability that the

next sample will be closer to the mean.

But to say this we need to have some assumptions about the sample’s view of the population’s reality.

These are called the sample’s underlying assumptions.

Page 10: Measures of Central Tendency Measures of Dispersion

The Assumptions Underlying Why The Crowd’s View is a Good Approximate of The Population’s Reality

These are the underlying assumptions for The Crowd:Data distributions are normally distributed (a.k.a. bell shaped).

This means that:• They have no outliers.• They have no gaps.• They are not skewed (skewness).• They are not peaked (kurtosis).• They have no extreme values.• They are not bi-modal (two peaks).• They are not poly-modal (many peaks).• Their measures of central tendency are equal (hmmm).

This is called being “robust”.

Page 11: Measures of Central Tendency Measures of Dispersion

What Does Being Robust Mean?Deviant distributions (distributions that deviate from

the assumptions) happen because of non-normal attributes such as:

• extreme values• bi-modality or poly-modality

• outliers• gaps

• skewness• kurtosis

• extreme differences between values

These, thankfully for Statistics, are rare occurrences, but we must always check for them.

Page 12: Measures of Central Tendency Measures of Dispersion

What Does Being Robust Not Mean?Robustness:

ability to withstand assumption violation

Discrimination: ability to accurately represent a set of values

Statistical tools with high robustness

usually have lower ability to discriminate

Statistical tools with low robustness

usually have higher ability to discriminate

Being robust does not mean being better or more accurate – quite the opposite. It means that a robust statistical tool is better able to withstand assumption

violation but less able to discriminate accurately.

Page 13: Measures of Central Tendency Measures of Dispersion

Parametric and Non-parametric Tools

Parametric statistical tools usually have a lowability to withstand assumption violation.

Non-parametric statistical tools usually have a highability to withstand assumption violation.

Non-parametric statistical tools usually have higher robustness

but have a lower ability to discriminate.

Parametric statistical tools usually have lower robustness

but have a higher ability to discriminate.

The assumptions underlying statistical distributions are called parameters.

Therefore…

Page 14: Measures of Central Tendency Measures of Dispersion

The Effect of Extreme Values and Outliers

Ten people are sitting at a bar.Each earns $50K a year.

Their average is $50,000.

In walks Bill Gates and sits down.He earns $1,000,000,000 a year.Their average is now $90 million.

The data and the averages are both correct but the result is ridiculous because an underlying

assumption - that of extreme values - has been violated.

Moral: don’t hang around in bars

Page 15: Measures of Central Tendency Measures of Dispersion

n=9 Value Rank

Incremental Difference each value

Incremental Difference each rank

Difference mean to

value

Difference median to

rank 1 1 1 1 -13 -4 2 2 1 1 -12 -3 3 3 1 1 -11 -2 4 4 1 1 -10 -1 5 5 1 1 -9 0 6 6 1 1 -8 1 7 7 1 1 -7 2 8 8 82 1 -6 3 90 9 na na 76 4

Median 5 5 1 1 n/a 0Arithmetic

mean 14 5 11.13 1 0 0

Mean is almost 3 times the median.

Middle value

Extremevalue

Why Median Income is a Robust Statistic

Ranks ‘neutralize’ the extreme value.

Page 16: Measures of Central Tendency Measures of Dispersion

These are our values and where they lie on the distribution.

This is the median (5)

This is the mean (14)

This is the outlier – waaaaay out.

…and it drags the mean out with it.

Thus the median is the more ‘robust’ statistic because it is less effected by the extreme value – it represents

the dataset more accurately.

Page 17: Measures of Central Tendency Measures of Dispersion

n=9 Value Rank

Incremental Difference each value

Incremental Difference each rank

Difference mean to

value

Difference median to

rank 1 1 1 1 -4 -4 2 2 1 1 -3 -3 3 3 1 1 -2 -2 4 4 1 1 -1 -1 5 5 1 1 0 0 6 6 1 1 1 1 7 7 1 1 2 2 8 8 1 1 3 3 9 9 na na 4 4

Median 5 5 1 1 n/a 0Arithmetic

mean 5 5 1 1 0 0

Being Robust – Removing the Extreme Value

Page 18: Measures of Central Tendency Measures of Dispersion

These are our values and where they lie on the distribution.

This is the median and the mean.

Outlier is gone.

…so the mean moves back to betterrepresent the data values.

Now the mean is the better statistic because it is more accurate since it uses arithmetic to represent the dataset. But the median is still the more robust

statistic but dos not discriminate as well.

Page 19: Measures of Central Tendency Measures of Dispersion

CalculatingMean

MedianMode

Page 20: Measures of Central Tendency Measures of Dispersion

Arithmetic MeanReturns the arithmetic centre of the data distribution,

such that the sum of all differences between data values and the mean equals zero.

The arithmetic mean is the arithmetic middle point of a set of values.

This means that the differences between any value x in a dataset and the mean of that dataset will sum to zero.

THIS DOES NOT MEAN THAT THE ARITHMETIC MEAN IS THE MID POINT OF THE DATASET’S DISTRIBUTION

BECAUSE THE ARITHMETIC MEAN IS STRONGLY INFLUENCED BY EXTREME VALUES.

Arithmetic lives here.

Page 21: Measures of Central Tendency Measures of Dispersion

Returns the exact centre value of the dataset and hence the second quartile value. Half the values of the dataset

will be above the median and half will be below.

The median is the middle case of a set of cases (records or rows).

THIS MEANS THAT THE MEDIAN IS THE EXACT CENTRAL POINT OF A DATASET’S DISTRIBUTION – 50% of values

will be below the median and 50% will be above.THIS MEANS THAT THE MEDIAN IS NOT AFFECTED BY

EXTREME VALUES.

Median

Arithmetic does not live here.

Page 22: Measures of Central Tendency Measures of Dispersion

Returns the most frequently occurring data value in the dataset. Sometimes reported as a label, when the

“value” is nominal level data – e.g. religious or political affiliation.

CASE RELIGIOUS AFFILIATION

Person #1 Protestant

Person #2 Protestant

Person #3 Muslim

Person #4 Catholic

Person #5 Protestant

Person #6 Jewish

Person #7 Catholic

The “modality” of this sample would be Protestant because it is the most frequently occurring “value”.

Mode (or Modality)

Page 23: Measures of Central Tendency Measures of Dispersion

An Example of How They All Work TogetherYou work for a computer manufacturer and are asked to do a quality control analysis, so you gather data on the number of

faults your computers have.

These data show that your company has an average of 9.1 faults per 100 computers and that your competition has 9.1

faults per 100 as well.

In other words the average number of faults is about the same.

Should you report to your boss that your company is alright? It’s no worse than your competitors?

Perhaps not, because you are a smart statistician and you collected more than the bare bones dataset.

Page 24: Measures of Central Tendency Measures of Dispersion

Percentage of Faults by CategoryNumber of Faults Per

UnitYour

CompanyYour

CompetitorZero 30 16One 20 17Two 10 9

Three 4 8Four 3 12Five 2 8Six 2 8

Seven 1 8Eight 2 7Nine 0 3Ten+ 26 4

TOTAL 100 100MEAN 9.1 9.1

MEDIAN One Three

Rather than the very basic arithmetic mean data you collected these.

They show what proportion of the machines had no faults, 1 fault, 2

faults etc.

The average proportion of faults stays at 9.1%, but the median shows that your company is doing much better

with 50% of machines having 1 or no faults, whereas your competitor’s median shows 50% of machines

having up to 3 faults.

Your problem lies in in the 26% of machines having 10 or more faults.

Further investigation shows that one of your assembly lines is the culprit

due to sloppy workers

Page 25: Measures of Central Tendency Measures of Dispersion

Zero One Two Three Four Five Six SevenEight Nine Ten+0

5

10

15

20

25

30

35

Percentage of Units with Specific Number of Faults

Your Company Competitor

Number of Faults

Perc

enta

ge o

f Fau

lts

Page 26: Measures of Central Tendency Measures of Dispersion

The Arithmetic Mean

where:= the arithmetic mean∑= the sum of all x’sx = a value in the datasetn = the number of values (or cases) in the dataset

𝑥=∑ 𝑥𝑛

Page 27: Measures of Central Tendency Measures of Dispersion

δ𝑥=

∑ 𝑥𝑛

SAMPLE AND POPULATION SYMBOLOGY

• In formulas for a sample (such as the arithmetic mean), Latin letters are used, and a lower case ‘n’ used for the number of cases.

• In formulas for a population (such as the arithmetic mean), Greek symbols and letters are used (here delta) and a upper case ‘N’ used for the number of cases.

Page 28: Measures of Central Tendency Measures of Dispersion

THE ARITHMETIC MEAN - DIFFERENCES SUM TO ZERO

e.g.38.25 – 21 = 17.2538.25 – 34 = 4.25

etc

Data Values Differences from Mean

21 17.2534 4.2545 -6.7556 -17.7554 -15.7543 -4.7532 6.2521 17.25

Sum 306 0

N 8

Mean 38.25

Remember this.

Page 29: Measures of Central Tendency Measures of Dispersion

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00

THE ARITHMETIC MEAN - EFFECT OF EXTREME VALUESMiddle dataset is

unbalanced. It has one extreme value that pulls the mean higher than all but that extreme value.Note that median and mode are not affected.An opposite extreme

value balances the dataset again.

When an extreme value is present the median should be used and not the arithmetic mean

because the distribution will be

skewed.

BUT YOU ARE STILL LEFT WITH EXTREME VALUES.These will affect the deviation in the dataset (s and s2).

Page 30: Measures of Central Tendency Measures of Dispersion

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00

THE MEDIANThe median is the middle point of a set of cases.

Half the valuesabove…

…and halfbelow.

Since there are an even

number of values you

take the mean of the centre two values -

$42,000If there were

an odd number of

values, then the single

middle value is the median.

Page 31: Measures of Central Tendency Measures of Dispersion

Mean and Median – Points to RememberThe Mean is the arithmetic middle point of a set of values. It

is calculated arithmetically from all data values. It is not the exact mid point of a set of values because it is strongly influenced by extreme values.

BECAUSE OF THIS THE ARITHMETIC MEAN IS NOT A ROBUST STATISTIC BUT IT IS A DISCRIMINATING ONE.

The Median is the middle point of a set of cases (records or rows). It is calculated by dividing the number of rows into two halves. Because it is the exact mid point of a set of cases it is not influenced by extreme values.

BECAUSE OF THIS THE MEDIAN IS A ROBUST STATISTIC BUT IT IS NOT A DISCRIMINATING ONE.

Page 32: Measures of Central Tendency Measures of Dispersion

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00

The ModeThe most frequently occurring data value (in this example $45,000 in each dataset)

Page 33: Measures of Central Tendency Measures of Dispersion

Gaps and OutliersThe following histogram has outliers—there are three

cities in the leftmost bar. This creates a gap where there are effectively no values.

Gap

Outliers

Page 34: Measures of Central Tendency Measures of Dispersion

Using Measures of Central Tendency

Use the method that returns the most information about the centre of the dataset – usually the

arithmetic mean. BUT…

With highly skewed (such as income) or non-unimodal datasets the median should be used.

Means and medians cannot be used with nominal level data – the mode can be used to describe the

most frequently occurring label.USING THE MODE IS CALLED MODAL ANALYSIS.

Page 35: Measures of Central Tendency Measures of Dispersion

Other means to an end…• Weighted mean:

Useful when the ‘x’s have unequal weights as in grade calculations (e.g. tests worth 20% labs worth 30%, etc).

• Grouped data mean:Useful when you only have data in categories, as with income classes – is a special case of the weighted mean.

• Geometric mean:Useful when you have percentages, ratios, indexes or data covering several orders of magnitude.

• Harmonic mean:Useful when you have rates as in calculating average speeds.

• Trimmed mean:Useful for removing outliers.

Page 36: Measures of Central Tendency Measures of Dispersion

Weighted Mean

𝑥𝑤=∑ 𝑥 𝑖𝑤𝑖

∑𝑤𝑖

Where: = weighted mean

xi = data value

wi = weight of data value

The weighted mean is used when data values have weighting schemes, as with the grades in this course.

Page 37: Measures of Central Tendency Measures of Dispersion

Weighted mean example #1Component Weight

wi

Your Markxi

Mark X Weight xi * wi

Lab Assignments 45% 86% 3870Tests 15% 80% 1200Final Exam 40% 80% 3200Totals 100% 246 8270

= 8270/100 = 82.7%246/3=83%

n=3123

𝑥𝑤=∑ 𝑥 𝑖𝑤𝑖

∑𝑤𝑖

The weighted mean methodThe arithmetic mean method

Page 38: Measures of Central Tendency Measures of Dispersion

Weighted mean example #2Changing the weights

Component Weightwi

Your Markxi

Mark X Weight xi * wi

Lab Assignments 10% 86% 860Tests 30% 80% 2400Final Exam 60% 80% 4800Totals 100% 246 8060

= 8060/100 = 80.6%246/3=83%

n=3123

𝑥𝑤=∑ 𝑥 𝑖𝑤𝑖

∑𝑤𝑖

The weighted mean methodThe arithmetic mean method

Page 39: Measures of Central Tendency Measures of Dispersion

Grouped data mean examplePopulation in Census Tract 12345.6

IncomeClasses

Frequency (f)wi

Class midpoints (CM)

xi

CM X fwi * xi

$0-$10,000 22 $5000 $110,000$10,001-$20,000 56 $15,000 $840,000$20,001-$30,000 81 $25,000 $2,025,000$30,001-$40,000 45 $35,000 $1,575,000$40,001-$50,000 23 $45,000 $1,035,000$50,001-$60,000 15 $55,000 $825,000

>$60,000 7 Excluded 0Totals 249 $180,000 $6,410,000

= $6,410,000/249 = $25,742.97 $180,000/6 = $30,000

123456

$180,000

𝑥𝑤=∑ 𝑥 𝑖𝑤𝑖

∑𝑤𝑖

The weighted mean method The arithmetic mean method

Page 40: Measures of Central Tendency Measures of Dispersion

Geometric mean

Where:GM: geometric meanx : data valuesn√ : nth root of product of all x

The ∏ symbol is the upper case Greek letter pi and signifies the product of a set

multiplications.

Used extensively in biology and finance

GMGM

Page 41: Measures of Central Tendency Measures of Dispersion

Geometric mean – use when your data:• Are percentages, ratios, indexes or growth rates;• Have an exponential distribution;• Have high value more than 3 times the low value;• Cover several orders of magnitude.

Geometric mean – do not use when your data:

• Are already log scaled such as decibels or pH;• Have high value less than 3 times the low value;

Page 42: Measures of Central Tendency Measures of Dispersion

Geometric Mean ExampleExample using bacteria counts (they typically vary widely)

Water Sample # Enteric Bacteria Count per ml

1 62 503 94 1200

Arithmetic mean 316.25Geometric Mean 42.42

𝟒√𝟔∗𝟓𝟎∗𝟗∗𝟏𝟐𝟎𝟎GM = = 42.42Basically the data are log transformed.

Thus extreme values are tempered.

Page 43: Measures of Central Tendency Measures of Dispersion

Harmonic mean

Where:HM: harmonic mean1/x: reciprocals of data values n : number of data values

Harmonic meanUseful when you have rates per unit (such as

distance per unit of time (speed) to average out.

HM

Page 44: Measures of Central Tendency Measures of Dispersion

Harmonic meanexample of transportation & speed

What’s the average speed for a delivery truck given these data:

Segment Length(km) Speed (kph) Time takenOutbound 180 60 180/60=3hrsInbound 180 80 180/80=2.25hrs

Total distance 360

Arithmetic mean speed = 60kph+80kph/2 = 70kph.But the time taken is 3hrs+2.25hrs=5.25hrs

Therefore the actual average speed = 360km/5.25hrs=68.57kph

This is what the harmonic mean does:HM = 2/((1/60kph)+(1/80kph)) = 68.57kph

Difference small but over more segments it can be significant

HM

Page 45: Measures of Central Tendency Measures of Dispersion

Trimmed MeanThis is easy: it is any mean where the outliers

have been stripped or trimmed away.

Thus you would sort your data and drop the top and bottom 10% of your values. This is called a

10% trimmed mean.

You can drop whatever proportion of the dataset you wish - within reason.

Page 46: Measures of Central Tendency Measures of Dispersion

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00Mean $41,375.00 $40,857.14 $41,833.33Median $42,000.00 $41,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00

7 6 $286,000.00 $251,000.00

Example of Trimmed Mean

Page 47: Measures of Central Tendency Measures of Dispersion

GNI Per CapitaMoving Average (period 2) Trend Line

0102030405060708090

5000

1000

0

1500

0

2000

0

2500

0

3000

0

3500

0

3500

0

4000

0

4500

0

5000

0

U.S. $

Fre

qu

ency

Moving average trend lines produce a smoother ‘actual

value’ trend line that is based on consecutive recalculations of an

arithmetic mean of set size.

The weakness is that the line loses values, depending on what

size you make the averaging group.

Page 48: Measures of Central Tendency Measures of Dispersion

Calculating the Moving AverageOriginal

Data Values

Moving Average, Period 2

Moving Average Period 3

10 10+22/2=16 10+22+12/3=14.6

22 22+12/2=17 22+12+14/3=16

12 12+14/2=13 12+14+16/3=14

14 14+16/2=15 14+16+20/3=16.6

16 16+20/2=18 16+20+32/3=22.6

20 20+32/2=26

32

# of values to be plotted

7 6 5

Page 49: Measures of Central Tendency Measures of Dispersion

All Geography students are above average.

Page 50: Measures of Central Tendency Measures of Dispersion

Measures of Dispersion

Page 51: Measures of Central Tendency Measures of Dispersion

What is Dispersion?Refers to the way in which quantitative data

values are dispersed or spread out in a dataset.

The most powerful dispersion statistics calculate the quantitative spread of the data values around the arithmetic mean and are called

measures of deviation.

The various measures of deviation calculate the arithmetic differences between each data value

and the arithmetic mean of the dataset.

Page 52: Measures of Central Tendency Measures of Dispersion

Why bother with measuring deviation?

Consider the following datasets:3+3+3+3+3 1+1+1+2+10

First we calculate their arithmetic means using:

𝑥=∑ 𝑥𝑛

=3 =3Are they the same? According to the mean they are.

Page 53: Measures of Central Tendency Measures of Dispersion

Then we calculate their standard deviations using:

Same means, very different standard deviations.So are the datasets the same – or not?

𝑠=√∑ ¿¿¿ ¿ = 0 = 3.94

Page 54: Measures of Central Tendency Measures of Dispersion

Measures of Dispersion and DeviationThe Range (a measure of dispersion):

The range is the difference between the lowest value (called MIN) and the highest value (called MAX) in a

dataset.

The Standard Deviation (a measure of deviation):Measures the average difference between a data value

and the arithmetic mean of all data values.

The Variance (a measure of deviation):Squares the average difference between a data value and the arithmetic mean of the data set. Thus it is the

standard deviation squared.

Page 55: Measures of Central Tendency Measures of Dispersion

The Range(Range = MAX-MIN)

Page 56: Measures of Central Tendency Measures of Dispersion

The RangeThe range describes the span of your dataset, from the

minimum value (MIN) to the maximum value (MAX) using:

Range = MAX – MIN

Used as a measure of data dispersion NOT deviation, because deviation implies a difference between your data values and something, e.g. the arithmetic mean.

The Range is used in finding histogram (or bar chart) classes.

Page 57: Measures of Central Tendency Measures of Dispersion

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00MAX $45,000.00 $80,000.00 $80,000.00MIN $35,000.00 $35,000.00 $1.00Range $10,000.00 $45,000.00 $79,999.00

Even the range is telling us more about the data than just the

central tendency measures do.

Compare dataset #1 with #3.

Page 58: Measures of Central Tendency Measures of Dispersion

The Standard Deviation(s )

Page 59: Measures of Central Tendency Measures of Dispersion

The Standard Deviation

Where:s is the sample standard deviationx is a value in the dataset is the arithmetic mean of the datasetn is the number of values in the dataset

𝑠=√∑ ¿¿¿ ¿

𝑥

The standard deviation measures the average difference between a data value and the arithmetic mean of all data values. It is given by:

The standard deviation and the variance are related insofar as the s is the square root of the variance (or the variance is s2). s is the

most widely used measure of deviation, though it should always be used in conjunction with the variance.

Page 60: Measures of Central Tendency Measures of Dispersion

Interpreting the Standard Deviation Formula

Subtract each data value x from the arithmetic mean and sum them:

But this returns a set of plus and minus differences that add to zero.So to remove the signs we square each difference and sum the squared

differences …

… then take their square root to return the magnitudes of the original values.

𝑥

𝑠=∑ (𝑥−𝑥)

𝑠❑=∑ (𝑥−𝑥 )2

𝑠=√∑ ¿¿¿ ¿

𝑠=√∑ ¿¿¿ ¿

Page 61: Measures of Central Tendency Measures of Dispersion

A reminder of the effect of squaring…

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

50

100

150

200

250

300

350

400

450

numbers squares

# #2

1 12 43 94 165 256 367 498 649 81

10 10011 12112 14413 16914 19615 22516 25617 28918 32419 36120 400

… it emphasizes higher values

An exponential progression

An arithmetic progression

Page 62: Measures of Central Tendency Measures of Dispersion

x x-meanx-mean squared

sqrt of x-mean

squared1 -9.5 90.25 9.52 -8.5 72.25 8.53 -7.5 56.25 7.54 -6.5 42.25 6.55 -5.5 30.25 5.56 -4.5 20.25 4.57 -3.5 12.25 3.58 -2.5 6.25 2.59 -1.5 2.25 1.5

10 -0.5 0.25 0.511 0.5 0.25 0.512 1.5 2.25 1.513 2.5 6.25 2.514 3.5 12.25 3.515 4.5 20.25 4.516 5.5 30.25 5.517 6.5 42.25 6.518 7.5 56.25 7.519 8.5 72.25 8.520 9.5 90.25 9.5

10.5 0.0

Why Squares and Roots?

The difference x-x produces negative

numbers and a sum of zero, but

… the square of a number is

always positive,

and…

… differences between squares

increase more rapidly than differences

between original numbers, so…

…taking the square root of the squared data values simply

returns them to the original numbers, and also removes

the sign.

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

0

50

100

150

200

250

300

350

400

450

numbers squares num diff square diffs

number

square

This is a list of numbers, x.

Page 63: Measures of Central Tendency Measures of Dispersion

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00MAX $45,000.00 $80,000.00 $80,000.00MIN $35,000.00 $35,000.00 $1.00Range $10,000.00 $45,000.00 $79,999.00s $3,852.18 $14,290.36 $21,559.86

Low s means that the data are clustered around mean

(data are leptokurtic or

‘peaked’)

REMEMBERs values do not indicate skewness.

They do indicate kurtosis.

High s means that the data

are spread out around the mean (data are

platykurtic or ‘flat’)

Page 64: Measures of Central Tendency Measures of Dispersion

‘Normal’ standard deviation

‘Small’ standard deviation

‘Large’ standard deviation

Freq

uenc

yReview Slide

Standard Deviation and the ‘Shape’ of Data

This ‘peakedness’ of the distribution is called kurtosis.Use the kurtosis statistic to test for normality.

𝒙

Page 65: Measures of Central Tendency Measures of Dispersion

The Variance(s2)

Page 66: Measures of Central Tendency Measures of Dispersion

The Variance

𝑠𝟐=∑ (𝑥− 𝑥 )2

𝑛−1

Squares the average difference between a data value and the arithmetic mean of the data set. It is given by:

Where:s2 is the sample variancex is a value in the dataset is the arithmetic mean of the datasetn is the number of values in the dataset𝑥

Since it uses the arithmetic mean, it is subject to the same effect of extreme values – except much more because of

the effect of squaring.

Page 67: Measures of Central Tendency Measures of Dispersion

Interpreting the Variance Formula

Subtract each data value x from the arithmetic mean and sum them.

But this returns a set of plus and minus differences that adds to zero.

So to remove the signs we square each difference thus:

…and sum the squared differences.

𝑠2=∑ (𝑥− 𝑥 )2

𝑛−1𝑥

𝑠2=∑(𝑥−𝑥)

𝑠2=∑ (𝑥−𝑥 )2

Page 68: Measures of Central Tendency Measures of Dispersion

Variance and SD Compared

By squaring the differences you

remove the negative signs and exaggerate

more extreme differences to make them more obvious

for analysis.

By taking the square root you return the differences to their original magnitude

but the signs are removed so the

differences no longer sum to zero.

In comparing the two, when the s is small, the difference between the variance (s2) and the s is

smaller than if the s is large – that’s what happens when you square numbers.

𝑠=√∑ ¿¿¿ ¿𝑠2=∑ (𝑥− 𝑥 )2

𝑛−1

Page 69: Measures of Central Tendency Measures of Dispersion

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00MAX $45,000.00 $80,000.00 $80,000.00MIN $35,000.00 $35,000.00 $1.00Range $10,000.00 $45,000.00 $79,999.00s $3,852.18 $14,290.36 $21,559.86s2 $14,839,285.71 $204,214,285.71 $464,827,464.41

Note that the highest s is 5.6 times the lowest whereas the highest s2 is 31 times the lowest – this is the effect of

squaring extreme values

Page 70: Measures of Central Tendency Measures of Dispersion

N and n-1

Why do the sample standard deviation and sample variance (in fact, sample anything) formulas have n-1 as the denominator?

Because n-1 gives a more conservative estimate of deviation by increasing the standard deviation and variance values.

If you have a larger standard deviation or variance, you have a higher standard to pass in making your case. Why?

Because if you are testing to see if a data value is 1.96 s away from the mean of its dataset, then a larger s means the data value has to

meet a stricter test – i.e. it has to be higher.

𝑠2=∑ (𝑥− 𝑥 )2

𝑛−1𝑠=√∑ ¿¿¿ ¿

Page 71: Measures of Central Tendency Measures of Dispersion

Sample versus population – n-1 versus NSample

size(n)

Value of numerator

in standard deviation formula

Biased estimate of population standard

deviation (i.e. dividing by N)

Unbiased estimate of population standard deviation

(dividing by n-1)

Difference between

biased and

unbiased estimates

10 500 7.07 7.45 .38100 500 2.24 2.25 .011000 500 0.7071 0.7075 .0004Source: After Salkind, page 40.

Note:1. With n-1 the standard deviation is higher.2. The larger the sample, the smaller the effect of n-1

√(500/10)=

√(500/100)=

√(500/1000)=

√(500/(10-1))=

√(500/(100-1))=

√(500/(1000-1))=

2( )

1

x xs

n

5.0%0.4%0.056%

N

Page 72: Measures of Central Tendency Measures of Dispersion

Interpreting Variance & Standard Deviations gives the average difference between each data

value and the mean of a dataset and s2 squares it and so exaggerates it.

The larger the values, the more spread out the values are and the larger the differences between them.

If the values are equal to zero then there are no differences between your data values.

The standard deviation and the variance each require an arithmetic mean to work, not the median or the

mode. Therefore they require the same rigour as the mean and are sensitive to extreme values as well,

especially the variance.

Page 73: Measures of Central Tendency Measures of Dispersion

The Coefficient of Variation(Cv)

Page 74: Measures of Central Tendency Measures of Dispersion

Calculating the Coefficient Of Variation

The equation for the sample coefficient of variation is:

And, for the population:

* 100 * 100

Page 75: Measures of Central Tendency Measures of Dispersion

Interpreting The Coefficient Of Variation

The coefficient of variation expresses the standard deviation as a percentage

of the mean.

Allows easy comparison of standard deviations with one another.

Page 76: Measures of Central Tendency Measures of Dispersion

Interpreting The Coefficient Of Variation

By way of example:

Compare a s of $2,400 on a per capita average income of $55,000 against an s of $300 on a per

capita average income of $2,000 – how to interpret?

Here the coefficients of variation are 4.4% and 15% indicating a much wider range of variability in the poorer nation – that is a much wider gap between

rich and poor.

Case in point: the coefficient of variation for global GNI is 108.9%! This indicates an extraordinary gap

between rich and poor nations.

Page 77: Measures of Central Tendency Measures of Dispersion

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00MAX $45,000.00 $80,000.00 $80,000.00MIN $35,000.00 $35,000.00 $1.00Range $10,000.00 $45,000.00 $79,999.00s $3,852.18 $14,290.36 $21,559.86s2 $14,839,285.71 $204,214,285.71 $464,827,464.41Cv 9.31% 31.24% 52.11%

Note that the highest Cv is 5.3 times the lowest indicating that dataset#3 is considerably more variable that dataset #1 – the effect of the two

extreme values is evident.

Page 78: Measures of Central Tendency Measures of Dispersion

Summary Stats So Far

Arithmetic mean and standard deviation are fundamental to statistics.

Form the heart of descriptive statistics.

Are the essential building blocks of all other statistical methods – look for them as

elements in future formulas.

Other measures of dispersion have their roles, are more robust, but not as powerful.

Page 79: Measures of Central Tendency Measures of Dispersion

All Geography students are deviants.

Page 80: Measures of Central Tendency Measures of Dispersion

All Geography students areabove average deviants.

mg!