statistic & information theory (csnb134) module 2 numerical data representation

26
STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

Upload: phyllis-lee

Post on 16-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

STATISTIC & INFORMATION THEORY

(CSNB134)

MODULE 2NUMERICAL DATA REPRESENTATION

Page 2: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

Recap: Module 1

In Module 1, we have learned several techniques to describe data by using graphs / charts.

Is it effective?Graphs / Charts are effective at giving the overall view of a situation / a populationHOWEVERGraphs / Charts cannot give precise information for inferential purposes (note: infer == to make conclusions)IN FACTGraphs / Charts may not be suitable for all cases (e.g. How to describe a student result for this semester?)

Page 3: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

Describing Data with Numerical Measures Numerical measures can be created for

both populations and samples.- A parameter is a numerical descriptive measure calculated for a population.- A statistic is a numerical descriptive measure calculated for a sample.

It is best to describe data by using both numerical and graphical representations whenever possible.

Page 4: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

Arithmetic Mean or Average The mean of a set of measurements is

the sum of the measurements divided by the total number of measurements (i.e. the average).

n

xx i

n

xx i

where n = number of measurements

∑ xi = sum of all measurements

Page 5: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

Example

The set: 2, 9, 11, 5, 6

n

xx i 6.6

5

33

5

651192

When do you often use mean? When the measures of overall population follows a normal distribution. E.g. height, weight, income etc.

Page 6: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

The median of a set of measurements is the middle measurement when the measurements are ranked from smallest to largest.

The position of the median is

Median

.5(n + 1)

once the measurements have been ordered.

Page 7: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

Example The set: 2, 4, 9, 8, 6, 5, 3 n = 7 Sort: 2, 3, 4, 5, 6, 8, 9 Position: .5(n + 1) = .5(7 + 1) = 4th

Median = 5 (i.e. 4th largest measurement

The set: 2, 4, 9, 8, 6, 5 n = 6 Sort: 2, 4, 5, 6, 8, 9 Position: .5(n + 1) = .5(6 + 1) = 3.5th

Median = (5 + 6)/2 = 5.5 — average of the 3rd and 4th measurements

Page 8: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

Mode

The mode is the measurement which occurs most frequently.

The set: 2, 4, 9, 8, 8, 5, 3 The mode is 8, which occurs twice

The set: 2, 2, 9, 8, 8, 5, 3 There are two modes which are 8 and 2

(bimodal) The set: 2, 4, 9, 8, 5, 3

There is no mode (each value is unique).

Page 9: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

Example

Mean?

Median?

Mode? (Highest peak)

Calculate the mean, median and mode for the number of quarts of milk purchased by the following 25 households:

0 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4 5

2.225

55

n

xx i

2m

2mode

Page 10: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

Measures of Center

A measure along the horizontal axis of the data distribution that locates the center of the distribution.

What do you use as a measure of centre? (a) Mean? (b) Median? (c) Mode?

Page 11: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

The mean is more easily affected by extremely large or small values than the median.

Extreme Values

The median is often used as a measure of center when the distribution is skewed.

Page 12: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

Extreme Values (cont.)

Skewed left: Mean < Median

Skewed right: Mean > Median

Symmetric: Mean = Median

Page 13: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

Skewed Right (Positively Skewed)

Skewed Right – long tail to the right A few high numbers pull the mean above

the median The set: The graph:

00.5

11.5

22.5

33.5

44.5

5

Frequency

1234

Num. Frequency

1 3

2 5

3 3

4 1

Mean = [1(3) + 2(5) + 3(3) + 4(1)] / 12 = 2.17

Median = 2

Mean > Median

Page 14: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

Skewed Left (Negatively Skewed)

Skewed Left – long tail to the left A few low numbers pull the mean below the

medianThe set: The graph:

00.5

11.5

22.5

33.5

44.5

5

Frequency

1234

Num. Frequency

1 1

2 3

3 5

4 3

Mean = [1(1) + 2(3) + 3(5) + 4(3)] / 12 = 2.83

Median = 3

Mean < Median

Page 15: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

Measures of Centre Vs. Variability

I was told that the average height of plants here is only 1 feet.

But this tree is 10 feet high!!! !#$&*^(&**

Often, measure of centre does not give the true picture. Need to know the measure of variability from the centre too….

Page 16: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

Measures of Variability

A measure along the horizontal axis of the data distribution that describes the spread of the distribution from the center.

Page 17: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

The Range

The range, R, of a set of n measurements is the difference between the largest and smallest measurements.

Example: A botanist records the number of petals on 5 flowers:

5, 12, 6, 8, 14 The range is

R = 14 – 5 = 9

Page 18: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

The Variance

The variance is measure of variability that uses all the measurements (as oppose to range R that uses only 2 measurements, maximum and minimum).

It measures the average deviation of the measurements from their mean.

Flower petals: 5, 12, 6, 8, 14

95

45x 9

5

45x

4 6 8 10 12 14

Page 19: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

The variance of a population of N measurements is the average of the squared deviations of the measurements about their mean m.

The Variance

The variance of a sample of n measurements is the sum of the squared deviations of the measurements about their mean, divided by (n – 1).

N

xi2

2 )(

N

xi2

2 )(

1

)( 22

n

xxs i

1

)( 22

n

xxs i

Page 20: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

In calculating the variance, we squared all of the deviations, and in doing so changed the scale of the measurements.

To return this measure of variability to the original units of measure, we calculate the standard deviation, the positive square root of the variance.

The Standard Deviation

2

2

:deviation standard Sample

:deviation standard Population

ss

2

2

:deviation standard Sample

:deviation standard Population

ss

Page 21: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

2 Ways to Calculate the Sample Variance

1

)( 22

n

xxs i

5 -4 16

12 3 9

6 -3 9

8 -1 1

14 5 25

Sum 45 0 60

Use the Definition Formula:ix xxi

2)( xxi

154

60

87.3152 ss

Page 22: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

2 Ways to Calculate the Sample Variance

1

)( 22

2

nnx

xs

ii

5 25

12 144

6 36

8 64

14 196

Sum 45 465

Use the Calculational Formula:

ix2ix

154

545

4652

87.3152 ss

Page 23: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

The value of s is ALWAYS positive. The larger the value of s2 or s, the larger

the variability of the data set. Why divide by n –1?

The sample standard deviation s is often used to estimate the population standard deviation s. Dividing by n –1 gives us a better estimate of s.

Some Notes

Page 24: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

1. Question: Find the mean, median and mode of:5, 7, 3, 5, 6, 8, 5, 6, 4, 6, 25

Solution: Note: First, arrange the data3, 4, 5, 5, 5, 6, 6, 6, 7, 8, 25

median = 6; mean = 80/11 = 7.27 ; modes = 5 and 6

2. Question: Eliminate the last observation x= 25 and then find the mean, median and mode. How do these values compare with those found using the full data set? Solution: median = 5.5; mean = 55/10 = 5.5; modes = 5 and 6. The mean is smaller.

3. Question: How do possible outliers (such as 25) affect these values? Solution: The mean is very much affected by the outlier, while the median and mode are not so.

Exercise 1

Page 25: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

Given the observations 7, 9, 10, 6, 8, 7, 8, 9, 8

calculate:1. the range

Solution : R = 10 – 6 = 4

2. the meanSolution : Mean = 72 / 9 = 8

3. the varianceSolution : Variance = [588 – (722/9)] / 8 = 12 / 8 = 1.5

4. the standard deviationSolution : Standard Deviation = √1.5 = 1.225

Exercise 2

Page 26: STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

STATISTIC & INFORMATION THEORY

(CSNB134)

NUMERICAL DATA REPRESENTATION

--END--