summary statistics (1)

SUMMARY STATISTICS

July 2014

MEASURES OF CENTRAL TENDENCY

• We depend on volumes of data to make various strategic decisions in business.

• Dealing with large volumes of data comes with various challenges

• To the production foreman, detail is of essence but

• To top management, it is better to summarisethe data for easy management because prime interest is on overall profitability

• Two items are always of utmost importance:

– Measure of central tendency (about the middle of the distribution)

– Measure of dispersion (about the centre)

Mean, Mode, Median and Geometric Mean

• The Mean

– It is the sum of the data divided by the number of items constituting the data

Mean = n

∑ xi/n This the same as x1 + x2 + x3….+ xn

i=1

– If we have the following set of measurement:

n= 2, 4, 2, 3, 3, 5, 2, the mean is calculated as

follows:

Ẋ= 2+4+2+3+3+5+2/7

Ẋ= 21/7

Ẋ= 3

The mode

• It is the frequently occurring value in a set of measurement

• If we have the following set of measurement:

n= 2, 4, 2, 3, 3, 5, 2, the mode is calculated as

follows:

1. Arrange the values in terms of magnitude

2, 2, 2, 3, 3, 4, 5

1. Determine the measure that occurs most frequently - 2.

In the above, the mode is 2.

• There is no doubt that the mode is important but:

– It may not be unique to the rest of the set of values

– It cannot be expressed algebraically hence very few statistical operations are developed around it.

The median

• It measures the centrality of values if they are arranged in an ascending order of magnitude, taking into consideration the oddity of some values

• or the arithmetic average of the middle two numbers if the set contains only even numbers.

Example

• Calculate the median of the values below:

2, 4, 2, 3, 3, 5, 2,

Solution:

2, 2, 2, 3, 3, 4, 5

Median = 3 or

3+3

2

= 6/2

= 3

• The median divides the set of values into two halves: one containing values below the median and the other containing values above the median.

• We could also have quartiles (division into 4), deciles (division into tenths) and percentiles (division into hundredths).

• The disadvantage of the median is that it involves laborious arrangement of figures in their order of magnitude.

The Mean of Grouped Data• Raw data may be presented in the form of a

frequency table as in the example below about daily receipts of a shopping mall for 500 days.

Daily receipts (₵) Number of days

< 0<100 10

100<200 30

200<300 50

300<400 80

400<500 100

500<600 85

600<700 75

700<800 40

800<900 25

900<1000 5

The formula:

Mean = Ẋ is given as below if r class intervals are numbered i,…,r and mi, fi are the mid points of and the number of measurements in the ith interval respectively.

n n

Ẋ= ∑ fi mi ∑ fi

i=n i=n

thus, if n is the total number of measurements, the group mean is

n

Ẋ= ∑ fi mi /n

i=n

Daily receipts (₵) Number of days (fi ) Midpoints (mi ) fi mi

< 0<100 10 50 500

100<200 30 150 4,500

200<300 50 250 12,500

300<400 80 350 28,000

400<500 100 450 45,000

500<600 85 550 46,750

600<700 75 650 48,750

700<800 40 750 30,000

800<900 25 850 21,250

900<1000 5 950 4,750

500∑ fi mi = 241,650

n

Ẋ= ∑ fi mi /n

i=n

Ẋ = 241, 650/500

Ẋ = GHC 483.30

Dispersion

• A measure of centrality alone does not provide a sufficiently adequate summary of a set of values

• Consider the two sets of values below(a) -2, -1, 0, 0, 1, 2(b) 0, 0, 0, 0, 0, 0

In both sets, the mean, mode and median are all 0. The difference in character doesn’t lie in their centrality but in their variation about the central valueIn measurement of dispersion, there are various statistics:

– Standard Deviation and Variance are of utmost importance– The Range (difference between the highest and lowest

measurement– Inter-quartile range (difference between the median of the

higher and lower quartiles ).

Standard Deviation and Variance• Finding the standard deviation (δ2) of a set of values

given as:

2, 4, 2, 3, 3, 5, 2,

– Find the mean of the distribution

2+4+2+3+3+5+2/7 =21/7= 3

– Find the deviation of each measure from the mean

-1, 1, -1, 0, 0, 2, -1

– Square the deviation of each measure from the mean

1, 1, 1, 0, 0, 4, 1

– Sum the squared deviations

1+1+1+0+0+4+1 =8

– Divide the sum of the deviations by the total number of measurements to give you the variance (δ) = 8/7=1.142857

– Take the square root of the variance to deal with any distortion = √8/7 or √1.142857 =1.069045

Therefore, δ2 = 1.069045

NOTE: S.D = δ2

Formula

• This is given by the formula:n n n

s = √ ∑ (xi- Ẋ) 2 0R δ2 = √ ∑ xi2 - (∑ xi - Ẋ) 2

i=1 n i=1 i=1n n

n

Ẋ= (∑ xi) /n i=n

Large sample:S = √ ∑f(X- Ẋ)2

∑f

Sample:S = √ ∑f(X- Ẋ)2

n-1

REGRESSION AND CORRELATION ANALYSIS

• Statistics involve analysis of variables : – Dependent variable – Independent variable

• Regression analysis concerns an explanation of the exact dependence of one variable on another)

• Correlation analysis measures the degree of dependence of one variable on the other

• Both regression and correlation analysis study the form of association between a set of variables

• The power of these analysis is prediction of the effect of a given variable on another variable given the former variable.

• E.g. we can predict output levels given the man hours, we could predict sales volume given an amount of money spent on promotion.

Linear regressionY = mX + c

This is a simple linear equation where X and Y are variables and m and c are constants.

Y = 4x +6Y = 3.5x +7.2 Y = 13.8x + 76.1

are all examples of linear equations that can be represented graphically to show a straight line relationship but the objective is a determination of a Best Fit on a scatter diagram through partial differentiation.

It is also possible to have non-linear relationships whose graphical representation is not in the form of a straight line.

Least Squares Regression

• In the least squares regression analysis, we are given two functions to solve e.g.

∑yi = nc + m ∑ xi ………………………………(1)

∑xi yi = c∑ xi + m∑ xi2 ……………………….(2)

To minimize,

m= n ∑xi yi – (∑ xi )(∑yi ) ………………………………(1

n ∑ xi 2 – (∑ xi

2 )

c = (∑yi)(∑ xi 2) - (∑ xi)(∑xi yi)

n(∑ xi 2) - (∑ xi )

2

= 1/n (∑yi - m ∑ xi ) …………………………….(2)

OR Y= (∑XY/ ∑Y)X, Where: x =X- Ẋ and y = y-Ӯ

Example• The output of A&B Co is given in table 1. You are to

provide the best fit using the least square approach.

Weeks Total Output (X) Independent variable

Total Cost (Y) –DependentVariable

1 2 11.2

2 3 15.6

3 5 20.3

4 4 20.8

5 1 7.8

6 3 10.6

7 2 12.3

8 4 21.5

9 5 22

10 6 27.6

SOLUTION

Weeks Total Output (X)

Total Cost (Y)

XY X2 Y2

1 2 11.2 22.4 4 125.44

2 3 15.6 46.8 9 243.36

3 5 20.3 101.5 25 412.09

4 4 20.8 83.2 16 432.64

5 1 7.8 7.8 1 60.84

6 3 10.6 31.8 9 112.36

7 2 12.3 24.6 4 151.29

8 4 21.5 86 16 462.25

9 5 22 110 25 484

10 6 27.6 165.6 36 761.76

TOALT 35 169.7 679.7 145 3,246.03

• Given:Y= (∑XY/ ∑Y)X, Where: x =X- Ẋ and y = y-Ӯ

Then,Y= (679.7/169.7)35

Orm=(10 x 679.7) – (35x169.7)/(10x145) – (35)2 = 3.811

c= 1/10({169.7 – (3.811 x 35)} = 3.63

Y=3.811X + 3.63

summary statistics (1)

Documents