summary statistics (1)
TRANSCRIPT
SUMMARY STATISTICS
July 2014
MEASURES OF CENTRAL TENDENCY
• We depend on volumes of data to make various strategic decisions in business.
• Dealing with large volumes of data comes with various challenges
• To the production foreman, detail is of essence but
• To top management, it is better to summarisethe data for easy management because prime interest is on overall profitability
• Two items are always of utmost importance:
– Measure of central tendency (about the middle of the distribution)
– Measure of dispersion (about the centre)
Mean, Mode, Median and Geometric Mean
• The Mean
– It is the sum of the data divided by the number of items constituting the data
Mean = n
∑ xi/n This the same as x1 + x2 + x3….+ xn
i=1
– If we have the following set of measurement:
n= 2, 4, 2, 3, 3, 5, 2, the mean is calculated as
follows:
Ẋ= 2+4+2+3+3+5+2/7
Ẋ= 21/7
Ẋ= 3
The mode
• It is the frequently occurring value in a set of measurement
• If we have the following set of measurement:
n= 2, 4, 2, 3, 3, 5, 2, the mode is calculated as
follows:
1. Arrange the values in terms of magnitude
2, 2, 2, 3, 3, 4, 5
1. Determine the measure that occurs most frequently - 2.
In the above, the mode is 2.
• There is no doubt that the mode is important but:
– It may not be unique to the rest of the set of values
– It cannot be expressed algebraically hence very few statistical operations are developed around it.
The median
• It measures the centrality of values if they are arranged in an ascending order of magnitude, taking into consideration the oddity of some values
• or the arithmetic average of the middle two numbers if the set contains only even numbers.
Example
• Calculate the median of the values below:
2, 4, 2, 3, 3, 5, 2,
Solution:
2, 2, 2, 3, 3, 4, 5
Median = 3 or
3+3
2
= 6/2
= 3
• The median divides the set of values into two halves: one containing values below the median and the other containing values above the median.
• We could also have quartiles (division into 4), deciles (division into tenths) and percentiles (division into hundredths).
• The disadvantage of the median is that it involves laborious arrangement of figures in their order of magnitude.
The Mean of Grouped Data• Raw data may be presented in the form of a
frequency table as in the example below about daily receipts of a shopping mall for 500 days.
Daily receipts (₵) Number of days
< 0<100 10
100<200 30
200<300 50
300<400 80
400<500 100
500<600 85
600<700 75
700<800 40
800<900 25
900<1000 5
The formula:
Mean = Ẋ is given as below if r class intervals are numbered i,…,r and mi, fi are the mid points of and the number of measurements in the ith interval respectively.
n n
Ẋ= ∑ fi mi ∑ fi
i=n i=n
thus, if n is the total number of measurements, the group mean is
n
Ẋ= ∑ fi mi /n
i=n
Daily receipts (₵) Number of days (fi ) Midpoints (mi ) fi mi
< 0<100 10 50 500
100<200 30 150 4,500
200<300 50 250 12,500
300<400 80 350 28,000
400<500 100 450 45,000
500<600 85 550 46,750
600<700 75 650 48,750
700<800 40 750 30,000
800<900 25 850 21,250
900<1000 5 950 4,750
500∑ fi mi = 241,650
n
Ẋ= ∑ fi mi /n
i=n
Ẋ = 241, 650/500
Ẋ = GHC 483.30
Dispersion
• A measure of centrality alone does not provide a sufficiently adequate summary of a set of values
• Consider the two sets of values below(a) -2, -1, 0, 0, 1, 2(b) 0, 0, 0, 0, 0, 0
In both sets, the mean, mode and median are all 0. The difference in character doesn’t lie in their centrality but in their variation about the central valueIn measurement of dispersion, there are various statistics:
– Standard Deviation and Variance are of utmost importance– The Range (difference between the highest and lowest
measurement– Inter-quartile range (difference between the median of the
higher and lower quartiles ).
Standard Deviation and Variance• Finding the standard deviation (δ2) of a set of values
given as:
2, 4, 2, 3, 3, 5, 2,
– Find the mean of the distribution
2+4+2+3+3+5+2/7 =21/7= 3
– Find the deviation of each measure from the mean
-1, 1, -1, 0, 0, 2, -1
– Square the deviation of each measure from the mean
1, 1, 1, 0, 0, 4, 1
– Sum the squared deviations
1+1+1+0+0+4+1 =8
– Divide the sum of the deviations by the total number of measurements to give you the variance (δ) = 8/7=1.142857
– Take the square root of the variance to deal with any distortion = √8/7 or √1.142857 =1.069045
Therefore, δ2 = 1.069045
NOTE: S.D = δ2
Formula
• This is given by the formula:n n n
s = √ ∑ (xi- Ẋ) 2 0R δ2 = √ ∑ xi2 - (∑ xi - Ẋ) 2
i=1 n i=1 i=1n n
n
Ẋ= (∑ xi) /n i=n
Large sample:S = √ ∑f(X- Ẋ)2
∑f
Sample:S = √ ∑f(X- Ẋ)2
n-1
REGRESSION AND CORRELATION ANALYSIS
• Statistics involve analysis of variables : – Dependent variable – Independent variable
• Regression analysis concerns an explanation of the exact dependence of one variable on another)
• Correlation analysis measures the degree of dependence of one variable on the other
• Both regression and correlation analysis study the form of association between a set of variables
• The power of these analysis is prediction of the effect of a given variable on another variable given the former variable.
• E.g. we can predict output levels given the man hours, we could predict sales volume given an amount of money spent on promotion.
Linear regressionY = mX + c
This is a simple linear equation where X and Y are variables and m and c are constants.
Y = 4x +6Y = 3.5x +7.2 Y = 13.8x + 76.1
are all examples of linear equations that can be represented graphically to show a straight line relationship but the objective is a determination of a Best Fit on a scatter diagram through partial differentiation.
It is also possible to have non-linear relationships whose graphical representation is not in the form of a straight line.
Least Squares Regression
• In the least squares regression analysis, we are given two functions to solve e.g.
∑yi = nc + m ∑ xi ………………………………(1)
∑xi yi = c∑ xi + m∑ xi2 ……………………….(2)
To minimize,
m= n ∑xi yi – (∑ xi )(∑yi ) ………………………………(1
n ∑ xi 2 – (∑ xi
2 )
c = (∑yi)(∑ xi 2) - (∑ xi)(∑xi yi)
n(∑ xi 2) - (∑ xi )
2
= 1/n (∑yi - m ∑ xi ) …………………………….(2)
OR Y= (∑XY/ ∑Y)X, Where: x =X- Ẋ and y = y-Ӯ
Example• The output of A&B Co is given in table 1. You are to
provide the best fit using the least square approach.
Weeks Total Output (X) Independent variable
Total Cost (Y) –DependentVariable
1 2 11.2
2 3 15.6
3 5 20.3
4 4 20.8
5 1 7.8
6 3 10.6
7 2 12.3
8 4 21.5
9 5 22
10 6 27.6
SOLUTION
Weeks Total Output (X)
Total Cost (Y)
XY X2 Y2
1 2 11.2 22.4 4 125.44
2 3 15.6 46.8 9 243.36
3 5 20.3 101.5 25 412.09
4 4 20.8 83.2 16 432.64
5 1 7.8 7.8 1 60.84
6 3 10.6 31.8 9 112.36
7 2 12.3 24.6 4 151.29
8 4 21.5 86 16 462.25
9 5 22 110 25 484
10 6 27.6 165.6 36 761.76
TOALT 35 169.7 679.7 145 3,246.03
• Given:Y= (∑XY/ ∑Y)X, Where: x =X- Ẋ and y = y-Ӯ
Then,Y= (679.7/169.7)35
Orm=(10 x 679.7) – (35x169.7)/(10x145) – (35)2 = 3.811
c= 1/10({169.7 – (3.811 x 35)} = 3.63
Y=3.811X + 3.63