introduction to business statistics qm 120 chapter 3 spring 2008 … 3-2.pdf · 2008-03-03 ·...
Post on 10-Jul-2020
9 Views
Preview:
TRANSCRIPT
QM-120, M. Zainal 1
DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS
Introduction to Business StatisticsQM 120
Chapter 3
Dr. Mohammad ZainalSpring 2008
Measures of central tendency for ungrouped data
Graphs are very helpful to describe the basic shape of a datadistribution; “a picture is worth a thousand words.” There arelimitations, however to the use of graphs.
2
One way to overcome graph problems is to use numericalmeasures, which can be calculated for either a sample or apopulation.
Numerical descriptive measures associated with a populationof measurements are called parameters; those computed from
l ll d i isample measurements are called statistics.
A measure of central tendency gives the center of a histogramor frequency distribution curve.
The measures are: mean, median, and mode.QM-120, M. Zainal
QM-120, M. Zainal 2
Measures of central tendency for ungrouped data
Data that give information on each member of the populationor sample individually are called ungrouped data, whereasgrouped data are presented in the form of a frequency
3
distribution table.MeanThe mean (average) is the most frequently used measure ofcentral tendency.
The mean for ungrouped data is obtained by dividing the sumof all values by the number if values in the data set Thusof all values by the number if values in the data set. Thus,
nx
x
Nx
∑
∑
=
=
: samplefor Mean
:populationfor Mean μ
QM-120, M. Zainal
Measures of central tendency for ungrouped data
Example: The following table gives the 2002 total payrolls of five MLBteams.
2002 total payroll ( illi f d ll )
MLB team
4
(millions of dollars)62Anaheim Angels93Atlanta Braves126New York Yankees75St. Louis Cardinals34Tampa Bay Devil Rays The Mean is
a balancing i t
34 62 9375 126
point
QM-120, M. Zainal
QM-120, M. Zainal 3
Measures of central tendency for ungrouped data
Sometimes a data set may contain a few very small or a fewvery large values. Such values are called outliers or extremevalues.
5
We should be very cautious when using the mean. It may notalways be the best measure of central tendency.
Example: The following table lists the 2000 populations (inthousands) of five Pacific states.
Population State 121262734215894 +++
Excluding California
California
1212Hawaii
627Alaska
3421Oregon
5894Washington
(thousands)State
33,872
An outlier
5.27884
121262734215894Mean =+++
=
2.90055
33,8721212 627 3421 5894 Mean =++++
=
Including California
QM-120, M. Zainal
Measures of central tendency for ungrouped data
Weighted Mean
Sometimes we may assign weight (importance) to each
observation before we calculate the mean.
6
A mean computed in this manner is refereed to as a weighted
mean and it is given by
h h l f b d h f∑∑=
i
ii
WxW
x
where xi is the value of observation i and Wi is weight for
observation i.
QM-120, M. Zainal
QM-120, M. Zainal 4
Measures of central tendency for ungrouped data
Example: Consider the following sample of four purchases of one stock in the KSE. Find the average cost of the stock.
Purchase Price Quantity
7
Q y
1 .300 5,000
2 .325 15,000
3 .350 10,000
4 .295 20,000
QM-120, M. Zainal
Measures of central tendency for ungrouped data
1n th
⎟⎞
⎜⎛ +
MedianThe median is the value of the middle term in a data set hasbeen ranked in increasing order.
8
setdatarankedainterm2
1ntheofValue Median ⎟⎠⎞
⎜⎝⎛ +
=
The value .5(n + 1) indicates the position in the ordered dataset.
If n is even, we choose a value halfway between the twomiddle observationsExample: Find the median for the following two sets ofExample: Find the median for the following two sets ofmeasurements
2, 9, 11, 5, 6 2, 9, 11, 5, 6, 27
QM-120, M. Zainal
QM-120, M. Zainal 5
Measures of central tendency for ungrouped data
ModeThe mode is the value that occurs with the highest frequencyin a data set.
9
Adata with each value occurring only once has no mode.
A data set with only one value occurring with highestfrequency has only one mode, it is said to be unimodal.
A data set with two values that occurs with the same (highest)frequency has two modes, it is said to be bimodal.
If more than two values in a data set occur with the same(highest) frequency, it is said to be multimodal.
QM-120, M. Zainal
Measures of central tendency for ungrouped data
Example: You are given 8 measurements: 3, 5, 4, 6, 12, 5, 6, 7.Find
a) The mean. b) The median. c) The mode.
10
Relationships among the mean, median, and Mode
Symmetric histograms when
Mean = Median = Mode
Right skewed histograms when
Mean > Median > Mode
Left skewed histograms when
Mean < Median < ModeQM-120, M. Zainal
QM-120, M. Zainal 6
Measures of dispersion for ungrouped data
RangeData sets may have same center but look different because of
the way the numbers are spread out from center.
11
Example:Company 1: 47 38 35 40 36 45 39Company 2: 70 33 18 52 27
Measure of variability can help us to create a mental picture ofthe spread of the data.The range for ungrouped dataThe range for ungrouped data
Range = Largest value – Smallest valueThe range, like the mean, is highly influenced by outliers.The range is based on two values only.
QM-120, M. Zainal
Measures of dispersion for ungrouped data
Variance and standard deviationThe standard deviation is the most used measure of dispersion.It tells us how closely the values of a data set are clustered
d th
12
around the mean.
In general, larger values of standard deviation indicate thatvalues of that data set are spread over a relatively larger rangearound the mean and vice versa.
( ) ( )22
2
22
2−− ∑∑∑∑ n
xx
Nx
x
Standard deviation is always non‐negative
1 :sample and, :Population 22
−==∑∑
nns
NNσ
22 and ss == σσ
QM-120, M. Zainal
QM-120, M. Zainal 7
Measures of dispersion for ungrouped data
Example: Find the standard deviation of the data set in table.
2002 payroll (millions of dollars)
MLB team
13
(millions of dollars)
62Anaheim Angels
93Atlanta Braves
126New York Yankees
75St. Louis Cardinals
34Tampa Bay Devil Rays
QM-120, M. Zainal
Measures of dispersion for ungrouped data
In some situations we may be interested in a descriptivestatistics that indicates how large is the standard deviationcompared to the mean.
14
It is very useful when comparing two different samples withdifferent means and standard deviations.
It is given by:
%100⎟⎟⎞
⎜⎜⎛
×=σCV %100⎟⎟
⎠⎜⎜⎝
×μ
CV
QM-120, M. Zainal
QM-120, M. Zainal 8
Mean, variance, and standard deviation for grouped data
Mean for grouped dataOnce we group the data, we no longer know the values of
individual observations.
15
Thus, we find an approximation for the sum of these values.
: samplefor Mean
:populationfor Mean
nmf
x
Nmf
∑
∑
=
=μ
class. a offrequency theis andmidpoint theis Where fm
QM-120, M. Zainal
Mean, variance, and standard deviation for grouped data
Variance and standard deviation for grouped data
( )2mf∑
16
( )
( )
1 : Sample
:Population
22
2
2
2
nnmf
fms
NNmf
fm
−
−=
−=
∑∑
∑∑σ
class.aoffrequencytheisandmidpoint the is Where fm
22 and ss == σσ
QM-120, M. Zainal
QM-120, M. Zainal 9
Mean, variance, and standard deviation for grouped data
Example: The table below gives the frequency distribution ofthe daily commuting times (in minutes) from home to CBA forall 25 students in QMIS 120. Calculate the mean and thestandard deviation of the daily commuting times.
17
standard deviation of the daily commuting times.
fDaily commuting time (min)40 to less than 10910 to less than 20620 to less than 30430 to less than 40240 to less than 50
25Total
QM-120, M. Zainal
Use of standard deviation
Z‐ScoresWe often are interested in the relative location of a data point xiwith respect to the mean (how far or how close).
18
The z‐scores (standardized value) can be used to find therelative location of a data point xi compared to the center of thedata.
It is given by
xxZ ii
−=
The z‐score can be interpreted as the number of standarddeviations xi is from the mean.
sZi
QM-120, M. Zainal
QM-120, M. Zainal 10
Use of standard deviation
Example: Find the z‐scores for the following data where themean is 44 and the standard deviation is 8.
19
Number of students in a class
4654424632
QM-120, M. Zainal
Use of standard deviation
Chebyshevʹs TheoremChebyshevʹs theorem allows you to understand how the value
of a standard deviation can be applied to any data set.
20
Chebyshevʹs theorem: The fraction of any data set lying withink standard deviations of the mean is at least
where k = a number greater than 1.
2
11k
−
This theorem applies to all data sets, which include a sample ora population.
Chebyshev’s theorem is very conservative but it can be appliedto a data set with any shape.
QM-120, M. Zainal
QM-120, M. Zainal 11
Use of standard deviation21
k = 1
k = 2 k = 3QM-120, M. Zainal
Use of standard deviation
Example: The average systolic blood pressure for 4000 womenwho were screened for high blood pressure was found to be 187with standard deviation of 22. Using Chebyshev’s theorem, find
22
at least what percentage of women in this group have a systolicblood pressure between 143 and 231.
QM-120, M. Zainal
QM-120, M. Zainal 12
Use of standard deviation
Empirical RuleThe empirical rule gives more precise information about a data set than the Chebyshevʹs Theorem, however it only applies to a data set that is bell‐shaped.
23
data set t at is be s aped
Theorem:1‐ 68% of the observations lie within one standard deviation of the mean.2‐ 95% of the observations lie within two standard deviations of the mean.3‐ 99.7% of the observations lie within three standard deviations of the mean.
QM-120, M. Zainal
Use of standard deviation
Example: The age distribution of a sample of 5000 persons isbell‐shaped with a mean of 40 years and a standard deviation of12 years. Determine the approximate percentage of people whoare 16 to 64 years old.
24
are 16 to 64 years old.
QM-120, M. Zainal
QM-120, M. Zainal 13
Use of standard deviation
Detecting outliersSometimes a data set may have one or more observation that isunusually small or large value.
25
This extreme value is called an outlier and cam be detectedusing the z‐score and the empirical rule for data with bell‐shapedistribution.
An experienced statistician may face the following situationsand need to take an action
Outlier Action
A data value that was incorrectly recorded Correct it before any further analysis
A data value that was incorrectly included Remove it before any further analysis
A data value that belongs to the data set and correctly recorded
Keep it !
QM-120, M. Zainal
Measure of position
Quartiles and interquartile rangeA measure of position which determines the rank of a singlevalue in relation to other values in a sample or population.
26
Quartiles are three measures that divide a ranked data set intofour equal parts
QM-120, M. Zainal
QM-120, M. Zainal 14
Measure of position
Calculating the quartilesThe second quartile is the median of a data set.
The first quartile, Q1, is the value of the middle term among
27
observations that are less than the median.
The third quartile, Q3, is the value of the middle term among observations that are greater than the median.
Interquartile range (IQR) is the difference between the third quartile and the first quartile for a data setquartile and the first quartile for a data set.
IQR = Interquartile range = Q3 – Q1
QM-120, M. Zainal
Measure of position
Example: For the following data75.3 82.2 85.8 88.7 94.1 102.1 79.0 97.1 104.2 119.3 81.3 77.1
Find
28
The values of the three quartiles.
The interquartile range.
Where does the 104.2 fall in relation to these quartiles.
QM-120, M. Zainal
QM-120, M. Zainal 15
Measure of position
Percentile and percentile rankPercentiles are the summery measures that divide a rankeddata set into 100 equal parts.
29
The kth percentile is denoted by Pk , where k is an integer in therange 1 to 99.
The approximate value of the kth percentile is
set data ranked ain term100
theof Value th
kknP ⎟
⎠⎞
⎜⎝⎛=
QM-120, M. Zainal
Measure of position
Example: For the following data75.3 82.2 85.8 88.7 94.1 102.1 79.0 97.1 104.2 119.3 81.3 77.1
Find the value of the 40th percentile.
30
Find the value of the 40 percentile.
Solution:
QM-120, M. Zainal
QM-120, M. Zainal 16
Box-and-whisker plot
Box and whisker plot gives a graphic presentation of datausing five measures:
Q1, Q2, Q3, smallest, and largest values*.
31
Can help to visualize the center, the spread, and the skewnessof a data set.
Can help in detecting outliers.
Very good tool of comparing more than a distribution.
Detecting an outlier:Lower fence: Q1 – 1.5(IQR)
Upper fence: Q3 + 1.5(IQR)
If a data point is larger than the upper fence or smaller than the lower fence, it is considered to be an outlier.
QM-120, M. Zainal
Box-and-whisker plot
To construct a box plotDraw a horizontal line representing the scale of the measurements.Calculate the median, the upper and lower quartiles, and the IQR for thedata set and mark them on the line.
32
data set and mark them on the line.Form a box just above the line with the right and left ends at Q1 and Q3.Draw a vertical line through the box at the location of the median.Mark any outliers with an asterisk (*) on the graph.Extend horizontal lines called “Whiskers” from the ends of the box to thesmallest and largest observation that are not outliers.
Q1 Q2 Q3Lower fence
Upper fence
**
QM-120, M. Zainal
QM-120, M. Zainal 17
Box-and-whisker plot
Example: Construct a box plot for the following data.340 300 470 340 320 290 260 330
33
QM-120, M. Zainal
Measures of association between two variables
So far we have studied numerical methods to describe datawith one variable.
Often decision makers are interested in the relationship
34
between two variables.
To do so, we will use descriptive measure that is calledcovariance.
Covariance assigns a numerical value to the linear relationshipbetween two variables (see scatter diagram). It is given by
( )( )( )( )
( )( )N
yxn
yyxx
yixi
ii
μμσ
−−=
−−−
=
∑
∑
xy
xy
: covariance Population
1S :covariance Sample
QM-120, M. Zainal
QM-120, M. Zainal 18
Measures of association between two variables
A big disadvantage of the covariance is that it depends on theunits of measurement for x and y.
For the same data set, we will have two different covariance
35
values depending on the units (i.e. height in meters orcentimeters will make a big difference).
Pearson’s correlation coefficient is a good remedy to that problemas it can go only from ‐1 to 1.
It is given by
,SS
Sr :t coefficienn correlatio Sample
:t coefficienn correlatio Population
yx
xyxy
yx
xyxy
=
=σσ
σσ
QM-120, M. Zainal
Measures of association between two variables
Example: A golfer is interested in investigating the relationship,if any, between driving distance and 18‐hole score.
36
277.6259.5269.1267 0
69717070
Average DrivingDistance (meters)
Average18‐Hole Score
267.0255.6272.9
707169
QM-120, M. Zainal
QM-120, M. Zainal 19
Measures of association between two variables
277.6 69
x y
37
277.6259.5269.1267.0255.6272.9
697170707169
QM-120, M. Zainal
Measures of association between two variables
Example: Find the covariance and Pearson’s correlation coefficient forthe following data
Week Number of commercials (x) Sales in $ (y)
1 2 50
38
1 2 50
2 5 57
3 1 41
4 3 54
5 4 54
6 1 38
7 5 63
8 3 48
9 4 59
10 2 46
QM-120, M. Zainal
top related