seminar 15 | tuesday, october 18, 2007 | aliaksei smalianchuk
TRANSCRIPT
Seminar 15 | Tuesday, October 18, 2007 | Aliaksei Smalianchuk
Means and Variances
What happens to means and variances when data is manipulated?
Let’s check by manipulating data from the survey.
Data
Height in inches (HT) Shoe size (Shoe) Age (Age) Additional Columns:
Height with a 1 inch heel (HeightPlus1)Height in centimeters (2.5TimesHeight)Sum of height and shoe size
(HeightPlusShoe)Sum of height and age (HeightPlusAge)
Statistics
Variable N Mean StDev
HT 444 66.928 3.938
Shoe 445 9.1056 1.9484
Age 444 20.371 2.912
HeightPlus1 444 67.928 3.938
2.5TimesHeight 444 167.32 9.84
HeightPlusShoe 444 76.035 5.693
HeightPlusAge 444 87.299 4.913
Variable N Mean StDev
HT 444 66.928 3.938
Shoe 445 9.1056 1.9484
Age 444 20.371 2.912
HeightPlus1 444 67.928 3.938
2.5TimesHeight 444 167.32 9.84
HeightPlusShoe 444 76.035 5.693
HeightPlusAge 444 87.299 4.913
Observation 1
The mean of heel heights is one inch larger than then mean of heights
Why?
If every element is modified by a constant number the mean follows the same pattern.
Variable N Mean StDev
HT 444 66.928 3.938
Shoe 445 9.1056 1.9484
Age 444 20.371 2.912
HeightPlus1 444 67.928 3.938
2.5TimesHeight 444 167.32 9.84
HeightPlusShoe 444 76.035 5.693
HeightPlusAge 444 87.299 4.913
Observation 2
The standard deviation of heel heights equals the standard deviation of heights
Why?
Standard deviation is relative to the mean, and the shape of the distribution didn’t change
Variable N Mean StDev
HT 444 66.928 3.938
Shoe 445 9.1056 1.9484
Age 444 20.371 2.912
HeightPlus1 444 67.928 3.938
2.5TimesHeight 444 167.32 9.84
HeightPlusShoe 444 76.035 5.693
HeightPlusAge 444 87.299 4.913
Observation 3
The standard deviation of heights is 2.5 times the standard deviation of heights in centimeters
Why?
By multiplying all data values by a constant value we are increasing the spread of the histogram by the same value, therefore modifyingthe properties that depend on the spread (like standard deviation.)
Variable N Mean StDev
HT 444 66.928 3.938
Shoe 445 9.1056 1.9484
Age 444 20.371 2.912
HeightPlus1 444 67.928 3.938
2.5TimesHeight 444 167.32 9.84
HeightPlusShoe 444 76.035 5.693
HeightPlusAge 444 87.299 4.913
Observation 4
Mean of HeightPlusShoe = Mean of Height + Mean of Shoe
Variable N Mean StDev
HT 444 66.928 3.938
Shoe 445 9.1056 1.9484
Age 444 20.371 2.912
HeightPlus1 444 67.928 3.938
2.5TimesHeight 444 167.32 9.84
HeightPlusShoe 444 76.035 5.693
HeightPlusAge 444 87.299 4.913
Observation 5
Mean of HeightPlusAge = Mean of Height + Mean of Age
Why?
Since
Variances
Variance = σ2
Variances apply to a probability distribution
Variance is a way to capture the degree of spread of a distribution
Variances
Variable Variance
HT 15.50784
Shoe 3.796263
Age 8.479744
HeightPlusShoe 32.41025
HeightPlusAge 24.13757
Dependence
Are shoe sizes and heights dependent? Are age and height dependent? Let’s check using scatter plots
Height vs. Shoe Size
Height vs. Age
Back to variances
Variance of HeightPlusShoe is much greater than Var(Height) + Var(Shoe)
Variance of HeightPlusAge is very close to Var(Height) + Var(Age)
Variable VarianceHT 15.50784Shoe 3.796263Age 8.479744
HeightPlusShoe 32.41025HeightPlusAge 24.13757
Why?
Can you see a difference in relationships (Height vs. Shoe Size) and (Height vs. Age?)
Dependence
Adding two dependent data distributions produces extremes (adding small values with corresponding small values and adding large values to correspondent large values)
This makes the variance much larger.
Dependence
In case of independent sets, values do not necessarily correspond by relative value (large values can be added to small values)
This does not alter the spread of the distribution much
Variance of sample mean Mean = (X1 + X2 + … + Xn)/n
Variance [(X1 + X2+ … +Xn)/n] = (Variance[X1] + Variance[X2]+ … + Variance[Xn])/n
Dependence?
Would this work for dependent values of X1, X2 … Xn ?
Would the variance produced by this formula be larger or smaller than actual?
Sampling without replacementWould the variance formula hold true?Why?
Dependence
Adding variances of dependent values will produce a smaller result than expected because adding dependent data sets will produce extremes, altering the spread
Sampling without replacement on smaller populations (n < 10) will produce dependence
The End
Extra Credit (Dr. Pfenning) Use Minitab Calculator to create column
“Birthyear” Plot Earned vs. Birthyear, note relationship Create column “EarnedPlusBirthyear” Find sds of Earned, Birthyear,
EarnedPlusBirthyear, square to variances Compare variances Explain results