it d ti tintroduction to descriptive...
TRANSCRIPT
I t d ti tIntroduction to DescriptiveDescriptive StatisticsStatistics17.871Spring 2012
Key measuresKey measuresDescribing data
Moment Non-mean based measure
Center Mean Mode, median
Spread Variance Range,Spread(standard deviation) Interquartile range
Skew Skewness --
Peaked Kurtosis --
Key distinctionKey distinctionPopulation vs. Sample Notation
Population vs SamplePopulation vs. SampleGreeks Romans
β bμ, σ, β s, b
MMean
n
xi
Xi 1
n
Variance, Standard Deviation ofVariance, Standard Deviation of a Population
n
ix 22
,)(i n1
,
n
ix 2)(
i
i
nx
1
)(
V i S D f S lVariance, S.D. of a Sample
sxni 2
2
,)( ni
1,
1
xni
2)( Degrees of freedom
sn
xi
i
1 1)(
Bi d tBinary data
1 timeof proportion1)( xXprobX
)1()1(2 xxsxxs xx
Example of this, using today’s NBC N /M i P ll i Mi hiNews/Marist Poll in Michigan
gen santorum = 1 if candidate==“Santorum”
replace santorum = 0 if candidate~=“Santorum”th d
Candidate Pct.Santorum 35Romney 37
the command summsantorum produces
Mean = 35
yPaul 13Gingrich 8[U t d [7] Mean .35
Var = .35(1-.35)=.2275 S.d. = . 4769696
[Unaccounted for]
[7]
N l di t ib ti lNormal distribution example
IQ SATFrequency
Height
“N k ” “No skew” “Zero skew” Symmetrical Symmetrical Mean = median = mode Value
22/)(1 22/)(
21)(
xexf
SkewnessSkewnessAsymmetrical distribution
Income Contribution to
did t
Frequency
candidates Populations of
countriescountries “Residual vote” rates
“Positive skew” “Right skew”
Value
Right skew
Distribution of the average $$ of dividends/tax return (in K’s)
1.5
1D
ensi
ty
Hyde County SD
.5D Hyde County, SD
Mitsubishi i-MiEV(which is supposed to be all electric)
0
0 5 10 15dividends_pc
6.0
8.1
.02
.04
.06
Den
sity
0
0 50 100 150var1
Fuel economy of cars for sale in the US
SkewnessSkewnessAsymmetrical distribution
GPA of MIT studentsFrequency
“Negative skew” “Left skew”
Value
Placement of Republican PartyPlacement of Republican Party on 100-point scale
.08
.06
ty.0
2.0
4D
ensi
t0
.
0 20 40 60 80 100place on ideological scale - republican partyplace on ideological scale republican party
SkSkewnessFrequency
Value
Placement of Republican PartyPlacement of Republican Party on 100-point scale
.1.0
6.0
8si
ty02
.04
Den
s0
.0
0 20 40 60 80 100place on ideological scale - democratic party
Mean = 26.8; median = 25; mode = 25
K t iKurtosisF
k > 3 leptokurticFrequency
k = 3 mesokurtic
k < 3 platykurtic
Value
.15
.05
.1D
ensi
ty0
0 20 40 60 80 100place on ideological scale - yourself
1
Mean s.d. Skew. Kurt.Self-placement
55.1 26.4 -0.14 2.21
.04
.06
.08
.1D
ensi
ty
Rep. pty. 26.8 21.2 0.87 3.59Dem. pty 74.7 21.8 -1.18 4.29
0.0
2
0 20 40 60 80 100place on ideological scale - democratic party
.06
.08
.1en
sity
0.0
2.0
4D
e
0 20 40 60 80 100place on ideological scale - republican party
Source: Cooperative Congressional Election Study, 2008
N l di t ib tiNormal distribution
Skewness = 0 Kurtosis = 3
22/)(1)( xexf )(
2)(
exf
M d b t th lMore words about the normal curve
The z-scoreor the“standardized score”standardized score
z x xzx x
Commands in STATA forCommands in STATA for univariate statistics summarize varname summarize varname detail summarize varname, detail histogram varname, bin() start() width()
density/fraction/frequency normaldensity/fraction/frequency normal graph box varnames tabulate
E l f Fl id tExample of Florida voters
Question: does the age of voters vary by race?
Combine Florida voter extract files, 2008
gen new_birth_date=date(birth_date,"MDY") gen birth_year=year(new_b)
2010 bi h gen age= 2010-birth_year
L k t di t ib ti f bi thLook at distribution of birth year.0
2.0
2501
.015
Den
sity
.005
.00
1850 1900 1950 2000birth_year
E l b ti dExplore age by voting mode
. table race if birth_year>1900,c(mean age)
----------------------race | mean(age)
----------+-----------1 | 45.612292 | 42 899162 | 42.899163 | 42.69524 | 45.097185 | 52.08628|6 | 44.773929 | 40.86704
----------------------3 = Black4 = Hispanic5 Whit5 = White
G h bi thGraph birth year
.03
.02
.01
Den
sity
0
0 20 40 60 80 100
. hist age if birth_year>1900
0 20 40 60 80 100age
(bin=71, start=9, width=1.3802817)
Divide into “bins” so that each barDivide into bins so that each bar represents 1 year
.015
.02
.01
Den
sity
.005
hi t if bi th 1900 idth(1)
0
0 20 40 60 80 100age
. hist age if birth_year>1900,width(1)
Add ti k t 10 i t lAdd ticks at 10-year intervalshistogram totalscore, width(1) xlabel(-.2 (.1) 1)
.015
.02
.01
.D
ensi
ty.0
050
20 30 40 50 60 70 80 90 100age
S perimpose the normal c r eSuperimpose the normal curve (with the same mean and s.d. as the empirical distribution)
i i i i l lhist age if birth_year>1900,wid(1) xlabel(20 (10) 100) normal
5.0
2.0
1.0
15D
ensi
ty.0
050
20 30 40 50 60 70 80 90 100age
. summ age if birth_year>1900,det
age-------------------------------------------------------------
Percentiles Smallest1% 18 91% 18 95% 21 1610% 24 16 Obs 1261211425% 34 16 Sum of Wgt. 12612114
50% 48 Mean 49.47549Largest Std. Dev. 19.01049
75% 63 10775% 63 10790% 77 107 Variance 361.398695% 83 107 Skewness .262949699% 91 107 Kurtosis 2.222442
Histograms by raceHistograms by racehist age if birth year>1900&race>=3&race<=5,wid(1) hist age if birth_year 1900&race 3&race 5,wid(1)
xlabel(20 (10) 100) normal by(race).0
1.0
2.0
3
3 4
3 = Black4 = Hispanic5 = White
0.0
2.0
3
20 30 40 50 60 70 80 90 100
5Den
sity
5 = White
0.0
1.
20 30 40 50 60 70 80 90 100
D it
age
Densitynormal age
Graphs by race
M i i ith hi tMain issues with histograms
Proper level of aggregation Non-regular data categories Non regular data categories
Draw the previous graph with a boxDraw the previous graph with a box plot
0
graph box age if birth_year>1900
}100 }1.5 x IQR
50ag
e
Upper quartileMedianLower quartile
} Inter-quartilerange
}0
q
Draw the box plots for the differentDraw the box plots for the different races
graph box age if birth_year>1900&race>=3&race<=5,by(race)
3 4
5010
0
3 = Black4 = Hispanic5 = White
010
0
5age
5 = White
050
Graphs by race
Draw the box plots for the different races using “over” optionggraph box age if birth_year>1900&race>=3&race<=5,over(race)
100
3 = Black4 = Hispanic5 = White
50ag
e
5 = White
0
3 4 5
A note about histograms withA note about histograms with unnatural categories
From the Current Population Survey (2000), Voter and Registration Survey
How long (have you/has name) lived at this address?
-9 No Response-3 Refused-2 Don't know-1 Not in universe1 Less than 1 month2 1-6 months3 7-11 months3 7 11 months4 1-2 years5 3-4 years6 5 years or longer
Solution, Step 1, pMap artificial category onto “natural” midpointnatural midpoint
-9 No Response missing-3 Refused missingg-2 Don't know missing-1 Not in universe missing1 Less than 1 month 1/24 = 0.0422 1 6 months 3 5/12 = 0 292 1-6 months 3.5/12 = 0.293 7-11 months 9/12 = 0.754 1-2 years 1.55 3-4 years 3.56 5 years or longer 10 (arbitrary)
recode live length (min/ 1 = )(1= 042)(2= 29)(3= 75)(4=1 5)(5=3 5)(6=10)recode live_length (min/-1 =.)(1=.042)(2=.29)(3=.75)(4=1.5)(5=3.5)(6=10)
G h f d d d tGraph of recoded datahistogram longevity, fraction
.557134
histogram longevity, fraction
.557134
Frac
tion
longevity0 1 2 3 4 5 6 7 8 9 10
0
D it l t f d tDensity plot of dataTotal area of last bar = 557Total area of last bar = .557Width of bar = 11 (arbitrary)Solve for: a = w h (or).557 = 11h => h = .051
longevity0 1 2 3 4 5 6 7 8 9 10
0
15
D it l t t l tDensity plot template
Category Fraction X-min X-max X-lengthHeight
(density)< 1 mo 0156 0 1/12 082 19*< 1 mo. .0156 0 1/12 .082 .19
1-6 mo. .0909 1/12 ½ .417 .22
7 11 mo 0430 ½ 1 500 097-11 mo. .0430 ½ 1 .500 .09
1-2 yr. .1529 1 2 1 .15
3 4 1404 2 4 2 073-4 yr. .1404 2 4 2 .07
5+ yr. .5571 4 15 11 .05
* = .0156/.082
Three words about pie charts:Three words about pie charts: don’t use them
S h t’ ith thSo, what’s wrong with them
For non-time series data, hard to get a comparison among groups; the eye is very p g g p ; y ybad in judging relative size of circle slices
For time series data hard to grasp cross- For time series, data, hard to grasp crosstime comparisons
Some words about graphicalSome words about graphical presentation Aspects of graphical integrity (following
Edward Tufte, Visual Display of , p yQuantitative Information)Main point should be readily apparentMain point should be readily apparentShow as much data as possibleWrite clear labels on the graphWrite clear labels on the graphShow data variation, not design variation
There is a difference between graphs inThere is a difference between graphs in research and publication
Publish OK
.01
.02
.03
3 4
0.0
2.0
3
20 30 40 50 60 70 80 90 100
5Den
sity
2.0
3
Black Hispanic
0.0
1
20 30 40 50 60 70 80 90 100
Densit
age 0.0
1.0
2
White Totalnsity
Densitynormal age
Graphs by race
.01
.02
.03
White Total
De
0
20 30 40 50 60 70 80 90 100 20 30 40 50 60 70 80 90 100Age
Graphs by race
Do not publish