it d ti tintroduction to descriptive...

43
It d ti t Introduction to Descriptive Descriptive Statistics Statistics 17.871 Spring 2012

Upload: others

Post on 14-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

I t d ti tIntroduction to DescriptiveDescriptive StatisticsStatistics17.871Spring 2012

Page 2: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Key measuresKey measuresDescribing data

Moment Non-mean based measure

Center Mean Mode, median

Spread Variance Range,Spread(standard deviation) Interquartile range

Skew Skewness --

Peaked Kurtosis --

Page 3: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Key distinctionKey distinctionPopulation vs. Sample Notation

Population vs SamplePopulation vs. SampleGreeks Romans

β bμ, σ, β s, b

Page 4: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

MMean

n

xi

Xi 1

n

Page 5: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Variance, Standard Deviation ofVariance, Standard Deviation of a Population

n

ix 22

,)(i n1

,

n

ix 2)(

i

i

nx

1

)(

Page 6: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

V i S D f S lVariance, S.D. of a Sample

sxni 2

2

,)( ni

1,

1

xni

2)( Degrees of freedom

sn

xi

i

1 1)(

Page 7: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Bi d tBinary data

1 timeof proportion1)( xXprobX

)1()1(2 xxsxxs xx

Page 8: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Example of this, using today’s NBC N /M i P ll i Mi hiNews/Marist Poll in Michigan

gen santorum = 1 if candidate==“Santorum”

replace santorum = 0 if candidate~=“Santorum”th d

Candidate Pct.Santorum 35Romney 37

the command summsantorum produces

Mean = 35

yPaul 13Gingrich 8[U t d [7] Mean .35

Var = .35(1-.35)=.2275 S.d. = . 4769696

[Unaccounted for]

[7]

Page 9: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

N l di t ib ti lNormal distribution example

IQ SATFrequency

Height

“N k ” “No skew” “Zero skew” Symmetrical Symmetrical Mean = median = mode Value

22/)(1 22/)(

21)(

xexf

Page 10: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

SkewnessSkewnessAsymmetrical distribution

Income Contribution to

did t

Frequency

candidates Populations of

countriescountries “Residual vote” rates

“Positive skew” “Right skew”

Value

Right skew

Page 11: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Distribution of the average $$ of dividends/tax return (in K’s)

1.5

1D

ensi

ty

Hyde County SD

.5D Hyde County, SD

Mitsubishi i-MiEV(which is supposed to be all electric)

0

0 5 10 15dividends_pc

6.0

8.1

.02

.04

.06

Den

sity

0

0 50 100 150var1

Fuel economy of cars for sale in the US

Page 12: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

SkewnessSkewnessAsymmetrical distribution

GPA of MIT studentsFrequency

“Negative skew” “Left skew”

Value

Page 13: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Placement of Republican PartyPlacement of Republican Party on 100-point scale

.08

.06

ty.0

2.0

4D

ensi

t0

.

0 20 40 60 80 100place on ideological scale - republican partyplace on ideological scale republican party

Page 14: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

SkSkewnessFrequency

Value

Page 15: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Placement of Republican PartyPlacement of Republican Party on 100-point scale

.1.0

6.0

8si

ty02

.04

Den

s0

.0

0 20 40 60 80 100place on ideological scale - democratic party

Mean = 26.8; median = 25; mode = 25

Page 16: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

K t iKurtosisF

k > 3 leptokurticFrequency

k = 3 mesokurtic

k < 3 platykurtic

Value

Page 17: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

.15

.05

.1D

ensi

ty0

0 20 40 60 80 100place on ideological scale - yourself

1

Mean s.d. Skew. Kurt.Self-placement

55.1 26.4 -0.14 2.21

.04

.06

.08

.1D

ensi

ty

Rep. pty. 26.8 21.2 0.87 3.59Dem. pty 74.7 21.8 -1.18 4.29

0.0

2

0 20 40 60 80 100place on ideological scale - democratic party

.06

.08

.1en

sity

0.0

2.0

4D

e

0 20 40 60 80 100place on ideological scale - republican party

Source: Cooperative Congressional Election Study, 2008

Page 18: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

N l di t ib tiNormal distribution

Skewness = 0 Kurtosis = 3

22/)(1)( xexf )(

2)(

exf

Page 19: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

M d b t th lMore words about the normal curve

Page 20: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

The z-scoreor the“standardized score”standardized score

z x xzx x

Page 21: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Commands in STATA forCommands in STATA for univariate statistics summarize varname summarize varname detail summarize varname, detail histogram varname, bin() start() width()

density/fraction/frequency normaldensity/fraction/frequency normal graph box varnames tabulate

Page 22: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

E l f Fl id tExample of Florida voters

Question: does the age of voters vary by race?

Combine Florida voter extract files, 2008

gen new_birth_date=date(birth_date,"MDY") gen birth_year=year(new_b)

2010 bi h gen age= 2010-birth_year

Page 23: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

L k t di t ib ti f bi thLook at distribution of birth year.0

2.0

2501

.015

Den

sity

.005

.00

1850 1900 1950 2000birth_year

Page 24: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

E l b ti dExplore age by voting mode

. table race if birth_year>1900,c(mean age)

----------------------race | mean(age)

----------+-----------1 | 45.612292 | 42 899162 | 42.899163 | 42.69524 | 45.097185 | 52.08628|6 | 44.773929 | 40.86704

----------------------3 = Black4 = Hispanic5 Whit5 = White

Page 25: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

G h bi thGraph birth year

.03

.02

.01

Den

sity

0

0 20 40 60 80 100

. hist age if birth_year>1900

0 20 40 60 80 100age

(bin=71, start=9, width=1.3802817)

Page 26: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Divide into “bins” so that each barDivide into bins so that each bar represents 1 year

.015

.02

.01

Den

sity

.005

hi t if bi th 1900 idth(1)

0

0 20 40 60 80 100age

. hist age if birth_year>1900,width(1)

Page 27: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Add ti k t 10 i t lAdd ticks at 10-year intervalshistogram totalscore, width(1) xlabel(-.2 (.1) 1)

.015

.02

.01

.D

ensi

ty.0

050

20 30 40 50 60 70 80 90 100age

Page 28: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

S perimpose the normal c r eSuperimpose the normal curve (with the same mean and s.d. as the empirical distribution)

i i i i l lhist age if birth_year>1900,wid(1) xlabel(20 (10) 100) normal

5.0

2.0

1.0

15D

ensi

ty.0

050

20 30 40 50 60 70 80 90 100age

Page 29: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

. summ age if birth_year>1900,det

age-------------------------------------------------------------

Percentiles Smallest1% 18 91% 18 95% 21 1610% 24 16 Obs 1261211425% 34 16 Sum of Wgt. 12612114

50% 48 Mean 49.47549Largest Std. Dev. 19.01049

75% 63 10775% 63 10790% 77 107 Variance 361.398695% 83 107 Skewness .262949699% 91 107 Kurtosis 2.222442

Page 30: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Histograms by raceHistograms by racehist age if birth year>1900&race>=3&race<=5,wid(1) hist age if birth_year 1900&race 3&race 5,wid(1)

xlabel(20 (10) 100) normal by(race).0

1.0

2.0

3

3 4

3 = Black4 = Hispanic5 = White

0.0

2.0

3

20 30 40 50 60 70 80 90 100

5Den

sity

5 = White

0.0

1.

20 30 40 50 60 70 80 90 100

D it

age

Densitynormal age

Graphs by race

Page 31: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

M i i ith hi tMain issues with histograms

Proper level of aggregation Non-regular data categories Non regular data categories

Page 32: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Draw the previous graph with a boxDraw the previous graph with a box plot

0

graph box age if birth_year>1900

}100 }1.5 x IQR

50ag

e

Upper quartileMedianLower quartile

} Inter-quartilerange

}0

q

Page 33: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Draw the box plots for the differentDraw the box plots for the different races

graph box age if birth_year>1900&race>=3&race<=5,by(race)

3 4

5010

0

3 = Black4 = Hispanic5 = White

010

0

5age

5 = White

050

Graphs by race

Page 34: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Draw the box plots for the different races using “over” optionggraph box age if birth_year>1900&race>=3&race<=5,over(race)

100

3 = Black4 = Hispanic5 = White

50ag

e

5 = White

0

3 4 5

Page 35: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

A note about histograms withA note about histograms with unnatural categories

From the Current Population Survey (2000), Voter and Registration Survey

How long (have you/has name) lived at this address?

-9 No Response-3 Refused-2 Don't know-1 Not in universe1 Less than 1 month2 1-6 months3 7-11 months3 7 11 months4 1-2 years5 3-4 years6 5 years or longer

Page 36: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Solution, Step 1, pMap artificial category onto “natural” midpointnatural midpoint

-9 No Response missing-3 Refused missingg-2 Don't know missing-1 Not in universe missing1 Less than 1 month 1/24 = 0.0422 1 6 months 3 5/12 = 0 292 1-6 months 3.5/12 = 0.293 7-11 months 9/12 = 0.754 1-2 years 1.55 3-4 years 3.56 5 years or longer 10 (arbitrary)

recode live length (min/ 1 = )(1= 042)(2= 29)(3= 75)(4=1 5)(5=3 5)(6=10)recode live_length (min/-1 =.)(1=.042)(2=.29)(3=.75)(4=1.5)(5=3.5)(6=10)

Page 37: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

G h f d d d tGraph of recoded datahistogram longevity, fraction

.557134

histogram longevity, fraction

.557134

Frac

tion

longevity0 1 2 3 4 5 6 7 8 9 10

0

Page 38: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

D it l t f d tDensity plot of dataTotal area of last bar = 557Total area of last bar = .557Width of bar = 11 (arbitrary)Solve for: a = w h (or).557 = 11h => h = .051

longevity0 1 2 3 4 5 6 7 8 9 10

0

15

Page 39: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

D it l t t l tDensity plot template

Category Fraction X-min X-max X-lengthHeight

(density)< 1 mo 0156 0 1/12 082 19*< 1 mo. .0156 0 1/12 .082 .19

1-6 mo. .0909 1/12 ½ .417 .22

7 11 mo 0430 ½ 1 500 097-11 mo. .0430 ½ 1 .500 .09

1-2 yr. .1529 1 2 1 .15

3 4 1404 2 4 2 073-4 yr. .1404 2 4 2 .07

5+ yr. .5571 4 15 11 .05

* = .0156/.082

Page 40: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Three words about pie charts:Three words about pie charts: don’t use them

Page 41: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

S h t’ ith thSo, what’s wrong with them

For non-time series data, hard to get a comparison among groups; the eye is very p g g p ; y ybad in judging relative size of circle slices

For time series data hard to grasp cross- For time series, data, hard to grasp crosstime comparisons

Page 42: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

Some words about graphicalSome words about graphical presentation Aspects of graphical integrity (following

Edward Tufte, Visual Display of , p yQuantitative Information)Main point should be readily apparentMain point should be readily apparentShow as much data as possibleWrite clear labels on the graphWrite clear labels on the graphShow data variation, not design variation

Page 43: It d ti tIntroduction to Descriptive Statisticsweb.mit.edu/~17.871/www/2012/02descriptive_stats_2012.pdf · .15.05.1 Density 0 0 20 40 60 80 100 place on ideological scale - yourself

There is a difference between graphs inThere is a difference between graphs in research and publication

Publish OK

.01

.02

.03

3 4

0.0

2.0

3

20 30 40 50 60 70 80 90 100

5Den

sity

2.0

3

Black Hispanic

0.0

1

20 30 40 50 60 70 80 90 100

Densit

age 0.0

1.0

2

White Totalnsity

Densitynormal age

Graphs by race

.01

.02

.03

White Total

De

0

20 30 40 50 60 70 80 90 100 20 30 40 50 60 70 80 90 100Age

Graphs by race

Do not publish