exploratory data analysis

33
Exploratory Data Analysis Hal Varian 20 March 2006

Upload: shaw

Post on 19-Jan-2016

77 views

Category:

Documents


0 download

DESCRIPTION

Exploratory Data Analysis. Hal Varian 20 March 2006. What is EDA?. Goals Examine and summarize data Look for patterns and suggest hypotheses Provide guidance for more systematic analysis Methods of analysis Primarily graphics and tables Online reference - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Exploratory Data Analysis

Exploratory Data Analysis

Hal Varian20 March 2006

Page 2: Exploratory Data Analysis

What is EDA? Goals

Examine and summarize data Look for patterns and suggest hypotheses Provide guidance for more systematic analysis

Methods of analysis Primarily graphics and tables Online reference

http://www.itl.nist.gov/div898/handbook/eda/eda.htm http://www.math.yorku.ca/SCS/Courses/eda/

Page 3: Exploratory Data Analysis

Tools for EDA We will use R = open source S

Very widely used by statisticians Libraries for all sorts of things are

available Download from

cran.stat.ucla.edu http://www.r-project.org/

Recommend ESS (=Emacs Speaks Statistics) for interactive use

Windows interface is not bad

Page 4: Exploratory Data Analysis

Interactive R session

> library("foreign")

> dat <- read.spss("GSS93 subset.sav")

> attach(dat)

> summary(AGE)

Min. 1st Qu. Median Mean 3rd Qu. Max.

18.0 33.0 43.0 46.4 59.0 99.0 > hist(AGE)

Page 5: Exploratory Data Analysis

Histogram of ageHistogram of AGE

AGE

Fre

qu

en

cy

20 40 60 80 100

05

01

00

15

02

00

Page 6: Exploratory Data Analysis

Recode missing data AGE[AGE>90] <- NA plot(density(AGE,na.rm=T))

#plot both together hist(AGE,freq=F) lines(density(AGE,na.rm=T))

Page 7: Exploratory Data Analysis

Density and density + hist

20 40 60 80 100

0.0

00

0.0

05

0.0

10

0.0

15

0.0

20

0.0

25

density(x = AGE, na.rm = T)

N = 1495 Bandwidth = 3.633

De

nsi

ty

Histogram of AGE

AGE

De

nsi

ty

20 40 60 80

0.0

00

0.0

05

0.0

10

0.0

15

0.0

20

0.0

25

Page 8: Exploratory Data Analysis

Boxplot Boxplot

Outlier 1.5 interquartile range 3rd quartile Median 1st quartile Smallest value 20

4060

8010

0

Page 9: Exploratory Data Analysis

Boxplot enhancements Notches: confidence interval for

median Varwidth=T: width of box is sqrt(n) Useful for

comparisons2

04

06

08

01

00

Page 10: Exploratory Data Analysis

Comparing distributions boxplot(AGE~RACE) boxplot(AGE~RACE,notch=T,varwidth=T)

Doesn’t seem to be big diff in age distn

white black other

20

30

40

50

60

70

80

90

Page 11: Exploratory Data Analysis

EDUC v RACEboxplot(EDUC[EDUC<90]~RACE[EDUC<90],notch=T,varwidth=T)

other black white

05

10

15

20

Page 12: Exploratory Data Analysis

Violin plot Combines density plot and boxplot Good for weird shaped

distributions…

Page 13: Exploratory Data Analysis

Back to Back Histogram library("Hmisc") histbackback(EDUC[RACE=="black"],EDUC[RACE=="white"],probability=T)

0.2 0.1 0.0 0.1 0.2

2.0

00

00

00

6.0

00

00

00

10

.00

00

00

01

4.0

00

00

00

18

.00

00

00

0

EDUC[RACE == "black"] EDUC[RACE == "white"]

Page 14: Exploratory Data Analysis

Two-way table GT12 <- EDUC>12 temp <-table(GT12,RACE)

GT12 white black other FALSE 614 100 37 TRUE 640 67 38

prop.table(temp,2) GT12 white black other FALSE 0.4896332 0.5988024 0.4933333 TRUE 0.5103668 0.4011976 0.5066667

Page 15: Exploratory Data Analysis

Comparing distributions qqplot = quantile-quantile plot

Fraction of data less than k in x Fraction of data less than k in y

Shapes Straight line: same distribution Vertical intercepts differ: different mean Slopes differ: different variance

Reference distribution can be theoretical distn qnorm – compare to standardized normal Skew to right: both tails below straight line Heavy tails: lower tail above, upper tail below line

Page 16: Exploratory Data Analysis

qqplot(x,y) examples

-3 -2 -1 0 1 2 3

-2-1

01

2

x

y

-4 -2 0 2 4

-4-2

02

4

x

y

-4 -2 0 2 4

-4-2

02

4

x

y

Mean1=0Mean2=2

1=12=2

identical

-3 -2 -1 0 1 2 3

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sa

mp

le Q

ua

ntil

es

Sample vN(0,1),with refline

Page 17: Exploratory Data Analysis

More qqnorm examples

Skewed to right Heavy tails

www.maths.murdoch.edu.au/units/statsnotes/samplestats/qqplot.html

Page 18: Exploratory Data Analysis

Pairs of variables Is one variable related to another? Scatterplot

Basic: plot(x,y) Enhanced from library(“car”):

scatterplot(x,y) Scatterplot matrix

Basic: pairs(data.frame(x,y,z)) Enhanced:

scatterplot.matrix(data.frame(x,y,z))

Page 19: Exploratory Data Analysis

Basic and enhanced scatterplot

Page 20: Exploratory Data Analysis

Scatterplot matrix

Page 21: Exploratory Data Analysis

Labeling points in scatterplots identify(x,y,labels=“foo”) Color is also useful

-2 -1 0 1 2

-4-2

02

46

x

y

90

98

110

175

Page 22: Exploratory Data Analysis

Cigarettes and taxes Discussant on paper by Austan

Goolsbee, “Playing with Fire” Question: did Internet purchases of

cigarettes affect state tobacco tax revenues?

Page 23: Exploratory Data Analysis

Cigarette Prices in 1990s

1990 1992 1994 1996 1998 2000

15

02

00

25

03

00

35

04

00

Price of cigarettes

Page 24: Exploratory Data Analysis

Internet usage

1990 1992 1994 1996 1998 2000

0.0

0.1

0.2

0.3

0.4

0.5

Internet usage

Page 25: Exploratory Data Analysis

Price elasticity of use/sales Across all states and years

Taxable sales elasticity: -0.802 Use elasticity: -0.440

Sales are much more responsive to price than usage suggesting that there is some cross border trade (aka “buttlegging”)

Page 26: Exploratory Data Analysis

Use vs Sales in 2000

40 60 80 100 120 140 160

34

56

q.p[year == 2000]

cig

use

.p[y

ea

r =

= 2

00

0]

DE

KY

NH

CAUT

Page 27: Exploratory Data Analysis

Reduced form dp = log(p2001) – log(p1995) dq = log(q2001) – log(q1995) Regress dq/dp on internet

penetration in 2000 See next slide for result

Page 28: Exploratory Data Analysis

0.25 0.30 0.35 0.40 0.45

-0.8

-0.6

-0.4

-0.2

0.0

0.2

i

dq

/dp

CA

DC

DE

MI

NH

NY

OK

WA

Elasticity v Internet penetration

Page 29: Exploratory Data Analysis

What is Internet providing? It was always a good deal for some to buy

cigarettes out-of-state (in high tax states) Mail order has been around for a long time

and is certainly cost-effective Internet makes it easier to find merchants

– just type into search engine Internet is great at matching buyers and

sellers

Page 30: Exploratory Data Analysis

Price of a match Google doesn’t accept cigarette

advertisements, but Overture does Price for top listing: $1.20 per click

Avg price for click on Overture is 40 cents

Conversion rates might be 5%, so advertiser is paying $24 for introduction

But think of lifetime value…

Page 31: Exploratory Data Analysis

Value of a match Google doesn’t accept cigarette

advertisements, but Overture does Price for top listing: $1.20 per click

Avg price for click on Overture is 40 cents

Conversion rates might be 5%, so advertiser is paying $24 for introduction

But think of lifetime value…

Page 32: Exploratory Data Analysis

Straightening out and scaling data Find transform so that data looks

linear, or normal, or fits on same scale Log10 (easier to interpret than log) Square root Reciprocal Box-Cox transform (xr – 1)/r which

combines many of above; r=0 is log

Page 33: Exploratory Data Analysis

City sizes: regular & log10

Histogram of log10(pop1980)

log10(pop1980)

De

nsi

ty

3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

0.0

0.2

0.4

0.6

0.8