exploratory data analysis

Post on 19-Jan-2016

77 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Exploratory Data Analysis. Hal Varian 20 March 2006. What is EDA?. Goals Examine and summarize data Look for patterns and suggest hypotheses Provide guidance for more systematic analysis Methods of analysis Primarily graphics and tables Online reference - PowerPoint PPT Presentation

TRANSCRIPT

Exploratory Data Analysis

Hal Varian20 March 2006

What is EDA? Goals

Examine and summarize data Look for patterns and suggest hypotheses Provide guidance for more systematic analysis

Methods of analysis Primarily graphics and tables Online reference

http://www.itl.nist.gov/div898/handbook/eda/eda.htm http://www.math.yorku.ca/SCS/Courses/eda/

Tools for EDA We will use R = open source S

Very widely used by statisticians Libraries for all sorts of things are

available Download from

cran.stat.ucla.edu http://www.r-project.org/

Recommend ESS (=Emacs Speaks Statistics) for interactive use

Windows interface is not bad

Interactive R session

> library("foreign")

> dat <- read.spss("GSS93 subset.sav")

> attach(dat)

> summary(AGE)

Min. 1st Qu. Median Mean 3rd Qu. Max.

18.0 33.0 43.0 46.4 59.0 99.0 > hist(AGE)

Histogram of ageHistogram of AGE

AGE

Fre

qu

en

cy

20 40 60 80 100

05

01

00

15

02

00

Recode missing data AGE[AGE>90] <- NA plot(density(AGE,na.rm=T))

#plot both together hist(AGE,freq=F) lines(density(AGE,na.rm=T))

Density and density + hist

20 40 60 80 100

0.0

00

0.0

05

0.0

10

0.0

15

0.0

20

0.0

25

density(x = AGE, na.rm = T)

N = 1495 Bandwidth = 3.633

De

nsi

ty

Histogram of AGE

AGE

De

nsi

ty

20 40 60 80

0.0

00

0.0

05

0.0

10

0.0

15

0.0

20

0.0

25

Boxplot Boxplot

Outlier 1.5 interquartile range 3rd quartile Median 1st quartile Smallest value 20

4060

8010

0

Boxplot enhancements Notches: confidence interval for

median Varwidth=T: width of box is sqrt(n) Useful for

comparisons2

04

06

08

01

00

Comparing distributions boxplot(AGE~RACE) boxplot(AGE~RACE,notch=T,varwidth=T)

Doesn’t seem to be big diff in age distn

white black other

20

30

40

50

60

70

80

90

EDUC v RACEboxplot(EDUC[EDUC<90]~RACE[EDUC<90],notch=T,varwidth=T)

other black white

05

10

15

20

Violin plot Combines density plot and boxplot Good for weird shaped

distributions…

Back to Back Histogram library("Hmisc") histbackback(EDUC[RACE=="black"],EDUC[RACE=="white"],probability=T)

0.2 0.1 0.0 0.1 0.2

2.0

00

00

00

6.0

00

00

00

10

.00

00

00

01

4.0

00

00

00

18

.00

00

00

0

EDUC[RACE == "black"] EDUC[RACE == "white"]

Two-way table GT12 <- EDUC>12 temp <-table(GT12,RACE)

GT12 white black other FALSE 614 100 37 TRUE 640 67 38

prop.table(temp,2) GT12 white black other FALSE 0.4896332 0.5988024 0.4933333 TRUE 0.5103668 0.4011976 0.5066667

Comparing distributions qqplot = quantile-quantile plot

Fraction of data less than k in x Fraction of data less than k in y

Shapes Straight line: same distribution Vertical intercepts differ: different mean Slopes differ: different variance

Reference distribution can be theoretical distn qnorm – compare to standardized normal Skew to right: both tails below straight line Heavy tails: lower tail above, upper tail below line

qqplot(x,y) examples

-3 -2 -1 0 1 2 3

-2-1

01

2

x

y

-4 -2 0 2 4

-4-2

02

4

x

y

-4 -2 0 2 4

-4-2

02

4

x

y

Mean1=0Mean2=2

1=12=2

identical

-3 -2 -1 0 1 2 3

-2-1

01

2

Normal Q-Q Plot

Theoretical Quantiles

Sa

mp

le Q

ua

ntil

es

Sample vN(0,1),with refline

More qqnorm examples

Skewed to right Heavy tails

www.maths.murdoch.edu.au/units/statsnotes/samplestats/qqplot.html

Pairs of variables Is one variable related to another? Scatterplot

Basic: plot(x,y) Enhanced from library(“car”):

scatterplot(x,y) Scatterplot matrix

Basic: pairs(data.frame(x,y,z)) Enhanced:

scatterplot.matrix(data.frame(x,y,z))

Basic and enhanced scatterplot

Scatterplot matrix

Labeling points in scatterplots identify(x,y,labels=“foo”) Color is also useful

-2 -1 0 1 2

-4-2

02

46

x

y

90

98

110

175

Cigarettes and taxes Discussant on paper by Austan

Goolsbee, “Playing with Fire” Question: did Internet purchases of

cigarettes affect state tobacco tax revenues?

Cigarette Prices in 1990s

1990 1992 1994 1996 1998 2000

15

02

00

25

03

00

35

04

00

Price of cigarettes

Internet usage

1990 1992 1994 1996 1998 2000

0.0

0.1

0.2

0.3

0.4

0.5

Internet usage

Price elasticity of use/sales Across all states and years

Taxable sales elasticity: -0.802 Use elasticity: -0.440

Sales are much more responsive to price than usage suggesting that there is some cross border trade (aka “buttlegging”)

Use vs Sales in 2000

40 60 80 100 120 140 160

34

56

q.p[year == 2000]

cig

use

.p[y

ea

r =

= 2

00

0]

DE

KY

NH

CAUT

Reduced form dp = log(p2001) – log(p1995) dq = log(q2001) – log(q1995) Regress dq/dp on internet

penetration in 2000 See next slide for result

0.25 0.30 0.35 0.40 0.45

-0.8

-0.6

-0.4

-0.2

0.0

0.2

i

dq

/dp

CA

DC

DE

MI

NH

NY

OK

WA

Elasticity v Internet penetration

What is Internet providing? It was always a good deal for some to buy

cigarettes out-of-state (in high tax states) Mail order has been around for a long time

and is certainly cost-effective Internet makes it easier to find merchants

– just type into search engine Internet is great at matching buyers and

sellers

Price of a match Google doesn’t accept cigarette

advertisements, but Overture does Price for top listing: $1.20 per click

Avg price for click on Overture is 40 cents

Conversion rates might be 5%, so advertiser is paying $24 for introduction

But think of lifetime value…

Value of a match Google doesn’t accept cigarette

advertisements, but Overture does Price for top listing: $1.20 per click

Avg price for click on Overture is 40 cents

Conversion rates might be 5%, so advertiser is paying $24 for introduction

But think of lifetime value…

Straightening out and scaling data Find transform so that data looks

linear, or normal, or fits on same scale Log10 (easier to interpret than log) Square root Reciprocal Box-Cox transform (xr – 1)/r which

combines many of above; r=0 is log

City sizes: regular & log10

Histogram of log10(pop1980)

log10(pop1980)

De

nsi

ty

3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

0.0

0.2

0.4

0.6

0.8

top related