univariate eda (exploratory data analysis). eda john tukey (1970s) data –two components: smooth +...

54
Univariate EDA (Exploratory Data Analysis)

Upload: fay-bruce

Post on 26-Dec-2015

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

Univariate EDA

(Exploratory Data Analysis)

Page 2: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant
Page 3: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant
Page 4: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

EDA• John Tukey (1970s)

• data– two components:

• smooth + rough

• patterned behaviour + random variation

• resistant measures/displays– little influenced by changes in a small proportion of the total

number of cases

– resistant to the effects of outliers

– emphasizes smooth over rough components

• concepts apply to statistics and to graphical methods

Page 5: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

Tree Ring dates (AD)

1255 1239 1162 1239 1240 1243 1241 1241 1271

• 9 dendrochronology dates

• what do they mean????

• usually helps to sort the data…

Page 6: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

Stem-and-Leaf Diagram

1162 1239 1239 1240 1241 1241 1243 1255 1271

11|62

12|39,39,40,41,41,43,55,71

• original values preserved

• no rounding, no loss of information…

Page 7: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

can simplify in various ways…

11|6

12|44444467

– ‘leaves’ rounded to nearest decade

– ‘stem’ based on centuries

Page 8: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

1162 1239 1239 1240 1241 1241 1243 1255 1271

116|2117|118|119|120|121|122|123|99124|0113125|5126|127|1

‘stem’ based on decades…

Page 9: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

1162 1239 1239 1240 1241 1241 1243 1255 1271

116|2117|118|119|120|121|122|123|99124|0113125|5126|127|1

highlights existence of gaps in the distribution of dates, groups of dates…

Page 10: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

R• stem()

• vuround(runif(25, 0, 50),0); stem(vu)

• vnround(rnorm(25, 25, 10),0); stem(vn)

• stem(vn, scale=2)

Page 11: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

unit 1 unit 2

12.6 16.2

11.6 16.4

16.3 13.8

13.1 13.2

12.1 11.3

26.9 14

9.7 9

11.5 12.5

14.8 15.6

13.5 11.2

12.4 12.2

13.6 15.5

11.7

9 26

25

24

23

22

21

20

19

18

17

3 16 24

15 56

8 14 0

651 13 28

641 12 25

65 11 237

10

7 9 0

unit 1 unit 2

Back-to-back stem-and-leaf plot

rimdiameterdata (cm)

Page 12: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

percentiles

• useful for constructing various kinds of EDA graphics

• don’t confuse percentile with percent or proportion

Note:• frequency = count• relative frequency = percent or proportion

Page 13: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

percentiles

“the pth percentile of a distribution: number such that approximately p percent of the

values in the distribution are equal or less than that number…”

• can be calculated for numbers that actually exist in the distribution, and interpolated for numbers than don’t…

Page 14: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

percentiles

• sort the data so that x1 is the smallest value, and xn is the largest (where n=total number of cases)

• xi is the pith percentile of a dataset of n members where:

n

ipi

5.0100

Page 15: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

original data:

5 1 9 3 14 9 7

sorted data:

x i 1 3 5 7 9 9 14

i 1 2 3 4 5 6 7p i (calculate, using equation [1], as shown below…)

p1 = 100(1 - 0.5) / 7 = 7.1p2 = 100(2 - 0.5) / 7 = 21.4p3 = 100(3 - 0.5) / 7 = 35.7p4 = 100(4 - 0.5) / 7 = 50etc…

x i 1 3 5 7 9 9 14

i 1 2 3 4 5 6 7p i 7.1 21.4 35.7 50 64.3 78.6 92.9

n

ipi

5.0100

[1]

Page 16: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

n

ipi

5.0100

5.0

100 inp

i

x i 1 3 5 7 9 9 14

i 1 2 3 4 5 6 7p i 7.1 21.4 35.7 50 64.3 78.6 92.9

25

?

85

?

50

50th percentile:i=(7*50)/100 + .5i=4, xi=7

25th percentile:i=(7*25)/100 + .5i=2.25, 3<xi<5

Page 17: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

x i 1 3 5 7 9 9 14

i 1 2 3 4 5 6 7p i 7.1 21.4 35.7 50 64.3 78.6 92.9

?

if i < > integer, then…k = integer part of i; f = fractional part of ixint = interpolated value of xxint = (1-f)xk + fxk+1

xint= (1-.25)*3+.25*5xint= 3.5

25th percentile:i=(7*25)/100 + .5i=2.25, 3<xi<5

25

Page 18: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

use R!!

• test<-c(1,3,5,7,9,9,14)

• quantile(test, .25, type=5)

Page 19: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

75th25th 50thpercentiles:

interquartilerange

(midspread)

upper hingelower hinge inner fenceinner fence

“boxplot”63 5885 4795 3344 393 117 11

80 526 1962 320 4286 3752 9055 8664 283 27

65 6046 4129 5596 8982 9066 6399 8326 3295 7276 9746 6765 8184 75

(1.5 x midspread)

Page 20: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

Figure 6.25: Internal diversity of neighbourhoods used to define N-clusters, measured by the 'evenness' statistic H/Hmax on the basis of counts of various A-clusters, and broken down by N-cluster and phase. [Boxes encompass the midspread; lines inside boxes indicate the median, while whiskers show the range of cases that fall within 1.5-times the midspread, above or below the limits of the box.]

Page 21: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

Cleveland, W. S. (1985) The Elements of Graphing Data.

Page 22: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

Histograms

• divide a continuous variable into intervals called ‘bins’

• count the number of cases within each bin

• use bars to reflect counts

• intervals on the horizontal axis

• counts on the vertical axis

Page 23: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

“bins”

Histogram

coun

ts percent63 5885 4795 3344 393 117 11

80 526 1962 320 4286 3752 9055 8664 283 27

65 6046 4129 5596 8982 9066 6399 8326 3295 7276 9746 6765 8184 75

Page 24: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

• useful for illustrating the shape of the distribution of a batch of numbers

• may be helpful for identifying modes and modal behaviour

Histograms

Page 25: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

mode

mode?

mode!

• the distribution is clearly bimodal

• may be multimodal…

Page 26: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant
Page 27: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant
Page 28: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant
Page 29: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

important variables in histogram constuction:

• bin width• bin starting point

Page 30: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

smoothing histograms

• may want to accentuate the ‘smooth’ in a data distribution…

• calculate “running averages” on bin counts• level of smoothing is arbitrary…

1 3 5 2 4 2 0 1

2 3 3.3 3.6 2.6 2 1 0.5

Page 31: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

0

1

2

3

4

5

6

0

0.5

1

1.5

2

2.5

3

3.5

4

Page 32: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

histogram / barchart variations

• 3d

• stacked

• dual

• frequency polygon

• kernel density methods

Page 33: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

bear

carib

ou

muskox

seal

walrus

FAUNA

0

10

20

30

40

50

60

70

80

Co

un

t

21

SITE

dual barchart

Page 34: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

1

bear

caribou

muskoxseal

walrus

FAUNA

0

10

20

30

40

50

60

70

Cou

nt

2

bear

caribou

muskoxseal

walrus

FAUNA

0

10

20

30

40

50

60

70

Cou

nt

Site 1 Site 2

Page 35: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

01020304050607080

Co

un

t

bear

carib

ou

muskox

seal

walrus

FAUNA

01020304050607080

Co

un

t

21

SITE

‘mirror’ barchart

Page 36: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

0

10

20

30

40

50

60

70

80

bear caribou muskox seal walrus

Site 2

Site 1

stacked barchart

Page 37: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

bearcaribou

muskoxseal

walrus

Site 1

Site 20

10

20

30

40

50

60

70

Site 1

Site 2

3d barchart

Page 38: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant
Page 39: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

frequency polygon

Page 40: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

Histogram of vol

vol

De

nsi

ty

100 200 300 400

0.0

00

0.0

02

0.0

04

0.0

06

0.0

08

kernel density modelHistogram of vol

vol

De

nsi

ty

100 200 300 400

0.0

00

0.0

02

0.0

04

0.0

06

0.0

08

Page 41: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant
Page 42: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

controlling kernel density plots…

• hd <- density(XX)• hh <- hist(XX, plot=F)

• maxD <- max(hd$y)• maxH <- max(hh$density)• Y <- c(0, max(c(maxD, maxH)))

• hist(XX, freq=F, ylim=Y)• lines(density(XX))

Page 43: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

1 2 3 4 5 6 7 8 9 10VAR00003

0

1

2

3

4

5

6

7

8

Cou

nt

Dot Plot [R: dotchart()]

Page 44: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

bear

carib

ou

muskox

seal

walrus

FAUNA

0

10

20

30

40

50

60

70

80

Co

un

t

21

SITE

Page 45: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

Dot Histogram [R: stripchart()]

1 2 3 4 5 6 7 8 9 10VAR00003

1 2 3 4 5 6 7 8 9 10VAR00003

1 2 3 4 5 6 7 8 9 10VAR00003

method = “stack”

Page 46: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

cooking/service service ritual

line plot

Page 47: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

cooking/service service ritual

Page 48: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

bear

carib

ou

muskox

seal

walrus

FAUNA

0

10

20

30

40

50

60

70

80

Cou

nt

21

SITE

Page 49: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

bear

caribou

cat

elk

moose

20%

19%

18%

21%

22%

bear

caribou

cat

elk

moose

pie chart

Page 50: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

1

bear

caribou

catelk

moose

2

bear

caribou

cat

elk

moose

Page 51: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

10

20

30

40

50

60

70

80

90

100

perc

ent

10

20

30

40

50

60

70

80

90

100

cum

ulat

ive

perc

ent

Cumulative Percent Graph

Page 52: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

10

20

30

40

50

60

70

80

90

100

cum

ulat

ive

perc

ent

• some useful statistical measures

(ordinal or ratio scale)

• can be misleading when used with nominal data

• good for comparing data sets

Cumulative Percent Graph

Page 53: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

PercentagesSitesA B C

Types 1 5 5 52 45 0 303 5 48 54 5 5 55 5 5 56 5 5 57 20 5 358 5 22 59 5 5 5

100 100 100

Cumulative PercentsSitesA B C

Types 1 5 5 52 50 5 353 55 53 404 60 58 455 65 63 506 70 68 557 90 73 908 95 95 959 100 100 100

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9

A

B

C

Page 54: Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9

A

B

C

0

20

40

60

80

100

120

1 5 3 4 2 6 7 8 9

A

B

C