chapter 1: introduction to data - quantoid · chapter 1: introduction to data openintro statistics,...

Chapter 1: Introduction to data

OpenIntro Statistics, 3rd Edition

Slides developed by Mine Cetinkaya-Rundel of OpenIntro.The slides may be copied, edited, and/or shared via the CC BY-SA license.Some images may be included under fair use guidelines (educational purposes).

Introduction

Goals of This Lecture

Understand how to visually convey properties of data.

• Dos and Don’ts for making graphs.

• What sorts of graphs are at our disposal and what are theygood for?

• How to make all of these things (or most of these things) in R.

1

Examining numerical data

http://creativecommons.org/licenses/by-sa/3.0/us/

Scatterplot

Scatterplots are useful for visualizing the relationship between twonumerical variables.

Do life expectancy and total fertility ap-pear to be associated or independent?

Was the relationship the same through-out the years, or did it change?

http:// www.gapminder.org/ world

2

Dot plots

Useful for visualizing one numerical variable. Darker colorsrepresent areas where there are more observations.

GPA

2.5 3.0 3.5 4.0

How would you describe the distribution of GPAs in this data set?Make sure to say something about the center, shape, and spread ofthe distribution.

3

Dot plots & mean

GPA

2.5 3.0 3.5 4.0

• The mean, also called the average (marked with a triangle inthe above plot), is one way to measure the center of adistribution of data.

• The mean GPA is 3.59.

4

Mean

• The sample mean, denoted as x, can be calculated as

x =x1 + x2 + · · ·+ xn

n,

where x1, x2, · · · , xn represent the n observed values.

• The population mean is also computed the same way but isdenoted as µ. It is often not possible to calculate µ sincepopulation data are rarely available.

• The sample mean is a sample statistic, and serves as a pointestimate of the population mean. This estimate may not beperfect, but if the sample is good (representative of thepopulation), it is usually a pretty good estimate.

5

http://www.gapminder.org/world

Stacked dot plot

Higher bars represent areas where there are more observations,makes it a little easier to judge the center and the shape of thedistribution.

GPA

●● ● ●●●●

● ●●●

●●

●

●

●● ● ●●●

●●●

●

●●

●●

●●●

●

●●

●● ●

● ●

●

●

●●●

●●

●

●●

●●

●

●

●

●

●

●●●●

●●

●

●

●

●

●●

●

●●

●●

●●

●

●●

● ●

●

●

● ● ●●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●●●

●●

●

●

●

●

●

●●

●

●●

●

●●

● ●

●●

●●

●

●

●●

●

●

●

● ●●●

●

●

●●

●

●

●

●●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●● ●

●

●●

●●

●

●

●●

●●

●●

● ●●

●

●

●●●

●●

●

●

2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0

6

Histograms - Extracurricular hours

• Histograms provide a view of the data density. Higher barsrepresent where the data are relatively more common.

• Histograms are especially convenient for describing the shapeof the data distribution.

• The chosen bin width can alter the story the histogram istelling.

Hours / week spent on extracurricular activities

0 10 20 30 40 50 60 70

0

50

100

150

7

Bin width

Which one(s) of these histograms are useful? Which reveal toomuch about the data? Which hide too much?


0 20 40 60 80 100

0

50

100

150

200


0 10 20 30 40 50 60 70

0

50

100

150


0 10 20 30 40 50 60 70

0

20

40

60

80


0 10 20 30 40 50 60 70

0

10

20

30

40

8

Shape of a distribution: modality

Does the histogram have a single prominent peak (unimodal),several prominent peaks (bimodal/multimodal), or no apparentpeaks (uniform)?

0 5 10 150

510

150 5 10 15 20

05

1015

0 5 10 15 20

05

1015

20

0 5 10 15 20

02

46

810

14

Note: In order to determine modality, step back and imagine a smooth curve over

the histogram – imagine that the bars are wooden blocks and you drop a limp

spaghetti over them, the shape the spaghetti would take could be viewed as a

smooth curve.9

Shape of a distribution: skewness

Is the histogram right skewed, left skewed, or symmetric?

0 2 4 6 8 10

05

1015

0 5 10 15 20 25

020

4060

0 20 40 60 80

05

1015

2025

30Note: Histograms are said to be skewed to the side of the long tail.

10

Shape of a distribution: unusual observations

Are there any unusual observations or potential outliers?

0 5 10 15 20

05

1015

2025

30

20 40 60 80 100

010

2030

40

11

Extracurricular activities

How would you describe the shape of the distribution of hours perweek students spend on extracurricular activities?


0 10 20 30 40 50 60 70

0

50

100

150

12

Commonly observed shapes of distributions

• modality

unimodal bimodal multimodaluniform

• skewness

right skew left skewsymmetric

13

Practice

Which of these variables do you expect to be uniformly distributed?

(a) weights of adult females

(b) salaries of a random sample of people from North Carolina

(c) house prices

(d) birthdays of classmates (day of the month)

14

Application activity: Shapes of distributions

Sketch the expected distributions of the following variables:

• number of piercings

• scores on an exam

• IQ scores

Come up with a concise way (1-2 sentences) to teach someone howto determine the expected distribution of any variable.

15

Are you typical?

http:// www.youtube.com/ watch?v=4B2xOvKFFz4

How useful are centers alone for conveying the true characteristicsof a distribution?

16

Variance

Variance is roughly the average squared deviation from the mean.

s2 =

∑ni=1(xi − x)2

n − 1

• The sample mean isx = 6.71, and the samplesize is n = 217.

• The variance of amount ofsleep students get per nightcan be calculated as: Hours of sleep / night

2 4 6 8 10 12

0

20

40

60

80

s2 =(5 − 6.71)2 + (9 − 6.71)2 + · · ·+ (7 − 6.71)2

217 − 1= 4.11 hours2

17

http://www.youtube.com/watch?v=4B2xOvKFFz4

Variance (cont.)

Why do we use the squared deviation in the calculation of variance?

18

Standard deviation

The standard deviation is the square root of the variance, and hasthe same units as the data.s

s =√

s2

• The standard deviation ofamount of sleep studentsget per night can becalculated as:

s =√

4.11 = 2.03 hours

• We can see that all of thedata are within 3 standarddeviations of the mean.

Hours of sleep / night

2 4 6 8 10 12

0

20

40

60

80

19

Median

• The median is the value that splits the data in half whenordered in ascending order.

0, 1, 2, 3, 4

• If there are an even number of observations, then the medianis the average of the two values in the middle.

0, 1, 2, 3, 4, 5→2 + 3

2= 2.5

• Since the median is the midpoint of the data, 50% of thevalues are below it. Hence, it is also the 50th percentile.

20

Q1, Q3, and IQR

• The 25th percentile is also called the first quartile, Q1.

• The 50th percentile is also called the median.

• The 75th percentile is also called the third quartile, Q3.

• Between Q1 and Q3 is the middle 50% of the data. The rangethese data span is called the interquartile range, or the IQR.

IQR = Q3 − Q1

21

Box plot

The box in a box plot represents the middle 50% of the data, andthe thick line in the box is the median.

# of

stu

dy h

ours

/ w

eek

0

10

20

30

40

50

60

70

22

Anatomy of a box plot

# of

stu

dy h

ours

/ w

eek

0

10

20

30

40

50

60

70

lower whisker

Q1 (first quartile)

median

Q3 (third quartile)

max whisker reach& upper whisker

suspected outliers

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

23

Whiskers and outliers

• Whiskersof a box plot can extend up to 1.5×IQR away from the quartiles.

max upper whisker reach = Q3 + 1.5 × IQR

max lower whisker reach = Q1 − 1.5 × IQR

IQR : 20 − 10 = 10

max upper whisker reach = 20 + 1.5 × 10 = 35

max lower whisker reach = 10 − 1.5 × 10 = −5

• A potential outlier is defined as an observation beyond themaximum reach of the whiskers. It is an observation thatappears extreme relative to the rest of the data.

24

Outliers (cont.)

Why is it important to look for outliers?

25

Extreme observations

How would sample statistics such as mean, median, SD, and IQRof household income be affected if the largest value was replacedwith $10 million? What if the smallest value was replaced with $10million?

Annual Household Income

●● ●●●

●●

●●

●●●

●

●●

●●

●●

●

●●

●

●

●

● ●

●●●

● ●●

●●

●

●●

●

●●

●

●

●

●●

●

●●

●

●● ●

●

●●

● ●

●

● ●●

●●●

●

●

●●

●

●

●

●

●

●

●●●

●

●●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

26

Robust statistics

Annual Household Income

●● ●●●

●●

●●

●●●

●

●●

●●

●●

●

●●

●

●

●

● ●

●●●

● ●●

●●

●

●●

●

●●

●

●

●

●●

●

●●

●

●● ●

●

●●

● ●

●

● ●●

●●●

●

●

●●

●

●

●

●

●

●

●●●

●

●●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

robust not robustscenario median IQR x soriginal data 190K 200K 245K 226Kmove largest to $10 million 190K 200K 309K 853Kmove smallest to $10 million 200K 200K 316K 854K

27

Robust statistics

Median and IQR are more robust to skewness and outliers thanmean and SD. Therefore,

• for skewed distributions it is often more helpful to use medianand IQR to describe the center and spread

• for symmetric distributions it is often more helpful to use themean and SD to describe the center and spread

If you would like to estimate the typical household income for a stu-dent, would you be more interested in the mean or median income?

28

Mean vs. median

• If the distribution is symmetric, center is often defined as themean: mean ≈ median

Symmetric

meanmedian

• If the distribution is skewed or has extreme outliers, center isoften defined as the median• Right-skewed: mean > median• Left-skewed: mean < median

Right−skewed

meanmedian

Left−skewed

meanmedian

29

Practice

Which is most likely true for the distribution of percentage of time actually

spent taking notes in class versus on Facebook, Twitter, etc.?

% of time in class spent taking notes

0 20 40 60 80 100

0

10

20

30

40

50

(a) mean> median

(b) mean < median

(c) mean ≈ median

(d) impossible to tell

30

Extremely skewed data

When data are extremely skewed, transforming them might makemodeling easier. A common transformation is the logtransformation.

The histograms on the left shows the distribution of number ofbasketball games attended by students. The histogram on the rightshows the distribution of log of number of games attended.

# of basketball games attended

0 10 20 30 40 50 60 70

0

50

100

150

# of basketball games attended

0 1 2 3 4

0

10

20

30

40

31

Pros and cons of transformations

• Skewed data are easier to model with when they aretransformed because outliers tend to become far lessprominent after an appropriate transformation.

# of games 70 50 25 · · ·

log(# of games) 4.25 3.91 3.22 · · ·

• However, results of an analysis might be difficult to interpretbecause the log of a measured variable is usuallymeaningless.

What other variables would you expect to be extremely skewed?

32

Intensity maps

What patterns are apparent in the change in population between2000 and 2010?

http:// projects.nytimes.com/ census/ 2010/ map

33

http://projects.nytimes.com/census/2010/map

Considering categorical data

Contingency tables

A table that summarizes data for two categorical variables is calleda contingency table.

The contingency table below shows the distribution of students’genders and whether or not they are looking for a spouse while incollege.

looking for spouseNo Yes Total

genderFemale 86 51 137Male 52 18 70Total 138 69 207

34

Bar plots

A bar plot is a common way to display a single categorical variable.A bar plot where proportions instead of frequencies are shown iscalled a relative frequency bar plot.

Female Male0

20

40

60

80

100

120

Female Male0.0

0.1

0.2

0.3

0.4

0.5

0.6

How are bar plots different than histograms?

35

Choosing the appropriate proportion

Does there appear to be a relationship between gender andwhether the student is looking for a spouse in college?

looking for spouseNo Yes Total

genderFemale 86 51 137Male 52 18 70Total 138 69 207

To answer this question we examine the row proportions:

• % Females looking for a spouse: 51/137 ≈ 0.37

• % Males looking for a spouse: 18/70 ≈ 0.26

36

Segmented bar and mosaic plots

What are the differences between the three visualizations shownbelow?

Female Male

YesNo

0

20

40

60

80

100

120

Female Male0.0

0.2

0.4

0.6

0.8

1.0 Female Male

No

Yes

37

Pie charts

Can you tell which order encompasses the lowest percentage ofmammal species?

RODENTIACHIROPTERACARNIVORAARTIODACTYLAPRIMATESSORICOMORPHALAGOMORPHADIPROTODONTIADIDELPHIMORPHIACETACEADASYUROMORPHIAAFROSORICIDAERINACEOMORPHASCANDENTIAPERISSODACTYLAHYRACOIDEAPERAMELEMORPHIACINGULATAPILOSAMACROSCELIDEATUBULIDENTATAPHOLIDOTAMONOTREMATAPAUCITUBERCULATASIRENIAPROBOSCIDEADERMOPTERANOTORYCTEMORPHIAMICROBIOTHERIA

Data from http:// www.bucknell.edu/ msw3.

38

Side-by-side box plots

Does there appear to be a relationship between class year andnumber of clubs students are in?

First−year Sophomore Junior Senior

0

2

4

6

8

●

●●

●

●

●●

●

●

●

●

●

●

●

●

39

Implementation in R

http://www.bucknell.edu/msw3

Scatterplots in R

load("fl_survey.rda")

plot(college_GPA ˜ high_sch_GPA, data=fl)

●●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●● ●

2.0 2.5 3.0 3.5 4.0

2.6

2.8

3.0

3.2

3.4

3.6

3.8

4.0

high_sch_GPA

colle

ge_G

PA

40

Scatterplots in R II

plot(fl$high_sch_GPA, fl$college_GPA, pch=16, col=rgb(0,0,1,.2),

xlab = "High School GPA", ylab="College GPA")

2.0 2.5 3.0 3.5 4.0

2.6

2.8

3.0

3.2

3.4

3.6

3.8

4.0

High School GPA

Col

lege

GPA

41

Bar Charts in R (raw data)

load("elec.rda")

par(mar=c(6,4,4,2))

barplot(elec$us_pct, names.arg = rownames(elec), las=2)

Coa

l

Nat

Gas

Nuc

lear

Hyd

roge

n

Oth

Ren

ew

Pet

role

um

0

10

20

30

40

42

Bar Charts in R (summary statistics)

require(mosaic)

load("heights.rda")

barplot(mean(HEIGHT ˜ GENDER, data=heights),

ylab = "Average Height (in inches)")

Male Female

Ave

rage

Hei

ght (

in in

ches

)

010

2030

4050

6070

43

Histograms in R

hist(heights$HEIGHT, xlab = "height",

main = "", col="gray")

height

Fre

quen

cy

60 70 80 90

050

100

150

44

Histograms by groups in R

library(lattice)

histogram(˜HEIGHT | GENDER, data=heights,

col="gray", nint=10)

HEIGHT

Den

sity

0.00

0.05

0.10

0.15

60 70 80 90

Male

60 70 80 90

Female

45

Histograms for discrete data in R

load("fl_survey.rda")

histDiscrete <- function(x, ...){

l <- max(x, na.rm=T)

b <- seq(.5, l+.5, by=1)

hist(x, breaks=b, ...)

}

histDiscrete(fl$political_ideology, col="gray",

main="", xlab="Political Ideology\n(1=liberal, 7=conservative)")

Political Ideology(1=liberal, 7=conservative)

Fre

quen

cy

1 2 3 4 5 6 7

05

1015

20

46

Boxplots in R

boxplot(heights$HEIGHT, ylab = "height", col="gray")

●

5560

6570

7580

8590

heig

ht

47

Boxplots by group in R

boxplot(heights$HEIGHT ˜ heights$GENDER,

ylab = "height", col="gray")

●

●

●

●

●●

●

●●

●

●●●

●

Male Female

5560

6570

7580

8590

heig

ht

48

Summary Statistics in R

favstats(˜HEIGHT, data=heights)

## min Q1 median Q3 max mean sd n missing

## 56 64 67 70 91 67.09499 4.104804 379 0

sd(heights$HEIGHT, na.rm=T)

## [1] 4.104804

IQR(heights$HEIGHT, na.rm=T)

## [1] 6

range(heights$HEIGHT, na.rm=T)

## [1] 56 91

49

Cross-Tabulation: R

load("mids.rda")

tab <- tally( ˜ mid | aclp_dem, data=mids)

tab

## aclp_dem

## mid Non-Democracy Democracy

## No War 465 286

## War 226 181

ptab <- tally(˜ mid | aclp_dem, data=mids, format="proportion")

ptab

## aclp_dem

## mid Non-Democracy Democracy

## No War 0.6729378 0.6124197

## War 0.3270622 0.3875803

50

Cross-Tabulation: R, continued

library(gmodels)

CrossTable(mids$mid, mids$aclp_dem, chisq=F,

prop.chisq=F, prop.c=F, expected=F)

##

##

## Cell Contents

## |-------------------------|

## | N |

## | N / Row Total |

## | N / Table Total |

## |-------------------------|

##

##

## Total Observations in Table: 1158

##

##

## | mids$aclp_dem

## mids$mid | Non-Democracy | Democracy | Row Total |

## -------------|---------------|---------------|---------------|

## No War | 465 | 286 | 751 |

## | 0.619 | 0.381 | 0.649 |

## | 0.402 | 0.247 | |

## -------------|---------------|---------------|---------------|

## War | 226 | 181 | 407 |

## | 0.555 | 0.445 | 0.351 |

## | 0.195 | 0.156 | |

## -------------|---------------|---------------|---------------|

## Column Total | 691 | 467 | 1158 |

## -------------|---------------|---------------|---------------|

##

##

51

Side-by-side Barplots: R

barplot(ptab, beside=T)

legend("topright", c("War", "No War"),

fill=c("gray70", "gray30"), inset=.01)

Non−Democracy Democracy

0.0

0.1

0.2

0.3

0.4

0.5

0.6

WarNo War

52

Side-by-side Barplots: R

barplot(ptab, beside=F, ylim=c(0,1.18))

legend("top", c("War", "No War"),

fill=c("gray70", "gray30"), inset=.01)

Non−Democracy Democracy

0.0

0.2

0.4

0.6

0.8

1.0

WarNo War

53

Mosaic Plot

mosaicplot(ptab, color=TRUE, main="")

mid

aclp

_dem

No War War

Non

−D

emoc

racy

Dem

ocra

cy

54

Side-by-side Boxplots

boxplot(HEIGHT˜GENDER, data=heights)

●

●

●

●

●●

●

●●

●

●●●

●

Male Female

5560

6570

7580

8590

55

chapter 1: introduction to data - quantoid · chapter 1: introduction to data openintro statistics,...

Documents