chapter 1: introduction to data - quantoid · chapter 1: introduction to data openintro statistics,...
TRANSCRIPT
Chapter 1: Introduction to data
OpenIntro Statistics, 3rd Edition
Slides developed by Mine Cetinkaya-Rundel of OpenIntro.The slides may be copied, edited, and/or shared via the CC BY-SA license.Some images may be included under fair use guidelines (educational purposes).
Introduction
Goals of This Lecture
Understand how to visually convey properties of data.
• Dos and Don’ts for making graphs.
• What sorts of graphs are at our disposal and what are theygood for?
• How to make all of these things (or most of these things) in R.
1
Examining numerical data
Scatterplot
Scatterplots are useful for visualizing the relationship between twonumerical variables.
Do life expectancy and total fertility ap-pear to be associated or independent?
Was the relationship the same through-out the years, or did it change?
http:// www.gapminder.org/ world
2
Dot plots
Useful for visualizing one numerical variable. Darker colorsrepresent areas where there are more observations.
GPA
2.5 3.0 3.5 4.0
How would you describe the distribution of GPAs in this data set?Make sure to say something about the center, shape, and spread ofthe distribution.
3
Dot plots & mean
GPA
2.5 3.0 3.5 4.0
• The mean, also called the average (marked with a triangle inthe above plot), is one way to measure the center of adistribution of data.
• The mean GPA is 3.59.
4
Mean
• The sample mean, denoted as x, can be calculated as
x =x1 + x2 + · · ·+ xn
n,
where x1, x2, · · · , xn represent the n observed values.
• The population mean is also computed the same way but isdenoted as µ. It is often not possible to calculate µ sincepopulation data are rarely available.
• The sample mean is a sample statistic, and serves as a pointestimate of the population mean. This estimate may not beperfect, but if the sample is good (representative of thepopulation), it is usually a pretty good estimate.
5
Stacked dot plot
Higher bars represent areas where there are more observations,makes it a little easier to judge the center and the shape of thedistribution.
GPA
●● ● ●●●●
● ●●●
●●
●
●
●● ● ●●●
●●●
●
●●
●●
●●●
●
●●
●● ●
● ●
●
●
●●●
●●
●
●●
●●
●
●
●
●
●
●●●●
●●
●
●
●
●
●●
●
●●
●●
●●
●
●●
● ●
●
●
● ● ●●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●●●
●●
●
●
●
●
●
●●
●
●●
●
●●
● ●
●●
●●
●
●
●●
●
●
●
● ●●●
●
●
●●
●
●
●
●●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●● ●
●
●●
●●
●
●
●●
●●
●●
● ●●
●
●
●●●
●●
●
●
2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0
6
Histograms - Extracurricular hours
• Histograms provide a view of the data density. Higher barsrepresent where the data are relatively more common.
• Histograms are especially convenient for describing the shapeof the data distribution.
• The chosen bin width can alter the story the histogram istelling.
Hours / week spent on extracurricular activities
0 10 20 30 40 50 60 70
0
50
100
150
7
Bin width
Which one(s) of these histograms are useful? Which reveal toomuch about the data? Which hide too much?
Hours / week spent on extracurricular activities
0 20 40 60 80 100
0
50
100
150
200
Hours / week spent on extracurricular activities
0 10 20 30 40 50 60 70
0
50
100
150
Hours / week spent on extracurricular activities
0 10 20 30 40 50 60 70
0
20
40
60
80
Hours / week spent on extracurricular activities
0 10 20 30 40 50 60 70
0
10
20
30
40
8
Shape of a distribution: modality
Does the histogram have a single prominent peak (unimodal),several prominent peaks (bimodal/multimodal), or no apparentpeaks (uniform)?
0 5 10 150
510
150 5 10 15 20
05
1015
0 5 10 15 20
05
1015
20
0 5 10 15 20
02
46
810
14
Note: In order to determine modality, step back and imagine a smooth curve over
the histogram – imagine that the bars are wooden blocks and you drop a limp
spaghetti over them, the shape the spaghetti would take could be viewed as a
smooth curve.9
Shape of a distribution: skewness
Is the histogram right skewed, left skewed, or symmetric?
0 2 4 6 8 10
05
1015
0 5 10 15 20 25
020
4060
0 20 40 60 80
05
1015
2025
30Note: Histograms are said to be skewed to the side of the long tail.
10
Shape of a distribution: unusual observations
Are there any unusual observations or potential outliers?
0 5 10 15 20
05
1015
2025
30
20 40 60 80 100
010
2030
40
11
Extracurricular activities
How would you describe the shape of the distribution of hours perweek students spend on extracurricular activities?
Hours / week spent on extracurricular activities
0 10 20 30 40 50 60 70
0
50
100
150
12
Commonly observed shapes of distributions
• modality
unimodal bimodal multimodaluniform
• skewness
right skew left skewsymmetric
13
Practice
Which of these variables do you expect to be uniformly distributed?
(a) weights of adult females
(b) salaries of a random sample of people from North Carolina
(c) house prices
(d) birthdays of classmates (day of the month)
14
Application activity: Shapes of distributions
Sketch the expected distributions of the following variables:
• number of piercings
• scores on an exam
• IQ scores
Come up with a concise way (1-2 sentences) to teach someone howto determine the expected distribution of any variable.
15
Are you typical?
http:// www.youtube.com/ watch?v=4B2xOvKFFz4
How useful are centers alone for conveying the true characteristicsof a distribution?
16
Variance
Variance is roughly the average squared deviation from the mean.
s2 =
∑ni=1(xi − x)2
n − 1
• The sample mean isx = 6.71, and the samplesize is n = 217.
• The variance of amount ofsleep students get per nightcan be calculated as: Hours of sleep / night
2 4 6 8 10 12
0
20
40
60
80
s2 =(5 − 6.71)2 + (9 − 6.71)2 + · · ·+ (7 − 6.71)2
217 − 1= 4.11 hours2
17
Variance (cont.)
Why do we use the squared deviation in the calculation of variance?
18
Standard deviation
The standard deviation is the square root of the variance, and hasthe same units as the data.s
s =√
s2
• The standard deviation ofamount of sleep studentsget per night can becalculated as:
s =√
4.11 = 2.03 hours
• We can see that all of thedata are within 3 standarddeviations of the mean.
Hours of sleep / night
2 4 6 8 10 12
0
20
40
60
80
19
Median
• The median is the value that splits the data in half whenordered in ascending order.
0, 1, 2, 3, 4
• If there are an even number of observations, then the medianis the average of the two values in the middle.
0, 1, 2, 3, 4, 5→2 + 3
2= 2.5
• Since the median is the midpoint of the data, 50% of thevalues are below it. Hence, it is also the 50th percentile.
20
Q1, Q3, and IQR
• The 25th percentile is also called the first quartile, Q1.
• The 50th percentile is also called the median.
• The 75th percentile is also called the third quartile, Q3.
• Between Q1 and Q3 is the middle 50% of the data. The rangethese data span is called the interquartile range, or the IQR.
IQR = Q3 − Q1
21
Box plot
The box in a box plot represents the middle 50% of the data, andthe thick line in the box is the median.
# of
stu
dy h
ours
/ w
eek
0
10
20
30
40
50
60
70
22
Anatomy of a box plot
# of
stu
dy h
ours
/ w
eek
0
10
20
30
40
50
60
70
lower whisker
Q1 (first quartile)
median
Q3 (third quartile)
max whisker reach& upper whisker
suspected outliers
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
23
Whiskers and outliers
• Whiskersof a box plot can extend up to 1.5×IQR away from the quartiles.
max upper whisker reach = Q3 + 1.5 × IQR
max lower whisker reach = Q1 − 1.5 × IQR
IQR : 20 − 10 = 10
max upper whisker reach = 20 + 1.5 × 10 = 35
max lower whisker reach = 10 − 1.5 × 10 = −5
• A potential outlier is defined as an observation beyond themaximum reach of the whiskers. It is an observation thatappears extreme relative to the rest of the data.
24
Outliers (cont.)
Why is it important to look for outliers?
25
Extreme observations
How would sample statistics such as mean, median, SD, and IQRof household income be affected if the largest value was replacedwith $10 million? What if the smallest value was replaced with $10million?
Annual Household Income
●● ●●●
●●
●●
●●●
●
●●
●●
●●
●
●●
●
●
●
● ●
●●●
● ●●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●
●● ●
●
●●
● ●
●
● ●●
●●●
●
●
●●
●
●
●
●
●
●
●●●
●
●●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
26
Robust statistics
Annual Household Income
●● ●●●
●●
●●
●●●
●
●●
●●
●●
●
●●
●
●
●
● ●
●●●
● ●●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●
●● ●
●
●●
● ●
●
● ●●
●●●
●
●
●●
●
●
●
●
●
●
●●●
●
●●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
robust not robustscenario median IQR x soriginal data 190K 200K 245K 226Kmove largest to $10 million 190K 200K 309K 853Kmove smallest to $10 million 200K 200K 316K 854K
27
Robust statistics
Median and IQR are more robust to skewness and outliers thanmean and SD. Therefore,
• for skewed distributions it is often more helpful to use medianand IQR to describe the center and spread
• for symmetric distributions it is often more helpful to use themean and SD to describe the center and spread
If you would like to estimate the typical household income for a stu-dent, would you be more interested in the mean or median income?
28
Mean vs. median
• If the distribution is symmetric, center is often defined as themean: mean ≈ median
Symmetric
meanmedian
• If the distribution is skewed or has extreme outliers, center isoften defined as the median• Right-skewed: mean > median• Left-skewed: mean < median
Right−skewed
meanmedian
Left−skewed
meanmedian
29
Practice
Which is most likely true for the distribution of percentage of time actually
spent taking notes in class versus on Facebook, Twitter, etc.?
% of time in class spent taking notes
0 20 40 60 80 100
0
10
20
30
40
50
(a) mean> median
(b) mean < median
(c) mean ≈ median
(d) impossible to tell
30
Extremely skewed data
When data are extremely skewed, transforming them might makemodeling easier. A common transformation is the logtransformation.
The histograms on the left shows the distribution of number ofbasketball games attended by students. The histogram on the rightshows the distribution of log of number of games attended.
# of basketball games attended
0 10 20 30 40 50 60 70
0
50
100
150
# of basketball games attended
0 1 2 3 4
0
10
20
30
40
31
Pros and cons of transformations
• Skewed data are easier to model with when they aretransformed because outliers tend to become far lessprominent after an appropriate transformation.
# of games 70 50 25 · · ·
log(# of games) 4.25 3.91 3.22 · · ·
• However, results of an analysis might be difficult to interpretbecause the log of a measured variable is usuallymeaningless.
What other variables would you expect to be extremely skewed?
32
Intensity maps
What patterns are apparent in the change in population between2000 and 2010?
http:// projects.nytimes.com/ census/ 2010/ map
33
Considering categorical data
Contingency tables
A table that summarizes data for two categorical variables is calleda contingency table.
The contingency table below shows the distribution of students’genders and whether or not they are looking for a spouse while incollege.
looking for spouseNo Yes Total
genderFemale 86 51 137Male 52 18 70Total 138 69 207
34
Bar plots
A bar plot is a common way to display a single categorical variable.A bar plot where proportions instead of frequencies are shown iscalled a relative frequency bar plot.
Female Male0
20
40
60
80
100
120
Female Male0.0
0.1
0.2
0.3
0.4
0.5
0.6
How are bar plots different than histograms?
35
Choosing the appropriate proportion
Does there appear to be a relationship between gender andwhether the student is looking for a spouse in college?
looking for spouseNo Yes Total
genderFemale 86 51 137Male 52 18 70Total 138 69 207
To answer this question we examine the row proportions:
• % Females looking for a spouse: 51/137 ≈ 0.37
• % Males looking for a spouse: 18/70 ≈ 0.26
36
Segmented bar and mosaic plots
What are the differences between the three visualizations shownbelow?
Female Male
YesNo
0
20
40
60
80
100
120
Female Male0.0
0.2
0.4
0.6
0.8
1.0 Female Male
No
Yes
37
Pie charts
Can you tell which order encompasses the lowest percentage ofmammal species?
RODENTIACHIROPTERACARNIVORAARTIODACTYLAPRIMATESSORICOMORPHALAGOMORPHADIPROTODONTIADIDELPHIMORPHIACETACEADASYUROMORPHIAAFROSORICIDAERINACEOMORPHASCANDENTIAPERISSODACTYLAHYRACOIDEAPERAMELEMORPHIACINGULATAPILOSAMACROSCELIDEATUBULIDENTATAPHOLIDOTAMONOTREMATAPAUCITUBERCULATASIRENIAPROBOSCIDEADERMOPTERANOTORYCTEMORPHIAMICROBIOTHERIA
Data from http:// www.bucknell.edu/ msw3.
38
Side-by-side box plots
Does there appear to be a relationship between class year andnumber of clubs students are in?
First−year Sophomore Junior Senior
0
2
4
6
8
●
●●
●
●
●●
●
●
●
●
●
●
●
●
39
Implementation in R
Scatterplots in R
load("fl_survey.rda")
plot(college_GPA ˜ high_sch_GPA, data=fl)
●●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●● ●
2.0 2.5 3.0 3.5 4.0
2.6
2.8
3.0
3.2
3.4
3.6
3.8
4.0
high_sch_GPA
colle
ge_G
PA
40
Scatterplots in R II
plot(fl$high_sch_GPA, fl$college_GPA, pch=16, col=rgb(0,0,1,.2),
xlab = "High School GPA", ylab="College GPA")
2.0 2.5 3.0 3.5 4.0
2.6
2.8
3.0
3.2
3.4
3.6
3.8
4.0
High School GPA
Col
lege
GPA
41
Bar Charts in R (raw data)
load("elec.rda")
par(mar=c(6,4,4,2))
barplot(elec$us_pct, names.arg = rownames(elec), las=2)
Coa
l
Nat
Gas
Nuc
lear
Hyd
roge
n
Oth
Ren
ew
Pet
role
um
0
10
20
30
40
42
Bar Charts in R (summary statistics)
require(mosaic)
load("heights.rda")
barplot(mean(HEIGHT ˜ GENDER, data=heights),
ylab = "Average Height (in inches)")
Male Female
Ave
rage
Hei
ght (
in in
ches
)
010
2030
4050
6070
43
Histograms in R
hist(heights$HEIGHT, xlab = "height",
main = "", col="gray")
height
Fre
quen
cy
60 70 80 90
050
100
150
44
Histograms by groups in R
library(lattice)
histogram(˜HEIGHT | GENDER, data=heights,
col="gray", nint=10)
HEIGHT
Den
sity
0.00
0.05
0.10
0.15
60 70 80 90
Male
60 70 80 90
Female
45
Histograms for discrete data in R
load("fl_survey.rda")
histDiscrete <- function(x, ...){
l <- max(x, na.rm=T)
b <- seq(.5, l+.5, by=1)
hist(x, breaks=b, ...)
}
histDiscrete(fl$political_ideology, col="gray",
main="", xlab="Political Ideology\n(1=liberal, 7=conservative)")
Political Ideology(1=liberal, 7=conservative)
Fre
quen
cy
1 2 3 4 5 6 7
05
1015
20
46
Boxplots in R
boxplot(heights$HEIGHT, ylab = "height", col="gray")
●
5560
6570
7580
8590
heig
ht
47
Boxplots by group in R
boxplot(heights$HEIGHT ˜ heights$GENDER,
ylab = "height", col="gray")
●
●
●
●
●●
●
●●
●
●●●
●
Male Female
5560
6570
7580
8590
heig
ht
48
Summary Statistics in R
favstats(˜HEIGHT, data=heights)
## min Q1 median Q3 max mean sd n missing
## 56 64 67 70 91 67.09499 4.104804 379 0
sd(heights$HEIGHT, na.rm=T)
## [1] 4.104804
IQR(heights$HEIGHT, na.rm=T)
## [1] 6
range(heights$HEIGHT, na.rm=T)
## [1] 56 91
49
Cross-Tabulation: R
load("mids.rda")
tab <- tally( ˜ mid | aclp_dem, data=mids)
tab
## aclp_dem
## mid Non-Democracy Democracy
## No War 465 286
## War 226 181
ptab <- tally(˜ mid | aclp_dem, data=mids, format="proportion")
ptab
## aclp_dem
## mid Non-Democracy Democracy
## No War 0.6729378 0.6124197
## War 0.3270622 0.3875803
50
Cross-Tabulation: R, continued
library(gmodels)
CrossTable(mids$mid, mids$aclp_dem, chisq=F,
prop.chisq=F, prop.c=F, expected=F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1158
##
##
## | mids$aclp_dem
## mids$mid | Non-Democracy | Democracy | Row Total |
## -------------|---------------|---------------|---------------|
## No War | 465 | 286 | 751 |
## | 0.619 | 0.381 | 0.649 |
## | 0.402 | 0.247 | |
## -------------|---------------|---------------|---------------|
## War | 226 | 181 | 407 |
## | 0.555 | 0.445 | 0.351 |
## | 0.195 | 0.156 | |
## -------------|---------------|---------------|---------------|
## Column Total | 691 | 467 | 1158 |
## -------------|---------------|---------------|---------------|
##
##
51
Side-by-side Barplots: R
barplot(ptab, beside=T)
legend("topright", c("War", "No War"),
fill=c("gray70", "gray30"), inset=.01)
Non−Democracy Democracy
0.0
0.1
0.2
0.3
0.4
0.5
0.6
WarNo War
52
Side-by-side Barplots: R
barplot(ptab, beside=F, ylim=c(0,1.18))
legend("top", c("War", "No War"),
fill=c("gray70", "gray30"), inset=.01)
Non−Democracy Democracy
0.0
0.2
0.4
0.6
0.8
1.0
WarNo War
53
Mosaic Plot
mosaicplot(ptab, color=TRUE, main="")
mid
aclp
_dem
No War War
Non
−D
emoc
racy
Dem
ocra
cy
54
Side-by-side Boxplots
boxplot(HEIGHT˜GENDER, data=heights)
●
●
●
●
●●
●
●●
●
●●●
●
Male Female
5560
6570
7580
8590
55