unit 1: introduction to data lecture 2: exploratory data ...1) unit 1/… · unit 1: introduction...
TRANSCRIPT
U 1: I L 2: E
S 101
Nicole Dalzell
July 1, 2014
Announcements
1 AnnouncementsWarm-Up and Data BasicsExploring Data
2 Numerical DataRelationship between two numerical variables
3 Distribution of one numerical variableDescribing distributions of numerical variablesDistribution Shapes
4 Descriptive StatisticsCenterSpreadRobust Statistics
5 Examples
Statistics 101
U1 - L2: EDA Nicole Dalzell
Announcements
Announcements
From now on, sit with your teams in class.
If there is someone from your team that you haven’t met yet, letme know.
If you weren’t able to log on to RStudio, and your name isn’thighlighted on the Google Doc, stop by after class.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 2 / 63
Announcements Warm-Up and Data Basics
Review
Example Study:A researcher divides 250 cats (adults and kittens) into two rooms,with adult cats in one room and baby kittens in the other room. Withineach room she erects a fence, randomly placing half the cats (orkittens) on each side of the fence. On one side of the fence shescatters a variety of cat toys. For 1 day, the researcher records thenumber of hours each cat spends sleeping.
What is the research question?
What are the explanatory and response variables?
Is this an Experimental or Observational study?
What are the controls and treatments?
What kind of structure was given to the study (eg. blocking orclustering) and why?
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 3 / 63
Announcements Warm-Up and Data Basics
Types of Variables Example
Still our cat example:
Cat Age Toys # of Naps Weight (lbs)1 adult 1 3 82 juvenile 1 5 93 adult 0 2 10.54 adult 1 8 12.25...
......
......
250 adult 0 5 11.67
What types of variables are these:
Age?
Toys?
# of Naps?
Weight?Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 3 / 63
Announcements Exploring Data
Population to sample
It is usually not feasible to collect information on the entirepopulation due to high costs of data collection so statisticiansinstead work with samples that are (hopefully) representative ofthe populations they come from.
population
sample
We try to understand certain features of the population as awhole using summary statistics and graphs based on thesesamples.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 4 / 63
Announcements Exploring Data
Exploratory analysis to inference
When you taste a spoonful of soup and decide it doesn’t tastesalty enough, that’s exploratory analysis.
If you generalize and conclude that your soup needs salt, that’san inference.For your inference to be valid the spoonful you tasted (thesample) needs to be representative of the entire pot (thepopulation).
If your spoonful comes only from the surface and the salt iscollected at the bottom of the pot, what you tasted is probably notrepresentative of the whole pot.If you first stir the soup thoroughly before you taste, your spoonfulwill more likely be representative of the whole pot.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 5 / 63
Announcements Exploring Data
Random assignment vs. random sampling
Random assignment
No random assignment
Random sampling
Causal conclusion, generalized to the whole
population.
No causal conclusion, correlation statement
generalized to the whole population.
Generalizability
No random sampling
Causal conclusion, only for the sample.
No causal conclusion, correlation statement only
for the sample.No
generalizability
Causation Correlation
ideal experiment
most experiments
most observational
studies
bad observational
studies
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 6 / 63
Announcements Exploring Data
EDA Intro
EDA is important, because your analyses will depend on thetrends and features of your data.The distribution of a variable is a list of possible values thevariable can take and how often it takes each of those values.
Distributions are critical to assessing the probability of events.
We often utilize descriptive statistics related to the center andspread of the data.
Plots are almost always useful for visualizing relationships anddistributions in the data.
Example:Do {5,5,5,5,5,5,5,5,5} and {1,2,3,4,5,6,7,8,9} have the samedistribution? Why or why not?
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 7 / 63
Numerical Data
1 AnnouncementsWarm-Up and Data BasicsExploring Data
2 Numerical DataRelationship between two numerical variables
3 Distribution of one numerical variableDescribing distributions of numerical variablesDistribution Shapes
4 Descriptive StatisticsCenterSpreadRobust Statistics
5 Examples
Statistics 101
U1 - L2: EDA Nicole Dalzell
Numerical Data Relationship between two numerical variables
Scatterplot
Scatterplots are useful for visualizing the relationship between twonumerical variables.
Do life expectancy and total fertil-ity appear to be associated or in-dependent?
Was the relationship the samethroughout the years, or did itchange?
http:// www.gapminder.org/ world
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 8 / 63
Numerical Data Relationship between two numerical variables
Cars: ... vs. weight
From the cars data:
mile
s p
er
ga
llon
(city r
atin
g)
2000 3000 4000
20
30
40
weight (pounds)
2000 2500 3000 3500 4000
10
20
30
40
50
60
weight (pounds)pr
ice
($10
00s)
What do these scatterplots reveal about the data? How might they beuseful?
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 9 / 63
Distribution of one numerical variable
1 AnnouncementsWarm-Up and Data BasicsExploring Data
2 Numerical DataRelationship between two numerical variables
3 Distribution of one numerical variableDescribing distributions of numerical variablesDistribution Shapes
4 Descriptive StatisticsCenterSpreadRobust Statistics
5 Examples
Statistics 101
U1 - L2: EDA Nicole Dalzell
Distribution of one numerical variable
Visualizing numerical variables
Histogram: Provides a view of the data density, and areespecially convenient for describing the shape of the datadistribution.
Box plot: Especially useful for displaying the median, quartiles,unusual observations, as well as the IQR.
Intensity map: Useful for displaying the spatial distribution.
Dot plot: Useful when individual values are of interest.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 10 / 63
Distribution of one numerical variable
Why visualize?
What does a response of 0 mean in this distribution?
●●●
0 2 4 6 8 10 12
Number of drinks it takes students to get drunk
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 11 / 63
Distribution of one numerical variable
Why visualize?
What does a response of 0 mean in this distribution?
●●●
0 2 4 6 8 10 12
Number of drinks it takes students to get drunk
Most likely that a student doesn’t drink. It would be preferable torecode these as NAs.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 11 / 63
Distribution of one numerical variable
Why visualize?
Describe the spatial distribution of race/ethnicity in the US.
http:// demographics.coopercenter.org/ DotMap/ index.html
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 12 / 63
Distribution of one numerical variable
Why visualize?
And let’s take a closer look at Durham.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 13 / 63
Distribution of one numerical variable
Why visualize?
Dot plot of weight, in ounces
0 1000 2000 3000 4000
●● ●● ●● ●●●
●
●●● ●
●
●● ● ● ●
●
● ●
●
●
●● ●
●
●●
●●
●●
● ●
●
● ●
●●
●
● ●
● ●
●
●●
● ●
●
●● ●
●● ●
●
● ●
●
●
●
●●
●
●
●
● ●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Do you see anything out of the ordinary?
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 14 / 63
Distribution of one numerical variable
Why visualize?
Dot plot of weight, in ounces
0 1000 2000 3000 4000
●● ●● ●● ●●●
●
●●● ●
●
●● ● ● ●
●
● ●
●
●
●● ●
●
●●
●●
●●
● ●
●
● ●
●●
●
● ●
● ●
●
●●
● ●
●
●● ●
●● ●
●
● ●
●
●
●
●●
●
●
●
● ●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Do you see anything out of the ordinary?
Some people reported their weight in pounds.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 14 / 63
Distribution of one numerical variable
Why visualize?
What type of variable is average number of hours of sleep per night?Is this reflected in the dot plot below? If not, what might be the reason?
Dot plot of average number of hours of sleep per night
4 5 6 7 8 9
●●● ●●● ●●●●●
●●● ●●●●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●●
●
●
●●
●
●
●
●
●
●●●
●●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 15 / 63
Distribution of one numerical variable
Why visualize?
What type of variable is average number of hours of sleep per night?Is this reflected in the dot plot below? If not, what might be the reason?
Dot plot of average number of hours of sleep per night
4 5 6 7 8 9
●●● ●●● ●●●●●
●●● ●●●●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●●
●
●
●●
●
●
●
●
●
●●●
●●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
Average number of hours of sleep per night is a continuous numericalvariable. But responses are sounded, so there are only wholenumbers and half hours in the data.Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 15 / 63
Distribution of one numerical variable
Stacked Dot Plot
Higher bars represent areas where there are more observations,makes it a little easier to judge the center and the shape of thedistribution.
gpa
3.0 3.2 3.4 3.6 3.8 4.0
●
●
● ●● ● ●● ●● ●● ●
● ●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 16 / 63
Distribution of one numerical variable
Histogram Construction
Order the data in ascending order:
2.900 2.910 3.000 3.100 3.100 3.150 3.150 3.150 3.200 3.200 3.2503.294 3.300 3.300 3.330 3.350 3.350 3.400 3.400 3.400 3.400 3.4003.400 3.400 3.410 3.450 3.460 3.500 3.500 3.500 3.500 3.550 3.5603.600 3.600 3.600 3.600 3.610 3.630 3.650 3.680 3.700 3.700 3.7003.700 3.700 3.700 3.750 3.750 3.750 3.750 3.785 3.790 3.800 3.8003.800 3.800 3.800 3.840 3.840 3.840 3.860 3.868 3.900 3.900 3.9003.900 3.925 3.925 3.970 3.970 4.000 4.000 4.000 4.000 4.300 4.300
Make a frequency table where the number of observations that fall in a certain binare recorded by counting how many observations fall in each bin. Let’s use a binwidth of 0.1:
GPA 2.9 to 3 3 to 3.1 3.1 to 3.2 3.2 to 3.3 · · · 3.8 to 3.9 3.9 to 4Count 3 2 5 4 · · · 9 8
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 17 / 63
Distribution of one numerical variable
Exploring Histograms
Link
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 18 / 63
Distribution of one numerical variable
Histograms
Higher bars represent areas where there are more observations,preferable when sample size is large but hides finer details likeindividual observations.
gpa
freq
uenc
y
3.0 3.2 3.4 3.6 3.8 4.0
0
2
4
6
8
10
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 19 / 63
Distribution of one numerical variable
Bin Width
Which one(s) of these histograms are useful? Which reveal too muchabout the data? Which hide too much?
extracurricular hrs / week
freq
uenc
y
0 5 10 15 20 25 300
10
20
30
40
50
extracurricular hrs / week
freq
uenc
y
0 5 10 15 20 250
5
10
15
20
25
30
extracurricular hrs / week
freq
uenc
y
0 5 10 15 20 250
5
10
15
extracurricular hrs / week
freq
uenc
y
5 10 15 20 2502468
101214
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 20 / 63
Distribution of one numerical variable
Density Curves
A Density Curve is a smoothed density histogram where the areaunder the curve is 1.To draw a density curve from a histogram simply connect thepeaks of a histogram with a smooth line, and normalize thevalues of the y-axis such that the area under the curve is 1.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 21 / 63
Distribution of one numerical variable Describing distributions of numerical variables
Describing distributions of numerical variables
When describing distributions of numerical variables always mention
Shape: skewness, modalityCenter: an estimate of a typical observation in the distribution(mean, median, mode, etc.)Spread: measure of variability in the distribution (SD, IQR, range,etc.)Unusual observations: observations that stand out from the restof the data that may be suspected outliers
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 22 / 63
Distribution of one numerical variable Distribution Shapes
Describing Distributions
When describing distributions make sure to talk about the shape,center, spread, and if any, unusual observations.
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 23 / 63
Distribution of one numerical variable Distribution Shapes
Shape
How would you describe the shape of this distribution?
Histogram ofaverage number of hours spent on school work per day
2 4 6 8 10
05
1015
2025
30
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 24 / 63
Distribution of one numerical variable Distribution Shapes
Shape
How would you describe the shape of this distribution?
Histogram ofaverage number of hours spent on school work per day
2 4 6 8 10
05
1015
2025
30
Unimodal and right skewed.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 24 / 63
Distribution of one numerical variable Distribution Shapes
Describing Your Pictures
Bell Shaped: Data is bell shaped if the majority of the data isclustered around the center value (mean) with very few datapoints lying either way above or way below this value.
Right Skewed: Data is positively skewed if you have severallarge positive data points creating a long tail to the right.
Left Skewed: Data is negatively skewed if you have several largenegative numbers creating a long tail to the left.
Bimodal: Data is bimodal if it has two large clusters of datapoints.
Symmetric: Data is symmetric if it looks like a mirror imagearound a point of inflection.
Uniformly Distributed: Data is evenly spread across all possiblevalues.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 25 / 63
Distribution of one numerical variable Distribution Shapes
Modality
The mode is defined as the most frequent observation in the data set.Does the histogram have a single prominent peak (unimodal), severalprominent peaks (bimodal/multimodal), or no apparent peaks(uniform)?
0 5 10 15
05
1015
0 5 10 15 20
05
1015
0 5 10 15 20
05
1015
20
0 5 10 15 20
02
46
810
1214
In order to determine modality, it’s easiest to step back and imagine adensity curve over the histogram. Use the limp spaghetti method.Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 26 / 63
Distribution of one numerical variable Distribution Shapes
Commonly observed shapes of distributions
modality
unimodal bimodal multimodaluniform
skewness
right skew left skewsymmetric
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63
Distribution of one numerical variable Distribution Shapes
Commonly observed shapes of distributions
modality
unimodal
bimodal multimodaluniform
skewness
right skew left skewsymmetric
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63
Distribution of one numerical variable Distribution Shapes
Commonly observed shapes of distributions
modality
unimodal bimodal
multimodaluniform
skewness
right skew left skewsymmetric
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63
Distribution of one numerical variable Distribution Shapes
Commonly observed shapes of distributions
modality
unimodal bimodal multimodal
uniform
skewness
right skew left skewsymmetric
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63
Distribution of one numerical variable Distribution Shapes
Commonly observed shapes of distributions
modality
unimodal bimodal multimodaluniform
skewness
right skew left skewsymmetric
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63
Distribution of one numerical variable Distribution Shapes
Commonly observed shapes of distributions
modality
unimodal bimodal multimodaluniform
skewness
right skew left skewsymmetric
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63
Distribution of one numerical variable Distribution Shapes
Commonly observed shapes of distributions
modality
unimodal bimodal multimodaluniform
skewness
right skew
left skewsymmetric
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63
Distribution of one numerical variable Distribution Shapes
Commonly observed shapes of distributions
modality
unimodal bimodal multimodaluniform
skewness
right skew left skew
symmetric
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63
Distribution of one numerical variable Distribution Shapes
Commonly observed shapes of distributions
modality
unimodal bimodal multimodaluniform
skewness
right skew left skewsymmetric
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63
Distribution of one numerical variable Distribution Shapes
Participation question
Which of these variables do you expect to be uniformly distributed?
(a) weights of adult females
(b) salaries of a random sample of people from North Carolina
(c) house prices
(d) birthdays of classmates (day of the month)
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 28 / 63
Distribution of one numerical variable Distribution Shapes
Participation question
Which of these variables do you expect to be uniformly distributed?
(a) weights of adult females
(b) salaries of a random sample of people from North Carolina
(c) house prices
(d) birthdays of classmates (day of the month)
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 28 / 63
Distribution of one numerical variable Distribution Shapes
Skewness
Is the histogram right skewed, left skewed, or symmetric?
0 2 4 6 8 10
05
1015
0 5 10 15 20 25
020
4060
0 20 40 60 80
05
1015
2025
30
Histograms are said to be skewed to the side of the long tail.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 29 / 63
Distribution of one numerical variable Distribution Shapes
Unusual Observations
Are there any unusual observations or potential outliers?
0 5 10 15 20
05
1015
2025
30
20 40 60 80 100
010
2030
40
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 30 / 63
Distribution of one numerical variable Distribution Shapes
Application exercise: Shapes of distributions
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 31 / 63
Distribution of one numerical variable Distribution Shapes
Application exercise: Shapes of distributionsBelow are two histograms. One corresponds to the age at which asample of people applied for marriage licenses; the other correspondsto the last digit of a sample of social security numbers. Which graph iswhich, and why?
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 31 / 63
Distribution of one numerical variable Distribution Shapes
Application exercise: Shapes of distributions
Match the following variables with the histograms and bar graphs givenbelow. These data represent Sta 101 students at Duke.
(a) the height of students
(b) gender breakdown of students
(c) the time it took students to get to their firstclass of the day
(d) the number of hours of sleep studentsreceived last night
(e) whether or not students live off campus
(f) the number of piercings students have
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 31 / 63
Descriptive Statistics
1 AnnouncementsWarm-Up and Data BasicsExploring Data
2 Numerical DataRelationship between two numerical variables
3 Distribution of one numerical variableDescribing distributions of numerical variablesDistribution Shapes
4 Descriptive StatisticsCenterSpreadRobust Statistics
5 Examples
Statistics 101
U1 - L2: EDA Nicole Dalzell
Descriptive Statistics Center
Measures of Center
The Mean of a dataset is what we commonly refer to as theaverage.
The Median of a dataset is the middle value of your data. Youfind the median of your data by ordering from smallest to largest,then finding the value where 50% of your data is above andbelow that value.The Trimmed Mean is the calculation of the mean after removinga few of the very large and very small observations.
What is the advantage of using the Median instead of the Mean?What is the advantage of using the Mean instead of the Median?
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 32 / 63
Descriptive Statistics Center
Measures of Center
The Mean of a dataset is what we commonly refer to as theaverage.
The Median of a dataset is the middle value of your data. Youfind the median of your data by ordering from smallest to largest,then finding the value where 50% of your data is above andbelow that value.The Trimmed Mean is the calculation of the mean after removinga few of the very large and very small observations.
What is the advantage of using the Median instead of the Mean?
What is the advantage of using the Mean instead of the Median?
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 32 / 63
Descriptive Statistics Center
Measures of Center
The Mean of a dataset is what we commonly refer to as theaverage.
The Median of a dataset is the middle value of your data. Youfind the median of your data by ordering from smallest to largest,then finding the value where 50% of your data is above andbelow that value.The Trimmed Mean is the calculation of the mean after removinga few of the very large and very small observations.
What is the advantage of using the Median instead of the Mean?What is the advantage of using the Mean instead of the Median?
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 32 / 63
Descriptive Statistics Center
Mean
The sample mean, denoted as x̄, can be calculated as
x̄ =x1 + x2 + · · · + xn
n=
Sum of Data PointsNumber of Data Points
,
where x1, x2, · · · , xn represent the n observed values.
The population mean is a parameter computed the same way butis denoted as µ. It is often not possible to calculate µ sincepopulation data is rarely available.
x̄ is an estimate of µ based on the observed data.
The sample mean is a sample statistic, or a point estimate of thepopulation mean. This estimate may not be perfect, but if thesample is good (representative of the population) it is usually agood guess.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 33 / 63
Descriptive Statistics Center
Median
The median is the value that splits the data in half when orderedin ascending order.
0, 1, 2, 3, 4
If there are an even number of observations, then the median isthe average of the two values in the middle.
0, 1, 2, 3, 4, 5→2 + 3
2= 2.5
Since the median is the midpoint of the data, 50% of the valuesare below it. Hence, it is also the 50th percentile.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 34 / 63
Descriptive Statistics Center
Mean vs. Median
If the distribution is symmetric, center is the meanSymmetric: mean ≈ median
If the distribution is skewed or has outliers center is the medianRight-skewed: mean > medianLeft-skewed: mean < median
Right−skewed
meanmedian
Left−skewed
meanmedian
Symmetric
meanmedian
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 35 / 63
Descriptive Statistics Center
Are you typical?
http:// www.youtube.com/ watch?v=4B2xOvKFFz4
How useful are centers alone for conveying the true characteristics ofa distribution?
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 36 / 63
Descriptive Statistics Center
Are you typical?
http:// www.youtube.com/ watch?v=4B2xOvKFFz4
How useful are centers alone for conveying the true characteristics ofa distribution?
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 36 / 63
Descriptive Statistics Spread
Measures of Spread
The population Variance, σ2, measures each observation’sdeviation from the mean.
The population Standard Deviation, σ, is the square root of thevariance.
The Inner Quartile Range (IQR) measures the spread of themiddle 50% of your data, and is visually depicted in Boxplots.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 37 / 63
Descriptive Statistics Spread
Deviation
The distance of an observation from the mean is its deviation: xi − x̄.
s o r t ( d$sleep )[ 1 ] 1 1 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[ 3 0 ] 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5[ 5 9 ] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 7 7 7 7 7 7 7 7 8 9 9 9mean( d$sleep )[ 1 ] 4.6
x1 − x̄ = 1 − 4.6 = −3.6x2 − x̄ = 1 − 4.6 = −3.6x3 − x̄ = 2 − 4.6 = −2.6
...
x86 − x̄ = 9 − 4.6 = 4.4
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 38 / 63
Descriptive Statistics Spread
Variance
Variance, s2
Roughly the average squared deviation from the mean
s2 =
∑ni=1(xi − x̄)2
n − 1
Given that the average number of hours students sleep per night is7.029, the variance of amount of sleep students get per night can becalculated as:
s2 =(7.5 − 7.029)2 + (7 − 7.029)2 + · · · + (8 − 7.029)2
106 − 1= 0.72
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 39 / 63
Descriptive Statistics Spread
Variance
Variance, s2
Roughly the average squared deviation from the mean
s2 =
∑ni=1(xi − x̄)2
n − 1
Given that the average number of hours students sleep per night is7.029, the variance of amount of sleep students get per night can becalculated as:
s2 =(7.5 − 7.029)2 + (7 − 7.029)2 + · · · + (8 − 7.029)2
106 − 1= 0.72
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 39 / 63
Descriptive Statistics Spread
Variance (cont.)
Why do we use the squared deviation in the calculation of variance?
To get rid of negatives so that observations equally distant fromthe mean are weighed equally.
To weigh larger deviations more heavily
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 40 / 63
Descriptive Statistics Spread
Variance (cont.)
Why do we use the squared deviation in the calculation of variance?
To get rid of negatives so that observations equally distant fromthe mean are weighed equally.
To weigh larger deviations more heavily
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 40 / 63
Descriptive Statistics Spread
Application exercise: Variability
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 41 / 63
Descriptive Statistics Spread
Application exercise: Variability
Order histograms A, B, and C from least to most variable. Explain yourreasoning.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 41 / 63
Descriptive Statistics Spread
Application exercise: Variability
Between histograms D and E, which exhibits more variability? Explainyour reasoning.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 41 / 63
Descriptive Statistics Spread
Variability vs. diversity
Which of the following sets of cars has more diverse composition ofcolors?
Set 1:
Set 2:
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 42 / 63
Descriptive Statistics Spread
Variability vs. diversity
Which of the following sets of cars has more diverse composition ofcolors?
Set 1:
more diverse
Set 2:
less diverseStatistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 42 / 63
Descriptive Statistics Spread
Variability vs. diversity (cont.)
Which of the following sets of cars has more variable mileage?
Set 1:
Set 2:
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 43 / 63
Descriptive Statistics Spread
Variability vs. diversity (cont.)
Which of the following sets of cars has more variable mileage?
Set 1:
10 20 30 40 50 60
less variable
01
23
Set 2:
10 20 30 40 50 60
more variable
01
23
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 43 / 63
Descriptive Statistics Spread
Standard deviation
Standard deviation, sRoughly the deviation around the mean, calculated as the square rootof the variance, and has the same units as the data.
s =√
s2 =
√∑ni=1(xi − x̄)2
n − 1
The variance of amount of sleep students get per night can be calculated as:
s =√
0.72 = 0.85 hours
Student on average sleep 7.029 hours, give or take 0.85 hours.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 44 / 63
Descriptive Statistics Spread
Standard deviation
Standard deviation, sRoughly the deviation around the mean, calculated as the square rootof the variance, and has the same units as the data.
s =√
s2 =
√∑ni=1(xi − x̄)2
n − 1
The variance of amount of sleep students get per night can be calculated as:
s =√
0.72 = 0.85 hours
Student on average sleep 7.029 hours, give or take 0.85 hours.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 44 / 63
Descriptive Statistics Spread
Standard deviation
Standard deviation, sRoughly the deviation around the mean, calculated as the square rootof the variance, and has the same units as the data.
s =√
s2 =
√∑ni=1(xi − x̄)2
n − 1
The variance of amount of sleep students get per night can be calculated as:
s =√
0.72 = 0.85 hours
Student on average sleep 7.029 hours, give or take 0.85 hours.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 44 / 63
Descriptive Statistics Spread
Standard Deviation
The standard deviation gives a rough estimate of the typicaldistance of a data values from the mean.The larger the standard deviation, the more variability there is inthe data and the more spread out the data are.
Standard Deviation of 2
rnorm(100, 0, 2)
Fre
quen
cy
−15 −10 −5 0 5 10 15
05
1015
20
Standard Deviation of 4
rnorm(100, 0, 4)
Fre
quen
cy
−15 −10 −5 0 5 10 15
05
1015
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 45 / 63
Descriptive Statistics Spread
Variability in Student Sleep
sleep, x = 4.6, sx = 1.66
2 4 6 8
● ●● ●
●●
●
●● ●● ●
●
● ●
●
●
●●
●●●
●●●●●
●
●●
●
●●
●
●●●●●●
●●
●
●●
●
●
●●●●
●
●
●
●
●●●●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●●
●
●
●●
●
●
●
●
●
69 out of 86 students (80%) are within 1 SD of the mean.
80 out of 86 students (93%) are within 2 SDs of the mean.
86 out of 86 students (100%) are within 3 SDs of the mean.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 46 / 63
Descriptive Statistics Spread
95% Rule
95 % RuleIf a distribution of data is approximately symmetric and bell-shaped,about 95% of the data should fall within two standard deviations of themean.
For a population, 95% of the data will be between µ − 2σ andµ + 2σ
http:// rchsbowman.files.wordpress.com/ 2008/ 09/ empirical-rule-3.jpgStatistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 47 / 63
Descriptive Statistics Spread
Notation Recap
mean variance SD
sample x̄ s2 s
population µ σ2 σ
Do you see a trend in what types of letters are used for samplestatistics vs. population parameters?
Latin letters for sample statistics, Greek letters for populationparameters.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 48 / 63
Descriptive Statistics Spread
Notation Recap
mean variance SD
sample x̄ s2 s
population µ σ2 σ
Do you see a trend in what types of letters are used for samplestatistics vs. population parameters?
Latin letters for sample statistics, Greek letters for populationparameters.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 48 / 63
Descriptive Statistics Spread
Z-Scores
Z-ScoreThe z-score for a data value, x , is
z =x − x̄
s
For a population, x̄ is replaced with µ and s is replaced with σ.
Values farther from 0 are more extreme.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 49 / 63
Descriptive Statistics Spread
Z-Scores: Why?
A z-score puts values on a common scale
A z-score is the number of standard deviations a value falls fromthe mean
95% of all z-scores fall between -2 and 2 .
z-scores beyond -2 or 2 can be considered extreme
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 50 / 63
Descriptive Statistics Spread
Z-Scores: Example
Which is better, (A) an ACT score of 28 or (B) a combined SAT scoreof 2100 ? Assume ACT and SAT scores have approximately
bell-shaped distributions.
ACT: µ = 21, σ = 5
SAT: µ = 1500, σ = 325
ACT:
z =28 − 21
5=
75= 1.4
SAT:
z =2100 − 1500
325=
600325= 1.85
Histogram of Z−Scores
Z−Score
Fre
quen
cy
−3 −2 −1 0 1 2 3
010
020
030
0
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 51 / 63
Descriptive Statistics Spread
Z-Scores: Example
Which is better, (A) an ACT score of 28 or (B) a combined SAT scoreof 2100 ? Assume ACT and SAT scores have approximately
bell-shaped distributions.
ACT: µ = 21, σ = 5
SAT: µ = 1500, σ = 325
ACT:
z =28 − 21
5=
75= 1.4
SAT:
z =2100 − 1500
325=
600325= 1.85
Histogram of Z−Scores
Z−Score
Fre
quen
cy
−3 −2 −1 0 1 2 3
010
020
030
0
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 51 / 63
Descriptive Statistics Spread
Other Measures of Location
The 25th percentile is also called the first quartile, Q1.
The 50th percentile is also called the median.
The 75th percentile is also called the third quartile, Q3.
summary ( d$study hours )Min . 1 s t Qu. Median Mean 3rd Qu. Max . NAs3.00 10.00 15.00 17.42 20.00 40.00 13.00
Between Q1 and Q3 is the middle 50% of the data. The range thesedata span is called the interquartile range, or the IQR.
IQR = 20 − 10 = 10
Is the range or the IQR more robust to outliers?
IQR
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 52 / 63
Descriptive Statistics Spread
Other Measures of Location
The 25th percentile is also called the first quartile, Q1.
The 50th percentile is also called the median.
The 75th percentile is also called the third quartile, Q3.
summary ( d$study hours )Min . 1 s t Qu. Median Mean 3rd Qu. Max . NAs3.00 10.00 15.00 17.42 20.00 40.00 13.00
Between Q1 and Q3 is the middle 50% of the data. The range thesedata span is called the interquartile range, or the IQR.
IQR = 20 − 10 = 10
Is the range or the IQR more robust to outliers?
IQR
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 52 / 63
Descriptive Statistics Spread
Other Measures of Location
The 25th percentile is also called the first quartile, Q1.
The 50th percentile is also called the median.
The 75th percentile is also called the third quartile, Q3.
summary ( d$study hours )Min . 1 s t Qu. Median Mean 3rd Qu. Max . NAs3.00 10.00 15.00 17.42 20.00 40.00 13.00
Between Q1 and Q3 is the middle 50% of the data. The range thesedata span is called the interquartile range, or the IQR.
IQR = 20 − 10 = 10
Is the range or the IQR more robust to outliers?
IQR
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 52 / 63
Descriptive Statistics Spread
Participation question
Which of the following is false about the distribution of average numberof hours students study daily?
●
2 4 6 8 10
Average number of hours students study daily
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 3.000 4.000 3.821 5.000 10.000
(a) There are no students who don’t study at all.(b) 75% of the students study more than 5 hours daily, on average.(c) 25% of the students study less than 3 hours, on average.(d) IQR is 2 hours.Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 53 / 63
Descriptive Statistics Spread
Participation question
Which of the following is false about the distribution of average numberof hours students study daily?
●
2 4 6 8 10
Average number of hours students study daily
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 3.000 4.000 3.821 5.000 10.000
(a) There are no students who don’t study at all.(b) 75% of the students study more than 5 hours daily, on average.(c) 25% of the students study less than 3 hours, on average.(d) IQR is 2 hours.Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 53 / 63
Descriptive Statistics Spread
Box Plot
The box in a box plot represents the middle 50% of the data, and thethick line in the box is the median.
# of study hours / week10 20 30 40
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 54 / 63
Descriptive Statistics Spread
Anatomy of a Box Plot#
of s
tudy
hou
rs /
wee
k
0
10
20
30
40
lower whisker
Q1 (first quartile)
median
Q3 (third quartile)
upper whisker
max whisker reach
suspected outliers
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 55 / 63
Descriptive Statistics Spread
Whiskers and Outliers
Whiskers of a box plot can extend up to 1.5 * IQR away from thequartiles.
max upper whisker reach : Q3 + 1.5 ∗ IQR = 20 + 1.5 ∗ 10 = 35max lower whisker reach : Q1 − 1.5 ∗ IQR = 10 − 1.5 ∗ 10 = −5
An outlier is defined as an observation beyond the maximumreach of the whiskers. It is an observation that appears extremerelative to the rest of the data.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 56 / 63
Descriptive Statistics Spread
Outliers (cont.)
Why is it important to look for outliers?
Identify extreme skew in the distribution.
Identify data collection and entry errors.
Provide insight into interesting features of the data.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 57 / 63
Descriptive Statistics Spread
Outliers (cont.)
Why is it important to look for outliers?
Identify extreme skew in the distribution.
Identify data collection and entry errors.
Provide insight into interesting features of the data.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 57 / 63
Descriptive Statistics Spread
Example: Visualizing
What does a response of 0 mean in this distribution?
●●●
0 2 4 6 8 10 12
Number of drinks it takes students to get drunk
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 58 / 63
Descriptive Statistics Spread
Example: Visualizing
What does a response of 0 mean in this distribution?
●●●
0 2 4 6 8 10 12
Number of drinks it takes students to get drunk
Most likely that a student doesn’t drink. It would be preferable torecode these as NAs.
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 58 / 63
Descriptive Statistics Robust Statistics
Extreme observations
How would sample statistics such as mean, median, SD, and IQR ofhousehold income be affected if the largest value was replaced with$10 million? What if the smallest value was replaced with $10 million?
household income ($ thousands)
0 200 400 600 800 1000
●● ● ●● ● ●● ●
● ●
●
● ●
●
●
●
● ●
●●
●
●
● ●
●
●
●
●●
●
● ●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 59 / 63
Descriptive Statistics Robust Statistics
Income Example
household income ($ thousands)
0 200 400 600 800 1000
●● ● ●● ● ●● ●
● ●
●
● ●
●
●
●
● ●
●●
●
●
● ●
●
●
●
●●
●
● ●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
robust not robustscenario median IQR x̄ soriginal data 165K 150K 211K 180Kmove largest to $10 million 165K 150K 398K 1,422Kmove smallest to $10 million 190K 163K 4,186K 1,424K
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 60 / 63
Descriptive Statistics Robust Statistics
Income Example
Median and IQR are more robust to skewness and outliers than meanand SD. Therefore,
for skewed distributions it is more appropriate to use median andIQR to describe the center and spread
for symmetric distributions it is more appropriate to use the meanand SD to describe the center and spread
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 61 / 63
Descriptive Statistics Robust Statistics
Robust/resistant or Not?
Robust/ResistantA statistic is called resistant or robust if it is relatively unaffected byextreme values.
Measures of Center:
Mean (Not Robust)
Median (Robust)
Measures of Spread:
Standard Deviation (Not Robust)
IQR (Robust)
Range (Not Robust)
Most often, we use the mean and the standard deviation, becausethey are calculated based on all the data values, so use all theavailable information.Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 62 / 63
Examples
1 AnnouncementsWarm-Up and Data BasicsExploring Data
2 Numerical DataRelationship between two numerical variables
3 Distribution of one numerical variableDescribing distributions of numerical variablesDistribution Shapes
4 Descriptive StatisticsCenterSpreadRobust Statistics
5 Examples
Statistics 101
U1 - L2: EDA Nicole Dalzell
Examples
Application exercise: Shapes of distributions
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 63 / 63
Examples
Application exercise: Shapes of distributionsBelow are two histograms. One corresponds to the age at which asample of people applied for marriage licenses; the other correspondsto the last digit of a sample of social security numbers. Which graph iswhich, and why?
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 63 / 63
Examples
Application exercise: Shapes of distributions
Match the following variables with the histograms and bar graphs givenbelow. These data represent Sta 101 students at Duke.
(a) the height of students
(b) gender breakdown of students
(c) the time it took students to get to their firstclass of the day
(d) the number of hours of sleep studentsreceived last night
(e) whether or not students live off campus
(f) the number of piercings students have
Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 63 / 63