unit 1: introduction to data lecture 2: exploratory data ...1) unit 1/… · unit 1: introduction...

104
U 1: I L 2: E S 101 Nicole Dalzell July 1, 2014

Upload: others

Post on 19-Oct-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

U 1: I L 2: E

S 101

Nicole Dalzell

July 1, 2014

Page 2: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Announcements

1 AnnouncementsWarm-Up and Data BasicsExploring Data

2 Numerical DataRelationship between two numerical variables

3 Distribution of one numerical variableDescribing distributions of numerical variablesDistribution Shapes

4 Descriptive StatisticsCenterSpreadRobust Statistics

5 Examples

Statistics 101

U1 - L2: EDA Nicole Dalzell

Page 3: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Announcements

Announcements

From now on, sit with your teams in class.

If there is someone from your team that you haven’t met yet, letme know.

If you weren’t able to log on to RStudio, and your name isn’thighlighted on the Google Doc, stop by after class.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 2 / 63

Page 4: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Announcements Warm-Up and Data Basics

Review

Example Study:A researcher divides 250 cats (adults and kittens) into two rooms,with adult cats in one room and baby kittens in the other room. Withineach room she erects a fence, randomly placing half the cats (orkittens) on each side of the fence. On one side of the fence shescatters a variety of cat toys. For 1 day, the researcher records thenumber of hours each cat spends sleeping.

What is the research question?

What are the explanatory and response variables?

Is this an Experimental or Observational study?

What are the controls and treatments?

What kind of structure was given to the study (eg. blocking orclustering) and why?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 3 / 63

Page 5: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Announcements Warm-Up and Data Basics

Types of Variables Example

Still our cat example:

Cat Age Toys # of Naps Weight (lbs)1 adult 1 3 82 juvenile 1 5 93 adult 0 2 10.54 adult 1 8 12.25...

......

......

250 adult 0 5 11.67

What types of variables are these:

Age?

Toys?

# of Naps?

Weight?Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 3 / 63

Page 6: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Announcements Exploring Data

Population to sample

It is usually not feasible to collect information on the entirepopulation due to high costs of data collection so statisticiansinstead work with samples that are (hopefully) representative ofthe populations they come from.

population

sample

We try to understand certain features of the population as awhole using summary statistics and graphs based on thesesamples.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 4 / 63

Page 7: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Announcements Exploring Data

Exploratory analysis to inference

When you taste a spoonful of soup and decide it doesn’t tastesalty enough, that’s exploratory analysis.

If you generalize and conclude that your soup needs salt, that’san inference.For your inference to be valid the spoonful you tasted (thesample) needs to be representative of the entire pot (thepopulation).

If your spoonful comes only from the surface and the salt iscollected at the bottom of the pot, what you tasted is probably notrepresentative of the whole pot.If you first stir the soup thoroughly before you taste, your spoonfulwill more likely be representative of the whole pot.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 5 / 63

Page 8: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Announcements Exploring Data

Random assignment vs. random sampling

Random assignment

No random assignment

Random sampling

Causal conclusion, generalized to the whole

population.

No causal conclusion, correlation statement

generalized to the whole population.

Generalizability

No random sampling

Causal conclusion, only for the sample.

No causal conclusion, correlation statement only

for the sample.No

generalizability

Causation Correlation

ideal experiment

most experiments

most observational

studies

bad observational

studies

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 6 / 63

Page 9: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Announcements Exploring Data

EDA Intro

EDA is important, because your analyses will depend on thetrends and features of your data.The distribution of a variable is a list of possible values thevariable can take and how often it takes each of those values.

Distributions are critical to assessing the probability of events.

We often utilize descriptive statistics related to the center andspread of the data.

Plots are almost always useful for visualizing relationships anddistributions in the data.

Example:Do {5,5,5,5,5,5,5,5,5} and {1,2,3,4,5,6,7,8,9} have the samedistribution? Why or why not?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 7 / 63

Page 10: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Numerical Data

1 AnnouncementsWarm-Up and Data BasicsExploring Data

2 Numerical DataRelationship between two numerical variables

3 Distribution of one numerical variableDescribing distributions of numerical variablesDistribution Shapes

4 Descriptive StatisticsCenterSpreadRobust Statistics

5 Examples

Statistics 101

U1 - L2: EDA Nicole Dalzell

Page 11: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Numerical Data Relationship between two numerical variables

Scatterplot

Scatterplots are useful for visualizing the relationship between twonumerical variables.

Do life expectancy and total fertil-ity appear to be associated or in-dependent?

Was the relationship the samethroughout the years, or did itchange?

http:// www.gapminder.org/ world

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 8 / 63

Page 12: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Numerical Data Relationship between two numerical variables

Cars: ... vs. weight

From the cars data:

mile

s p

er

ga

llon

(city r

atin

g)

2000 3000 4000

20

30

40

weight (pounds)

2000 2500 3000 3500 4000

10

20

30

40

50

60

weight (pounds)pr

ice

($10

00s)

What do these scatterplots reveal about the data? How might they beuseful?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 9 / 63

Page 13: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable

1 AnnouncementsWarm-Up and Data BasicsExploring Data

2 Numerical DataRelationship between two numerical variables

3 Distribution of one numerical variableDescribing distributions of numerical variablesDistribution Shapes

4 Descriptive StatisticsCenterSpreadRobust Statistics

5 Examples

Statistics 101

U1 - L2: EDA Nicole Dalzell

Page 14: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable

Visualizing numerical variables

Histogram: Provides a view of the data density, and areespecially convenient for describing the shape of the datadistribution.

Box plot: Especially useful for displaying the median, quartiles,unusual observations, as well as the IQR.

Intensity map: Useful for displaying the spatial distribution.

Dot plot: Useful when individual values are of interest.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 10 / 63

Page 15: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable

Why visualize?

What does a response of 0 mean in this distribution?

●●●

0 2 4 6 8 10 12

Number of drinks it takes students to get drunk

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 11 / 63

Page 16: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable

Why visualize?

What does a response of 0 mean in this distribution?

●●●

0 2 4 6 8 10 12

Number of drinks it takes students to get drunk

Most likely that a student doesn’t drink. It would be preferable torecode these as NAs.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 11 / 63

Page 17: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable

Why visualize?

Describe the spatial distribution of race/ethnicity in the US.

http:// demographics.coopercenter.org/ DotMap/ index.html

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 12 / 63

Page 18: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable

Why visualize?

And let’s take a closer look at Durham.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 13 / 63

Page 19: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable

Why visualize?

Dot plot of weight, in ounces

0 1000 2000 3000 4000

●● ●● ●● ●●●

●●● ●

●● ● ● ●

● ●

●● ●

●●

●●

●●

● ●

● ●

●●

● ●

● ●

●●

● ●

●● ●

●● ●

● ●

●●

● ●●●

●●

Do you see anything out of the ordinary?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 14 / 63

Page 20: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable

Why visualize?

Dot plot of weight, in ounces

0 1000 2000 3000 4000

●● ●● ●● ●●●

●●● ●

●● ● ● ●

● ●

●● ●

●●

●●

●●

● ●

● ●

●●

● ●

● ●

●●

● ●

●● ●

●● ●

● ●

●●

● ●●●

●●

Do you see anything out of the ordinary?

Some people reported their weight in pounds.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 14 / 63

Page 21: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable

Why visualize?

What type of variable is average number of hours of sleep per night?Is this reflected in the dot plot below? If not, what might be the reason?

Dot plot of average number of hours of sleep per night

4 5 6 7 8 9

●●● ●●● ●●●●●

●●● ●●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 15 / 63

Page 22: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable

Why visualize?

What type of variable is average number of hours of sleep per night?Is this reflected in the dot plot below? If not, what might be the reason?

Dot plot of average number of hours of sleep per night

4 5 6 7 8 9

●●● ●●● ●●●●●

●●● ●●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

Average number of hours of sleep per night is a continuous numericalvariable. But responses are sounded, so there are only wholenumbers and half hours in the data.Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 15 / 63

Page 23: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable

Stacked Dot Plot

Higher bars represent areas where there are more observations,makes it a little easier to judge the center and the shape of thedistribution.

gpa

3.0 3.2 3.4 3.6 3.8 4.0

● ●● ● ●● ●● ●● ●

● ●●

●●

●●

● ●

● ●

●●

● ●

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 16 / 63

Page 24: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable

Histogram Construction

Order the data in ascending order:

2.900 2.910 3.000 3.100 3.100 3.150 3.150 3.150 3.200 3.200 3.2503.294 3.300 3.300 3.330 3.350 3.350 3.400 3.400 3.400 3.400 3.4003.400 3.400 3.410 3.450 3.460 3.500 3.500 3.500 3.500 3.550 3.5603.600 3.600 3.600 3.600 3.610 3.630 3.650 3.680 3.700 3.700 3.7003.700 3.700 3.700 3.750 3.750 3.750 3.750 3.785 3.790 3.800 3.8003.800 3.800 3.800 3.840 3.840 3.840 3.860 3.868 3.900 3.900 3.9003.900 3.925 3.925 3.970 3.970 4.000 4.000 4.000 4.000 4.300 4.300

Make a frequency table where the number of observations that fall in a certain binare recorded by counting how many observations fall in each bin. Let’s use a binwidth of 0.1:

GPA 2.9 to 3 3 to 3.1 3.1 to 3.2 3.2 to 3.3 · · · 3.8 to 3.9 3.9 to 4Count 3 2 5 4 · · · 9 8

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 17 / 63

Page 25: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable

Exploring Histograms

Link

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 18 / 63

Page 26: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable

Histograms

Higher bars represent areas where there are more observations,preferable when sample size is large but hides finer details likeindividual observations.

gpa

freq

uenc

y

3.0 3.2 3.4 3.6 3.8 4.0

0

2

4

6

8

10

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 19 / 63

Page 27: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable

Bin Width

Which one(s) of these histograms are useful? Which reveal too muchabout the data? Which hide too much?

extracurricular hrs / week

freq

uenc

y

0 5 10 15 20 25 300

10

20

30

40

50

extracurricular hrs / week

freq

uenc

y

0 5 10 15 20 250

5

10

15

20

25

30

extracurricular hrs / week

freq

uenc

y

0 5 10 15 20 250

5

10

15

extracurricular hrs / week

freq

uenc

y

5 10 15 20 2502468

101214

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 20 / 63

Page 28: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable

Density Curves

A Density Curve is a smoothed density histogram where the areaunder the curve is 1.To draw a density curve from a histogram simply connect thepeaks of a histogram with a smooth line, and normalize thevalues of the y-axis such that the area under the curve is 1.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 21 / 63

Page 29: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Describing distributions of numerical variables

Describing distributions of numerical variables

When describing distributions of numerical variables always mention

Shape: skewness, modalityCenter: an estimate of a typical observation in the distribution(mean, median, mode, etc.)Spread: measure of variability in the distribution (SD, IQR, range,etc.)Unusual observations: observations that stand out from the restof the data that may be suspected outliers

−3 −2 −1 0 1 2 3

−3 −2 −1 0 1 2 3

−3 −2 −1 0 1 2 3

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 22 / 63

Page 30: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Describing Distributions

When describing distributions make sure to talk about the shape,center, spread, and if any, unusual observations.

−3 −2 −1 0 1 2 3

−3 −2 −1 0 1 2 3

−3 −2 −1 0 1 2 3

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 23 / 63

Page 31: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Shape

How would you describe the shape of this distribution?

Histogram ofaverage number of hours spent on school work per day

2 4 6 8 10

05

1015

2025

30

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 24 / 63

Page 32: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Shape

How would you describe the shape of this distribution?

Histogram ofaverage number of hours spent on school work per day

2 4 6 8 10

05

1015

2025

30

Unimodal and right skewed.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 24 / 63

Page 33: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Describing Your Pictures

Bell Shaped: Data is bell shaped if the majority of the data isclustered around the center value (mean) with very few datapoints lying either way above or way below this value.

Right Skewed: Data is positively skewed if you have severallarge positive data points creating a long tail to the right.

Left Skewed: Data is negatively skewed if you have several largenegative numbers creating a long tail to the left.

Bimodal: Data is bimodal if it has two large clusters of datapoints.

Symmetric: Data is symmetric if it looks like a mirror imagearound a point of inflection.

Uniformly Distributed: Data is evenly spread across all possiblevalues.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 25 / 63

Page 34: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Modality

The mode is defined as the most frequent observation in the data set.Does the histogram have a single prominent peak (unimodal), severalprominent peaks (bimodal/multimodal), or no apparent peaks(uniform)?

0 5 10 15

05

1015

0 5 10 15 20

05

1015

0 5 10 15 20

05

1015

20

0 5 10 15 20

02

46

810

1214

In order to determine modality, it’s easiest to step back and imagine adensity curve over the histogram. Use the limp spaghetti method.Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 26 / 63

Page 35: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Commonly observed shapes of distributions

modality

unimodal bimodal multimodaluniform

skewness

right skew left skewsymmetric

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63

Page 36: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Commonly observed shapes of distributions

modality

unimodal

bimodal multimodaluniform

skewness

right skew left skewsymmetric

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63

Page 37: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Commonly observed shapes of distributions

modality

unimodal bimodal

multimodaluniform

skewness

right skew left skewsymmetric

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63

Page 38: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Commonly observed shapes of distributions

modality

unimodal bimodal multimodal

uniform

skewness

right skew left skewsymmetric

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63

Page 39: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Commonly observed shapes of distributions

modality

unimodal bimodal multimodaluniform

skewness

right skew left skewsymmetric

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63

Page 40: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Commonly observed shapes of distributions

modality

unimodal bimodal multimodaluniform

skewness

right skew left skewsymmetric

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63

Page 41: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Commonly observed shapes of distributions

modality

unimodal bimodal multimodaluniform

skewness

right skew

left skewsymmetric

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63

Page 42: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Commonly observed shapes of distributions

modality

unimodal bimodal multimodaluniform

skewness

right skew left skew

symmetric

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63

Page 43: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Commonly observed shapes of distributions

modality

unimodal bimodal multimodaluniform

skewness

right skew left skewsymmetric

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 27 / 63

Page 44: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Participation question

Which of these variables do you expect to be uniformly distributed?

(a) weights of adult females

(b) salaries of a random sample of people from North Carolina

(c) house prices

(d) birthdays of classmates (day of the month)

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 28 / 63

Page 45: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Participation question

Which of these variables do you expect to be uniformly distributed?

(a) weights of adult females

(b) salaries of a random sample of people from North Carolina

(c) house prices

(d) birthdays of classmates (day of the month)

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 28 / 63

Page 46: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Skewness

Is the histogram right skewed, left skewed, or symmetric?

0 2 4 6 8 10

05

1015

0 5 10 15 20 25

020

4060

0 20 40 60 80

05

1015

2025

30

Histograms are said to be skewed to the side of the long tail.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 29 / 63

Page 47: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Unusual Observations

Are there any unusual observations or potential outliers?

0 5 10 15 20

05

1015

2025

30

20 40 60 80 100

010

2030

40

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 30 / 63

Page 48: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Application exercise: Shapes of distributions

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 31 / 63

Page 49: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Application exercise: Shapes of distributionsBelow are two histograms. One corresponds to the age at which asample of people applied for marriage licenses; the other correspondsto the last digit of a sample of social security numbers. Which graph iswhich, and why?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 31 / 63

Page 50: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Distribution of one numerical variable Distribution Shapes

Application exercise: Shapes of distributions

Match the following variables with the histograms and bar graphs givenbelow. These data represent Sta 101 students at Duke.

(a) the height of students

(b) gender breakdown of students

(c) the time it took students to get to their firstclass of the day

(d) the number of hours of sleep studentsreceived last night

(e) whether or not students live off campus

(f) the number of piercings students have

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 31 / 63

Page 51: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics

1 AnnouncementsWarm-Up and Data BasicsExploring Data

2 Numerical DataRelationship between two numerical variables

3 Distribution of one numerical variableDescribing distributions of numerical variablesDistribution Shapes

4 Descriptive StatisticsCenterSpreadRobust Statistics

5 Examples

Statistics 101

U1 - L2: EDA Nicole Dalzell

Page 52: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Center

Measures of Center

The Mean of a dataset is what we commonly refer to as theaverage.

The Median of a dataset is the middle value of your data. Youfind the median of your data by ordering from smallest to largest,then finding the value where 50% of your data is above andbelow that value.The Trimmed Mean is the calculation of the mean after removinga few of the very large and very small observations.

What is the advantage of using the Median instead of the Mean?What is the advantage of using the Mean instead of the Median?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 32 / 63

Page 53: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Center

Measures of Center

The Mean of a dataset is what we commonly refer to as theaverage.

The Median of a dataset is the middle value of your data. Youfind the median of your data by ordering from smallest to largest,then finding the value where 50% of your data is above andbelow that value.The Trimmed Mean is the calculation of the mean after removinga few of the very large and very small observations.

What is the advantage of using the Median instead of the Mean?

What is the advantage of using the Mean instead of the Median?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 32 / 63

Page 54: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Center

Measures of Center

The Mean of a dataset is what we commonly refer to as theaverage.

The Median of a dataset is the middle value of your data. Youfind the median of your data by ordering from smallest to largest,then finding the value where 50% of your data is above andbelow that value.The Trimmed Mean is the calculation of the mean after removinga few of the very large and very small observations.

What is the advantage of using the Median instead of the Mean?What is the advantage of using the Mean instead of the Median?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 32 / 63

Page 55: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Center

Mean

The sample mean, denoted as x̄, can be calculated as

x̄ =x1 + x2 + · · · + xn

n=

Sum of Data PointsNumber of Data Points

,

where x1, x2, · · · , xn represent the n observed values.

The population mean is a parameter computed the same way butis denoted as µ. It is often not possible to calculate µ sincepopulation data is rarely available.

x̄ is an estimate of µ based on the observed data.

The sample mean is a sample statistic, or a point estimate of thepopulation mean. This estimate may not be perfect, but if thesample is good (representative of the population) it is usually agood guess.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 33 / 63

Page 56: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Center

Median

The median is the value that splits the data in half when orderedin ascending order.

0, 1, 2, 3, 4

If there are an even number of observations, then the median isthe average of the two values in the middle.

0, 1, 2, 3, 4, 5→2 + 3

2= 2.5

Since the median is the midpoint of the data, 50% of the valuesare below it. Hence, it is also the 50th percentile.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 34 / 63

Page 57: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Center

Mean vs. Median

If the distribution is symmetric, center is the meanSymmetric: mean ≈ median

If the distribution is skewed or has outliers center is the medianRight-skewed: mean > medianLeft-skewed: mean < median

Right−skewed

meanmedian

Left−skewed

meanmedian

Symmetric

meanmedian

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 35 / 63

Page 58: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Center

Are you typical?

http:// www.youtube.com/ watch?v=4B2xOvKFFz4

How useful are centers alone for conveying the true characteristics ofa distribution?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 36 / 63

Page 59: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Center

Are you typical?

http:// www.youtube.com/ watch?v=4B2xOvKFFz4

How useful are centers alone for conveying the true characteristics ofa distribution?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 36 / 63

Page 60: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Measures of Spread

The population Variance, σ2, measures each observation’sdeviation from the mean.

The population Standard Deviation, σ, is the square root of thevariance.

The Inner Quartile Range (IQR) measures the spread of themiddle 50% of your data, and is visually depicted in Boxplots.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 37 / 63

Page 61: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Deviation

The distance of an observation from the mean is its deviation: xi − x̄.

s o r t ( d$sleep )[ 1 ] 1 1 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

[ 3 0 ] 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5[ 5 9 ] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 7 7 7 7 7 7 7 7 8 9 9 9mean( d$sleep )[ 1 ] 4.6

x1 − x̄ = 1 − 4.6 = −3.6x2 − x̄ = 1 − 4.6 = −3.6x3 − x̄ = 2 − 4.6 = −2.6

...

x86 − x̄ = 9 − 4.6 = 4.4

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 38 / 63

Page 62: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Variance

Variance, s2

Roughly the average squared deviation from the mean

s2 =

∑ni=1(xi − x̄)2

n − 1

Given that the average number of hours students sleep per night is7.029, the variance of amount of sleep students get per night can becalculated as:

s2 =(7.5 − 7.029)2 + (7 − 7.029)2 + · · · + (8 − 7.029)2

106 − 1= 0.72

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 39 / 63

Page 63: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Variance

Variance, s2

Roughly the average squared deviation from the mean

s2 =

∑ni=1(xi − x̄)2

n − 1

Given that the average number of hours students sleep per night is7.029, the variance of amount of sleep students get per night can becalculated as:

s2 =(7.5 − 7.029)2 + (7 − 7.029)2 + · · · + (8 − 7.029)2

106 − 1= 0.72

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 39 / 63

Page 64: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Variance (cont.)

Why do we use the squared deviation in the calculation of variance?

To get rid of negatives so that observations equally distant fromthe mean are weighed equally.

To weigh larger deviations more heavily

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 40 / 63

Page 65: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Variance (cont.)

Why do we use the squared deviation in the calculation of variance?

To get rid of negatives so that observations equally distant fromthe mean are weighed equally.

To weigh larger deviations more heavily

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 40 / 63

Page 66: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Application exercise: Variability

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 41 / 63

Page 67: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Application exercise: Variability

Order histograms A, B, and C from least to most variable. Explain yourreasoning.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 41 / 63

Page 68: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Application exercise: Variability

Between histograms D and E, which exhibits more variability? Explainyour reasoning.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 41 / 63

Page 69: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Variability vs. diversity

Which of the following sets of cars has more diverse composition ofcolors?

Set 1:

Set 2:

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 42 / 63

Page 70: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Variability vs. diversity

Which of the following sets of cars has more diverse composition ofcolors?

Set 1:

more diverse

Set 2:

less diverseStatistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 42 / 63

Page 71: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Variability vs. diversity (cont.)

Which of the following sets of cars has more variable mileage?

Set 1:

Set 2:

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 43 / 63

Page 72: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Variability vs. diversity (cont.)

Which of the following sets of cars has more variable mileage?

Set 1:

10 20 30 40 50 60

less variable

01

23

Set 2:

10 20 30 40 50 60

more variable

01

23

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 43 / 63

Page 73: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Standard deviation

Standard deviation, sRoughly the deviation around the mean, calculated as the square rootof the variance, and has the same units as the data.

s =√

s2 =

√∑ni=1(xi − x̄)2

n − 1

The variance of amount of sleep students get per night can be calculated as:

s =√

0.72 = 0.85 hours

Student on average sleep 7.029 hours, give or take 0.85 hours.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 44 / 63

Page 74: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Standard deviation

Standard deviation, sRoughly the deviation around the mean, calculated as the square rootof the variance, and has the same units as the data.

s =√

s2 =

√∑ni=1(xi − x̄)2

n − 1

The variance of amount of sleep students get per night can be calculated as:

s =√

0.72 = 0.85 hours

Student on average sleep 7.029 hours, give or take 0.85 hours.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 44 / 63

Page 75: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Standard deviation

Standard deviation, sRoughly the deviation around the mean, calculated as the square rootof the variance, and has the same units as the data.

s =√

s2 =

√∑ni=1(xi − x̄)2

n − 1

The variance of amount of sleep students get per night can be calculated as:

s =√

0.72 = 0.85 hours

Student on average sleep 7.029 hours, give or take 0.85 hours.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 44 / 63

Page 76: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Standard Deviation

The standard deviation gives a rough estimate of the typicaldistance of a data values from the mean.The larger the standard deviation, the more variability there is inthe data and the more spread out the data are.

Standard Deviation of 2

rnorm(100, 0, 2)

Fre

quen

cy

−15 −10 −5 0 5 10 15

05

1015

20

Standard Deviation of 4

rnorm(100, 0, 4)

Fre

quen

cy

−15 −10 −5 0 5 10 15

05

1015

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 45 / 63

Page 77: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Variability in Student Sleep

sleep, x = 4.6, sx = 1.66

2 4 6 8

● ●● ●

●●

●● ●● ●

● ●

●●

●●●

●●●●●

●●

●●

●●●●●●

●●

●●

●●●●

●●●●●

●●●

●●●

●●

69 out of 86 students (80%) are within 1 SD of the mean.

80 out of 86 students (93%) are within 2 SDs of the mean.

86 out of 86 students (100%) are within 3 SDs of the mean.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 46 / 63

Page 78: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

95% Rule

95 % RuleIf a distribution of data is approximately symmetric and bell-shaped,about 95% of the data should fall within two standard deviations of themean.

For a population, 95% of the data will be between µ − 2σ andµ + 2σ

http:// rchsbowman.files.wordpress.com/ 2008/ 09/ empirical-rule-3.jpgStatistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 47 / 63

Page 79: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Notation Recap

mean variance SD

sample x̄ s2 s

population µ σ2 σ

Do you see a trend in what types of letters are used for samplestatistics vs. population parameters?

Latin letters for sample statistics, Greek letters for populationparameters.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 48 / 63

Page 80: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Notation Recap

mean variance SD

sample x̄ s2 s

population µ σ2 σ

Do you see a trend in what types of letters are used for samplestatistics vs. population parameters?

Latin letters for sample statistics, Greek letters for populationparameters.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 48 / 63

Page 81: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Z-Scores

Z-ScoreThe z-score for a data value, x , is

z =x − x̄

s

For a population, x̄ is replaced with µ and s is replaced with σ.

Values farther from 0 are more extreme.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 49 / 63

Page 82: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Z-Scores: Why?

A z-score puts values on a common scale

A z-score is the number of standard deviations a value falls fromthe mean

95% of all z-scores fall between -2 and 2 .

z-scores beyond -2 or 2 can be considered extreme

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 50 / 63

Page 83: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Z-Scores: Example

Which is better, (A) an ACT score of 28 or (B) a combined SAT scoreof 2100 ? Assume ACT and SAT scores have approximately

bell-shaped distributions.

ACT: µ = 21, σ = 5

SAT: µ = 1500, σ = 325

ACT:

z =28 − 21

5=

75= 1.4

SAT:

z =2100 − 1500

325=

600325= 1.85

Histogram of Z−Scores

Z−Score

Fre

quen

cy

−3 −2 −1 0 1 2 3

010

020

030

0

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 51 / 63

Page 84: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Z-Scores: Example

Which is better, (A) an ACT score of 28 or (B) a combined SAT scoreof 2100 ? Assume ACT and SAT scores have approximately

bell-shaped distributions.

ACT: µ = 21, σ = 5

SAT: µ = 1500, σ = 325

ACT:

z =28 − 21

5=

75= 1.4

SAT:

z =2100 − 1500

325=

600325= 1.85

Histogram of Z−Scores

Z−Score

Fre

quen

cy

−3 −2 −1 0 1 2 3

010

020

030

0

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 51 / 63

Page 85: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Other Measures of Location

The 25th percentile is also called the first quartile, Q1.

The 50th percentile is also called the median.

The 75th percentile is also called the third quartile, Q3.

summary ( d$study hours )Min . 1 s t Qu. Median Mean 3rd Qu. Max . NAs3.00 10.00 15.00 17.42 20.00 40.00 13.00

Between Q1 and Q3 is the middle 50% of the data. The range thesedata span is called the interquartile range, or the IQR.

IQR = 20 − 10 = 10

Is the range or the IQR more robust to outliers?

IQR

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 52 / 63

Page 86: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Other Measures of Location

The 25th percentile is also called the first quartile, Q1.

The 50th percentile is also called the median.

The 75th percentile is also called the third quartile, Q3.

summary ( d$study hours )Min . 1 s t Qu. Median Mean 3rd Qu. Max . NAs3.00 10.00 15.00 17.42 20.00 40.00 13.00

Between Q1 and Q3 is the middle 50% of the data. The range thesedata span is called the interquartile range, or the IQR.

IQR = 20 − 10 = 10

Is the range or the IQR more robust to outliers?

IQR

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 52 / 63

Page 87: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Other Measures of Location

The 25th percentile is also called the first quartile, Q1.

The 50th percentile is also called the median.

The 75th percentile is also called the third quartile, Q3.

summary ( d$study hours )Min . 1 s t Qu. Median Mean 3rd Qu. Max . NAs3.00 10.00 15.00 17.42 20.00 40.00 13.00

Between Q1 and Q3 is the middle 50% of the data. The range thesedata span is called the interquartile range, or the IQR.

IQR = 20 − 10 = 10

Is the range or the IQR more robust to outliers?

IQR

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 52 / 63

Page 88: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Participation question

Which of the following is false about the distribution of average numberof hours students study daily?

2 4 6 8 10

Average number of hours students study daily

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.000 3.000 4.000 3.821 5.000 10.000

(a) There are no students who don’t study at all.(b) 75% of the students study more than 5 hours daily, on average.(c) 25% of the students study less than 3 hours, on average.(d) IQR is 2 hours.Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 53 / 63

Page 89: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Participation question

Which of the following is false about the distribution of average numberof hours students study daily?

2 4 6 8 10

Average number of hours students study daily

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.000 3.000 4.000 3.821 5.000 10.000

(a) There are no students who don’t study at all.(b) 75% of the students study more than 5 hours daily, on average.(c) 25% of the students study less than 3 hours, on average.(d) IQR is 2 hours.Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 53 / 63

Page 90: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Box Plot

The box in a box plot represents the middle 50% of the data, and thethick line in the box is the median.

# of study hours / week10 20 30 40

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 54 / 63

Page 91: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Anatomy of a Box Plot#

of s

tudy

hou

rs /

wee

k

0

10

20

30

40

lower whisker

Q1 (first quartile)

median

Q3 (third quartile)

upper whisker

max whisker reach

suspected outliers

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 55 / 63

Page 92: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Whiskers and Outliers

Whiskers of a box plot can extend up to 1.5 * IQR away from thequartiles.

max upper whisker reach : Q3 + 1.5 ∗ IQR = 20 + 1.5 ∗ 10 = 35max lower whisker reach : Q1 − 1.5 ∗ IQR = 10 − 1.5 ∗ 10 = −5

An outlier is defined as an observation beyond the maximumreach of the whiskers. It is an observation that appears extremerelative to the rest of the data.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 56 / 63

Page 93: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Outliers (cont.)

Why is it important to look for outliers?

Identify extreme skew in the distribution.

Identify data collection and entry errors.

Provide insight into interesting features of the data.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 57 / 63

Page 94: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Outliers (cont.)

Why is it important to look for outliers?

Identify extreme skew in the distribution.

Identify data collection and entry errors.

Provide insight into interesting features of the data.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 57 / 63

Page 95: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Example: Visualizing

What does a response of 0 mean in this distribution?

●●●

0 2 4 6 8 10 12

Number of drinks it takes students to get drunk

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 58 / 63

Page 96: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Spread

Example: Visualizing

What does a response of 0 mean in this distribution?

●●●

0 2 4 6 8 10 12

Number of drinks it takes students to get drunk

Most likely that a student doesn’t drink. It would be preferable torecode these as NAs.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 58 / 63

Page 97: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Robust Statistics

Extreme observations

How would sample statistics such as mean, median, SD, and IQR ofhousehold income be affected if the largest value was replaced with$10 million? What if the smallest value was replaced with $10 million?

household income ($ thousands)

0 200 400 600 800 1000

●● ● ●● ● ●● ●

● ●

● ●

● ●

●●

● ●

●●

● ●

● ●

●●

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 59 / 63

Page 98: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Robust Statistics

Income Example

household income ($ thousands)

0 200 400 600 800 1000

●● ● ●● ● ●● ●

● ●

● ●

● ●

●●

● ●

●●

● ●

● ●

●●

robust not robustscenario median IQR x̄ soriginal data 165K 150K 211K 180Kmove largest to $10 million 165K 150K 398K 1,422Kmove smallest to $10 million 190K 163K 4,186K 1,424K

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 60 / 63

Page 99: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Robust Statistics

Income Example

Median and IQR are more robust to skewness and outliers than meanand SD. Therefore,

for skewed distributions it is more appropriate to use median andIQR to describe the center and spread

for symmetric distributions it is more appropriate to use the meanand SD to describe the center and spread

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 61 / 63

Page 100: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Descriptive Statistics Robust Statistics

Robust/resistant or Not?

Robust/ResistantA statistic is called resistant or robust if it is relatively unaffected byextreme values.

Measures of Center:

Mean (Not Robust)

Median (Robust)

Measures of Spread:

Standard Deviation (Not Robust)

IQR (Robust)

Range (Not Robust)

Most often, we use the mean and the standard deviation, becausethey are calculated based on all the data values, so use all theavailable information.Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 62 / 63

Page 101: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Examples

1 AnnouncementsWarm-Up and Data BasicsExploring Data

2 Numerical DataRelationship between two numerical variables

3 Distribution of one numerical variableDescribing distributions of numerical variablesDistribution Shapes

4 Descriptive StatisticsCenterSpreadRobust Statistics

5 Examples

Statistics 101

U1 - L2: EDA Nicole Dalzell

Page 102: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Examples

Application exercise: Shapes of distributions

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 63 / 63

Page 103: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Examples

Application exercise: Shapes of distributionsBelow are two histograms. One corresponds to the age at which asample of people applied for marriage licenses; the other correspondsto the last digit of a sample of social security numbers. Which graph iswhich, and why?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 63 / 63

Page 104: Unit 1: Introduction to data Lecture 2: Exploratory data ...1) Unit 1/… · Unit 1: Introduction to data Lecture 2: Exploratory data analysis Statistics 101 Nicole Dalzell July 1,

Examples

Application exercise: Shapes of distributions

Match the following variables with the histograms and bar graphs givenbelow. These data represent Sta 101 students at Duke.

(a) the height of students

(b) gender breakdown of students

(c) the time it took students to get to their firstclass of the day

(d) the number of hours of sleep studentsreceived last night

(e) whether or not students live off campus

(f) the number of piercings students have

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 63 / 63