unit 1: introduction to data lecture 2: exploratory data ...1) unit 1/… · unit 1: introduction...

U 1: I L 2: E

S 101

Nicole Dalzell

July 1, 2014

Announcements

1 AnnouncementsWarm-Up and Data BasicsExploring Data

2 Numerical DataRelationship between two numerical variables

3 Distribution of one numerical variableDescribing distributions of numerical variablesDistribution Shapes

4 Descriptive StatisticsCenterSpreadRobust Statistics

5 Examples

Statistics 101

U1 - L2: EDA Nicole Dalzell

Announcements

Announcements

From now on, sit with your teams in class.

If there is someone from your team that you haven’t met yet, letme know.

If you weren’t able to log on to RStudio, and your name isn’thighlighted on the Google Doc, stop by after class.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 2 / 63

Announcements Warm-Up and Data Basics

Review

Example Study:A researcher divides 250 cats (adults and kittens) into two rooms,with adult cats in one room and baby kittens in the other room. Withineach room she erects a fence, randomly placing half the cats (orkittens) on each side of the fence. On one side of the fence shescatters a variety of cat toys. For 1 day, the researcher records thenumber of hours each cat spends sleeping.

What is the research question?

What are the explanatory and response variables?

Is this an Experimental or Observational study?

What are the controls and treatments?

What kind of structure was given to the study (eg. blocking orclustering) and why?


Announcements Warm-Up and Data Basics

Types of Variables Example

Still our cat example:

Cat Age Toys # of Naps Weight (lbs)1 adult 1 3 82 juvenile 1 5 93 adult 0 2 10.54 adult 1 8 12.25...

......

......

250 adult 0 5 11.67

What types of variables are these:

Age?

Toys?

# of Naps?

Weight?Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 3 / 63

Announcements Exploring Data

Population to sample

It is usually not feasible to collect information on the entirepopulation due to high costs of data collection so statisticiansinstead work with samples that are (hopefully) representative ofthe populations they come from.

population

sample

We try to understand certain features of the population as awhole using summary statistics and graphs based on thesesamples.



Exploratory analysis to inference

When you taste a spoonful of soup and decide it doesn’t tastesalty enough, that’s exploratory analysis.

If you generalize and conclude that your soup needs salt, that’san inference.For your inference to be valid the spoonful you tasted (thesample) needs to be representative of the entire pot (thepopulation).

If your spoonful comes only from the surface and the salt iscollected at the bottom of the pot, what you tasted is probably notrepresentative of the whole pot.If you first stir the soup thoroughly before you taste, your spoonfulwill more likely be representative of the whole pot.



Random assignment vs. random sampling

Random assignment

No random assignment

Random sampling

Causal conclusion, generalized to the whole

population.

No causal conclusion, correlation statement

generalized to the whole population.

Generalizability

No random sampling

Causal conclusion, only for the sample.

No causal conclusion, correlation statement only

for the sample.No

generalizability

Causation Correlation

ideal experiment

most experiments

most observational

studies

bad observational

studies



EDA Intro

EDA is important, because your analyses will depend on thetrends and features of your data.The distribution of a variable is a list of possible values thevariable can take and how often it takes each of those values.

Distributions are critical to assessing the probability of events.

We often utilize descriptive statistics related to the center andspread of the data.

Plots are almost always useful for visualizing relationships anddistributions in the data.

Example:Do {5,5,5,5,5,5,5,5,5} and {1,2,3,4,5,6,7,8,9} have the samedistribution? Why or why not?


Numerical Data





5 Examples

Statistics 101


Numerical Data Relationship between two numerical variables

Scatterplot

Scatterplots are useful for visualizing the relationship between twonumerical variables.

Do life expectancy and total fertil-ity appear to be associated or in-dependent?

Was the relationship the samethroughout the years, or did itchange?

http:// www.gapminder.org/ world


http://www.gapminder.org/world

Numerical Data Relationship between two numerical variables

Cars: ... vs. weight

From the cars data:

mile

s p

er

ga

llon

(city r

atin

g)

2000 3000 4000

20

30

40

weight (pounds)

2000 2500 3000 3500 4000

10

20

30

40

50

60

weight (pounds)pr

ice

($10

00s)

What do these scatterplots reveal about the data? How might they beuseful?


Distribution of one numerical variable





5 Examples

Statistics 101



Visualizing numerical variables

Histogram: Provides a view of the data density, and areespecially convenient for describing the shape of the datadistribution.

Box plot: Especially useful for displaying the median, quartiles,unusual observations, as well as the IQR.

Intensity map: Useful for displaying the spatial distribution.

Dot plot: Useful when individual values are of interest.



Why visualize?

What does a response of 0 mean in this distribution?

●●●

0 2 4 6 8 10 12

Number of drinks it takes students to get drunk



Why visualize?


●●●

0 2 4 6 8 10 12


Most likely that a student doesn’t drink. It would be preferable torecode these as NAs.



Why visualize?

Describe the spatial distribution of race/ethnicity in the US.

http:// demographics.coopercenter.org/ DotMap/ index.html


http://demographics.coopercenter.org/DotMap/index.html


Why visualize?

And let’s take a closer look at Durham.



Why visualize?

Dot plot of weight, in ounces

0 1000 2000 3000 4000

●● ●● ●● ●●●

●

●●● ●

●

●● ● ● ●

●

● ●

●

●

●● ●

●

●●

●●

●●

● ●

●

● ●

●●

●

● ●

● ●

●

●●

● ●

●

●● ●

●● ●

●

● ●

●

●

●

●●

●

●

●

● ●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Do you see anything out of the ordinary?



Why visualize?

Dot plot of weight, in ounces

0 1000 2000 3000 4000

●● ●● ●● ●●●

●

●●● ●

●

●● ● ● ●

●

● ●

●

●

●● ●

●

●●

●●

●●

● ●

●

● ●

●●

●

● ●

● ●

●

●●

● ●

●

●● ●

●● ●

●

● ●

●

●

●

●●

●

●

●

● ●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Do you see anything out of the ordinary?

Some people reported their weight in pounds.



Why visualize?

What type of variable is average number of hours of sleep per night?Is this reflected in the dot plot below? If not, what might be the reason?

Dot plot of average number of hours of sleep per night

4 5 6 7 8 9

●●● ●●● ●●●●●

●●● ●●●●●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●●●

●

●

●●

●

●

●

●

●

●●●

●●

●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●



Why visualize?

What type of variable is average number of hours of sleep per night?Is this reflected in the dot plot below? If not, what might be the reason?

Dot plot of average number of hours of sleep per night

4 5 6 7 8 9

●●● ●●● ●●●●●

●●● ●●●●●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●●●

●

●

●●

●

●

●

●

●

●●●

●●

●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

Average number of hours of sleep per night is a continuous numericalvariable. But responses are sounded, so there are only wholenumbers and half hours in the data.Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 15 / 63


Stacked Dot Plot

Higher bars represent areas where there are more observations,makes it a little easier to judge the center and the shape of thedistribution.

gpa

3.0 3.2 3.4 3.6 3.8 4.0

●

●

● ●● ● ●● ●● ●● ●

● ●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●



Histogram Construction

Order the data in ascending order:

2.900 2.910 3.000 3.100 3.100 3.150 3.150 3.150 3.200 3.200 3.2503.294 3.300 3.300 3.330 3.350 3.350 3.400 3.400 3.400 3.400 3.4003.400 3.400 3.410 3.450 3.460 3.500 3.500 3.500 3.500 3.550 3.5603.600 3.600 3.600 3.600 3.610 3.630 3.650 3.680 3.700 3.700 3.7003.700 3.700 3.700 3.750 3.750 3.750 3.750 3.785 3.790 3.800 3.8003.800 3.800 3.800 3.840 3.840 3.840 3.860 3.868 3.900 3.900 3.9003.900 3.925 3.925 3.970 3.970 4.000 4.000 4.000 4.000 4.300 4.300

Make a frequency table where the number of observations that fall in a certain binare recorded by counting how many observations fall in each bin. Let’s use a binwidth of 0.1:

GPA 2.9 to 3 3 to 3.1 3.1 to 3.2 3.2 to 3.3 · · · 3.8 to 3.9 3.9 to 4Count 3 2 5 4 · · · 9 8



Exploring Histograms

Link


http://mih5.github.io/statapps/histogram/histogram.html


Histograms

Higher bars represent areas where there are more observations,preferable when sample size is large but hides finer details likeindividual observations.

gpa

freq

uenc

y

3.0 3.2 3.4 3.6 3.8 4.0

0

2

4

6

8

10



Bin Width

Which one(s) of these histograms are useful? Which reveal too muchabout the data? Which hide too much?

extracurricular hrs / week

freq

uenc

y

0 5 10 15 20 25 300

10

20

30

40

50


freq

uenc

y

0 5 10 15 20 250

5

10

15

20

25

30


freq

uenc

y

0 5 10 15 20 250

5

10

15


freq

uenc

y

5 10 15 20 2502468

101214



Density Curves

A Density Curve is a smoothed density histogram where the areaunder the curve is 1.To draw a density curve from a histogram simply connect thepeaks of a histogram with a smooth line, and normalize thevalues of the y-axis such that the area under the curve is 1.


Distribution of one numerical variable Describing distributions of numerical variables

Describing distributions of numerical variables

When describing distributions of numerical variables always mention

Shape: skewness, modalityCenter: an estimate of a typical observation in the distribution(mean, median, mode, etc.)Spread: measure of variability in the distribution (SD, IQR, range,etc.)Unusual observations: observations that stand out from the restof the data that may be suspected outliers

−3 −2 −1 0 1 2 3

−3 −2 −1 0 1 2 3

−3 −2 −1 0 1 2 3


Distribution of one numerical variable Distribution Shapes

Describing Distributions

When describing distributions make sure to talk about the shape,center, spread, and if any, unusual observations.

−3 −2 −1 0 1 2 3

−3 −2 −1 0 1 2 3

−3 −2 −1 0 1 2 3



Shape

How would you describe the shape of this distribution?

Histogram ofaverage number of hours spent on school work per day

2 4 6 8 10

05

1015

2025

30



Shape

How would you describe the shape of this distribution?

Histogram ofaverage number of hours spent on school work per day

2 4 6 8 10

05

1015

2025

30

Unimodal and right skewed.



Describing Your Pictures

Bell Shaped: Data is bell shaped if the majority of the data isclustered around the center value (mean) with very few datapoints lying either way above or way below this value.

Right Skewed: Data is positively skewed if you have severallarge positive data points creating a long tail to the right.

Left Skewed: Data is negatively skewed if you have several largenegative numbers creating a long tail to the left.

Bimodal: Data is bimodal if it has two large clusters of datapoints.

Symmetric: Data is symmetric if it looks like a mirror imagearound a point of inflection.

Uniformly Distributed: Data is evenly spread across all possiblevalues.



Modality

The mode is defined as the most frequent observation in the data set.Does the histogram have a single prominent peak (unimodal), severalprominent peaks (bimodal/multimodal), or no apparent peaks(uniform)?

0 5 10 15

05

1015

0 5 10 15 20

05

1015

0 5 10 15 20

05

1015

20

0 5 10 15 20

02

46

810

1214

In order to determine modality, it’s easiest to step back and imagine adensity curve over the histogram. Use the limp spaghetti method.Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 26 / 63


Commonly observed shapes of distributions

modality

unimodal bimodal multimodaluniform

skewness

right skew left skewsymmetric




modality

unimodal

bimodal multimodaluniform

skewness





modality

unimodal bimodal

multimodaluniform

skewness





modality

unimodal bimodal multimodal

uniform

skewness





modality


skewness





modality


skewness

right skew

left skewsymmetric




modality


skewness

right skew left skew

symmetric




modality


skewness




Participation question

Which of these variables do you expect to be uniformly distributed?

(a) weights of adult females

(b) salaries of a random sample of people from North Carolina

(c) house prices

(d) birthdays of classmates (day of the month)



Skewness

Is the histogram right skewed, left skewed, or symmetric?

0 2 4 6 8 10

05

1015

0 5 10 15 20 25

020

4060

0 20 40 60 80

05

1015

2025

30

Histograms are said to be skewed to the side of the long tail.



Unusual Observations

Are there any unusual observations or potential outliers?

0 5 10 15 20

05

1015

2025

30

20 40 60 80 100

010

2030

40



Application exercise: Shapes of distributions



Application exercise: Shapes of distributionsBelow are two histograms. One corresponds to the age at which asample of people applied for marriage licenses; the other correspondsto the last digit of a sample of social security numbers. Which graph iswhich, and why?




Match the following variables with the histograms and bar graphs givenbelow. These data represent Sta 101 students at Duke.

(a) the height of students

(b) gender breakdown of students

(c) the time it took students to get to their firstclass of the day

(d) the number of hours of sleep studentsreceived last night

(e) whether or not students live off campus

(f) the number of piercings students have


Descriptive Statistics





5 Examples

Statistics 101


Descriptive Statistics Center

Measures of Center

The Mean of a dataset is what we commonly refer to as theaverage.

The Median of a dataset is the middle value of your data. Youfind the median of your data by ordering from smallest to largest,then finding the value where 50% of your data is above andbelow that value.The Trimmed Mean is the calculation of the mean after removinga few of the very large and very small observations.

What is the advantage of using the Median instead of the Mean?What is the advantage of using the Mean instead of the Median?



Measures of Center



What is the advantage of using the Median instead of the Mean?

What is the advantage of using the Mean instead of the Median?



Measures of Center



What is the advantage of using the Median instead of the Mean?What is the advantage of using the Mean instead of the Median?



Mean

The sample mean, denoted as x̄, can be calculated as

x̄ =x1 + x2 + · · · + xn

n=

Sum of Data PointsNumber of Data Points

,

where x1, x2, · · · , xn represent the n observed values.

The population mean is a parameter computed the same way butis denoted as µ. It is often not possible to calculate µ sincepopulation data is rarely available.

x̄ is an estimate of µ based on the observed data.

The sample mean is a sample statistic, or a point estimate of thepopulation mean. This estimate may not be perfect, but if thesample is good (representative of the population) it is usually agood guess.



Median

The median is the value that splits the data in half when orderedin ascending order.

0, 1, 2, 3, 4

If there are an even number of observations, then the median isthe average of the two values in the middle.

0, 1, 2, 3, 4, 5→2 + 3

2= 2.5

Since the median is the midpoint of the data, 50% of the valuesare below it. Hence, it is also the 50th percentile.



Mean vs. Median

If the distribution is symmetric, center is the meanSymmetric: mean ≈ median

If the distribution is skewed or has outliers center is the medianRight-skewed: mean > medianLeft-skewed: mean < median

Right−skewed

meanmedian

Left−skewed

meanmedian

Symmetric

meanmedian



Are you typical?

http:// www.youtube.com/ watch?v=4B2xOvKFFz4

How useful are centers alone for conveying the true characteristics ofa distribution?


http://www.youtube.com/watch?v=4B2xOvKFFz4

Descriptive Statistics Spread

Measures of Spread

The population Variance, σ2, measures each observation’sdeviation from the mean.

The population Standard Deviation, σ, is the square root of thevariance.

The Inner Quartile Range (IQR) measures the spread of themiddle 50% of your data, and is visually depicted in Boxplots.



Deviation

The distance of an observation from the mean is its deviation: xi − x̄.

s o r t ( d$sleep )[ 1 ] 1 1 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

[ 3 0 ] 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5[ 5 9 ] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 7 7 7 7 7 7 7 7 8 9 9 9mean( d$sleep )[ 1 ] 4.6

x1 − x̄ = 1 − 4.6 = −3.6x2 − x̄ = 1 − 4.6 = −3.6x3 − x̄ = 2 − 4.6 = −2.6

...

x86 − x̄ = 9 − 4.6 = 4.4



Variance

Variance, s2

Roughly the average squared deviation from the mean

s2 =

∑ni=1(xi − x̄)2

n − 1

Given that the average number of hours students sleep per night is7.029, the variance of amount of sleep students get per night can becalculated as:

s2 =(7.5 − 7.029)2 + (7 − 7.029)2 + · · · + (8 − 7.029)2

106 − 1= 0.72



Variance (cont.)

Why do we use the squared deviation in the calculation of variance?

To get rid of negatives so that observations equally distant fromthe mean are weighed equally.

To weigh larger deviations more heavily



Application exercise: Variability




Order histograms A, B, and C from least to most variable. Explain yourreasoning.




Between histograms D and E, which exhibits more variability? Explainyour reasoning.



Variability vs. diversity

Which of the following sets of cars has more diverse composition ofcolors?

Set 1:

Set 2:



Variability vs. diversity

Which of the following sets of cars has more diverse composition ofcolors?

Set 1:

more diverse

Set 2:

less diverseStatistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 42 / 63


Variability vs. diversity (cont.)

Which of the following sets of cars has more variable mileage?

Set 1:

Set 2:



Variability vs. diversity (cont.)

Which of the following sets of cars has more variable mileage?

Set 1:

10 20 30 40 50 60

less variable

01

23

Set 2:

10 20 30 40 50 60

more variable

01

23



Standard deviation

Standard deviation, sRoughly the deviation around the mean, calculated as the square rootof the variance, and has the same units as the data.

s =√

s2 =

√∑ni=1(xi − x̄)2

n − 1

The variance of amount of sleep students get per night can be calculated as:

s =√

0.72 = 0.85 hours

Student on average sleep 7.029 hours, give or take 0.85 hours.



Standard Deviation

The standard deviation gives a rough estimate of the typicaldistance of a data values from the mean.The larger the standard deviation, the more variability there is inthe data and the more spread out the data are.

Standard Deviation of 2

rnorm(100, 0, 2)

Fre

quen

cy

−15 −10 −5 0 5 10 15

05

1015

20

Standard Deviation of 4

rnorm(100, 0, 4)

Fre

quen

cy

−15 −10 −5 0 5 10 15

05

1015



Variability in Student Sleep

sleep, x = 4.6, sx = 1.66

2 4 6 8

● ●● ●

●●

●

●● ●● ●

●

● ●

●

●

●●

●●●

●●●●●

●

●●

●

●●

●

●●●●●●

●●

●

●●

●

●

●●●●

●

●

●

●

●●●●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●●

●

●

●●

●

●

●

●

●

69 out of 86 students (80%) are within 1 SD of the mean.

80 out of 86 students (93%) are within 2 SDs of the mean.

86 out of 86 students (100%) are within 3 SDs of the mean.



95% Rule

95 % RuleIf a distribution of data is approximately symmetric and bell-shaped,about 95% of the data should fall within two standard deviations of themean.

For a population, 95% of the data will be between µ − 2σ andµ + 2σ

http:// rchsbowman.files.wordpress.com/ 2008/ 09/ empirical-rule-3.jpgStatistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 47 / 63

http://rchsbowman.files.wordpress.com/2008/09/empirical-rule-3.jpg


Notation Recap

mean variance SD

sample x̄ s2 s

population µ σ2 σ

Do you see a trend in what types of letters are used for samplestatistics vs. population parameters?

Latin letters for sample statistics, Greek letters for populationparameters.



Z-Scores

Z-ScoreThe z-score for a data value, x , is

z =x − x̄

s

For a population, x̄ is replaced with µ and s is replaced with σ.

Values farther from 0 are more extreme.



Z-Scores: Why?

A z-score puts values on a common scale

A z-score is the number of standard deviations a value falls fromthe mean

95% of all z-scores fall between -2 and 2 .

z-scores beyond -2 or 2 can be considered extreme



Z-Scores: Example

Which is better, (A) an ACT score of 28 or (B) a combined SAT scoreof 2100 ? Assume ACT and SAT scores have approximately

bell-shaped distributions.

ACT: µ = 21, σ = 5

SAT: µ = 1500, σ = 325

ACT:

z =28 − 21

5=

75= 1.4

SAT:

z =2100 − 1500

325=

600325= 1.85

Histogram of Z−Scores

Z−Score

Fre

quen

cy

−3 −2 −1 0 1 2 3

010

020

030

0



Other Measures of Location

The 25th percentile is also called the first quartile, Q1.

The 50th percentile is also called the median.

The 75th percentile is also called the third quartile, Q3.

summary ( d$study hours )Min . 1 s t Qu. Median Mean 3rd Qu. Max . NAs3.00 10.00 15.00 17.42 20.00 40.00 13.00

Between Q1 and Q3 is the middle 50% of the data. The range thesedata span is called the interquartile range, or the IQR.

IQR = 20 − 10 = 10

Is the range or the IQR more robust to outliers?

IQR



Participation question

Which of the following is false about the distribution of average numberof hours students study daily?

●

2 4 6 8 10

Average number of hours students study daily

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.000 3.000 4.000 3.821 5.000 10.000

(a) There are no students who don’t study at all.(b) 75% of the students study more than 5 hours daily, on average.(c) 25% of the students study less than 3 hours, on average.(d) IQR is 2 hours.Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 53 / 63


Box Plot

The box in a box plot represents the middle 50% of the data, and thethick line in the box is the median.

# of study hours / week10 20 30 40



Anatomy of a Box Plot#

of s

tudy

hou

rs /

wee

k

0

10

20

30

40

lower whisker

Q1 (first quartile)

median

Q3 (third quartile)

upper whisker

max whisker reach

suspected outliers



Whiskers and Outliers

Whiskers of a box plot can extend up to 1.5 * IQR away from thequartiles.

max upper whisker reach : Q3 + 1.5 ∗ IQR = 20 + 1.5 ∗ 10 = 35max lower whisker reach : Q1 − 1.5 ∗ IQR = 10 − 1.5 ∗ 10 = −5

An outlier is defined as an observation beyond the maximumreach of the whiskers. It is an observation that appears extremerelative to the rest of the data.



Outliers (cont.)

Why is it important to look for outliers?

Identify extreme skew in the distribution.

Identify data collection and entry errors.

Provide insight into interesting features of the data.



Example: Visualizing


●●●

0 2 4 6 8 10 12




Example: Visualizing


●●●

0 2 4 6 8 10 12


Most likely that a student doesn’t drink. It would be preferable torecode these as NAs.


Descriptive Statistics Robust Statistics

Extreme observations

How would sample statistics such as mean, median, SD, and IQR ofhousehold income be affected if the largest value was replaced with$10 million? What if the smallest value was replaced with $10 million?

household income ($ thousands)

0 200 400 600 800 1000

●● ● ●● ● ●● ●

● ●

●

● ●

●

●

●

● ●

●●

●

●

● ●

●

●

●

●●

●

● ●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●



Income Example

household income ($ thousands)

0 200 400 600 800 1000

●● ● ●● ● ●● ●

● ●

●

● ●

●

●

●

● ●

●●

●

●

● ●

●

●

●

●●

●

● ●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

robust not robustscenario median IQR x̄ soriginal data 165K 150K 211K 180Kmove largest to $10 million 165K 150K 398K 1,422Kmove smallest to $10 million 190K 163K 4,186K 1,424K



Income Example

Median and IQR are more robust to skewness and outliers than meanand SD. Therefore,

for skewed distributions it is more appropriate to use median andIQR to describe the center and spread

for symmetric distributions it is more appropriate to use the meanand SD to describe the center and spread



Robust/resistant or Not?

Robust/ResistantA statistic is called resistant or robust if it is relatively unaffected byextreme values.

Measures of Center:

Mean (Not Robust)

Median (Robust)

Measures of Spread:

Standard Deviation (Not Robust)

IQR (Robust)

Range (Not Robust)

Most often, we use the mean and the standard deviation, becausethey are calculated based on all the data values, so use all theavailable information.Statistics 101 (Nicole Dalzell) U1 - L2: EDA July 1, 2014 62 / 63

Examples





5 Examples

Statistics 101


Examples



Examples

Application exercise: Shapes of distributionsBelow are two histograms. One corresponds to the age at which asample of people applied for marriage licenses; the other correspondsto the last digit of a sample of social security numbers. Which graph iswhich, and why?


Examples


Match the following variables with the histograms and bar graphs givenbelow. These data represent Sta 101 students at Duke.

(a) the height of students

(b) gender breakdown of students

(c) the time it took students to get to their firstclass of the day

(d) the number of hours of sleep studentsreceived last night

(e) whether or not students live off campus

(f) the number of piercings students have


unit 1: introduction to data lecture 2: exploratory data ...1) unit 1/… · unit 1: introduction...

Documents