stat 31, section 1, last time distributions (how are data “spread out”?) visual display:...

Stat 31, Section 1, Last Time

• Distributions (how are data “spread out”?)

• Visual Display: Histograms

• Binwidth is critical

• Bivariate display: scatterplot

• Course Organization & Websitehttps://www.unc.edu/%7Emarron/UNCstat31-2005/Stat31sec1Home.html

https://www.unc.edu/%7Emarron/UNCstat31-2005/Stat31sec1Home.html

Exploratory Data Analysis 4

“Time Plots”, i.e. “Time Series:

Idea: when time structure is important,

plot variable as a function of time:

variable

time

Often useful to “connect the dots”

Class Time Series Example

Monthly Airline Passenger Numbershttps://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg5Done.xls

• Increasing Trend (long term growth, over years)

• Increasing Variation (appears proportional to trend)

• “Seasonal Effect” - 12 Month Cycle(Peak in summer, less in winter)

Airline Passengers Example

Interesting variation: log transformation

• Stabilizes variation

• Since log of product is sum

• Shows changing variation prop’l to trend

• Log10 is “most interpretable”

(log10(1000) = 3, …)

• Generally useful trick (there are others)

Airline Passengers Example

A look under the hoodhttps://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg5Raw.xls

• Use Chart Wizard

• Chart Type: Line (or could do XY)

• Use subtype for points & lines

• Use menu for first log10

• Although could just type it in

• Drag down to repeat for whole column

Time Series HW

HW: 1.36, 1.37

• Use EXCEL

Exploratory Data Analysis 5

Numerical Summaries of Quant. Variables:

Idea: Summarize distributional information

(“center”, “spread”, “skewed”)

In Text, Sec. 1.2

for data

(subscripts allow “indexing numbers” in list)nxxx ,...,, 21

Numerical Summaries

A. “Centers” (note there are several)

1. “Mean” = Average =

• Greek letter “Sigma”, for “sum”

In EXCEL, use “AVERAGE” function

nxx n 1

xxn

iin

1

1

Numerical Summaries of Center

2. “Median” = Value in middle (of sorted list)

Unsorted E.g: Sorted E.g:

3 0

1 1

27 “in middle”? (no) 2 better “middle”!

2 3

0 27

EXCEL: use function “MEDIAN”

Difference Betw’n Mean & Median

Symmetric Distribution: Essentially no difference

Right Skewed:

50% area 50% area

M

bigger since “feels tails more strongly”x


Outliers (unusual values):

Nice Web Example:

http://www.stat.sc.edu/~west/applets/box.html• Mean feels outliers much more strongly• Leaves “range of most of data”• Good notion of “center”? (perhaps not)• Median affected very minimally• Robustness Terminology:

Median is “resistant to the effect of outliers”

http://www.stat.sc.edu/~west/applets/box.html


A more flexible web example:

http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html

• Get various dist’ns, by manipulating bar heights

• See Mean, Median and more

• Similar for symmetric distributions

• Very different when skewed

• “Big Gap”, can make median jump a lot

• But mean is less sensitive (more “continuous”)

Numerical Centerpoint HW

HW: 1.49 a (but make histograms), b

• Use EXCEL

Numerical Summaries (cont.)

A. “Spreads” (again there are several)

1. Range = biggest - smallest

range

Problems:

• Feels only “outliers”

• Not “bulk of data”

• Very non-resistant to outliers

ix ix

Numerical Summaries of Spread

2. Variance =

= “average squared distance to “

EXCEL: VAR

Drawback: units are wrong

e. g. For in feet is in square feet

111

22

12

12

n

xx

nxxxx

s

n

ii

x

ix2s


3. Standard Deviation

EXCEL: STDEV

• Scale is right

• But not resistant to outliers

• Will use quite a lot later

(for reasons described later)

2ss

Interactive View of S. D.

Revisit flexible web example:http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html

• Note SD range centered at mean

• Can put SD “right near middle” (densely packed data)

• Can put SD at “edges of data” (U shaped data)

• Can put SD “outside of data” (big spike + outlier)

• But generally “sensible measure of spread”

Variance – S. D. HW

HW: for both data sets in 1.49, find the:

i. Variance (698.9, 1079)

ii. Standard Deviation (26.4, 32.9)

• Use EXCEL


3. Interquartile Range = IQR

Based on “quartiles”, Q1 and Q3

(idea: shows where are 25% & 75% “through the data”)

25% 25% 25% 25%

Q1 Q2 = median Q3

IQR = Q3 – Q1

Quartiles Example

Revisit flexible web example:https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls

• Right skewness gives:

– Median < Mean

(mean “feels farther points more strongly”)

– Q1 near median

– Q3 quite far

(makes sense from histogram)

Quartiles Example

A look under the hood:https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Raw.xls

• Can compute as separate functions for each

• Or use:

Tools Data Analysis Descriptive Stats

• Which gives many other measures as well

• Use “k-th largest & smallest” to get quartiles

5 Number Summary

1. Minimum2. Q1 - 1st Quartile3. Median4. Q3 - 3rd Quartile5. Maximum

Summarize Information About:

a) Center - from 3b) Spread - from 2 & 4 (maybe 1 & 6)c) Skewness - from 2, 3 & 4d) Outliers - from 1 & 5

5 Number Summary

How to Compute?https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls

• EXCEL function QUARTILE

• “One stop shopping”

• IQR seems to need explicit calculation

Rule for Defining “Outliers”

Caution: There are many of these

Textbook version:Above Q3 + 1.5 * IQR

Below Q1 – 1.5 * IQR

For stamps data:https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls

– No outliers at “low end”

– Some that “high end”

Box Plot

• Additional Visual Display Device

• Again legacy from pencil & paper days

• Not supported in EXCEL

• We will skip

5 Number Sum. & Outliers HW

1.49 c, d

1.46 and add:

(d) How much does the mean change if you

omit Montana and Wyoming?

stat 31, section 1, last time distributions (how are data “spread out”?) visual display:...

Documents

sumin excel

function of time

time structure

time plots

sd range

lotbut mean

data subscripts

nice web example