stat 31, section 1, last time distributions (how are data “spread out”?) visual display:...
TRANSCRIPT
Stat 31, Section 1, Last Time
• Distributions (how are data “spread out”?)
• Visual Display: Histograms
• Binwidth is critical
• Bivariate display: scatterplot
• Course Organization & Websitehttps://www.unc.edu/%7Emarron/UNCstat31-2005/Stat31sec1Home.html
Exploratory Data Analysis 4
“Time Plots”, i.e. “Time Series:
Idea: when time structure is important,
plot variable as a function of time:
variable
time
Often useful to “connect the dots”
Class Time Series Example
Monthly Airline Passenger Numbershttps://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg5Done.xls
• Increasing Trend (long term growth, over years)
• Increasing Variation (appears proportional to trend)
• “Seasonal Effect” - 12 Month Cycle(Peak in summer, less in winter)
Airline Passengers Example
Interesting variation: log transformation
• Stabilizes variation
• Since log of product is sum
• Shows changing variation prop’l to trend
• Log10 is “most interpretable”
(log10(1000) = 3, …)
• Generally useful trick (there are others)
Airline Passengers Example
A look under the hoodhttps://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg5Raw.xls
• Use Chart Wizard
• Chart Type: Line (or could do XY)
• Use subtype for points & lines
• Use menu for first log10
• Although could just type it in
• Drag down to repeat for whole column
Time Series HW
HW: 1.36, 1.37
• Use EXCEL
Exploratory Data Analysis 5
Numerical Summaries of Quant. Variables:
Idea: Summarize distributional information
(“center”, “spread”, “skewed”)
In Text, Sec. 1.2
for data
(subscripts allow “indexing numbers” in list)nxxx ,...,, 21
Numerical Summaries
A. “Centers” (note there are several)
1. “Mean” = Average =
• Greek letter “Sigma”, for “sum”
In EXCEL, use “AVERAGE” function
nxx n 1
xxn
iin
1
1
Numerical Summaries of Center
2. “Median” = Value in middle (of sorted list)
Unsorted E.g: Sorted E.g:
3 0
1 1
27 “in middle”? (no) 2 better “middle”!
2 3
0 27
EXCEL: use function “MEDIAN”
Difference Betw’n Mean & Median
Symmetric Distribution: Essentially no difference
Right Skewed:
50% area 50% area
M
bigger since “feels tails more strongly”x
Difference Betw’n Mean & Median
Outliers (unusual values):
Nice Web Example:
http://www.stat.sc.edu/~west/applets/box.html• Mean feels outliers much more strongly• Leaves “range of most of data”• Good notion of “center”? (perhaps not)• Median affected very minimally• Robustness Terminology:
Median is “resistant to the effect of outliers”
Difference Betw’n Mean & Median
A more flexible web example:
http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html
• Get various dist’ns, by manipulating bar heights
• See Mean, Median and more
• Similar for symmetric distributions
• Very different when skewed
• “Big Gap”, can make median jump a lot
• But mean is less sensitive (more “continuous”)
Numerical Centerpoint HW
HW: 1.49 a (but make histograms), b
• Use EXCEL
Numerical Summaries (cont.)
A. “Spreads” (again there are several)
1. Range = biggest - smallest
range
Problems:
• Feels only “outliers”
• Not “bulk of data”
• Very non-resistant to outliers
ix ix
Numerical Summaries of Spread
2. Variance =
= “average squared distance to “
EXCEL: VAR
Drawback: units are wrong
e. g. For in feet is in square feet
111
22
12
12
n
xx
nxxxx
s
n
ii
x
ix2s
Numerical Summaries of Spread
3. Standard Deviation
EXCEL: STDEV
• Scale is right
• But not resistant to outliers
• Will use quite a lot later
(for reasons described later)
2ss
Interactive View of S. D.
Revisit flexible web example:http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html
• Note SD range centered at mean
• Can put SD “right near middle” (densely packed data)
• Can put SD at “edges of data” (U shaped data)
• Can put SD “outside of data” (big spike + outlier)
• But generally “sensible measure of spread”
Variance – S. D. HW
HW: for both data sets in 1.49, find the:
i. Variance (698.9, 1079)
ii. Standard Deviation (26.4, 32.9)
• Use EXCEL
Numerical Summaries of Spread
3. Interquartile Range = IQR
Based on “quartiles”, Q1 and Q3
(idea: shows where are 25% & 75% “through the data”)
25% 25% 25% 25%
Q1 Q2 = median Q3
IQR = Q3 – Q1
Quartiles Example
Revisit flexible web example:https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls
• Right skewness gives:
– Median < Mean
(mean “feels farther points more strongly”)
– Q1 near median
– Q3 quite far
(makes sense from histogram)
Quartiles Example
A look under the hood:https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Raw.xls
• Can compute as separate functions for each
• Or use:
Tools Data Analysis Descriptive Stats
• Which gives many other measures as well
• Use “k-th largest & smallest” to get quartiles
5 Number Summary
1. Minimum2. Q1 - 1st Quartile3. Median4. Q3 - 3rd Quartile5. Maximum
Summarize Information About:
a) Center - from 3b) Spread - from 2 & 4 (maybe 1 & 6)c) Skewness - from 2, 3 & 4d) Outliers - from 1 & 5
5 Number Summary
How to Compute?https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls
• EXCEL function QUARTILE
• “One stop shopping”
• IQR seems to need explicit calculation
Rule for Defining “Outliers”
Caution: There are many of these
Textbook version:Above Q3 + 1.5 * IQR
Below Q1 – 1.5 * IQR
For stamps data:https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls
– No outliers at “low end”
– Some that “high end”
Box Plot
• Additional Visual Display Device
• Again legacy from pencil & paper days
• Not supported in EXCEL
• We will skip
5 Number Sum. & Outliers HW
1.49 c, d
1.46 and add:
(d) How much does the mean change if you
omit Montana and Wyoming?