descriptive statistics -...
TRANSCRIPT
Descriptive Statistics
Descriptive Statistics
B.H. Robbins Scholars Series
May 27, 2010
1 / 34
Descriptive Statistics
Outline
Measurement
Random Variables and DistributionsCommonly used distributions
Descriptive Statistics
GraphicsCategorical VariablesContinuous Variables
2 / 34
Descriptive Statistics
Measurement
Types of measurements
I Binary: two category variableI yes/no, present/absent, 0/1
I Categorical (nominal, polytomous, discrete): at least 2 valuesthat are not necessarily ordered
I Race, eye color, etc.
I Ordinal: a categorical variable whose values are in a specialorder
I Disease severity, likert score, cancer stage, ASA status
I Count: A discrete value that can (in theory) has no upperlimit
I Number of ER visits in a day, number of CABG surgeries in ayear.
3 / 34
Descriptive Statistics
Measurement
Types of measurements (cont.)
I Continuous: numerical variable that has many possible valuesthat represent some underlying distribution.
I Tend to have the most information (assuming not a lot ofpreprocessing)
I Turning continuous data into categories is risky: loss ofinformation (i.e., lower power to detect effects, more peopleneeded to have the same power).
I Errors not reduced by categorization unless that’s the only wayto get someone to answer the question (e.g., with income)
4 / 34
Descriptive Statistics
Random Variables and Distributions
Random Variables: X
I A potential measurement
I A quantity that may vary from subject to subject
I A variable whose values are random but whose probabilitydistribution is known.
I A function that assigns a unique value with every outcomefrom an experiment.
I Once the random variable X is observed, it is a sample value(x).
5 / 34
Descriptive Statistics
Random Variables and Distributions
Distributions of Random Variables (RV)
I Distribution of the RV, X, is a profile of its tendencies
I Depending upon the type of RV, the distribution may becompletely characterized by one or a few parameters.
I Binary variable:I The probability of a ’yes’ or ’present’ or 1I In the sample this is a proportion
I K-category categorical variable (multinomial)I The probability that a randomly chosen subject will be from
category 1, 2, ... KI In a sample this a proportion falling into each category
6 / 34
Descriptive Statistics
Random Variables and Distributions
Distributions of Random Variables (RV)
I Continuous variable:
1. probability density: the value of x is on the x-axis, and therelative probability of observing values close to it is on they-axis: In a sample this is a histogram
2. cumulative probability distribution: y-axis is the probability ofX ≤ x . This function only rises or stays flat. For a sample it isa cumulative histogram
3. All percentiles of X4. All moments of X (e.g. mean(X), mean(X 2), mean(X 3),...)5. If we know one of these, we can derive the others.
7 / 34
Descriptive Statistics
Random Variables and Distributions
Distributions of Random Variables (RV)
I If the distribution is known, we may be able to makereasonable guesses about future observations
I It is much harder to guess a single value x than it is toestimate a mean from a distribution.
I At the very least, the distribution tells us the proportions ofpeople we expect to see within each interval of interest
8 / 34
Descriptive Statistics
Random Variables and Distributions
Commonly used distributions
Normal distribution
I Characterized by two parameters: mean (µ) and variance (σ2)
I Mean is a measure of central tendency
I Variance is the mean of the squared deviations from the mean.
I µ and σ2 are completely independent of each other.
I Symmetric distribution
9 / 34
Descriptive Statistics
Random Variables and Distributions
Commonly used distributions
Normal(−2,1)
x
Den
sity
−10 −5 0 5 10
0.0
0.1
0.2
0.3
0.4
Normal(2,1)
Den
sity
−10 −5 0 5 10
0.0
0.1
0.2
0.3
0.4
Normal(−2,4)
x
Den
sity
−10 −5 0 5 10
0.00
0.05
0.10
0.15
0.20
Normal(2,4)
Den
sity
−10 −5 0 5 10
0.00
0.05
0.10
0.15
0.20
10 / 34
Descriptive Statistics
Random Variables and Distributions
Commonly used distributions
Bernoulli distribution
I The random variable is binary
I Characterized by one parameter: probability (p)
I The mean is p and the variance is p(1− p)
11 / 34
Descriptive Statistics
Random Variables and Distributions
Commonly used distributions
Bernoulli(0.1)
Pro
port
ion
No Yes
00.
20.
40.
60.
81
Bernoulli(0.25)
Pro
port
ion
No Yes
00.
20.
40.
60.
81
Bernoulli(0.5)
Pro
port
ion
No Yes
00.
20.
40.
60.
81
Bernoulli(0.9)
Pro
port
ion
No Yes
00.
20.
40.
60.
81
12 / 34
Descriptive Statistics
Random Variables and Distributions
Commonly used distributions
Poisson distribution
I Characterized by a single parameter parameter: mean (λ)
I Poisson distribution is commonly used in modeling a countper interval of time
I In a Poisson distribution, the variance is equal to the mean.
I Generally is right skewed but converges to a symmetricdistribution
13 / 34
Descriptive Statistics
Random Variables and Distributions
Commonly used distributions
Poisson(1)
x
Den
sity
0 5 10 15 20
0.0
0.5
1.0
1.5
Poisson(2)
x
Den
sity
0 5 10 15 20
0.0
0.4
0.8
1.2
Poisson(5)
x
Den
sity
0 5 10 15 20
0.00
0.10
0.20
0.30
Poisson(10)
x
Den
sity
0 5 10 15 20
0.00
0.10
0.20
14 / 34
Descriptive Statistics
Random Variables and Distributions
Commonly used distributions
Exponential distribution
I Characterized by a single parameter: rate (λ)
I It is used to describe the time until an event occurs (trainarrives, light bulb burns out, a patient has an MI)
I The mean of an expoential is 1/λ and the variance is 1/λ2
I A relaxed version of this distribution is very commonly used inmedical research for survival analysis.
I Right or positively skewed
15 / 34
Descriptive Statistics
Random Variables and Distributions
Commonly used distributions
Exponential(1)
x
Den
sity
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
Exponential(0.5)
x
Den
sity
0 20 40 60 80 100
0.0
0.1
0.2
0.3
0.4
Exponential(0.2)
x
Den
sity
0 20 40 60 80 100
0.00
0.05
0.10
0.15
Exponential(0.05)
x
Den
sity
0 20 40 60 80 100
0.00
0.02
0.04
16 / 34
Descriptive Statistics
Descriptive Statistics
Continuous Variables
I Let x1, x2, . . . , xn denote sample values
I Measures of location or the ’center’ of a sampleI Arithmetic average: x = 1
n
∑ni=1 xi
I Population mean is the value that x converges to as n→∞I Highly influenced by a single value (which may be a bad thing
if that values is not real)
I Median: middle among all sorted values (e.g., half the sampleis greater and half is smaller)
I Usually very descriptiveI Not affected by an outlying value
17 / 34
Descriptive Statistics
Descriptive Statistics
Continuous Variables: Measures of location
[-1.52 -1.36 -1.00 -0.93 -0.83 -0.59 -0.05 0.03 0.62 1.18 1.32]
I Sample average = -0.28
I Median = -0.59
[-1.52 -1.36 -1.00 -0.93 -0.83 -0.59 -0.05 0.03 0.62 11.8 13.2]
I Sample average = 1.76
I Median = -0.59
18 / 34
Descriptive Statistics
Descriptive Statistics
Continuous Variables (cont.)
I Quantiles or percentiles: Can generally be used to describecentral tendency, spread, symmetry, heavy tailedness, etc.
I The pth sample quantile, xp is the value such that the fractionp of observations fall below the value.
I The pth population quantile, is the value x such thatpr(X ≤ x) = p
I Sample median is the 50th percentile or 0.5 quantileI Quartiles: (Q1,Q2,Q3) are the (25th, 50th, 75th)I Quintiles: (Q1,Q2,Q3,Q4) are the (20th, 40th, 60th, 80th)
19 / 34
Descriptive Statistics
Descriptive Statistics
Continuous Variables (cont.)I Measures of spread or variabilityI Inter-Quartile range, (Q1, Q3): An interval that contains half
of all subjectsI Most people think these are meaningful for continuous
distributionsI Variance: The average of the squared deviations between each
observed value and the overall average value:I s2 = 1
n−1
∑ni=1(xi − x)2
I n − 1 is used because we must acknowledge that we haveestimated and do not know that true value of µ the populationmean.
I Standard deviation: the square root of the varianceI A normal distribution can be defined in terms of the proportion
of people falling within a SD of the meanI 68% of the populations falls within 1 SD of the mean, and
approximately 95% fall within 2 SDs of the mean.
20 / 34
Descriptive Statistics
Descriptive Statistics
Continuous Variables: Measures of scale or spread
[-1.52 -1.36 -1.00 -0.93 -0.83 -0.59 -0.05 0.03 0.62 1.18 1.32]
I Sample Standard Deviation = 0.98
I Interquartile range: (-0.965, 0.325)
[-1.52 -1.36 -1.00 -0.93 -0.83 -0.59 -0.05 0.03 0.62 11.8 13.2]
I Sample Standard Deviation = 5.36
I Interquartile range: (-0.965, 0.325)
21 / 34
Descriptive Statistics
Descriptive Statistics
Continuous Variables (cont.)
I With highly skewed or assymetric distributions, variance andSD may not be very useful summaries of spread
I If median length of stay at Vanderbilt Hospital is 3 days thenhaving a SD of 10 due to some very sick people is difficult tointerpret.
I Range: may not be useful because it is always increasing withn and it is dominated by a single value.
I Coefficient of variation (standard deviation divided by themean): few situations where this could be useful because itdepends highly on what the value of the mean is (esp if closeto 0).
22 / 34
Descriptive Statistics
Descriptive Statistics
Discrete Variables
I Categorical VariablesI The proportion of observations falling within each category
I Count VariablesI Best to report the average.
23 / 34
Descriptive Statistics
Graphics
Categorical Variables
Chart Junk
I Chart junk (Tufte): non-data-ink or redundant data-inkI Not part of the minimum set of visuals to communicate the
message understandably.I Elements of a display that are not necessary to understand the
information contained in itI Can be distracting and can skew the depiction, making it
difficult to understand (3-D pie and bar chart are examples)I Ink : information ratio : Ratio to of the amount of ink used
and the amount of information portrayed
24 / 34
Descriptive Statistics
Graphics
Categorical Variables
Categorical Variables
I Pie chartI High ink:information ratioI Can create optical illusions (perception depends on orientation
vs the horizon)I With a lot of categories, difficult to label
I Bar chartI High ink:information ratioI Hard to depict uncertaintyI Hard to interpret subcategoriesI Labels hard to read if bars are vertical
25 / 34
Descriptive Statistics
Graphics
Categorical Variables
Categorical Variables
I Dot chartI Leads to the most accurate perceptionI Easy to show labelsI Allows for multiple levels of categorizationI Multiple categories within a single line of dotsI Easy to show 2-sided error barsI Dot chart and figure 3
26 / 34
Descriptive Statistics
Graphics
Categorical Variables
Y
0.0 0.2 0.4 0.6 0.8
770 9565 7439 2378 8093
10341 7610 1484
14320 10518 1461 9231 3241
494 13849 10810 6529 375
13410 11764 14450 3392
13591 13841 1384
14911
n
Category a Category b Category c Category d Category e Category f
Category g Category h Category i Category j
Category k Category l
Category m
Category a Category b Category c Category d Category e Category f
Category g Category h Category i Category j
Category k Category l
Category m
North
South
27 / 34
Descriptive Statistics
Graphics
Continuous Variables
Continuous Variables
I Histograms show relative frequencies (requires binning ofdata)
I Not optimal for comparing multiple distributions
I Cumulative distribution functionsI Can read all quantiles directly off the plot
I Box plot: good way to compare many groups
28 / 34
Descriptive Statistics
Graphics
Continuous Variables
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
Normal CDF
x
Pr(
X <
x)
Normal(0, 1)Normal(−1, 2)
0 5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
Poisson CDF
x
Pr(
X <
x)
●●
●
●
●
●
●
●
●● ● ● ● ●
● ● ● ●●
●
●
●
●
●
●
●
●
●●
● ● ● ● ● ● ●
Poisson(2)Poisson(5)
0 5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
Exponential CDF
x
Pr(
X <
x)
Exponential(.5)Exponential(.2)
29 / 34
Descriptive Statistics
Graphics
Continuous Variables
Box and Whisker Plot
Value
A
B
C
D
E
F
−2 0 2
x
x
x
x
x
x
●
●
●●
30 / 34
Descriptive Statistics
Graphics
Continuous Variables
Box Percentile Plot
Value
A
B
C
D
E
F
−2 0 2
●
●
●
●
●
●
31 / 34
Descriptive Statistics
Graphics
Continuous Variables
Decile Plot
Value
A
B
C
D
E
F
−2 −1 0 1 2
x
x
x
x
x
x
32 / 34
Descriptive Statistics
Graphics
Continuous Variables
Decile Plot: Exponential Distribution
Value
A
B
C
D
E
F
0.5 1.0 1.5 2.0 2.5
x
x
x
x
x
x
33 / 34
Descriptive Statistics
Graphics
Continuous Variables
Summary
I Measurements
I Random variables and distributions
I Numerical descriptive statisticsI Graphical summaries
I Categorical variablesI Continuous variables
34 / 34