descriptive statistics -...

34
Descriptive Statistics Descriptive Statistics B.H. Robbins Scholars Series May 27, 2010 1 / 34

Upload: donguyet

Post on 28-May-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Descriptive Statistics

B.H. Robbins Scholars Series

May 27, 2010

1 / 34

Page 2: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Outline

Measurement

Random Variables and DistributionsCommonly used distributions

Descriptive Statistics

GraphicsCategorical VariablesContinuous Variables

2 / 34

Page 3: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Measurement

Types of measurements

I Binary: two category variableI yes/no, present/absent, 0/1

I Categorical (nominal, polytomous, discrete): at least 2 valuesthat are not necessarily ordered

I Race, eye color, etc.

I Ordinal: a categorical variable whose values are in a specialorder

I Disease severity, likert score, cancer stage, ASA status

I Count: A discrete value that can (in theory) has no upperlimit

I Number of ER visits in a day, number of CABG surgeries in ayear.

3 / 34

Page 4: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Measurement

Types of measurements (cont.)

I Continuous: numerical variable that has many possible valuesthat represent some underlying distribution.

I Tend to have the most information (assuming not a lot ofpreprocessing)

I Turning continuous data into categories is risky: loss ofinformation (i.e., lower power to detect effects, more peopleneeded to have the same power).

I Errors not reduced by categorization unless that’s the only wayto get someone to answer the question (e.g., with income)

4 / 34

Page 5: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Random Variables and Distributions

Random Variables: X

I A potential measurement

I A quantity that may vary from subject to subject

I A variable whose values are random but whose probabilitydistribution is known.

I A function that assigns a unique value with every outcomefrom an experiment.

I Once the random variable X is observed, it is a sample value(x).

5 / 34

Page 6: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Random Variables and Distributions

Distributions of Random Variables (RV)

I Distribution of the RV, X, is a profile of its tendencies

I Depending upon the type of RV, the distribution may becompletely characterized by one or a few parameters.

I Binary variable:I The probability of a ’yes’ or ’present’ or 1I In the sample this is a proportion

I K-category categorical variable (multinomial)I The probability that a randomly chosen subject will be from

category 1, 2, ... KI In a sample this a proportion falling into each category

6 / 34

Page 7: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Random Variables and Distributions

Distributions of Random Variables (RV)

I Continuous variable:

1. probability density: the value of x is on the x-axis, and therelative probability of observing values close to it is on they-axis: In a sample this is a histogram

2. cumulative probability distribution: y-axis is the probability ofX ≤ x . This function only rises or stays flat. For a sample it isa cumulative histogram

3. All percentiles of X4. All moments of X (e.g. mean(X), mean(X 2), mean(X 3),...)5. If we know one of these, we can derive the others.

7 / 34

Page 8: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Random Variables and Distributions

Distributions of Random Variables (RV)

I If the distribution is known, we may be able to makereasonable guesses about future observations

I It is much harder to guess a single value x than it is toestimate a mean from a distribution.

I At the very least, the distribution tells us the proportions ofpeople we expect to see within each interval of interest

8 / 34

Page 9: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Random Variables and Distributions

Commonly used distributions

Normal distribution

I Characterized by two parameters: mean (µ) and variance (σ2)

I Mean is a measure of central tendency

I Variance is the mean of the squared deviations from the mean.

I µ and σ2 are completely independent of each other.

I Symmetric distribution

9 / 34

Page 10: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Random Variables and Distributions

Commonly used distributions

Normal(−2,1)

x

Den

sity

−10 −5 0 5 10

0.0

0.1

0.2

0.3

0.4

Normal(2,1)

Den

sity

−10 −5 0 5 10

0.0

0.1

0.2

0.3

0.4

Normal(−2,4)

x

Den

sity

−10 −5 0 5 10

0.00

0.05

0.10

0.15

0.20

Normal(2,4)

Den

sity

−10 −5 0 5 10

0.00

0.05

0.10

0.15

0.20

10 / 34

Page 11: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Random Variables and Distributions

Commonly used distributions

Bernoulli distribution

I The random variable is binary

I Characterized by one parameter: probability (p)

I The mean is p and the variance is p(1− p)

11 / 34

Page 12: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Random Variables and Distributions

Commonly used distributions

Bernoulli(0.1)

Pro

port

ion

No Yes

00.

20.

40.

60.

81

Bernoulli(0.25)

Pro

port

ion

No Yes

00.

20.

40.

60.

81

Bernoulli(0.5)

Pro

port

ion

No Yes

00.

20.

40.

60.

81

Bernoulli(0.9)

Pro

port

ion

No Yes

00.

20.

40.

60.

81

12 / 34

Page 13: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Random Variables and Distributions

Commonly used distributions

Poisson distribution

I Characterized by a single parameter parameter: mean (λ)

I Poisson distribution is commonly used in modeling a countper interval of time

I In a Poisson distribution, the variance is equal to the mean.

I Generally is right skewed but converges to a symmetricdistribution

13 / 34

Page 14: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Random Variables and Distributions

Commonly used distributions

Poisson(1)

x

Den

sity

0 5 10 15 20

0.0

0.5

1.0

1.5

Poisson(2)

x

Den

sity

0 5 10 15 20

0.0

0.4

0.8

1.2

Poisson(5)

x

Den

sity

0 5 10 15 20

0.00

0.10

0.20

0.30

Poisson(10)

x

Den

sity

0 5 10 15 20

0.00

0.10

0.20

14 / 34

Page 15: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Random Variables and Distributions

Commonly used distributions

Exponential distribution

I Characterized by a single parameter: rate (λ)

I It is used to describe the time until an event occurs (trainarrives, light bulb burns out, a patient has an MI)

I The mean of an expoential is 1/λ and the variance is 1/λ2

I A relaxed version of this distribution is very commonly used inmedical research for survival analysis.

I Right or positively skewed

15 / 34

Page 16: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Random Variables and Distributions

Commonly used distributions

Exponential(1)

x

Den

sity

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

Exponential(0.5)

x

Den

sity

0 20 40 60 80 100

0.0

0.1

0.2

0.3

0.4

Exponential(0.2)

x

Den

sity

0 20 40 60 80 100

0.00

0.05

0.10

0.15

Exponential(0.05)

x

Den

sity

0 20 40 60 80 100

0.00

0.02

0.04

16 / 34

Page 17: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Descriptive Statistics

Continuous Variables

I Let x1, x2, . . . , xn denote sample values

I Measures of location or the ’center’ of a sampleI Arithmetic average: x = 1

n

∑ni=1 xi

I Population mean is the value that x converges to as n→∞I Highly influenced by a single value (which may be a bad thing

if that values is not real)

I Median: middle among all sorted values (e.g., half the sampleis greater and half is smaller)

I Usually very descriptiveI Not affected by an outlying value

17 / 34

Page 18: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Descriptive Statistics

Continuous Variables: Measures of location

[-1.52 -1.36 -1.00 -0.93 -0.83 -0.59 -0.05 0.03 0.62 1.18 1.32]

I Sample average = -0.28

I Median = -0.59

[-1.52 -1.36 -1.00 -0.93 -0.83 -0.59 -0.05 0.03 0.62 11.8 13.2]

I Sample average = 1.76

I Median = -0.59

18 / 34

Page 19: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Descriptive Statistics

Continuous Variables (cont.)

I Quantiles or percentiles: Can generally be used to describecentral tendency, spread, symmetry, heavy tailedness, etc.

I The pth sample quantile, xp is the value such that the fractionp of observations fall below the value.

I The pth population quantile, is the value x such thatpr(X ≤ x) = p

I Sample median is the 50th percentile or 0.5 quantileI Quartiles: (Q1,Q2,Q3) are the (25th, 50th, 75th)I Quintiles: (Q1,Q2,Q3,Q4) are the (20th, 40th, 60th, 80th)

19 / 34

Page 20: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Descriptive Statistics

Continuous Variables (cont.)I Measures of spread or variabilityI Inter-Quartile range, (Q1, Q3): An interval that contains half

of all subjectsI Most people think these are meaningful for continuous

distributionsI Variance: The average of the squared deviations between each

observed value and the overall average value:I s2 = 1

n−1

∑ni=1(xi − x)2

I n − 1 is used because we must acknowledge that we haveestimated and do not know that true value of µ the populationmean.

I Standard deviation: the square root of the varianceI A normal distribution can be defined in terms of the proportion

of people falling within a SD of the meanI 68% of the populations falls within 1 SD of the mean, and

approximately 95% fall within 2 SDs of the mean.

20 / 34

Page 21: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Descriptive Statistics

Continuous Variables: Measures of scale or spread

[-1.52 -1.36 -1.00 -0.93 -0.83 -0.59 -0.05 0.03 0.62 1.18 1.32]

I Sample Standard Deviation = 0.98

I Interquartile range: (-0.965, 0.325)

[-1.52 -1.36 -1.00 -0.93 -0.83 -0.59 -0.05 0.03 0.62 11.8 13.2]

I Sample Standard Deviation = 5.36

I Interquartile range: (-0.965, 0.325)

21 / 34

Page 22: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Descriptive Statistics

Continuous Variables (cont.)

I With highly skewed or assymetric distributions, variance andSD may not be very useful summaries of spread

I If median length of stay at Vanderbilt Hospital is 3 days thenhaving a SD of 10 due to some very sick people is difficult tointerpret.

I Range: may not be useful because it is always increasing withn and it is dominated by a single value.

I Coefficient of variation (standard deviation divided by themean): few situations where this could be useful because itdepends highly on what the value of the mean is (esp if closeto 0).

22 / 34

Page 23: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Descriptive Statistics

Discrete Variables

I Categorical VariablesI The proportion of observations falling within each category

I Count VariablesI Best to report the average.

23 / 34

Page 24: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Graphics

Categorical Variables

Chart Junk

I Chart junk (Tufte): non-data-ink or redundant data-inkI Not part of the minimum set of visuals to communicate the

message understandably.I Elements of a display that are not necessary to understand the

information contained in itI Can be distracting and can skew the depiction, making it

difficult to understand (3-D pie and bar chart are examples)I Ink : information ratio : Ratio to of the amount of ink used

and the amount of information portrayed

24 / 34

Page 25: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Graphics

Categorical Variables

Categorical Variables

I Pie chartI High ink:information ratioI Can create optical illusions (perception depends on orientation

vs the horizon)I With a lot of categories, difficult to label

I Bar chartI High ink:information ratioI Hard to depict uncertaintyI Hard to interpret subcategoriesI Labels hard to read if bars are vertical

25 / 34

Page 26: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Graphics

Categorical Variables

Categorical Variables

I Dot chartI Leads to the most accurate perceptionI Easy to show labelsI Allows for multiple levels of categorizationI Multiple categories within a single line of dotsI Easy to show 2-sided error barsI Dot chart and figure 3

26 / 34

Page 27: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Graphics

Categorical Variables

Y

0.0 0.2 0.4 0.6 0.8

770 9565 7439 2378 8093

10341 7610 1484

14320 10518 1461 9231 3241

494 13849 10810 6529 375

13410 11764 14450 3392

13591 13841 1384

14911

n

Category a Category b Category c Category d Category e Category f

Category g Category h Category i Category j

Category k Category l

Category m

Category a Category b Category c Category d Category e Category f

Category g Category h Category i Category j

Category k Category l

Category m

North

South

27 / 34

Page 28: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Graphics

Continuous Variables

Continuous Variables

I Histograms show relative frequencies (requires binning ofdata)

I Not optimal for comparing multiple distributions

I Cumulative distribution functionsI Can read all quantiles directly off the plot

I Box plot: good way to compare many groups

28 / 34

Page 29: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Graphics

Continuous Variables

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

Normal CDF

x

Pr(

X <

x)

Normal(0, 1)Normal(−1, 2)

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Poisson CDF

x

Pr(

X <

x)

●●

●● ● ● ● ●

● ● ● ●●

●●

● ● ● ● ● ● ●

Poisson(2)Poisson(5)

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Exponential CDF

x

Pr(

X <

x)

Exponential(.5)Exponential(.2)

29 / 34

Page 30: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Graphics

Continuous Variables

Box and Whisker Plot

Value

A

B

C

D

E

F

−2 0 2

x

x

x

x

x

x

●●

30 / 34

Page 31: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Graphics

Continuous Variables

Box Percentile Plot

Value

A

B

C

D

E

F

−2 0 2

31 / 34

Page 32: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Graphics

Continuous Variables

Decile Plot

Value

A

B

C

D

E

F

−2 −1 0 1 2

x

x

x

x

x

x

32 / 34

Page 33: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Graphics

Continuous Variables

Decile Plot: Exponential Distribution

Value

A

B

C

D

E

F

0.5 1.0 1.5 2.0 2.5

x

x

x

x

x

x

33 / 34

Page 34: Descriptive Statistics - WebHomebiostat.mc.vanderbilt.edu/.../AnesShortCourse/DescriptiveStats.pdf · Descriptive Statistics Measurement Types of measurements (cont.) I Continuous:

Descriptive Statistics

Graphics

Continuous Variables

Summary

I Measurements

I Random variables and distributions

I Numerical descriptive statisticsI Graphical summaries

I Categorical variablesI Continuous variables

34 / 34