chapter 3. exploratory data analysisparkj1/math105/mainslide_ch3.pdf · exploratory data analysis....
TRANSCRIPT
Chapter 3. Exploratory data analysis
Outline
Example problems and associated data setsDiseased treesUrban and rural ozoneComparing hospitals
Graphical methodsHistograms
Choice of bin size
Empirical distribution functionScatterplotVisualising conditional distributionsHistorical note – Florence Nightingale
Summary statisticsSample meanSample variance and standard deviationSample quantilesSample correlation
Data and Random variability
◮ Data is information we collected but also bears uncertainty,due to random variability in characteristics of interest fromone individual to another.
◮ In mathematical terms, the characteristics being measured arerepresented by random variables. e.g X = age of individualschosen at random.
◮ Individual in this context refers to a unit of study - actualpeople, towns, cars, test tubes...
Random variables and realisations
IMPORTANT DIFFERENCE BETWEEN:
Random variable ←→ realisation (or observation)
◮ Random variable is always written in UPPER CASE and isassociated with a probability distribution. e.g. X = Ozonelevel
◮ Observations of random variables are written in lower caseand is just a number. e.g. x = observed value in number
Data set of size n:
X1, · · · ,Xn random variables
x1, · · · , xn particular realizations
Data analysis
The first stage in any analysis is to get to know the problem andthe data. This usually involves a variety of graphical procedures totry to visualise the data, as well as the calculation of a few simplesummary numbers, or summary statistics that capture key featuresof the data, which hopefully reveal key features of the unknown
underlying distribution.
Data analysis
The first stage in any analysis is to get to know the problem andthe data. This usually involves a variety of graphical procedures totry to visualise the data, as well as the calculation of a few simplesummary numbers, or summary statistics that capture key featuresof the data, which hopefully reveal key features of the unknown
underlying distribution.The random variability in the data is reflection of the underlyingdistribution however bear in mind that, because of finite samplesize, this would serve as an approximation to the true underlyingdistribution and its features. This implies that we also need to carehow good the approximation is.
Role of exploratory analysis
Finding errors and anomalies: missing data, outliers, changes ofscale...
Suggesting route of subsequent analysis: plots of data giveinformation on location, scale and shape of thedistribution and relationships between variables.
Augmenting understanding of applied problem: exploratorygraphical tools sharpen the scientific questions beingaddressed.
Example problems and associated data sets
Ecological Diseased trees
Atmospheric Chemistry Monitoring urban air pollution
Health Comparing hospitals
Diseased trees
How does the disease spread between trees, and what is the
probability that trees are infected by the disease?
Run length 0 1 2 3 4 5
Number of runs 31 16 2 0 1 0in first 50 observations
Number of runs 71 28 5 2 2 1in 109 observations
Table: Run lengths of diseased trees in an infected plantation.
Urban and rural ozone
Air quality monitoring:
◮ How, if at all, does the distribution of ozone measurements
vary between the urban and rural sites?
◮ How, if at all, is the distribution of ozone measurements
affected by season?
◮ How, if at all, does the presence of other pollutants affect the
levels of measured ozone?
data: daily measurements of the maximum hourly meanconcentration of O3 and NO2 (ppb):
x1, · · · , xn Leeds city centre ozone
y1, · · · , yn Ladybower reservoir ozone
We will look at observations from early summer (April – Julyinclusive) and winter (November – February inclusive).
Comparing hospitals
Number of successful operations at each hospital (out of ten)
Hospital 1 Hospital 2
9 5
What can we conclude about the relative performances of the two
hospitals?
Population and sample: example
In the Ozone problem, we have data from a number of days during1994-1998. However, interest is not solely in the levels of ozone onthe days on which measurements were taken. The objective of astatistical analysis is to learn about the relationships between thevariables, and to draw more general conclusions about levels of thevariables on other, perhaps future dates.
Exercise 3.1.1For each of the problems that we are concentrating on in thecourse, state the populations that we are trying to learn about:
Ozone Levels of ozone at the two locations given thetime of year and the level of NO2.
Diseased trees All trees in the forest and, possibly otherlocations where the climate, soil and treeshave similar properties.
Hospital Other operations at the two hospitals.
Discrete or Continuous: Diseased trees
Exercise 3.1.2For diseased tree data set, define the variable of interest as X andfind possible range of values. Is the variable discrete or continuous?
◮ The variable of interest isX = number of unbroken run of diseased trees in theneighboring trees of a diseased tree.
◮ Possible values are {0, 1, · · · , }
Discrete or Continuous: Comparing hospitals
Exercise 3.1.3For hospital data set, define the variable of interest as X and findpossible range of values. Is the variable discrete or continuous?
◮ The variable of interest is
◮ The variable of interest is
X = number of successful operations in the first hospital
Y = number of successful operations in the second hospital
◮ Possible values are
◮ The variable of interest is
X = number of successful operations in the first hospital
Y = number of successful operations in the second hospital
◮ Possible values are
{0, 1, · · · , 10} for both variables.
Histograms
Histogram - shape of distribution
◮ Bins of equal width
◮ Number of observations in each bin
0 1 2 3 4 5
020
4060
0 1 2 3 4 5
020
4060
Partial Full
Run lengthRun lengthCou
nt
Cou
nt
Figure: Histograms of run lengths of diseased trees in an infectedplantation.
Scaled histogram
We can rescale the vertical axis of our histogram to ensure thatthe histogram has area 1. We do this by calculating the area of ouroriginal histogram, then dividing all the counts, or frequencies, bythis amount. When the bins are all of equal width, the area is:
A = contribution of one individual× number of individuals
= 1× bin width× number of individuals
Exercise 3.2.1Diseased trees. For the histogram of the full dataset of diseasedtree,
A = 1 × 109 and the new maximum value on the y axis isapprox
71
1× 109≈ 0.65
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0 1 2 3 4 5
0.0
0.2
0.4
0.6
Partial Full
Run lengthRun lengthp.
m.f.
p.m
.f.
Figure: Histograms of run lengths of diseased trees in an infectedplantation, partial (left plot) and full (right plot) data sets
Comparing histograms
20 40 60 80
0.00
0.01
0.02
0.03
0.04
20 40 60 80 100 120
0.00
0.01
0.02
0.03
Leeds city centre Ladybower Reservoir
Daily max ozoneDaily max ozone
Den
sity
Den
sity
Figure: Histograms of the summer daily maximum ozone levels (ppb) inLeeds city centre and at Ladybower reservoir.
Reducing variability
Differences
Daily max ozone
Den
sity
−50 −40 −30 −20 −10 0 10 20
0.00
0.01
0.02
0.03
0.04
0.05
Figure: Differences of summer ozone daily maxima at the two sites.
Effect of bin size
Leeds
Summer daily maxima
Den
sity
0 20 40 60 80 100
0.00
0.02
0.04
0.06
0.08
0.10
Leeds
Summer daily maxima
Den
sity
20 40 60 80 1000.
000.
020.
040.
060.
080.
10
Figure: Histograms of density of summer ozone data using different binsizes.
Empirical distribution function
Recall the cumulative distribution function (c.d.f.) of a randomvariable X :
F (x) = P(X ≤ x)
How can we estimate this from a finite number of observations?
Let us assume that our variables X1, . . . ,Xn are independent andidentically distributed (i.i.d.) replicates of a random variable X
which has cumulative distribution function F . We denote byx1, . . . , xn, the observed values of X1, . . . ,Xn.
The empirical cumulative distribution function(c.d.f.) is defined as
F̃ (x) =1
n(num of xi ≤ x) =
∑ni=1 II(xi ≤ x)
n
where
II(xi ≤ x) =
{
1 if xi ≤ x
0 if xi > 0
The empirical c.d.f is a proper distribution function and has thefollowing properties:
◮ F̃ (x) is a step function with jumps at the data points;
◮ F̃ (x) = 1 if x ≥ max(x1, . . . , xn);
◮ F̃ (x) = 0 if x < min(x1, . . . , xn).
Remark
◮ We have no reason to favor any particular observation. So wegive each observation an equal weight 1/n. If some values aremore likely than others, they simply appear more frequentlythan others.
◮ Take the observed values and order them so that the smallestone comes first. Lable these ordered values x(1), x(2), · · · , x(n)
so thatx(1) ≤ x(2) ≤ · · · ≤ x(n) .
Then the kth ordered point x(k) is the k/n th quantile.
Exercise 3.2.2For observations {1, 2, 2, 3, 4}, find F̃ (x) and sketch the plot.
x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
F̃ (x) 0 0 1/5 1/5 3/5 3/5 4/5 4/5 1 1 1
F̃(x
)0.
00.
20.
40.
60.
81.
0
0 1 2 3 4 5
16 22 23 23 26 27 27 28 29 30 32 32 32 33 34 34 35 35 3645Then at each sorted data point we have a jump of i/n.Here n = 20
x 16 22 23 26 27 28 29
F̃ (x) 1/20 2/20 4/20 5/20 7/20 8/20 9/20
x 30 32 33 34 35 36 45
F̃ (x) 10/20 13/20 14/20 16/20 18/20 19/20 1
F̃(x
)Leeds
Summer daily maxima
0.0
0.2
0.4
0.6
0.8
1.0
15 20 25 30 35 40 45
0 20 40 60 80
0.0
0.2
0.4
0.6
0.8
1.0
F̃(x
)Leeds
Summer daily maxima
Figure: Empirical c.d.f. for summer daily maxima ozone at the Leeds citycentre site.
Scatterplot
When we have multivariate data, we have to look at dependencebetween variables.Scatterplot – plots one variable against another.
Exercise 3.2.3Ozone. We now turn to the effect of the nitrogen dioxide (NO2)on ozone levels. We focus on the Leeds city centre ozonemeasurements as our response variable.
20 40 60 80 100
2040
6080
20 40 60 80 100 120
010
2030
40
Summer Winter
O3
O3
NO2NO2
Figure: Scatterplots of Leeds city centre O3 values against NO2 for eachseason.
Visualising conditional distribution
As well as looking for dependence between variables, it can also beuseful to identify situations in which variables appear to beindependent. If two variables are independent, then thedistribution of one variable will look the same regardless of thevalue of the other variable.
Visualising conditional distribution
As well as looking for dependence between variables, it can also beuseful to identify situations in which variables appear to beindependent. If two variables are independent, then thedistribution of one variable will look the same regardless of thevalue of the other variable.Conditional probabilities were introduced in Math104:
If A and B are two events then, as long as P(B) > 0,the conditional probability of A given B is written asP(A |B) and calculated from
P(A |B) = P(A ∩ B)/P(B).
Visualising conditional distribution
As well as looking for dependence between variables, it can also beuseful to identify situations in which variables appear to beindependent. If two variables are independent, then thedistribution of one variable will look the same regardless of thevalue of the other variable.Conditional probabilities were introduced in Math104:
If A and B are two events then, as long as P(B) > 0,the conditional probability of A given B is written asP(A |B) and calculated from
P(A |B) = P(A ∩ B)/P(B).
We can look for more structure in our data, including thedependence of one variable on another, by examining conditional
distributions of some subsets of our data.
Exercise 3.2.4Ozone data. We will look at the following conditional histogramsfor Leeds city center:
◮ daily maximum ozone levels in summer conditional on{NO2 <= 40};
◮ daily maximum ozone levels in summer conditional on{40 < NO2 ≤ 60};
◮ daily maximum ozone levels in winter conditional on{NO2 <= 40};
◮ daily maximum ozone levels in winter conditional on{40 < NO2 ≤ 60};
Den
sity
0 10 20 30 40 50 60 70
0.00
0.01
0.02
0.03
0.04
0.05
Den
sity
0 10 20 30 40
0.00
0.01
0.02
0.03
0.04
0.05
Den
sity
0 10 20 30 40 50 60 70
0.00
0.01
0.02
0.03
0.04
0.05
Den
sity
0 10 20 30 40
0.00
0.01
0.02
0.03
0.04
0.05
O3O3
O3O3
Summer Ozone |NO2 ≤ 40
Summer Ozone | 40 < NO2 ≤ 60
Winter Ozone |NO2 ≤ 40
Winter Ozone | 40 < NO2 ≤ 60
Figure: Conditional histograms of ozone levels in Leeds city centreconditional on {NO2 ≤ 40} and {40 < NO2 ≤ 60} in summer (left) andwinter (right).
Summary statistics
Numerical summaries of the data can
facilitate the comparison of different variables;
help us make clear statements about some aspects ofthe data.
Mathematical skills:
Recall the notation
n∑
i=1
g(i) = g(1) + g(2) + . . . + g(n − 1) + g(n)
for any positive integer value of n and any function g . In statisticswe often have to do mathematics with sums of the form∑n
i=1 h(xi). Using the above notation, this means
n∑
i=1
h(xi ) = h(x1) + h(x2) + . . . + h(xn−1) + h(xn)
for any positive integer value of n, any function h, and any set ofvalues x1, . . . , xn.
The most common forms of this expression we will encounter are:
n∑
i=1
xi = x1 + . . . + xn andn
∑
i=1
x2i = x2
1 + . . . + x2n .
Let c 6= 0 and d be real numbers and denote x̄ = 1n
∑ni=1 xi .
Sample mean
Consider a random variable X of which we obtain n i.i.d.realisations X1, . . . ,Xn. The sample mean of the observed valuesx1, . . . , xn is defined as follows:
The sample mean of n observations x1, . . . , xn is de-noted by m(x) and obtained by adding all the xi anddividing by n:
m(x) =
∑ni=1 xi
n= x .
This is an estimate of µX , the mean or expectation of X and isviewed as a measure of center.
Sample variance
The sample variance of n observations x1, . . . , xn isdenoted s2(x) and is given by:
s2(x) =1
n
n∑
i=1
(xi − x)2.
Note the divisor n here instead of n− 1. The sample variance is an
estimate of the variance of X , σ2X .
Sample standard deviation
Ideally, the spread measure should have the same units as theoriginal data. To obtain a measure with the correct units, we takethe square root and define:
The sample standard deviation of n observationsx1, . . . , xn is denoted by s(x) and is given by:
s(x) =
√
√
√
√
1
n
n∑
i=1
(xi − x)2.
The standard deviation of X , σX , is the square root ofσ2
X = Var(X ), and the sample standard deviation estimates thisvalue from the data.
Exercise 3.3.1Ozone data We will calculate summary statistics of O3 to lookmore closely for differences between the locations and the seasons.There are four groups, arising from the two levels of each of thetwo nominal variables location and season. Numbers in theparenthese are standard deviations.
Mean ozone concentrationLeeds city Ladybower
summer 31.78 (9.28) 43.63 (11.81)winter 20.52 (10.77) 29.24 (8.40)
What conclusions do you draw from these summary statistics?
Sample quantiles
The median is the midpoint of the data, another measure of center
of the data.Sample quantiles are calculated directly from the empirical c.d.f.To estimate the pth quantile, we find the value xp that satisfies
F̃ (xp) = p.
Exercise 3.3.2Calculate sample mean, sample standard deviation, samplequantiles (x0.25, x0.5, x0.75) for each dataset.
◮ Data A: 2, 4, 6, 8, 10
◮ Data B: 2, 4, 6, 8, 100
◮ Data C: 2, 4, 6, 8, 1000
Data x̄ s(x) x0.25 x0.5 x0.75
Data A: 2, 4, 6, 8, 10 6 2.83 4 6 8Data B: 2, 4, 6, 8, 100 24 38.05 4 6 8Data C: 2, 4, 6, 8, 1000 204 398.01 4 6 8
Using empirical c.d.f.
0 20 40 60 80
0.0
0.2
0.4
0.6
0.8
1.0
F̃(x
)
Leeds
Summer daily maxima
p = 0.6
x(p) in interval (33,34)
Using sample quantiles
Sample median is an alternative measure of the center of thedistribution.Similarly, an alternative measure of the spread of the distribution isthe range of middle 50% observations, called interquartile range,x0.75 − x0.25.
Using sample quantiles
Sample median is an alternative measure of the center of thedistribution.Similarly, an alternative measure of the spread of the distribution isthe range of middle 50% observations, called interquartile range,x0.75 − x0.25.Using the above example, compare (i) sample mean and samplemedian, (ii) sample standard deviation and sample interquartilerange. When do you think measures based on sample quantiles aremore preferable?
Sample mean is sensitive to outliers are are greatly influencedby extreme points, whereas sample median is not affectedby them. Likewise, sample standard deviation is greatly in-flated by extreme points, whereas sample interquartile rangeis stable.
Boxplot
Exercise 3.3.3Ozone data.
Leeds.O3 Ladybower.O3
020
4060
8010
0
Leeds.O3 Ladybower.O3
020
4060
8010
0
Summer Winter
Figure: Sample quantiles for ozone data are summarized in Boxplot. The
Sample correlation
Random variables X and Y of which we have i.i.d. observations(x1, y1), . . . , (xn, yn).First calculate:
◮ m(x) and s(x) for variable X ;
◮ m(y) and s(y) for variable Y .
Next standardise x and y :
xi −m(x)
s(x)and
yi −m(y)
s(y)for all i = 1, . . . , n
Then the sample correlation coefficient is the average of theproduct of these standardised values:
The sample correlation coefficient of n pairs of observations(x1, y1), . . . , (xn, yn) is denoted by r(x , y) and is given by:
r(x , y) =1
n
n∑
i=1
(
xi −m(x)
s(x)
)(
yi −m(y)
s(y)
)
∈ [−1, 1]
The sample correlation coefficient is an estimate of the correlationbetween X and Y , denoted Corr(X ,Y ).
Sample correlation coefficient
o
o
ooo
o o
oo
oo
o
o
o
o
o
o
oo
oo
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
ooo
oo
o
o
o
o
o
o
o
o
o
ooo
oo o
oo
o
o
oo
ooo
o
o
o
oo o
o
oo
o
o
ooo
o
o
oo
oo
o
o
o
oo
o
o
oo
o
o
o
o
o
o
-4 -2 0 2 4
-4-2
02
4
o
o
o
oo
o
o
oo
o
o
o
o
o
o
oo
o
o
o
oo
o oo o
o
o
o
o
o
o
o
oo
oo
o
o
o
o
o
o
o
o o
oo
o
o
o
oo
o oo
o
o
o
o
o
oo
o oo
o
o
o
o
o
o
oo
oooo
o
o
o
o
oo
oo
oo
oo
o
o
o
oo
o
o
oo
o
-4 -2 0 2 4
-4-2
02
4
o
o
oo
o
oo
o
o
o
oo
o
o
o o
o
o
o
o o
o
o
o
o
o
o
oo
o
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
oo
o
o
o
o o
oo
oo
o
o
o
oo
oo
o
o
oo o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
oo
o
ooo
o
o
o
oo
-4 -2 0 2 4
-4-2
02
4
o
oo
oo
o
o
o
o
o
oo
o
o o
o
o
oo
o
ooo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
oo
o
o o
oo
oo
o
oo
o
o
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
o
oo
oo
o
o
o
o
ooo
o
o
o
o
o
o
oo
o
o
oo
o
o
o
o
-4 -2 0 2 4
-4-2
02
4
o
o
o
o
o
o
oo
o
o
o
o
o
oo
o
o
o
oo
o
o
o o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
oo
o
o
o
o
o
ooo
o oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ooo o
o
oo
o
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
-4 -2 0 2 4
-4-2
02
4
oo
o
o
o
o
oo
oo
o
o
o
o
o
o
o
oo o
o
o
o
oo o
o
oo
o
oo
o
o o
o
o
oo
oo
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
oo
o
oo
o
o
oo
o
o
o
o
o
o
o
o oo
o
o
o
o
o
oo
o
o o
ooo
o
o
oo
oo
o
o
o
oo
o
-4 -2 0 2 4
-4-2
02
4
Corr(X , Y ) = −0.9 Corr(X , Y ) = −0.5 Corr(X , Y ) = 0
Corr(X , Y ) = 0.3 Corr(X , Y ) = 0.5 Corr(X , Y ) = 0.7
Misuse of sample correlation coefficient
−3 −2 −1 0 1 2 3
−3
−1
13
−3 −2 −1 0 1 2 3
−3
−1
13
−3 −2 −1 0 1 2 3
−3
−1
13
−3 −2 −1 0 1 2 3
−3
−1
13
Corr(X , Y ) = 0.16 Corr(X , Y ) = 0.028
Corr(X , Y ) = 0.51 Corr(X , Y ) = −0.83
Figure: Sample correlation coefficient is not appropriate measure ofstrength of non-linear association.
Exercise 3.3.4Ozone data. Calculate sample correlation coefficients betweeb O3
and NO2 for ozone data. There are four groups, arising from thetwo levels of each of the two nominal variables location and season.
Leeds city Ladybower reservoir
Summer 0.10 0.25Winter -0.24 -0.48