chapter 3. exploratory data analysisparkj1/math105/mainslide_ch3.pdf · exploratory data analysis....

Chapter 3. Exploratory data analysis

Outline

Example problems and associated data setsDiseased treesUrban and rural ozoneComparing hospitals

Graphical methodsHistograms

Choice of bin size

Empirical distribution functionScatterplotVisualising conditional distributionsHistorical note – Florence Nightingale

Summary statisticsSample meanSample variance and standard deviationSample quantilesSample correlation

Data and Random variability

◮ Data is information we collected but also bears uncertainty,due to random variability in characteristics of interest fromone individual to another.

◮ In mathematical terms, the characteristics being measured arerepresented by random variables. e.g X = age of individualschosen at random.

◮ Individual in this context refers to a unit of study - actualpeople, towns, cars, test tubes...

Random variables and realisations

IMPORTANT DIFFERENCE BETWEEN:

Random variable ←→ realisation (or observation)

◮ Random variable is always written in UPPER CASE and isassociated with a probability distribution. e.g. X = Ozonelevel

◮ Observations of random variables are written in lower caseand is just a number. e.g. x = observed value in number

Data set of size n:

X1, · · · ,Xn random variables

x1, · · · , xn particular realizations

Data analysis

The first stage in any analysis is to get to know the problem andthe data. This usually involves a variety of graphical procedures totry to visualise the data, as well as the calculation of a few simplesummary numbers, or summary statistics that capture key featuresof the data, which hopefully reveal key features of the unknown

underlying distribution.

Data analysis

The first stage in any analysis is to get to know the problem andthe data. This usually involves a variety of graphical procedures totry to visualise the data, as well as the calculation of a few simplesummary numbers, or summary statistics that capture key featuresof the data, which hopefully reveal key features of the unknown

underlying distribution.The random variability in the data is reflection of the underlyingdistribution however bear in mind that, because of finite samplesize, this would serve as an approximation to the true underlyingdistribution and its features. This implies that we also need to carehow good the approximation is.

Role of exploratory analysis

Finding errors and anomalies: missing data, outliers, changes ofscale...

Suggesting route of subsequent analysis: plots of data giveinformation on location, scale and shape of thedistribution and relationships between variables.

Augmenting understanding of applied problem: exploratorygraphical tools sharpen the scientific questions beingaddressed.

Example problems and associated data sets

Ecological Diseased trees

Atmospheric Chemistry Monitoring urban air pollution

Health Comparing hospitals

Diseased trees

How does the disease spread between trees, and what is the

probability that trees are infected by the disease?

Run length 0 1 2 3 4 5

Number of runs 31 16 2 0 1 0in first 50 observations

Number of runs 71 28 5 2 2 1in 109 observations

Table: Run lengths of diseased trees in an infected plantation.

Urban and rural ozone

Air quality monitoring:

◮ How, if at all, does the distribution of ozone measurements

vary between the urban and rural sites?

◮ How, if at all, is the distribution of ozone measurements

affected by season?

◮ How, if at all, does the presence of other pollutants affect the

levels of measured ozone?

data: daily measurements of the maximum hourly meanconcentration of O3 and NO2 (ppb):

x1, · · · , xn Leeds city centre ozone

y1, · · · , yn Ladybower reservoir ozone

We will look at observations from early summer (April – Julyinclusive) and winter (November – February inclusive).

Comparing hospitals

Number of successful operations at each hospital (out of ten)

Hospital 1 Hospital 2

9 5

What can we conclude about the relative performances of the two

hospitals?

Population and sample: example

In the Ozone problem, we have data from a number of days during1994-1998. However, interest is not solely in the levels of ozone onthe days on which measurements were taken. The objective of astatistical analysis is to learn about the relationships between thevariables, and to draw more general conclusions about levels of thevariables on other, perhaps future dates.

Exercise 3.1.1For each of the problems that we are concentrating on in thecourse, state the populations that we are trying to learn about:

Ozone Levels of ozone at the two locations given thetime of year and the level of NO2.

Diseased trees All trees in the forest and, possibly otherlocations where the climate, soil and treeshave similar properties.

Hospital Other operations at the two hospitals.

Discrete or Continuous: Diseased trees

Exercise 3.1.2For diseased tree data set, define the variable of interest as X andfind possible range of values. Is the variable discrete or continuous?

◮ The variable of interest isX = number of unbroken run of diseased trees in theneighboring trees of a diseased tree.

◮ Possible values are {0, 1, · · · , }

Discrete or Continuous: Comparing hospitals

Exercise 3.1.3For hospital data set, define the variable of interest as X and findpossible range of values. Is the variable discrete or continuous?

◮ The variable of interest is


X = number of successful operations in the first hospital

Y = number of successful operations in the second hospital

◮ Possible values are


X = number of successful operations in the first hospital

Y = number of successful operations in the second hospital

◮ Possible values are

{0, 1, · · · , 10} for both variables.

Histograms

Histogram - shape of distribution

◮ Bins of equal width

◮ Number of observations in each bin

0 1 2 3 4 5

020

4060

0 1 2 3 4 5

020

4060

Partial Full

Run lengthRun lengthCou

nt

Cou

nt

Figure: Histograms of run lengths of diseased trees in an infectedplantation.

Scaled histogram

We can rescale the vertical axis of our histogram to ensure thatthe histogram has area 1. We do this by calculating the area of ouroriginal histogram, then dividing all the counts, or frequencies, bythis amount. When the bins are all of equal width, the area is:

A = contribution of one individual× number of individuals

= 1× bin width× number of individuals

Exercise 3.2.1Diseased trees. For the histogram of the full dataset of diseasedtree,

A = 1 × 109 and the new maximum value on the y axis isapprox

71

1× 109≈ 0.65

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0 1 2 3 4 5

0.0

0.2

0.4

0.6

Partial Full

Run lengthRun lengthp.

m.f.

p.m

.f.

Figure: Histograms of run lengths of diseased trees in an infectedplantation, partial (left plot) and full (right plot) data sets

Comparing histograms

20 40 60 80

0.00

0.01

0.02

0.03

0.04

20 40 60 80 100 120

0.00

0.01

0.02

0.03

Leeds city centre Ladybower Reservoir

Daily max ozoneDaily max ozone

Den

sity

Den

sity

Figure: Histograms of the summer daily maximum ozone levels (ppb) inLeeds city centre and at Ladybower reservoir.

Reducing variability

Differences

Daily max ozone

Den

sity

−50 −40 −30 −20 −10 0 10 20

0.00

0.01

0.02

0.03

0.04

0.05

Figure: Differences of summer ozone daily maxima at the two sites.

Effect of bin size

Leeds

Summer daily maxima

Den

sity

0 20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

0.10

Leeds

Summer daily maxima

Den

sity

20 40 60 80 1000.

000.

020.

040.

060.

080.

10

Figure: Histograms of density of summer ozone data using different binsizes.

Empirical distribution function

Recall the cumulative distribution function (c.d.f.) of a randomvariable X :

F (x) = P(X ≤ x)

How can we estimate this from a finite number of observations?

Let us assume that our variables X1, . . . ,Xn are independent andidentically distributed (i.i.d.) replicates of a random variable X

which has cumulative distribution function F . We denote byx1, . . . , xn, the observed values of X1, . . . ,Xn.

The empirical cumulative distribution function(c.d.f.) is defined as

F̃ (x) =1

n(num of xi ≤ x) =

∑ni=1 II(xi ≤ x)

n

where

II(xi ≤ x) =

{

1 if xi ≤ x

0 if xi > 0

The empirical c.d.f is a proper distribution function and has thefollowing properties:

◮ F̃ (x) is a step function with jumps at the data points;

◮ F̃ (x) = 1 if x ≥ max(x1, . . . , xn);

◮ F̃ (x) = 0 if x < min(x1, . . . , xn).

Remark

◮ We have no reason to favor any particular observation. So wegive each observation an equal weight 1/n. If some values aremore likely than others, they simply appear more frequentlythan others.

◮ Take the observed values and order them so that the smallestone comes first. Lable these ordered values x(1), x(2), · · · , x(n)

so thatx(1) ≤ x(2) ≤ · · · ≤ x(n) .

Then the kth ordered point x(k) is the k/n th quantile.

Exercise 3.2.2For observations {1, 2, 2, 3, 4}, find F̃ (x) and sketch the plot.

x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

F̃ (x) 0 0 1/5 1/5 3/5 3/5 4/5 4/5 1 1 1

F̃(x

)0.

00.

20.

40.

60.

81.

0

0 1 2 3 4 5

16 22 23 23 26 27 27 28 29 30 32 32 32 33 34 34 35 35 3645Then at each sorted data point we have a jump of i/n.Here n = 20

x 16 22 23 26 27 28 29

F̃ (x) 1/20 2/20 4/20 5/20 7/20 8/20 9/20

x 30 32 33 34 35 36 45

F̃ (x) 10/20 13/20 14/20 16/20 18/20 19/20 1

F̃(x

)Leeds

Summer daily maxima

0.0

0.2

0.4

0.6

0.8

1.0

15 20 25 30 35 40 45

0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

F̃(x

)Leeds

Summer daily maxima

Figure: Empirical c.d.f. for summer daily maxima ozone at the Leeds citycentre site.

Scatterplot

When we have multivariate data, we have to look at dependencebetween variables.Scatterplot – plots one variable against another.

Exercise 3.2.3Ozone. We now turn to the effect of the nitrogen dioxide (NO2)on ozone levels. We focus on the Leeds city centre ozonemeasurements as our response variable.

20 40 60 80 100

2040

6080

20 40 60 80 100 120

010

2030

40

Summer Winter

O3

O3

NO2NO2

Figure: Scatterplots of Leeds city centre O3 values against NO2 for eachseason.

Visualising conditional distribution

As well as looking for dependence between variables, it can also beuseful to identify situations in which variables appear to beindependent. If two variables are independent, then thedistribution of one variable will look the same regardless of thevalue of the other variable.


As well as looking for dependence between variables, it can also beuseful to identify situations in which variables appear to beindependent. If two variables are independent, then thedistribution of one variable will look the same regardless of thevalue of the other variable.Conditional probabilities were introduced in Math104:

If A and B are two events then, as long as P(B) > 0,the conditional probability of A given B is written asP(A |B) and calculated from

P(A |B) = P(A ∩ B)/P(B).


As well as looking for dependence between variables, it can also beuseful to identify situations in which variables appear to beindependent. If two variables are independent, then thedistribution of one variable will look the same regardless of thevalue of the other variable.Conditional probabilities were introduced in Math104:

If A and B are two events then, as long as P(B) > 0,the conditional probability of A given B is written asP(A |B) and calculated from

P(A |B) = P(A ∩ B)/P(B).

We can look for more structure in our data, including thedependence of one variable on another, by examining conditional

distributions of some subsets of our data.

Exercise 3.2.4Ozone data. We will look at the following conditional histogramsfor Leeds city center:

◮ daily maximum ozone levels in summer conditional on{NO2 <= 40};

◮ daily maximum ozone levels in summer conditional on{40 < NO2 ≤ 60};

◮ daily maximum ozone levels in winter conditional on{NO2 <= 40};

◮ daily maximum ozone levels in winter conditional on{40 < NO2 ≤ 60};

Den

sity

0 10 20 30 40 50 60 70

0.00

0.01

0.02

0.03

0.04

0.05

Den

sity

0 10 20 30 40

0.00

0.01

0.02

0.03

0.04

0.05

Den

sity

0 10 20 30 40 50 60 70

0.00

0.01

0.02

0.03

0.04

0.05

Den

sity

0 10 20 30 40

0.00

0.01

0.02

0.03

0.04

0.05

O3O3

O3O3

Summer Ozone |NO2 ≤ 40

Summer Ozone | 40 < NO2 ≤ 60

Winter Ozone |NO2 ≤ 40

Winter Ozone | 40 < NO2 ≤ 60

Figure: Conditional histograms of ozone levels in Leeds city centreconditional on {NO2 ≤ 40} and {40 < NO2 ≤ 60} in summer (left) andwinter (right).

Summary statistics

Numerical summaries of the data can

facilitate the comparison of different variables;

help us make clear statements about some aspects ofthe data.

Mathematical skills:

Recall the notation

n∑

i=1

g(i) = g(1) + g(2) + . . . + g(n − 1) + g(n)

for any positive integer value of n and any function g . In statisticswe often have to do mathematics with sums of the form∑n

i=1 h(xi). Using the above notation, this means

n∑

i=1

h(xi ) = h(x1) + h(x2) + . . . + h(xn−1) + h(xn)

for any positive integer value of n, any function h, and any set ofvalues x1, . . . , xn.

The most common forms of this expression we will encounter are:

n∑

i=1

xi = x1 + . . . + xn andn

∑

i=1

x2i = x2

1 + . . . + x2n .

Let c 6= 0 and d be real numbers and denote x̄ = 1n

∑ni=1 xi .

Sample mean

Consider a random variable X of which we obtain n i.i.d.realisations X1, . . . ,Xn. The sample mean of the observed valuesx1, . . . , xn is defined as follows:

The sample mean of n observations x1, . . . , xn is de-noted by m(x) and obtained by adding all the xi anddividing by n:

m(x) =

∑ni=1 xi

n= x .

This is an estimate of µX , the mean or expectation of X and isviewed as a measure of center.

Sample variance

The sample variance of n observations x1, . . . , xn isdenoted s2(x) and is given by:

s2(x) =1

n

n∑

i=1

(xi − x)2.

Note the divisor n here instead of n− 1. The sample variance is an

estimate of the variance of X , σ2X .

Sample standard deviation

Ideally, the spread measure should have the same units as theoriginal data. To obtain a measure with the correct units, we takethe square root and define:

The sample standard deviation of n observationsx1, . . . , xn is denoted by s(x) and is given by:

s(x) =

√

√

√

√

1

n

n∑

i=1

(xi − x)2.

The standard deviation of X , σX , is the square root ofσ2

X = Var(X ), and the sample standard deviation estimates thisvalue from the data.

Exercise 3.3.1Ozone data We will calculate summary statistics of O3 to lookmore closely for differences between the locations and the seasons.There are four groups, arising from the two levels of each of thetwo nominal variables location and season. Numbers in theparenthese are standard deviations.

Mean ozone concentrationLeeds city Ladybower

summer 31.78 (9.28) 43.63 (11.81)winter 20.52 (10.77) 29.24 (8.40)

What conclusions do you draw from these summary statistics?

Sample quantiles

The median is the midpoint of the data, another measure of center

of the data.Sample quantiles are calculated directly from the empirical c.d.f.To estimate the pth quantile, we find the value xp that satisfies

F̃ (xp) = p.

Exercise 3.3.2Calculate sample mean, sample standard deviation, samplequantiles (x0.25, x0.5, x0.75) for each dataset.

◮ Data A: 2, 4, 6, 8, 10

◮ Data B: 2, 4, 6, 8, 100

◮ Data C: 2, 4, 6, 8, 1000

Data x̄ s(x) x0.25 x0.5 x0.75

Data A: 2, 4, 6, 8, 10 6 2.83 4 6 8Data B: 2, 4, 6, 8, 100 24 38.05 4 6 8Data C: 2, 4, 6, 8, 1000 204 398.01 4 6 8

Using empirical c.d.f.

0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

F̃(x

)

Leeds

Summer daily maxima

p = 0.6

x(p) in interval (33,34)

Using sample quantiles

Sample median is an alternative measure of the center of thedistribution.Similarly, an alternative measure of the spread of the distribution isthe range of middle 50% observations, called interquartile range,x0.75 − x0.25.

Using sample quantiles

Sample median is an alternative measure of the center of thedistribution.Similarly, an alternative measure of the spread of the distribution isthe range of middle 50% observations, called interquartile range,x0.75 − x0.25.Using the above example, compare (i) sample mean and samplemedian, (ii) sample standard deviation and sample interquartilerange. When do you think measures based on sample quantiles aremore preferable?

Sample mean is sensitive to outliers are are greatly influencedby extreme points, whereas sample median is not affectedby them. Likewise, sample standard deviation is greatly in-flated by extreme points, whereas sample interquartile rangeis stable.

Boxplot

Exercise 3.3.3Ozone data.

Leeds.O3 Ladybower.O3

020

4060

8010

0

Leeds.O3 Ladybower.O3

020

4060

8010

0

Summer Winter

Figure: Sample quantiles for ozone data are summarized in Boxplot. The

Sample correlation

Random variables X and Y of which we have i.i.d. observations(x1, y1), . . . , (xn, yn).First calculate:

◮ m(x) and s(x) for variable X ;

◮ m(y) and s(y) for variable Y .

Next standardise x and y :

xi −m(x)

s(x)and

yi −m(y)

s(y)for all i = 1, . . . , n

Then the sample correlation coefficient is the average of theproduct of these standardised values:

The sample correlation coefficient of n pairs of observations(x1, y1), . . . , (xn, yn) is denoted by r(x , y) and is given by:

r(x , y) =1

n

n∑

i=1

(

xi −m(x)

s(x)

)(

yi −m(y)

s(y)

)

∈ [−1, 1]

The sample correlation coefficient is an estimate of the correlationbetween X and Y , denoted Corr(X ,Y ).

Sample correlation coefficient

o

o

ooo

o o

oo

oo

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

ooo

oo

o

o

o

o

o

o

o

o

o

ooo

oo o

oo

o

o

oo

ooo

o

o

o

oo o

o

oo

o

o

ooo

o

o

oo

oo

o

o

o

oo

o

o

oo

o

o

o

o

o

o

-4 -2 0 2 4

-4-2

02

4

o

o

o

oo

o

o

oo

o

o

o

o

o

o

oo

o

o

o

oo

o oo o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

o

o o

oo

o

o

o

oo

o oo

o

o

o

o

o

oo

o oo

o

o

o

o

o

o

oo

oooo

o

o

o

o

oo

oo

oo

oo

o

o

o

oo

o

o

oo

o

-4 -2 0 2 4

-4-2

02

4

o

o

oo

o

oo

o

o

o

oo

o

o

o o

o

o

o

o o

o

o

o

o

o

o

oo

o

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o o

oo

oo

o

o

o

oo

oo

o

o

oo o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

oo

o

ooo

o

o

o

oo

-4 -2 0 2 4

-4-2

02

4

o

oo

oo

o

o

o

o

o

oo

o

o o

o

o

oo

o

ooo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

o o

oo

oo

o

oo

o

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

oo

o

o

o

o

-4 -2 0 2 4

-4-2

02

4

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

oo

o

o

o o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

ooo

o oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

ooo o

o

oo

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

-4 -2 0 2 4

-4-2

02

4

oo

o

o

o

o

oo

oo

o

o

o

o

o

o

o

oo o

o

o

o

oo o

o

oo

o

oo

o

o o

o

o

oo

oo

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

oo

o

oo

o

o

oo

o

o

o

o

o

o

o

o oo

o

o

o

o

o

oo

o

o o

ooo

o

o

oo

oo

o

o

o

oo

o

-4 -2 0 2 4

-4-2

02

4

Corr(X , Y ) = −0.9 Corr(X , Y ) = −0.5 Corr(X , Y ) = 0

Corr(X , Y ) = 0.3 Corr(X , Y ) = 0.5 Corr(X , Y ) = 0.7

Misuse of sample correlation coefficient

−3 −2 −1 0 1 2 3

−3

−1

13

−3 −2 −1 0 1 2 3

−3

−1

13

−3 −2 −1 0 1 2 3

−3

−1

13

−3 −2 −1 0 1 2 3

−3

−1

13

Corr(X , Y ) = 0.16 Corr(X , Y ) = 0.028

Corr(X , Y ) = 0.51 Corr(X , Y ) = −0.83

Figure: Sample correlation coefficient is not appropriate measure ofstrength of non-linear association.

Exercise 3.3.4Ozone data. Calculate sample correlation coefficients betweeb O3

and NO2 for ozone data. There are four groups, arising from thetwo levels of each of the two nominal variables location and season.

Leeds city Ladybower reservoir

Summer 0.10 0.25Winter -0.24 -0.48

chapter 3. exploratory data analysisparkj1/math105/mainslide_ch3.pdf · exploratory data analysis....

Documents