chapter 3. exploratory data analysisparkj1/math105/mainslide_ch3.pdf · exploratory data analysis....

65
Chapter 3. Exploratory data analysis

Upload: others

Post on 21-Jul-2020

13 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Chapter 3. Exploratory data analysis

Page 2: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Outline

Example problems and associated data setsDiseased treesUrban and rural ozoneComparing hospitals

Graphical methodsHistograms

Choice of bin size

Empirical distribution functionScatterplotVisualising conditional distributionsHistorical note – Florence Nightingale

Summary statisticsSample meanSample variance and standard deviationSample quantilesSample correlation

Page 3: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Data and Random variability

◮ Data is information we collected but also bears uncertainty,due to random variability in characteristics of interest fromone individual to another.

◮ In mathematical terms, the characteristics being measured arerepresented by random variables. e.g X = age of individualschosen at random.

◮ Individual in this context refers to a unit of study - actualpeople, towns, cars, test tubes...

Page 4: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Random variables and realisations

IMPORTANT DIFFERENCE BETWEEN:

Random variable ←→ realisation (or observation)

◮ Random variable is always written in UPPER CASE and isassociated with a probability distribution. e.g. X = Ozonelevel

◮ Observations of random variables are written in lower caseand is just a number. e.g. x = observed value in number

Data set of size n:

X1, · · · ,Xn random variables

x1, · · · , xn particular realizations

Page 5: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Data analysis

The first stage in any analysis is to get to know the problem andthe data. This usually involves a variety of graphical procedures totry to visualise the data, as well as the calculation of a few simplesummary numbers, or summary statistics that capture key featuresof the data, which hopefully reveal key features of the unknown

underlying distribution.

Page 6: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Data analysis

The first stage in any analysis is to get to know the problem andthe data. This usually involves a variety of graphical procedures totry to visualise the data, as well as the calculation of a few simplesummary numbers, or summary statistics that capture key featuresof the data, which hopefully reveal key features of the unknown

underlying distribution.The random variability in the data is reflection of the underlyingdistribution however bear in mind that, because of finite samplesize, this would serve as an approximation to the true underlyingdistribution and its features. This implies that we also need to carehow good the approximation is.

Page 7: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Role of exploratory analysis

Finding errors and anomalies: missing data, outliers, changes ofscale...

Suggesting route of subsequent analysis: plots of data giveinformation on location, scale and shape of thedistribution and relationships between variables.

Augmenting understanding of applied problem: exploratorygraphical tools sharpen the scientific questions beingaddressed.

Page 8: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Example problems and associated data sets

Ecological Diseased trees

Atmospheric Chemistry Monitoring urban air pollution

Health Comparing hospitals

Page 9: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Diseased trees

How does the disease spread between trees, and what is the

probability that trees are infected by the disease?

Run length 0 1 2 3 4 5

Number of runs 31 16 2 0 1 0in first 50 observations

Number of runs 71 28 5 2 2 1in 109 observations

Table: Run lengths of diseased trees in an infected plantation.

Page 10: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Urban and rural ozone

Air quality monitoring:

◮ How, if at all, does the distribution of ozone measurements

vary between the urban and rural sites?

◮ How, if at all, is the distribution of ozone measurements

affected by season?

◮ How, if at all, does the presence of other pollutants affect the

levels of measured ozone?

data: daily measurements of the maximum hourly meanconcentration of O3 and NO2 (ppb):

x1, · · · , xn Leeds city centre ozone

y1, · · · , yn Ladybower reservoir ozone

We will look at observations from early summer (April – Julyinclusive) and winter (November – February inclusive).

Page 11: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Comparing hospitals

Number of successful operations at each hospital (out of ten)

Hospital 1 Hospital 2

9 5

What can we conclude about the relative performances of the two

hospitals?

Page 12: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Population and sample: example

In the Ozone problem, we have data from a number of days during1994-1998. However, interest is not solely in the levels of ozone onthe days on which measurements were taken. The objective of astatistical analysis is to learn about the relationships between thevariables, and to draw more general conclusions about levels of thevariables on other, perhaps future dates.

Page 13: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Exercise 3.1.1For each of the problems that we are concentrating on in thecourse, state the populations that we are trying to learn about:

Page 14: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Ozone Levels of ozone at the two locations given thetime of year and the level of NO2.

Diseased trees All trees in the forest and, possibly otherlocations where the climate, soil and treeshave similar properties.

Hospital Other operations at the two hospitals.

Page 15: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Discrete or Continuous: Diseased trees

Exercise 3.1.2For diseased tree data set, define the variable of interest as X andfind possible range of values. Is the variable discrete or continuous?

Page 16: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

◮ The variable of interest isX = number of unbroken run of diseased trees in theneighboring trees of a diseased tree.

◮ Possible values are {0, 1, · · · , }

Page 17: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Discrete or Continuous: Comparing hospitals

Exercise 3.1.3For hospital data set, define the variable of interest as X and findpossible range of values. Is the variable discrete or continuous?

Page 18: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

◮ The variable of interest is

Page 19: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

◮ The variable of interest is

X = number of successful operations in the first hospital

Y = number of successful operations in the second hospital

◮ Possible values are

Page 20: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

◮ The variable of interest is

X = number of successful operations in the first hospital

Y = number of successful operations in the second hospital

◮ Possible values are

{0, 1, · · · , 10} for both variables.

Page 21: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Histograms

Histogram - shape of distribution

◮ Bins of equal width

◮ Number of observations in each bin

Page 22: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

0 1 2 3 4 5

020

4060

0 1 2 3 4 5

020

4060

Partial Full

Run lengthRun lengthCou

nt

Cou

nt

Figure: Histograms of run lengths of diseased trees in an infectedplantation.

Page 23: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Scaled histogram

We can rescale the vertical axis of our histogram to ensure thatthe histogram has area 1. We do this by calculating the area of ouroriginal histogram, then dividing all the counts, or frequencies, bythis amount. When the bins are all of equal width, the area is:

A = contribution of one individual× number of individuals

= 1× bin width× number of individuals

Page 24: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Exercise 3.2.1Diseased trees. For the histogram of the full dataset of diseasedtree,

Page 25: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

A = 1 × 109 and the new maximum value on the y axis isapprox

71

1× 109≈ 0.65

Page 26: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0 1 2 3 4 5

0.0

0.2

0.4

0.6

Partial Full

Run lengthRun lengthp.

m.f.

p.m

.f.

Figure: Histograms of run lengths of diseased trees in an infectedplantation, partial (left plot) and full (right plot) data sets

Page 27: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Comparing histograms

20 40 60 80

0.00

0.01

0.02

0.03

0.04

20 40 60 80 100 120

0.00

0.01

0.02

0.03

Leeds city centre Ladybower Reservoir

Daily max ozoneDaily max ozone

Den

sity

Den

sity

Figure: Histograms of the summer daily maximum ozone levels (ppb) inLeeds city centre and at Ladybower reservoir.

Page 28: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Reducing variability

Differences

Daily max ozone

Den

sity

−50 −40 −30 −20 −10 0 10 20

0.00

0.01

0.02

0.03

0.04

0.05

Figure: Differences of summer ozone daily maxima at the two sites.

Page 29: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Effect of bin size

Leeds

Summer daily maxima

Den

sity

0 20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

0.10

Leeds

Summer daily maxima

Den

sity

20 40 60 80 1000.

000.

020.

040.

060.

080.

10

Figure: Histograms of density of summer ozone data using different binsizes.

Page 30: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Empirical distribution function

Recall the cumulative distribution function (c.d.f.) of a randomvariable X :

F (x) = P(X ≤ x)

How can we estimate this from a finite number of observations?

Page 31: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Let us assume that our variables X1, . . . ,Xn are independent andidentically distributed (i.i.d.) replicates of a random variable X

which has cumulative distribution function F . We denote byx1, . . . , xn, the observed values of X1, . . . ,Xn.

The empirical cumulative distribution function(c.d.f.) is defined as

F̃ (x) =1

n(num of xi ≤ x) =

∑ni=1 II(xi ≤ x)

n

where

II(xi ≤ x) =

{

1 if xi ≤ x

0 if xi > 0

Page 32: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

The empirical c.d.f is a proper distribution function and has thefollowing properties:

◮ F̃ (x) is a step function with jumps at the data points;

◮ F̃ (x) = 1 if x ≥ max(x1, . . . , xn);

◮ F̃ (x) = 0 if x < min(x1, . . . , xn).

Page 33: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Remark

◮ We have no reason to favor any particular observation. So wegive each observation an equal weight 1/n. If some values aremore likely than others, they simply appear more frequentlythan others.

◮ Take the observed values and order them so that the smallestone comes first. Lable these ordered values x(1), x(2), · · · , x(n)

so thatx(1) ≤ x(2) ≤ · · · ≤ x(n) .

Then the kth ordered point x(k) is the k/n th quantile.

Page 34: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Exercise 3.2.2For observations {1, 2, 2, 3, 4}, find F̃ (x) and sketch the plot.

Page 35: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

F̃ (x) 0 0 1/5 1/5 3/5 3/5 4/5 4/5 1 1 1

F̃(x

)0.

00.

20.

40.

60.

81.

0

0 1 2 3 4 5

Page 36: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

16 22 23 23 26 27 27 28 29 30 32 32 32 33 34 34 35 35 3645Then at each sorted data point we have a jump of i/n.Here n = 20

x 16 22 23 26 27 28 29

F̃ (x) 1/20 2/20 4/20 5/20 7/20 8/20 9/20

x 30 32 33 34 35 36 45

F̃ (x) 10/20 13/20 14/20 16/20 18/20 19/20 1

Page 37: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

F̃(x

)Leeds

Summer daily maxima

0.0

0.2

0.4

0.6

0.8

1.0

15 20 25 30 35 40 45

Page 38: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

F̃(x

)Leeds

Summer daily maxima

Figure: Empirical c.d.f. for summer daily maxima ozone at the Leeds citycentre site.

Page 39: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Scatterplot

When we have multivariate data, we have to look at dependencebetween variables.Scatterplot – plots one variable against another.

Exercise 3.2.3Ozone. We now turn to the effect of the nitrogen dioxide (NO2)on ozone levels. We focus on the Leeds city centre ozonemeasurements as our response variable.

Page 40: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

20 40 60 80 100

2040

6080

20 40 60 80 100 120

010

2030

40

Summer Winter

O3

O3

NO2NO2

Figure: Scatterplots of Leeds city centre O3 values against NO2 for eachseason.

Page 41: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Visualising conditional distribution

As well as looking for dependence between variables, it can also beuseful to identify situations in which variables appear to beindependent. If two variables are independent, then thedistribution of one variable will look the same regardless of thevalue of the other variable.

Page 42: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Visualising conditional distribution

As well as looking for dependence between variables, it can also beuseful to identify situations in which variables appear to beindependent. If two variables are independent, then thedistribution of one variable will look the same regardless of thevalue of the other variable.Conditional probabilities were introduced in Math104:

If A and B are two events then, as long as P(B) > 0,the conditional probability of A given B is written asP(A |B) and calculated from

P(A |B) = P(A ∩ B)/P(B).

Page 43: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Visualising conditional distribution

As well as looking for dependence between variables, it can also beuseful to identify situations in which variables appear to beindependent. If two variables are independent, then thedistribution of one variable will look the same regardless of thevalue of the other variable.Conditional probabilities were introduced in Math104:

If A and B are two events then, as long as P(B) > 0,the conditional probability of A given B is written asP(A |B) and calculated from

P(A |B) = P(A ∩ B)/P(B).

We can look for more structure in our data, including thedependence of one variable on another, by examining conditional

distributions of some subsets of our data.

Page 44: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Exercise 3.2.4Ozone data. We will look at the following conditional histogramsfor Leeds city center:

◮ daily maximum ozone levels in summer conditional on{NO2 <= 40};

◮ daily maximum ozone levels in summer conditional on{40 < NO2 ≤ 60};

◮ daily maximum ozone levels in winter conditional on{NO2 <= 40};

◮ daily maximum ozone levels in winter conditional on{40 < NO2 ≤ 60};

Page 45: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Den

sity

0 10 20 30 40 50 60 70

0.00

0.01

0.02

0.03

0.04

0.05

Den

sity

0 10 20 30 40

0.00

0.01

0.02

0.03

0.04

0.05

Den

sity

0 10 20 30 40 50 60 70

0.00

0.01

0.02

0.03

0.04

0.05

Den

sity

0 10 20 30 40

0.00

0.01

0.02

0.03

0.04

0.05

O3O3

O3O3

Summer Ozone |NO2 ≤ 40

Summer Ozone | 40 < NO2 ≤ 60

Winter Ozone |NO2 ≤ 40

Winter Ozone | 40 < NO2 ≤ 60

Figure: Conditional histograms of ozone levels in Leeds city centreconditional on {NO2 ≤ 40} and {40 < NO2 ≤ 60} in summer (left) andwinter (right).

Page 46: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Summary statistics

Numerical summaries of the data can

facilitate the comparison of different variables;

help us make clear statements about some aspects ofthe data.

Page 47: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Mathematical skills:

Recall the notation

n∑

i=1

g(i) = g(1) + g(2) + . . . + g(n − 1) + g(n)

for any positive integer value of n and any function g . In statisticswe often have to do mathematics with sums of the form∑n

i=1 h(xi). Using the above notation, this means

n∑

i=1

h(xi ) = h(x1) + h(x2) + . . . + h(xn−1) + h(xn)

for any positive integer value of n, any function h, and any set ofvalues x1, . . . , xn.

Page 48: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

The most common forms of this expression we will encounter are:

n∑

i=1

xi = x1 + . . . + xn andn

i=1

x2i = x2

1 + . . . + x2n .

Let c 6= 0 and d be real numbers and denote x̄ = 1n

∑ni=1 xi .

Page 49: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Sample mean

Consider a random variable X of which we obtain n i.i.d.realisations X1, . . . ,Xn. The sample mean of the observed valuesx1, . . . , xn is defined as follows:

The sample mean of n observations x1, . . . , xn is de-noted by m(x) and obtained by adding all the xi anddividing by n:

m(x) =

∑ni=1 xi

n= x .

This is an estimate of µX , the mean or expectation of X and isviewed as a measure of center.

Page 50: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Sample variance

The sample variance of n observations x1, . . . , xn isdenoted s2(x) and is given by:

s2(x) =1

n

n∑

i=1

(xi − x)2.

Note the divisor n here instead of n− 1. The sample variance is an

estimate of the variance of X , σ2X .

Page 51: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Sample standard deviation

Ideally, the spread measure should have the same units as theoriginal data. To obtain a measure with the correct units, we takethe square root and define:

The sample standard deviation of n observationsx1, . . . , xn is denoted by s(x) and is given by:

s(x) =

1

n

n∑

i=1

(xi − x)2.

The standard deviation of X , σX , is the square root ofσ2

X = Var(X ), and the sample standard deviation estimates thisvalue from the data.

Page 52: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Exercise 3.3.1Ozone data We will calculate summary statistics of O3 to lookmore closely for differences between the locations and the seasons.There are four groups, arising from the two levels of each of thetwo nominal variables location and season. Numbers in theparenthese are standard deviations.

Mean ozone concentrationLeeds city Ladybower

summer 31.78 (9.28) 43.63 (11.81)winter 20.52 (10.77) 29.24 (8.40)

What conclusions do you draw from these summary statistics?

Page 53: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Sample quantiles

The median is the midpoint of the data, another measure of center

of the data.Sample quantiles are calculated directly from the empirical c.d.f.To estimate the pth quantile, we find the value xp that satisfies

F̃ (xp) = p.

Page 54: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Exercise 3.3.2Calculate sample mean, sample standard deviation, samplequantiles (x0.25, x0.5, x0.75) for each dataset.

◮ Data A: 2, 4, 6, 8, 10

◮ Data B: 2, 4, 6, 8, 100

◮ Data C: 2, 4, 6, 8, 1000

Page 55: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Data x̄ s(x) x0.25 x0.5 x0.75

Data A: 2, 4, 6, 8, 10 6 2.83 4 6 8Data B: 2, 4, 6, 8, 100 24 38.05 4 6 8Data C: 2, 4, 6, 8, 1000 204 398.01 4 6 8

Page 56: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Using empirical c.d.f.

0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

F̃(x

)

Leeds

Summer daily maxima

p = 0.6

x(p) in interval (33,34)

Page 57: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Using sample quantiles

Sample median is an alternative measure of the center of thedistribution.Similarly, an alternative measure of the spread of the distribution isthe range of middle 50% observations, called interquartile range,x0.75 − x0.25.

Page 58: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Using sample quantiles

Sample median is an alternative measure of the center of thedistribution.Similarly, an alternative measure of the spread of the distribution isthe range of middle 50% observations, called interquartile range,x0.75 − x0.25.Using the above example, compare (i) sample mean and samplemedian, (ii) sample standard deviation and sample interquartilerange. When do you think measures based on sample quantiles aremore preferable?

Page 59: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Sample mean is sensitive to outliers are are greatly influencedby extreme points, whereas sample median is not affectedby them. Likewise, sample standard deviation is greatly in-flated by extreme points, whereas sample interquartile rangeis stable.

Page 60: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Boxplot

Exercise 3.3.3Ozone data.

Leeds.O3 Ladybower.O3

020

4060

8010

0

Leeds.O3 Ladybower.O3

020

4060

8010

0

Summer Winter

Figure: Sample quantiles for ozone data are summarized in Boxplot. The

Page 61: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Sample correlation

Random variables X and Y of which we have i.i.d. observations(x1, y1), . . . , (xn, yn).First calculate:

◮ m(x) and s(x) for variable X ;

◮ m(y) and s(y) for variable Y .

Next standardise x and y :

xi −m(x)

s(x)and

yi −m(y)

s(y)for all i = 1, . . . , n

Page 62: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Then the sample correlation coefficient is the average of theproduct of these standardised values:

The sample correlation coefficient of n pairs of observations(x1, y1), . . . , (xn, yn) is denoted by r(x , y) and is given by:

r(x , y) =1

n

n∑

i=1

(

xi −m(x)

s(x)

)(

yi −m(y)

s(y)

)

∈ [−1, 1]

The sample correlation coefficient is an estimate of the correlationbetween X and Y , denoted Corr(X ,Y ).

Page 63: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Sample correlation coefficient

o

o

ooo

o o

oo

oo

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

ooo

oo

o

o

o

o

o

o

o

o

o

ooo

oo o

oo

o

o

oo

ooo

o

o

o

oo o

o

oo

o

o

ooo

o

o

oo

oo

o

o

o

oo

o

o

oo

o

o

o

o

o

o

-4 -2 0 2 4

-4-2

02

4

o

o

o

oo

o

o

oo

o

o

o

o

o

o

oo

o

o

o

oo

o oo o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

o

o o

oo

o

o

o

oo

o oo

o

o

o

o

o

oo

o oo

o

o

o

o

o

o

oo

oooo

o

o

o

o

oo

oo

oo

oo

o

o

o

oo

o

o

oo

o

-4 -2 0 2 4

-4-2

02

4

o

o

oo

o

oo

o

o

o

oo

o

o

o o

o

o

o

o o

o

o

o

o

o

o

oo

o

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o o

oo

oo

o

o

o

oo

oo

o

o

oo o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

oo

o

ooo

o

o

o

oo

-4 -2 0 2 4

-4-2

02

4

o

oo

oo

o

o

o

o

o

oo

o

o o

o

o

oo

o

ooo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

o o

oo

oo

o

oo

o

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

oo

o

o

o

o

-4 -2 0 2 4

-4-2

02

4

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

oo

o

o

o o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

ooo

o oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

ooo o

o

oo

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

-4 -2 0 2 4

-4-2

02

4

oo

o

o

o

o

oo

oo

o

o

o

o

o

o

o

oo o

o

o

o

oo o

o

oo

o

oo

o

o o

o

o

oo

oo

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

oo

o

oo

o

o

oo

o

o

o

o

o

o

o

o oo

o

o

o

o

o

oo

o

o o

ooo

o

o

oo

oo

o

o

o

oo

o

-4 -2 0 2 4

-4-2

02

4

Corr(X , Y ) = −0.9 Corr(X , Y ) = −0.5 Corr(X , Y ) = 0

Corr(X , Y ) = 0.3 Corr(X , Y ) = 0.5 Corr(X , Y ) = 0.7

Page 64: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Misuse of sample correlation coefficient

−3 −2 −1 0 1 2 3

−3

−1

13

−3 −2 −1 0 1 2 3

−3

−1

13

−3 −2 −1 0 1 2 3

−3

−1

13

−3 −2 −1 0 1 2 3

−3

−1

13

Corr(X , Y ) = 0.16 Corr(X , Y ) = 0.028

Corr(X , Y ) = 0.51 Corr(X , Y ) = −0.83

Figure: Sample correlation coefficient is not appropriate measure ofstrength of non-linear association.

Page 65: Chapter 3. Exploratory data analysisparkj1/math105/mainslide_ch3.pdf · Exploratory data analysis. Outline Example problems and associated data sets Diseased trees Urban and rural

Exercise 3.3.4Ozone data. Calculate sample correlation coefficients betweeb O3

and NO2 for ozone data. There are four groups, arising from thetwo levels of each of the two nominal variables location and season.

Leeds city Ladybower reservoir

Summer 0.10 0.25Winter -0.24 -0.48