modelling correlations using python

19
SK R ENGINEERING Modelling correlations with Python and SciPy Eric Marsden <[email protected]>

Upload: eric-marsden

Post on 30-Jul-2015

355 views

Category:

Engineering


2 download

TRANSCRIPT

SKRENGINEERING

Modelling correlations with Python and SciPy

Eric Marsden <[email protected]>

Measuring linear correlation

▷ Linear correlation coefficient: a measure of the strength and directionof a linear association between two random variables• also called the Pearson product-moment correlation coefficient

▷ 𝜌X ,Y =cov(X ,Y )

u�Xu�Y=

𝔼[(X−u�X )(Y−u�Y )]

u�Xu�Y

• 𝔼 is the expectation operator

• cov means covariance

• u�X is the expected value of random variable X

• u�X is the standard deviation of X

▷ Python: scipy.stats.pearsonr(X, Y)

▷ Excel and Google Docs spreadsheet: CORREL

2 / 17

Measuring linear correlation

The linear correlation coefficient ρ quantifies the strengths and directions ofmovements in two random variables:

▷ sign of ρ determines the relative directions that the variables move in

▷ value determines strength of the relative movements (ranging from -1 to+1)

▷ ρ = 0.5: one instrument moves in the same direction by half the amountthat the other variable moves

▷ ρ = 0: variables are uncorrelated• does not imply that they are independent!

3 / 17

Examples of correlations

Source: Wikipedia

c o r r e l a t i o n≠ d e p e n d e

n c y

4 / 17

Examples of correlations

Source: Wikipedia

c o r r e l a t i o n≠ d e p e n d e

n c y

4 / 17

Examples of correlations

Source: Wikipedia

c o r r e l a t i o n≠ d e p e n d e

n c y

4 / 17

Online visualization: interpreting correlations

Try it out online: http://rpsychologist.com/d3/correlation/

5 / 17

Not all relations are linear!

▷ Example: Yerkes–Dodson law• empirical relationship between level of

arousal/stress and level of performance

▷ Performance initially increases withstress/arousal

▷ Beyond a certain level of stress, performancedecreases

Source: http://en.wikipedia.org/wiki/YerkesDodson_law

6 / 17

Measuring correlation with NumPy

In [3]: import numpy

import matplotlib.pyplot as plt

import scipy.stats

In [4]: X = numpy.random.normal(10, 1, 100)

Y = X + numpy.random.normal(0, 0.3, 100)

plt.scatter(X, Y)

Out[4]: <matplotlib.collections.PathCollection at 0x7f7443e3c438>

In [5]: scipy.stats.pearsonr(X, Y)

Out[5]: (0.9560266103379802, 5.2241043747083435e-54)

E x e r c i s e : sh o w t h a t w

h e n t h e e rr o r

i n Y d e c r e a s e s , t h e c o rr e l a t i o n

c o e f f i c i e n ti n c r e a s e s

E x e r c i s e : pr o d u c e d a

t a a n d a pl o t

w i t h a n e ga t i v e c o r r e

l a t i o n

c o e f f i c i e n t

7 / 17

Anscombe’s quartet

4

8

12 I II

0 10 20

4

8

12 III

0 10 20

IV

Four datasets proposed by Francis Anscombe to illustrateimportance of graphing data rather than relying blindlyon summary statistics

e a c h d a t a se t h a s t h e

s a m e

c o r r e l a t i o nc o e f f i c i e n t

!

8 / 17

Plotting relationships between variables with matplotlib

▷ Scatterplot: use function plt.scatter

▷ Continuous plot or X-Y: function plt.plot

1 > import matplotlib.pyplot as plt2 > import numpy3 > x = numpy.linspace(1, 15, 100)4 > plt.plot(x, numpy.sin(x))5 > plt.show()

0 2 4 6 8 10 12 14 16−1.0

−0.5

0.0

0.5

1.0

9 / 17

Aside: polio caused by ice cream!

▷ Polio: an infectious disease causing paralysis, which primarilyaffects young children

▷ Largely eliminated today but was once a worldwide concern

▷ Late 1940s: public health experts in USA noted that theincidence of polio increased with the consumption of ice cream

▷ Some suspected that ice cream caused polio… sales plummeted

▷ Polio incidence increases in hot summer weather

▷ Correlation is not causation: there may be a hidden, underlyingvariable• but it sure is a hint! [Edward Tufte]

More info: Freakonomics, Steven Levitt and Stephen J. Dubner

10 / 17

Aside: fire fighters and fire damage

▷ Statistical fact: the larger the number of fire-fighters attendingthe scene, the worse the damage!

▷ More fire fighters are sent to larger fires

▷ Larger fires lead to more damage

▷ Lurking (underlying) variable = fire size

▷ An instance of “Simpson’s paradox”

11 / 17

Aside: low birth weight babies of tobacco smoking mothers

▷ Statistical fact: low birth-weight children born to smoking mothershave a lower infant mortality rate than the low birth weight childrenof non-smokers

▷ In a given population, low birth weight babies have a significantlyhigher mortality rate than others

▷ Babies of smoking mothers are more likely to be of low birth weightthan babies of non-smoking mothers

▷ Babies underweight because of smoking still have a lower mortalityrate than children who have other, more severe, medical reasons whythey are born underweight

▷ Lurking variable between smoking, birth weight and infant mortality

Source: Wilcox, A. (2001). On the importance — and the unimportance — of birthweight, International Journal of Epidemiology.30:1233–1241

12 / 17

Aside: cheese causes death by bedsheet strangulation

Note: real data!

Source: http://www.tylervigen.com/, with many more surprising correlations

13 / 17

Aside: correlation is not causation

Source: http://xkcd.com/552/ (CC BY-NC licence)

14 / 17

Directionality of effect problem

aggressive behaviour watching violent films

aggressive behaviour watching violent films

Do aggressive children prefer violent TV programmes, or do violentprogrammes promote violent behaviour?

15 / 17

Analyzing data: wind speed

▷ Import wind speed data for Toulouse airport

▷ Find the mean of the distribution

▷ Plot a histogram of the data

▷ Does the data seem to follow a normal distribution?• use a Q-Q plot to check

▷ Check whether a Weibull distribution fits better

▷ Predict the highest wind speed expected in a 10-year interval

0 10 20 30 40 50 600.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08 TLS wind speed in 2013

−3 −2 −1 0 1 2 3Quantiles

−10

0

10

20

30

40

50

Ord

ered

Val

ues

R2 =0:9645

TLS wind speed qqnorm-plot

0 5 10 15 20 25 30 35Quantiles

0

10

20

30

40

50

Ord

ered

Val

ues

R2 =0:9850

TLS wind speed qqweibull plot

Data downloaded from http://www.wunderground.com/history/airport/LFBO/

16 / 17

For more information

▷ SciPy lecture notes: https://scipy-lectures.github.io/

▷ Analysis of the “pay for performance” (correlation between a CEO’s payand their job performance, as measured by the stock market) principle,http://freakonometrics.hypotheses.org/15999

This presentation is distributed under the terms ofthe Creative Commons Attribution – Share Alikelicence.

SKRENGINEERING

For more free course materials on risk engineering, visithttp://risk-engineering.org/

17 / 17