applied statistics ii

Applied Statistics Vincent JEANNIN – ESGF 4IFM

Q1 2012

Summary of the session (est. 4.5h) • R Steps by Steps • Reminders of last session • The Value at Risk • OLS & Exploration

R Step by Step

http://www.r-project.org/

Downloadable for free (open source)

Main screen

Menu: File / New Script

Step 1, upload your data

Excel CSV file easy to import

Path C:\Users\vin\Desktop

DATA<-read.csv(file="C:/Users/vin/Desktop/DataFile.csv",header=T)

Note: 4 columns with headers

Run your instruction(s)

You can call variables anytime you want

summary(DATA) Shows a quick summary of the distribution of all variables

SPX SPXr AMEXr AMEX

Min. : 86.43 Min. :-0.0666344 Min. : 97.6 Min. :-0.0883287

1st Qu.: 95.70 1st Qu.:-0.0069082 1st Qu.:104.7 1st Qu.:-0.0094580

Median :100.79 Median : 0.0010016 Median :108.8 Median : 0.0013007

Mean : 99.67 Mean : 0.0001249 Mean :109.4 Mean : 0.0005891

3rd Qu.:103.75 3rd Qu.: 0.0075235 3rd Qu.:114.1 3rd Qu.: 0.0102923

Max. :107.21 Max. : 0.0474068 Max. :123.5 Max. : 0.0710967

Min. 1st Qu. Median Mean 3rd Qu. Max.

86.43 95.70 100.80 99.67 103.80 107.20

summary(DATA$SPX) Shows a quick summary of the distribution of one variable

Careful using the following instructions min(DATA)

max(DATA)

This will consider DATA as one variable

> min(DATA)

[1] -0.08832874

> max(DATA)

[1] 123.4793

> sd(DATA)

SPX SPXr AMEXr AMEX

4.92763551 0.01468776 6.03035318 0.01915489

> mean(DATA)

SPX SPXr AMEXr AMEX

9.967046e+01 1.249283e-04 1.093951e+02 5.890780e-04

Mean & SD

Easy to show histogram

hist(DATA$SPXr, breaks=25, main="Distribution of SPXr", ylab="Freq",

xlab="SPXr", col="blue")

Obvious Excess Kurtosis Obvious Asymmetry

Functions doesn’t exists directly in R…

However some VNP (Very Nice Programmer) built and shared add-in

Package Moments

Menu: Packages / Install Package(s)

• Choose whatever mirror (server) you want • Usually France (Toulouse) is very good as it’s a

University Server with all the packages available

require(moments)

library(moments)

Once installed, you can load them with the following instructions:

New functions can now be used!

> require(moments)

> library(moments)

> skewness(DATA)

SPX SPXr AMEXr AMEX

-0.6358029 -0.4178701 0.1876994 -0.2453693

> kurtosis(DATA)

SPX SPXr AMEXr AMEX

2.411177 5.671254 2.078366 5.770583

Btw, you can store any result in a variable

> Kur<-kurtosis(DATA$SPXr)

[1] 5.671254

Call the help! help(kurtosis)

Reminds you the package

Syntax

Arguments definition

Let’s store a few values

x<-seq(from=SPMean-4*SPSD,to=SPMean+4*SPSD,length=500)

Build a sequence, the x axis

SPMean<-mean(DATA$SPXr)

SPSD<-sd(DATA$SPXr) Package Stats

Build a normal density on these x

Y1<-dnorm(x,mean=SPMean,sd=SPSD) Package Stats

hist(DATA$SPXr, breaks=25,main="S&P Returns / Normal

Distribution",xlab="Returns",ylab="Occurences", col="blue")

Display the histogram

Display on top of it the normal density

lines(x,y1,type="l",lwd=3,col="red")

Package graphics

Positive Excess Kurtosis & Negative Skew

Let’s build a spread Spd<-DATA$SPXr-DATA$AMEX

What is the mean?

Mean is linear 𝐸 𝑎𝑋 + 𝑏𝑌 = 𝑎𝐸 𝑋 + 𝑏𝐸(𝑌)

𝐸 𝑋 − 𝑌 = 𝐸 𝑋 − 𝐸(𝑌)

> mean(DATA$SPXr)-mean(DATA$AMEX)-mean(Spd)

Let’s verify

What is the standard deviation?

Is standard deviation linear? NO! VAR 𝑎𝑋 + 𝑏𝑌 = 𝑎2𝑉𝐴𝑅 𝑋 + 𝑏2𝐸 𝑌 + 2𝑎𝑏𝐶𝑂𝑉(𝑋, 𝑌)

> (var(DATA$SPXr)+var(DATA$AMEX)-2*cov(DATA$SPXr,DATA$AMEX))^0.5

[1] 0.01019212

> sd(Spd)

[1] 0.01019212

Let’s show the implication in a proper manner

Let’s create a portfolio containing half of each stocks

Portf<-0.5*DATA$SPXr+0.5*DATA$AMEX

plot(sd(DATA$SPXr),mean(DATA$SPXr),col="blue",ylim=c(0,0.0008),xlim=c(0.012

,0.022),ylab="Return",xlab="Vol")

points(sd(DATA$AMEX),mean(DATA$AMEX),col="red")

points(sd(Portf),mean(Portf),col="green")

The efficient frontier

points(sd(0.1*DATA$SPXr+0.9*DATA$AMEX),mean(0.1*DATA$SPXr+0.9*DATA$AMEX),c

ol="green")

plot(DATA$AMEX,DATA$SPXr)

abline(lm(DATA$AMEX~DATA$SPXr), col="blue")

LM stands for Linear Models

> lm(DATA$AMEX~DATA$SPXr)

lm(formula = DATA$AMEX ~ DATA$SPXr)

Coefficients:

(Intercept) DATA$SPXr

0.0004505 1.1096287

𝑦 = 1.1096𝑥 + 0.04%

Will be used later for linear regression and hedging

Do you remember what is the most platykurtic distribution in the nature?

Toss Head = Success = 1 / Tail = Failure = 0

> require(moments)

Loading required package: moments

> library(moments)

> toss<-rbinom(100,1,0.5)

> mean(toss)

[1] 0.52

> kurtosis(toss)

[1] 1.006410

> kurtosis(toss)-3

[1] -1.993590

> hist(toss, breaks=10,main="Tossing a

coin 100 times",xlab="Result of the

trial",ylab="Occurence")

> sum(toss)

[1] 52

Let’s test the fairness

100 toss… Else memory issue…

𝑓 𝑟 𝐻 = ℎ, 𝑇 = 𝑡 =𝑁 + 1 !

ℎ! 𝑡!𝑟ℎ(1 − 𝑟)𝑡

Density of a binomial distribution

Let’s plot this density with

ℎ = 52

𝑡 = 48

𝑁 = 100 N<-100

r<-seq(0,1,length=500)

(factorial(N+1)/(factorial(h)*factori

al(t)))*r^h*(1-r)^t

plot(r,y,type="l",col="red",main="Pro

bability density to have 52 head out

100 flips")

If the probability between 45% and 55% is significant we’ll accept the fairness

What do you think?

What is the problem with this coin?

Toss it! Head = Success = 1 / Tail = Failure = 0

> require(moments)

Loading required package: moments

> library(moments)

> toss<-rbinom(100,1,0.7)

> mean(toss)

[1] 0.72

> kurtosis(toss)

[1] 1.960317

> kurtosis(toss)-3

[1] -1.039683

> hist(toss, breaks=10,main="Tossing a

coin 100 times",xlab="Result of the

trial",ylab="Occurence")

> sum(toss)

[1] 72

Let’s test the fairness (assuming you don’t know it’s a trick)

100 toss

Obvious fake! Assuming the probability of head is 0.7

If the probability between 45% and 55% is significant we’ll accept the fairness

N<-100

r<-seq(0.2,0.8,length=500)

y<-(factorial(N+1)/(factorial(h)*factorial(t)))*r^h*(1-r)^t

plot(r,y,type="l",col="red",main="Probability density or r given 72

head out 100 flips")

Trick coin!

Reminders of last session

Snapshot, 4 moments:

Skewness

Kurtosis

Normal Standard Distribution

𝑃 𝑋 ≤ 𝜇 = 𝑃 𝑋 ≤ −𝜎 + 𝜇

𝑃 𝑋 ≤ −2 ∗ 𝜎 + 𝜇

𝑃 𝑋 ≤ −3 ∗ 𝜎 + 𝜇

𝑃 𝜇 − 𝜎 ≤ 𝑋 ≤ 𝜇 + 𝜎

𝑃 𝜇 − 2 ∗ 𝜎 ≤ 𝑋 ≤ 𝜇 + 2 ∗ 𝜎

𝑃 𝜇 − 3 ∗ 𝜎 ≤ 𝑋 ≤ 𝜇 + 3 ∗ 𝜎

𝑃 𝑋 ≤ −1.645 ∗ 𝜎 + 𝜇

𝑃 𝑋 ≤ −2.326 ∗ 𝜎 + 𝜇

= 0.05

= 0.01

= 0.159

= 0.023

= 0.001

= 0.682

= 0.954

= 0.996

𝑓 𝑥 =1

2𝜋𝜎2𝑒−(𝑥−𝜇)2

2𝜎2 Density

𝑁(𝜇, 𝜎) Notation

𝑃 𝑋 ≤ 𝑥 = 𝜙 𝑥 = 𝑓 𝑥 𝑑𝑥𝑥

−∞

Let be X~N(1,1.5) Find:

𝑃 𝑋 ≤ 4.75

𝑃 𝑋 ≤ 4.75 =P 𝑌 ≤4.75−1

With Y~N(0,1)

P 𝑌 ≤ 2.5 =?

Use the table!

P 𝑌 ≤ −2.5 =0.0062

P 𝑋 ≤ 4.75 =0.9938

P 𝑌 ≤ 2.5 =0.9938

>qqnorm(FCOJ$V1)

>qqline(FCOJ$V1)

Fat Tail

QQ Plot

Discrete form 𝑑𝑠𝑡 = 𝜇𝑠𝑡𝑑𝑡 + 𝜎𝑠𝑡 𝑑𝑡𝜀

Geometric Brownian Motion

Based on Stochastic Differential Equation 𝑑𝑠𝑡 = 𝜇𝑠𝑡𝑑𝑡 + 𝜎𝑠𝑡𝑊𝑡

with 𝜀~N(0,1)

𝑢 = 𝑒𝜎 𝑡

𝑑 =1

𝑢= 𝑒−𝜎 𝑡

S𝑒𝑟𝑡 = 𝑝𝑆𝑢 + 1 − 𝑝 𝑆𝑑 𝑒𝑟𝑡 = 𝑝𝑢 + 1 − 𝑝 𝑑

𝑝 =𝑒𝑟𝑡 − 𝑑

𝑢 − 𝑑

BV= OpUp ∗ p + OpDown ∗ 1 − p ∗ 𝑒−𝑟𝑡

Greeks Approximation – Taylor Development

𝑑𝐶 = 𝐶 + ∆ ∗ 𝑑𝑆 +1

2∗ 𝛾 ∗ 𝑑𝑆2

6∗ 𝑆𝑝𝑒𝑒𝑑 ∗ 𝑑𝑆3

24∗ 𝐺𝑟𝑒𝑒𝑘4𝑡ℎ ∗ 𝑑𝑆4

etc…

Estimate with a specific confidence interval (usually 95% or 99%) the worth loss possible. In other words, the point is to identify a particular point on the left of the distribution

3 Methods

• Historical • Parametrical • Monte-Carlo

For now, we’ll focus on VaR on one linear asset… FCOJ is back!

The Value at Risk

Historical VaR

• No assumption about the distribution • Easy to implement and calculate • Sensitive to the length of the history • Sensitive to very extreme values

Let’s get back to our FCOJ time series, last price is $150 cents

If we work on returns, we’ve seen the use of the PERCENTILE Excel function

• 1% Percentile is -5.22%, 99% Historical Daily VaR is -$7.83 cents • 5% Percentile is -3.34%, 95% Historical Daily VaR is -$5.00 cents

Works as well on weekly, monthly, quarterly series

Historical VaR

Can be worked as well with prices variations instead of returns but it’s going to be price sensitive! So careful to the bias.

• 1% Percentile in term of price movement is -$8.11 cents • 5% Percentile in term of price movement is -$4.14 cents

Parametric VaR

• Easy to implement and calculate • Assumes a particular shape of the distribution • Not really sensitive to fat tails

FCOJ Mean Return: 0.1364%

𝑃 𝑋 ≤ −1.645 ∗ 𝜎 + 𝜇 = 0.05

𝑃 𝑋 ≤ −2.326 ∗ 𝜎 + 𝜇 = 0.01

FCOJ SD: 2.1664%

We already know:

𝑃 𝑋 ≤ −3.43% = 0.05

𝑃 𝑋 ≤ −4.90% = 0.01

VaR 95% (-$5.15 cents)

VaR 99% (-$7.35 cents)

Parametric VaR

𝑃 𝑋 ≤ −3.57% = 0.05

𝑃 𝑋 ≤ −5.04% = 0.01

VaR 95% (-$5.36 cents)

VaR 99% (-$8.10 cents)

Very often you assume anyway a 0 mean, therefore:

Lower values than the historical VaR

Problem with leptokurtic distributions, impact of fat tails isn’t strong on the method

Monte Carlo VaR

Based on an assumption of a price process (for example GBM)

• Most efficient method when asset aren’t linear • Tough to implement • Assumes a particular shape of the distribution

Great number of random simulations on the price process to build a distribution and outline the VaR

Monte Carlo VaR

library(sde)

require(sde)

FCOJ<-

read.csv(file="C:/Users/Vinz/Desktop/FCOJStats.csv",head=FALSE,sep=",")

Drift<-mean(FCOJ$V1)

Volat<-sd(FCOJ$V1)

nbsim<-252

Spot<-150

Final<-rep(1,10000)

for(i in 1:100000){

Matr<-GBM(x=Spot,r=Drift, sigma=Volat,N=nbsim)

Final[i]<-Matr[nbsim+1]}

quantile(Final, 0.05)

quantile(Final, 0.01)

Let’s simulate 10,000 GBM, 252 steps and store the final result

Don’t be fooled by the 252, we’re still making a daily simulation: what to change in the code to make it yearly?

Monte Carlo VaR

> quantile(Final, 0.05)

144.93

142.7941

• 95% Daily VaR is -$5.07 cents • 99% Daily VaR is -$7.21 cents

Let’s take off the drift

Monte Carlo VaR

144.7583

142.6412

• 95% Daily VaR is -$5.35 cents • 99% Daily VaR is -$7.36 cents

Comparison

Which is the best?

Going forward on the VaR

All method give different but coherent values

Easy? Yes but…

• We’ve involved one asset only • We’ve involved a linear asset

What about an option?

What about 2 assets?

Portfolio scale: what to look at to calculate the VaR?

Big question, is the VaR additive?

NO! Keywords for the future: covariance, correlation, diversification

Options: what to look at to calculate the VaR?

4 risk factors: • Underlying price • Interest rate • Volatility • Time

4 answers: • Delta/Gamma approximation knowing the distribution of the underlying • Rho approximation knowing the distribution of the underlying rate • Vega approximation knowing the distribution of implied volatility • Theta (time decay)

Yes but,… Does the underling price/rate/volatility vary independently?

Might be a bit more complicated than expected…

OLS & Exploration

Linear regression model

Minimize the sum of the square vertical distances between the observations and the linear approximation

𝑦 = 𝑓 𝑥 = 𝑎𝑥 + 𝑏

Residual ε

OLS: Ordinary Least Square

Two parameters to estimate: • Intercept α • Slope β

Minimising residuals

𝐸 = 𝜀𝑖2

𝑖=1

= 𝑦𝑖 − 𝑎𝑥𝑖 + 𝑏 2

𝑖=1

When E is minimal?

When partial derivatives i.r.w. a and b are 0

𝐸 = 𝜀𝑖2

𝑖=1

= 𝑦𝑖 − 𝑎𝑥𝑖 + 𝑏 2

𝑖=1

= 𝑦𝑖 − 𝑎𝑥𝑖 − 𝑏 2

𝑖=1

𝜕𝐸

𝜕𝑎= −2𝑥𝑖𝑦𝑖 + 2𝑎𝑥𝑖

2 + 2𝑏𝑥𝑖

𝑖=1

𝑦𝑖 − 𝑎𝑥𝑖 − 𝑏 2 = 𝑦𝑖2 − 2𝑎𝑥𝑖𝑦𝑖 − 2𝑏𝑦𝑖 + 𝑎2𝑥𝑖

2 + 2𝑎𝑏𝑥𝑖 + 𝑏2

Quick high school reminder if necessary…

−𝑥𝑖𝑦𝑖 + 𝑎𝑥𝑖2 + 𝑏𝑥𝑖

𝑖=1

𝑎 ∗ 𝑥𝑖2

𝑖=1

+ 𝑏 ∗ 𝑥𝑖

𝑖=1

= 𝑥𝑖𝑦𝑖

𝑖=1

𝜕𝐸

𝜕𝑏= −2𝑦𝑖 + 2𝑏 + 2𝑎𝑥𝑖

𝑖=1

−𝑦𝑖 + 𝑏 + 𝑎𝑥𝑖

𝑖=1

𝑎 ∗ 𝑥𝑖

𝑖=1

+ 𝑛𝑏 = 𝑦𝑖

𝑖=1

𝑎 ∗ 𝑥𝑖

𝑖=1

+ 𝑛𝑏 = 𝑦𝑖

𝑖=1

Leads easily to the intercept

𝑎𝑛𝑥 + 𝑛𝑏 = 𝑛𝑦

𝑎𝑥 + 𝑏 = 𝑦

The regression line is going through (𝑥 , 𝑦 )

The distance of this point to the line is 0 indeed

𝜕𝐸

𝜕𝑏

𝑏 = 𝑦 − 𝑎𝑥

𝜕𝐸

𝜕𝑎= −2𝑥𝑖𝑦𝑖 + 2𝑎𝑥𝑖

2 + 2𝑏𝑥𝑖

𝑖=1

y = 𝑎𝑥 + 𝑦 − 𝑎𝑥

y − 𝑦 = 𝑎(𝑥 − 𝑥 )

𝑏 = 𝑦 − 𝑎𝑥

𝑥𝑖 𝑦𝑖 − 𝑎𝑥𝑖 − 𝑏 = 0

𝑖=1

𝜕𝐸

𝜕𝑏= −2𝑦𝑖 + 2𝑏 + 2𝑎𝑥𝑖 = 0

𝑖=1

𝑦𝑖 − 𝑏 − 𝑎𝑥𝑖

𝑖=1

𝑦𝑖 − 𝑦 + 𝑎𝑥 − 𝑎𝑥𝑖 = 0

𝑖=1

(𝑦𝑖 − 𝑦 ) − 𝑎(𝑥𝑖 − 𝑥 )

𝑖=1

𝑥𝑖 𝑦𝑖 − 𝑎𝑥𝑖 − 𝑦 + 𝑎𝑥 = 0

𝑖=1

𝑥𝑖(𝑦𝑖 − 𝑦 − 𝑎 𝑥𝑖 − 𝑥 )

𝑖=1

𝑥 ( 𝑦𝑖 − 𝑦 − 𝑎 𝑥𝑖 − 𝑥 )

𝑖=1

= 0 𝑥 ( 𝑦𝑖 − 𝑦 − 𝑎 𝑥𝑖 − 𝑥 )

𝑖=1

= 𝑥 ( 𝑦𝑖 − 𝑦 − 𝑎 𝑥𝑖 − 𝑥 )

𝑖=1

− 𝑥 𝑦𝑖 − 𝑦 − 𝑎 𝑥𝑖 − 𝑥

𝑖=1

(𝑥𝑖−𝑥 )(𝑦𝑖 − 𝑦 − 𝑎 𝑥𝑖 − 𝑥 )

𝑖=1

𝑎 = (𝑥𝑖−𝑥 )(𝑦𝑖 − 𝑦 )𝑛

𝑖=1

(𝑥𝑖−𝑥 )2 𝑛𝑖=1

Finally…

We have

𝑎 = (𝑥𝑖 − 𝑥 )(𝑦𝑖 − 𝑦 )𝑛

𝑖=1

(𝑥𝑖 − 𝑥 )2𝑛𝑖=1

Covariance

Variance

𝑎 =𝐶𝑜𝑣𝑥𝑦

𝜎2𝑥

𝑏 = 𝑦 − 𝑎 𝑥

You can use Excel function INTERCEPT and SLOPE

Calculate the Variances and Covariance of X{1,2,3,3,1,2} and Y{2,3,1,1,3,2}

You can use Excel function VAR.P, COVARIANCE.P and STDEV.P

Let’s asses the quality of the regression

Let’s calculate the correlation coefficient (aka Pearson Product-Moment Correlation Coefficient – PPMCC):

𝑟 =𝐶𝑜𝑣𝑥𝑦

𝜎𝑥𝜎𝑦 Value between -1 and 1

𝑟 = 1 Perfect dependence

𝑟 ~0 No dependence

Give an idea of the dispersion of the scatterplot

You can use Excel function CORREL

R=0.96

High quality

R=0.62

Poor quality

What is good quality?

Slightly discretionary…

𝑟 ≥3

2= 0.8666…

It’s largely admitted as the threshold for acceptable / poor

The regression itself introduces a bias

Let’s introduce the coefficient of determination R-Squared

Total Dispersion = Dispersion Regression + Dispersion Residual

Dispersion Regression

Total Dispersion 𝑅2 =

In other words the part of the total dispersion explained by the regression

𝑦𝑖 − 𝑦 2 = 𝑦𝑖 − 𝑦𝑖 2 + 𝑦𝑖 − 𝑦 2

You can use Excel function RSQ

In a simple linear regression with intercept 𝑅2 = 𝑟2

Is a good correlation coefficient and a good coefficient of determination enough to accept the regression?

Not necessarily!

Residuals need to have no effect, in other word to be a white noise!

𝑦 = 7.5

𝑥 = 9

𝑦 = 3 + 0.5𝑥

𝑟 = 0.82

𝑅2 = 0.67

Don’t get fooled by numbers!

For every dataset of the Quarter

Can you say at this stage which regression is the best?

Certainly not those on the right you need a LINEAR dependence

Is any linear regression useless?

Think what you could do to the series

Polynomial transformation, log transformation,…

Else, non linear regressions, but it’s another story

First application on financial market

S&P / AmEx in 2011

𝑅𝐴𝑚𝑒𝑥 = 0.06% + 1.1046 ∗ 𝑅𝑆&𝑃

𝑟 =𝐶𝑜𝑣𝐴𝑚𝐸𝑥,𝑆&𝑃

𝜎𝐴𝑚𝐸𝑥𝜎𝑆&𝑃= 0.8501

𝑅2 = 𝑟2 = 0.7227

Oups :-o

Is Excel wrong?

R-Squared has different calculation methods

Let’s accept the following regression then as the quality seems pretty good

How to use this?

• Forecasting? Not really… Both are random variables

• Hedging? Yes but basis risk Yes but careful to the residuals…

Let’s have a try!

In theory, what is the daily result of the hedge? 𝑎

Hedging $1.0M of AmEx Stocks with $1.1046M of S&P

It would have been too easy… Great differences… Why?

Sensitivity to the size of the sample

Heteroscedasticity Basis Risk

The purpose was to see if the market as effect an effect on a particular stock

The dependence is obvious but residuals too volatile for any stable application

But attention!

We are looking for causation, not correlation!

Causation implies correlation

Reciprocity is not true!

DON’T BE FOOLED BY PRETTY NUMBERS

Let prove this…

Perfect linear dependence

Excellent R-Squared

Residuals are a white noise

What’s the problem then?

Do you really think fresh lemon reduces car fatalities?

Conclusion

Normal Distribution

applied statistics ii

ifm q1

histogram esgf

sddata spx spxramexr

file new

distribution of spxr

variable kur kur

h r steps

data excel csv file

Economy & Finance

applied statistics en

applied statistics - mit

math602: applied statistics

applied statistics 2009

stat3014 { applied statistics

research, applied analytics, and statistics, statistics...

applied statistics and econometrics outline of …...applied...

applied statistics ii · i random effects and mixed linear...

applied statistics 3

applied statistics

raduate programs in statistics at the university of...

msc applied statistics - birkbeck, university of london ·...

applied probability & statistics

applied statistics ii - university of · pdf filei one...

applied statistics lecture_3

applied statistics lecture_8

applied statistics chapter17

master of science in applied statistics year ii, sem i

applied statistics lecture_5

chapter1 notes - applied statistics