slides barcelona machine learning

Download Slides barcelona Machine Learning

If you can't read please download the document

Upload: arthur-charpentier

Post on 21-Jan-2017

18.115 views

Category:

Economy & Finance


3 download

TRANSCRIPT

  • http://www.ub.edu/riskcenter

    Machine Learning & Data Science for Actuaries, with R

    Arthur Charpentier (Universit de Rennes 1 & UQM)

    Universitat de Barcelona, April 2016.

    http://freakonometrics.hypotheses.org

    @freakonometrics 1

  • http://www.ub.edu/riskcenter

    Machine Learning & Data Science for Actuaries, with R

    Arthur Charpentier (Universit de Rennes 1 & UQM)

    Professor, Economics Department, Univ. Rennes 1In charge of Data Science for Actuaries program, IAResearch Chair actinfo (Institut Louis Bachelier)(previously Actuarial Sciences at UQM & ENSAE Paristechactuary in Hong Kong, IT & Stats FFSA)

    PhD in Statistics (KU Leuven), Fellow Institute of ActuariesMSc in Financial Mathematics (Paris Dauphine) & ENSAEEditor of the freakonometrics.hypotheses.orgs blogEditor of Computational Actuarial Science, CRC

    @freakonometrics 2

  • http://www.ub.edu/riskcenter

    Agenda

    0. Introduction, see slides

    1. Classification, y {0, 1}

    2. Regression Models, y R

    3. Model Choice, Feature Selection, etc

    4. Data Visualisation & Maps

    @freakonometrics 3

    https://freakonometrics.hypotheses.org/48127

  • http://www.ub.edu/riskcenter

    Part 1.Classification, y {0, 1}

    @freakonometrics 4

  • http://www.ub.edu/riskcenter

    Classification?

    Example: Fraud detection, automatic reading (classifying handwritingsymbols), face recognition, accident occurence, death, purchase of optinalinsurance cover, etcHere yi {0, 1} or yi {1,+1} or yi {, }.

    We look for a (good) predictive model here.There will be two steps,

    the score function, s(x) = P(Y = 1|X = x) [0, 1]

    the classification function s(x) Y {0, 1}.

    0.0 0.2 0.4 0.6 0.8 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    @freakonometrics 5

  • http://www.ub.edu/riskcenter

    Modeling a 0/1 random variable

    Myocardial infarction of patients admited in E.R.

    heart rate (FRCAR), heart index (INCAR) stroke index (INSYS) diastolic pressure (PRDIA) pulmonary arterial pressure (PAPUL) ventricular pressure (PVENT) lung resistance (REPUL) death or survival (PRONO)

    1 > myocarde =read. table ("http:// freakonometrics .free.fr/ myocarde .csv",

    head=TRUE ,sep=";")

    @freakonometrics 6

  • http://www.ub.edu/riskcenter

    Logistic RegressionAssume that P(Yi = 1) = i,

    logit(i) = xTi , where logit(i) = log(

    i1 i

    ),

    ori = logit1(xTi ) =

    exp[xTi ]1 + exp[xTi ]

    .

    The log-likelihood is

    logL() =ni=1

    yi log(i)+(1yi) log(1i) =ni=1

    yi log(i())+(1yi) log(1i())

    and the first order conditions are solved numerically

    logL()k

    =ni=1

    xk,i[yi i()] = 0.

    @freakonometrics 7

  • http://www.ub.edu/riskcenter

    Logistic Regression, Output (with R)1 > logistic summary ( logistic )

    3

    4 Coefficients :

    5 Estimate Std. Error z value Pr(>|z|)

    6 ( Intercept ) -10.187642 11.895227 -0.856 0.392

    7 FRCAR 0.138178 0.114112 1.211 0.226

    8 INCAR -5.862429 6.748785 -0.869 0.385

    9 INSYS 0.717084 0.561445 1.277 0.202

    10 PRDIA -0.073668 0.291636 -0.253 0.801

    11 PAPUL 0.016757 0.341942 0.049 0.961

    12 PVENT -0.106776 0.110550 -0.966 0.334

    13 REPUL -0.003154 0.004891 -0.645 0.519

    14

    15 ( Dispersion parameter for binomial family taken to be 1)

    16

    17 Number of Fisher Scoring iterations : 7

    @freakonometrics 8

  • http://www.ub.edu/riskcenter

    Logistic Regression, Output (with R)

    1 > library (VGAM)

    2 > mlogistic summary ( mlogistic )

    4

    5 Coefficients :

    6 Estimate Std. Error z value

    7 ( Intercept ) 10.1876411 11.8941581 0.856525

    8 FRCAR -0.1381781 0.1141056 -1.210967

    9 INCAR 5.8624289 6.7484319 0.868710

    10 INSYS -0.7170840 0.5613961 -1.277323

    11 PRDIA 0.0736682 0.2916276 0.252610

    12 PAPUL -0.0167565 0.3419255 -0.049006

    13 PVENT 0.1067760 0.1105456 0.965901

    14 REPUL 0.0031542 0.0048907 0.644939

    15

    16 Name of linear predictor : log(mu [ ,1]/mu [ ,2])

    @freakonometrics 9

  • http://www.ub.edu/riskcenter

    Logistic (Multinomial) Regression

    In the Bernoulli case, y {0, 1},

    P(Y = 1) = eXT

    1 + eXT= p1p0 + p1

    p1 and P(Y = 0) =1

    1 + eXT= p0p0 + p1

    p0

    In the multinomial case, y {A,B,C}

    P(X = A) = pApA + pB + pC

    pA i.e. P(X = A) =eX

    TA

    eXTB + eXTB + 1

    P(X = B) = pBpA + pB + pC

    pB i.e. P(X = B) =eX

    TB

    eXTA + eXTB + 1

    P(X = C) = pCpA + pB + pC

    pC i.e. P(X = C) =1

    eXTA + eXTB + 1

    @freakonometrics 10

  • http://www.ub.edu/riskcenter

    Logistic Regression, Numerical Issues

    The algorithm to compute is

    1. start with some initial value 0

    2. define k = k1 H(k1)1 logL(k1)

    where logL()is the gradient, and H() the Hessian matrix, also calledFishers score.

    The generic term of the Hessian is

    2 logL()k`

    =ni=1

    Xk,iX`,i[yi i()]

    Define = [i,j ] = diag(i(1 i)) so that the gradient is writen

    logL() = logL()

    = XT(y )

    @freakonometrics 11

  • http://www.ub.edu/riskcenter

    Logistic Regression, Numerical Issues

    and the HessianH() =

    2 logL()T

    = XTX

    The gradient descent algorithm is then

    k = (XTX)1XTZ where Z = Xk1 +XT1(y ),

    From maximum likelihood properties, if = limk

    k,

    n( ) L N (0, I()1).

    From a numerical point of view, this asymptotic variance I()1 satisfiesI()1 = H().

    @freakonometrics 12

  • http://www.ub.edu/riskcenter

    Logistic Regression, Numerical Issues1 > X= cbind (1,as. matrix ( myocarde [ ,1:7]))

    2 > Y= myocarde $ PRONO ==" Survival "

    3 > beta=as. matrix (lm(Y~0+X)$ coefficients ,ncol =1)

    4 > for(s in 1:9){

    5 + pi=exp(X%*%beta[,s])/(1+ exp(X%*%beta[,s]))

    6 + gradient =t(X)%*%(Y-pi)

    7 + omega = matrix (0, nrow(X),nrow(X));diag( omega )=( pi*(1-pi))

    8 + Hessian =-t(X)%*% omega %*%X

    9 + beta= cbind (beta ,beta[,s]- solve ( Hessian )%*% gradient )}

    10 > beta

    11 > -solve ( Hessian )

    12 > sqrt(-diag( solve ( Hessian )))

    @freakonometrics 13

  • http://www.ub.edu/riskcenter

    Predicted Probability

    Let m(x) = E(Y |X = x). With a logistic regression, we can get a prediction

    m(x) = exp[xT]

    1 + exp[xT]

    1 > predict (logistic ,type=" response ") [1:5]

    2 1 2 3 4 5

    3 0.6013894 0.1693769 0.3289560 0.8817594 0.1424219

    4 > predict ( mlogistic ,type=" response ")[1:5 ,]

    5 Death Survival

    6 1 0.3986106 0.6013894

    7 2 0.8306231 0.1693769

    8 3 0.6710440 0.3289560

    9 4 0.1182406 0.8817594

    10 5 0.8575781 0.1424219

    @freakonometrics 14

  • http://www.ub.edu/riskcenter

    Predicted Probability

    m(x) = exp[xT]

    1 + exp[xT]= exp[0 + 1x1 + + kxk]

    1 + exp[0 + 1x1 + + kxk]use

    1 > predict (fit_glm , newdata = data , type=" response ")

    e.g.

    1 > GLM pred_GLM = function (p,r){

    3 + return ( predict (GLM , newdata =

    4 + data. frame ( PVENT =p, REPUL =r), type=" response ")}

    0 5 10 15 20

    500

    1000

    1500

    2000

    2500

    3000

    PVENT

    RE

    PU

    L

    @freakonometrics 15

  • http://www.ub.edu/riskcenter

    Predictive Classifier

    To go from a score to a class:

    if s(x) > s, then Y (x) = 1 and s(x) s, then Y (x) = 0

    Plot TP (s) = P[Y = 1|Y = 1] against FP (s) = P[Y = 1|Y = 0]

    @freakonometrics 16

  • http://www.ub.edu/riskcenter

    Predictive Classifier

    With a threshold (e.g. s = 50%) and the predicted probabilities, one can get aclassifier and the confusion matrix

    1 > probabilities predictions .5) +1]

    3 > table ( predictions , myocarde $ PRONO )

    4

    5 predictions Death Survival

    6 Death 25 3

    7 Survival 4 39

    @freakonometrics 17

  • http://www.ub.edu/riskcenter

    Visualization of a Classifier in Higher Dimension...

    4 2 0 2 4

    4

    2

    02

    4

    Dim 1 (54.26%)

    Dim

    2 (

    18.6

    4%)

    1

    2

    3

    4 56

    7

    8

    9

    101112

    13

    14

    15

    16 17

    18

    19

    20

    2122

    23

    2425

    2627

    28

    29

    30

    31

    32

    33

    34

    353637

    38 394041

    42

    43

    4445

    4647

    48

    49

    50

    5152 53

    54

    55

    56

    57

    58

    59

    6061

    62

    63

    6465

    66

    67

    68

    69

    7071

    Death

    Survival

    DeathSurvival

    4 2 0 2 4

    4

    2

    02

    4

    Dim 1 (54.26%)

    Dim

    2 (

    18.6

    4%)

    1

    2

    3

    4 56

    7

    8

    9

    101112

    13

    14

    15

    16 17

    18

    19

    20

    2122

    23

    2425

    2627

    28

    29

    30

    31

    32

    33

    34

    353637

    38 394041

    42

    43

    4445

    4647

    48

    49

    50

    5152 53

    54

    55

    56

    57

    58

    59

    6061

    62

    63

    6465

    66

    67

    68

    69

    7071

    Death

    Survival

    DeathSurvival

    0.5

    Point z = (z1, z2, 0, , 0) x = (x1, x2, , xk).

    @freakonometrics 18

  • http://www.ub.edu/riskcenter

    ... but be carefull about interpretation !

    1 > prediction = predict (logistic ,type=" response ")

    Use a 25% probability threshold1 > table ( prediction >.25 , myocarde $ PRONO )

    2 Death Survival

    3 FALSE 19 2

    4 TRUE 10 40

    or a 75% probability threshold1 > table ( prediction >.75 , myocarde $ PRONO )

    2 Death Survival

    3 FALSE 27 9

    4 TRUE 2 33

    @freakonometrics 19

  • http://www.ub.edu/riskcenter

    Why a Logistic and not a Probit Regression?Bliss (1934) suggested a model such that

    P(Y = 1|X = x) = H(xT) where H() = ()

    the c.d.f. of the N (0, 1) distribution. This is the probit model.This yields a latent model, yi = 1(y?i > 0) where

    y?i = xTi + i is a nonobservable score.

    In the logistic regression, we model the odds ratio,

    P(Y = 1|X = x)P(Y 6= 1|X = x) = exp[x

    T]

    P(Y = 1|X = x) = H(xT) where H() = exp[]1 + exp[]

    which is the c.d.f. of the logistic variable, see Verhulst (1845)

    @freakonometrics 20

    http://www.sciencemag.org/content/79/2037/38http://gdz.sub.uni-goettingen.de/dms/load/img/?PPN=PPN129323640_0018&DMDID=dmdlog7

  • http://www.ub.edu/riskcenter

    k-Nearest Neighbors (a.k.a. k-NN)

    In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is anon-parametric method used for classification and regression. (Source: wikipedia).

    E[Y |X = x] 1k

    xix small

    yi

    For k-Nearest Neighbors, the class is usually the majority vote of the k closestneighbors of x.

    1 > library ( caret )

    2 > KNN

    4 > pred_KNN = function (p,r){

    5 + return ( predict (KNN , newdata =

    6 + data. frame ( PVENT =p, REPUL =r), type="prob")[ ,2]}0 5 10 15 20

    500

    1000

    1500

    2000

    2500

    3000

    PVENT

    RE

    PU

    L

    @freakonometrics 21

    https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

  • http://www.ub.edu/riskcenter

    k-Nearest Neighbors

    Distance should not be sensitive to units: normalize by standard deviation

    1 > sP

  • http://www.ub.edu/riskcenter

    k-Nearest Neighbors and Curse of Dimensionality

    The higher the dimension, the larger the distance to the closest neigbbor

    mini{1, ,n}

    {a,xi},xi Rd.

    e.g. xs drawn from U([0, 1]) and a = 0,

    dim1 dim2 dim3 dim4 dim5

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    dim1 dim2 dim3 dim4 dim50.

    00.

    20.

    40.

    60.

    81.

    0

    n = 10 n = 100

    @freakonometrics 23

  • http://www.ub.edu/riskcenter

    Classification (and Regression) Trees, CART

    one of the predictive modelling approaches used in statistics, data mining andmachine learning [...] In tree structures, leaves represent class labels andbranches represent conjunctions of features that lead to those class labels.(Source: wikipedia).

    1 > library ( rpart )

    2 > cart library ( rpart .plot)

    4 > library ( rattle )

    5 > prp(cart , type =2, extra =1)

    or1 > fancyRpartPlot (cart , sub="")

    @freakonometrics 24

    https://en.wikipedia.org/wiki/Decision_tree_learning

  • http://www.ub.edu/riskcenter

    Classification (and Regression) Trees, CART

    The impurity is a function of the probability to have 1 at node N , i.e.P[Y = 1| node N ], and

    I(N) = (P[Y = 1| node N ])

    is nonnegative ( 0), symmetric ((p) = (1 p)), with a minimum in 0 and1 ((0) = (1) < (p)), e.g.

    Bayes error: (p) = min{p, 1 p}

    cross-entropy: (p) = p log(p) (1 p) log(1 p)

    Gini index: (p) = p(1 p)

    Those functions are concave, minimum at p = 0 and 1, maximum at p = 1/2.

    @freakonometrics 25

  • http://www.ub.edu/riskcenter

    Classification (and Regression) Trees, CART

    To split N into two {NL, NR}, consider

    I(NL, NR) =

    x{L,R}

    nxnI(Nx)

    e.g. Gini index (used originally in CART, see Breiman et al. (1984))

    gini(NL, NR) =

    x{L,R}

    nxn

    y{0,1}

    nx,ynx

    (1 nx,y

    nx

    )

    and the cross-entropy (used in C4.5 and C5.0)

    entropy(NL, NR) =

    x{L,R}

    nxn

    y{0,1}

    nx,ynx

    log(nx,ynx

    )

    @freakonometrics 26

    https://books.google.ca/books?id=JwQx-WOmSyQC&hl=frhttps://en.wikipedia.org/wiki/C4.5_algorithm

  • http://www.ub.edu/riskcenter

    Classification (and Regression) Trees, CART

    1.0 1.5 2.0 2.5 3.0

    0.

    45

    0.35

    0.

    25

    INCAR

    15 20 25 30

    0.

    45

    0.35

    0.

    25

    INSYS

    12 16 20 24

    0.

    45

    0.35

    0.

    25

    PRDIA

    20 25 30 35

    0.

    45

    0.35

    0.

    25

    PAPUL

    4 6 8 10 12 14 16

    0.

    45

    0.35

    0.

    25

    PVENT

    500 1000 1500 2000

    0.

    45

    0.35

    0.

    25

    REPUL

    NL: {xi,j s} NR: {xi,j > s}

    solve maxj{1, ,k},s

    {I(NL, NR)}

    first split

    second split

    1.8 2.2 2.6 3.0

    0.

    20

    0.18

    0.

    16

    0.14

    INCAR

    20 24 28 32

    0.

    20

    0.18

    0.

    16

    0.14

    INSYS

    12 14 16 18 20 22

    0.

    20

    0.18

    0.

    16

    0.14

    PRDIA

    16 18 20 22 24 26 28

    0.

    20

    0.18

    0.

    16

    0.14

    PAPUL

    4 6 8 10 12 14

    0.

    20

    0.18

    0.

    16

    0.14

    PVENT

    500 700 900 1100

    0.

    20

    0.18

    0.

    16

    0.14

    REPUL

    @freakonometrics 27

  • http://www.ub.edu/riskcenter

    Pruning Trees

    One can grow a big tree, until leaves have a (preset) small number ofobservations, and then possibly go back and prune branches (or leaves) that donot improve gains on good classification sufficiently.

    Or we can decide, at each node, whether we split, or not.

    @freakonometrics 28

  • http://www.ub.edu/riskcenter

    Pruning Trees

    In trees, overfitting increases with the number of steps, and leaves. Drop inimpurity at node N is defined as

    I(NL, NR) = I(N) I(NL, NR) = I(N)(nLnI(NL)

    nRnI(NR)

    )

    1 > library ( rpart )

    2 > CART

    4 > pred_CART = function (p,r){

    5 + return ( predict (CART , newdata =

    6 + data. frame ( PVENT =p, REPUL =r)[," Survival "])}0 5 10 15 20

    500

    1000

    1500

    2000

    2500

    3000

    PVENT

    RE

    PU

    L

    we cut if I(NL, NR)/I(N) (relative gain) exceeds cp (complexityparameter, default 1%).

    @freakonometrics 29

  • http://www.ub.edu/riskcenter

    Pruning Trees

    1 > library ( rpart )

    2 > CART

    4 > pred_CART = function (p,r){

    5 + return ( predict (CART , newdata =

    6 + data. frame ( PVENT =p, REPUL =r)[," Survival "])}0 5 10 15 20

    500

    1000

    1500

    2000

    2500

    3000

    PVENT

    RE

    PU

    L

    See also1 > library ( mvpart )

    2 > ? prune

    Define the missclassification rate of a tree R(tree)

    @freakonometrics 30

  • http://www.ub.edu/riskcenter

    Pruning Trees

    Given a cost-complexity parameter cp (see tunning parameter in Ridge-Lasso)define a penalized R()

    Rcp(tree) = R(tree) loss

    + cptree complexity

    If cp is small the optimal tree is large, if cp is large the optimal tree has no leaf,see Breiman et al. (1984).

    1 > cart plotcp (cart)

    3 > prune (cart , cp =0.06)

    cp

    X

    val R

    elat

    ive

    Err

    or

    0.4

    0.6

    0.8

    1.0

    1.2

    Inf 0.27 0.06 0.024 0.013

    1 2 3 7 9

    size of tree

    @freakonometrics 31

    https://books.google.ca/books?id=JwQx-WOmSyQC&hl=fr

  • http://www.ub.edu/riskcenter

    Bagging

    Bootstrapped Aggregation (Bagging) , is a machine learning ensemblemeta-algorithm designed to improve the stability and accuracy of machinelearning algorithms used in statistical classification (Source: wikipedia).

    It is an ensemble method that creates multiple models of the same type fromdifferent sub-samples of the same dataset [boostrap]. The predictions from eachseparate model are combined together to provide a superior result [aggregation].

    can be used on any kind of model, but interesting for trees, see Breiman (1996)

    Boostrap can be used to define the concept of margin,

    margini =1B

    Bb=1

    1(yi = yi)1B

    Bb=1

    1(yi 6= yi)

    Remark Probability that ith raw is not selection (1 n1)n e1 36.8%, cftraining / validation samples (2/3-1/3)

    @freakonometrics 32

    https://en.wikipedia.org/wiki/Bootstrap_aggregatinghttp://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf

  • http://www.ub.edu/riskcenter

    Bagging Trees

    5 10 15 20

    500

    1000

    1500

    2000

    2500

    3000

    PVENT

    RE

    PU

    L

    1 > margin for(b in 1:1 e4){

    3 + idx = sample (1:n,size=n, replace =TRUE)

    4 > cart margin [j ,] .5)!=

    ( myocarde $ PRONO ==" Survival ")

    7 + }

    8 > apply (margin , 2, mean)

    @freakonometrics 33

  • http://www.ub.edu/riskcenter

    Bagging Trees

    Interesting because of instability in CARTs (in terms of tree structure, notnecessarily prediction)

    @freakonometrics 34

  • http://www.ub.edu/riskcenter

    Bagging and Variance, Bagging and Bias

    Assume that y = m(x) + . The mean squared error over repeated randomsamples can be decomposed in three parts Hastie et al. (2001)

    E[(Y m(x))2] = 21

    +[E[m(x)]m(x)

    ]2 2

    +E([m(x) E[(m(x)]

    ]2) 3

    1 reflects the variance of Y around m(x)

    2 is the squared bias of m(x)

    3 is the variance of m(x)

    bias-variance tradeoff. Boostrap can be used to reduce the bias, and hevariance (but be careful of outliers)

    @freakonometrics 35

    http://statweb.stanford.edu/~tibs/ElemStatLearn/

  • http://www.ub.edu/riskcenter

    1 > library ( ipred )

    2 > BAG

    4 > pred_BAG = function (p,r){

    5 + return ( predict (BAG , newdata =

    6 + data. frame ( PVENT =p, REPUL =r), type="prob")[ ,2])}0 5 10 15 20

    500

    1000

    1500

    2000

    2500

    3000

    PVENT

    RE

    PU

    L

    @freakonometrics 36

  • http://www.ub.edu/riskcenter

    Random Forests

    Strictly speaking, when boostrapping among observations, and aggregating, weuse a bagging algorithm.

    In the random forest algorithm, we combine Breimans bagging idea and therandom selection of features, introduced independently by Ho (1995)) and Amit& Geman (1997))

    1 > library ( randomForest )

    2 > RF

    4 > pred_RF = function (p,r){

    5 + return ( predict (RF , newdata =

    6 + data. frame ( PVENT =p, REPUL =r), type="prob")[ ,2])}0 5 10 15 20

    500

    1000

    1500

    2000

    2500

    3000

    PVENT

    RE

    PU

    L

    @freakonometrics 37

    http://cm.bell-labs.com/cm/cs/who/tkh/papers/odt.pdfhttp://www.cis.jhu.edu/publications/papers_in_database/GEMAN/shape.pdfhttp://www.cis.jhu.edu/publications/papers_in_database/GEMAN/shape.pdf

  • http://www.ub.edu/riskcenter

    Random Forest

    At each node, selectk covariates out of k (randomly).

    can deal with small n large k-problems

    Random Forest are used not only for prediction, but also to assess variableimportance (discussed later on).

    @freakonometrics 38

  • http://www.ub.edu/riskcenter

    Support Vector Machine

    SVMs were developed in the 90s based on previous work, from Vapnik & Lerner(1963), see Vailant (1984)Assume that points are linearly separable, i.e. there is and b such that

    Y =

    +1 if Tx+ b > 01 if Tx+ b < 0Problem: infinite number of solutions, need a good one,that separate the data, (somehow) far from the data.

    Concept : VC dimension. Let H : {h : Rd 7 {1,+1}}. Then H is saidto shatter a set of points X is all dichotomies can be achieved.E.g. with those three points, all configurations can be achieved

    @freakonometrics 39

    http://www.cs.iastate.edu/~cs573x/vapnik-portraits1963.pdfhttp://www.cs.iastate.edu/~cs573x/vapnik-portraits1963.pdfhttps://people.mpi-inf.mpg.de/~mehlhorn/SeminarEvolvability/ValiantLearnable.pdf

  • http://www.ub.edu/riskcenter

    Support Vector Machine

    E.g. with those four points, several configurations cannot be achieved(with some linear separator, but they can with some quadratic one)

    @freakonometrics 40

  • http://www.ub.edu/riskcenter

    Support Vector Machine

    Vapniks (VC) dimension is the size of the largest shattered subset of X.

    This dimension is intersting to get an upper bound of the probability ofmiss-classification (with some complexity penalty, function of VC(H)).

    Now, in practice, where is the optimal hyperplane ?

    The distance from x0 to the hyperplane Tx+ b is

    d(x0, H,b) =Tx0 + b

    and the optimal hyperplane (in the separable case) is

    argmin{

    mini=1, ,n

    d(xi, H,b)}

    @freakonometrics 41

  • http://www.ub.edu/riskcenter

    Support Vector Machine

    Define support vectors as observations such that

    |Txi + b| = 1

    The margin is the distance between hyperplanes defined bysupport vectors.

    The distance from support vectors to H,b is 1, and the margin is then21.

    the algorithm is to minimize the inverse of the margins s.t. H,b separates1 points, i.e.

    min{

    12

    T

    }s.t. Yi(Txi + b) 1, i.

    @freakonometrics 42

  • http://www.ub.edu/riskcenter

    Support Vector Machine

    Problem difficult to solve: many inequality constraints (n)

    solve the dual problem...

    In the primal space, the solution was

    =

    iYixi withi=1

    iYi = 0.

    In the dual space, the problem becomes (hint: consider the Lagrangian)

    max{i=1

    i 12i=1

    ijYiYjxTi xj

    }s.t.

    i=1

    iYi = 0.

    which is usually written

    min

    {12

    TQ 1T}

    s.t.

    0 i iyT = 0where Q = [Qi,j ] and Qi,j = yiyjxTi xj .

    @freakonometrics 43

  • http://www.ub.edu/riskcenter

    Support Vector Machine

    Now, what about the non-separable case?

    Here, we cannot have yi(Txi + b) 1 i.

    introduce slack variables, Txi + b +1 i when yi = +1Txi + b 1 + i when yi = 1where i 0 i. There is a classification error when i > 1.

    The idea is then to solve

    min{

    12

    T + C1T1>1}, instead ofmin

    {12

    T

    }

    @freakonometrics 44

  • http://www.ub.edu/riskcenter

    Support Vector Machines, with a Linear Kernel

    So far,d(x0, H,b) = min

    xH,b{x0 x`2}

    where `2 is the Euclidean (`2) norm,

    x0 x`2 =

    (x0 x) (x0 x) =x0x0 2x0x+ xx

    1 > library ( kernlab )

    2 > SVM2 pred_SVM2 = function (p,r){

    5 + return ( predict (SVM2 , newdata =

    6 + data. frame ( PVENT =p, REPUL =r), type=" probabilities

    ")[ ,2])} 0 5 10 15 20

    500

    1000

    1500

    2000

    2500

    3000

    PVENT

    RE

    PU

    L

    @freakonometrics 45

  • http://www.ub.edu/riskcenter

    Support Vector Machines, with a Non Linear Kernel

    More generally,d(x0, H,b) = min

    xH,b{x0 xk}

    where k is some kernel-based norm,

    x0 xk =k(x0,x0) 2k(x0,x) + k(xx)

    1 > library ( kernlab )

    2 > SVM2 pred_SVM2 = function (p,r){

    5 + return ( predict (SVM2 , newdata =

    6 + data. frame ( PVENT =p, REPUL =r), type=" probabilities

    ")[ ,2])} 0 5 10 15 20

    500

    1000

    1500

    2000

    2500

    3000

    PVENT

    RE

    PU

    L

    @freakonometrics 46

  • http://www.ub.edu/riskcenter

    Heuristics on SVMs

    An interpretation is that data arent linearly seperable in the original space, butmight be separare by some kernel transformation,

    @freakonometrics 47

  • http://www.ub.edu/riskcenter

    Still Hungry ?

    There are still several (machine learning) techniques that can be used forclassification

    Fishers Linear or Quadratic Discrimination (closely related to logisticregression, and PCA), see Fisher (1936))

    X|Y = 0 N (0,0) and X|Y = 1 N (1,1)

    @freakonometrics 48

    http://onlinelibrary.wiley.com/doi/10.1111/j.1469-1809.1936.tb02137.x/abstract;jsessionid=C4721283C0ECF3893F91BFEBBDF3D094.f03t02)

  • http://www.ub.edu/riskcenter

    Still Hungry ?

    Perceptron or more generally Neural Networks In machine learning, neuralnetworks are a family of statistical learning models inspired by biologicalneural networks and are used to estimate or approximate functions that candepend on a large number of inputs and are generally unknown. wikipedia,see Rosenblatt (1957)

    Boosting (see next section)

    Naive Bayes In machine learning, naive Bayes classifiers are a family ofsimple probabilistic classifiers based on applying Bayes theorem with strong(naive) independence assumptions between the features. wikipedia, see Russell& Norvig (2003)

    See also the (great) package1 > library ( caret )

    @freakonometrics 49

    https://en.wikipedia.org/wiki/Artificial_neural_networkhttp://www.researchgate.net/post/Where_to_find_the_original_paper_of_Rosenblatts_Perceptron_algorithm_The_perceptron_a_perceiving_and_recognizing_automaton_Rosenblatt_F_1957https://en.wikipedia.org/wiki/Naive_Bayes_classifierhttp://aima.cs.berkeley.edu/http://aima.cs.berkeley.edu/

  • http://www.ub.edu/riskcenter

    Difference in Differences

    In many applications (e.g. marketing), we do need twomodels to analyze the impact of a treatment. We needtwo groups, a control and a treatment group.Data : {(xi, yi)} with yi {, }Data : {(xj , yj)} with yi {, }See clinical trials, treatment vs. control groupE.g. direct mail campaign in a bank

    Control Promotion

    No Purchase 85.17% 61.60%Purchase 14.83% 38.40%

    0.0 0.2 0.4 0.6 0.8 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    overall uplift effect +23.57%, see Guelman et al. (2014) for more details.

    @freakonometrics 50

    http://www.sciencedirect.com/science/article/pii/S0167668714000717

  • http://www.ub.edu/riskcenter

    Application on Motor Insurance Claims

    Consider a (standard) logistic regression, on two covariate (age of driver, and ageof camping-car)

    = logit1(0 + 1x1 + 2x2)

    1 > reg_glm=glm( nombre ~ ageconducteur + agevehicule ,

    data=camping , family = binomial )

    0.00 0.05 0.10 0.15

    20 30 40 50 60 70 80 90

    010

    2030

    40

    Age du conducteur principal

    Age

    du

    vhi

    cule

    @freakonometrics 51

  • http://www.ub.edu/riskcenter

    Application on Motor Insurance Claims

    Consider a (standard) logistic regression, on two covariate (age of driver, and ageof camping-car), smoothed with splines

    = logit1(0 + s1(x1) + s2(x2))

    1 > reg_add=glm( nombre ~bs( ageconducteur )+bs(

    agevehicule ),data=camping , family = binomial )

    0.00 0.05 0.10 0.15

    20 30 40 50 60 70 80 90

    010

    2030

    40

    Age du conducteur principal

    Age

    du

    v?hi

    cule

    @freakonometrics 52

  • http://www.ub.edu/riskcenter

    Application on Motor Insurance Claims

    Consider a (standard) logistic regression, on two covariate (age of driver, and ageof camping-car), smoothed with bivariate spline

    = logit1(0 + s(x1,x2))

    1 > library (mgcv)

    2 > reg_gam=gam( nombre ~s( ageconducteur , agevehicule ),

    data=camping , family = binomial )

    0.00 0.05 0.10 0.15

    20 30 40 50 60 70 80 90

    010

    2030

    40

    Age du conducteur principal

    Age

    du

    vhi

    cule

    @freakonometrics 53

  • http://www.ub.edu/riskcenter

    Application on Motor Insurance Claims

    One can also use k-Nearest Neighbours (k-NN)

    1 > library ( caret )

    2 > sc=sd( camping $ ageconducteur )

    3 > sv=sd( camping $ agevehicule )

    4 > knn=knn3 (( nombre ==1)~I( ageconducteur /sc)+I(

    agevehicule /sv),data=camping ,k =100)

    0.00 0.05 0.10 0.15

    20 30 40 50 60 70 80 90

    010

    2030

    40

    Age du conducteur principal

    Age

    du

    vhi

    cule

    (be carefull about scaling problems)

    @freakonometrics 54

  • http://www.ub.edu/riskcenter

    Application on Motor Insurance Claims

    We can also use a tree1 > tree= rpart (( nombre ==1)~ ageconducteur + agevehicule ,

    2 data=camping ,cp =7e -4)

    0.00 0.05 0.10 0.15

    20 30 40 50 60 70 80 90

    010

    2030

    40

    Age du conducteur principal

    Age

    du

    vhi

    cule

    0.00 0.05 0.10 0.15

    20 30 40 50 60 70 80 900

    1020

    3040

    Age du conducteur principal

    Age

    du

    vhi

    cule

    @freakonometrics 55

  • http://www.ub.edu/riskcenter

    Application on Motor Insurance Claimsor bagging techniques (rather close to random forests)

    1 > library ( ipred )

    2 > bag= bagging (( nombre ==1)~ ageconducteur +

    agevehicule ,data= camping )

    3 > library ( randomForest )

    4 > rf= randomForest (( nombre ==1)~ ageconducteur +

    agevehicule ,data= camping )

    0.00 0.05 0.10 0.15

    20 30 40 50 60 70 80 90

    010

    2030

    40

    Age du conducteur principal

    Age

    du

    vhi

    cule

    @freakonometrics 56

  • http://www.ub.edu/riskcenter

    Application on Motor Insurance Claims

    Boosting algorithms can also be considered (see next time)

    1 > library ( dismo )

    2 > library (gbm)

    3 > fit predict (fit , type=" response ", n. trees =700)

    20 30 40 50 60 70 80 90

    0.00

    0.05

    0.10

    0.15

    Age du conducteur principal

    @freakonometrics 57

  • http://www.ub.edu/riskcenter

    Application on Motor Insurance Claims

    Boosting algorithms can also be considered (see next time)

    1 > library ( dismo )

    2 > library (gbm)

    3 > fit predict (fit , type=" response ", n. trees =400)

    0.00 0.05 0.10 0.15

    20 30 40 50 60 70 80 90

    010

    2030

    40

    Age du conducteur principal

    Age

    du

    vhi

    cule

    @freakonometrics 58

  • http://www.ub.edu/riskcenter

    Part 2.Regression

    @freakonometrics 59

  • http://www.ub.edu/riskcenter

    Regression?

    In statistics, regression analysis is a statistical process for estimating therelationships among variables [...] In a narrower sense, regression may referspecifically to the estimation of continuous response variables, as opposed to thediscrete response variables used in classification. (Source: wikipedia).

    Here regression is opposed to classification (as in the CART algorithm). y iseither a continuous variable y R or a counting variable y N .

    @freakonometrics 60

    https://en.wikipedia.org/wiki/Regression_analysis

  • http://www.ub.edu/riskcenter

    Regression? Parametrics, nonparametrics and machine learning

    In many cases in econometric and actuarial literature we simply want a good fitfor the conditional expectation, E[Y |X = x].

    Regression analysis estimates the conditional expectation of the dependentvariable given the independent variables (Source: wikipedia).

    Example: A popular nonparametric technique, kernel based regression,

    m(x) =i Yi Kh(Xi x)iKh(Xi x)

    In econometric litterature, interest on asymptotic normality properties andplug-in techniques.

    In machine learning, interest on out-of sample cross-validation algorithms.

    @freakonometrics 61

    https://en.wikipedia.org/wiki/Regression_analysis

  • http://www.ub.edu/riskcenter

    Linear, Non-Linear and Generalized Linear

    Linear Model:

    (Y |X = x) N (x, 2)

    E[Y |X = x] = x = xT

    1 > fit

  • http://www.ub.edu/riskcenter

    Linear, Non-Linear and Generalized Linear

    NonLinear / NonParametric Model:

    (Y |X = x) N (x, 2)

    E[Y |X = x] = x = m(x)

    1 > fit fit

  • http://www.ub.edu/riskcenter

    Linear, Non-Linear and Generalized Linear

    Generalized Linear Model:

    (Y |X = x) L(x, )

    E[Y |X = x] = h1(x) = h1(xT)

    1 > fit

  • http://www.ub.edu/riskcenter

    Linear Model

    Consider a linear regression model, yi = xTi + i.

    is estimated using ordinary least squares, = [XTX]1XTY

    best linear unbiased estimator

    Unbiased estimators in important in statistics because they have nicemathematical properties (see Cramr-Rao lower bound).

    Looking for biased estimators (bias-variance tradeoff) becomes important inhigh-dimension, see Burr & Fry (2005)

    @freakonometrics 65

    http://www.jstor.org/stable/25471022

  • http://www.ub.edu/riskcenter

    Linear Model and Loss Functions

    Consider a linear model, with some general loss function `, set `(x, y) = R(x y)and consider,

    argmin{

    ni=1

    `(yi,xTi )}

    If R is differentiable, the first order condition would be

    ni=1

    R(yi xTi

    ) xTi = 0.

    i.e.ni=1

    (yi xTi

    ) i

    (yi xTi

    )xTi = 0 with (x) =

    R(x)x

    ,

    It is the first order condition of a weighted `2 regression.

    @freakonometrics 66

  • http://www.ub.edu/riskcenter

    Linear Model and Loss Functions

    But weights are unknown: use and iterative algorithm1 > e for( i in 1:100) {

    3 + W

  • http://www.ub.edu/riskcenter

    Bagging Linear Models

    1 > V= matrix (NA ,100 ,251)

    2 > for(i in 1:100) {

    3 + ind

  • http://www.ub.edu/riskcenter

    Regression Smoothers, natura non facit saltus

    In statistical learning procedures, a key role is played by basis functions. We willsee that it is common to assume that

    m(x) =Mm=0

    Mhm(x),

    where h0 is usually a constant function and hm defined basis functions.

    For instance, hm(x) = xm for a polynomial expansion witha single predictor, or hm(x) = (x sm)+ for some knotssms (for linear splines, but one can consider quadratic orcubic ones).

    @freakonometrics 69

  • http://www.ub.edu/riskcenter

    Regression Smoothers: Polynomial Functions

    Stone-Weiestrass theorem every continuous function defined on a closedinterval [a, b] can be uniformly approximated as closely as desired by apolynomial function

    1 > fit predict (fit , newdata = data. frame (X=x))

    @freakonometrics 70

  • http://www.ub.edu/riskcenter

    Regression Smoothers: Spline Functions

    1 > fit predict (fit , newdata = data. frame (X=x))

    @freakonometrics 71

  • http://www.ub.edu/riskcenter

    Regression Smoothers: Spline Functions

    1 > fit predict (fit , newdata = data. frame (X=x))

    see Generalized Additive Models.

    @freakonometrics 72

  • http://www.ub.edu/riskcenter

    Fixed Knots vs. Optimized Ones

    1 > library ( freeknotsplines )

    2 > gen fit predict (fit , newdata = data. frame (X=x))

    @freakonometrics 73

  • http://www.ub.edu/riskcenter

    Interpretation of Penalty

    Unbiased estimators are important in mathematical statistics, but are they thebest estimators ?

    Consider a sample, i.i.d., {y1, , yn} with distribution N (, 2). Define = Y . What is the optimal ? to get the best estimator of ?

    bias: bias()

    = E() = ( 1)

    variance: Var()

    = 22

    n

    mse: mse()

    = ( 1)22 + 22

    n

    The optimal value is ? = 2

    2 + 2

    n

    < 1.

    @freakonometrics 74

  • http://www.ub.edu/riskcenter

    Linear Model

    Consider some linear model yi = xTi + i for all i = 1, , n.

    Assume that i are i.i.d. with E() = 0 (and finite variance). Writey1...yn

    y,n1

    =

    1 x1,1 x1,k...

    .... . .

    ...1 xn,1 xn,k

    X,n(k+1)

    0

    1...k

    ,(k+1)1

    +

    1...n

    ,n1

    .

    Assuming N (0, 2I), the maximum likelihood estimator of is

    = argmin{y XT`2} = (XTX)1XTy

    ... under the assumtption that XTX is a full-rank matrix.

    What if XTiX cannot be inverted? Then = [XTX]1XTy does not exist, but = [XTX + I]1XTy always exist if > 0.

    @freakonometrics 75

  • http://www.ub.edu/riskcenter

    Ridge Regression

    The estimator = [XTX + I]1XTy is the Ridge estimate obtained as solutionof

    = argmin

    ni=1

    [yi 0 xTi ]2 + `2 1T2

    for some tuning parameter . One can also write

    = argmin;`2s

    {Y XT`2}

    Remark Note that we solve = argmin{objective()} where

    objective() = L() training loss

    + R() regularization

    @freakonometrics 76

  • http://www.ub.edu/riskcenter

    Going further on sparcity issues

    In severall applications, k can be (very) large, but a lot of features are just noise:j = 0 for many js. Let s denote the number of relevent features, with s

  • http://www.ub.edu/riskcenter

    Going further on sparcity issues

    Define a`0 =

    1(|ai| > 0). Ici dim() = s.

    We wish we could solve

    = argmin;`0s

    {Y XT`2}

    Problem: it is usually not possible to describe all possible constraints, since(s

    k

    )coefficients should be chosen here (with k (very) large).

    Idea: solve the dual problem

    = argmin;Y XT`2h

    {`0}

    where we might convexify the `0 norm, `0 .

    @freakonometrics 78

  • http://www.ub.edu/riskcenter

    Regularization `0, `1 et `2

    @freakonometrics 79

  • http://www.ub.edu/riskcenter

    Optimal LASSO Penalty

    Use cross validation, e.g. K-fold,

    (k)() = argmin

    i6Ik

    [yi xTi ]2 +

    then compute the sum of the squared errors,

    Qk() =iIk

    [yi xTi (k)()]2

    and finally solve

    ? = argmin{Q() = 1

    K

    k

    Qk()}

    Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009) suggest thelargest such that

    Q() Q(?) + se[?] with se[]2 = 1K2

    Kk=1

    [Qk()Q()]2

    @freakonometrics 80

    http://statweb.stanford.edu/~tibs/ElemStatLearn/

  • http://www.ub.edu/riskcenter

    Going further on sparcity issues

    On [1,+1]k, the convex hull of `0 is `1On [a,+a]k, the convex hull of `0 is a1`1Hence,

    = argmin;`1s

    {Y XT`2}

    is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization problem

    = argmin{Y XT`2+`1}

    @freakonometrics 81

  • http://www.ub.edu/riskcenter

    LASSO Least Absolute Shrinkage and Selection Operator

    argmin{Y XT`2+`1}

    is a convex problem (several algorithms?), but not strictly convex (no unicity ofthe minimum). Nevertheless, predictions y = xT are unique

    ? MM, minimize majorization, coordinate descent Hunter (2003).

    @freakonometrics 82

    http://sites.stat.psu.edu/~dhunter/papers/mmtutorial.pdf

  • http://www.ub.edu/riskcenter

    1 > freq = merge (contrat , nombre _RC)

    2 > freq = merge (freq , nombre _DO)

    3 > freq [ ,10]= as. factor (freq [ ,10])

    4 > mx= cbind (freq[,c(4 ,5 ,6)],freq [ ,9]=="D",

    freq [ ,3]% in%c("A","B","C"))

    5 > colnames (mx)=c( names (freq)[c(4 ,5 ,6)],"

    diesel ","zone")

    6 > for(i in 1: ncol(mx)) mx[,i]=( mx[,i]-mean(

    mx[,i]))/sd(mx[,i])

    7 > names (mx)

    8 [1] puissance agevehicule ageconducteur

    diesel zone

    9 > library ( glmnet )

    10 > fit = glmnet (x=as. matrix (mx), y=freq [ ,11] ,

    offset =log(freq [ ,2]) , family = " poisson

    ")

    11 > plot(fit , xvar=" lambda ", label =TRUE)

    LASSO, third party

    10 9 8 7 6 5

    0.

    20

    0.15

    0.

    10

    0.05

    0.00

    0.05

    0.10

    Log Lambda

    Coe

    ffici

    ents

    4 4 4 4 3 1

    1

    3

    4

    5

    @freakonometrics 83

  • http://www.ub.edu/riskcenter

    LASSO, third party

    1 > plot(fit , label =TRUE)

    2 > cvfit = cv. glmnet (x=as. matrix (mx), y=freq

    [ ,11] , offset =log(freq [ ,2]) ,family = "

    poisson ")

    3 > plot( cvfit )

    4 > cvfit $ lambda .min

    5 [1] 0.0002845703

    6 > log( cvfit $ lambda .min)

    7 [1] -8.16453

    Cross validation curve + error bars

    0.0 0.1 0.2 0.3 0.4

    0.

    20

    0.15

    0.

    10

    0.05

    0.00

    0.05

    0.10

    L1 Norm

    Coe

    ffici

    ents

    0 2 3 4 4

    1

    3

    4

    5

    10 9 8 7 6 5

    0.24

    60.

    248

    0.25

    00.

    252

    0.25

    40.

    256

    0.25

    8

    log(Lambda)

    Poi

    sson

    Dev

    ianc

    e

    4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 2 1

    @freakonometrics 84

  • http://www.ub.edu/riskcenter

    1 > freq = merge (contrat , nombre _RC)

    2 > freq = merge (freq , nombre _DO)

    3 > freq [ ,10]= as. factor (freq [ ,10])

    4 > mx= cbind (freq[,c(4 ,5 ,6)],freq [ ,9]=="D",

    freq [ ,3]% in%c("A","B","C"))

    5 > colnames (mx)=c( names (freq)[c(4 ,5 ,6)],"

    diesel ","zone")

    6 > for(i in 1: ncol(mx)) mx[,i]=( mx[,i]-mean(

    mx[,i]))/sd(mx[,i])

    7 > names (mx)

    8 [1] puissance agevehicule ageconducteur

    diesel zone

    9 > library ( glmnet )

    10 > fit = glmnet (x=as. matrix (mx), y=freq [ ,12] ,

    offset =log(freq [ ,2]) , family = " poisson

    ")

    11 > plot(fit , xvar=" lambda ", label =TRUE)

    LASSO, Frquence DO

    9 8 7 6 5 4

    0.

    8

    0.6

    0.

    4

    0.2

    0.0

    Log Lambda

    Coe

    ffici

    ents

    4 4 3 2 1 1

    1

    2

    4

    5

    @freakonometrics 85

  • http://www.ub.edu/riskcenter

    LASSO, material

    1 > plot(fit , label =TRUE)

    2 > cvfit = cv. glmnet (x=as. matrix (mx), y=freq

    [ ,12] , offset =log(freq [ ,2]) ,family = "

    poisson ")

    3 > plot( cvfit )

    4 > cvfit $ lambda .min

    5 [1] 0.0004744917

    6 > log( cvfit $ lambda .min)

    7 [1] -7.653266

    Cross validation curve + error bars

    0.0 0.2 0.4 0.6 0.8 1.0

    0.

    8

    0.6

    0.

    4

    0.2

    0.0

    L1 Norm

    Coe

    ffici

    ents

    0 1 1 1 2 4

    1

    2

    4

    5

    9 8 7 6 5 4

    0.21

    50.

    220

    0.22

    50.

    230

    0.23

    5

    log(Lambda)

    Poi

    sson

    Dev

    ianc

    e

    4 4 4 4 3 3 3 3 3 2 2 1 1 1 1 1 1 1 1

    @freakonometrics 86

  • http://www.ub.edu/riskcenter

    Some thoughts about Tuning parameters

    Regularization is a key issue in machine learning, to avoid overfitting.

    In (traditional) econometrics are based on plug-in methods: see Silvermanbandwith rule in Kernel density estimation,

    h? =(

    45

    3n

    ) 1.06n1/5.

    In machine learning literature, use on out-of-sample cross-validation methods forchoosing amount of regularization.

    @freakonometrics 87

  • http://www.ub.edu/riskcenter

    Optimal LASSO Penalty

    Use cross validation, e.g. K-fold,

    (k)() = argmin

    {i6Ik

    [yi xTi ]2 + k

    |k|

    then compute the sum or the squared errors,

    Qk() =i 6Ik

    [yi xTi (k)()]2

    and finally solve

    ? = argmin{Q() = 1

    K

    k

    Qk()}

    Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009) suggest thelargest such that

    Q() Q(?) + se[?] with se[]2 = 1K2

    Kk=1

    [Qk()Q()]2

    @freakonometrics 88

    http://statweb.stanford.edu/~tibs/ElemStatLearn/

  • http://www.ub.edu/riskcenter

    Big Data, Oracle and Sparcity

    Assume that k is large, and that Rk can be partitioned as = (imp,non-imp), as well as covariates x = (ximp,xnon-imp), with importantand non-important variables, i.e. non-imp 0.

    Goal : achieve variable selection and make inference of impOracle property of high dimensional model selection and estimation, see Fan andLi (2001). Only the oracle knows which variables are important...

    If sample size is large enough (n >> kimp(

    1 + log kkimp

    )) we can do inference as

    if we knew which covariates were important: we can ignore the selection ofcovariates part, that is not relevant for the confidence intervals. This providescover for ignoring the shrinkage and using regularstandard errors, see Athey &Imbens (2015).

    @freakonometrics 89

    http://orfe.princeton.edu/~jqfan/papers/01/penlike.pdfhttp://orfe.princeton.edu/~jqfan/papers/01/penlike.pdfhttp://www.nber.org/econometrics_minicourse_2015/http://www.nber.org/econometrics_minicourse_2015/

  • http://www.ub.edu/riskcenter

    Why Shrinkage Regression Estimates ?

    Interesting for model selection (alternative to peanlized criterions) and to get agood balance between bias and variance.

    In decision theory, an admissible decision rule is a rule for making a decisionsuchthat there is not any other rule that is always better than it.

    When k 3, ordinary least squares are not admissible, see the improvement byJamesStein estimator.

    @freakonometrics 90

  • http://www.ub.edu/riskcenter

    Regularization and Scalability

    What if k is (extremely) large? never trust ols with more than five regressors(attributed to Zvi Griliches in Athey & Imbens (2015))

    Use regularization techniques, see Ridge, Lasso, or subset selection

    = argmin

    {ni=1

    [yi 0 xTi ]2 + `0 where `0 =k

    1(k 6= 0).}

    @freakonometrics 91

    http://www.nber.org/econometrics_minicourse_2015/nber_slides11.pdf

  • http://www.ub.edu/riskcenter

    Penalization and Splines

    In order to get a sufficiently smooth model, why not penalyse the sum of squaresof errors,

    ni=1

    [yi m(xi)]2 +

    [m(t)]2dt

    for some tuning parameter . Consider some cubic spline basis, so that

    m(x) =Jj=1

    jNj(x)

    then the optimal expression for m is obtained using

    = [NTN + ]1NTy

    where N i,j is the matrix of Nj(Xi)s and i,j =N i (t)N j (t)dt

    @freakonometrics 92

  • http://www.ub.edu/riskcenter

    Smoothing with Multiple Regressors

    Actuallyni=1

    [yi m(xi)]2 +

    [m(t)]2dt

    is based on some multivariate penalty functional, e.g.

    [m(t)]2dt =

    i

    (2m(t)t2i

    )2+ 2

    i,j

    (2m(t)titj

    )2 dt

    @freakonometrics 93

  • http://www.ub.edu/riskcenter

    Regression Trees

    The partitioning is sequential, one covariate at a time (see adaptative neighborestimation).

    Start with Q =ni=1

    [yi y]2

    For covariate k and threshold t, split the data according to {xi,k t} (L) or{xi,k > t} (R). Compute

    yL =i,xi,kt yii,xi,kt 1

    and yR =i,xi,k>t

    yii,xi,k>t

    1

    and let

    m(k,t)i =

    yL if xi,k tyR if xi,k > t@freakonometrics 94

  • http://www.ub.edu/riskcenter

    Regression Trees

    Then compute (k?, t?) = argmin{

    ni=1

    [yi m(k,t)i ]2

    }, and partition the space

    intro two subspace, whether xk? t?, or not.

    Then repeat this procedure, and minimize

    ni=1

    [yi mi]2 + #{leaves},

    (cf LASSO).

    One can also consider random forests with regression trees.

    @freakonometrics 95

  • http://www.ub.edu/riskcenter

    Local Regression

    1 > W

  • http://www.ub.edu/riskcenter

    Local Regression

    1 > W

  • http://www.ub.edu/riskcenter

    Local Regression : Nearest Neighbor

    1 > W

  • http://www.ub.edu/riskcenter

    Local Regression : Kernel Based Smoothing

    1 > library ( KernSmooth )

    2 > W library ( KernSmooth )

    6 > library (sp)

    @freakonometrics 99

  • http://www.ub.edu/riskcenter

    Local Regression : Kernel Based Smoothing

    1 > library (np)

    2 > fit predict (fit , newdata = data. frame (X=x))

    @freakonometrics 100

  • http://www.ub.edu/riskcenter

    From Linear to Generalized Linear Models

    The (Gaussian) Linear Model and the logistic regression have been extended tothe wide class of the exponential family,

    f(y|, ) = exp(y b()a() + c(y, )

    ),

    where a(), b() and c() are functions, is the natural - canonical - parameterand is a nuisance parameter.

    The Gaussian distribution N (, 2) belongs to this family

    = E(Y )

    , = 2 Var(Y )

    , a() = , b() = 2/2

    @freakonometrics 101

  • http://www.ub.edu/riskcenter

    From Linear to Generalized Linear Models

    The Bernoulli distribution B(p) belongs to this family

    = log p1 p =g?(E(Y ))

    , a() = 1, b() = log(1 + exp()), and = 1

    where the g?() is some link function (here the logistic transformation): thecanonical link.

    Canonical links are1 binomial (link = " logit ")

    2 gaussian (link = " identity ")

    3 Gamma (link = " inverse ")

    4 inverse . gaussian (link = "1/mu ^2")

    5 poisson (link = "log")

    6 quasi (link = " identity ", variance = " constant ")

    7 quasibinomial (link = " logit ")

    8 quasipoisson (link = "log")

    @freakonometrics 102

  • http://www.ub.edu/riskcenter

    From Linear to Generalized Linear Models

    Observe that

    = E(Y ) = b() and Var(Y ) = b() = b([b]1()) variance function V ()

    distributions are characterized by this variance function, e.g. V () = 1 forthe Gaussian family (homoscedastic models), V () = for the Poisson andV () = 2 for the Gamma distribution, V () = 3 for the inverse-Gaussianfamily.

    Note that g?() = [b]1() is the canonical link.

    Tweedie (1984) suggested a power-type variance function V () = . When [1, 2], then Y has a compound Poisson distribution with Gamma jumps.

    1 > library ( tweedie )

    @freakonometrics 103

  • http://www.ub.edu/riskcenter

    From the Exponential Family to GLMs

    So far, there no regression model. Assume that

    f(yi|i, ) = exp(yii b(i)

    a() + c(yi, ))

    where i = g1? (g(xTi ))

    so that the log-likelihood is

    L(, |y) =ni=1

    f(yi|i, ) = exp(n

    i=1 yii ni=1 b(i)

    a() +ni=1

    c(yi, )).

    To derive the first order condition, observe that we can write

    logL(, |yi)j

    = i,jxi,j [yi i]

    for some i,j (see e.g. Mller (2004)) which are simple when g? = g.

    @freakonometrics 104

    http://www.marlenemueller.de/publications/HandbookCS.pdf

  • http://www.ub.edu/riskcenter

    From the Exponential Family to GLMs

    The first order conditions can be writen

    XTW1[y ] = 0

    which are first order conditions for a weighted linear regression model.

    As for the logistic regression, W depends on unkown s : use an iterativealgorithm

    1. Set 0 = y, 0 = g(0) and

    z0 = 0 + (y 0)g(0).

    Define W 0 = diag[g(0)2Var(y)] and fit a (weighted) lineare regression of Z0 onX, i.e.

    1 = [XTW10 X]1XTW10 z0

    2. Set k = Xk, k = g(k) and

    zk = k + (y k)g(k).

    @freakonometrics 105

  • http://www.ub.edu/riskcenter

    From the Exponential Family to GLMs

    Define W k = diag[g(k)2Var(y)] and fit a (weighted) lineare regression of Zk onX, i.e.

    k+1 = [XTW1k X]1XTW1k Zk

    and loop... until changes in k+1 are (sufficiently) small. Then set =

    Under some technical conditions, we can prove that P and

    n( ) L N (0, I()1).

    where numerically I() = [XTW1X]).

    @freakonometrics 106

  • http://www.ub.edu/riskcenter

    From the Exponential Family to GLMs

    We estimate (see linear regression estimation) by

    = 1n dim(X)

    ni=1

    i,i[yi i]2

    Var(i)

    This asymptotic expression can be used to derive confidence intervals, or tests.But is might be a poor approximation when n is small. See use of boostrap inclaims reserving.

    Those are theorerical results: in practice, the algorithm may fail to converge

    @freakonometrics 107

  • http://www.ub.edu/riskcenter

    GLMs outside the Exponential Family?

    Actually, it is possible to consider more general distributions, see Yee (2014))

    1 > library (VGAM)

    2 > vglm(y ~ x, family = Makeham )

    3 > vglm(y ~ x, family = Gompertz )

    4 > vglm(y ~ x, family = Erlang )

    5 > vglm(y ~ x, family = Frechet )

    6 > vglm(y ~ x, family = pareto1 ( location =100) )

    Those functions can also be used for a multivariate response y

    @freakonometrics 108

    https://www.stat.auckland.ac.nz/~yee/VGAM/doc/VGAMrefcard.pdf

  • http://www.ub.edu/riskcenter

    GLM: Link and Distribution

    @freakonometrics 109

  • http://www.ub.edu/riskcenter

    GLM: Distribution?

    From a computational point of view, the Poisson regression is not (really) relatedto the Poisson distribution.

    Here we solve the first order conditions (or normal equations)i

    [Yi exp(XTi )]Xi,j = 0 j

    with unconstraint , using Fishers scoring technique k+1 = k H1k k

    where Hk = i

    exp(XTi k)XiXTi and k =i

    XTi [Yi exp(XTi k)]

    There is no assumption here about Y N: it is possible to run a Poissonregression on non-integers.

    @freakonometrics 110

  • http://www.ub.edu/riskcenter

    The Exposure and (Annual) Claim Frequency

    In General Insurance, we should predict blueyearly claims frequency. Let Nidenote the number of claims over one year for contrat i.

    We did observe only the contract for a period of time EiLet Yi denote the observed number of claims, over period [0, Ei].

    @freakonometrics 111

  • http://www.ub.edu/riskcenter

    The Exposure and (Annual) Claim Frequency

    Assuming that claims occurence is driven by a Poisson process of intensity , ifNi P(), then Yi P( Ei), where N is the annual frequency.

    L(,Y ,E) =ni=1

    eEi [Ei]YiYi!

    the first order condition is

    logL(,Y ,E) =

    ni=1

    Ei +1

    ni=1

    Yi = 0

    for

    =ni=1 Yini=1 Ei

    =ni=1

    iYiEi

    where i =Eini=1 Ei

    @freakonometrics 112

  • http://www.ub.edu/riskcenter

    The Exposure and (Annual) Claim Frequency

    Assume thatYi P(i Ei) where i = exp[XTi ].

    Here E(Yi|Xi) = Var(Yi|Xi) = i = exp[XTi + logEi].

    logL(;Y ) =ni=1

    Yi [XTi + logEi] (exp[Xi] + logEi) log(Yi!)

    1 > model model

  • http://www.ub.edu/riskcenter

    Boosting

    Boosting is a machine learning ensemble meta-algorithm for reducing biasprimarily and also variance in supervised learning, and a family of machinelearning algorithms which convert weak learners to strong ones. (source:Wikipedia)

    The heuristics is simple: we consider an iterative process where we keep modelingthe errors.

    Fit model for y, m1() from y and X, and compute the error, 1 = y m1(X).

    Fit model for 1, m2() from 1 and X, and compute the error,2 = 1 m2(X), etc. Then set

    m() = m1() y

    +m2() 1

    +m3() 2

    + +mk() k1

    @freakonometrics 114

    https://en.wikipedia.org/wiki/Boosting_(machine_learning)

  • http://www.ub.edu/riskcenter

    Boosting

    With (very) general notations, we want to solve

    m? = argmin{E[`(Y,m(X))]}

    for some loss function `.

    It is an iterative procedure: assume that at some step k we have an estimatormk(X). Why not constructing a new model that might improve our model,

    mk+1(X) = mk(X) + h(X).

    What h() could be?

    @freakonometrics 115

  • http://www.ub.edu/riskcenter

    Boosting

    In a perfect world, h(X) = y mk(X), which can be interpreted as a residual.

    Note that this residual is the gradient of 12 [y m(x)]2

    A gradient descent is based on Taylor expansion

    f(xk) f,xk

    f(xk1) f,xk1

    + (xk xk1)

    f(xk1) f,xk1

    But here, it is different. We claim we can write

    fk(x) fk,x

    fk1(x) fk1,x

    + (fk fk1)

    ?fk1,x

    where ? is interpreted as a gradient.

    @freakonometrics 116

  • http://www.ub.edu/riskcenter

    Boosting

    Here, fk is a Rd R function, so the gradient should be in such a (big)functional space want to approximate that function.

    mk(x) = mk1(x) + argminfF

    {ni=1

    `(Yi,mk1(x) + f(x))}

    where f F means that we seek in a class of weak learner functions.

    If learner are two strong, the first loop leads to some fixed point, and there is nolearning procedure, see linear regression y = xT + . Since x we cannotlearn from the residuals.

    @freakonometrics 117

  • http://www.ub.edu/riskcenter

    Boosting with some Shrinkage

    Consider here some quadratic loss function.

    In order to make sure that we learn weakly, we can use some shrinkageparameter (or collection of parameters j) so that

    E[Y |X = x] = m(x) mM (x) =Mj=1

    jhj(x)

    The problem is always the same. At stage j, we should solve

    minh()

    ni=1

    [yi mj1(xi) i,j1

    h(xi)]2

    @freakonometrics 118

  • http://www.ub.edu/riskcenter

    Boosting with some Shrinkage

    The algorithm is then

    start with some (simple) model y = h1(x)

    compute the residuals (including ), 1 = y h1(x)

    and at step j,

    consider some (simple) model j = hj(x)

    compute the residuals (including ), j+1 = j hj(x)

    and loop. And set finally

    y =Mj=1

    hj(x)

    @freakonometrics 119

  • http://www.ub.edu/riskcenter

    Boosting with Piecewise Linear Spline Functions

    @freakonometrics 120

  • http://www.ub.edu/riskcenter

    Boosting with Trees (Stump Functions)

    @freakonometrics 121

  • http://www.ub.edu/riskcenter

    Boosting for Classification

    Still seek m?() = argmin{E[`(Y,m(X))]}

    Here y {1,+1}, and use `(y,m(x)) = eym(x) : AdaBoot algorithm.

    Note thatP[Y = +1|X = x] = 11 + e2m?x

    cf probit transform... Can be seen as iteration on weights. At step k solve

    argminh()

    ni=1

    eyimk(xi) i,k

    eyih(xi)

    @freakonometrics 122

  • http://www.ub.edu/riskcenter

    Boosting for Classification

    1 > gbm.step(data=myocarde , gbm.x = 1:7 , gbm.y = 8,

    2 + family = " bernoulli ", tree. complexity = 5,

    3 + learning .rate = 0.01 , bag. fraction = 0.5)

    200 400 600 800 1000

    0.6

    0.7

    0.8

    0.9

    1.0

    1.1

    no. of trees

    hold

    out d

    evia

    nce

    PRONO01, d 5, lr 0.01

    @freakonometrics 123

  • http://www.ub.edu/riskcenter

    Exponential distribution, deviance, loss function, residuals, etc

    Gaussian distribution `2 loss function

    Deviance isni=1

    (yi m(xi))2, with gradient i = yi m(xi)

    Laplace distribution `1 loss function

    Deviance isni=1|yi m(xi))|, with gradient i = sign(yi m(xi))

    @freakonometrics 124

  • http://www.ub.edu/riskcenter

    Exponential distribution, deviance, loss function, residuals, etc

    Bernoullli {1,+1} distribution `adaboost loss function

    Deviance isni=1

    eyim(xi), with gradient i = yie[yi]m(xi)

    Bernoullli {0, 1} distribution

    Deviance 2ni=1

    [yi log(

    yim(xi)

    )(1 yi) log

    (1 yi

    1m(xi)

    )with gradient

    i = yi exp[m(xi)]

    1 + exp[m(xi)]

    Poisson distribution

    Deviance 2ni=1

    (yi log

    (yi

    m(xi)

    ) [yi m(xi)]

    )with gradient i =

    yi m(xi)m(xi)

    @freakonometrics 125

  • http://www.ub.edu/riskcenter

    Regularized GLM

    In Regularized GLMs, we introduced a penalty in the loss function (thedeviance), see e.g. `1 regularized logistic regression

    max

    ni=1

    (yi[0 + xTi log[1 + e0+x

    Ti]])

    kj=1|j |

    1 > library ( glmnet )

    2 > y x glm_ ridge plot(lm_ ridge )0 5 10 15

    4

    2

    02

    46

    L1 Norm

    Coe

    ffici

    ents

    7 7 7 7

    FRCARINCARINSYSPRDIAPAPULPVENTREPUL

    @freakonometrics 126

  • http://www.ub.edu/riskcenter

    Collective vs. Individual Model

    Consider a Tweedie distribution, with variance function power p (0, 1), mean and scale parameter , then it is a compound Poisson model,

    N P() with = 2p

    2 p

    Yi G(, ) with = p 2p 1 and =

    1p

    p 1

    Consversely, consider a compound Poisson model N P() and Yi G(, ),

    variance function power is p = + 2+ 1

    mean is =

    scale parameter is = []+2+112

    +2+1

    + 1

    seems to be equivalent... but its not.

    @freakonometrics 127

  • http://www.ub.edu/riskcenter

    Collective vs. Individual Model

    In the context of regression

    Ni P(i) with i = exp[XTi ]

    Yj,i G(i, ) with i = exp[XTi ]

    Then Si = Y1,i + + YN,i has a Tweedie distribution

    variance function power is p = + 2+ 1

    mean is ii

    scale parameter is 1

    +11i

    +1i

    (

    1 +

    )

    There are 1 + 2dim(X) degrees of freedom.

    @freakonometrics 128

  • http://www.ub.edu/riskcenter

    Collective vs. Individual Model

    Note that the scale parameter should not depend on i. A Tweedie regression is

    variance function power is p = (0, 1)

    mean is i = exp[XTi Tweedie]

    scale parameter is

    There are 2 + dim(X) degrees of freedom.

    Note that oone can easily boost a Tweedie model1 > library ( TDboost )

    @freakonometrics 129

  • http://www.ub.edu/riskcenter

    Part 3.Model Choice, Feature Selection, etc.

    @freakonometrics 130

  • http://www.ub.edu/riskcenter

    AIC, BIC

    AIC and BIC are both maximum likelihood estimate driven and penalize uselessparameters(to avoid overfitting)

    AIC = 2 log[likelihood] + 2k and BIC = 2 log[likelihood] + log(n)k

    AIC focus on overfit, while BIC depends on n so it might also avoid underfit

    BIC penalize complexity more than AIC does.

    Minimizing AIC minimizing cross-validation value, Stone (1977).

    Minimizing BIC k-fold leave-out cross-validation, Shao (1997), withk = n[1 (logn 1)]

    used in econometric stepwise procedures

    @freakonometrics 131

    http://www.jstor.org/stable/2984877http://www3.stat.sinica.edu.tw/statistica/oldpdf/A7n21.pdf

  • http://www.ub.edu/riskcenter

    Cross-Validation

    Formally, the leave-one-out cross validation is based on

    CV = 1n

    ni=1

    `(yi, mi(xi))

    where mi is obtained by fitting the model on the sample where observation ihas been dropped, e.g.

    CV = 1n

    ni=1

    [yi, mi(xi)]2

    The Generalized cross-validation, for a quadratic loss function, is defined as

    GCV = 1n

    ni=1

    [yi mi(xi)

    1 trace(S)/n

    ]2

    @freakonometrics 132

  • http://www.ub.edu/riskcenter

    Cross-Validation for kernel based local regression

    Econometric approachDefine m(x) = [x]0 +

    [x]1 x with

    ([x]0 , [x]1 ) = argmin

    (0,1)

    {ni=1

    [x]h? [yi (0 + 1xi)]

    2

    }

    where h? is given by some rule of thumb(see previous discussion).

    0 2 4 6 8 10

    2

    1

    01

    2

    @freakonometrics 133

  • http://www.ub.edu/riskcenter

    Cross-Validation for kernel based local regression

    Bootstrap based approach

    Use bootstrap samples, compute h?b , and get mb(x)s.

    0 2 4 6 8 10

    2

    1

    01

    2

    0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20

    02

    46

    810

    12

    @freakonometrics 134

  • http://www.ub.edu/riskcenter

    Cross-Validation for kernel based local regression

    Statistical learning approach (Cross Validation (leave-one-out))

    Given j {1, , n}, given h, solve

    ([(i),h]0 , [(i),h]1 ) = argmin

    (0,1)

    j 6=i

    (i)h [Yj (0 + 1xj)]

    2

    and compute m[h](i)(xi) =

    [(i),h]0 +

    [(i),h]1 xi. Define

    mse(h) =ni=1

    [yi m[h](i)(xi)]2

    and set h? = argmin{mse(h)}.

    Then compute m(x) = [x]0 + [x]1 x with

    ([x]0 , [x]1 ) = argmin

    (0,1)

    {ni=1

    [x]h? [yi (0 + 1xi)]

    2

    }

    @freakonometrics 135

  • http://www.ub.edu/riskcenter

    Cross-Validation for kernel based local regression

    @freakonometrics 136

  • http://www.ub.edu/riskcenter

    Cross-Validation for kernel based local regression

    Statistical learning approach (Cross Validation (k-fold))

    Given I {1, , n}, given h, solve

    ([(I),h]0 , [xi,h]1 ) = argmin

    (0,1)

    j /I

    (I)h [yj (0 + 1xj)]

    2

    and compute m[h](I)(xi) =

    [(i),h]0 +

    [(i),h]1 xi, i I. Define

    mse(h) =I

    iI

    [yi m[h](I)(xi)]2

    and set h? = argmin{mse(h)}.

    Then compute m(x) = [x]0 + [x]1 x with

    ([x]0 , [x]1 ) = argmin

    (0,1)

    {ni=1

    [x]h? [yi (0 + 1xi)]

    2

    }

    @freakonometrics 137

  • http://www.ub.edu/riskcenter

    Cross-Validation for kernel based local regression

    @freakonometrics 138

  • http://www.ub.edu/riskcenter

    Cross-Validation for Ridge & Lasso

    1 > library ( glmnet )

    2 > y x cvfit cvfit $ lambda .min

    7 [1] 0.0408752

    8 > plot( cvfit )

    9 > cvfit cvfit $ lambda .min

    12 [1] 0.03315514

    13 > plot( cvfit )

    2 0 2 4 6

    0.6

    0.8

    1.0

    1.2

    1.4

    log(Lambda)

    Bin

    omia

    l Dev

    ianc

    e

    7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

    10 8 6 4 2

    12

    34

    log(Lambda)

    Bin

    omia

    l Dev

    ianc

    e

    7 7 7 6 6 6 6 5 5 6 5 4 4 3 3 2 1

    @freakonometrics 139

  • http://www.ub.edu/riskcenter

    Variable Importance for Trees

    Given some random forest with M trees, set I(Xk) =1M

    m

    t

    NtN

    i(t)

    where the first sum is over all trees, and the second one is over all nodes wherethe split is done based on variable Xk.

    1 > RF= randomForest ( PRONO ~ .,data = myocarde )

    2 > varImpPlot (RF ,main="")

    3 > importance (RF)

    4 MeanDecreaseGini

    5 FRCAR 1.107222

    6 INCAR 8.194572

    7 INSYS 9.311138

    8 PRDIA 2.614261

    9 PAPUL 2.341335

    10 PVENT 3.313113

    11 REPUL 7.078838

    FRCAR

    PAPUL

    PRDIA

    PVENT

    REPUL

    INCAR

    INSYS

    0 2 4 6 8

    MeanDecreaseGini

    @freakonometrics 140

  • http://www.ub.edu/riskcenter

    Partial Response Plots

    One can also compute Partial Response Plots,

    x 7 1n

    ni=1

    E[Y |Xk = x,Xi,(k) = xi,(k)]

    1 > importanceOrder names for (name in names )

    4 + partialPlot (RF , myocarde , eval(name), col="red",

    main="", xlab=name)

    @freakonometrics 141

  • http://www.ub.edu/riskcenter

    Feature Selection

    Use Mallows Cp, from Mallow (1974) on all subset of predictors, in a regression

    Cp =1S2

    ni=1

    [Yi Yi]2 n+ 2p,

    1 > library ( leaps )

    2 > y x selec = leaps (x, y, method ="Cp")

    5 > plot( selec $size -1, selec $Cp)

    @freakonometrics 142

    http://www.stat.washington.edu/courses/stat527/s13/readings/technometrics1973.pdf

  • http://www.ub.edu/riskcenter

    Feature Selection

    Use random forest algorithm, removing some features at each iterations (the lessrelevent ones).

    The algorithm uses shadow attributes (obtained from existing features byshuffling the values).

    1 > library ( Boruta )

    2 > B plot(B)

    PV

    EN

    T

    PAP

    UL

    PR

    DIA

    RE

    PU

    L

    INC

    AR

    INS

    YS

    5

    10

    15

    20

    Impo

    rtan

    ce@freakonometrics 143

  • http://www.ub.edu/riskcenter

    Feature Selection

    Use random forests, and variable importance plots

    1 > library ( varSelRFBoot )

    2 > X Y library ( randomForest )

    5 > rf V VB plot(VB)

    2 3 4 5 6 7 8

    0.00

    0.05

    0.10

    0.15

    0.20

    OOB Error rate vs. Number of variables in predictor

    Number of variables

    OO

    B E

    rror

    rat

    e

    @freakonometrics 144

  • http://www.ub.edu/riskcenter

    ROC (and beyond)Y = 0 Y = 1 prvalence

    Y = 0true negative

    N00

    false negative(type II)N01

    negativepredictive

    value

    NPV=N00N0

    falseomission

    rate

    FOR=N01N0

    Y = 1false positive

    (type I)N10

    true positive

    N11

    falsediscovery

    rate

    FDR=N10N1

    positivepredictive

    value

    PPV=N11N1

    (precision)

    negativelikelihood

    ratioLR-=FNR/TNR

    true negativerate

    TNR=N00N0

    (specificity)

    false negativerate

    FNR=N01N1

    positivelikelihood

    ratioLR+=TPR/FPR

    false positiverate

    FPR=N10N0

    (fall out)

    true positive rate

    TPR=N11N1

    (sensitivity)

    diagnostic oddsratio = LR+/LR-

    @freakonometrics 145

  • http://www.ub.edu/riskcenter

    Comparing Classifiers: ROC Curves1 > library ( randomForest )

    2 > fit= randomForest ( PRONO ~.,data= train _ myocarde )

    3 > train _Y=( train _ myocarde $ PRONO ==" Survival ")

    4 > test_Y =( test_ myocarde $ PRONO ==" Survival ")

    5 > train _S= predict (fit ,type="prob",newdata = train

    _ myocarde )[ ,2]

    6 > test_S= predict (fit ,type="prob",newdata =test_

    myocarde )[ ,2]

    7 > vp=seq (0,1, length =101)

    8 > roc_ train =t( Vectorize ( function (u) roc. curve (

    train _Y, train _S,s=u))(vp))

    9 > roc_test=t( Vectorize ( function (u) roc. curve (

    test_Y,test_S,s=u))(vp))

    10 > plot(roc_train ,type="b",col="blue",xlim =0:1 ,

    ylim =0:1)

    @freakonometrics 146

  • http://www.ub.edu/riskcenter

    Comparing Classifiers: ROC Curves

    The Area Under the Curve, AUC, can be interpreted as the probability that aclassifier will rank a randomly chosen positive instance higher than a randomlychosen negative one, see Swets, Dawes & Monahan (2000)

    Many other quantities can be computed, see1 > library ( hmeasures )

    2 > HMeasure (Y,S)$ metrics [ ,1:5]

    3 Class labels have been switched from (Death , Survival ) to (0 ,1)

    4 H Gini AUC AUCH KS

    5 scores 0.7323154 0.8834154 0.9417077 0.9568966 0.8144499

    with the H-measure (see hmeasure), Gini and AUC, as well as the area under theconvex hull (AUCH).

    @freakonometrics 147

    http://ist-socrates.berkeley.edu/~maccoun/LP_SwetsDawesMonahan2000.pdfhttp://www.hmeasure.net/

  • http://www.ub.edu/riskcenter

    Comparing Classifiers: ROC CurvesConsider our previous logistic regression (on heart at-tacks)

    1 > logistic Y S library (ROCR)

    2 > pred perf plot(perf)

    False positive rate

    True

    pos

    itive

    rat

    e

    0.0 0.2 0.4 0.6 0.8 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    @freakonometrics 148

  • http://www.ub.edu/riskcenter

    Comparing Classifiers: ROC Curves

    On can get econfidence bands (obtained using bootstrapprocedures)

    1 > library (pROC)

    2 > roc roc.se plot(roc.se , type=" shape ", col=" light blue")

    Specificity (%)

    Sen

    sitiv

    ity (

    %)

    020

    4060

    8010

    0

    100 80 60 40 20 0

    see also for Gains and Lift curves1 > library ( gains )

    @freakonometrics 149

  • http://www.ub.edu/riskcenter

    Comparing Classifiers: Accuracy and Kappa

    Kappa statistic compares an Observed Accuracy with an Expected Accuracy(random chance), see Landis & Koch (1977).

    Y = 0 Y = 1

    Y = 0 TN FN TN+FN

    Y = 1 FP TP FP+TP

    TN+FP FN+TP n

    See also Obsersed and Random Confusion TablesY = 0 Y = 1

    Y = 0 25 3 28

    Y = 1 4 39 43

    29 42 71

    Y = 0 Y = 1

    Y = 0 11.44 16.56 28

    Y = 1 17.56 25.44 43

    29 42 71

    total accuracy = TP + TNn

    90.14%

    random accuracy = [TN + FP ] [TP + FN ] + [TP + FP ] [TN + FN ]n2

    51.93%

    = total accuracy random accuracy1 random accuracy 79.48%

    @freakonometrics 150

    http://www.jstor.org/pss/2529310

  • http://www.ub.edu/riskcenter

    Comparing Models on the myocarde Dataset

    @freakonometrics 151

  • http://www.ub.edu/riskcenter

    Comparing Models on the myocarde Dataset

    If we average over all training samples

    loess

    glm

    aic

    knn

    svm

    tree

    bag

    nn

    boost

    gbm

    rf

    0.

    5

    0.0

    0.5

    1.0

    @freakonometrics 152

  • http://www.ub.edu/riskcenter

    Gini and Lorenz Type Curves

    Consider an ordered sample {y1, , yn} of incomes, with y1 y2 yn,then Lorenz curve is

    {Fi, Li} with Fi =i

    nand Li =

    ij=1 yjnj=1 yj

    1 > L

  • http://www.ub.edu/riskcenter

    Gini and Lorenz Type Curves

    The theoretical curve, given a distribution F , is

    u 7 L(u) = F1(u) tdF (t) + tdF (t)

    see Gastwirth (1972).

    One can also sort them from high to low incomes, y1 y2 yn

    1 > L

  • http://www.ub.edu/riskcenter

    Gini and Lorenz Type Curves

    We want to compare two regression models, m1() and m2(), in the context ofinsurance pricing, see Frees, Meyers & Cummins (2014). We have observed lossesyi and premiums m(xi). Consider an ordered sample by the model,

    m(x1) m(x2) m(xn)

    then plot {Fi, Li} with Fi =i

    nand Li =

    ij=1 yjnj=1 yj

    1 > L

  • http://www.ub.edu/riskcenter

    Gini and Lorenz Type Curves

    See Frees et al. (2010) or Tevet (2013).

    @freakonometrics 156

    https://www.casact.org/education/spring/2010/handouts/C17-Frees2.pdfhttp://www.casact.org/newsletter/index.cfm?fa=viewart&id=6540

  • http://www.ub.edu/riskcenter

    Model Selection, or Aggregation?

    We have k models, m1(x), , mk(x) for the same y-variable, that can be trees,vsm, regression, etc.

    Instead of selecting the best model, why not consider

    m?(x) =k=1

    m(x)

    for some weights .

    New problem: solve min1, ,k

    {ni=1

    `

    (yi

    k=1

    m(xi))}

    Note that it might be interesting to regularize, by adding a Lasso-type penalty

    term, based on k=1||

    @freakonometrics 157

  • http://www.ub.edu/riskcenter

    Part 4.Visualization and Maps

    @freakonometrics 158

  • http://www.ub.edu/riskcenter

    Projection(s)

    "When people thought the earth was flat, they were wrong. Whenpeople thought the earth was spherical, they were wrong. But ifyou think that thinking the earth is spherical is just as wrong asthinking the earth is flat, then your view is wronger than both ofthem put together." Isaac Asimov, cited in kjordahl.github.io

    Need some Spatial Reference Systems

    E.g. EPSG:4326 latitude, longitude in WGS-84 coordinate system

    E.g. EPSG:900913 and EPSG:3857: Google spherical Mercator

    E.g. ESRI:102718: and NAD 1983 State Plane New York Long Island

    @freakonometrics 159

    http://kjordahl.github.io/SciPy2013/

  • http://www.ub.edu/riskcenter

    1 library ( rgdal )

    2 library ( maptools )

    3 Proj

  • http://www.ub.edu/riskcenter

    1 library (sp)

    2 library ( rworldmap )

    3 data( countriesLow )

    4

    5 crs. WGS84

  • http://www.ub.edu/riskcenter

    Projection(s)

    Source: explainxkcd.com

    @freakonometrics 162

    http://www.explainxkcd.com/wiki/index.php/977:_Map_Projections

  • http://www.ub.edu/riskcenter

    Maps and Polygons

    1 > library ( rgdal )

    2 > ita1 plot(ita1 ,col=" light green ")

    @freakonometrics 163

  • http://www.ub.edu/riskcenter

    Maps and Polygons

    1 > ita1@data [,"NAME_1" ][1:6]

    2 [1] Abruzzo Apulia Basilicata Calabria

    3 [5] Campania Emilia - Romagna

    4 > pos Poly_ Sicily plot(ita1 ,col=" light green ")

    7 > plot(Poly_Sicily ,col=" yellow ",add=TRUE)

    @freakonometrics 164

  • http://www.ub.edu/riskcenter

    Maps and Polygons

    1 > plot(ita1 ,col=" light green ")

    2 > plot(Poly_Sicily ,col=" yellow ",add=TRUE)

    3 > abline (v=5:20 , col=" light blue")

    4 > abline (h=35:50 , col=" light blue")

    5 > axis (1)

    6 > axis (2)

    5 10 15 20

    3638

    4042

    4446

    @freakonometrics 165

  • http://www.ub.edu/riskcenter

    Maps and Polygons

    1 > pos ita_ south ita_ north plot(ita1)

    5 > plot(ita_south ,col="red",add=TRUE)

    6 > plot(ita_north ,col="blue",add=TRUE)

    @freakonometrics 166

  • http://www.ub.edu/riskcenter

    Maps and Polygons

    1 > library (xlsx)

    2 > data_ codes names (data_ codes )[1]="NAME_1"

    4 > ita2 pos

  • http://www.ub.edu/riskcenter

    Maps and Polygons

    Here we have only used colors, but it is also possible to merge the polygonstogether,

    1 > library ( rgeos )

    2 > ita_s ita_n plot(ita1)

    5 > plot(ita_s,col="red",add=TRUE)

    6 > plot(ita_n,col="blue",add=TRUE)

    @freakonometrics 168

  • http://www.ub.edu/riskcenter

    Maps and Polygons

    On polygons, it is also possible to visualize centroids,

    1 > plot(ita1 ,col=" light green ")

    2 > plot(Poly_Sicily ,col=" yellow ",add=TRUE)

    3 > gCentroid (Poly_Sicily ,byid=TRUE)

    4 SpatialPoints :

    5 x y

    6 14 14.14668 37.58842

    7 Coordinate Reference System (CRS) arguments :

    8 +proj= longlat + ellps = WGS84

    9 + datum = WGS84 +no_defs + towgs84 =0 ,0 ,0

    10 > points ( gCentroid (Poly_Sicily ,byid=TRUE),

    pch =19 , col="red")

    @freakonometrics 169

  • http://www.ub.edu/riskcenter

    Maps and Polygons

    or

    1 > G plot(ita1 ,col=" light green ")

    3 > text(G$x,G$y ,1:20)

    1

    23

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19 20

    @freakonometrics 170

  • http://www.ub.edu/riskcenter

    Maps and Polygons

    Consider two trajectories, caracterized by a series of knots (from the centrodslist)

    1 > c1 c2

  • http://www.ub.edu/riskcenter

    3 L1@proj4string plot(ita1 ,col=" light green ")

    15 > plot(L1 ,col="red",lwd =2, add=TRUE)

    16 > plot(L2 ,col="blue",lwd =2, add=TRUE)

    17 > plot( cross _road ,pch =19 , cex =1.5 , add=TRUE)

    @freakonometrics 172

  • http://www.ub.edu/riskcenter

    Maps and Polygons

    To add elements on maps, consider, e.g.

    1 plot(ita1 ,col=" light green ")

    2 grat

  • http://www.ub.edu/riskcenter

    Maps and Polygons

    To add elements on maps, consider, e.g.

    1 sea

  • http://www.ub.edu/riskcenter

    Maps, with R

    There are 36, 552 communes in France. Why not illustrate with some smallerdatasets? Paris is splited in 850 IRIS areas (lot Regroup pour des IndicateursStatistiques see insee.fr).

    1 library ( RColorBrewer )

    2 plotclr

  • http://www.ub.edu/riskcenter

    Maps, with R

    To visualize all those polygons, use1 > plot( paris )

    2.20 2.25 2.30 2.35 2.40 2.45 2.50

    48.8

    248

    .86

    48.9

    0

    @freakonometrics 176

  • http://www.ub.edu/riskcenter

    Maps, with RLet us focus on one very specific polygon (the 100th)

    1 > length ( paris )

    2 [1] 850

    3 > poly_ paris sub_poly sub_poly

    6 PID SID POS X Y

    7 1 100 1 1 2.332904 48.85836

    8 2 100 1 2 2.333240 48.85826

    9 3 100 1 3 2.331740 48.85644

    10 18 100 1 18 2.332460 48.85850

    11 19 100 1 19 2.332904 48.85836

    12 > plot(sub_poly[,c("X","Y")])

    13 > polygon (sub_poly[,c("X","Y")])

    @freakonometrics 177

  • http://www.ub.edu/riskcenter

    Maps, with R

    Given some GPS coordinates, we want to know in which IRIS it is located

    use function (for all possible polygon...it can be long)

    1 > library (sp)

    2 > point point .in. polygon ( point [1] , point [2] , sub_poly[,"X"],sub_poly[,"Y"])

    4 [1] 1

    5 > point _in_i= function (i, point )

    6 + point .in. polygon ( point [1] , point [2] , poly_ paris [poly_ paris $PID ==i,"X"

    ],poly_ paris [poly_ paris $PID ==i,"Y"])

    7 > where _is_ point = function ( point )

    8 + which ( Vectorize ( function (i) point _in_i(i, point ))(1: length ( paris ))

    >0)

    @freakonometrics 178

  • http://www.ub.edu/riskcenter

    Maps, with R1 > where _is_ point ( point )

    2 [1] 100

    1 > library ( RColorBrewer )

    2 > plotclr vizualize _ point

  • http://www.ub.edu/riskcenter

    Maps, with R

    simplify using some kind of grid, then use the previous function only on afew possible candidates

    1 > minmax = function (i){

    2 + vect_xy=poly_ paris [poly_ paris $PID ==i,c("X"

    ,"Y")]

    3 + box_xy=c(min(vect_xy[,"X"]) ,min(vect_xy[,"

    Y"]) ,

    4 + max(vect_xy[,"X"]) ,max(vect_xy[,"Y"]))

    5 + return (box_xy)}

    6 > rect( minmax (100) )

    2.326 2.328 2.330 2.332

    48.8

    5748

    .858

    48.8

    5948

    .860

    48.8

    61

    X

    Y

    @freakonometrics 180

  • http://www.ub.edu/riskcenter

    Maps, with RThen simply use

    1 > is.box= function (i,loc){

    2 + box_i= minmax (i)

    3 + x = box_i[1])&(loc [1]

  • http://www.ub.edu/riskcenter

    Maps, with RThen to get all possible candidates use

    1 > which .box which .box( point )

    4 [1] 1 100 101

    see1 > plot(sub_poly[,c("X","Y")],col=" white ")

    2 > polygon (poly_ paris [poly_ paris $PID ==1 ,c("X"

    ,"Y")],col= plotclr [2])

    3 > polygon (poly_ paris [poly_ paris $PID ==100 ,c("

    X","Y")],col= plotclr [1])

    4 > polygon (poly_ paris [poly_ paris $PID ==101 ,c("

    X","Y")],col= plotclr [2])

    2.320 2.325 2.330 2.335 2.340 2.345

    48.8

    5448

    .856

    48.8

    5848

    .860

    48.8

    6248

    .864

    X

    Y

    @freakonometrics 182

  • http://www.ub.edu/riskcenter

    Maps, with R

    Finally use1 > which .poly 0) idx_ valid =c(idx_valid ,i)}

    9 + return (idx_ valid )}

    to identify the IRIS polygon1 > which .poly( point )

    2 [1] 100

    @freakonometrics 183

  • http://www.ub.edu/riskcenter

    Maps, with R1 > vizualize _ point

  • http://www.ub.edu/riskcenter

    Google Maps and Open Street Map1 > library ( ggmap )

    2 > library ( geosphere )

    3 > (MAIF ( Rennes distHaversine (MAIF , Rennes , r =6378.137)

    2 [1] 217.9371

    @freakonometrics 185

    http://en.wikipedia.org/wiki/Haversine_formula

  • http://www.ub.edu/riskcenter

    (in km.) while the driving distance is

    1 > mapdist (as. numeric (MAIF),as. numeric ( Rennes ), mode = driving )

    2 by using this function you are agreeing to the terms at :

    3 http://code. google .com/apis/maps/ documentation / distancematrix /

    4

    5 from

    to m km miles seconds

    6 1 200 Avenue Salvador Allende , 79000 Niort , France 10 -11 Place Hoche ,

    35000 Rennes , France 257002 257.002 159.701 9591

    7 minutes hours

    8 1 159.85 2.664167

    @freakonometrics 186

  • http://www.ub.edu/riskcenter

    Visualizing Maps, via Google Maps

    1 Loc=data. frame ( rbind (MAIF , Rennes ))

    2 > library ( ggmap )

    3 > library ( RgoogleMaps )

    4 > CenterOfMap W WMap WMap

    or1 > W ggmap (W)

    46.5

    47.0

    47.5

    48.0

    2 1 0lon

    lat

    46.5

    47.0

    47.5

    48.0

    2 1 0lon

    lat

    @freakonometrics 187

  • http://www.ub.edu/riskcenter

    Visualizing Maps, via Google Maps

    from a technical point of view, those are ggplot2maps, see ggmapCheatsheet

    1 > W ggmap (W)

    or1 > W ggmap (W)

    1.2

    1.3

    1.4

    1.5

    103.7 103.8 103.9 104.0 104.1lon

    lat

    1.2

    1.3

    1.4

    1.5

    103.7 103.8 103.9 104.0 104.1lon

    lat

    @freakonometrics 188

    https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/ggmap/ggmapCheatsheet.pdf

  • http://www.ub.edu/riskcenter

    Visualizing Maps, via Google Maps1 > Paris ParisMap library ( maptools )

    4 > library ( rgdal )

    5 > paris = readShapeSpatial ("paris - cartelec .shp

    ")

    6 > proj4string ( paris ) paris ParisMapPlus ParisMapPlus

    48.81

    48.84

    48.87

    48.90

    2.25 2.30 2.35 2.40 2.45lon

    lat

    48.84

    48.86

    48.88

    2.300 2.325 2.350 2.375 2.400lon

    lat

    @freakonometrics 189

  • http://www.ub.edu/riskcenter

    OpenStreet Map

    1 > library ( OpenStreetMap )

    2 > map map plot(map)

    @freakonometrics 190

  • http://www.ub.edu/riskcenter

    OpenStreet Map

    1 > library ( OpenStreetMap )

    2 > map plot(map)

    It is a standard R plot, we can add points on thatgraph.

    @freakonometrics 191

  • http://www.ub.edu/riskcenter

    OpenStreet Map

    1 > library ( maptools )

    2 > deaths head( deaths@coords )

    4 coords .x1 coords .x2

    5 0 529308.7 181031.4

    6 1 529312.2 181025.2

    7 2 529314.4 181020.3

    8 3 529317.4 181014.3

    9 4 529320.7 181007.9

    10 > points ( deaths@coords ,col="red", pch =19 ,

    cex =.7 )

    Cholera outbreak, in London, 1854,dataset collected by John (not Jon) Snow

    @freakonometrics 192

  • http://www.ub.edu/riskcenter

    OpenStreet Map

    1 >