a generalized approach to ridge regression - github … · a generalized approach to ridge...

A GeneralizedApproach to Ridge

Regression

Kathryn Vasilaky

A Generalized Approach to Ridge Regression

Kathryn Vasilaky1

December 7, 2016

1Postdoc @ Columbia University, Earth Institute; [email protected],www.kvasilaky.com


Regression

Kathryn Vasilaky

My Background

1. Applied economist with an interest in development economics,policy and information communication

2. RCTs, lab experiments in the field, natural experiments, as well asbroader methods in data science

3. Past work looked at how groups and social networks affecttechnology adoption and learning, particularly for females

4. Recent work combines the use of crowd sourced and sensored data(IoT) for prediction and feedback to larger information systems

5. As a complement to this applied work, I work on learning andimproving applied statistical methodologies


Regression

Kathryn Vasilaky

Overview of talk

1. Part I1.1 Review Standard Ridge1.2 Derivation of Generalized Ridge Results

2. Part II2.1 Test Generalized Ridge under simulated case2.2 Test Generalized Ridge with real data and cross validation2.3 Reconstruct an image from projections


Regression

Kathryn Vasilaky

Motivation of Inverse ProblemsI Inverse problems: compute information about some “interior”

properties using “exterior” measurements.

I Inference: Covariates − > Coefficients − > Outcome

I Tomography: Xray source − > Object − > X Ray Dampening

I ML: Features − > Effect Size − > Classifier/Prediction


Regression

Kathryn Vasilaky

Why is Regularization Used?

I OLS is BLUE when the covariate matrix (A) is full rank

I But when A is ill-conditioned (covariates correlated), estimatorswill be sensitive to noise

I Regularization methods are used to dampen the effects of thesensitivity to noise

I I present a generalization to the frequently used Ridge RegressionI Performs as well or better than Standard RidgeI Allows for a more flexible weighting of singular values than

Standard RidgeI Useful for data where covariates are correlated: large consumer data

sets, health data


Regression

Kathryn Vasilaky

Reviewing the Basics of Gauss MarkovRecall the Gauss Markov estimator for the least squares problem:

y = Ax + e

minx ||Ax − y ||22A is nxp, with rank p,

x = (A′A)−1A′y

Var(x) = σ2(A′A)−1

and the MSE(x) = σ2Tr(A′A)−1 = σ2∑i

1σ2i

,

where σ2i is the ith eigenvalue of A′A

This will serve as a comparison later on.


Regression

Kathryn Vasilaky

When is Gauss Markov MSE Large?

I When A′A is ill-conditioned (nearly not full rank), the solution toOLS is sensitive to noise (y = y + ε)

I This can occur when covariates are highly correlated, the numberof covariates exceeds the number observations, or A is sparse

I OLS is still BLUEI But the standard errors of x , and MSE, will be large (σi ’s are small)

Var(x) = σ2(A′A)−1

MSE(x) = σ2Tr(A′A)−1 = σ2∑

i1σ2i

I And x can deviate very far from the true solution x


Regression

Kathryn Vasilaky

Example: Noise is magnified if the covariate matrix isill-conditioned

A = UΣV ′ =

[1 00 1

] [1 00 10−6

] [1 00 1

](1)

(A′A)−1A′y =

[1 00 106

] [11

]=

[1

106

]Noiseless solution (2)

(A′A)−1A′(y+e) =

[1 00 106

] [1

1.1

]=

[1

106 + 105

]Naive solution (3)


Regression

Kathryn Vasilaky

Regularization problems formulate a nearby problem, which has amore stable solution:

Minx(||Ax − y ||22 + λ||x ||22), λ > 0

where we introduce the term λ||x ||22 perturbing the least-squaresformulation.

I The estimated x is no longer unbiased, however, the MSE =Variance + Bias2, may be smaller than BLUE.

The regularization can be L1 (Lasso) or L2. The objective is to choosea lambda that brings x close to the noiseless solution.


Regression

Kathryn Vasilaky

The regularized least squares problem problem becomes:

Minx ||Ax − y ||22 + λ||x − x0||22

x = (A′A + λIn)−1A′y

MSE = σ2∑∞n=1

σ2i

(σ2i +λ)2 + λ2∑∞

n=1

α2i

(σ2i +λ)2

(Hoerl and Kennard, 1970)2

2Note: where α = Vx , A′A = VΣ2V ′


Regression

Kathryn Vasilaky

Other regularization methods include:

I Lasso, with a L1 penalty weighted by λ

I Truncated SVD

I Elastic Net, a weighted sum of L1 and L2

I ‘Generalized’ Tikhanov


Regression

Kathryn Vasilaky

Introduction to Generalized Iterative Ridge

I Rather than computing one solution I will compute a sequence ofsolutions which approximates the noiseless solution (A′A)−1A′y onits way to the naive solution (A′A)−1A′(y + e).

I Since the operator (A′A + λI )−1 = UDiag( 1σ2i +λ

)V ′ has

eigenvalues that are less than one, it is contracting.

I I show that repeated application of this operator will converge tothe naive solution.

New normal equations with added constraint become:

(A′A + λI )x1 = A′y + λx0


Regression

Kathryn VasilakyIntroduction to Generalized Iterative Ridge

(A′A + λI )x1 = A′y + λx0

The solution for x1 is:

x1λ = (A′A + λI )−1A′y + λ(A′A + λI )−1x0 (Regular Ridge, x0 = 0)

Then substituting x1λ into x0, we obtain x2

λ, and if we substitute xk−1λ

into xk−2λ , we obtain:

xkλ = Σk

i=1λi−1((A′A + λI )−i (A′y) + λk(A′A + λI )−kx0)

where, Σki=1λ

i−1((A′A + λI )−i (A′y), is a contracting opearator.


Regression

Kathryn Vasilaky

Generalized Iterative Solution, xkλ Now we sum the geometric series

using single value decomposition (SVD) of A: A = UΣV T , whereΣ = Diag[σ1, .., σn, 0....0] ∈ Rmxn with σ1 ≥ σ2 ≥ · · · ≥ σn > 0, andwhere n is the rank of A. The iterative solution after k iterations, wehave:

V

1σ1− 1

σ1( λλ+σ2

1)k . . . 0 . . . 0

.... . .

...0 . . . 1

σn− 1

σn( λλ+σ2

n)k 0 . . . 0

UT y

For practical ease we can set x0=0. It does not affect the solution, asthe last term attached to x0 is pulled quickly towards 0.


Regression

Kathryn Vasilaky

The limits of Generalized Iterative Ridge (GIR)

The GIR solution tends towards the naive solution ask − >∞:(A′A)−1A′y = V

∑−1 U ′y , or

V

1σ1

. . . 0 . . . 0...

. . ....

0 . . . 1σn

0 . . . 0

UT y


Regression

Kathryn Vasilaky

The limits of Generalized Iterative Ridge (GIR)

When k =1, GIR collapses to Standard Ridge:

V

σ1

λ+σ21

. . . 0 . . . 0

.... . .

...0 . . . σn

λ+σ2n

0 . . . 0

UT y

The diagonal entries are the filters for the singular values, and it is clearthat the iterative solution’s filter is a generalization of the two extremes.


Regression

Kathryn Vasilaky

Summary, Generalized Iterative Ridge

I Falls between Standard Ridge and OLS

I The filter is more general than Standard Ridge


Regression

Kathryn Vasilaky

Choosing k, and λ using the Residual Error

We can use this simple and nice expression for the residual error,||Axk − y ||, as a function of k and λ, and choose it’s lowest point ormaximum curvature in convexity.

RE(λ, k) = ||

( λλ+σ2

1)k . . . 0 . . . 0

.... . .

...0 . . . ( λ

λ+σ2n

)k 0 . . . 0

U ′y ||


Regression

Kathryn VasilakyChoosing k, and λ using the Residual Error

We can see that the residual error, ||Axk − y ||, is convex and decreasesas k increases, for a given lambda.

Residual error as a function of k, given lambda. Choose k at the elbow.


Regression

Kathryn Vasilaky

Intuition for choosing k at residual error’s elbow

Residual error’s convexity can be seen when we take the difference of xkλ

and the noiseless OLS solution x .

I First term, iteration error declines monotonically as k − >∞I Second term, noise term, increases monotonically with k (to the

OLS residual).

(S1)

-V

1σ1

( λλ+σ2

1)k . . . 0 0 . . . 0

.... . .

...0 . . . 1

σn( λλ+σ2

n)k 0 . . . 0

UT y

(S2) V

1σ1− 1

σ1( λλ+σ2

1)k . . . 0 0 . . . 0

.... . .

...0 . . . 1

σn− 1

σn( λλ+σ2

n)k 0 . . . 0

UT e


Regression

Kathryn Vasilaky

Summary, Choosing k and λ

I Grid search over λ

I Choose k at the elbow


Regression

Kathryn Vasilaky

Comparing the Filters of Standard Ridge and GIR

I A key contribution of the iterative solution is in the filters.

I The filters dampen the effects of small singular values.

Blue 1σi

Standard Ridge 1σi

(1− ( λλ+σ2

1)) =

σ2i

λ+σ2i

(1− ( λλ+σ2

1)) =

σ2i

λ+σ2i

(1− ( λλ+σ2

1)) =

σ2i

λ+σ2i

Iterative Ridge 1σi

(1− ( λλ+σ2

1)k)(1− ( λ

λ+σ21

)k)(1− ( λλ+σ2

1)k)

What is the extra advantage of the k iteration parameter in the filter?


Regression

Kathryn Vasilaky

Iterative Ridge introduces additional flexibility in the weighting ofeach covariate

I With Standard Ridge/Tikhanov, even medium valued sigma’s areheavily penalized (for lambda = 0.1)

I Notice λ = 0.0001 Standard Ridge and Iterative Ridgeλ = 0.01, k = 20 have similar shapes, however, a small λ increasesthe noise.


Regression

Kathryn VasilakyComparing the MSEs for BLUE, Standard Ridge, and GIR

Last but not least, I have the generalized MSE for Iterative Ridge,exposes the variance and the bias:3

MSE OLS

σ2∑i

1

σ2i

MSE Ridge

σ2∞∑n=1

σ2i

(σ2i + λ)2

+ λ2∞∑n=1

α2i

(σ2i + λ)2

MSE Iterative Ridge

σ2∞∑n=1

1

σ2i

(1− (λ

λ+ σ2i

)k)2 + λ2k∞∑n=1

α2i (

1

σ2i + λ

)2k

3Note: where α = Vx , A′A = VΣ2V ′


Regression

Kathryn Vasilaky

Part II


Regression

Kathryn Vasilaky

How do we know we can do better with Generalized Ridge?

I Provide a small simulated example with known x, noisy y, andill-conditioned A

I Provide a real data example where we test our out-of-samplepredictions

I Provide an image example where we recover an image


Regression

Kathryn Vasilaky

Example 1: Simulated Data


Regression

Kathryn Vasilaky

Let’s take the Golub matrix. An Ill conditioned matrix.1 3 11 0 −11 −15

18 55 209 15 −198 −277−23 −33 144 532 259 82

9 55 405 437 −100 −2853 −4 −111 −180 39 219−13 −9 202 346 401 253

The singular values are:

s = [9.545e+02, 7.240e+02, 1.767e+02, 7.301e+01, 3.536e-01,3.169e-10]

So the matrix’s condition is 9.545e+023.169e−10

, or 3 quadrillion.


Regression

Kathryn Vasilaky

Framework of Simulated Problem

Suppose we know the true x = [1, 1, 1, 1, 1, 1].

So we also know true y = A ∗ x

We add a small normally distributed error to y , with s.d. 0.1

Now, we would like to recover known x, but with the presence of noise.


Regression

Kathryn Vasilaky

Recall, the noiseless solution, x = [1, 1, 1, 1, 1, 1], is:

V

1σ1

. . . 0 . . . 0...

. . ....

0 . . . 1σn

0 . . . 0

UT y

And the naive solution, (A′A)−1A′(y + e):

V

1σ1

. . . 0 . . . 0...

. . ....

0 . . . 1σn

0 . . . 0

UT (y + e)


Regression

Kathryn Vasilaky

Solutions to Simulated Problem

x with OLS, Ridge, and Iterative Ridge:

OLS with noise[3.08e+08,−1.44e+08, 1.12e+07, 1.36e+06,−9.11e+04,−1.49e+04]||x − x || = 340, 863, 251!

Ridge, λ = 0.1[0.067, 0.24, 1.25, 0.89, 0.88, 1.054]||x − x || = 0.92

Iterative Ridge, λ = 0.1, k =5[0.08, 0.29, 1.23, 0.89, 0.89, 1.05]||x − x || = 0.68

λ was chosen through a grid search, and k via the elbow algorithm.


Regression

Kathryn Vasilaky

Example 2: Real Data


Regression

Kathryn Vasilaky

I Using Body Fat (Carnegie Mellon Data Statistics Library)

I Predict body fat based on 17 variables

I Covariates include: neck, chest, abdomen, hip, thigh, wristcircumference, and hence correlated

I The data are ill conditioned with σ17 = .000027

I We split data into training [202:17] and test [50:17]


Regression

Kathryn Vasilaky

Results

I BLUE unperturbed MSPE = 0.7740

I With added error, N(0,1)4. , MSPE = 2.47 a 302% increase.5

I Ridge, λ = 100, k = 1, MSPE = 2.16, about 15% improvementover BLUE.

I Iterative Ridge k = 2 MSP = 1.73, about 30% improvement overBLUE.

I Iterative Ridge λ = 300, k=5, MSP=1.72

I Iterative Ridge λ = 500 k=7, MSP= 1.71

4std(y) = 7.7509 so that e with std=1 is not excessive5Mean Squared Predicted Error is the norm of predicted and true y.


Regression

Kathryn Vasilaky

Example 3: Image Data


Regression

Kathryn Vasilaky

Example of an image that is 64 by 64 square pixels, where A is sparse:


Regression

Kathryn Vasilaky

Blue

Ridge, λ = 0.00346

Iterative Ridge, λ = 1, k = 20

6Computed by a method known as the L-Curve.


Regression

Kathryn Vasilaky

Summary

I Developed a generalization of ridge regression

I The additional parameter k provides more flexibility in balancingbias and noise

I Iterative Ridge performs better when there are number of smallsingular values, A is sparse, and y is noisy


Regression

Kathryn Vasilaky

Future Work

I Improve on the grid search for λ and k

I Prove conditions when Generalized Ridge does better for any givenlambda.

I Apply to “big” data problem - consumer data (FB); weather &sensored data

I Overall research is converging to using larger data sets for appliedproblems

a generalized approach to ridge regression - github … · a generalized approach to ridge...

Documents