a generalized approach to ridge regression - github … · a generalized approach to ridge...
TRANSCRIPT
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
A Generalized Approach to Ridge Regression
Kathryn Vasilaky1
December 7, 2016
1Postdoc @ Columbia University, Earth Institute; [email protected],www.kvasilaky.com
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
My Background
1. Applied economist with an interest in development economics,policy and information communication
2. RCTs, lab experiments in the field, natural experiments, as well asbroader methods in data science
3. Past work looked at how groups and social networks affecttechnology adoption and learning, particularly for females
4. Recent work combines the use of crowd sourced and sensored data(IoT) for prediction and feedback to larger information systems
5. As a complement to this applied work, I work on learning andimproving applied statistical methodologies
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Overview of talk
1. Part I1.1 Review Standard Ridge1.2 Derivation of Generalized Ridge Results
2. Part II2.1 Test Generalized Ridge under simulated case2.2 Test Generalized Ridge with real data and cross validation2.3 Reconstruct an image from projections
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Motivation of Inverse ProblemsI Inverse problems: compute information about some “interior”
properties using “exterior” measurements.
I Inference: Covariates − > Coefficients − > Outcome
I Tomography: Xray source − > Object − > X Ray Dampening
I ML: Features − > Effect Size − > Classifier/Prediction
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Why is Regularization Used?
I OLS is BLUE when the covariate matrix (A) is full rank
I But when A is ill-conditioned (covariates correlated), estimatorswill be sensitive to noise
I Regularization methods are used to dampen the effects of thesensitivity to noise
I I present a generalization to the frequently used Ridge RegressionI Performs as well or better than Standard RidgeI Allows for a more flexible weighting of singular values than
Standard RidgeI Useful for data where covariates are correlated: large consumer data
sets, health data
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Reviewing the Basics of Gauss MarkovRecall the Gauss Markov estimator for the least squares problem:
y = Ax + e
minx ||Ax − y ||22A is nxp, with rank p,
x = (A′A)−1A′y
Var(x) = σ2(A′A)−1
and the MSE(x) = σ2Tr(A′A)−1 = σ2∑i
1σ2i
,
where σ2i is the ith eigenvalue of A′A
This will serve as a comparison later on.
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
When is Gauss Markov MSE Large?
I When A′A is ill-conditioned (nearly not full rank), the solution toOLS is sensitive to noise (y = y + ε)
I This can occur when covariates are highly correlated, the numberof covariates exceeds the number observations, or A is sparse
I OLS is still BLUEI But the standard errors of x , and MSE, will be large (σi ’s are small)
Var(x) = σ2(A′A)−1
MSE(x) = σ2Tr(A′A)−1 = σ2∑
i1σ2i
I And x can deviate very far from the true solution x
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Example: Noise is magnified if the covariate matrix isill-conditioned
A = UΣV ′ =
[1 00 1
] [1 00 10−6
] [1 00 1
](1)
(A′A)−1A′y =
[1 00 106
] [11
]=
[1
106
]Noiseless solution (2)
(A′A)−1A′(y+e) =
[1 00 106
] [1
1.1
]=
[1
106 + 105
]Naive solution (3)
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Regularization problems formulate a nearby problem, which has amore stable solution:
Minx(||Ax − y ||22 + λ||x ||22), λ > 0
where we introduce the term λ||x ||22 perturbing the least-squaresformulation.
I The estimated x is no longer unbiased, however, the MSE =Variance + Bias2, may be smaller than BLUE.
The regularization can be L1 (Lasso) or L2. The objective is to choosea lambda that brings x close to the noiseless solution.
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
The regularized least squares problem problem becomes:
Minx ||Ax − y ||22 + λ||x − x0||22
x = (A′A + λIn)−1A′y
MSE = σ2∑∞n=1
σ2i
(σ2i +λ)2 + λ2∑∞
n=1
α2i
(σ2i +λ)2
(Hoerl and Kennard, 1970)2
2Note: where α = Vx , A′A = VΣ2V ′
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Other regularization methods include:
I Lasso, with a L1 penalty weighted by λ
I Truncated SVD
I Elastic Net, a weighted sum of L1 and L2
I ‘Generalized’ Tikhanov
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Introduction to Generalized Iterative Ridge
I Rather than computing one solution I will compute a sequence ofsolutions which approximates the noiseless solution (A′A)−1A′y onits way to the naive solution (A′A)−1A′(y + e).
I Since the operator (A′A + λI )−1 = UDiag( 1σ2i +λ
)V ′ has
eigenvalues that are less than one, it is contracting.
I I show that repeated application of this operator will converge tothe naive solution.
New normal equations with added constraint become:
(A′A + λI )x1 = A′y + λx0
A GeneralizedApproach to Ridge
Regression
Kathryn VasilakyIntroduction to Generalized Iterative Ridge
(A′A + λI )x1 = A′y + λx0
The solution for x1 is:
x1λ = (A′A + λI )−1A′y + λ(A′A + λI )−1x0 (Regular Ridge, x0 = 0)
Then substituting x1λ into x0, we obtain x2
λ, and if we substitute xk−1λ
into xk−2λ , we obtain:
xkλ = Σk
i=1λi−1((A′A + λI )−i (A′y) + λk(A′A + λI )−kx0)
where, Σki=1λ
i−1((A′A + λI )−i (A′y), is a contracting opearator.
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Generalized Iterative Solution, xkλ Now we sum the geometric series
using single value decomposition (SVD) of A: A = UΣV T , whereΣ = Diag[σ1, .., σn, 0....0] ∈ Rmxn with σ1 ≥ σ2 ≥ · · · ≥ σn > 0, andwhere n is the rank of A. The iterative solution after k iterations, wehave:
V
1σ1− 1
σ1( λλ+σ2
1)k . . . 0 . . . 0
.... . .
...0 . . . 1
σn− 1
σn( λλ+σ2
n)k 0 . . . 0
UT y
For practical ease we can set x0=0. It does not affect the solution, asthe last term attached to x0 is pulled quickly towards 0.
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
The limits of Generalized Iterative Ridge (GIR)
The GIR solution tends towards the naive solution ask − >∞:(A′A)−1A′y = V
∑−1 U ′y , or
V
1σ1
. . . 0 . . . 0...
. . ....
0 . . . 1σn
0 . . . 0
UT y
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
The limits of Generalized Iterative Ridge (GIR)
When k =1, GIR collapses to Standard Ridge:
V
σ1
λ+σ21
. . . 0 . . . 0
.... . .
...0 . . . σn
λ+σ2n
0 . . . 0
UT y
The diagonal entries are the filters for the singular values, and it is clearthat the iterative solution’s filter is a generalization of the two extremes.
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Summary, Generalized Iterative Ridge
I Falls between Standard Ridge and OLS
I The filter is more general than Standard Ridge
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Choosing k, and λ using the Residual Error
We can use this simple and nice expression for the residual error,||Axk − y ||, as a function of k and λ, and choose it’s lowest point ormaximum curvature in convexity.
RE(λ, k) = ||
( λλ+σ2
1)k . . . 0 . . . 0
.... . .
...0 . . . ( λ
λ+σ2n
)k 0 . . . 0
U ′y ||
A GeneralizedApproach to Ridge
Regression
Kathryn VasilakyChoosing k, and λ using the Residual Error
We can see that the residual error, ||Axk − y ||, is convex and decreasesas k increases, for a given lambda.
Residual error as a function of k, given lambda. Choose k at the elbow.
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Intuition for choosing k at residual error’s elbow
Residual error’s convexity can be seen when we take the difference of xkλ
and the noiseless OLS solution x .
I First term, iteration error declines monotonically as k − >∞I Second term, noise term, increases monotonically with k (to the
OLS residual).
(S1)
-V
1σ1
( λλ+σ2
1)k . . . 0 0 . . . 0
.... . .
...0 . . . 1
σn( λλ+σ2
n)k 0 . . . 0
UT y
(S2) V
1σ1− 1
σ1( λλ+σ2
1)k . . . 0 0 . . . 0
.... . .
...0 . . . 1
σn− 1
σn( λλ+σ2
n)k 0 . . . 0
UT e
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Summary, Choosing k and λ
I Grid search over λ
I Choose k at the elbow
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Comparing the Filters of Standard Ridge and GIR
I A key contribution of the iterative solution is in the filters.
I The filters dampen the effects of small singular values.
Blue 1σi
Standard Ridge 1σi
(1− ( λλ+σ2
1)) =
σ2i
λ+σ2i
(1− ( λλ+σ2
1)) =
σ2i
λ+σ2i
(1− ( λλ+σ2
1)) =
σ2i
λ+σ2i
Iterative Ridge 1σi
(1− ( λλ+σ2
1)k)(1− ( λ
λ+σ21
)k)(1− ( λλ+σ2
1)k)
What is the extra advantage of the k iteration parameter in the filter?
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Iterative Ridge introduces additional flexibility in the weighting ofeach covariate
I With Standard Ridge/Tikhanov, even medium valued sigma’s areheavily penalized (for lambda = 0.1)
I Notice λ = 0.0001 Standard Ridge and Iterative Ridgeλ = 0.01, k = 20 have similar shapes, however, a small λ increasesthe noise.
A GeneralizedApproach to Ridge
Regression
Kathryn VasilakyComparing the MSEs for BLUE, Standard Ridge, and GIR
Last but not least, I have the generalized MSE for Iterative Ridge,exposes the variance and the bias:3
MSE OLS
σ2∑i
1
σ2i
MSE Ridge
σ2∞∑n=1
σ2i
(σ2i + λ)2
+ λ2∞∑n=1
α2i
(σ2i + λ)2
MSE Iterative Ridge
σ2∞∑n=1
1
σ2i
(1− (λ
λ+ σ2i
)k)2 + λ2k∞∑n=1
α2i (
1
σ2i + λ
)2k
3Note: where α = Vx , A′A = VΣ2V ′
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Part II
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
How do we know we can do better with Generalized Ridge?
I Provide a small simulated example with known x, noisy y, andill-conditioned A
I Provide a real data example where we test our out-of-samplepredictions
I Provide an image example where we recover an image
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Example 1: Simulated Data
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Let’s take the Golub matrix. An Ill conditioned matrix.1 3 11 0 −11 −15
18 55 209 15 −198 −277−23 −33 144 532 259 82
9 55 405 437 −100 −2853 −4 −111 −180 39 219−13 −9 202 346 401 253
The singular values are:
s = [9.545e+02, 7.240e+02, 1.767e+02, 7.301e+01, 3.536e-01,3.169e-10]
So the matrix’s condition is 9.545e+023.169e−10
, or 3 quadrillion.
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Framework of Simulated Problem
Suppose we know the true x = [1, 1, 1, 1, 1, 1].
So we also know true y = A ∗ x
We add a small normally distributed error to y , with s.d. 0.1
Now, we would like to recover known x, but with the presence of noise.
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Recall, the noiseless solution, x = [1, 1, 1, 1, 1, 1], is:
V
1σ1
. . . 0 . . . 0...
. . ....
0 . . . 1σn
0 . . . 0
UT y
And the naive solution, (A′A)−1A′(y + e):
V
1σ1
. . . 0 . . . 0...
. . ....
0 . . . 1σn
0 . . . 0
UT (y + e)
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Solutions to Simulated Problem
x with OLS, Ridge, and Iterative Ridge:
OLS with noise[3.08e+08,−1.44e+08, 1.12e+07, 1.36e+06,−9.11e+04,−1.49e+04]||x − x || = 340, 863, 251!
Ridge, λ = 0.1[0.067, 0.24, 1.25, 0.89, 0.88, 1.054]||x − x || = 0.92
Iterative Ridge, λ = 0.1, k =5[0.08, 0.29, 1.23, 0.89, 0.89, 1.05]||x − x || = 0.68
λ was chosen through a grid search, and k via the elbow algorithm.
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Example 2: Real Data
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
I Using Body Fat (Carnegie Mellon Data Statistics Library)
I Predict body fat based on 17 variables
I Covariates include: neck, chest, abdomen, hip, thigh, wristcircumference, and hence correlated
I The data are ill conditioned with σ17 = .000027
I We split data into training [202:17] and test [50:17]
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Results
I BLUE unperturbed MSPE = 0.7740
I With added error, N(0,1)4. , MSPE = 2.47 a 302% increase.5
I Ridge, λ = 100, k = 1, MSPE = 2.16, about 15% improvementover BLUE.
I Iterative Ridge k = 2 MSP = 1.73, about 30% improvement overBLUE.
I Iterative Ridge λ = 300, k=5, MSP=1.72
I Iterative Ridge λ = 500 k=7, MSP= 1.71
4std(y) = 7.7509 so that e with std=1 is not excessive5Mean Squared Predicted Error is the norm of predicted and true y.
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Example 3: Image Data
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Example of an image that is 64 by 64 square pixels, where A is sparse:
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Blue
Ridge, λ = 0.00346
Iterative Ridge, λ = 1, k = 20
6Computed by a method known as the L-Curve.
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Summary
I Developed a generalization of ridge regression
I The additional parameter k provides more flexibility in balancingbias and noise
I Iterative Ridge performs better when there are number of smallsingular values, A is sparse, and y is noisy
A GeneralizedApproach to Ridge
Regression
Kathryn Vasilaky
Future Work
I Improve on the grid search for λ and k
I Prove conditions when Generalized Ridge does better for any givenlambda.
I Apply to “big” data problem - consumer data (FB); weather &sensored data
I Overall research is converging to using larger data sets for appliedproblems