basic kernel methods

16

Upload: others

Post on 17-May-2022

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Basic Kernel Methods
Page 2: Basic Kernel Methods

Basic Kernel Methods

Essentials of Data Analytics and Machine Learning 1

The Essentials of Data Analytics and

Machine Learning [A guide for anyone who wants to learn practical machining learning using R]

Author: Dr. Mike Ashcroft

Editor: Ali Syed

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.

ยฎ 2016 Dr. Michael Ashcroft and Persontyle Limited

Page 3: Basic Kernel Methods

BASIC KERNEL METHODS

In this module we discuss the theory behind kernel methods, centered

around the famous โ€˜kernel trickโ€™ and examine two basic examples of kernel

methods: Kernel regression, a non-linear regression method which

โ€˜kernelizesโ€™ linear regression techniques (see module 7), and kernel PCA, a

non-linear feature transformation technique which kernelizes ordinary PCA

(see module 5). The most famous application of the kernel trick is the

support vector machine, which we discuss separately in module 17.

Module 12

Page 4: Basic Kernel Methods

Basic Kernel Methods

Essentials of Data Analytics and Machine Learning 1

Basic Kernel Methods In this module we discuss the theory behind kernel methods, centered around the famous โ€˜kernel trickโ€™ and examine two basic examples of kernel methods: Kernel regression, a non-linear regression method which โ€˜kernelizesโ€™ linear regression techniques (see module 7), and kernel PCA, a non-linear feature transformation technique which kernelizes ordinary PCA (see module 5). The most famous application of the kernel trick is the support vector machine, which we discuss separately in module 17.

The order of topics is as follows: We will provide a basic overview of the kernel trick. We then list some common kernels, followed by an example of an application of the kernel trick with kernel regression. We then provide background of the theory behind the kernel trick, which delves into the quite complex mathematics and which may be skipped by the intimidated reader. Finally we examine kernel PCA.

The Kernel Trick

The popularity and importance of kernel methods is based on the so-called kernel trick. Stated simply, utilizing positive definite kernels we can work as though we have projected our data into high dimensional spaces while avoiding explicitly performing this projection. The key to doing so is the kernel matrix, ๐พ, whose elements given a positive definite kernel function k and dataset ๐‘‹ are:

๐พ๐‘–๐‘— = ๐‘˜(๐‘ฅ๐‘–, ๐‘ฅ๐‘—)

This kernel matrix, K, is equivalent to the result of projecting our data into some higher dimensional space and performing the inner product operation on all projected data in that space (hence the alternative name of gram matrix). The particular space projected into depends on the kernel function used. It is common to refer to the implicit space as the feature space as opposed to the input space of the original features.

The upshot of this is that any algorithm that can be expressed solely in terms of the inner product of data points can work with a kernel matrix instead, and by doing so will function as though we projected the data into the kernel-dependent feature space. The most famous example of such kernelization is the transformation of the linear support vector classifier to the non-linear support vector machine (see module 17).

Justification for this is sometimes made by appeal Vapnikโ€“Chervonenkis (VC) theory, which tells us that mappings which take us from the input space into a higher dimensional space often

Page 5: Basic Kernel Methods

Basic Kernel Methods

Essentials of Data Analytics and Machine Learning 2

provide us with greater classification power. This is also the sentiment of Coverโ€™s famous separability theorem:

A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated. (Cover, 1965)

More generally, the practice can be justified by noting that this is the approach that many successful algorithms โ€“ both classification and regression โ€“ take. Nonetheless, it is unclear whether just projecting into random higher dimensional spaces is likely to be positive. Since the feature space depends on the kernel, this means that care needs to be applied when choosing kernels. Further, many kernels are parameterized, and such parameters will need to themselves be optimized as hyper-parameters.

Common Kernels

If the kernel used is important, what kernels are commonly used? There are quite a few common kernels, and the list we include should not be considered exhaustive. It should be noted that not all of these โ€˜kernelsโ€™ are actually kernel functions as defined in module 11. In particular, they may not integrate to unity. Nonetheless they work as required for the kernel trick.

A kernel, ๐‘˜, is stationary if ๐‘˜(๐‘ฅ, ๐‘ฆ) depends only on the distance between ๐‘ฅ and ๐‘ฆ, not on their values. If a kernel is not stationary, scaling the variables will give different results. It is suggested that non-stationary kernels be used only with scaled and centered data.

A kernel is conditionally positive definite if certain values of its parameters can result in the kernel failing to be positive definite (as required for the kernel trick). Care should be taken when choosing parameters for conditionally positive definite kernels.

1. Linear kernel

Equation: ๐‘˜(๐‘ฅ, ๐‘ฆ) = ๐‘ฅ๐‘‡๐‘ฆ + ๐‘

Parameters: c

Comments: To use a linear kernel means simply means to work with the (potentially shifted) inner product in the input space. If the optional parameter c is 0, this simply utilizes the basic algorithms. Obviously, using a linear kernel will not introduce non-linearity into otherwise linear algorithms.

Page 6: Basic Kernel Methods

Basic Kernel Methods

Essentials of Data Analytics and Machine Learning 3

2. Polynomial kernel

Equation: ๐‘˜(๐‘ฅ, ๐‘ฆ) = (๐›ผ๐‘ฅ๐‘‡๐‘ฆ + ๐›ฝ)๐›พ

Parameters: ๐›ผ, ๐›ฝ and ๐›พ. Often it is parameterized only by ๐›พ, with ๐›ผ = 1 and ๐›ฝ = 0.

Comments: The polynomial kernel is non-stationary, so should normally be used on normalized data.

3. Gaussian kernel

Equation: ๐‘˜(๐‘ฅ, ๐‘ฆ) = ๐‘’โˆ’

โ€–๐‘ฅโˆ’๐‘ฆโ€–2

2๐œŽ2

Parameters: ๐œŽ

Comments: Care should be taken in fitting the ๐œŽ parameter. If it is overestimated, the projection becomes almost linear and the process loses it non-linear power. If it is underestimated, the result is very sensitive to noise and the chance of overfitting is high.

4. Exponential & Laplacian kernels

Equation: ๐‘˜(๐‘ฅ, ๐‘ฆ) = ๐‘’โˆ’

โ€–๐‘ฅโˆ’๐‘ฆโ€–

2๐œŽ2 and ๐‘˜(๐‘ฅ, ๐‘ฆ) = ๐‘’โˆ’โ€–๐‘ฅโˆ’๐‘ฆโ€–

๐œŽ

Parameters: ๐œŽ

Comments: Note that these two equations are equivalent and differ from the Gaussian only by the lack of a square on the numerator of the exponent. The two equations differ, though, in their sensitivities to changes in the sigma parameter, and a linear based grid search will behave quite differently in the two cases.

5. Hyperbolic Tangent (or Sigmoid) kernel

Equation: ๐‘˜(๐‘ฅ, ๐‘ฆ) = tanh(๐›ผ๐‘ฅ๐‘‡๐‘ฆ + ๐›ฝ)

Parameters: ๐›ผ and ๐›ฝ. A common value for ๐›ผ is 1

๐‘, where ๐‘ is the size of the training data.

Comments: The popularity of this kernel comes from the fact that when used with a support vector machine (see module 17) the result is equivalent to a two layered perceptron neural network (see module 16). The hyperbolic tangent kernel is only conditionally positive definite.

Page 7: Basic Kernel Methods

Basic Kernel Methods

Essentials of Data Analytics and Machine Learning 4

6. Rational Quadratic kernel

Equation: ๐‘˜(๐‘ฅ, ๐‘ฆ) = 1 โˆ’โ€–๐‘ฅโˆ’๐‘ฆโ€–2

โ€–๐‘ฅโˆ’๐‘ฆโ€–2+๐‘

Parameters: ๐‘

Comments: The rational quadratic kernel is often used as a replacement for the Gaussian when using the Gaussian is too computationally expensive.

7. Circular and Spherical kernels

Equation: ๐‘˜(๐‘ฅ, ๐‘ฆ) =2

๐œ‹(arccos (โˆ’

โ€–๐‘ฅโˆ’๐‘ฆโ€–

๐œŽ) โˆ’

โ€–๐‘ฅโˆ’๐‘ฆโ€–

๐œŽโˆš1 โˆ’ (

โ€–๐‘ฅโˆ’๐‘ฆโ€–

๐œŽ)

2

), if โ€–๐‘ฅ โˆ’ ๐‘ฆโ€– > ๐œŽ otherwise 0;

and

๐‘˜(๐‘ฅ, ๐‘ฆ) = 1 โˆ’3

2

โ€–๐‘ฅโˆ’๐‘ฆโ€–

๐œŽ+

1

2(

โ€–๐‘ฅโˆ’๐‘ฆโ€–

๐œŽ)

3

, if โ€–๐‘ฅ โˆ’ ๐‘ฆโ€– > ๐œŽ otherwise 0.

Parameters: ๐œŽ

Comments: The circular and spherical kernels have popularity within geo-statistics. The circular kernel is positive definite in two dimensions, and the spherical kernel is positive definite in three dimensions.

8. Cauchy kernel

Equation: ๐‘˜(๐‘ฅ, ๐‘ฆ) =1

1+โ€–๐‘ฅโˆ’๐‘ฆโ€–2

๐œŽ2

Parameters: ๐œŽ

Comments: The Cauchy kernel is fat-tailed (like the Cauchy distribution) leading to a high level of long range influence between data points.

9. Log kernel

Equation: ๐‘˜(๐‘ฅ, ๐‘ฆ) = โˆ’ log(โ€–๐‘ฅ โˆ’ ๐‘ฆโ€–๐‘‘ + 1)

Parameters: ๐‘‘

Page 8: Basic Kernel Methods

Basic Kernel Methods

Essentials of Data Analytics and Machine Learning 5

Comments: The log kernel has popularity within image analysis. It is conditionally positive definite.

Kernel Regression

The recipe for kernel regression is extremely simple. Given data, {โŸจ , โŸฉ}, we simply need to:

1. Select a positive definite Kernel, k. 2. Calculate the kernel matrix ๐พ of k. 3. Calculate ๏ฟฝฬ‚๏ฟฝ = (๐พ + ๐œ†๐ผ)โˆ’1๐‘ฆ

You now have the model ๏ฟฝฬ‚๏ฟฝ = โˆ‘ ๏ฟฝฬ‚๏ฟฝ๐‘–๐พ(๐‘ฅ, ๐‘ฅ๐‘–)๐‘๐‘–=1 . Where K is simply the inner product operator,

this is (the dual form of) ridge regression, with the regularization parameter ๐œ† from step 3. Where K is a kernel matrix, this model is equivalent to performing ridge regression in the implicit feature space associated with the kernel k.

Easy! So letโ€™s try a manual example. We will use the exponential kernel, and will create a โ€˜constructorโ€™ function that we can use to make exponential kernel functions for different values of ๐œŽ:

eDist=function(x1,x2) sqrt(sum((x1-x2)^2)) makeExpKernel = function (sigma) function(x1,x2)exp(-eDist(x1,x2)/(2*sigma^2))

We will work with the swiss dataset from the datasets package.

> library(datasets) > data(swiss) > head(swiss) Fertility Agriculture Examination Education Catholic Infant.Mortality Courtelary 80.2 17.0 15 12 9.96 22.2 Delemont 83.1 45.1 6 9 84.84 22.2 Franches-Mnt 92.5 39.7 5 5 93.40 20.2 Moutier 85.8 36.5 12 7 33.77 20.3 Neuveville 76.9 43.5 17 15 5.16 20.6 Porrentruy 76.1 35.3 9 7 90.57 26.6

We will try to estimate fertility from the remaining variables.

To get some idea of the performance of our kernel regression model, letโ€™s see how normal OLS regression performs using ABO cross-validation:

> sum(sapply(1:nrow(swiss),function(i){ + model=lm(Fertility~.,swiss[-i,]) + (swiss[i,1]-predict(model,swiss[i,]))^2 + }))/nrow(swiss) [1] 59.88621

Page 9: Basic Kernel Methods

Basic Kernel Methods

Essentials of Data Analytics and Machine Learning 6

Now we create two functions. The first creates an S3 kernel regression model object, fitting the parameters from data. The second provides the prediction function for this new class:

krr=function(X,Y,lambda,k) { xS=scale(X) yS=scale(Y) K=ker(xS,f=k) A=getA(lambda,K,yS) out=list(X=X,Y=Y,xS=xS,yS=yS,lambda=lambda,K=K,k=k,A=A) class(out)<-"krr" return(out) } predict.krr=function(model,X) { if (is.null(nrow(X))) X=matrix(X,nrow=1) xS=(X-attr(model$xS,"scaled:center"))/attr(model$xS,"scaled:scale") K_=ker(xS,t(model$xS),model$k) yS=K_%*%model$A yS*attr(model$yS,"scaled:scale")+attr(model$yS,"scaled:center") }

Note that we let the user pass a kernel function when creating the model, as well as a lambda value for the ridge regression regularization penalty. Note also that we scale and center our data, and hence have to perform the same transformation on new data, and have to undo the transformation on the predictions. We will also create an ABO cross-validation function for our kernel regression models:

abo_krr = function (X,Y,lambda,f) sum(sapply(1:nrow(X),function(i)(Y[i]-predict(krr(X[-i,],Y[-i],lambda,f),X[i,]))^2))/nrow(X)

Again, the user passes a kernel function and lambda value.

We are going to need to optimize two hyper-parameters: the ridge regression penalty, lambda, and the exponential kernel function parameter, sigma. So let us create a function which we can use to perform grid search on these hyper-parameters:

fitKrr=function(X,Y,kGen,Lambda,Sigma) outer(Lambda,Sigma,function(lambda,sigma) sapply(1:length(lambda),function(i) abo_krr(X,Y,lambda[i],kGen(sigma[i]))))

We can now pass sequences of lambda and sigma values to this fitKrr function and obtain ABO cross-validation results for all combinations of these hyperparameters. We have included a kGen argument in the fitKrr function that would allow us to generalize to other kernels if we wanted to. We will always pass the makeExpKernel function as for the kGen parameter. We begin the grid search:

> fitKrr(swiss[2:6],swiss[,1],makeExpKernel,seq(.1,1,.1),c(1,seq(1,10,1))) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [1,] 66.76466 66.76466 56.97639 57.77361 60.47269 63.96155 67.92341 72.18661 76.62098 81.12003 85.59706 [2,] 68.11137 68.11137 59.57889 62.51358 67.46776 73.25840 79.42440 85.66105 91.75151 97.54879 102.96319

Page 10: Basic Kernel Methods

Basic Kernel Methods

Essentials of Data Analytics and Machine Learning 7

[3,] 69.69506 69.69506 62.28342 66.80953 73.44344 80.80948 88.28681 95.49160 102.19623 108.28837 113.73564 [4,] 71.34992 71.34992 64.86890 70.69514 78.64261 87.11289 95.36965 103.01243 109.85828 115.86466 121.06963 [5,] 73.00610 73.00610 67.30953 74.23665 83.22593 92.47101 101.17073 108.95645 115.71902 121.49348 126.38156 [6,] 74.63170 74.63170 69.61139 77.48582 87.30578 97.08785 106.01167 113.77247 120.34554 125.83850 130.40469 [7,] 76.21196 76.21196 71.78608 80.48269 90.96544 101.10965 110.11305 117.75315 124.08961 129.29298 133.55648 [8,] 77.74050 77.74050 73.84520 83.25898 94.26909 104.64537 113.63217 121.09791 127.18109 132.10473 136.09193 [9,] 79.21524 79.21524 75.79922 85.84047 97.26758 107.77840 116.68456 123.94746 129.77655 134.43755 138.17550 [10,] 80.63637 80.63637 77.65731 88.24849 100.00205 110.57387 119.35704 126.40391 131.98625 136.40401 139.91797

The optimal value, in bold, is on the edge of the grid. So we both move the lambda parameters searched over and zoom in further:

> fitKrr(swiss[2:6],swiss[,1],makeExpKernel,seq(.01,.1,.01),c(1,seq(2,6,.5))) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 66.13120 56.45151 55.48076 54.93691 54.59060 54.36986 54.24720 54.20872 54.24486 54.34754 [2,] 66.15055 56.18133 55.18351 54.70519 54.51038 54.51177 54.66417 54.93755 55.30943 55.76237 [3,] 66.18665 56.06348 55.13604 54.80970 54.82496 55.07240 55.49040 56.03865 56.68927 57.42261 [4,] 66.23736 56.04919 55.23592 55.08531 55.31111 55.78538 56.43647 57.21991 58.10689 59.07826 [5,] 66.30090 56.10804 55.42805 55.45490 55.87807 56.55671 57.41418 58.40522 59.50169 60.68511 [6,] 66.37575 56.22001 55.68014 55.87820 56.48373 57.34773 58.39184 59.57134 60.85885 62.23608 [7,] 66.46060 56.37132 55.97219 56.33266 57.10683 58.14112 59.35712 60.71098 62.17562 63.73231 [8,] 66.55432 56.55224 56.29126 56.80497 57.73603 58.92868 60.30520 61.82233 63.45277 65.17682 [9,] 66.65596 56.75568 56.62869 57.28700 58.36507 59.70649 61.23435 62.90556 64.69207 66.57284 [10,] 66.76466 56.97639 56.97855 57.77361 58.99040 60.47269 62.14421 63.96155 65.89549 67.92341

We could continue, but will stop here. We see that we are capable of create kernel regression models using the exponential kernel that have approximately 54.25 ABO CV MSE, which is clearly a significant improvement over the OLS model. The optimized hyperparameters that achieved this result were ๐œ† = .01 and ๐œŽ = 4.5. Note, though, how poorly many of the alternative models in the grid search performed.

We have been unable to find generic kernel regression packages. There are some functions available in the gelnet package, but these appear to be little more than parameter holders, requiring you to implement all computations manually. This absence is misleading, as we will see that some of the powerful techniques we examine later are in fact kernel regression techniques with specialized kernels.

The Theory

The simplicity of the recipe for kernel regression is matched by the complexity of the theory behind it. We provide an overview of this theory here.

Firstly, we state that a positive definite kernel permits the construction of a normed Hilbert space. Take a kernel ๐พ(๐‘ฅ, ๐‘ฅโ€ฒ). There is a corresponding function space ๐ป๐พ generated by the linear span of {๐พ(โˆ™, ๐‘ฅโ€ฒ), ๐‘ฅโ€ฒ โˆˆ ๐‘…๐‘}:

Let ๐พ have the eigen-expansion:

Page 11: Basic Kernel Methods

Basic Kernel Methods

Essentials of Data Analytics and Machine Learning 8

๐พ(๐‘ฅ, ๐‘ฅโ€ฒ) = โˆ‘ ๐›พ๐‘–๐œ™(๐‘ฅ)๐œ™(๐‘ฅโ€ฒ)

โˆž

๐‘–=1

with the eigenvalues ๐›พ๐‘– โ‰ฅ 0 and โˆ‘ ๐›พ๐‘–2โˆž

๐‘–=1 < โˆž.

๐พ induces a norm on ๐ป๐พ:

โ€–๐‘“โ€–๐ป๐‘˜โ‰ โˆšโˆ‘

๐‘๐‘–2

๐œ†๐‘–

โˆž

๐‘–=1

Elements of ๐ป๐พ have an expansion in terms of the eigenfunctions in (3):

๐‘“(๐‘ฅ) = โˆ‘ ๐‘๐‘–๐œ™๐‘–(๐‘ฅ)

โˆž

๐‘–=1

With the constraint that:

โ€–๐‘“โ€–๐ป๐‘˜

2 < โˆž

โ€–๐‘“โ€–๐ป๐‘˜ induces an inner product on ๐ป๐พ, โŸจโˆ™,โˆ™โŸฉ๐ป๐พ

, with important properties:

โŸจ๐พ(โˆ™, ๐‘ฅโ€ฒ), ๐‘“โŸฉ๐ป๐พ= ๐‘“(๐‘ฅโ€ฒ)

โŸจ๐พ(โˆ™, ๐‘ฅ), ๐พ(โˆ™, ๐‘ฅโ€ฒ)โŸฉ๐ป๐พ= ๐พ(๐‘ฅ, ๐‘ฅโ€ฒ)

We now explain the โ€˜kernel trickโ€™. We note that a general class of regularization problems has the form:

min๐‘“โˆˆ๐ป

(โˆ‘ ๐ฟ(๐‘ฆ๐‘– โˆ’ ๐‘“(๐‘‹๐‘–)) + ๐œ†๐ฝ(๐‘“)

๐‘

๐‘–=1

)

Let the space we search over be ๐ป๐พ . We define:

๐ฝ(๐‘“) = โ€–๐‘“โ€–๐ป๐‘˜

2

Page 12: Basic Kernel Methods

Basic Kernel Methods

Essentials of Data Analytics and Machine Learning 9

This makes our problem one of generalized ridge regression, where functions with large eigenvalues in (1) get penalized less. We can rewrite our regularization problem as:

min๐‘“โˆˆ๐ป๐พ

(โˆ‘ ๐ฟ(๐‘ฆ๐‘– โˆ’ ๐‘“(๐‘‹๐‘–)) + ๐œ†โ€–๐‘“โ€–๐ป๐‘˜

2

๐‘

๐‘–=1

) = min๐‘“โˆˆ๐ป๐พ

(โˆ‘ ๐ฟ (๐‘ฆ๐‘– โˆ’ โˆ‘ ๐‘๐‘—๐œ™๐‘—(๐‘ฅ๐‘–)

โˆž

๐‘—=1

) + ๐œ† โˆ‘๐‘๐‘—

2

๐œ†๐‘—

โˆž

๐‘—=1

๐‘

๐‘–=1

)

It has been proven that the solution has the form:

๐‘“(๐‘ฅ) = โˆ‘ ๐›ผ๐‘–๐พ(๐‘ฅ, ๐‘ฅ๐‘–)

๐‘

๐‘–=1

It has been also been proven that there is a relationship between the function form and the cost functional such that:

๐‘“(๐‘ฅ) = โˆ‘ ๐›ผ๐‘–๐พ(๐‘ฅ, ๐‘ฅ๐‘–)

๐‘

๐‘–=1

โ†’ ๐ฝ(๐‘“) = โˆ‘ โˆ‘ ๐พ(๐‘ฅ๐‘– , ๐‘ฅ๐‘—)

๐‘

๐‘—=1

๐›ผ๐‘–๐›ผ๐‘—

๐‘

๐‘–=1

Finally, it has been proven that optimal values for the coefficients are in this case:

๏ฟฝฬ‚๏ฟฝ = min๐›ผ

๐ฟ(๐‘ฆ, ๐พ๐›ผ) + ๐œ†๐›ผ๐‘‡๐พ๐›ผ

Where ๐พ is the kernel matrix ๐พ๐‘–๐‘— = ๐พ(๐‘ฅ๐‘–, ๐‘ฅ๐‘—).

When the loss function is squared error we have a set of closed form formula:

๏ฟฝฬ‚๏ฟฝ = min๐›ผ

(๐‘ฆ โˆ’ ๐พ๐›ผ)๐‘‡(๐‘ฆ โˆ’ ๐พ๐›ผ) + ๐œ†๐›ผ๐‘‡๐พ๐›ผ

๏ฟฝฬ‚๏ฟฝ = (๐พ + ๐œ†๐ผ)โˆ’1๐‘ฆ

๐‘“(๐‘ฅ) = โˆ‘ ๏ฟฝฬ‚๏ฟฝ๐‘–๐พ(๐‘ฅ, ๐‘ฅ๐‘–)

๐‘

๐‘–=1

Note that if K was the dot product, this is the dual form of ridge regression. Where K is a kernel function, it is the equivalent to the dual form of ridge regression in the feature space ๐ป๐‘˜, taking advantage of the fact that inner products there are equal to the kernel function. This is โ€˜the kernel trickโ€™.

Page 13: Basic Kernel Methods

Basic Kernel Methods

Essentials of Data Analytics and Machine Learning 10

The feature space ๐ป๐‘˜ is determined by the kernel. The induced feature spaces are reasonably strong, so the solution is reasonably good. For example, the kernel ๐พ(๐‘ฅ, ๐‘ฅโ€ฒ) = โŸจโŸจ๐‘ฅ, ๐‘ฆโŸฉ + 1โŸฉ for ๐‘ฅ, ๐‘ฆ โˆˆ ๐‘…๐‘ produces a feature space of the polynomial functions of degree p. So we can find the optimal solution to penalized polynomial regression using the kernel trick. This is efficient, as we do not have to perform an explicit projection into the polynomial space.

Kernel PCA

Recall that PCA discovers the eigenvectors of the covariance matrix for scaled and centered data. These are the principal components, and they ordered, each specifying the direction of maximal variance of the data with the constraint that it is orthogonal to the principal components prior to it. Projecting onto the first n principle components is therefore a powerful means of feature transformation.

๐‘€ = ๐‘ˆฮฃ๐‘‰๐‘‡

Where:

๐‘€ is an ๐‘š ร— ๐‘› that is the object of our analysis.

๐‘ˆ is an ๐‘š ร— ๐‘š matrix, whose columns are the left singular vectors of ๐‘€.

ฮฃ is an ๐‘š ร— ๐‘› non-negative diagonal matrix, containing the singular values of ๐‘€.

๐‘‰๐‘‡ is an ๐‘› ร— ๐‘› matrix, whose rows are the right singular vectors of ๐‘€.

The following facts are important to know:

The right singular vectors of ๐‘€ are the eigenvectors of ๐‘€๐‘‡๐‘€ (so are the principle components of ๐‘€).

The left singular vectors of ๐‘€ are the eigenvectors of ๐‘€๐‘€๐‘‡.

The non-zero singular values are the square roots of the non-zero eigenvalues of ๐‘€๐‘‡๐‘€ or ๐‘€๐‘€๐‘‡.

๐‘ˆฮฃ is the projection of ๐‘€ onto its principle components.

From the eigenvalue decomposition, we know ๐‘€๐‘€๐‘‡ = ๐‘ˆฮฃ2๐‘ˆ๐‘‡

Principle components are usually computed via the SVD.

Page 14: Basic Kernel Methods

Basic Kernel Methods

Essentials of Data Analytics and Machine Learning 11

๐พ = ๐‘‹โ€ฒ๐‘‹โ€ฒ๐‘‡

๐‘‹โ€ฒ = ๐‘ˆฮฃ๐‘‰๐‘‡

๐พ = ๐‘ˆฮฃ2๐‘ˆ๐‘‡

We can use this to calculate ๐‘ˆฮฃ from ๐พ: It is simply the eigenvectors of ๐พ multiplied by the square root of the eigenvalues of ๐พ. Remember that ๐‘ˆฮฃ is the projection of Xโ€ฒonto its principle components from the kernel matrix of X. So we have managed to calculate this while explicitly computing neither ๐‘‹โ€ฒ nor the principle components of ๐‘‹โ€ฒ!

Unfortunately, regardless of whether ๐‘‹ is centered, ๐‘‹โ€ฒ will generally not be. We can overcome

this by working with the doubly centered kernel matrix, ๏ฟฝฬƒ๏ฟฝ:

๏ฟฝฬƒ๏ฟฝ = ๐ถ๐พ๐ถ

Where ๐ถ = ๐ผ โˆ’1

๐‘›๐Ÿ๐Ÿ๐‘ป is a ๐‘› ร— ๐‘› matrix, and 1 is a vector of n ones. C is such that if multiplied on

the left it subtracts the mean of each row from the row. If multiplied on the right it subtracts the mean of each column from the column.

The above derivation is theoretically satisfying. But it is not actually used to calculate kernel

PCA, since performing the eigen decomposition of ๏ฟฝฬƒ๏ฟฝ is impractical for large datasets. We will see that the results of the kernel PCA implementation in the kernlab package agrees up to scaling constants with the procedure given above.

We will proceed with a manual and package-based example concurrently. This will allow us to borrow some functionality from the package. We will be using the kpca function in the kernlab package. We will use the iris dataset from the datasets package.

> library(kernlab) > d=datasets::iris > head(d) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa > kp=kpca(~.,d[,1:4],features=2) > plot(rotated(kp),col=d[,5])

Page 15: Basic Kernel Methods

Basic Kernel Methods

Essentials of Data Analytics and Machine Learning 12

Note the unusual formula for the first parameter. The object returned by kpca is an S4 object (see module 2). For practical use, the most important information in this object is the rotated matrix, which contains the projection of our original data onto the kernel principle components. We access this using the rotated function, as in the above code. We see that we have projected onto two kernel principle components, corresponding to our specification of the features parameter in the call to kpca. Note that the kpca function provides a number of different kernels. We utilized the default kernel, which is a Gaussian radial basis function (see module 13) which itself has variance parameter with a default of .2.

For a manual implementation, we โ€˜borrowโ€™ the kernel function used in the kpca call. (We could make our own, but hey, why bother?) This can be accessed using the kernel function. We can now manually generate the kernel matrix:

> K=apply(d[,1:4],1,function(r)apply(t(d[,1:4]),2,function(c)kernelf(kp)(r,c)))

Now we center the kernel matrix:

> l=nrow(K) > ones=rep(1,l) > C=diag(l)-1/l*(ones%*%t(ones)) > K_=C%*%K%*%C

We find the eigenvectors and eigenvalues, and we can then compute the projection onto the first two principle components:

> e=eigen(K_) > est=e$vectors[,1:2] %*% diag(sqrt(e$values[1:2])) > plot(est,col=d[,5])

We see that our results match those of the kpca function up to scaling constants.

Page 16: Basic Kernel Methods

Basic Kernel Methods

Essentials of Data Analytics and Machine Learning 13

Exercises

1. Select a dataset from the datasets package or elsewhere that is suitable for regression. It should have continuous input variables, and there should be at least two input variables. Try to find one that will benefit from a non-linear regression model. Fit a kernel regression model, optimizing the hyperparameters โ€“ both the ridge regression regularization parameter and the parameters required for your choice(s) of kernel(s).

2. Using the same dataset, perform kernel PCA on the input variables. Select an appropriate number of kernel principle components and refit your kernel regression model using these.