murpy's machine learning:14. kernel

Kernel Jungkyu Lee

Daum Search Quality Team

14.1 Introduction

• how do we represent a text document or protein sequence, which can be of variable length?

• One approach to such problems is to define a generative model for the data, and use the inferred latent representa-tion and/or the parameters of the model as features, and then to plug these features in to standard methods.

• For example, in Chapter 28, we discuss deep learning, which is essentially an unsupervised way to learn good fea-ture representations.

• Another approach is to assume that we have some way of measuring the similarity between objects, that doesn’t require preprocessing them into feature vector format.

• For example, when comparing strings, we can compute the edit distance between them.

• In this chapter, we will discuss several kinds of kernel functions

• 길이가 일정치 않은 feature 는 어떻게 표현할까• Latent representation LDA 각 문서의 토픽 분포• Kernel: 두 feature 간의 similarity 만으로 feature 를 표현

• e.g 두 스트링 비교 editing distance

• Support Vector Machine• kernelized ridge regression 을 sparse linear machine 으로 바꾸는 관점• maximum margin classifier 로 보는 관점

14.2 Kernel functions

• We define a kernel function to be a real-valued function of two arguments, κ(x,x) R,for x,x’ X.∈ ∈

• symmetric

• non-negative

• can be interpreted as a measure of similarity

14.2.1 RBF kernels

• squared exponential kernel(SE kernel) or Gaussian kernel

• If Σ is diagonal, this can be written as

• We can interpret the σj as defining the characteristic length scale of dimension j.

radial basis function or RBF kernel

Here σ2 is known as the bandwidth.

• 시그마가 커질수록 , 두 feature 간 거리가 멀어도 어느정도 수치는 나오게 된다 .

• 시그마가 작으면 , 거리가 좀만 멀어져도 kernel 유사도는 급히 작아진다

14.2.2 Kernels for comparing documents

• If we use a bag of words representation, where xij is the number of times words j occurs in document i, we can use the cosine similarity,

• Unfortunately, this simple method does not work very well, (stop word)

14.2.3 Mercer (positive definite) kernels

• In general, if the kernel is Mercer, then there exists a function φ mapping x X to R∈ D such that

• For example, consider the (non-stationary)polynomial kernel

14.2.3 Mercer (positive definite) kernels

• Gram matrix,

• kernel function satisfy the requirement that the Gram matrix be positive definite for any set of inputs

• It can be shown (Schoelkopf and Smola 2002) that the Gaussian kernel is a Mercer kernel as is the cosine similarity kernel (Sahami and Heilman 2006)

• Mercer’s theorem.• If the Gram matrix is positive definite, we can compute an eigenvector decomposition of it as follows

14.2.4 Linear kernels

• Deriving the feature vector implied by a kernel is in general quite difficult, and only possible if the kernel is Mercer.

• However, deriving a kernel from a feature vector is easy: we just use

• If φ(x)=x,we get the linear kernel, defined by κ

• This is useful if the original data is already high dimensional, and if the original features are individually informative, e.g., a bag of words representation where the vocabulary size is large, or the expression level of many genes.

• s. In such a case, the decision boundary is likely to be representable as a linear combination of the original features, so it is not necessary to work in some other feature space.

• Of course, not all high dimensional problems are linearly separable. For example, images are high dimensional, but individual pixels are not very informative, so image classification typically requires non-linear kernels (see e.g., Sec-tion 14.2.7).

14.2.5 Matern kernels

• The Matern kernel, which is commonly used in Gaussian process regression (see Section 15.2)

• Where r=||x−x ||, ν>0, l>0, and Kν is a modified Bessel function.

• As ν→∞, this approaches the SE kernel. If ν=12, the kernel simplifies to

14.2.6 String kernels

• The real power of kernels arises when the inputs are structured objects.

• As an example, we now describe one way of comparing two variable length strings using a string kernel.

• Consider two strings x, and x of lengths D, D’, each defined over the alphabet A

• A={A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V}

• Now let φs(x) denote the number of times that substrings appears in string x

• More formally and more generally, let us say that’s is a substring of x if we can write x=usv for some (possibly empty) strings u, s and v .

• .

x string, x' string 의 커널은 , substring 을 많이 공유하면 커진다

14.2.7 Pyramid match kernels

• 이미지에서 점을 추출하고 bag of words 로 표현한다 (SIFT)

• 여러 레벨을 가진 grid 로 나누고 , 얼마나 일치하는지 본다 ( 같은 grid 에 얼마나 같이 등장하는지 ?)

• This is a Mercer kernel.

14.3 Using kernels inside GLMs

• We define a kernel machine to be a GLM where the input feature vector has the form

• If κ is an RBF kernel, this is called an RBF network We discuss ways to choose the μk parameters below.

• We will call Equation 14.28 a kernelised feature vector. Note that in this approach, the kernel need not be a Mercer kernel.

• We can use the kernelized feature vector for logistic regression by defining

• This provides a simple way to define a non-linear decision boundary.

• As an example, consider the data coming from the exclusive or or xor function.

14.3 Using kernels inside GLMs

• Linear regression 의 kernelized 버전

14.3.2 L1VMs, RVMs, and other sparse vector machines

• The main issue with kernel machines is: how do we choose the centroids μk?

• If the input is low-dimensional Euclidean space, we can uniformly tile the space occupied by the data with proto-types, as we did in Figure 14.2(c). = 공간을 균일하게 나눈다

• However, this approach breaks down in higher numbers of dimensions because of the curse of dimensionality. 차원이 커지게 되면 , 각 공간이 너무 많아 져서 차원의 저주에 걸리게 된다

• A simpler approach is to make each example xi be a prototype, so we get

• Now we see D = N, so we have as many parameters as data points. However, we can use any of the sparsity-promot-ing priors for w discussed in Chapter 13 to efficiently select a subset of the training exemplars. 이런 접근법은 13 장 sparse linear model 에서 했던 w 를 sparse 하게 만드는 접근법을 training example 의 sub-set 을 고르는 문제로 바꾸는 것이면 가능하게 한다 . wi=0 이면 , xi 는 없어도 예측 가능

• We call this a sparse vector machine

• w 에 대해서 1 regularization 으로 penalty 를 주는 sparse vector machine 을 L1VM

• w 에 대해서 2 regularization 으로 penalty 를 주는 sparse vector machine 을 L2VM

• Automatic relevance determination (ARD)/sparse Bayesian learning (SBL) (13.7) 로 sparsity 를 얻는 모델을 rele-vance vector machine or RVM (Tipping 2001) 라 한다

• Another very popular approach to creating a sparse kernel machine is to use a support vector machine or SVM Rather than using a sparsity-promoting prior, it essentially modifies the likelihood term, which is rather unnatural from a Bayesian point of view. Nevertheless, the effect is similar, as we will see.

14.4 The kernel trick

• X 자체나 X 의 고차원 mapping 을다루는 것이 아니라 kernel 자체를 다룸

• <x,x’> 를 k(x,x’) 로 교체• linear regression 같은데에서 커널을 쓰면 , 고차원으로 사상된 training example 을 다루어야 했기 때문에 ,

학습이나 예측 계산이 무거워졌었는데 , 그런한 단점을 회피하는 trick 같은 것 , • 심지어 무한 차원 매핑까지 할 수 있다 (RBF= Gaussian Kernel)

• 또는 structured object 의 다루기 어려운 feature 자체를 쓰는 것이 아니라 , feature 간의 유사도라고 할 수 있는 kernel 을 씀으로써 , 기존 알고리즘들이 structured object 에도 수행될 수 있게 해줌

• This is called the kernel trick.

14.4.1 Kernelized nearest neighbor classification

• Recall that in a 1NN classifier (Section 1.4.2), we just need to compute the Euclidean distance of a test vector to all the training points, find the closest one, and look up its label.

• This can be kernelized by observing that

• This allows us to apply the nearest neighbor classifier to structured data objects.

• 예를 들어서 , 전에 다루었던 문서 커널을 써서 , KNN 으로 문서 분류를 할 수 있다 .(tf-idf 가 가장 비슷한 문서 k개로 test 문서를 분류 )

Similarity 를 kernel 로만 계산

14.4.2 Kernelized K-medoids clustering

• K –means 가 각 cluster 에 속한 feature 들의 평균을 중심으로 하는 것처럼 , k-medoids 는 각 cluster 의 중간값• feature 자체를 다루는 것이 아니기 때문에 평균을 내지 못한다

• e-step = 각 데이터 xi 가 어느 cluster 에 속하는지 정한다 ( 가장 가까운 centroid), k means 와 같음• m-step = 각 cluster k 의 중심은 k cluster 에 속한 애들 중에서 , 중간값을 구한다 = k cluster 에 속한 애들 중에서

다른 애들과의 거리의 합이 최소

• 예를 들어서 string kernel 을 써서 , 공통된 substring 이 많은 비슷한 string 을 clustering 할 수 있다

14.4.3 Kernelized ridge regression

• Applying the kernel trick to distance-based methods was straightforward.

• It is not so obvious how to apply it to parametric models such as ridge regression.

• 14.4.3.1 The primal problem• Let x ∈ RD be some feature vector, and X be the corresponding N × D design matrix• Minimize

• The optimal solution is given b

• 14.4.3.2 The dual problem

• using the matrix inversion lemma (Equation 4.107)

• takes O(N3 + N2D) time to compute. This can be advantageous if D is large

14.4.3 Kernelized ridge regression

• we can partially kernelize this, by replacing XXT with the Gram matrix K.

• But what about the leading XT term?

• Let us define the following dual variables:

• Then we can rewrite the primal variables as follows

• This tells us that the solution vector is just a linear sum of the N training vectors

• When we plug this in at test time to compute the predictive mean, we get

• kernelize ridge regression 의 쓰임새 ? 만약 , 다루고자 하는 training example 의 차원이 100000000 이라면 , 원래의 ridge regression 이라면 , 역행렬을 계산하는데 O(D^3) 의 시간복잡도가 들 것이지만 , Kernel ridge regres-sion 은 O(N^3) 의 시간 복잡도를 가진다 .

• 하지만 , 예측 계산은 O(D) 에서 O(ND) 로 증가한다 . 그러므로 α 를 svm 처럼 sparse 하게 만들어서 ( 예측에 support vector 만 써서 ) 해결해야 한다

이정규

메모는 뭘까

14.5 Support vector machines (SVMs)

• If L is quadratic loss, this is equivalent to ridge regression,

• Kernelized ridge regression is kernelized, but not sparse

• loss function 을 hinge(classification) 나 epsilon insensitive loss function(regression) 을 쓰면 , solution 이 sparse해지고 , prediction 은 오직 training data 의 일부 (support vectors) 만을 가지고 , 예측할 수 있다 .

• This combination of the kernel trick plus a modified loss function is known as a support vector machine or SVM

14.5.1 SVMs for regression• The problem with kernelized ridge regression is that the solution vector w depends on all the training inputs.

• We now seek a method to produce a sparse estimate

• epsilon insensitive loss function

• This means that any point lying inside an-tube around the prediction is not penalized, as in Figure 14.10.

예측치의 오차가 특정 범위 (epsilon) 에 있으면 , 페널티는 가하지 않는다 . 하지만 , 그 범위를 넘어가면 오차 -epsilon 만큼 페널티를 가한다

• his objective is convex and unconstrained, but not differentiable, because of the absolute value function in the loss term.

• One popular approach is to formulate the problem as a constrained optimization problem.

• In particular, we introduce slack variables to represent the degree to which each point lies outside the tube

제약조건이 true 가 되도록 ξ 를 늘려주고 , 늘린 만큼 페널티를 준다

• One can show (see e.g., (Schoelkopf and Smola 2002)) that the optimal solution has the form

• Once the model is trained, we can then make predictions using

14.5.2 SVMs for classification

• 14.5.2.1 Hinge loss

yη 가 1 보다 크면 penalty 는 없다f(X) 가 > 1 이고 y =1 이거나f(X) 가 < 1 이고 y =-1 일 때즉 정답일 때 pernality 없음

• The overall objective has the form

• Once again, this is non-differentiable, because of the max term. However, by introducing slack variables ξi, one can show that this is equivalent to solving

• Standard solvers take O(N3) time.

• However, specialized algorithms, which avoid the use of generic QP solvers, have been developed for this problem, such as the sequential minimal optimization or SMO algorithm (Platt 1998). In practice this can take O(N2)

• In such settings, it is common to use linear SVMs, which take O(N) time to train (Joachims 2006; Bottou et al. 2007).

제약조건이 true 가 되도록 ξ 를 늘려 제약조건을 완화해주고 , 완화해준 만큼 페널티를 준다

• One can show that the solution has the form

• where αi = λiyi and where α is sparse (because of the hinge loss).

• The xi for which αi > 0 are called support vectors; these are points which are either incorrectly classified, or are clas-sified correctly but are on or inside the margin (we discuss margins below).서포트 벡터는 오분류된 점들이거나 , 잘 분류됐지만 , margin 안에 있는 점들

• At test time, prediction is done using

\

• his takes O(sD) time to compute, where s ≤ N is the number of support vectors.

• This depends on the sparsity level, and hence on the regularizer C

모든 training example 의 알파의 weighted sum

14.5.2.2 The large margin principle

• In this section, we derive Equation 14.58 form a completely different perspective.

• 목표는 feature 가 커널에 의해서 사상된 space 상에서 linear 한 discriminant function f(x) 을 구하는 것이다 .

X 는 f(x) 에 사영시킨 x⊥ 에서 r 만큼 전진

14.5.2.2 The large margin principle

• We would like to make this distance r = f(x)/||w|| as large as possible, for reasons illustrated in Figure 14.11.

• intuitively, the best one to pick is the one that maximizes the margin, i.e., the perpendicular distance to the closest point

• In addition, we want to ensure each point is on the correct side of the boundary, hence we want f(xi) yi > 0.

• So our objective becomes

• Note that by rescaling the parameters using w → kw and w0 → kw0, we do not change the distance of any point to the boundary, since the k factor cancels out when we divide by ||w||.

• Therefore let us define the scale factor such that yifi = 1 for the point that is closest to the decision boundary.yi(wT xi + w0) = 1 이 되게 rescaling

• We therefore want to optimize ()

• The constraint says that we want all points to be on the correct side of the decision boundary with a margin of at least 1.

large margin classifier.

모든 training example 에서 가장 가까운 점의 margin 이 가장 큰 w를 구하라

soft margin constraints

• If the data is not linearly separable (even after using the kernel trick), there will be no feasible solution in which yifi ≥ 1 for all i.

• We replace the hard constraints that yifi ≥ 0 with the soft margin constraints that yifi ≥ 1 − ξi. The new objective be-comes

• 1

• We therefore introduce slack variables ξi ≥ 0 such that ξi = 0 if the point is on or inside the correct margin boundary, and ξi = |yi −fi| otherwise

• 0 < ξi ≤ 1 the point lies inside the margin, but on the correct side of the decision boundary.

• If ξi > 1, the point lies on the wrong side of the decision boundary.

제약조건이 true 가 되도록 ξ 를 늘려 제약조건을 완화해주고 , 완화해준 만큼 페널티를 준다

14.5.2.3 Probabilistic output

• An SVM classifier produces a hard-labeling, ˆy(x) = sign(f(x)).

• However, we often want a measure of confidence in our prediction.

• One heuristic approach is to interpret f(x) as the log-odds ratio, log (p(y=1|x)/p(y=0|x) ).

• where a, b can be estimated by maximum likelihood on a separate validation set.

• However, the resulting probabilities are not particularly well calibrated

14.5.2.4 SVMs for multi-class classification

• In Section 8.3.7, we saw how we could “upgrade” a binary logistic regression model to the multiclass case, by re-placing the sigmoid function with the softmax, and the Bernoulli distribution with the multinomial.

• Upgrading an SVM to the multi-class case is not so easy, since the outputs are not on a calibrated scale and hence are hard to compare to each other

• The obvious approach is to use a one-versus-the-rest approach (also called one-vs-all), in which we train C binary classifiers, fc(x), where the data from class c is treated as positive, and the data from all the other classes is treated as negative.

• However, this can result in regions of input space which are ambiguously labeled, as shown in Figure 14.14(a).

• Another approach is to use the one-versus-one or OVO approach, also called all pairs, in which we train C(C−1)/2 classifiers to discriminate all pairs fc,c .

C(C−1)/2 개의 분류기 C 개의 분류기

ambiguously labeled

14.5.3 Choosing C

• Typically C is chosen by cross-validation.

• however, that C interacts quite strongly with the kernel parameters.

• To choose C efficiently, one can develop a path following algorithm in the spirit of lars (Section 13.3.4).

• The basic idea is to start with λ large, so that the margin 1/||w(λ)|| is wide, and hence all points are inside of it and have αi = 1.

• By slowly decreasing λ, a small set of points will move from inside the margin to outside, and their αi values will change from 1 to 0, as they cease to be support vectors.

14.5.4 Summary of key points

• Summarizing the above discussion, we recognize that SVM classifiers involve three key ingredients: • the kernel trick :prevent underfitting, • Sparsity, the large margin principle : prevent overfitting

14.6 Comparison of discriminative kernel meth-ods• We have mentioned several different methods for classification and regression based on kernels, which we summa-

rize in Table 14.1.

• GPs and L2VM are generally the slowest, taking O(N3) time

• if speed matters, use an RVM,

• if well-calibrated probabilistic output matters (e.g., for active learning or control problems), use a GP.

“tune” the kernel parameter, use efficient gradient based optimizers to maximize the marginal likelihood확률 모델이냐 ? kernel 이 mercer 가 아니어도 되냐 ?

14.7 Kernels for building generative models

• There is a different kind of kernel known as a smoothing kernel which can be used to create non-parametric density estimates.

• This can be used for unsupervised density estimation, p(x), as well as for creating generative models for classification and regression by making models of the form p(y, x).

• 14.7.1 Smoothing kernels• A smoothing kernel is a function of one argument which satisfies the following properties:

• A simple example is the Gaussian kernel,

• We can control the width of the kernel by introducing a band width parameter h:

14.7.2 Kernel density estimation (KDE) aka Parzen window den-sity estimator

• Recall Gaussian mixture model

• However, it requires specifying the number K and locations μk of the clusters

• Gaussian 갯수가 커지면 RD 의 모든 분포를 표현할 수 있고 , 각 training example 마다 Gaussian 을 가지고 있다면

• We can generalize the approach by writing

• simple non-parametric density model: no model fitting is required

• The disadvantage is that the model takes a lot of memory to store, and a lot of time to evaluate. (training example 자체가 모델이고 , 계산할 때도 training example 을 모두 이용한다 )

특정 feature x 의 확률은 , training example xi 마다 있는 N 개의 Gaussian 에 의해서 결정되는데 , x 랑 가까이 있는 xi 의 Gaussian 이 더 큰 영향을 준다

x - xi 가 0 에 가까울수록 커짐

• 각 data example xi 마다 자기 자신을 평균으로 하는 N 개의 정규분포가 있고 , 확률을 알고자 하는 x 가 N 개의 정규 분포에 얼마나 떨어져 있는지 정규분포로 알아내서 합산

이 x 는 3 개의 정규분포와 가까워서 수치 높음 각 정규 분포의 분산을 늘리면 smooth 해짐

14.7.4 Kernel regression

wi(x) x 가 각 xi 와 가까우면 weighting 이 커지므로 , 가까운 training set 의 y 값을 다 많이 따른다

14.7.5 Locally weighted regression

• If we define κh(x−xi)=κ(x,xi), we can rewrite the prediction made by kernel regression as follows

• Note tha tκ(x,xi) need not be a smoothing kernel. If it is not, we no longer need the normalization term, so we can just write

• N 개의 function κ(x ,xi)∗ 을 융합

• linear regression 은 모든 데이터에 대해서 동일한 w 를 쓰는 반면 LWR 은 test set x* 마다 fitting 함으로 y 를 예측한다 . 즉 test set 마다 다른 weight β(x )∗ 를 가지고 다음과 같이 fit

• solution 은

현재 예측하고자 하는 x* 에 대한 weight β(x ) ∗ 로 모든 training set 을 예측했을 때 ( 커널로 가까우면 큰 에러를 갖도록 조정 ) 에러가 최소되게 weight 학습

murpy's machine learning:14. kernel

Technology

mercer kernel

gaussian kernel

kernel machines

matern kernel

stationarypolynomial

kinds of kernel functions

exponential kernelse

string kernels