overview of kernel methods (part 2)

37
1 Overview of Kernel Methods (Part 2) Steve Vincent November 17, 2004

Upload: fuller

Post on 29-Jan-2016

83 views

Category:

Documents


1 download

DESCRIPTION

Overview of Kernel Methods (Part 2). Steve Vincent November 17, 2004. Overview. Kernel methods offer a modular framework. In a first step , a dataset is processed into a kernel matrix. Data can be of various types, and also heterogeneous types. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Overview of Kernel Methods (Part 2)

1

Overview of Kernel Methods (Part 2)

Steve VincentNovember 17, 2004

Page 2: Overview of Kernel Methods (Part 2)

2

Overview Kernel methods offer a modular framework. In a first step, a dataset is processed into a kernel matrix.

Data can be of various types, and also heterogeneous types. In a second step, a variety of kernel algorithms can be

used to analyze the data, using only the information contained in the kernel matrix

Page 3: Overview of Kernel Methods (Part 2)

3

What will be covered today PCA vs. Kernel PCA

Algorithm Example Comparison

Text Related Kernel Methods Bag of Words Semantic Kernels String Kernels Tree Kernels

Page 4: Overview of Kernel Methods (Part 2)

4

PCA algorithm1. Subtract the mean from all the data points 2. Compute the covariance matrix S=3. Diagonalize S to get its eigenvalues and

eigenvectors4. Retain c eigenvectors corresponding to the c

largest eigenvalues such that equals the desired variance to be captured

5. Project the data points on the eigenvectors

N

iik

Tn xN

xEa1

1

c

n

Tnnxx

1

N

j j

c

j j 11/

Page 5: Overview of Kernel Methods (Part 2)

5

Kernel PCA algorithm (1)1. Given N data points in d dimensions let X={x1|

x2|..|xN} where each column represents one data point

2. Subtract the mean from all the data points3. Choose an appropriate kernel k4. Form the NxN Gram matrix K|ij=[k(xi,xj)}5. Form the modified Gram matrix

where 1NxN is an NxN matrix with all entries equal to 1

NIK

NIK NxN

T

NxN 11~

Page 6: Overview of Kernel Methods (Part 2)

6

Kernel PCA algorithm (2)6. Diagonalize K to get its eigenvalues n and its

eigenvectors an

7. Normalize an

8. Retain c eigenvectors corresponding to c largest eigenvalues such that

equals desired variance to be captured9. Project the data points on the eigenvectors

nna /

N

jj

C

jj

11

/

NK

xxk

xxk

NIay NxN

N

NxNT 1

),(

...

),(1

1

Page 7: Overview of Kernel Methods (Part 2)

7

Data Mining Problem Data Source

Computational Intelligence and Learning Cluster Challenge 2000, http://www.wi.leidenuniv.nl/~putten/library/cc2000/index.html

Supplied by Sentient Machine Research, http://www.smr.nl

Problem Definition: Given data which incorporates both socio-economic and

various insurance policy ownership attributes, can we derive models which help in determining factors or attributes which may influence or signify individuals who purchase a caravan insurance policy.

Page 8: Overview of Kernel Methods (Part 2)

8

Data Selection 5,822 records for training 4,000 records for evaluation 86 attributes

Attributes 1 through 43: Socio-demographic data derived from zip code

areas Attributes 44 through 85:

Product ownership for customers Attribute 86

Purchased caravan insurance

Page 9: Overview of Kernel Methods (Part 2)

9

Principal Components Analysis (PCA)

# Attributes

% Variance

# Attributes

% Variance

25 73.29 55 98.20

30 80.66 60 98.86

35 86.53 65 99.35

40 91.25 70 99.65

4545 94.9494.94 75 99.83

50 97.26 80 99.96

Data Transformation and Reduction

[From MatLab]

Page 10: Overview of Kernel Methods (Part 2)

10

Relative Performance PCA run time: 6.138 Kernel PCA run time: 5.668

Used Radial Basis Function Kernel

Matlab Code for PCA and Kernel PCA algorithm can be supplied if needed

2

exp),(vu

vuK

Page 11: Overview of Kernel Methods (Part 2)

11

Manually Reduced Dataset:Naïve Bayes Overall – 82.79% Correctly Classified

a b3155 544 a 14.71% False Positive132 98 b 42.61% Correctly Classified

PCA Reduced Dataset:Naïve Bayes Overall – 88.45% Correctly Classified

a b3507 255 a 6.77% False Positive207 31 b 13.03% Correctly Classified

Kernel PCA Reduced Dataset:Naïve Bayes Overall – 82.22% Correctly Classified

a b3238 541 a 14.3% False Positive175 74 b 29.7% Correctly Classified

Modeling, Test and Evaluation

* Legend: a – no, b yes

Page 12: Overview of Kernel Methods (Part 2)

12

Overall Results KPCA and PCA had similar time

performance KPCA is much gives results closer

to manually reduced dataset Future Work:

Examine other Kernels Vary the parameters for the Kernels Use other Data Mining Algorithms

Page 13: Overview of Kernel Methods (Part 2)

13

‘Bag of words’ kernels (1) Document seen as a vector d, indexed by all

the elements of a (controlled) dictionary. The entry is equal to the number of occurrences.

A training corpus is therefore represented by a Term-Document matrix,

noted D=[d1 d2 … dm-1 dm] From this basic representation, we will apply

a sequence of successive embeddings, resulting in a global (valid) kernel with all desired properties

Page 14: Overview of Kernel Methods (Part 2)

14

BOW kernels (2) Properties:

All order information is lost (syntactical relationships, local context, …)

Feature space has dimension N (size of the dictionary) Similarity is basically defined by: k(d1,d2)=d1•d2= d1

t.d2

or, normalized (cosine similarity):

Efficiency provided by sparsity (and sparse dot-product algorithm): O(|d1|+|d2|)

),().,(

),(),(ˆ

2211

2121

ddkddk

ddkddk

Page 15: Overview of Kernel Methods (Part 2)

15

Latent concept Kernels Basic idea :

documents

termstermstermstermsterms

Concepts space

Size t

Size k <<t

Size d

1

2

K(d1,d2)=?

Page 16: Overview of Kernel Methods (Part 2)

16

Semantic Kernels (1) k(d1,d2)=(d1)SS’(d2)’

where S is the semantic matrix S can be defined as S=RP where

R is a diagonal matrix giving the term weightings or relevances

P is the proximity matrix defining the semantic spreading between the different terms of the corpus

The measure for the inverse document frequency for a term t is given by:

The matrix R is diagonal with entries: Rtt=w(t)

)(ln)(

tdftw

l=# of documents

df(t)=# of documents containing term t

Page 17: Overview of Kernel Methods (Part 2)

17

Semantic Kernels (2) The associated kernel is:

For the proximity matrix (P) the associated kernel is:

Where Qij encodes the amount of semantic relation between terms i and j.

)'(')(),(~

2121 dRRdddk

jji

iji dQddPPdddk )()()'(')(),(~

2,

12121

Page 18: Overview of Kernel Methods (Part 2)

18

Semantic Kernels (3) Most natural method of

incorporating semantic information is be inferring the relatedness of terms from an external source of domain knowledge Example: WordNet Ontology

Semantic Distance Path length of hierarchical tree Information content

Page 19: Overview of Kernel Methods (Part 2)

19

Latent Semantic Kernels (LSK)/ Latent Semantic Indexing (LSI)

Singular Value Decomposition (SVD):

where is a diagonal matrix of the same dimensions as D, and U and V are unitary matrices whose columns are the eigenvectors of D’D and DD’ respectively

LSI projects the documents into the space spanned by the first k columns of U, suing the new k-dimensional vectors for subsequent processing

where Uk is the matrix containing the first k columns of U

'' VUD

kUdd )(

Page 20: Overview of Kernel Methods (Part 2)

20

Latent Semantic Kernels (LSK)/ Latent Semantic Indexing (LSI)

New kernel becomes that of Kernel PCA

LSK is implemented by projecting onto the features:

where k is the base kernel, and i vi are eigenvalue, eigenvector pairs of the kernel matrix

Can represent the LSK’s with the proximity matrix

)()(),(~

2121 dUUdddk kk

k

ijjjiik ddkvUd

11

2/1 ),()()(

kkUUP

Page 21: Overview of Kernel Methods (Part 2)

21

String and Sequence An alphabet is a finite set of ||

symbols. A string s=s1…s|s| is any finite sequence of

symbols from including the empty sequence.

We denote n the set of all finite strings of length n

String matching: implies contiguity Sequence matching : only implies order

Page 22: Overview of Kernel Methods (Part 2)

22

p-spectrum kernel (1) Features of s = p-spectrum of s =

histogram of all (contiguous) substrings of length p

Feature space indexed by all elements of p

u(s)=number of occurrences of u in s

The associated kernel is defined as)()(),( tstskp

u

p

upu

p

Page 23: Overview of Kernel Methods (Part 2)

23

p-spectrum kernel example Example: 3-spectrum kernel

s=“statistics’ and t=“computation” The two strings contain the following

substrings of length 3: “sta”,”tat”, “ati”, “tis”, “ist”, “sti”, “tic”,

“ics” “com”, “omp”, “mpu”, “put”, “uta”, “tat”,

“ati”, “tio”, “ion” Common substrings of “tat” and “ati”,

so the inner product k(s,t)=2

Page 24: Overview of Kernel Methods (Part 2)

24

p-spectrum Kernels Recursion

k-suffix kernel is defined by

p-spectrum kernel can be evaluated using the equation:

in O(p |s| |t|) operations The evaluation of one row of the table for the p-suffix

kernel corresponds to performing a search in the string t for the p-suffix of a prefix in s.

otherwise

uforuttussiftsk

ksk

0

,,1),( 11

):(),:(((),(1||

1

1||

1

pjjtpiissktskps

i

pt

j

spp

Page 25: Overview of Kernel Methods (Part 2)

25

All-subsequences kernels Feature mapping defined by all

contiguous or non-contiguous subsequences of a string

Feature space indexed by all elements of *={}U U 2U 3U…

u(s)=number of occurrences of u as a (non-contiguous) subsequence of s

Explicit computation rapidly infeasible (exponential in |s| even with sparse rep.)

Page 26: Overview of Kernel Methods (Part 2)

26

Recursive implementation Consider the addition of one extra symbol a to

s: common subsequences of (sa,t) are either in s or must end with symbol a (in both sa and t).

Mathematically,

This gives a complexity of O(|s||t|2)

atj j

jkkak

k

:

))1:1(t,s()t,s()t,s(

1),s(

Page 27: Overview of Kernel Methods (Part 2)

27

Fixed-length subsequence kernels

Feature space indexed by all elements of p

u(s)=number of occurrences of the p-gram u as a (non-contiguous) subsequence of s

Recursive implementation (will create a series of p tables)

Complexity: O(p|s||t|) , but we have the k-length subseq. kernels (k<=p) for free easy to compute k(s,t)=alkl(s,t)

atjppp

p

j

jkkak

pfork

k

:1

0

))1:1(t,s()t,s()t,s(

00),s(

1),s(

Page 28: Overview of Kernel Methods (Part 2)

28

Gap-weighted subsequence kernels (1)

Feature space indexed by all elements of p

u(s)=sum of weights of occurrences of the p-gram u as a (non-contiguous) subsequence of s, the weight being length penalizing: length(u)) [NB: length includes both matching symbols and gaps]

Example (1) The string “gon” occurs as a subsequence of the

strings “gone”, “going” and “galleon”, but we consider the first occurrence as more important since it is contiguous, while the final occurrence is the weakest of all three

Page 29: Overview of Kernel Methods (Part 2)

29

Gap-weighted subsequence kernels (2) Example(2)

D1 : ATCGTAGACTGTC D2 : GACTATGC (D1)CAT = 28+and(D2)CAT = 4

k(D1,D2)CAT=212+214 Naturally built as a dot product valid

kernel For alphabet of size 80, there are 512,000

trigrams For alphabet of size 26, there are 12 x 106 5-

grams

Page 30: Overview of Kernel Methods (Part 2)

30

Gap-weighted subsequence kernels (3)

Hard to perform explicit expansion and dot-product

Efficient recursive formulation (dynamic programming type), whose complexity is O(k |D1| |D2|)

Page 31: Overview of Kernel Methods (Part 2)

31

Word Sequence Kernels (1)

Here “words” are considered as symbols Meaningful symbols more relevant matching Linguistic preprocessing can be applied to improve performance Shorter sequence sizes improved computation time But increased sparsity (documents are more : “orthogonal”)

Motivation : the noisy stemming hypothesis (important N-grams approximate stems), confirmed experimentally in a categorization task

Page 32: Overview of Kernel Methods (Part 2)

32

Word Sequence Kernels (2)

Link between Word Sequence Kernels and other methods: For k=1, WSK is equivalent to basic “Bag Of Words”

approach For =1, close relation to polynomial kernel of degree k,

WSK takes order into account Extension of WSK:

Symbol dependant decay factors (way to introduce IDF concept, dependence on the POS, stop words)

Different decay factors for gaps and matches (e.g. noun<adj when gap; noun>adj when match)

Soft matching of symbols (e.g. based on thesaurus, or on dictionary if we want cross-lingual kernels)

Page 33: Overview of Kernel Methods (Part 2)

33

Tree Kernels Application: categorization [one doc=one tree], parsing

(disambiguation) [one doc = multiple trees] Tree kernels constitute a particular case of more general

kernels defined on discrete structure (convolution kernels). Intuitively, the philosophy is to split the structured objects in parts, to define a kernel on the “atoms” and a way to recursively

combine kernel over parts to get the kernel over the whole. Feature space definition: one feature for each possible

proper subtree in the training data; feature value = number of occurrences

A subtree is defined as any part of the tree which includes more than one node, with the restriction there is no “partial” rule production allowed.

Page 34: Overview of Kernel Methods (Part 2)

34

Trees in Text : example Example :

S

NP VP

V NJohn

loves Mary

S

NP VP

VP

V N

loves Mary

VP

V N

loves

N

Mary

VP

V NA Parse Tree

… a few among the many subtrees

of this tree!

Page 35: Overview of Kernel Methods (Part 2)

35

Tree Kernels : algorithm Kernel = dot product in this high dimensional feature

space Once again, there is an efficient recursive algorithm (in

polynomial time, not exponential!) Basically, it compares the production of all possible

pairs of nodes (n1,n2) (n1T1, n2 T2); if the production is the same, the number of common subtrees routed at both n1 and n2 is computed recursively, considering the number of common subtrees routed at the common children

Formally, let kco-rooted(n1,n2)=number of common subtrees rooted at both n1 and n2

11 22

),(),( 2121Tn Tn

rootedco nnkTTk

Page 36: Overview of Kernel Methods (Part 2)

36

All sub-tree kernel Kco-rooted(n1,n2)=0 if n1 or n2 is a leaf Kco-rooted(n1,n2)=0 if n1 and n2 have different

production or, if labeled, different label Else Kco-rooted(n1,n2)=

“Production” is left intentionally ambiguous to both include unlabelled tree and labeled tree

Complexity s O(|T1|.|T2|)

))),(),,((1( 21 inchinchkichildren

rootedco

Page 37: Overview of Kernel Methods (Part 2)

37

References J. Shawe-Tayor and N. Cristianini, Kernel

Methods for Pattern Analysis, 2004 (Chapter 10 and 11)

J. Tian, “PCA/Kernel PCA for Image Denoising”, September 16, 2004

T. Gartner, “A Survey of Kernels for Structured Data”, ACM SIGKDD Explorations Newsletter, July 2003

N. Cristianini, “Latent Semantic Kernels”, Proceedings of ICML-01, 18th International Conference on Machine Learning, 2001