transductive inference for text classification using support vector machines

1

Transductive Inference for Text Classification using Support Vector Machines

Thorsten JoachimsUniversit at Dortmund, LS VIII44221 Dortmund, Germany

[email protected]

Presented by Yueng-Tien ,Lo

2

outline

• Introduction• Text Classification• Transductive Support Vector Machines• What Makes TSVMs Especially well Suited for Text

Classification• Conclusions and Outlook

3

Introduction

• Over the recent years, text classification has become one of the key techniques for organizing online information. Ex. filtering spam from people's email, or learning users' newsreading preferences

• The work presented here tackles the problem of learning from small training samples by taking a transductive, instead of an inductive approach.

4

Introduction

• In the inductive setting the learner tries to induce adecision function which has a low error rate on the whole distribution of examples for the particular learning task.

• In many situations we do not care about the particular decision function, but rather that we classify a given set of examples (i.e. a test set) with as few errors as possible.

5

Introduction

• Relevance Feedback :– The user is interested in a good classification of the test set

into those documents relevant or irrelevant to the query.• Netnews Filtering :– Given the few training examples the user labeled on

previous days, he or she wants today's most interesting articles.

• Reorganizing a document collection :

6

Text Classification

• The goal of text classification is the automatic assignment of documents to a fixed number of semantic categories.

• Each such problem answers the question of whether or not a document should be assigned to a particular category.

7

• Documents, which typically are strings of characters, have to be transformed into a representation suitable for the learning algorithm and the classification task.

8

Transductive Support Vector Machines

9


10


11


12


• For a learning task the learner is given a hypothesis space of functions and an i.i.d. sample of n training examples

• Each training example consists of a document vector and a binary label In contrast to the

inductive setting, the learner is also given an i.i.d. sample of test examples

from the same distribution

)()|(),( xPxyPyxP

1 ,1- : h

)2( ),(,),,(),,( 2211 nn yxyxyx

)3( ,,, ***21 kxxx

LH

1 ,1- yx

13


• The transductive learner aims to selects a function from using and so that the expected number of erroneous predictions

on the test examples is minimized.• is zero if otherwise it is one.

k

ikkiiL yxdPyxdPyxh

kLR

1

**11

** ),(),(),(1)(

ba,

testtrainL SSLh , trainS testSL

H

ba

14


• Bounds on the relative uniform deviation of training error

and test error

• With probability

• where the confidence interval depends on the number of training examples n, the number of test examples k, and the VC-Dimension d of H.

)4( ),(1)(1

n

iiitrain yxh

nhR

)6( ),,,()()( dknhRhR traintest 1

),,,( dkn

)5( ),(1)(1

*

k

i

trueiitest yxh

khR

15


• What information do we get from studying the test sample (3) and how can we use it?

• The training and the test sample split the hypothesis space into a finite number of equivalence classes .

• This reduces the learning problem from finding a function in the possibly infinite set to finding one of finitely many equivalence classes

• We can use these equivalence classes to build a structure of increasing VC-Dimension for structural risk minimization

''' 21 HHH

H'H

H'H

16


• Vapnik shows that with the size of the margin we can control the maximum number of equivalence classes (i. e. the VC-Dimension)

17

Theorem 1

• Consider hyperplanes as hypothesis space . If the attribute vectors of a training sample (2) and a test sample (3) are contained in a ball of diameter D, then there are at most

equivalence classes which contain a separating

hyper-plane with

1,min,1lnexp 2

2

Dad

dkndN r

bxbx jkji

ni

*11

bxsignxh )(

H

18


• Note that the VC-Dimension does not necessarily depend on the number of features, but can be much lower than the dimensionality of the space.

• Structural risk minimization tells us that we get the smallest bound on the test error if we select the equivalence class from the structure element which minimizes (6).

'iH

19

Transductive SVM (lin. sep. case) Optimization Problem 1

• Minimize over byy n ,,,, **1

1:

1: 21

**1

1

2

bxy

bxytosubject

jkj

iini

j

20

Transductive SVM (non-sep. case) Optimization Problem 2

• Minimize over

and are parameters set by the user.

0:

0:

1:

1:

21

*1

1

***1

1

0

**

0

2

jkj

ini

jjkj

iiini

k

jj

n

ii

bxy

bxytosubject

CC

j

**11

**1 ,,,,,,,,,, knn byy

*CC

21

What Makes TSVMs Especially well Suited for Text Classification

• The text classification task is characterized by a special set of properties.– High dimensional input space:

• each(stemmed) word is a feature– Document vectors are sparse:

• For each document, the corresponding document vector contains few entries that are not zero.

– Few irrelevant features:• Experiments in [Joachims, 1998] suggest that most words are

relevant.

22


• But how can TSVMs be any better?

• when asking the search engine Altavista about all documents containing the words pepper and salt, it returns 327,180 web pages.

• When asking for the documents with the words pepper and physics, we get only 4,220 hits, although physics is a more popular word on the web than salt.

• And it is this co-occurrence information that TSVMs exploit as prior knowledge about the learning task.

23


• Imagine document D1 was given as a training example for class A and document D6 was given as a training example for class B.

• How should we classify documents D2 to D4 (the test set)?

24


• The reason we choose this classification of the test data over the others stems from our prior knowledge about the properties of text and common text classification tasks.

25


• Note again that we got to this classification by studying the location of the test examples, which is not possible for an inductive learner.

• We see that the maximum margin bias reflects our prior knowledge about text classification well.

• By analyzing the test set, we can exploit this prior knowledge for learning.

26

Solving the Optimization Problem

• Training a transductive SVM means solving the (partly) combinatorial optimization problem OP2

• The key idea of the algorithm is that it begins with a labeling of the test data based on the classificationof an inductive SVM.

• Then it improves the solution by switching the labels of test examples so that the objective function decreases.

27

Solving the Optimization Problem

• It starts with training an inductive SVM on the training data and classifying the test data accordingly.

• Then it uniformly increases the influence of the test examples by incrementing the cost-factors and up to the user defined value of (loop 1).

• While the criterion in the condition of loop 2 identifies two examples for which changing the class labels leads to a decrease in the current objective function, these examples are switched.

*C

*C

*C

28

Algorithm for training Transductive Support Vector Machines

• Input: – training examples– Test examples

• Parameters:– : parameters from OP(2)– : number of test examples to be assigned to class +

• Output: predicted labels of the test examples

Classify the test examples using The test exampleswith the highest value of are assigned to the classthe remaining test examples are assigned to class

),(,),,(),,( 2211 nn yxyxyx

*** ,,,21 kxxx

*,CCnum

; 0,0,,,,,__:_,,, 11Cyxyxqpsvmsolveb nn

; 10:

; 10:

5*

5*

numknumC

C

b,

bx j **

*** ,,,21 kyyy

num 1: * jy

1: * jy

29

Algorithm for training Transductive Support Vector Machines

• While {

while { }

}

**** CCCC

; ,,,,,,,,__:,,, ******1

*111 CCCyxyxyxyxqpsvmsolveb kknn

2&0&0&0:, ****** lmlmlm yylm

; ,,,,,,,,__:,,,

:

:

******1

*

**

**

111

CCCyxyxyxyxqpsvmsolveb

yy

yy

kknn

ll

mm

; , 2min:

; , 2min:***

***

CCC

CCC

; ,, **1 kyyreturn

30

Inductive SVM (primal) Optimization Problem 3

• The function solve_svm_qp refers to quadratic programs of the following type.

• Minimize over *,,, b

**

1

1

1:

**

1:

**

1

2

1:

1:

21

**

jjkj

iiini

yjj

yjj

n

ii

bxy

bxytosubject

CCC

j

jj

31

Theorem 2

• Algorithm 1 converges in a finite number of steps.

• The condition in loop 2 requires that the examples to be switched have different class labels.

0** lm yy

32

Theorem 2

• Let so that we can write.

''21

)2()2(2121

21

****

0

2

****

0

2

****

0

2

1:

**

1:

**

0

2

**

lm

n

ii

lm

n

ii

lm

n

ii

yjj

yjj

n

ii

CCC

CCC

CCC

CCCjj

1* my

33

Theorem 2

• The inequality holds due to the selection criterion in loop 2, since and

• This means that loop 2 is exited after a finite number of iterations, since there is only a finite number of permutations of the test examples.

• Loop 1 also terminates after a finite number of iterations, since is bounded by .*

C C

*** 0,2max' lmm *** 0,2max' mll

34

Conclusions and Outlook

• By taking a transductive instead of an inductive approach, the test set can be used as an additional source of information about margins.

• On all data sets the transductive approach showed improvements over the currently best performing method, most substantially for small training samples and large test sets