text classification vector space based and linear classifier

42
Text Classification Vector Space Based and Linear Classifier Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1

Upload: others

Post on 21-Apr-2022

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text Classification Vector Space Based and Linear Classifier

Text ClassificationVector Space Based and Linear 

Classifier

Reference: Introduction to Information Retrievalby C. Manning, P. Raghavan, H. Schutze

1

Page 2: Text Classification Vector Space Based and Linear Classifier

Recall: Vector Space Representation

• Each document is a vector, one component (term weight) for each term (= word).

• Normally normalize vectors to unit length.• High‐dimensional vector space:

– Terms are axes– 10,000+ dimensions, or even 100,000+– Docs are vectors in this space

• How can we do classification in this space?– Recall that Naïve Bayes classification does not make use of the term weight.

2

Page 3: Text Classification Vector Space Based and Linear Classifier

Classification Using Vector Spaces

• As before, the training set is a set of documents, each labeled with its class (e.g., topic)

• In vector space based representation, this set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space

• Premise 1: Documents in the same class form a contiguous region of space

• Premise 2: Documents from different classes don’t overlap (much)

• We define surfaces to delineate classes in the space

3

Page 4: Text Classification Vector Space Based and Linear Classifier

Documents in a Vector Space

Government

Science

Arts

4

Page 5: Text Classification Vector Space Based and Linear Classifier

Test Document of what class?

Government

Science

Arts

5

Page 6: Text Classification Vector Space Based and Linear Classifier

Test Document = Government

Government

Science

Arts

The main strategy is how to find good separators 6

Page 7: Text Classification Vector Space Based and Linear Classifier

Using Rocchio for text classification• Use standard TF‐IDF weighted vectors to represent text documents 

• For training documents in each category, compute a prototype vector by summing the vectors of the training documents in the category.–Prototype = centroid of members of class

• Assign test documents to the category with the closest prototype vector based on cosine similarity.

7

Page 8: Text Classification Vector Space Based and Linear Classifier

Definition of centroid

where Dc is the set of all documents that belong to class c and v(d) is the vector space representation of d.

• Note that centroid will in general not be a unit vector even when the inputs are unit vectors.

(c) 1

| Dc |v (d)

d Dc

8

Page 9: Text Classification Vector Space Based and Linear Classifier

Illustration of Rocchio Text Categorization

Note: Centroid vectors are illustrated by directions only9

Page 10: Text Classification Vector Space Based and Linear Classifier

Rocchio Properties 

• Forms a simple generalization of the examples in each class (a prototype).

• Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length.

• Classification is based on similarity to class prototypes.

10

Page 11: Text Classification Vector Space Based and Linear Classifier

Rocchio Anomaly   

• Prototype models have problems with polymorphic (disjunctive) categories.

11

Page 12: Text Classification Vector Space Based and Linear Classifier

Rocchio classfication

12

Page 13: Text Classification Vector Space Based and Linear Classifier

Rocchio classification

• Rocchio forms a simple representation for each class: the centroid/prototype 

• Classification is based on similarity to / distance from the prototype/centroid

• It does not guarantee that classifications are consistent with the given training data

• It is little used outside text classification, but has been used quite effectively for text classification

• Again, cheap to train and test documents

13

Page 14: Text Classification Vector Space Based and Linear Classifier

k Nearest Neighbor Classification

• kNN = k Nearest Neighbor

• To classify document d into class c:• Define k‐neighborhood N as k nearest neighbors of d• Count number of documents i in N that belong to c• Estimate P(c|d) as i/k• Choose as class argmaxc P(c|d)    [ = majority class]

14

Page 15: Text Classification Vector Space Based and Linear Classifier

Example: k=6 (6NN)

Government

Science

Arts

15

Page 16: Text Classification Vector Space Based and Linear Classifier

Nearest‐Neighbor Learning Algorithm• Learning is just storing the representations of the training 

examples in D.• Testing instance x (under 1NN):

– Compute similarity between x and all examples in D.– Assign x the category of the most similar example in D.

• Does not explicitly compute a generalization or category prototypes.

• Also called:– Case‐based learning– Memory‐based learning– Lazy learning

• Rationale of kNN: contiguity hypothesis

16

Page 17: Text Classification Vector Space Based and Linear Classifier

k Nearest Neighbor

• Using only the closest example (1NN) to determine the class is subject to errors due to:– A single atypical example. – Noise (i.e., an error) in the category label of a single training example.

• More robust alternative is to find the kmost‐similar examples and return the majority category of these k examples.

17

Page 18: Text Classification Vector Space Based and Linear Classifier

kNN decision boundaries

Government

Science

Arts

Boundaries are in principle arbitrary surfaces –but usually polyhedra

kNN gives locally defined decision boundaries betweenclasses – far away points do not influence each classificationdecision (unlike in Naïve Bayes, Rocchio, etc.) 18

Page 19: Text Classification Vector Space Based and Linear Classifier

Similarity Metrics

• Nearest neighbor method depends on a similarity (or distance) metric.

• For text, cosine similarity of tf.idf weighted vectors is typically most effective.

19

Page 20: Text Classification Vector Space Based and Linear Classifier

Illustration of 3 Nearest Neighbor for Text Vector Space

20

Page 21: Text Classification Vector Space Based and Linear Classifier

3 Nearest Neighbor Comparison

• Nearest Neighbor tends to handle polymorphic categories better. 

21

Page 22: Text Classification Vector Space Based and Linear Classifier

k Nearest Neighbor

• Value of k is typically odd to avoid ties; 3 and 5 are most common, but larger values between 50 to 100 are also used.

• Alternatively, we can select k that gives best results on a held‐out portion of the training set.

22

Page 23: Text Classification Vector Space Based and Linear Classifier

kNN AlgorithmTraining (Preprocessing) and Testing

TRAIN‐KNN(C,D)1    D′ ← PREPROCESS(D)2  k ← SELECT‐K(C,D′)3 return D′, k

APPLY‐KNN(C,D′, k, d)1  ← COMPUTENEARESTNEIGHBORS(D′, k, d)2  for each  ∈ C3    do  ←  ∩ ⁄4  return argmax

is an estimate for  |denotes the set of all documents in the class 

23

Page 24: Text Classification Vector Space Based and Linear Classifier

kNN: Discussion

• No feature selection necessary• Scales well with large number of classes

– Don’t need to train n classifiers for n classes

• Classes can influence each other– Small changes to one class can have ripple effect

• Can avoid training if preferred• May be more expensive at test time

24

Page 25: Text Classification Vector Space Based and Linear Classifier

Linear Classifiers

25

Page 26: Text Classification Vector Space Based and Linear Classifier

Linear Classifiers

• Consider 2 class problems– Deciding between two classes, perhaps, government and non‐government

• One‐versus‐rest classification

• How do we define (and find) the separating surface?

• How do we decide which region a test doc is in?

26

Page 27: Text Classification Vector Space Based and Linear Classifier

Separation by Hyperplanes• A strong high‐bias assumption is linear separability:

– in 2 dimensions, can separate classes by a line– in higher dimensions, need hyperplanes

• separator can be expressed as ax + by = c

27

Page 28: Text Classification Vector Space Based and Linear Classifier

Which Hyperplane?

In general, lots of possiblesolutions for a,b,c.

28

Page 29: Text Classification Vector Space Based and Linear Classifier

Which Hyperplane?• Lots of possible solutions for a,b,c.• Some methods find an optimal separating 

hyperplane[according to some criterion of expected goodness]

• Which points should influence optimality?– All points

• Linear regression• Naïve Bayes

– Only “difficult points” close to decision boundary

• Support vector machines29

Page 30: Text Classification Vector Space Based and Linear Classifier

High‐Dimensional Linear Classifier

• For general linear classifiers, assign the document d with m features d=(d1,…dM) to one class if:

Otherwise, assign to the other class. 

01

M

iiidw

30

Page 31: Text Classification Vector Space Based and Linear Classifier

Applying Linear Classifier

APPLYLINEARCLASSIFIER( , θ,  )1  score ← ∑2  if score > θ 3  then return 14  else return 0

31

Page 32: Text Classification Vector Space Based and Linear Classifier

Linear classifier: Example

• Class: “interest” (as in interest rate)• Example features of a linear classifier• wi ti                                                          wi ti

• To classify, find dot product of feature vector and weights

• 0.70 prime• 0.67 rate• 0.63 interest• 0.60 rates• 0.46 discount• 0.43 bundesbank

• -0.71 dlrs• -0.35 world• -0.33 sees• -0.25 year• -0.24 group• -0.24 dlr

32

Page 33: Text Classification Vector Space Based and Linear Classifier

Linear Classifiers

• Many common text classifiers are linear classifiers– Naïve Bayes– Perceptron– Rocchio– Logistic regression– Support vector machines (with linear kernel)– Linear regression

• Despite this similarity, noticeable performance differences

33

Page 34: Text Classification Vector Space Based and Linear Classifier

Two‐class Rocchio as a linear classifier

• To show Rocchio is a linear classifier, consider that a vector  is on the decision boundary if it has equal distance to the two class centroids:

01

M

iiidw

w (c1)

(c2)

0.5 (|(c1) |2 |

(c2) |2)

34

Page 35: Text Classification Vector Space Based and Linear Classifier

Linear Problem

• A linear problem ‐ The underlying distributions of the two classes are separated by a line.

• This separating line is called class boundary.– It is the “true” boundary and we distinguish it from the decision boundary that the learning method computes

35

Page 36: Text Classification Vector Space Based and Linear Classifier

Linear Problem

• A linear problem that classifies whether a Web page is a Chinese Web page (solid circles)

• The class boundary is represented by short dashed line• There are three noise documents 36

Page 37: Text Classification Vector Space Based and Linear Classifier

Linear Problem

• If there exists a hyperplane that perfectly separates the two classes, then we call the two classes linearly separable.

• If linear separability holds, there is an infinite number of linear separators.–We need a criterion for selecting among all decision hyperplanes.

37

Page 38: Text Classification Vector Space Based and Linear Classifier

Nonlinear Problem

• If a problem is nonlinear problem and its class boundaries cannot be approximated well with linear hyperplanes, then nonlinear classifiers are better

• An example of a nonlinear classifier is kNN 38

Page 39: Text Classification Vector Space Based and Linear Classifier

More Than Two Classes

• Any‐of or multivalue classification– Classes are independent of each other.– A document can belong to 0, 1, or >1 classes.– Quite common for documents

39

document class1 , ,2 ,: :

Page 40: Text Classification Vector Space Based and Linear Classifier

Set of Binary Classifiers: Any‐of

• Build a separator between each class and its complementary set (docs from all other classes).

• Decompose into  binary classification problems.

• Given test doc, evaluate it for membership in each class.

• Apply decision criterion of classifiers independently

• Though maybe you could do better by considering dependencies between categories

40

Page 41: Text Classification Vector Space Based and Linear Classifier

More Than Two Classes

• One‐of or multinomial or polytomousclassification– Classes are mutually exclusive.– Each document belongs to exactly one class– E.g., digit recognition is polytomous classification

• Digits are mutually exclusive

41

Page 42: Text Classification Vector Space Based and Linear Classifier

Set of Binary Classifiers: One‐of

• Build a separator between each class and its complementary set (docs from all other classes).

• Given test doc, evaluate it for membership in each class.

• Assign document to class with:– maximum score– maximum confidence– maximum probability

?

??

42