information retrieval search engine technology (5&6) tangra.si.umich/clair/ir09

Information RetrievalSearch Engine Technology

(5&6)http://tangra.si.umich.edu/clair/ir09

Prof. Dragomir R. Radev

[email protected]

Final projects

• Two formats:– A software system that performs a specific search-engine

related task. We will create a web page with all such code and make it available to the IR community.

– A research experiment documented in the form of a paper. Look at the proceedings of the SIGIR, WWW, or ACL conferences for a sample format. I will encourage the authors of the most successful papers to consider submitting them to one of the IR-related conferences.

• Deliverables:– System (code + documentation + examples) or Paper (+ code,

data)– Poster (to be presented in class)– Web page that describes the project.

SET/IR – W/S 2009

…9. Text classification

Naïve Bayesian classifiers Decision trees…

Introduction

• Text classification: assigning documents to predefined categories: topics, languages, users

• A given set of classes C• Given x, determine its class in C• Hierarchical vs. flat• Overlapping (soft) vs non-overlapping (hard)

Introduction

• Ideas: manual classification using rules (e.g., Columbia AND University EducationColumbia AND “South Carolina” Geography

• Popular techniques: generative (knn, Naïve Bayes) vs. discriminative (SVM, regression)

• Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x)

• Discriminative: model p(y|x) directly.

Bayes formula

)(

)|()()|(

Ap

BApBpABp

Full probability

Naïve Bayesian classifiers

• Naïve Bayesian classifier

• Assuming statistical independence

• Features = words (or phrases) typically

),(

)()|,...,(),...,|(

,...21

2121

k

kk FFFP

CdPCdFFFPFFFCdP

k

j j

k

j j

kFP

CdPCdFPFFFCdP

1

121

)(

)()|(),...,|(

Example (cont’d)

• Features: sneeze, cough, no fever• P(well|e)=(.9) * (.1)(.1)(.99) / p(e)=0.0089/p(e)• P(cold|e)=(.05) * (.9)(.8)(.3) / p(e)=0.01/p(e)• P(allergy|e)=(.05) * (.9)(.7)(.6) / p(e)=0.019/p(e)• P(e) = 0.0089+0.01+0.019=0.379• P(well|e)=.23• P(cold|e)=.26• P(allergy|e)=.50

Example from Ray Mooney

Issues with NB

• Where do we get the values – use maximum likelihood estimation (Ni/N)

• Same for the conditionals – these are based on a multinomial generator and the MLE estimator

is (Tji/Tji)• Smoothing is needed – why?

• Laplace smoothing ((Tji+1)/Tji+1))• Implementation: how to avoid floating point

underflow

)( CdP

Spam recognitionReturn-Path: <[email protected]>X-Sieve: CMU Sieve 2.2From: "Ibrahim Galadima" <[email protected]>Reply-To: [email protected]: [email protected]: Tue, 14 Jan 2003 21:06:26 -0800Subject: Gooday

DEAR SIR

FUNDS FOR INVESTMENTS

THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HADNO PREVIOUS CORRESPONDENCE WITH YOU

I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENTNATIONAL ELECTORAL COMMISSION INEC I GOT YOURCONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLEPERSON WITH WHOM TO HANDLE A VERY CONFIDENTIALTRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED ATTWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATESDOLLARS US$20M TO A SAFE FOREIGN ACCOUNT

THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITHARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OFOVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A

SpamAssassin

• http://spamassassin.apache.org/

• http://spamassassin.apache.org/tests_3_1_x.html

Feature selection: The 2 test

• For a term t:

• C=class, it = feature• Testing for independence:

P(C=0,It=0) should be equal to P(C=0) P(It=0)– P(C=0) = (k00+k01)/n– P(C=1) = 1-P(C=0) = (k10+k11)/n– P(It=0) = (k00+K10)/n– P(It=1) = 1-P(It=0) = (k01+k11)/n

It

0 1

C 0 k00 k01

1 k10 k11

Feature selection: The 2 test

• High values of 2 indicate lower belief in independence.

• In practice, compute 2 for all words and pick the top k among them.

))()()((

)(

0010011100011011

2011000112

kkkkkkkk

kkkknΧ

Feature selection: mutual information

• No document length scaling is needed

• Documents are assumed to be generated according to the multinomial model

• Measures amount of information: if the distribution is the same as the background distribution, then MI=0

• X = word; Y = class

x y yPxP

yxPyxPYXMI

)()(

),(log),(),(

Well-known datasets

• 20 newsgroups– http://people.csail.mit.edu/u/j/jrennie/public_html/

20Newsgroups/

• Reuters-21578– http://www.daviddlewis.com/resources/testcollections/

reuters21578/ – Cats: grain, acquisitions, corn, crude, wheat, trade…

• WebKB– http://www-2.cs.cmu.edu/~webkb/ – course, student, faculty, staff, project, dept, other– NB performance (2000)– P=26,43,18,6,13,2,94– R=83,75,77,9,73,100,35

Evaluation of text classification

• Microaveraging – average over classes

• Macroaveraging – uses pooled table

Vector space classification

x1

x2

topic2

topic1

Decision surfaces

x1

x2

topic2

topic1

Decision trees

x1

x2

topic2

topic1

Classification usingdecision trees

• Expected information need

• I (s1, s2, …, sm) = - pi log (pi)

• s = data samples

• m = number of classes

RID Age Income student credit buys?

1 <= 30 High No Fair No

2 <= 30 High No Excellent No

3 31 .. 40 High No Fair Yes

4 > 40 Medium No Fair Yes

5 > 40 Low Yes Fair Yes

6 > 40 Low Yes Excellent No

7 31 .. 40 Low Yes Excellent Yes

8 <= 30 Medium No Fair No

9 <= 30 Low Yes Fair Yes

10 > 40 Medium Yes Fair Yes

11 <= 30 Medium Yes Excellent Yes

12 31 .. 40 Medium No Excellent Yes

13 31 .. 40 High Yes Fair Yes

14 > 40 Medium no excellent no

Decision tree induction

• I(s1,s2)= I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 == 0.940

Entropy and information gain

•E(A) = I (s1j,…,smj) S1j + … + smj

s

Entropy = expected information based on the partitioning intosubsets by A

Gain (A) = I (s1,s2,…,sm) – E(A)

Entropy

• Age <= 30s11 = 2, s21 = 3, I(s11, s21) = 0.971

• Age in 31 .. 40s12 = 4, s22 = 0, I (s12,s22) = 0

• Age > 40s13 = 3, s23 = 2, I (s13,s23) = 0.971

Entropy (cont’d)

• E (age) =5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) = 0.694

• Gain (age) = I (s1,s2) – E(age) = 0.246

• Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048

Final decision tree

excellent

age

student credit

no yes no yes

yes

no

31 .. 40

> 40

yes fair

Other techniques

• Bayesian classifiers

• X: age <=30, income = medium, student = yes, credit = fair

• P(yes) = 9/14 = 0.643

• P(no) = 5/14 = 0.357

Example (cont’d)

• P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044• P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019• P (X | yes) P (yes) = 0.044 x 0.643 = 0.028• P (X | no) P (no) = 0.019 x 0.357 = 0.007

• Answer: yes/no?

SET/IR – W/S 2009

…10. Linear classifiers Kernel methods Support vector machines…

Linear boundary

x1

x2

topic2

topic1

Vector space classifiers

• Using centroids

• Boundary = line that is equidistant from two centroids

Generative models: knn

• Assign each element to the closest cluster• K-nearest neighbors

• Very easy to program• Tessellation; nonlinearity• Issues: choosing k, b?• Demo:

– http://www-2.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html

)(

),(),(qdkNNd

qcq ddsbdcscore

Linear separators

• Two-dimensional line:w1x1+w2x2=b is the linear separator

w1x1+w2x2>b for the positive class

bxwT

• In n-dimensional spaces:

Example 1

x1

x2

topic2

topic1

w

Example 2

• Classifier for “interest” in Reuters-21578

• b=0• If the document is

“rate discount dlr world”, its score will be0.67*1+0.46*1+(-0.71)*1+(-0.35)*1= 0.05>0

Example from MSR

wi xi wi xi

0.70 prime -0.71 dlrs

0.67 rate -0.35 world

0.63 interest -0.33 sees

0.60 rates -0.25 year

0.46 discount -0.24 group

0.43 bundesbank -0.24 dlr

Example: perceptron algorithm

Input:

Algorithm:

Output:

}1,1{,)),,(),...,,(( 111 i

Nnn yxyxyxS

END

END

1

take //mis0)( IF

TO 1 FOR

0,0

1

0

kk

xyww

xwy

ni

kw

iikk

iki

kw

[Slide from Chris Bishop]

Linear classifiers

• What is the major shortcoming of a perceptron?

• How to determine the dimensionality of the separator?– Bias-variance tradeoff (example)

• How to deal with multiple classes?– Any-of: build multiple classifiers for each class– One-of: harder (as J hyperplanes do not

divide RM into J regions), instead: use class complements and scoring

Support vector machines

• Introduced by Vapnik in the early 90s.

Issues with SVM

• Soft margins (inseparability)

• Kernels – non-linearity

The kernel idea

before after

Example32:

),2,(),,(),( 2221

2132121 xxxxzzzxx

(mapping to a higher-dimensional space)

The kernel trick

)',(',)',''2,')(,2,()'(),( 22221

21

2221

21 xxkxxxxxxxxxxxx T

dcxxxxk ))',()',(

)',tanh()',( xxkxxk

))2/('exp()',( 22 xxxxk

Polynomial kernel:

Sigmoid kernel:

RBF kernel:

Many other kernels are useful for IR:e.g., string kernels, subsequence kernels, tree kernels, etc.

SVM (Cont’d)

• Evaluation:– SVM > knn > decision tree > NB

• Implementation– Quadratic optimization– Use toolkit (e.g., Thorsten Joachims’s

svmlight)

Semi-supervised learning

• EM

• Co-training

• Graph-based

Exploiting Hyperlinks – Co-training

• Each document instance has two sets of alternate view (Blum and Mitchell 1998)– terms in the document, x1– terms in the hyperlinks that point to the document, x2

• Each view is sufficient to determine the class of the instance– Labeling function that classifies examples is the

same applied to x1 or x2– x1 and x2 are conditionally independent, given the

class

[Slide from Pierre Baldi]

Co-training Algorithm

• Labeled data are used to infer two Naïve Bayes classifiers, one for each view

• Each classifier will– examine unlabeled data – pick the most confidently predicted positive and

negative examples– add these to the labeled examples

• Classifiers are now retrained on the augmented set of labeled examples

[Slide from Pierre Baldi]

Conclusion

• SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters.

• NB also good in many circumstances

Readings

• MRS18

• MRS17, MRS19