information retrieval search engine technology (5&6) tangra.si.umich/clair/ir09
DESCRIPTION
Information Retrieval Search Engine Technology (5&6) http://tangra.si.umich.edu/clair/ir09. Prof. Dragomir R. Radev [email protected]. Final projects. Two formats: - PowerPoint PPT PresentationTRANSCRIPT
Information RetrievalSearch Engine Technology
(5&6)http://tangra.si.umich.edu/clair/ir09
Prof. Dragomir R. Radev
Final projects
• Two formats:– A software system that performs a specific search-engine
related task. We will create a web page with all such code and make it available to the IR community.
– A research experiment documented in the form of a paper. Look at the proceedings of the SIGIR, WWW, or ACL conferences for a sample format. I will encourage the authors of the most successful papers to consider submitting them to one of the IR-related conferences.
• Deliverables:– System (code + documentation + examples) or Paper (+ code,
data)– Poster (to be presented in class)– Web page that describes the project.
SET/IR – W/S 2009
…9. Text classification
Naïve Bayesian classifiers Decision trees…
Introduction
• Text classification: assigning documents to predefined categories: topics, languages, users
• A given set of classes C• Given x, determine its class in C• Hierarchical vs. flat• Overlapping (soft) vs non-overlapping (hard)
Introduction
• Ideas: manual classification using rules (e.g., Columbia AND University EducationColumbia AND “South Carolina” Geography
• Popular techniques: generative (knn, Naïve Bayes) vs. discriminative (SVM, regression)
• Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x)
• Discriminative: model p(y|x) directly.
Bayes formula
)(
)|()()|(
Ap
BApBpABp
Full probability
Example (performance enhancing drug)
• Drug(D) with values y/n• Test(T) with values +/-• P(D=y) = 0.001• P(T=+|D=y)=0.8• P(T=+|D=n)=0.01• Given: athlete tests positive• P(D=y|T=+)=
P(T=+|D=y)P(D=y) / (P(T=+|D=y)P(D=y)+P(T=+|D=n)P(D=n)=(0.8x0.001)/(0.8x0.001+0.01x0.999)=0.074
Naïve Bayesian classifiers
• Naïve Bayesian classifier
• Assuming statistical independence
• Features = words (or phrases) typically
),(
)()|,...,(),...,|(
,...21
2121
k
kk FFFP
CdPCdFFFPFFFCdP
k
j j
k
j j
kFP
CdPCdFPFFFCdP
1
121
)(
)()|(),...,|(
Example
• p(well)=0.9, p(cold)=0.05, p(allergy)=0.05– p(sneeze|well)=0.1– p(sneeze|cold)=0.9– p(sneeze|allergy)=0.9– p(cough|well)=0.1– p(cough|cold)=0.8– p(cough|allergy)=0.7– p(fever|well)=0.01– p(fever|cold)=0.7– p(fever|allergy)=0.4
Example from Ray Mooney
Example (cont’d)
• Features: sneeze, cough, no fever• P(well|e)=(.9) * (.1)(.1)(.99) / p(e)=0.0089/p(e)• P(cold|e)=(.05) * (.9)(.8)(.3) / p(e)=0.01/p(e)• P(allergy|e)=(.05) * (.9)(.7)(.6) / p(e)=0.019/p(e)• P(e) = 0.0089+0.01+0.019=0.379• P(well|e)=.23• P(cold|e)=.26• P(allergy|e)=.50
Example from Ray Mooney
Issues with NB
• Where do we get the values – use maximum likelihood estimation (Ni/N)
• Same for the conditionals – these are based on a multinomial generator and the MLE estimator
is (Tji/Tji)• Smoothing is needed – why?
• Laplace smoothing ((Tji+1)/Tji+1))• Implementation: how to avoid floating point
underflow
)( CdP
Spam recognitionReturn-Path: <[email protected]>X-Sieve: CMU Sieve 2.2From: "Ibrahim Galadima" <[email protected]>Reply-To: [email protected]: [email protected]: Tue, 14 Jan 2003 21:06:26 -0800Subject: Gooday
DEAR SIR
FUNDS FOR INVESTMENTS
THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HADNO PREVIOUS CORRESPONDENCE WITH YOU
I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENTNATIONAL ELECTORAL COMMISSION INEC I GOT YOURCONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLEPERSON WITH WHOM TO HANDLE A VERY CONFIDENTIALTRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED ATTWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATESDOLLARS US$20M TO A SAFE FOREIGN ACCOUNT
THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITHARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OFOVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A
SpamAssassin
• http://spamassassin.apache.org/
• http://spamassassin.apache.org/tests_3_1_x.html
Feature selection: The 2 test
• For a term t:
• C=class, it = feature• Testing for independence:
P(C=0,It=0) should be equal to P(C=0) P(It=0)– P(C=0) = (k00+k01)/n– P(C=1) = 1-P(C=0) = (k10+k11)/n– P(It=0) = (k00+K10)/n– P(It=1) = 1-P(It=0) = (k01+k11)/n
It
0 1
C 0 k00 k01
1 k10 k11
Feature selection: The 2 test
• High values of 2 indicate lower belief in independence.
• In practice, compute 2 for all words and pick the top k among them.
))()()((
)(
0010011100011011
2011000112
kkkkkkkk
kkkknΧ
Feature selection: mutual information
• No document length scaling is needed
• Documents are assumed to be generated according to the multinomial model
• Measures amount of information: if the distribution is the same as the background distribution, then MI=0
• X = word; Y = class
x y yPxP
yxPyxPYXMI
)()(
),(log),(),(
Well-known datasets
• 20 newsgroups– http://people.csail.mit.edu/u/j/jrennie/public_html/
20Newsgroups/
• Reuters-21578– http://www.daviddlewis.com/resources/testcollections/
reuters21578/ – Cats: grain, acquisitions, corn, crude, wheat, trade…
• WebKB– http://www-2.cs.cmu.edu/~webkb/ – course, student, faculty, staff, project, dept, other– NB performance (2000)– P=26,43,18,6,13,2,94– R=83,75,77,9,73,100,35
Evaluation of text classification
• Microaveraging – average over classes
• Macroaveraging – uses pooled table
Vector space classification
x1
x2
topic2
topic1
Decision surfaces
x1
x2
topic2
topic1
Decision trees
x1
x2
topic2
topic1
Classification usingdecision trees
• Expected information need
• I (s1, s2, …, sm) = - pi log (pi)
• s = data samples
• m = number of classes
RID Age Income student credit buys?
1 <= 30 High No Fair No
2 <= 30 High No Excellent No
3 31 .. 40 High No Fair Yes
4 > 40 Medium No Fair Yes
5 > 40 Low Yes Fair Yes
6 > 40 Low Yes Excellent No
7 31 .. 40 Low Yes Excellent Yes
8 <= 30 Medium No Fair No
9 <= 30 Low Yes Fair Yes
10 > 40 Medium Yes Fair Yes
11 <= 30 Medium Yes Excellent Yes
12 31 .. 40 Medium No Excellent Yes
13 31 .. 40 High Yes Fair Yes
14 > 40 Medium no excellent no
Decision tree induction
• I(s1,s2)= I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 == 0.940
Entropy and information gain
•E(A) = I (s1j,…,smj) S1j + … + smj
s
Entropy = expected information based on the partitioning intosubsets by A
Gain (A) = I (s1,s2,…,sm) – E(A)
Entropy
• Age <= 30s11 = 2, s21 = 3, I(s11, s21) = 0.971
• Age in 31 .. 40s12 = 4, s22 = 0, I (s12,s22) = 0
• Age > 40s13 = 3, s23 = 2, I (s13,s23) = 0.971
Entropy (cont’d)
• E (age) =5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) = 0.694
• Gain (age) = I (s1,s2) – E(age) = 0.246
• Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048
Final decision tree
excellent
age
student credit
no yes no yes
yes
no
31 .. 40
> 40
yes fair
Other techniques
• Bayesian classifiers
• X: age <=30, income = medium, student = yes, credit = fair
• P(yes) = 9/14 = 0.643
• P(no) = 5/14 = 0.357
Example
• P (age < 30 | yes) = 2/9 = 0.222P (age < 30 | no) = 3/5 = 0.600P (income = medium | yes) = 4/9 = 0.444P (income = medium | no) = 2/5 = 0.400P (student = yes | yes) = 6/9 = 0.667P (student = yes | no) = 1/5 = 0.200P (credit = fair | yes) = 6/9 = 0.667P (credit = fair | no) = 2/5 = 0.400
Example (cont’d)
• P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044• P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019• P (X | yes) P (yes) = 0.044 x 0.643 = 0.028• P (X | no) P (no) = 0.019 x 0.357 = 0.007
• Answer: yes/no?
SET/IR – W/S 2009
…10. Linear classifiers Kernel methods Support vector machines…
Linear boundary
x1
x2
topic2
topic1
Vector space classifiers
• Using centroids
• Boundary = line that is equidistant from two centroids
Generative models: knn
• Assign each element to the closest cluster• K-nearest neighbors
• Very easy to program• Tessellation; nonlinearity• Issues: choosing k, b?• Demo:
– http://www-2.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html
)(
),(),(qdkNNd
qcq ddsbdcscore
Linear separators
• Two-dimensional line:w1x1+w2x2=b is the linear separator
w1x1+w2x2>b for the positive class
bxwT
• In n-dimensional spaces:
Example 1
x1
x2
topic2
topic1
w
Example 2
• Classifier for “interest” in Reuters-21578
• b=0• If the document is
“rate discount dlr world”, its score will be0.67*1+0.46*1+(-0.71)*1+(-0.35)*1= 0.05>0
Example from MSR
wi xi wi xi
0.70 prime -0.71 dlrs
0.67 rate -0.35 world
0.63 interest -0.33 sees
0.60 rates -0.25 year
0.46 discount -0.24 group
0.43 bundesbank -0.24 dlr
Example: perceptron algorithm
Input:
Algorithm:
Output:
}1,1{,)),,(),...,,(( 111 i
Nnn yxyxyxS
END
END
1
take //mis0)( IF
TO 1 FOR
0,0
1
0
kk
xyww
xwy
ni
kw
iikk
iki
kw
[Slide from Chris Bishop]
Linear classifiers
• What is the major shortcoming of a perceptron?
• How to determine the dimensionality of the separator?– Bias-variance tradeoff (example)
• How to deal with multiple classes?– Any-of: build multiple classifiers for each class– One-of: harder (as J hyperplanes do not
divide RM into J regions), instead: use class complements and scoring
Support vector machines
• Introduced by Vapnik in the early 90s.
Issues with SVM
• Soft margins (inseparability)
• Kernels – non-linearity
The kernel idea
before after
Example32:
),2,(),,(),( 2221
2132121 xxxxzzzxx
(mapping to a higher-dimensional space)
The kernel trick
)',(',)',''2,')(,2,()'(),( 22221
21
2221
21 xxkxxxxxxxxxxxx T
dcxxxxk ))',()',(
)',tanh()',( xxkxxk
))2/('exp()',( 22 xxxxk
Polynomial kernel:
Sigmoid kernel:
RBF kernel:
Many other kernels are useful for IR:e.g., string kernels, subsequence kernels, tree kernels, etc.
SVM (Cont’d)
• Evaluation:– SVM > knn > decision tree > NB
• Implementation– Quadratic optimization– Use toolkit (e.g., Thorsten Joachims’s
svmlight)
Semi-supervised learning
• EM
• Co-training
• Graph-based
Exploiting Hyperlinks – Co-training
• Each document instance has two sets of alternate view (Blum and Mitchell 1998)– terms in the document, x1– terms in the hyperlinks that point to the document, x2
• Each view is sufficient to determine the class of the instance– Labeling function that classifies examples is the
same applied to x1 or x2– x1 and x2 are conditionally independent, given the
class
[Slide from Pierre Baldi]
Co-training Algorithm
• Labeled data are used to infer two Naïve Bayes classifiers, one for each view
• Each classifier will– examine unlabeled data – pick the most confidently predicted positive and
negative examples– add these to the labeled examples
• Classifiers are now retrained on the augmented set of labeled examples
[Slide from Pierre Baldi]
Conclusion
• SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters.
• NB also good in many circumstances
Readings
• MRS18
• MRS17, MRS19