naïve&bayes&classificaon · on&naïve&bayes&! text classification – spam...
TRANSCRIPT
![Page 1: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/1.jpg)
Naïve Bayes Classifica/on
Debapriyo Majumdar Data Mining – Fall 2014
Indian Statistical Institute Kolkata
August 14, 2014
![Page 2: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/2.jpg)
Bayes’ Theorem
§ Thomas Bayes (1701-1761) § Simple form of Bayes’ Theorem, for two random
variables C and X
2
P(C | X) = P(X |C)×P(C)P(X)
X C
Class prior probability
Predictor prior probability or evidence
Posterior probability
Likelihood
![Page 3: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/3.jpg)
Probability Model § Probability model: for a target class variable C which
is dependent over features X1,…, Xn
3
P(C | X1,...,Xn ) =P(C)×P(X1,...,Xn |C)
P(X1,...,Xn ) The values of the features are given
P(C)×P(X1,...,Xn |C)
§ So the denominator is effectively constant § Goal: calculating probabilities for the possible values of C § We are interested in the numerator:
![Page 4: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/4.jpg)
Probability Model § The conditional probability is equivalent to the joint
probability
4
P(C)×P(X1,...,Xn |C) = P(C,X1,...,Xn )
P(C)×P(X1,...,Xn |C)
§ Applying the chain rule for join probability
= P(C)×P(X1 |C)×P(X2 |C,X1)P(X3,...,Xn |C,X1,X2 )= P(C)×P(X1 |C)×P(X2,...,Xn |C,X1)
… …
= P(C)×P(X1 |C)×P(X2 |C,X1)......P(Xn |C,X1,...,Xn−1)
P(A,B) = P(A)×P(B | A)
![Page 5: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/5.jpg)
Strong Independence Assump/on (Naïve) § Assume the features X1, … , Xn are conditionally
independent given C – Given C, occurrence of Xi does not influence the
occurrence of Xj, for i ≠ j.
5
P(Xi |C,Xj ) = P(Xi |C)
§ Similarly, P(Xi |C,Xj,...,Xk ) = P(Xi |C)
P(C)×P(X1 |C)×P(X2 |C,X1)......P(Xn |C,X1,...,Xn−1)
§ Hence:
= P(C)×P(X1 |C)×P(X2 |C)......P(Xn |C)
![Page 6: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/6.jpg)
Naïve Bayes Probability Model
6
P(C | X1,...,Xn ) =P(C)×P(X1 |C)×P(X2 |C)×!×P(Xn |C)
P(X1,...,Xn )
Known values: constant
Class posterior probability
![Page 7: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/7.jpg)
Classifier based on Naïve Bayes § Decision rule: pick the hypothesis (value c of C) that
has highest probability – Maximum A-Posteriori (MAP) decision rule
7
argmaxc
P(C = c) P(Xi = xi |C = c)i=1
n
∏"#$
%&'
Approximated from rela/ve frequencies in the training set
Approximated from frequency in the training set
The values of features are known for the
new observa/on
![Page 8: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/8.jpg)
Text Classifica/on with Naïve Bayes
Example of Naïve Bayes Reference: The IR Book by Raghavan et al, Chapter 6
8
![Page 9: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/9.jpg)
The Text Classifica/on Problem § Set of class labels / tags / categories: C § Training set: set D of documents with labels
<d,c> ∈ D × C § Example: a document, and a class label <Beijing joins the World Trade Organization, China> § Set of all terms: V § Given d ∈D’, a set of new documents, the goal is to
find a class of the document c(d)
9
![Page 10: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/10.jpg)
Mul/nomial Naïve Bayes Model § Probability of a document d being in class c
10
P(c | d)∝P(c)× P(tk | c)k=1
nd
∏where P(tk|c) = probability of a term tk occurring in a document of class c Intuitively: • P(tk|c) ~ how much evidence tk contributes to the class c • P(c) ~ prior probability of a document being labeled by
class c
![Page 11: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/11.jpg)
Mul/nomial Naïve Bayes Model § The expression
has many probabilities. § May result a floating point underflow. – Add logarithms of the probabilities instead of multiplying
the original probability values
11
P(c)× P(tk | c)k=1
nd
∏
cmap = argmaxc∈C
logP(c)+ logP(tk | c)k=1
nd
∑#
$%
&
'(
Term weight of tk in class c
Todo: estimating these probabilities
![Page 12: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/12.jpg)
Maximum Likelihood Es/mate § Based on relative frequencies § Class prior probability
12
P(c) = Nc
N
#of documents labeled as class c in training data
total #of documents in training data
§ Estimate P(t|c) as the relative frequency of t occurring in documents labeled as class c
P(t | c) = TctTct '
t '∈V∑ total #of occurrences of all
terms in documents d ∈ c
total #of occurrences of t in documents d ∈ c
![Page 13: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/13.jpg)
Handling Rare Events § What if: a term t did not occur in documents
belonging to class c in the training data? – Quite common. Terms are sparse in documents.
§ Problem: P(t|c) becomes zero, the whole expression becomes zero
§ Use add-one or Laplace smoothing
13
P(t | c) = Tct +1(Tct ' +1)
t '∈V∑
=Tct +1
Tct 't '∈V∑#
$%
&
'(+ V
![Page 14: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/14.jpg)
Example
14
Indian Delhi Indian
Indian Indian
Taj Mahal
Indian Goa
London Indian Embassy
India India India UK
Indian Embassy London Indian Indian Classify
![Page 15: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/15.jpg)
Bernoulli Naïve Bayes Model § Binary indicator of occurrence instead of frequency § Estimate P(t|c) as the fraction of documents in c
containing the term t § Models absence of terms explicitly:
15
P(C | X1,...,Xn ) = XiP(ti |C)+(1− Xi )(1−P(ti |C)[ ]i=1
n
∏
Xi = 1 if ti is present 0 otherwise Absence of terms
Difference between Mul/nomial with frequencies truncated to 1, and Bernoulli Naïve Bayes?
![Page 16: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/16.jpg)
Example
16
Indian Delhi Indian
Indian Indian
Taj Mahal
Indian Goa
London Indian Embassy
India India India UK
Indian Embassy London Indian Indian Classify
![Page 17: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/17.jpg)
Naïve Bayes as a Genera/ve Model § The probability model:
17
P(c | d) = P(c)×P(d | c)P(d)
UK
X1= London
X2= India
X3= Embassy
Mul/nomial model
where Xi is the random variable for position i in the document – Takes values as terms of the vocabulary
§ Positional independence assumption à Bag of words model:
P(d | c) = P t1,..., tnd c( )Terms as they occur in d, exclude other terms
P(Xk1= t | c) = P Xk2
= t c( )
![Page 18: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/18.jpg)
Naïve Bayes as a Genera/ve Model § The probability model:
18
P(c | d) = P(c)×P(d | c)P(d)
Bernoulli model
P(Ui=1|c) is the probability that term ti will occur in any position in a document of class c
P(d | c) = P e1,...,e|V | c( )
All terms in the vocabulary
UK
Ulondon=1
UIndia=1
UEmbassy=1
UTajMahal=0 Udelhi=0
Ugoa=0
![Page 19: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/19.jpg)
Mul/nomial vs Bernoulli Multinomial Bernoulli
Event Model Generation of token Generation of document
Multiple occurrences Matters Does not matter
Length of documents Better for larger documents
Better for shorter documents
#Features Can handle more Works best with fewer
19
![Page 20: Naïve&Bayes&Classificaon · On&Naïve&Bayes&! Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters –](https://reader033.vdocuments.mx/reader033/viewer/2022042710/5f5af3ea4065cd3cfc5c5f11/html5/thumbnails/20.jpg)
On Naïve Bayes § Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters – SpamAssassin, SpamBayes, CRM114, …
§ Simple: the conditional independence assumption is very strong (naïve) – Naïve Bayes may not estimate right in many cases, but
ends up classifying correctly quite often – It is difficult to understand the dependencies between
features in real problems
20