1 categorization/classification given: –a description of an instance, x x, where x is the...
TRANSCRIPT
![Page 1: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/1.jpg)
1
Categorization/Classification• Given:
– A description of an instance, x X, where X is the instance language or instance space.
– A fixed set of classes:
C = {c1, c2,…, cJ}
• Determine:– The category of x: c(x)C, where c(x) is a
classification function whose domain is X and whose range is C.
• We want to know how to build classification functions (“classifiers”).
![Page 2: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/2.jpg)
2
More Text Classification Examples:Many search engine functionalities use
classification
Assign labels to each document or web-page:• Labels are most often topics such as Yahoo-categories
e.g., "finance," "sports," "news>world>asia>business"• Labels may be opinion on a person/product
e.g., “like”, “hate”, “neutral”• Labels may be domain-specific
e.g., "interesting-to-me" : "not-interesting-to-me”
e.g., “contains adult language” : “doesn’t”
![Page 3: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/3.jpg)
3
Classification Methods• Manual classification
– Used by Yahoo! (originally)
– Very accurate when job is done by experts– Consistent when the problem size and team is
small– Difficult and expensive to scale
• Means we need automatic classification methods for big problems
![Page 4: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/4.jpg)
4
Classification Methods• Supervised learning of a document-label
assignment function– Many systems partly rely on machine learning (MSN,
Verity, Yahoo!, …)• k-Nearest Neighbors (simple, powerful)• Naive Bayes (simple, common method)• … plus many other methods• No free lunch: requires hand-classified training data
• Note that many commercial systems use a mixture of methods
![Page 5: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/5.jpg)
5
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes– Each tuple/sample is assumed to belong to a
predefined class, as determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or mathematical formulae
![Page 6: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/6.jpg)
6
Classification—A Two-Step Process
• Model usage: for classifying future or unknown objects– Estimate accuracy of the model
• The known label of test sample is compared with the classified result from the model
• Accuracy rate is the percentage of test set samples that are correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will occur
– If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
![Page 7: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/7.jpg)
7
Process (1): Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)
![Page 8: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/8.jpg)
8
Process (2): Using the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
![Page 9: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/9.jpg)
9
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
![Page 10: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/10.jpg)
10
The goal of the course
• Study supervised learning specially for text and hypertext documents
• Text– Has a very large number of potential features,
of which many are irrelevant.• If vector space model is used, each term is a
potential feature.
– The number of distinct class labels is much larger than structured leaning scenarios.
![Page 11: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/11.jpg)
11
Topics including in the course
• Evaluating text classifiers
• Classifiers– NN learners– Bayesian learners– Hypertext classification
• Feature selection methods
![Page 12: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/12.jpg)
12
Evaluating text classifiers• Accuracy
– The ability to predict the correct class labels
– This is based on comparing the classifier-assigned labels with human-assigned labels
• Speed– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Simplicity, speed, and scalability for document insertion, deletion and modification
• Scalability: efficiency in disk-resident databases • Interpretability
– understanding and insight provided by the model
![Page 13: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/13.jpg)
13
Benchmarks
• Reuters– Labeled documents : 10700– Number of terms : 30000– Number of categories : 135
• 20NG– Labeled documents : 18800– Number of terms : 94000– Number of categories : 20
• WebKB– Labeled documents : 8300– Number of categories : 7
![Page 14: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/14.jpg)
14
Measures of accuracy
• Each document is associated with a subset of classes– To avoid searching over the power set of
class labels, many systems create a two-class problem for every class
• Two-way ensemble or one-vs.-rest technique
– Ensemble classifiers are evaluated on the basis of recall and precision
![Page 15: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/15.jpg)
15
Classifier Accuracy Measures
(guess)~C1 C1
(true)
~C1
True negative False positive
C1 False negative True positive
![Page 16: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/16.jpg)
16
A combined measure: F
• Combined measure that assesses precision/recall tradeoff is F measure (weighted harmonic mean):
• People usually use balanced F1 measure
– i.e., with = 1 or = ½ ( 2 = 1-/ )
RP
PR
RP
F
2
2 )1(1
)1(1
1
![Page 17: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/17.jpg)
17
F: Example
• precision?
• recall?
• F1?
![Page 18: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/18.jpg)
18
F: Why harmonic mean?
• The simple (arithmetic) mean is 50% for “return-everything” search engine, which is too high.
• Desideratum: Punish really bad performance on either precision or recall.– Taking the minimum achieves this.– But minimum is not smooth and hard to weight.– F (harmonic mean) is a kind of smooth minimum.
![Page 19: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/19.jpg)
19
Nearest Neighbor Learner
• Basic idea– Similar documents are expected to be
assigned the same class label.
• Vector space model and cosine measure for similarity let us formalize the idea.
![Page 20: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/20.jpg)
20
The k-Nearest Neighbor Algorithm
• All instances correspond to points in the n-D space
• The nearest neighbor are defined in terms of cosine similarity
• k-NN returns the most common value among the k training examples nearest to xq
• Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples
.
_+
_ xq
+
_ _+
_
_
+
.
..
. .
![Page 21: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/21.jpg)
21
Discussion on the k-NN Algorithm
• Distance-weighted nearest neighbor algorithm– Weight the contribution of each of the k neighbors
according to their distance to the query xq
• Give greater weight to closer neighbors
• Robust to noisy data by averaging k-nearest neighbors
• Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes – To overcome it, elimination of the least relevant attributes
2),(1
ixqxdw
![Page 22: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/22.jpg)
22
Bayesian Methods• Learning and classification methods based on
probability theory.• Bayes theorem plays a critical role in probabilistic
learning and classification.• Build a generative model that approximates how
data is produced• Uses prior probability of each category given no
information about an item.• Categorization produces a posterior probability
distribution over the possible categories given a description of an item.
![Page 23: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/23.jpg)
23
Bayes’ Rule
P (C , D) P (C | D)P (D) P (D | C )P (C )
P(C | D) P(D | C)P(C)
P(D)
![Page 24: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/24.jpg)
24
Naive Bayes ClassifiersTask: Classify a new instance D based on a tuple of attribute
values into one of the classes cj CnxxxD ,,, 21
),,,|(argmax 21 njCc
MAP xxxcPcj
),,,(
)()|,,,(argmax
21
21
n
jjn
Cc xxxP
cPcxxxP
j
)()|,,,(argmax 21 jjnCc
cPcxxxPj
![Page 25: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/25.jpg)
25
Naïve Bayes Assumption• P(cj)
– Can be estimated from the frequency of classes in the training examples.
• P(x1,x2,…,xn|cj) – O(|X|n•|C|) parameters– Could only be estimated if a very, very large number of
training examples was available.
Naïve Bayes Conditional Independence Assumption:
• Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).
![Page 26: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/26.jpg)
26
Flu
X1 X2 X5X3 X4
feversinus coughrunnynose muscle-ache
The Naïve Bayes Classifier
• Conditional Independence Assumption: features detect term presence and are independent of each other given the class:
• This model is appropriate for binary variables– Multivariate Bernoulli model
)|()|()|()|,,( 52151 CXPCXPCXPCXXP
![Page 27: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/27.jpg)
27
Learning the Model
• First attempt: maximum likelihood estimates– simply use the frequencies in the data
)(
),()|(ˆ
j
jiiji cCN
cCxXNcxP
C
X1 X2 X5X3 X4 X6
N
cCNcP j
j
)()(ˆ
![Page 28: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/28.jpg)
28
Naïve Bayesian Classifier: Training Dataset
Class:C1:buys_computer = ‘yes’C2:buys_computer = ‘no’
Data sample X = (age <=30,Income = medium,Student = yesCredit_rating = Fair)
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
![Page 29: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/29.jpg)
29
An Example• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
![Page 30: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/30.jpg)
30
• What if we have seen no training cases where patient had no flu and muscle aches?
• Zero probabilities cannot be conditioned away, no matter the other evidence!
Problem with Max Likelihood
0)(
),()|(ˆ 5
5
nfCN
nfCtXNnfCtXP
Flu
X1 X2 X5X3 X4
feversinus coughrunnynose muscle-ache
)|()|()|()|,,( 52151 CXPCXPCXPCXXP
![Page 31: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/31.jpg)
31
Smoothing to Avoid Overfitting
kcCN
cCxXNcxP
j
jiiji
)(
1),()|(ˆ
# of terms in the vocabulary
• The estimate is 0 because of sparseness– The training data are never large enough to
represent the frequency of rare events adequately
• To eliminate zeros, we use add-one or Laplace smoothing
![Page 32: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/32.jpg)
32
Underflow Prevention: log space
• Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow.
• Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.
• Class with highest final un-normalized log probability score is still the most probable.
positionsi
jijCc
NB cxPcPc )|(log)(logargmaxj
![Page 33: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/33.jpg)
33
Two Models• Model 1: Multivariate Bernoulli
– One feature Xw for each word in dictionary
– Xw = true in document d if w appears in d
– Naive Bayes assumption: • Given the document’s topic, appearance of one
word in the document tells us nothing about chances that another word appears
![Page 34: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/34.jpg)
34
Example
Chinese Beijing Shanghai Macao Tokyo Japan Class
label
D1 1 1 0 0 0 0 yes
D2 1 0 1 0 0 0 yes
D3 1 0 0 1 0 0 yes
D4 1 0 0 0 1 1 no
D5 1 0 0 0 1 1 ?
![Page 35: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/35.jpg)
35
Text classification example(multivariate Bernoulli model)
![Page 36: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/36.jpg)
36
Two Models• Model 2: Multinomial = Class conditional
unigram– One feature Xi for each word pos in document
• feature’s values are all words in dictionary
– Value of Xi is the word in position i– Naïve Bayes assumption:
• Given the document’s topic, word in one position in the document tells us nothing about words in other positions
– Second assumption: • Word appearance does not depend on position
)|()|( cwXPcwXP ji for all positions i,j, word w, and class c
![Page 37: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/37.jpg)
37
Parameter estimation
fraction of documents of topic cj
in which word w appears
• Multivariate Bernoulli model:
• Multinomial model:
– Can create a mega-document for topic j by concatenating all documents in this topic
– Use frequency of w in mega-document
)|(ˆjw ctXP
fraction of times in which word w appears
across all documents of topic cj
)|(ˆji cwXP
![Page 38: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/38.jpg)
38
Textj single document containing all docsj
for each word xk in Vocabulary
nk number of occurrences of xk in Textj
Naïve Bayes: Learning
• From training corpus, extract Vocabulary• Calculate required P(cj) and P(xk | cj) terms
– For each cj in C do• docsj subset of documents for which the target
class is cj
•
||
1)|(
Vocabularyn
ncxP k
jk
|documents # total|
||)( j
j
docscP
![Page 39: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/39.jpg)
39
Naïve Bayes: Classifying
• positions all word positions in current document which contain tokens found in Vocabulary
• Return cNB, where
positionsi
jijCc
NB cxPcPc )|()(argmaxj
![Page 40: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/40.jpg)
40
Text classification example(multinomial model)
• P(c)=3/4, P(~c)=1/4• P(chinese|c)=(5+1)/(8+6)=3/7• P(Tokyo|c)=P(Japan|c)=(0+1)/(8+6)=1/14• P(chinese|~c)=(1+1)/(3+6)=2/9• P(tokyo|~c)=p(Japan|~c)=(1+1)/(3+6)=2/9• d5
– c : (3/4)*(3/7)3*(1/14)*(1/14)=0.0003– ~c:(1/4)*(2/9)3*(2/9)*(2/9)=0.0001
![Page 41: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/41.jpg)
41
Stochastic Language Models• Models probability of generating strings (each
word in turn) in the language (commonly all strings over ∑). E.g., unigram model
0.2 the
0.1 a
0.01 man
0.01 woman
0.03 said
0.02 likes
…
the man likes the woman
0.2 0.01 0.02 0.2 0.01
multiply
Model M
P(s | M) = 0.00000008
![Page 42: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/42.jpg)
42
Stochastic Language Models• Model probability of generating any string
0.2 the
0.01 class
0.0001 sayst
0.0001 pleaseth
0.0001 yon
0.0005 maiden
0.01 woman
Model M1 Model M2
maidenclass pleaseth yonthe
0.00050.01 0.0001 0.00010.2
0.010.0001 0.02 0.10.2
P(s|M2) > P(s|M1)
0.2 the
0.0001 class
0.03 sayst
0.02 pleaseth
0.1 yon
0.01 maiden
0.0001 woman
![Page 43: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/43.jpg)
43
Unigram and higher-order models
•
• Unigram Language Models
• Bigram (generally, n-gram) Language Models
= P ( ) P ( | ) P ( | ) P ( | )
P ( ) P ( ) P ( ) P ( )
P ( )
P ( ) P ( | ) P ( | ) P ( | )
Easy.Effective!
![Page 44: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/44.jpg)
44
WebKB Experiment (1998)• Classify webpages from CS departments
into:– student, faculty, course,project
• Train on ~5,000 hand-labeled web pages– Cornell, Washington, U.Texas, Wisconsin
• Crawl and classify a new site (CMU)
• Results:Student Faculty Person Project Course Departmt
Extracted 180 66 246 99 28 1Correct 130 28 194 72 25 1Accuracy: 72% 42% 79% 73% 89% 100%
![Page 45: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/45.jpg)
45
NB Model Comparison: WebKB
![Page 46: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/46.jpg)
46
Classification• Multinomial vs Multivariate Bernoulli
• Multinomial model is almost always more effective in text applications!
![Page 47: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/47.jpg)
47
Naive Bayes is Not So Naive• Naïve Bayes: First and Second place in KDD-CUP 97 competition,
among 16 (then) state of the art algorithmsGoal: Financial services industry direct mail response prediction model: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.
• Robust to Irrelevant FeaturesIrrelevant Features cancel each other without affecting resultsInstead Decision Trees can heavily suffer from this.
• Very good in domains with many equally important featuresDecision Trees suffer from fragmentation in such cases – especially if little data
• A good dependable baseline for text classification (but not the best)!• Optimal if the Independence Assumptions hold: If assumed
independence is correct, then it is the Bayes Optimal Classifier for problem
• Very Fast: Learning with one pass of counting over the data; testing linear in the number of attributes, and document collection size
• Low Storage requirements
![Page 48: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/48.jpg)
48
Hypertext classification
• Search engines assign heuristic weights to terms that occur in specific HTML tags
• Paying special attention to tags can help with supervised learning as well
![Page 49: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/49.jpg)
49
Hypertext classification
• It is important to distinguish between the two occurrences of the word “surfing”– resume.publication.title.surfing– resume.hobbies.item.surfing
• Relations provide a uniform way to codify hypertextual features.– Ex: contains-text(resume.hobbies.item, wind-surfing)– Ex: links-to(source, destination)
![Page 50: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/50.jpg)
50
Rule Induction
![Page 51: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/51.jpg)
51
Rule Induction
• The outer loop learns new rules one at a time, removing positive examples covered by any rule generated thus far.– When a new empty rule is initialized, its free variables
can be bound in all possible ways
• The inner loop adds conjunctive literals to the new rule until no negative example is covered by the new rule.– A heuristic is to pick a literal that rapidly increases the
ratio of surviving positive to negative bindings.
![Page 52: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/52.jpg)
52
Feature Selection: Why?• Text collections have a large number of
features– 10,000 – 1,000,000 unique words … and more
• May make using a particular classifier feasible– Some classifiers can’t deal with 100,000 of features
• Reduces training time– Training time for some methods is quadratic or
worse in the number of features
• Can improve generalization (performance)– Eliminates noise features– Avoids overfitting
![Page 53: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/53.jpg)
53
Feature selection: how?• An easy one
– Ignoring terms that are “too frequent” or “too rare” according to empirically chosen threshold.
• General idea:– Hypothesis testing statistics:
• Are we confident that the value of one categorical variable is associated with the value of another
• Chi-square test
![Page 54: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/54.jpg)
54
2 statistic (CHI)• 2 is interested in (fo – fe)2/fe summed over all table
entries: is the observed number what you’d expect given the marginals?
• The null hypothesis is rejected with confidence .999, since 12.9 > 10.83 (the value for .999 confidence).
)001.(9.129498/)94989500(502/)502500(
75.4/)75.43(25./)25.2(/)(),(22
2222
p
EEOaj
9500
500
(4.75)
(0.25)
(9498)3Class auto
(502)2Class = auto
Term jaguarTerm = jaguar expected: fe
observed: fo
![Page 55: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/55.jpg)
55
There is a simpler formula for 2x2 2:
2 statistic (CHI)
N = A + B + C + D
D = #(¬t, ¬c)B = #(t,¬c)
C = #(¬t,c)A = #(t,c)
![Page 56: 1 Categorization/Classification Given: –A description of an instance, x X, where X is the instance language or instance space. –A fixed set of classes:](https://reader035.vdocuments.mx/reader035/viewer/2022062518/56649ddf5503460f94ad8fe0/html5/thumbnails/56.jpg)
56
Feature Selection• Chi-square
– Statistical foundation– May select very slightly informative frequent
terms that are not very useful for classification