text classification spring 2018 cosc 6336: natural
TRANSCRIPT
![Page 1: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/1.jpg)
Text ClassificationCOSC 6336: Natural Language Processing
Spring 2018
Some content on these slides was adapted from J&M and from Wei Xu
![Page 2: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/2.jpg)
Class Announcements★ Midterm exam now on March 21st. ★ Removing Paper presentations from syllabus★ Practical 1 is posted:
○ System submission is March 7th○ Report is due March 9th
![Page 3: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/3.jpg)
Today’s lecture★ Text Classification
○ Example applications○ Task definition
★ Classical approaches to Text Classification○ Naive Bayes○ MaxEnt
![Page 4: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/4.jpg)
![Page 5: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/5.jpg)
What do these books have in common?
![Page 6: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/6.jpg)
Other tasks that can be solved as TC★ Sentiment classification
★ Native language identification
★ Author Profiling
★ Depression detection
★ Cyberbullying detection
★ ….
![Page 7: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/7.jpg)
Formal definition of the TC task
★ Input:○ a document d○ a fixed set of classes C = {c1, c2,…, cJ}
★ Output: a predicted class c ∈ C
![Page 8: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/8.jpg)
Methods for TC tasks★ Rule based approaches★ Machine Learning algorithms
○ Naive Bayes○ Support Vector Machines○ Logistic Regression○ And now deep learning approaches
![Page 9: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/9.jpg)
Naive Bayes for Text Classification★ Simple approach ★ Based on the bag-of-words representation
![Page 10: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/10.jpg)
Bag of wordsThe first reference to Bag of Words is attributed to a 1954 paper by Zellig Harris
![Page 11: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/11.jpg)
Naive BayesProbabilistic classifier (eq. 1)
According to Bayes rule: (eq. 2)
Replacing eq. 2 into eq. 1:
Dropping the denominator:
![Page 12: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/12.jpg)
Naive BayesA document d is represented as a set of features f
1 , f
2 , …, f
n
How many parameters do we need to learn in this model?
![Page 13: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/13.jpg)
Naive Bayes Assumptions1. Position doesn’t matter2. Naive Bayes assumption: probabilities P(f
i|c) are independent given the class
c and thus we can multiply them:
This leads us to:
![Page 14: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/14.jpg)
Naive Bayes in PracticeWe consider word positions:
We also do everything in log space:
![Page 15: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/15.jpg)
Naive Bayes: TrainingHow do we compute and ?
![Page 16: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/16.jpg)
Is Naive Bayes a good option for TC?
![Page 17: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/17.jpg)
Evaluation in TCConfusion table
Accuracy = TP + TN
(TP + TN + FN + FP)
Gold Standard
True False
True TP = true positives FP = False positives
False FN = false negatives TN = True negatives
![Page 18: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/18.jpg)
Evaluation in TC: Issues with Accuracy?Suppose we want to learn to classify each message in a web forum as “extremely negative”. We have a collected gold standard data:
★ 990 instances are labeled as negative★ 10 instances are labeled as positive★ Test data has 100 instances (99- and 1+)★ A dumb classifier can get 99% accuracy by always predicting “negative” !
![Page 19: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/19.jpg)
More Sensible Metrics: Precision, Recall and F-measure
P= TP/(TP+FP)
R=TP/(TP+FN)
F-measure =
Gold Standard
True False
True TP = true positives FP = False positives
False FN = false negatives TN = True negatives
![Page 20: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/20.jpg)
What about Multi-class problems?● Multi-class: c > 2
● P, R, and F-measure are defined for a single class
● We assume classes are mutually exclusive
● We use per class evaluation metrics
P = R =
![Page 21: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/21.jpg)
Micro vs Macro Average★ Macro average: measure performance per class and then average★ Micro average: collect predictions for all classes then compute TP, FP, FN,
and TN ★ Weighted average: compute performance per label and then average where
each label score is weighted by its support
![Page 22: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/22.jpg)
Example
![Page 23: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/23.jpg)
Train/Test Data Separation
![Page 24: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/24.jpg)
Does this model look familiar?
![Page 25: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/25.jpg)
Logistic Regression(MaxEnt)
![Page 26: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/26.jpg)
Disadvantages of NB★ Correlated features
○ Their evidence is overestimated
★ This will hurt classifier performance
![Page 27: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/27.jpg)
Logistic Regression★ No independence assumption★ Feature weights are learned simultaneously
![Page 28: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/28.jpg)
Logistic RegressionComputes p(y|x) directly from training data:
![Page 29: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/29.jpg)
Logistic RegressionComputes p(y|x) directly from training data:
But since the result of this will not be a proper probability function we need to normalize this:
![Page 30: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/30.jpg)
Logistic Regression★ We then compute the probability (y|x) for every class★ We choose the class with the highest probability
![Page 31: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/31.jpg)
Features in Text Classification★ We can have indicator functions of the form:
![Page 32: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/32.jpg)
Features in Text Classification★ Or other functions:
![Page 33: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/33.jpg)
An Example
![Page 34: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/34.jpg)
Learning in Logistic Regression
![Page 35: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/35.jpg)
Learning in Logistic RegressionHow does training look like?
We need to find the right weights
How?
By using the Conditional Maximum Likelihood Principle:
Choose parameters that maximize the log probability of the y labels in the training data:
![Page 36: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/36.jpg)
Learning in Logistic RegressionSo the objective function L(w) we are maximizing is this:
![Page 37: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/37.jpg)
Learning in Logistic Regression★ We use optimization methods like Stochastic Gradient Ascent to find the
weights that maximize L(w)★ We start with weights = 0 and move in the direction of the gradient (the partial
derivative L’(w))
![Page 38: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/38.jpg)
Learning in Logistic Regression★ The derivative turns out to be easy:
![Page 39: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/39.jpg)
Regularization★ Overfitting: when a model “memorizes” the training data ★ To prevent overfitting we penalize large weights:
★ Regularization forms:○ L2 norm (Euclidean)
○ L1 norm (Manhattan)
![Page 40: Text Classification Spring 2018 COSC 6336: Natural](https://reader030.vdocuments.mx/reader030/viewer/2022012610/619ce6c6eb6e9a77ac6e756e/html5/thumbnails/40.jpg)
In Summary★ NB is a generative model
○ Highly correlated features can result in bad results
★ Logistic Regression○ Weights for feature are calculated simultaneously
★ NB seems to fare better in small data sets★ Both NB and Logistic regression are linear classifiers★ Regularization helps alleviate overfitting