![Page 1: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/1.jpg)
Using Asymmetric Distributions to Improve Text Classifier Probability Estimates
Paul N. BennettComputer Science Dept.Carnegie Mellon UniversitySIGIR 2003
![Page 2: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/2.jpg)
Abstract
Text classifiers that give probability estimates are more readily applicable in a variety of scenarios.
The quality of estimates is crucialReview: a variety of standard approaches
to converting scores (and poor probability estimates) from text classifier to high quality estimates
![Page 3: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/3.jpg)
Cont’d
New models: motivated by the intuition that the empirical score distributions for the “extremely irrelevant”, “hard to discriminate”, and “obviously relevant” are often significantly different.
![Page 4: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/4.jpg)
Problem Definition & Approach
Difference from earlier approaches– Asymmetric parametric models suitable for
use when little training data is available– Explicitly analyze the quality of probability
estimates and provide significance tests– Target text classifier outputs where a majority
of the previous literature targeted the output of search engine
![Page 5: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/5.jpg)
Problem Definition
![Page 6: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/6.jpg)
Cont’d
There are two general types of parametric approaches:– Fit the posterior function directly, i.e., there is
one function estimator that performs a direct mapping of the score s to the probability P(+|s(d))
– Break the problem down as shown in the the gray box. An estimator for each of the class-conditional densities (p(s|+) and p(s|-)) is produced, then Bayes’ rule and the class priors are used to obtain the estimate for P(+|s(d))
![Page 7: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/7.jpg)
Motivation for Asymmetric Distributions
Using standard Gaussians fails to capitalize on a basic characteristic commonly seen
Intuitively, the area between the modes corresponds to the hard examples, which are difficult for this classifier to distinguish, while areas outside the modes are the extreme examples that are usually easily distinguished
![Page 8: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/8.jpg)
Cont’d
![Page 9: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/9.jpg)
Cont’d Ideally, there will exist scores θ- and θ+ such that
all examples with score greater than θ+ are relevant and all examples with scores less than θ- are irrelevant.
The distance |θ- - θ+| corresponds to the margin in some classifiers, and an attempt is often made to maximize this quantity.
Because text classifiers have training data to use to separate the classes, the final behavior of the score distributions is primarily a factor of the amount of training data and the consequent separation in the classes achieved.
![Page 10: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/10.jpg)
Cont’dPractically, some examples will fail between θ- an
d θ+, and it is often important to estimate the probabilities of these examples well (since they correspond to the “hard” examples)
Justifications can be given for both why you may find more and less examples between θ- and θ+ than outside of them, but there are few empirical reasons to believe that the distributions should be symmetric.
A natural first candidate for an asymmetric distribution is to generalize a common symmetric distribution, e.g. the Laplace or the Gaussian
![Page 11: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/11.jpg)
Asymmetric Laplace
![Page 12: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/12.jpg)
Asymmetric Gaussian
![Page 13: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/13.jpg)
Gaussians v.s. Asymmetric Gaussian
![Page 14: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/14.jpg)
Parameter Estimation
Two choices:– (1) Use numerical estimation to estimate all
three parameters at once.– (2) Fix the value of θ, and estimate the other
tow given our choice of θ, then consider alternate values of θ.
Because of the simplicity of analysis in the latter alternative, we choose this method.
![Page 15: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/15.jpg)
Asymmetric Laplace MLEs
![Page 16: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/16.jpg)
Asymmetric Gaussian MLEs
![Page 17: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/17.jpg)
Methods Compared
GaussiansAsymmetric GaussiansLaplace DistributionsAsymmetric Laplace DistributionsLogistic RegressionLogistic Regression with Noisy Class Labe
ls
![Page 18: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/18.jpg)
Data
MSN Web Directory– A large collection of heterogeneous web pages that
have been hierarchically classified.– 13 categories used, train/test = 50078/10024
Reuters– The Reuters 21578 corpus.– 135 classes, train/test = 9603/3299
TREC-AP– A collection of AP news stories from 1988 to 1990.– 20 categories, train/test = 142791/66992
![Page 19: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/19.jpg)
Performance Measures
Log-loss– For a document d with class , log-loss
is defined as where if a = b and 0 otherwise.
Squared error
Error– How the methods would perform if a false
positive was penalized the same as a false negative.
},{)( dc)|(log)),(()|(log)),(( dPdcdPdc
1),( ba
22 ))|(1)(),(())|(1)(),(( dPdcdPdc
![Page 20: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/20.jpg)
Results & Discussion
![Page 21: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/21.jpg)
Cont’d
A. Laplace, LR+Noise, and LogReg quite clearly outperform the other methods.
LR+Noise and LogReg tend to perform slightly better than A. Laplace at some tasks with respect to log-loss and squared error.
However, A. Laplace always produces the least number of errors for all the tasks
![Page 22: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/22.jpg)
Goodness of Fit – naive Bayes
![Page 23: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/23.jpg)
Cont’d -- SVM
![Page 24: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/24.jpg)
LogOdds v.s. s(d) – naive Bayes
![Page 25: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/25.jpg)
Cont’d -- SVM
![Page 26: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/26.jpg)
Gaussian v.s. Laplace
The asymmetric Gaussian tends to place the mode more accurately than a symmetric Gaussian.
However, the asymmetric Gaussian distributes too much mass to the outside tails while failing to fit around the mode accurately enough.
A. Gaussian is penalized quite heavily when outliers present.
![Page 27: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer…](https://reader035.vdocuments.mx/reader035/viewer/2022062911/5a4d1c127f8b9ab0599f78d1/html5/thumbnails/27.jpg)
Cont’d
The asymmetric Laplace places much more emphasis around the mode.
Even in cases where the test distribution differs from the training distribution, A. Laplace still yields a solution that gives a better fit than LogReg.