statistical methods for text mining
DESCRIPTION
Statistical Methods for Text Mining. David Madigan Rutgers University & DIMACS www.stat.rutgers.edu/~madigan. David D. Lewis www.daviddlewis.com. joint work with Alex Genkin, Vladimir Menkov, Aynur Dayanik, Dmitriy Fradkin. Statistical Analysis of Text. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/1.jpg)
Statistical Methods for Text Mining
David MadiganRutgers University & DIMACS
www.stat.rutgers.edu/~madigan
joint work with Alex Genkin, Vladimir Menkov, Aynur Dayanik, Dmitriy Fradkin
David D. Lewiswww.daviddlewis.com
![Page 2: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/2.jpg)
Statistical Analysis of Text•Statistical text analysis has a long history in literary analysis and in solving disputed authorship problems
•First (?) is Thomas C. Mendenhall in 1887
![Page 3: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/3.jpg)
Mendenhall•Mendenhall was Professor of Physics at Ohio State and at University of Tokyo, Superintendent of the USA Coast and Geodetic Survey, and later, President of Worcester Polytechnic Institute
Mendenhall Glacier, Juneau, Alaska
![Page 4: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/4.jpg)
X2 = 127.2, df=12
![Page 5: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/5.jpg)
•Used Naïve Bayes with Poisson and Negative Binomial model
•Out-of-sample predictive performance
![Page 6: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/6.jpg)
Today
• Statistical methods routinely used for textual analyses of all kinds
• Machine translation, part-of-speech tagging, information extraction, question-answering, text categorization, etc.
• Not reported in the statistical literature (no statisticians?)
![Page 7: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/7.jpg)
Outline• Part-of-Speech Tagging, Entity Recognition
• Text categorization
• Logistic regression and friends
• The richness of Bayesian regularization
• Sparseness-inducing priors
• Word-specific priors: stop words, IDF, domain knowledge, etc.
• Polytomous logistic regression
![Page 8: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/8.jpg)
Part-of-Speech Tagging
• Assign grammatical tags to words• Basic task in the analysis of natural
language data• Phrase identification, entity extraction,
etc.• Ambiguity: “tag” could be a noun or a
verb• “a tag is a part-of-speech label” –
context resolves the ambiguity
![Page 9: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/9.jpg)
The Penn Treebank POS Tag Set
![Page 10: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/10.jpg)
POS Tagging Process
Berlin Chen
![Page 11: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/11.jpg)
POS Tagging Algorithms
• Rule-based taggers: large numbers of hand-crafted rules
• Probabilistic tagger: used a tagged corpus to train some sort of model, e.g. HMM.
tag1
word1
tag2
word2
tag3
word3
• clever tricks for reducing the number of parameters (aka priors)
![Page 12: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/12.jpg)
some details…Charniak et al., 1993, achieved 95% accuracy on the Brown Corpus with:
number of times word j appears with tag inumber of times word j appears
number of times a word that had never been seen with tag i gets tag inumber of such occurrences in total
plus a modification that uses word suffixes
r1 s1
![Page 13: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/13.jpg)
Recent Developments
• Toutanova et al., 2003, use a dependency network and richer feature set
• Log-linear model for ti | t-i, w
• Model included, for example, a feature for whether the word contains a number, uppercase characters, hyphen, etc.• Regularization of the estimation process critical• 96.6% accuracy on the Penn corpus
![Page 14: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/14.jpg)
Named-Entity Classification
• “Mrs. Frank” is a person• “Steptoe and Johnson” is a
company• “Honduras” is a location• etc.• Bikel et al. (1998) from BBN
“Nymble” statistical approach using HMMs
![Page 15: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/15.jpg)
• “name classes”: Not-A-Name, Person, Location, etc.• Smoothing for sparse training data + word features• Training = 100,000 words from WSJ• Accuracy = 93%• 450,000 words same accuracy
nc1
word1
nc2
word2
nc3
word3
11
1111 if ],|[
if ],|[],,|[
iiiii
iiiiiiiii ncncncncw
ncncncwwncncww
![Page 16: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/16.jpg)
training-development-test
![Page 17: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/17.jpg)
Text Categorization•Automatic assignment of documents with respect to manually defined set of categories
•Applications automated indexing, spam filtering, content filters, medical coding, CRM, essay grading
•Dominant technology is supervised machine learning:
Manually classify some documents, then learn a classification rule from them (possibly with manual intervention)
![Page 18: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/18.jpg)
Terminology, etc.•Binary versus Multi-Class
•Single-Label versus Multi-Label
•Document representation via “bag of words:”
•wi’s might be 0/1, counts, or weights (e.g tf/idf, LSI)
•Phrases, syntactic information, synonyms, NLP, etc. ?
•Stopwords, stemming
),,( 1 Nwwd 54 1010 N
![Page 19: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/19.jpg)
Test Collections•Reuters-21578
•9603 training, 3299 test, 90 categories, ~multi-label
•New Reuters – 800,000 documents
•Medline – 11,000,000 documents; MeSH headings
•TREC conferences and collections
•Newsgroups, WebKB
![Page 20: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/20.jpg)
Reuters Evaluation
•binary classifiers:
recall=d/(b+d)
precision=d/(c+d)
macro-precision = 1.0+0.5
micro-averaged precision = 2/3
true 0
true 1
predict 0
a b
predict 1
c d
cat 1 cat 2
test doc 1
test doc 2
truepredict
10
1 1 1
0 0
1
p=1.0 p=0.5r =1.0 r =1.0
2
F1 Measure – harmonic mean of precision and recall
“sensitivity”
“predictive value positive”
•multiple binary classifiers:
![Page 21: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/21.jpg)
Reuters ResultsModel F1
AdaBoost.MH
0.86
SVM 0.84-0.87
k-NN 0.82-0.86
Neural Net 0.84
“Naïve Bayes”
0.72-0.78
Rocchio 0.62-0.76
![Page 22: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/22.jpg)
Naïve Bayes•Naïve Bayes for document classification dates back to the early 1960’s
•The NB model assumes features are conditionally independent given class
•Estimation is simple; scales well
•Empirical performance usually not bad
•High bias-low variance (Friedman, 1997; Domingos & Pazzani, 1997)
X0
X1 X2 Xp...
![Page 23: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/23.jpg)
Poisson NB
•Natural extension of binary model to word frequencies
•ML-equivalent to the multinomial model with Poisson-distributed document length
•Bayesian equivalence requires constraints on conjugate priors (Poisson NB has 2p hyper-parameters per class; Multinomial-Poisson has p+2)
X0
Xc1 Xc2 Xcp...
)(~ Cj
cj PoissonX
X0
X X
![Page 24: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/24.jpg)
Poisson NB - ReutersModel μPrecisi
onμRecall
SVM 0.89 0.84
Multinomial
0.78 0.76
Poisson NB 0.67 0.66Multinomial+ logspline
0.79 0.76
Multinomial+ negative bin.
0.78 0.75
Negative Binomial NB
0.77 0.76
over-dispersion
Different story for FAA dataset
![Page 25: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/25.jpg)
AdaBoost.MH•Multiclass-Multilabel
•At each iteration learns a simple score-producing classifier on weighed training data and the updates the weights
•Final decision averages over the classifiersClass
A B C D
doc 1
+1 +1 -1 -1data
Class
A B C D
doc 1
0.25
0.25
0.25
0.25initial weights
Class
A B C D
doc 1
2 -2 -1 0.1score from simple classifier
Class
A B C D
doc 1
0.02
0.82
0.04
0.12revised weights
![Page 26: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/26.jpg)
AdaBoost.MHSchapire and Singer, 2000
![Page 27: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/27.jpg)
AdaBoost.MH’s weak learner is a stump
two words!
![Page 28: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/28.jpg)
AdaBoost.MH Comments•Software implementation: BoosTexter
•Some theoretical support in terms of bounds on generalization error
•3 days of cpu time for Reuters with 10,000 boosting iterations
![Page 29: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/29.jpg)
•Documents usually represented as “bag of words:”
Document Representation
1( , , ,..., )i i j idx x xix
•xi’s might be 0/1, counts, or weights (e.g. tf/idf, LSI)
•Many text processing choices: stopwords, stemming, phrases, synonyms, NLP, etc.
![Page 30: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/30.jpg)
•For instance, linear classifier:
Classifier Representation
IF ,THEN 1
ELSE 1
j i j ij
i
x y
y
• xi’s derived from text of document
• yi indicates whether to put document in category
• βj are parameters chosen to give good classification effectiveness
![Page 31: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/31.jpg)
•Linear model for log odds of category membership:
Logistic Regression Model
( 1| )ln
( 1| )i
j i jji
P yx
P y
i
ii
xβx
x
• Equivalent to
( 1| )1i
eP y
e
i
i
βx
i βxx
• Conditional probability model
![Page 32: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/32.jpg)
•If estimated probability of category membership is greater than p, assign document to category:
Logistic Regression as a Linear Classifier
IF ln ,THEN 11j i j i
j
px y
p
•Choose p to optimize expected value of your effectiveness measure (may need different form of test)
•Can change measure w/o changing model
![Page 33: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/33.jpg)
Maximum Likelihood Training
• Choose parameters (βj's) that maximize probability (likelihood) of class labels (yi's) given documents (xi’s)
arg max ( ln(1 exp( )))Ti
i
y iβ
β x
• Maximizing (log-)likelihood can be viewed as minimizing a loss function
![Page 34: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/34.jpg)
Hastie, Friedman & Tibshirani
![Page 35: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/35.jpg)
► Subset selection is a discrete process – individual variables are either in or out. Combinatorial nightmare.
► This method can have high variance – a different dataset from the same source can result in a totally different model
► Shrinkage methods allow a variable to be partly included in the model. That is, the variable is included but with a shrunken co-efficient
► Elegant way to tackle over-fitting
Shrinkage Methods
![Page 36: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/36.jpg)
subject to:
2
1 10
ridge )(minargˆ
N
i
p
jjiji xy
p
jj s
1
2
Equivalently:
p
jj
N
i
p
jjiji xy
1
22
1 10
ridge )(minargˆ
This leads to:
Choose by cross-validation.
yXIXX TT 1ridge )(ˆ works even when XTX is singular
Ridge Regression
![Page 37: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/37.jpg)
Posterior Modes with Varying Hyperparameter - Gaussian
tau
po
ste
rio
r m
od
e
Posterior Modes with Varying Hyperparameter - Gaussian
tau
po
ste
rio
r m
od
e
Posterior Modes with Varying Hyperparameter - Gaussian
tau
po
ste
rio
r m
od
e
Posterior Modes with Varying Hyperparameter - Gaussian
tau
po
ste
rio
r m
od
e
Posterior Modes with Varying Hyperparameter - Gaussian
tau
po
ste
rio
r m
od
e
Posterior Modes with Varying Hyperparameter - Gaussian
tau
po
ste
rio
r m
od
e
Posterior Modes with Varying Hyperparameter - Gaussian
tau
po
ste
rio
r m
od
e
Posterior Modes with Varying Hyperparameter - Gaussian
tau
po
ste
rio
r m
od
e
-0.1
0-0
.05
0.0
00
.05
0.1
0
0 0.05 0.1 0.15 0.2 0.25 0.3
intercept
npreg
glu
bp
skin
bmi/100
ped
age/100
![Page 38: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/38.jpg)
22
2
20
),0(~
),(~
with ridgeas same
N
xNy
j
Tii
Ridge Regression = Bayesian MAP Regression
►Suppose we believe each βj is a small value near 0
►Encode this belief as separate Gaussian probability distributions over values of βj
►Choosing maximum a posteriori value of the β gives same result as ridge logistic regression
![Page 39: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/39.jpg)
subject to:
2
1 10
ridge )(minargˆ
N
i
p
jjiji xy
p
jj s
1
Quadratic programming algorithm needed to solve for the parameter estimates
qp
jj
N
i
p
jjiji xy
1
2
1 10 )(minarg
~
q=0: var. sel.q=1: lassoq=2: ridgeLearn q?
Least Absolute Shrinkage & Selection Operator (LASSO)
![Page 40: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/40.jpg)
Posterior Modes with Varying Hyperparameter - Laplace
lambda
po
ste
rio
r m
od
e
Posterior Modes with Varying Hyperparameter - Laplace
lambda
po
ste
rio
r m
od
e
Posterior Modes with Varying Hyperparameter - Laplace
lambda
po
ste
rio
r m
od
e
Posterior Modes with Varying Hyperparameter - Laplace
lambda
po
ste
rio
r m
od
e
Posterior Modes with Varying Hyperparameter - Laplace
lambda
po
ste
rio
r m
od
e
Posterior Modes with Varying Hyperparameter - Laplace
lambda
po
ste
rio
r m
od
e
Posterior Modes with Varying Hyperparameter - Laplace
lambda
po
ste
rio
r m
od
e
Posterior Modes with Varying Hyperparameter - Laplace
lambda
po
ste
rio
r m
od
e
-0.1
0-0
.05
0.0
00
.05
0.1
0
120 100 80 60 40 20 0
intercept
npreg
glu
bp
skin
bmi/100
ped
age/100
![Page 41: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/41.jpg)
![Page 42: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/42.jpg)
![Page 43: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/43.jpg)
► Lasso estimates are consistent
► But, Lasso does not have the “oracle property.” That is, it does not deliver the correct model with probability 1
► Fan & Li’s SCAD penalty function has the Oracle property
Ridge & LASSO - Theory
![Page 44: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/44.jpg)
► New geometrical insights into Lasso and “Stagewise”
► Leads to a highly efficient Lasso algorithm for linear regression
LARS
![Page 45: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/45.jpg)
► Start with all coefficients bj = 0
► Find the predictor xj most correlated with y
► Increase bj in the direction of the sign of its correlation with y. Take residuals r=y-yhat along the way. Stop when some other predictor xk has as much correlation with r as xj has
► Increase (bj,bk) in their joint least squares direction until some other predictor xm has as much correlation with the residual r.
► Continue until all predictors are in the model
LARS
![Page 46: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/46.jpg)
![Page 47: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/47.jpg)
Zhang & Oles Results
•Reuters-21578 collection•Ridge logistic regression plus feature selection
Model F1Naïve Bayes 0.852
Ridge Logistic Regression+FS
0.914
SVM 0.911
![Page 48: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/48.jpg)
Bayes!
• MAP logistic regression with Gaussian prior gives state of the art text classification effectiveness
• But Bayesian framework more flexible than SVM for combining knowledge with data :– Feature selection – Stopwords, IDF– Domain knowledge– Number of classes
• (and kernels.)
![Page 49: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/49.jpg)
Data Sets
• ModApte subset of Reuters-21578– 90 categories; 9603 training docs; 18978
features
• Reuters RCV1-v2– 103 cats; 23149 training docs; 47152 features
• OHSUMED heart disease categories– 77 cats; 83944 training docs; 122076 features
• Cosine normalized TFxIDF weights
![Page 50: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/50.jpg)
Dense vs. Sparse Models (Macroaveraged F1,
Preliminary)ModApte RCV1-v2 OHSUME
D
Lasso 52.03 56.54 51.30Ridge 39.71 51.40 42.99
Ridge/500
38.82 46.27 36.93
Ridge/50 45.80 41.61 42.59
Ridge/5 46.20 28.54 41.33
SVM 53.75 57.23 50.58
![Page 51: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/51.jpg)
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
01
23
4
ModApte (90 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
ridge lasso SVM
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
02
46
8
RCV1-v2 (103 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
ridge lasso SVM
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
23
45
67
OHSUMED (77 categories)
log(
Num
ber
of E
rror
s +
1) p
lus
jitte
r
ridge lasso SVM
![Page 52: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/52.jpg)
ModApte - 21,989 features
Number of Features w ith non-zero posterior mode
Num
ber
of C
ateg
orie
s
0 100 200 300 400 500
05
1015
20
RCV1 - 47,152 features
Number of Features w ith non-zero posterior mode
Num
ber
of C
ateg
orie
s
0 500 1000 1500
02
46
810
OHSUMED - 122,076 features
Number of features w ith non-zero posterior mode
Num
ber
of C
ateg
orie
s
0 200 400 600 800 1000
02
46
812
![Page 53: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/53.jpg)
Bayesian Unsupervised Feature Selection and
Weighting• Stopwords : low content words that
typically are discarded– Give them a prior with mean 0 and low
variance• Inverse document frequency (IDF)
weighting– Rare words more likely to be content
indicators– Make variance of prior inversely proportional
to frequency in collection• Experiments in progress
![Page 54: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/54.jpg)
Bayesian Use of Domain Knowledge
• Often believe that certain words are positively or negatively associated with category
• Prior mean can encode strength of positive or negative association
• Prior variance encodes confidence
![Page 55: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/55.jpg)
First Experiments
• 27 RCV1-v2 Region categories• CIA World Factbook entry for country
– Give content words higher mean and/or variance
• Only 10 training examples per category– Shows off prior knowledge– Limited data often the case in applications
![Page 56: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/56.jpg)
Results (Preliminary)
Macro F1 ROC
Gaussian w/ standard prior
0.242 87.2
Gaussian w/ DK prior #1
0.608 91.2
Gaussian w/ DKprior #2
0.542 90.0
![Page 57: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/57.jpg)
Polytomous Logistic Regression
• Logistic regression trivially generalizes to 1-of-k problems– Cleaner than SVMs, error correcting codes, etc.
• Laplace prior particularly cool here:– Suppose 99 classes and a word that predicts class 17– Word gets used 100 times if build 100 models, or if
use polytomous with Gaussian prior– With Laplace prior and polytomous it's used only once
• Experiments in progress, particularly on author id
![Page 58: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/58.jpg)
1-of-K Sample Results: brittany-lFeature Set %
errors
Number of Features
“Argamon” function words, raw tf
74.8 380
POS 75.1 44
1suff 64.2 121
1suff*POS 50.9 554
2suff 40.6 1849
2suff*POS 34.9 3655
3suff 28.7 8676
3suff*POS 27.9 12976
3suff+POS+3suff*POS+Argamon
27.6 22057
All words 23.9 52492 89 authors with at least 50 postings. 10,076 training documents, 3,322 test
documents.
BMR-Laplace classification, default hyperparameter
![Page 59: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/59.jpg)
1-of-K Sample Results: brittany-lFeature Set %
errors
Number of Features
“Argamon” function words, raw tf
74.8 380
POS 75.1 44
1suff 64.2 121
1suff*POS 50.9 554
2suff 40.6 1849
2suff*POS 34.9 3655
3suff 28.7 8676
3suff*POS 27.9 12976
3suff+POS+3suff*POS+Argamon
27.6 22057
All words 23.9 52492 89 authors with at least 50 postings. 10,076 training documents, 3,322 test
documents.
BMR-Laplace classification, default hyperparameter
4.6 million parameters
![Page 60: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/60.jpg)
Future
• Choose exact number of features desired• Faster training algorithm for polytomous
– Currently using cyclic coordinate descent
• Hierarchical models– Sharing strength among categories– Hierarchical relationships among features
• Stemming, thesaurus classes, phrases, etc.
![Page 61: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/61.jpg)
Text Categorization Summary
• Conditional probability models (logistic, probit, etc.)
• As powerful as other discriminative models (SVM, boosting, etc.)
• Bayesian framework provides much richer ability to insert task knowledge
• Code: http://stat.rutgers.edu/~madigan/BBR
• Polytomous, domain-specific priors soon
![Page 62: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/62.jpg)
The Last Slide• Statistical methods for text mining work well on certain types of problems
• Many problems remain unsolved
•Which financial news stories are likely to impact the market?
•Where did soccer originate?
•Attribution
![Page 63: Statistical Methods for Text Mining](https://reader035.vdocuments.mx/reader035/viewer/2022062421/56812a66550346895d8de465/html5/thumbnails/63.jpg)
Approximate Online Sparse Bayes
Shooting algorithm (Fu, 1988)