text categorization

Text Categorization

CategorizationProblem: Given a universe of objects and a pre-defined set of classes, or categories, assign each object to its correct class.Examples:

OverviewDefinition of Text Categorization

TechniquesDecision TreesMaximum Entropy Modelingk-Nearest Neighbor Classification

Text categorizationClassification (= Categorization)Task of assigning objects to classes or categories

Text categorizationTask of classifying the topic or theme of a document

Statistical classificationTraining set of objectsData representation modelModel classTraining procedureEvaluation

Training set of objectsA set of objects, each labeled by one or more classesExample from Reuters

5-MAR-1987 09:22:57.75earnusa

NORD RESOURCES CORP 4TH QTR NET DAYTON, Ohio, March 5 - Shr 19 cts vs 13 ctsNet 2,656,000 vs 1,712,000Revs 15.4 mln vs 9,443,000Avg shrs 14.1 mln vs 12.6 mlnShr 98 cts vs 77 ctsNet 13.8 mln vs 8,928,000Revs 58.8 mln vs 48.5 mlnAvg shrs 14.0 mln vs 11.6 mlnNOTE: Shr figures adjusted for 3-for-2 split paid Feb 6, 1987.Reuter

Data Representation ModelThe training set is encoded via a data representation modelTypically, each object in the training set is represented by a pair (x, c), where:x: a vector of measurementsc: class label

Data RepresentationFor text categorization:use words that are frequent in earnings documentsthe 20 most representative words are:vs, min, cts, loss, &, 000, profit, dlrs, pct, etc.Each document is represented as vector

where

and tfij is the number of occurrences of word i in document j and lj is the length of document j.

wordsijvs5mln5cts3;3&30004loss00034profit0dlrs312pct0is0s0that0net3lt2at0

Model Class and Training ProcedureModel classA parameterized family of classifierse.g. a model class for binary classification: g(x) = wx + w0if g(x) > 0, choose class c1, else c2Training procedureAlgorithm to select one classifier from this familyi.e., to select proper parameters values (e.g. w, w0)

EvaluationBorrowing from IR, NLP systems are evaluated by precision, recall, etc.Example: for text categorization, given a set of documents of which a subset is in a particular category (say, earnings), the system classifies some other subset of the documents as belonging to the earnings category.The results of the system are compared with the actual results as follows:

Evaluation measuresPrecision

Recall

Accuracy

Error

Evaluation of text categorizationmacro-averagingCompute an evaluation measure for each contingency table separately and average over categoriesgives equal weight to each category

macro-averaged precision =

micro-averagingMake a single contingency table for all categories by summing the scores in each cell, then compute the evaluation measure for the whole tablegives equal weight to each object

micro-averaged precision =

Classification TechniquesNave BayesDecision TreesMaximum Entropy ModelingSupport Vector Machinesk-Nearest Neighbor

Bayesian Classifiers

Bayesian MethodsLearning and classification methods based on probability theoryBayes theorem plays a critical role in probabilistic learning and classificationBuild a generative model that approximates how data is producedUses prior probability of each category given no information about an itemCategorization produces a posterior probability distribution over the possible categories given a description of an item

Bayes Ruleprior probabilityposterior probability

Maximum a posteriori Hypothesis

Maximum likelihood HypothesisIf all hypotheses are a priori equally likely,need only to consider the P(D|h) term:

Nave Bayes ClassifiersTask: Classify a new instance based on a tuple of attribute values

Nave Bayes Classifier: AssumptionsP(cj)Can be estimated from the frequency of classes in the training examples.P(x1,x2,,xn|cj) O(|X|n|C|)Could only be estimated if a very, very large number of training examples was available.Conditional Independence Assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities.

The Nave Bayes ClassifierConditional Independence Assumption: features are independent of each other given the class:

Learning the ModelCommon practice: maximum likelihoodsimply use the frequencies in the data

Problem with Max LikelihoodWhat if we have seen no training cases where patient had no flu and muscle aches?

Zero probabilities cannot be conditioned away, no matter the other evidence!

Smoothing to Avoid OverfittingSomewhat more subtle version# of values of Xioverall fraction in data where Xi=xi,kextent ofsmoothing

Text Classification Using Nave Bayes:Basic methodAttributes are text positions, values are words.Naive Bayes assumption is clearly violated.Example?Still too many possibilitiesAssume that classification is independent of the positions of the wordsUse same parameters for each position

Text Classification Algorithms: LearningFrom training corpus, extract VocabularyCalculate required P(cj) and P(xk | cj) termsFor each cj in C dodocsj subset of documents for which the target class is cj Textj single document containing all docsjfor each word xk in Vocabularynk number of occurrences of xk in Textj

Text Classification Algorithms: Classifyingpositions all word positions in current document which contain tokens found in VocabularyReturn cNB, where

Naive Bayes Time ComplexityTraining Time: O(|D|Ld + |C||V|)) where Ld is the average length of a document in DAssumes V and all Di , ni, and nij pre-computed in O(|D|Ld) time during one pass through all of the data.Generally just O(|D|Ld) since usually |C||V| < |D|Ld Test Time: O(|C| Lt) where Lt is the average length of a test documentVery efficient overall, linearly proportional to the time needed to just read in all the data

Underflow PreventionMultiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflowSince log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilitiesClass with highest final un-normalized log probability score is still the most probable

Nave Bayes Posterior ProbabilitiesClassification results of nave Bayes (the class with maximum posterior probability) are usually fairly accurateHowever, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are notOutput probabilities are generally very close to 0 or 1

Two ModelsModel 1: Multivariate binomialOne feature Xw for each word in dictionaryXw = true in document d if w appears in dNaive Bayes assumption: Given the documents topic, appearance of one word in document tells us nothing about chances that another word appears

Two ModelsModel 2: MultinomialOne feature Xi for each word pos in documentfeatures values are all words in dictionaryValue of Xi is the word in position iNave Bayes assumption: Given the documents topic, word in one position in document tells us nothing about value of words in other positionsSecond assumption: word appearance does not depend on positionfor all positions i,j, word w, and class c

Parameter estimationBinomial model:

Multinomial model:

creating a mega-document for topic j by concatenating all documents in this topicuse frequency of w in mega-documentfraction of documents of topic cjin which word w appearsfraction of times in which word w appears across all documents of topic cj

Feature selection via Mutual InformationWe might not want to use all words, but just reliable, good discriminatorsIn training set, choose k words which best discriminate the categories.One way is in terms of Mutual Information:

For each word w and each category c

Feature selection via MI (2)For each category we build a list of k most discriminating terms.For example (on 20 Newsgroups):sci.electronics: circuit, voltage, amp, ground, copy, battery, electronics, cooling, rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, Greedy: does not account for correlations between termsIn general feature selection is necessary for binomial NB, but not for multinomial NB

Evaluating CategorizationEvaluation must be done on test data that are independent of the training data (usually a disjoint set of instances).Classification accuracy: c/n where n is the total number of test instances and c is the number of test instances correctly classified by the system.Results can vary based on sampling error due to different training and test sets.Average results over multiple training and test sets (splits of the overall data) for the best results.

Example: AutoYahoo!Classify 13,589 Yahoo! webpages in Science subtree into 95 different topics (hierarchy depth 2)

Example: WebKB (CMU)Classify webpages from CS departments into:student, faculty, course, project

WebKB ExperimentTrain on ~5,000 hand-labeled web pagesCornell, Washington, U.Texas, WisconsinCrawl and classify a new site (CMU)Results:

Sheet1

StudentFacultyPersonProjectCourseDepart.

Extracted1806624699281

Correct1302819472251

Accuracy:72%42%79%73%89%100%

NB Model Comparison

Sample Learning Curve(Yahoo Science Data)

Importance of Conditional IndependenceAssume a domain with 20 binary (true/false) attributes A1,, A20, and two classes c1 and c2.Goal: for any case A=A1,,A20 estimate P(A,ci).A) No independence assumptions:Computation of 221 parameters (one for each combination of values) !The training database will not be so large!Huge Memory requirements / Processing time.Error Prone (small sample error).B) Strongest conditional independence assumptions (all attributes independent given the class) = Naive Bayes:P(A,ci)=P(A1,ci)P(A2,ci)P(A20,ci) Computation of 20*22 = 80 parameters.Space and time efficient.Robust estimations.What if the conditional independence assumptions do not hold??C) More relaxed independence assumptionsTradeoff between A) and B)

Conditions for Optimality of Nave BayesFactSometimes NB performs well even if the Conditional Independence assumptions are badly violated. QuestionsWHY? And WHEN?HintClassification is about predicting the correct class label and NOT about accurately estimating probabilities.AnswerAssume two classes c1 and c2.A new case A arrives.NB will classify A to c1 if:P(A, c1) > P(A, c2)Despite the big error in estimating the probabilities the classification is still correct.

Correct estimation accurate predictionbut NOTaccurate prediction Correct estimation

Nave Bayes is Not So NaveNave Bayes: First and Second place in KDD-CUP 97 competition, among 16 (then) state of the art algorithmsGoal: Financial services industry direct mail response prediction model.Predict if the recipient of mail will actually respond to the advertisement 750,000 records.Robust to Irrelevant FeaturesIrrelevant Features cancel each other without affecting resultsInstead Decision Trees & Nearest-Neighbor methods can heavily suffer from this.Very good in Domains with many equally important featuresDecision Trees suffer from fragmentation in such cases especially if little dataA good dependable baseline for text classification (but not the best)!Optimal if the Independence Assumptions hold:If assumed independence is correct, then it is the Bayes Optimal Classifier for problemVery Fast:Learning with one pass over the data; testing linear in the number of attributes, and document collection sizeLow Storage requirementsHandles Missing Values

Interpretability of Nave Bayes(From R.Kohavi, Silicon Graphics MineSet Evidence Visualizer)

Nave Bayes DrawbacksDoesnt do higher order interactionsTypical example: Chess end gamesEach move completely changes the context for the next moveC4.5 99.5 % accuracy : NB 87% accuracy.What if you have BOTH high order interactions AND few training data?Doesnt model features that do not equally contribute to distinguishing the classes.If few features ONLY mostly determine the class, additional features usually decrease the accuracy. Because NB gives same weight to all features.

Decision Trees

Decision TreesExample: decision whether to assign documents to the category "earning"node 17681 articlesP(c|n1) = 0.300split: ctsvalue: 2node 25977 articlesP(c|n2) = 0.116split: netvalue: 1node 51704articlesP(c|n5) = 0.943split: vsvalue: 2cts < 2cts 2node 35436 articlesP(c|n3) = 0.050net < 1node 4541 articlesP(c|n4) = 0.649node 6201 articlesP(c|n6) = 0.694node 71403 articlesP(c|n7) = 0.996vs 2net 1vs < 2

Decision Trees - Training procedure (1)Growing a tree with training datasplitting criterionfor finding the feature and its value on which to splite.g. maximum information gainstopping criteriondetermines when to stop splittinge.g. all elements at node have same categoryPruning it back to reasonable sizeto avoid overfitting the training sete.g. dlrs and pct in just one documentto optimize performance

Maximum Information GainInformation gainH(t) H(t|a) = H(t) (pLH(tL) + pRH(tR))where:a is attribute we split ont is distribution of the node we splitpL and pR are the percent of nodes passed on to left and right nodestL and tR are the distributions of left and right nodesChoose attribute which maximizes IG

Example:H(n1) = - 0.3 log(0.3) - 0.7 log(0.7) = 0.881H(n2) = 0.518H(n5) = 0.315H(n1) H(n1| cts) =0.881 (5977/7681) 0.518 (1704/7681) 0.315 = 0.408

Decision Trees Pruning (1)At each step, drop a node considered least helpfulFind best tree using validation on validation setvalidation set: portion of training data held out from trainingFind best tree using cross-validationI. Determine the optimal tree size1. Divide training data into N partitions2. Grow using N-1 partitions, and prune using held-out partition3. Repeat 2. N times4. Determine average pruned tree size as optimal tree sizeII. Training using total training data, and pruning back to optimal size

Decision Trees - Pruning (2)Effect of pruning on accuracyOptimal performance on test set pruning 951 nodes

Decision Trees SummaryUseful for non-trivial classification tasks (for simple problems, use simpler methods)Tend to split the training set into smaller and smaller subsets:may lead to poor generalizationsnot enough data for reliable predictionaccidental regularitiesVolatile: very different model from slightly different dataCan be interpreted easilyeasy to trace the patheasy to debug ones codeeasy to understand a new domain

Maximum Entropy Modeling

Maximum Entropy ModelingMaximum Entropy ModelingThe model with maximum entropy of all the models that satisfy the constraintsdesire to preserve as much uncertainty as possibleModel class: log linear modelTraining procedure: generalized iterative scaling

Maximum Entropy Modeling (2)Model class: loglinear model

i : weight for i-th featureZ : normalizing constant

Class of new documentcompute p(x, 0), p(x, 1)choose the class label with the greater probability

Maximum Entropy Modeling (3)Training procedure: generalized interative scalingExpected value of p:

Epfi maximum entropy distribution p*:Ep*fi = Epfi

Algorithminitialize (1). compute Epfi. n=1compute p(n)(x, c) for each training datacompute Ep(n)fiupdate (n+1)if converged, stop. otherwise n=n+1, goto 2

Maximum Entropy Modeling (4)Define (K+1)th feature: for the constraint that sum of fi is equal to C

Expected value of p is defined as

Expected value for the empirical distribution is computed as

Expected value of p is approximately computed as

GIS Algorithm (full)Initialize {ai(1)}.

Maximum Entropy Modeling (6)Application to text categorizationtrained on 9603 articles, 500 iterationtest result: 88.6% accuracy

Maximum Entropy Modeling (7)Shortcoming of MEMrestricted to binary featurelow performance in some situationscomputationally expensive: slow convergencethe lack of smoothing can cause problemStrength of MEMcan specify all possible relevant informationcomplex features can be definedcan use heterogeneous features and weighting featurean integrated framework for feature selection & classificationa very large number of features could be down to a manageable size during training procedure

Vector Space Classifiers

Vector Space RepresentationEach document is a vector, one component for each term (= word).Normalize to unit length.Properties of vector spaceterms are axesn docs live in this spaceeven with stemming, may have 10,000+ dimensions, or even 1,000,000+

Classification Using Vector SpacesEach training doc a point (vector) labeled by its classSimilarity hypothesis: docs of the same class form a contiguous region of space. Or: Similar documents are usually in the same class.Define surfaces to delineate classes in space

Classes in a Vector SpaceGovernmentScienceArtsSimilarityhypothesistrue ingeneral?

Given a Test DocumentFigure out which region it lies inAssign corresponding class

Test Document = GovernmentGovernmentScienceArts

Binary ClassificationConsider 2 class problemsHow do we define (and find) the separating surface?How do we test which region a test doc is in?

Separation by HyperplanesAssume linear separability for now:in 2 dimensions, can separate by a linein higher dimensions, need hyperplanesCan find separating hyperplane by linear programming (e.g. perceptron):separator can be expressed as ax + by = c

Linear Programming / Perceptron

Find a,b,c, such thatax + by c for red pointsax + by c for blue points.

Relationship to Nave Bayes?Find a,b,c, such thatax + by c for red pointsax + by c for blue points.

Linear ClassifiersMany common text classifiers are linear classifiersDespite this similarity, large performance differencesFor separable problems, there is an infinite number of separating hyperplanes. Which one do you choose?What to do for non-separable problems?

Which Hyperplane?In general, lots of possiblesolutions for a,b,c

Which Hyperplane?Lots of possible solutions for a,b,c.Some methods find a separating hyperplane, but not the optimal one (e.g., perceptron)Most methods find an optimal separating hyperplaneWhich points should influence optimality?All pointsLinear regressionNave BayesOnly difficult points close to decision boundarySupport vector machinesLogistic regression (kind of)

Hyperplane: ExampleClass: interest (as in interest rate)Example features of a linear classifier (SVM) wi ti wi ti 0.70 prime 0.67 rate 0.63 interest 0.60 rates 0.46 discount 0.43 bundesbank -0.71 dlrs -0.35 world -0.33 sees -0.25 year -0.24 group -0.24 dlr

More Than Two ClassesOne-of classification: each document belongs to exactly one classHow do we compose separating surfaces into regions?Any-of or multiclass classificationFor n classes, decompose into n binary problemsVector space classifiers for one-of classificationUse a set of binary classifiersCentroid classificationK nearest neighbor classification

Composing Surfaces: Issues???

Set of Binary ClassifiersBuild a separator between each class and its complementary set (docs from all other classes).Given test doc, evaluate it for membership in each class.For one-of classification, declare membership in classes for class with:maximum scoremaximum confidencemaximum probabilityWhy different from multiclass classification?

Negative ExamplesFormulate as above, except negative examples for a class are added to its complementary set.Positive examplesNegative examples

Centroid ClassificationGiven training docs for a class, compute their centroidNow have a centroid for each classGiven query doc, assign to class whose centroid is nearestCompare to Rocchio

ExampleGovernmentScienceArts

k-Nearest Neighbor

k Nearest Neighbor ClassificationTo classify document d into class cDefine k-neighborhood N as k nearest neighbors of dCount number of documents l in N that belong to cEstimate P(c|d) as l/k

Example: k=6 (6NN)GovernmentScienceArtsP(science| )?

Cover and Hart 1967Asymptotically, the error rate of 1-nearest-neighbor classification is less than twice the Bayes rate.Assume: query point coincides with a training point.Both query point and training point contribute error -> 2 times Bayes rateIn particular, asymptotic error rate 0 if Bayes rate is 0.

kNN Classification: Discussion1NN for earning95.3% accuracy

KNN Classificationno trainingneeds effective similarity measurecosine, Euclidean distance, Value Difference MetricPerformance is very dependent on the right similarity metricis computationally expensiveis robust and conceptually simple method

kNN vs. RegressionBias/Variance tradeoffVariance CapacitykNN has high variance and low bias.Regression has low variance and high bias.Consider: Is an object a tree? (Burges)Too much capacity/variance, low biasBotanist who memorizesWill always say no to new object (e.g., #leaves)Not enough capacity/variance, high biasLazy botanistSays yes if the object is green

kNN: DiscussionClassification time linear in training setNo feature selection necessaryScales well with large number of classesDont need to train n classifiers for n classesClasses can influence each otherSmall changes to one class can have ripple effectScores can be hard to convert to probabilitiesNo training necessaryActually: not true. Why?

Number of Neighbors

Support Vector Machines

Recall: Which Hyperplane?In general, lots of possible solutions for a,b,c.Support Vector Machine (SVM) finds an optimal solution.

Support Vector Machine (SVM)SVMs maximize the margin around the separating hyperplane.The decision function is fully specified by a subset of training samples, the support vectors.Quadratic programming problemText classification method du jour

Maximum Margin: Formalizationw: hyperplane normalx_i: data point iy_i: class of data point i (+1 or -1)

Constraint optimization formalization:

(1)

(2) maximize margin: 2/||w||

Quadratic ProgrammingOne can show that hyperplane w with maximum margin is:

ai: lagrange multipliersxi: data point iyi: class of data point i (+1 or -1)where the ai are the solution to maximizing:Most ai will be zero

Building an SVM ClassifierNow we know how to build a separator for two linearly separable classesWhat about classes whose exemplary docs are not linearly separable?

Not Linearly SeparableFind a line that penalizespoints on the wrong side

Penalizing Bad PointsDefine distance for each point withrespect to separator ax + by = c: (ax + by) - c for red pointsc - (ax + by) for blue points.Negative for bad points.

Solve Quadratic ProgramSolution gives separator between two classes: choice of a,bGiven a new point (x,y), can score its proximity to each class:evaluate ax+bySet confidence threshold357

Performance of SVMSVM are seen as best-performing method by manyStatistical significance of most results not clearThere are many methods that perform about as well as SVMExample: regularized regression (Zhang&Oles)Example of a comparison study: Yang&Liu

Yang&Liu: SVM vs Other Methods

Yang&Liu: Statistical Significance

Yang&Liu: Small Classes

Results for Kernels (Joachims)

SVM: SummarySVM have optimal or close to optimal performanceKernels are an elegant and efficient way to map data into a better representationSVM can be expensive to train (quadratic programming)If efficient training is important, and slightly suboptimal performance ok, dont use SVMFor text, linear kernel is commonSo most SVMs are linear classifiers (like many others), but find a (close to) optimal separating hyperplane

SVM: Summary (cont.)Model parameters based on small subset (SVs)Based on structural risk minimizationSupports kernel

ResourcesManning and Schtze. Foundations of Statistical Natural Language Processing. Chapter 16. MIT Press.Trevor Hastie, Robert Tibshirani and Jerome Friedman, "Elements of Statistical Learning: Data Mining, Inference and Prediction" Springer-Verlag, New York. Christopher J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition (1998)Data Mining and Knowledge DiscoveryR.M. Tong, L.A. Appelbaum, V.N. Askman, J.F. Cunningham. Conceptual Information Retrieval using RUBRIC. Proc. ACM SIGIR 247-253, (1987).S. T. Dumais, Using SVMs for text categorization, IEEE Intelligent Systems, 13(4), Jul/Aug 1998.Yiming Yang, S. Slattery and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, Volume 18, Number 2, March 2002.Yiming Yang, Xin Liu. A re-examination of text categorization methods.22nd Annual International SIGIR, (1999)Tong Zhang, Frank J. Oles: Text Categorization Based on Regularized Linear Classification Methods. Information Retrieval 4(1): 5-31 (2001)

Bias/variance in terms of resulting classifier given randomly selected training set; why it is a tradeoff; when to choose low-bias method, when to choose low-variance method

text categorization

Documents

mln vs

set of documents

class c1

correct class

earnings category

predefined set of classes

length of document j

classification methods