sentiment classification using machine learning techniques

Thumbs up? Sentiment Classification using Machine LearningTechniques

Bo Pang and Lillian LeeDepartment of Computer Science

Cornell UniversityIthaca, NY 14853 USA

{pabo,llee}@cs.cornell.edu

Shivakumar VaithyanathanIBM Almaden Research Center

650 Harry Rd.San Jose, CA 95120 [email protected]

Abstract

We consider the problem of classifying doc-uments not by topic, but by overall senti-ment, e.g., determining whether a reviewis positive or negative. Using movie re-views as data, we find that standard ma-chine learning techniques definitively out-perform human-produced baselines. How-ever, the three machine learning methodswe employed (Naive Bayes, maximum en-tropy classification, and support vector ma-chines) do not perform as well on sentimentclassification as on traditional topic-basedcategorization. We conclude by examiningfactors that make the sentiment classifica-tion problem more challenging.

Publication info: Proceedings of EMNLP2002, pp. 79–86.

1 Introduction

Today, very large amounts of information are avail-able in on-line documents. As part of the effort tobetter organize this information for users, researchershave been actively investigating the problem of au-tomatic text categorization.

The bulk of such work has focused on topical cat-egorization, attempting to sort documents accord-ing to their subject matter (e.g., sports vs. poli-tics). However, recent years have seen rapid growthin on-line discussion groups and review sites (e.g.,the New York Times’ Books web page) where a cru-cial characteristic of the posted articles is their senti-ment, or overall opinion towards the subject matter— for example, whether a product review is pos-itive or negative. Labeling these articles with theirsentiment would provide succinct summaries to read-ers; indeed, these labels are part of the appeal andvalue-add of such sites as www.rottentomatoes.com,

which both labels movie reviews that do not con-tain explicit rating indicators and normalizes thedifferent rating schemes that individual reviewersuse. Sentiment classification would also be helpful inbusiness intelligence applications (e.g. MindfulEye’sLexant system1) and recommender systems (e.g.,Terveen et al. (1997), Tatemura (2000)), where userinput and feedback could be quickly summarized; in-deed, in general, free-form survey responses given innatural language format could be processed usingsentiment categorization. Moreover, there are alsopotential applications to message filtering; for exam-ple, one might be able to use sentiment informationto recognize and discard “flames”(Spertus, 1997).

In this paper, we examine the effectiveness of ap-plying machine learning techniques to the sentimentclassification problem. A challenging aspect of thisproblem that seems to distinguish it from traditionaltopic-based classification is that while topics are of-ten identifiable by keywords alone, sentiment can beexpressed in a more subtle manner. For example, thesentence “How could anyone sit through this movie?”contains no single word that is obviously negative.(See Section 7 for more examples). Thus, sentimentseems to require more understanding than the usualtopic-based classification. So, apart from presentingour results obtained via machine learning techniques,we also analyze the problem to gain a better under-standing of how difficult it is.

2 Previous Work

This section briefly surveys previous work on non-topic-based text categorization.

One area of research concentrates on classifyingdocuments according to their source or source style,with statistically-detected stylistic variation (Biber,1988) serving as an important cue. Examples in-clude author, publisher (e.g., the New York Times vs.The Daily News), native-language background, and

1http://www.mindfuleye.com/about/lexant.htm

“brow” (e.g., high-brow vs. “popular”, or low-brow)(Mosteller and Wallace, 1984; Argamon-Engelson etal., 1998; Tomokiyo and Jones, 2001; Kessler et al.,1997).

Another, more related area of research is that ofdetermining the genre of texts; subjective genres,such as “editorial”, are often one of the possiblecategories (Karlgren and Cutting, 1994; Kessler etal., 1997; Finn et al., 2002). Other work explicitlyattempts to find features indicating that subjectivelanguage is being used (Hatzivassiloglou and Wiebe,2000; Wiebe et al., 2001). But, while techniques forgenre categorization and subjectivity detection canhelp us recognize documents that express an opin-ion, they do not address our specific classificationtask of determining what that opinion actually is.

Most previous research on sentiment-based classi-fication has been at least partially knowledge-based.Some of this work focuses on classifying the semanticorientation of individual words or phrases, using lin-guistic heuristics or a pre-selected set of seed words(Hatzivassiloglou and McKeown, 1997; Turney andLittman, 2002). Past work on sentiment-based cat-egorization of entire documents has often involvedeither the use of models inspired by cognitive lin-guistics (Hearst, 1992; Sack, 1994) or the manual orsemi-manual construction of discriminant-word lex-icons (Huettner and Subasic, 2000; Das and Chen,2001; Tong, 2001). Interestingly, our baseline exper-iments, described in Section 4, show that humansmay not always have the best intuition for choosingdiscriminating words.

Turney’s (2002) work on classification of reviewsis perhaps the closest to ours.2 He applied a spe-cific unsupervised learning technique based on themutual information between document phrases andthe words “excellent” and “poor”, where the mu-tual information is computed using statistics gath-ered by a search engine. In contrast, we utilize sev-eral completely prior-knowledge-free supervised ma-chine learning methods, with the goal of understand-ing the inherent difficulty of the task.

3 The Movie-Review Domain

For our experiments, we chose to work with moviereviews. This domain is experimentally convenientbecause there are large on-line collections of such re-views, and because reviewers often summarize theiroverall sentiment with a machine-extractable rat-ing indicator, such as a number of stars; hence, wedid not need to hand-label the data for supervised

2Indeed, although our choice of title was completelyindependent of his, our selections were eerily similar.

learning or evaluation purposes. We also note thatTurney (2002) found movie reviews to be the mostdifficult of several domains for sentiment classifica-tion, reporting an accuracy of 65.83% on a 120-document set (random-choice performance: 50%).But we stress that the machine learning methods andfeatures we use are not specific to movie reviews, andshould be easily applicable to other domains as longas sufficient training data exists.

Our data source was the Internet Movie Database(IMDb) archive of the rec.arts.movies.reviewsnewsgroup.3 We selected only reviews where the au-thor rating was expressed either with stars or somenumerical value (other conventions varied too widelyto allow for automatic processing). Ratings wereautomatically extracted and converted into one ofthree categories: positive, negative, or neutral. Forthe work described in this paper, we concentratedonly on discriminating between positive and nega-tive sentiment. To avoid domination of the corpusby a small number of prolific reviewers, we imposeda limit of fewer than 20 reviews per author per sen-timent category, yielding a corpus of 752 negativeand 1301 positive reviews, with a total of 144 re-viewers represented. This dataset will be availableon-line at http://www.cs.cornell.edu/people/pabo/-movie-review-data/ (the URL contains hyphens onlyaround the word “review”).

4 A Closer Look At the Problem

Intuitions seem to differ as to the difficulty of the sen-timent detection problem. An expert on using ma-chine learning for text categorization predicted rela-tively low performance for automatic methods. Onthe other hand, it seems that distinguishing positivefrom negative reviews is relatively easy for humans,especially in comparison to the standard text catego-rization problem, where topics can be closely related.One might also suspect that there are certain wordspeople tend to use to express strong sentiments, sothat it might suffice to simply produce a list of suchwords by introspection and rely on them alone toclassify the texts.

To test this latter hypothesis, we asked two gradu-ate students in computer science to (independently)choose good indicator words for positive and nega-tive sentiments in movie reviews. Their selections,shown in Figure 1, seem intuitively plausible. Wethen converted their responses into simple decisionprocedures that essentially count the number of theproposed positive and negative words in a given doc-ument. We applied these procedures to uniformly-

3http://reviews.imdb.com/Reviews/

Proposed word lists Accuracy Ties

Human 1 positive: dazzling, brilliant, phenomenal, excellent, fantastic 58% 75%negative: suck, terrible, awful, unwatchable, hideous

Human 2 positive: gripping, mesmerizing, riveting, spectacular, cool, 64% 39%awesome, thrilling, badass, excellent, moving, exciting

negative: bad, cliched, sucks, boring, stupid, slow

Figure 1: Baseline results for human word lists. Data: 700 positive and 700 negative reviews.

Proposed word lists Accuracy Ties

Human 3 + stats positive: love, wonderful, best, great, superb, still, beautiful 69% 16%negative: bad, worst, stupid, waste, boring, ?, !

Figure 2: Results for baseline using introspection and simple statistics of the data (including test data).

distributed data, so that the random-choice baselineresult would be 50%. As shown in Figure 1, theaccuracy — percentage of documents classified cor-rectly — for the human-based classifiers were 58%and 64%, respectively.4 Note that the tie rates —percentage of documents where the two sentimentswere rated equally likely — are quite high5 (we chosea tie breaking policy that maximized the accuracy ofthe baselines).

While the tie rates suggest that the brevity ofthe human-produced lists is a factor in the relativelypoor performance results, it is not the case that sizealone necessarily limits accuracy. Based on a verypreliminary examination of frequency counts in theentire corpus (including test data) plus introspection,we created a list of seven positive and seven negativewords (including punctuation), shown in Figure 2.As that figure indicates, using these words raised theaccuracy to 69%. Also, although this third list is ofcomparable length to the other two, it has a muchlower tie rate of 16%. We further observe that someof the items in this third list, such as “?” or “still”,would probably not have been proposed as possiblecandidates merely through introspection, althoughupon reflection one sees their merit (the questionmark tends to occur in sentences like “What was thedirector thinking?”; “still” appears in sentences like“Still, though, it was worth seeing”).

We conclude from these preliminary experimentsthat it is worthwhile to explore corpus-based tech-niques, rather than relying on prior intuitions, to se-lect good indicator features and to perform sentimentclassification in general. These experiments also pro-vide us with baselines for experimental comparison;in particular, the third baseline of 69% might actu-ally be considered somewhat difficult to beat, since

4Later experiments using these words as features formachine learning methods did not yield better results.

5This is largely due to 0-0 ties.

it was achieved by examination of the test data (al-though our examination was rather cursory; we donot claim that our list was the optimal set of four-teen words).

5 Machine Learning Methods

Our aim in this work was to examine whether it suf-fices to treat sentiment classification simply as a spe-cial case of topic-based categorization (with the two“topics” being positive sentiment and negative sen-timent), or whether special sentiment-categorizationmethods need to be developed. We experimentedwith three standard algorithms: Naive Bayes clas-sification, maximum entropy classification, and sup-port vector machines. The philosophies behind thesethree algorithms are quite different, but each hasbeen shown to be effective in previous text catego-rization studies.

To implement these machine learning algorithmson our document data, we used the following stan-dard bag-of-features framework. Let {f1, . . . , fm} bea predefined set of m features that can appear ina document; examples include the word “still” orthe bigram “really stinks”. Let ni(d) be the num-ber of times fi occurs in document d. Then, eachdocument d is represented by the document vector~d := (n1(d), n2(d), . . . , nm(d)).

5.1 Naive Bayes

One approach to text classification is to assign to agiven document d the class c∗ = arg maxc P (c | d).We derive the Naive Bayes (NB) classifier by firstobserving that by Bayes’ rule,

P (c | d) =P (c)P (d | c)

P (d),

where P (d) plays no role in selecting c∗. To estimatethe term P (d | c), Naive Bayes decomposes it by as-suming the fi’s are conditionally independent given

d’s class:

PNB(c | d) :=P (c)

(∏mi=1 P (fi | c)ni(d)

)

P (d).

Our training method consists of relative-frequencyestimation of P (c) and P (fi | c), using add-onesmoothing.

Despite its simplicity and the fact that its con-ditional independence assumption clearly does nothold in real-world situations, Naive Bayes-based textcategorization still tends to perform surprisingly well(Lewis, 1998); indeed, Domingos and Pazzani (1997)show that Naive Bayes is optimal for certain problemclasses with highly dependent features. On the otherhand, more sophisticated algorithms might (and of-ten do) yield better results; we examine two suchalgorithms next.

5.2 Maximum Entropy

Maximum entropy classification (MaxEnt, or ME,for short) is an alternative technique which hasproven effective in a number of natural lan-guage processing applications (Berger et al., 1996).Nigam et al. (1999) show that it sometimes, but notalways, outperforms Naive Bayes at standard textclassification. Its estimate of P (c | d) takes the fol-lowing exponential form:

PME(c | d) :=1

Z(d)exp

(∑

i

λi,cFi,c(d, c)

),

where Z(d) is a normalization function. Fi,c is a fea-ture/class function for feature fi and class c, definedas follows:6

Fi,c(d, c′) :={

1, ni(d) > 0 and c′ = c0 otherwise

.

For instance, a particular feature/class functionmight fire if and only if the bigram “still hate” ap-pears and the document’s sentiment is hypothesizedto be negative.7 Importantly, unlike Naive Bayes,MaxEnt makes no assumptions about the relation-ships between features, and so might potentially per-form better when conditional independence assump-tions are not met.

The λi,c’s are feature-weight parameters; inspec-tion of the definition of PME shows that a large λi,c

means that fi is considered a strong indicator for6We use a restricted definition of feature/class func-

tions so that MaxEnt relies on the same sort of featureinformation as Naive Bayes.

7The dependence on class is necessary for parameterinduction. See Nigam et al. (1999) for additional moti-vation.

class c. The parameter values are set so as to max-imize the entropy of the induced distribution (hencethe classifier’s name) subject to the constraint thatthe expected values of the feature/class functionswith respect to the model are equal to their expectedvalues with respect to the training data: the under-lying philosophy is that we should choose the modelmaking the fewest assumptions about the data whilestill remaining consistent with it, which makes intu-itive sense. We use ten iterations of the improvediterative scaling algorithm (Della Pietra et al., 1997)for parameter training (this was a sufficient num-ber of iterations for convergence of training-data ac-curacy), together with a Gaussian prior to preventoverfitting (Chen and Rosenfeld, 2000).

5.3 Support Vector Machines

Support vector machines (SVMs) have been shown tobe highly effective at traditional text categorization,generally outperforming Naive Bayes (Joachims,1998). They are large-margin, rather than proba-bilistic, classifiers, in contrast to Naive Bayes andMaxEnt. In the two-category case, the basic idea be-hind the training procedure is to find a hyperplane,represented by vector ~w, that not only separatesthe document vectors in one class from those in theother, but for which the separation, or margin, is aslarge as possible. This search corresponds to a con-strained optimization problem; letting cj ∈ {1,−1}(corresponding to positive and negative) be the cor-rect class of document dj , the solution can be writtenas

~w :=∑

j

αjcj~dj , αj ≥ 0,

where the αj ’s are obtained by solving a dual opti-mization problem. Those ~dj such that αj is greaterthan zero are called support vectors, since they arethe only document vectors contributing to ~w. Clas-sification of test instances consists simply of deter-mining which side of ~w’s hyperplane they fall on.

We used Joachim’s (1999) SV M light package8 fortraining and testing, with all parameters set to theirdefault values, after first length-normalizing the doc-ument vectors, as is standard (neglecting to normal-ize generally hurt performance slightly).

6 Evaluation

6.1 Experimental Set-up

We used documents from the movie-review corpusdescribed in Section 3. To create a data set with uni-form class distribution (studying the effect of skewed

8http://svmlight.joachims.org

Features # of frequency or NB ME SVMfeatures presence?

(1) unigrams 16165 freq. 78.7 N/A 72.8(2) unigrams ” pres. 81.0 80.4 82.9(3) unigrams+bigrams 32330 pres. 80.6 80.8 82.7(4) bigrams 16165 pres. 77.3 77.4 77.1(5) unigrams+POS 16695 pres. 81.5 80.4 81.9(6) adjectives 2633 pres. 77.0 77.7 75.1(7) top 2633 unigrams 2633 pres. 80.3 81.0 81.4(8) unigrams+position 22430 pres. 81.0 80.1 81.6

Figure 3: Average three-fold cross-validation accuracies, in percent. Boldface: best performance for a givensetting (row). Recall that our baseline results ranged from 50% to 69%.

class distributions was out of the scope of this study),we randomly selected 700 positive-sentiment and 700negative-sentiment documents. We then divided thisdata into three equal-sized folds, maintaining bal-anced class distributions in each fold. (We did notuse a larger number of folds due to the slowness ofthe MaxEnt training procedure.) All results reportedbelow, as well as the baseline results from Section 4,are the average three-fold cross-validation results onthis data (of course, the baseline algorithms had noparameters to tune).

To prepare the documents, we automatically re-moved the rating indicators and extracted the tex-tual information from the original HTML docu-ment format, treating punctuation as separate lex-ical items. No stemming or stoplists were used.

One unconventional step we took was to attemptto model the potentially important contextual effectof negation: clearly “good” and “not very good” in-dicate opposite sentiment orientations. Adapting atechnique of Das and Chen (2001), we added the tagNOT to every word between a negation word (“not”,“isn’t”, “didn’t”, etc.) and the first punctuationmark following the negation word. (Preliminary ex-periments indicate that removing the negation taghad a negligible, but on average slightly harmful, ef-fect on performance.)

For this study, we focused on features based onunigrams (with negation tagging) and bigrams. Be-cause training MaxEnt is expensive in the number offeatures, we limited consideration to (1) the 16165unigrams appearing at least four times in our 1400-document corpus (lower count cutoffs did not yieldsignificantly different results), and (2) the 16165 bi-grams occurring most often in the same data (theselected bigrams all occurred at least seven times).Note that we did not add negation tags to the bi-grams, since we consider bigrams (and n-grams in

general) to be an orthogonal way to incorporate con-text.

6.2 Results

Initial unigram results The classification accu-racies resulting from using only unigrams as fea-tures are shown in line (1) of Figure 3. As a whole,the machine learning algorithms clearly surpass therandom-choice baseline of 50%. They also hand-ily beat our two human-selected-unigram baselinesof 58% and 64%, and, furthermore, perform well incomparison to the 69% baseline achieved via limitedaccess to the test-data statistics, although the im-provement in the case of SVMs is not so large.

On the other hand, in topic-based classification,all three classifiers have been reported to use bag-of-unigram features to achieve accuracies of 90%and above for particular categories (Joachims, 1998;Nigam et al., 1999)9 — and such results are for set-tings with more than two classes. This providessuggestive evidence that sentiment categorization ismore difficult than topic classification, which cor-responds to the intuitions of the text categoriza-tion expert mentioned above.10 Nonetheless, we stillwanted to investigate ways to improve our senti-ment categorization results; these experiments arereported below.

Feature frequency vs. presence Recall that werepresent each document d by a feature-count vector(n1(d), . . . , nm(d)). However, the definition of the

9Joachims (1998) used stemming and stoplists; insome of their experiments, Nigam et al. (1999), like us,did not.

10We could not perform the natural experiment of at-tempting topic-based categorization on our data becausethe only obvious topics would be the film being reviewed;unfortunately, in our data, the maximum number of re-views per movie is 27, too small for meaningful results.

MaxEnt feature/class functions Fi,c only reflects thepresence or absence of a feature, rather than directlyincorporating feature frequency. In order to investi-gate whether reliance on frequency information couldaccount for the higher accuracies of Naive Bayes andSVMs, we binarized the document vectors, settingni(d) to 1 if and only feature fi appears in d, andreran Naive Bayes and SV M light on these new vec-tors.11

As can be seen from line (2) of Figure 3,better performance (much better performance forSVMs) is achieved by accounting only for fea-ture presence, not feature frequency. Interestingly,this is in direct opposition to the observations ofMcCallum and Nigam (1998) with respect to NaiveBayes topic classification. We speculate that this in-dicates a difference between sentiment and topic cat-egorization — perhaps due to topic being conveyedmostly by particular content words that tend to berepeated — but this remains to be verified. In anyevent, as a result of this finding, we did not incor-porate frequency information into Naive Bayes andSVMs in any of the following experiments.

Bigrams In addition to looking specifically fornegation words in the context of a word, we alsostudied the use of bigrams to capture more contextin general. Note that bigrams and unigrams aresurely not conditionally independent, meaning thatthe feature set they comprise violates Naive Bayes’conditional-independence assumptions; on the otherhand, recall that this does not imply that NaiveBayes will necessarily do poorly (Domingos and Paz-zani, 1997).

Line (3) of the results table shows that bigraminformation does not improve performance beyondthat of unigram presence, although adding in the bi-grams does not seriously impact the results, even forNaive Bayes. This would not rule out the possibilitythat bigram presence is as equally useful a featureas unigram presence; in fact, Pedersen (2001) foundthat bigrams alone can be effective features for wordsense disambiguation. However, comparing line (4)to line (2) shows that relying just on bigrams causesaccuracy to decline by as much as 5.8 percentagepoints. Hence, if context is in fact important, as ourintuitions suggest, bigrams are not effective at cap-turing it in our setting.

11Alternatively, we could have tried integrating fre-quency information into MaxEnt. However, feature/classfunctions are traditionally defined as binary (Berger etal., 1996); hence, explicitly incorporating frequencieswould require different functions for each count (or countbin), making training impractical. But cf. (Nigam et al.,1999).

Parts of speech We also experimented with ap-pending POS tags to every word via Oliver Mason’sQtag program.12 This serves as a crude form of wordsense disambiguation (Wilks and Stevenson, 1998):for example, it would distinguish the different usagesof “love” in “I love this movie” (indicating sentimentorientation) versus “This is a love story” (neutralwith respect to sentiment). However, the effect ofthis information seems to be a wash: as depicted inline (5) of Figure 3, the accuracy improves slightlyfor Naive Bayes but declines for SVMs, and the per-formance of MaxEnt is unchanged.

Since adjectives have been a focus of previous workin sentiment detection (Hatzivassiloglou and Wiebe,2000; Turney, 2002)13, we looked at the performanceof using adjectives alone. Intuitively, we might ex-pect that adjectives carry a great deal of informa-tion regarding a document’s sentiment; indeed, thehuman-produced lists from Section 4 contain almostno other parts of speech. Yet, the results, shown inline (6) of Figure 3, are relatively poor: the 2633adjectives provide less useful information than uni-gram presence. Indeed, line (7) shows that simplyusing the 2633 most frequent unigrams is a betterchoice, yielding performance comparable to that ofusing (the presence of) all 16165 (line (2)). This mayimply that applying explicit feature-selection algo-rithms on unigrams could improve performance.

Position An additional intuition we had was thatthe position of a word in the text might make a dif-ference: movie reviews, in particular, might beginwith an overall sentiment statement, proceed witha plot discussion, and conclude by summarizing theauthor’s views. As a rough approximation to deter-mining this kind of structure, we tagged each wordaccording to whether it appeared in the first quar-ter, last quarter, or middle half of the document14.The results (line (8)) didn’t differ greatly from usingunigrams alone, but more refined notions of positionmight be more successful.

7 Discussion

The results produced via machine learning tech-niques are quite good in comparison to the human-generated baselines discussed in Section 4. In termsof relative performance, Naive Bayes tends to do theworst and SVMs tend to do the best, although the

12http://www.english.bham.ac.uk/staff/oliver/soft-ware/tagger/index.htm

13Turney’s (2002) unsupervised algorithm uses bi-grams containing an adjective or an adverb.

14We tried a few other settings, e.g., first third vs. lastthird vs middle third, and found them to be less effective.

differences aren’t very large.On the other hand, we were not able to achieve ac-

curacies on the sentiment classification problem com-parable to those reported for standard topic-basedcategorization, despite the several different types offeatures we tried. Unigram presence informationturned out to be the most effective; in fact, none ofthe alternative features we employed provided consis-tently better performance once unigram presence wasincorporated. Interestingly, though, the superiorityof presence information in comparison to frequencyinformation in our setting contradicts previous obser-vations made in topic-classification work (McCallumand Nigam, 1998).

What accounts for these two differences — dif-ficulty and types of information proving useful —between topic and sentiment classification, and howmight we improve the latter? To answer these ques-tions, we examined the data further. (All examplesbelow are drawn from the full 2053-document cor-pus.)

As it turns out, a common phenomenon in the doc-uments was a kind of “thwarted expectations” narra-tive, where the author sets up a deliberate contrastto earlier discussion: for example, “This film shouldbe brilliant. It sounds like a great plot, the actors arefirst grade, and the supporting cast is good as well, andStallone is attempting to deliver a good performance.However, it can’t hold up” or “I hate the Spice Girls....[3 things the author hates about them]... Why I sawthis movie is a really, really, really long story, but Idid, and one would think I’d despise every minute ofit. But... Okay, I’m really ashamed of it, but I enjoyedit. I mean, I admit it’s a really awful movie ...the ninthfloor of hell...The plot is such a mess that it’s terrible.But I loved it.” 15

In these examples, a human would easily detectthe true sentiment of the review, but bag-of-featuresclassifiers would presumably find these instances dif-ficult, since there are many words indicative of theopposite sentiment to that of the entire review. Fun-damentally, it seems that some form of discourseanalysis is necessary (using more sophisticated tech-

15This phenomenon is related to another commontheme, that of “a good actor trapped in a bad movie”:“AN AMERICAN WEREWOLF IN PARIS is a failed at-tempt... Julie Delpy is far too good for this movie. She im-bues Serafine with spirit, spunk, and humanity. This isn’tnecessarily a good thing, since it prevents us from relax-ing and enjoying AN AMERICAN WEREWOLF IN PARISas a completely mindless, campy entertainment experience.Delpy’s injection of class into an otherwise classless produc-tion raises the specter of what this film could have beenwith a better script and a better cast ... She was radiant,charismatic, and effective ....”

niques than our positional feature mentioned above),or at least some way of determining the focus of eachsentence, so that one can decide when the author istalking about the film itself. (Turney (2002) makesa similar point, noting that for reviews, “the wholeis not necessarily the sum of the parts”.) Further-more, it seems likely that this thwarted-expectationsrhetorical device will appear in many types of texts(e.g., editorials) devoted to expressing an overallopinion about some topic. Hence, we believe that animportant next step is the identification of featuresindicating whether sentences are on-topic (which isa kind of co-reference problem); we look forward toaddressing this challenge in future work.

Acknowledgments

We thank Joshua Goodman, Thorsten Joachims, JonKleinberg, Vikas Krishna, John Lafferty, Jussi Myl-lymaki, Phoebe Sengers, Richard Tong, Peter Tur-ney, and the anonymous reviewers for many valuablecomments and helpful suggestions, and Hubie Chenand Tony Faradjian for participating in our baselineexperiments. Portions of this work were done whilethe first author was visiting IBM Almaden. This pa-per is based upon work supported in part by the Na-tional Science Foundation under ITR/IM grant IIS-0081334. Any opinions, findings, and conclusions orrecommendations expressed above are those of theauthors and do not necessarily reflect the views ofthe National Science Foundation.

References

Shlomo Argamon-Engelson, Moshe Koppel, andGalit Avneri. 1998. Style-based text categoriza-tion: What newspaper am I reading? In Proc. ofthe AAAI Workshop on Text Categorization, pages1–4.

Adam L. Berger, Stephen A. Della Pietra, and Vin-cent J. Della Pietra. 1996. A maximum entropyapproach to natural language processing. Compu-tational Linguistics, 22(1):39–71.

Douglas Biber. 1988. Variation across Speech andWriting. Cambridge University Press.

Stanley Chen and Ronald Rosenfeld. 2000. A surveyof smoothing techniques for ME models. IEEETrans. Speech and Audio Processing, 8(1):37–50.

Sanjiv Das and Mike Chen. 2001. Yahoo! forAmazon: Extracting market sentiment from stockmessage boards. In Proc. of the 8th Asia PacificFinance Association Annual Conference (APFA2001).

Stephen Della Pietra, Vincent Della Pietra, and JohnLafferty. 1997. Inducing features of random fields.IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 19(4):380–393.

Pedro Domingos and Michael J. Pazzani. 1997. Onthe optimality of the simple Bayesian classifier un-der zero-one loss. Machine Learning, 29(2-3):103–130.

Aidan Finn, Nicholas Kushmerick, and Barry Smyth.2002. Genre classification and domain transferfor information filtering. In Proc. of the Eu-ropean Colloquium on Information Retrieval Re-search, pages 353–362, Glasgow.

Vasileios Hatzivassiloglou and Kathleen McKeown.1997. Predicting the semantic orientation of adjec-tives. In Proc. of the 35th ACL/8th EACL, pages174–181.

Vasileios Hatzivassiloglou and Janyce Wiebe. 2000.Effects of adjective orientation and gradability onsentence subjectivity. In Proc. of COLING.

Marti Hearst. 1992. Direction-based text interpre-tation as an information access refinement. InPaul Jacobs, editor, Text-Based Intelligent Sys-tems. Lawrence Erlbaum Associates.

Alison Huettner and Pero Subasic. 2000. Fuzzytyping for document management. In ACL2000 Companion Volume: Tutorial Abstracts andDemonstration Notes, pages 26–27.

Thorsten Joachims. 1998. Text categorization withsupport vector machines: Learning with many rel-evant features. In Proc. of the European Confer-ence on Machine Learning (ECML), pages 137–142.

Thorsten Joachims. 1999. Making large-scale SVMlearning practical. In Bernhard Scholkopf andAlexander Smola, editors, Advances in KernelMethods - Support Vector Learning, pages 44–56.MIT Press.

Jussi Karlgren and Douglass Cutting. 1994. Recog-nizing text genres with simple metrics using dis-criminant analysis. In Proc. of COLING.

Brett Kessler, Geoffrey Nunberg, and HinrichSchutze. 1997. Automatic detection of text genre.In Proc. of the 35th ACL/8th EACL, pages 32–38.

David D. Lewis. 1998. Naive (Bayes) at forty: Theindependence assumption in information retrieval.In Proc. of the European Conference on MachineLearning (ECML), pages 4–15. Invited talk.

Andrew McCallum and Kamal Nigam. 1998. A com-parison of event models for Naive Bayes text clas-sification. In Proc. of the AAAI-98 Workshop onLearning for Text Categorization, pages 41–48.

Frederick Mosteller and David L. Wallace. 1984. Ap-plied Bayesian and Classical Inference: The Caseof the Federalist Papers. Springer-Verlag.

Kamal Nigam, John Lafferty, and Andrew McCal-lum. 1999. Using maximum entropy for text clas-sification. In Proc. of the IJCAI-99 Workshop onMachine Learning for Information Filtering, pages61–67.

Ted Pedersen. 2001. A decision tree of bigrams is anaccurate predictor of word sense. In Proc. of theSecond NAACL, pages 79–86.

Warren Sack. 1994. On the computation of point ofview. In Proc. of the Twelfth AAAI, page 1488.Student abstract.

Ellen Spertus. 1997. Smokey: Automatic recog-nition of hostile messages. In Proc. of Innova-tive Applications of Artificial Intelligence (IAAI),pages 1058–1065.

Junichi Tatemura. 2000. Virtual reviewers for col-laborative exploration of movie reviews. In Proc.of the 5th International Conference on IntelligentUser Interfaces, pages 272–275.

Loren Terveen, Will Hill, Brian Amento, David Mc-Donald, and Josh Creter. 1997. PHOAKS: A sys-tem for sharing recommendations. Communica-tions of the ACM, 40(3):59–62.

Laura Mayfield Tomokiyo and Rosie Jones. 2001.You’re not from round here, are you? Naive Bayesdetection of non-native utterance text. In Proc. ofthe Second NAACL, pages 239–246.

Richard M. Tong. 2001. An operational system fordetecting and tracking opinions in on-line discus-sion. Workshop note, SIGIR 2001 Workshop onOperational Text Classification.

Peter D. Turney and Michael L. Littman. 2002. Un-supervised learning of semantic orientation froma hundred-billion-word corpus. Technical ReportEGB-1094, National Research Council Canada.

Peter Turney. 2002. Thumbs up or thumbs down?Semantic orientation applied to unsupervised clas-sification of reviews. In Proc. of the ACL.

Janyce M. Wiebe, Theresa Wilson, and MatthewBell. 2001. Identifying collocations for recognizingopinions. In Proc. of the ACL/EACL Workshopon Collocation.

Yorick Wilks and Mark Stevenson. 1998. The gram-mar of sense: Using part-of-speech tags as a firststep in semantic disambiguation. Journal of Nat-ural Language Engineering, 4(2):135–144.

sentiment classification using machine learning techniques

Documents