web based traffic sentiment analysis methods and applications

844 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 15, NO. 2, APRIL 2014

Web-Based Traffic Sentiment Analysis:Methods and Applications

Jianping Cao, Ke Zeng, Hui Wang, Member, IEEE, Jiajun Cheng,Fengcai Qiao, Ding Wen, Senior Member, IEEE, and Yanqing Gao, Member, IEEE

Abstract—With the booming of social media, sentiment anal-ysis has developed rapidly in recent years. However, only a fewstudies focused on the field of transportation, which failed to meetthe stringent requirements of safety, efficiency, and informationexchange of intelligent transportation systems (ITSs). We proposethe traffic sentiment analysis (TSA) as a new tool to tackle thisproblem, which provides a new prospective for modern ITSs.Methods and models in TSA are proposed in this paper, and theadvantages and disadvantages of rule- and learning-based ap-proaches are analyzed based on web data. Practically, we appliedthe rule-based approach to deal with real problems, presented anarchitectural design, constructed related bases, demonstrated theprocess, and discussed the online data collection. Two cases werestudied to demonstrate the efficiency of our method: the “yellowlight rule” and “fuel price” in China. Our work will help thedevelopment of TSA and its applications.

Index Terms—Rule base, sentiment analysis, sentiment base,Web-based.

I. INTRODUCTION

T RANSPORTATION systems serve the people in essence,but the modern intelligent transportation systems (ITSs)

failed to concern about the public opinions. For the complete-ness of ITS space, it is necessary to collect and analyze thepublic wisdom and opinions. With the remarkable advancementof Web 2.0 in the last decade, communication platforms, suchas blogs, wikis, online forums, and social-networking groups,have become a rich data-mining source for the detection ofpublic opinions [1]–[4]. Therefore, we propose traffic senti-ment analysis (TSA) for processing traffic information fromwebsites. As taking consideration of human affection, TSA willenrich the performance of the current ITS space.

TSA is a subfield of sentiment analysis, which concernsabout the issues of traffic in particular. Due to the field sensitiv-

Manuscript received July 10, 2013; revised September 5, 2013 andOctober 22, 2013; accepted October 30, 2013. Date of publication December13, 2013; date of current version March 28, 2014. This work was supported inpart by Grant NNSFC 60872053 and in part by Grant 60902091. The AssociateEditor for this paper was L. Li.

J. Cao, H. Wang, J. Cheng, F. Qiao, and D. Wen are with the National Univer-sity of Defense Technology, Changsha 410073, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]).

K. Zeng is with the School of Electronic and Information Engineering, Xi’anJiaotong University, Xi’an 710049, China (e-mail: [email protected]).

Y. Gao is with the State Key Laboratory of Intelligent Control and Man-agement for Complex Systems, Institute of Automation, Chinese Academy ofSciences, Beijing 100190, China (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TITS.2013.2291241

Fig. 1. TSA system plays the role of sensing, computing, and supporting thedecision making as a background platform of intelligent transportation space.

ity of sentiment analysis [5], it is necessary to discuss the TSAproblems and construct TSA systems specifically. The TSAtreats the traffic problems in a new angle, and it supplements thecapabilities of current ITS systems. Fig. 1 illustrates the mod-ules of ITS and exhibits that the TSA plays the role of sensing,computing, and supporting the decision making in ITSs.

The functions of the TSA system can be illustrated asfollows. 1) Investigation: It is more economical and efficientthan the public poll to collect the public opinion through theTSA system. 2) Evaluation: The computational production ofthe TSA system can be used to evaluate the performance oftraffic services and policies. 3) Prediction: The TSA systemcan be further developed to predict the trends of some socialevents. For example, to predict whether a cancelled flight wouldbring chaos, we can analyze the emotion of passengers on theirwords published on Twitter or Weibo through TSA systems.In addition, specific parts of the TSA system can be viewedas another form of “social sensors” [6], [7]. Compared withtraditional sensor systems, it can detect the situation from anew humanized perspective. The TSA system is independent ofcurrent systems, which is particularly useful in an emergencywhen other systems were ruined. For example, in 2009, thevolcano ash from Iceland caused the malfunction of manycameras in several European countries.

In this paper, by constructing a specific TSA system, we ad-dressed the issues and methods in this field and illustrated twocases to demonstrate the value of this research. Our contributionin this paper can be addressed as follows. 1) We proposed TSAto view the traffic problems in a new perspective. 2) The mainissues of TSA applications on web data were discussed basedon the web data. 3) The key problems of TSA were addressed,including the design of architecture, the improved rule-basedapproach, and the construction of related bases.

1524-9050 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

CAO et al.: WEB-BASED TRAFFIC SENTIMENT ANALYSIS: METHODS AND APPLICATIONS 845

The paper is arranged as follows: the next section reviews thedevelopment of sentiment analysis in recent years. Section IIIdiscusses the main problems in TSA. Section IV presents theTSA process and architecture. Section V describes the datamodel and the methods of data collection. Section VI illustratestwo cases of TSA. Finally, Section VII summarizes the study.

II. RELATED RESEARCH

This section presents the development of sentiment analysisin recent years. Since the first study in this area focused on theanalysis of the semantic orientation of adjectives [8], techniquesof sentiment analysis have been extensively used in text filter-ing, tracking of public opinion, and customer relationship man-agement [4], [9]–[11]. Sentiment analysis is mining affectiveinformation from data and recognizing the sentiment polaritycontained in the information (e.g., happy or sad, approve ordisapprove, and agree or disagree).

The classification of former studies has been done by dif-ferent standards [5], [12]. In accordance with the study byZhang et al., the present study discusses previous studies bytheir level of granularity, type of analytical technique, andlanguage [12].

1) Level of Granularity: Previous studies discuss the prob-lem related to sentiment analysis at different levels of gran-ularity, from the document level to the sentence level. Forexample, Pang et al. classified the sentiments of articles byadopting a standard bag-of-features framework, which featuresunigrams and bigrams of words [13]. Turney et al. proposedan unsupervised learning algorithm known as pointwise mutualinformation and information retrieval (PMI-IR) to predict thesesemantic orientations of an article by calculating the similarityof its contained phrases to two reference words: “excellent”and “poor” [14]. Several recent studies have also considered thespread, density, and intensity of polar lexical terms to improvethe performance of sentiment classification [15].

2) Type of Analytical Technique: Existing approaches tosentiment analysis can be categorized into rule- and learning-based approaches. Rule-based approaches often require anexpert-defined dictionary of subjective words; this approachpredicts the polarity of a sentence or document by analyzingthe occurring patterns of such words in text [16]. For exam-ple, Wiebe et al. provided a lexicon source of subjectivityclues, such as verbs, adjectives, and nouns, with their polarity(i.e., positive, negative, or neutral) and strength (i.e., strong orweak) annotated [17]. However, this lexicon is able to definethe original polarity of a word only, and the actual polarityof a word may be modified by its context in a sentence.Several approaches that consider the context of words havebeen proposed to determine the sentiment orientation of words.Yuen et al. proposed an approach to deriving the semanticpolarity of words on the basis of morphemes [18]. Knowledgesources, such as WordNet, have also been used to measure thesemantic polarity of adjectives [19].

As to learning-based approaches, Hu and Liu [20] developedan approach to extracting option features from product reviewsbased on linguistic patterns called class sequential rules, whichcan be mined from a set of labeled training sequences of

words and part-of-speech tags. Pang et al. [13] representedreviews as a bag of unigram/bigram features and applied threemachine-learning methods to predict their sentiment. However,they found that, for sentiment classification, machine learningalgorithms did not perform as well as traditional topic catego-rization tasks. In addition, learning-based sentiment classifica-tion requires sufficiently large training data sets with positiveand negative examples manually labeled, which are often verycostly and time consuming [14].

3) Language: Most sentiment analysis studies have focusedon the English language and achieved remarkable success innumerous applications. By contrast, Chinese sentiment anal-ysis has not been sufficiently investigated [22]. The uniquelinguistic characteristics of the Chinese language pose severaltechnical challenges for Chinese sentiment analysis. The pri-mary challenge is that the Chinese language does not segmentwords by spaces in sentences. Therefore, word segmentationis often required as an additional step in Chinese languageprocessing [21]. In addition, the Chinese language containsvarious adverbs. The use of these adverbs can lead to subtletyand ambiguity in sentences. The English language mainly usessuffixes to express comparative and superlative words (-er and-est, respectively), whereas the Chinese language uses variousadverbs in varying degrees such as “ /more” and “ /most.”Thus, determining the sentiment polarity of Chinese sentencespresents greater difficulty, particularly when multiple adverbsand subjectivity clues appear in one sentence. Moreover, con-sidering the differences of contexts and the ambiguity of theChinese language itself, a document that contains several posi-tive words may indicate a strong negative tone, and vice versa.

III. ISSUES IN TSA

The primary problem of TSA is the selection of sentimentanalysis approaches on Web-based data. Since the performanceof both the rule- and learning-based approaches depends onthe data to some extent, the features of web data should beidentified first. The data we discussed in this paper are gatheredfrom online forums, blogs, and Weibo (Twitter-like websitesin China); the properties of these data can be described asfollows.

1) The lengths of the texts vary. Some texts contain thou-sands of words, whereas others consist of only one sen-tence that can be as short as a one word. Weibo limitsthe number of Chinese characters to 140, but other socialmedia do not have similar restrictions.

2) The stylistic features of the texts are diverse. Given thatWeb 2.0 provides an equal platform for each user, wordsused for communication are not regulated, and texts arenot under specific norms. In addition, different usersexpress themselves in various ways, hence the differentfeatures of expressions on the Web.

3) New Internet expressions frequently emerge. Withchanges through time, the same sentiment may be ex-pressed in different ways. In extreme cases, the sameword may carry a different sentiment polarity after certainpublic events.


Both the rule- and learning-based approaches can be appliedin TSA. In the present study, we compare the advantages anddisadvantages of using the two approaches on Web-based data.Considering the learning-based approach first, the advantageof the learning-based approach is that it does not need expertknowledge to build the related bases; instead, the classifier issimply trained without considering the context. Given that thesizes of the texts vary, the sparseness of the feature vectorvaries with the clauses if the classifier is trained directly. Thus,comparing the texts is of no significance. The texts shouldfirst be categorized according to their sizes. The document-and sentence-level clauses should then be trained separately.Similarly, Turney [14] indicated that sufficiently large trainingdata sets with positive and negative examples are required,which are often costly and time consuming. Moreover, theclassification standard of different levels of clauses must becarefully learned and investigated. As previously mentioned,the expressions vary because of the variety of users and publishtime. Therefore, the training data set hardly covers the sufficientcharacteristics of the entire data set, which leads to difficulty inresolving the representative of the training data set.

As to the rule-based approach, the disadvantage is that thesentiment polarity results cannot be as precise as expected if thecontext of the texts is not considered. Nevertheless, for handlingChinese web data, this type of approach has the following ad-vantages. First, the precision of the rule-based approach is inde-pendent of the sizes of the clauses. Second, the syntax rule of acertain language is basic and static despite the differences in thestylistic features of various users. The thought process and wordchoice basically remain unchanged. Therefore, the rules of therule-based approach are relatively static. Finally, the rule-basedapproach can be easily extended by simply updating the senti-ment lexicon, although new sentimental words rapidly emergeand the sentiment of several words may be changed with words.In this paper, we adopt the rule-based approach for sentimentanalysis of Chinese texts, illustrating the key issues of TSA.

IV. ARCHITECTURE AND PROCESSES

A. TSA Architecture

Previous studies on Chinese texts have devoted consider-able efforts on architectural design. Che et al. designed thearchitecture of the language technology platform (LTP), anintegrated Chinese processing platform including a suite ofhigh-performance natural language processing (NLP) modulesand relevant corpora. They achieved plausible results in severalrelevant evaluations, particularly for syntactic and semanticparsing modules [22]. Li et al. designed the architecture ofsentiment analysis application in the financial domain on thebasis of morphemes [23].

A rule-based approach is adopted here to address the distinctchallenges posed by the Chinese data set. Fig. 2 illustrated thearchitecture of TSA; the architecture is based on the tacklingprocess; and its main components, including 1) web data col-lection, 2) preprocessing, 3) extraction of subjects and objects,4) extraction of sentiment properties, 5) sentiment calculationand classification, 6) evaluation or applications, and 7) feed-

Fig. 2. Architecture of the rule-based TSA. The “evaluation” in the processpart (middle block) denotes the evaluation of the algorithm, and the other in thebottom block denotes the evaluation of traffic situation.

back, improve the construction of the sentiment, rule, and TSAobject bases.

Data collection: To address the problem, we gathereddata from several websites, such as Sina Weibo, Tencent Weibo,Tianya, and autohome (the upper block in Fig. 2), ensuring thatthe conclusions are definitely based on public opinion or, atleast, represent part of the public opinion [24]. The details ofdata collection are discussed in Section V.

Preprocessing: As previously mentioned, Chinese docu-ments must be processed additionally because that Chineselanguage does not segment words by spaces in sentences. Inthe preprocessing, the following steps are included: 1) thesegmentation of text, 2) the labeling of words, and 3) thereplacement of synonymous expressions. The first two steps aredone by a Chinese segmentation tool; we employ the ChineseLexical Analysis System 3 launched by the Chinese Academyof Sciences, Beijing, China, in 2011 [25]. In the social media,various expressions denote the same meaning. For example,several users commonly use “d,” which represents the Chinesecharacter “ ” (support), to express agreement with others.Therefore, the replacement of synonymous expressions (step 3)


is necessary to reduce the complexity and increase the precisionof following processes.

Word segmentation optimization: To avoid unnecessarydisturbances and improve precision, preprocessing should beconducted according to the material and the demand of thealgorithms [21]. However, in practice, the result of word seg-mentation in Chinese is far from expected. In some cases, thisstep may even reduce the precision. For example, “ ” is

separated as ( /n). In fact, “ ” is an abbreviationof a company name, which represents one of the two Chineseoil giants.

Therefore, it is necessary to improve the performance of theChinese segmentation. In this paper, we propose to constructthe “sentiment base” in the application of TSA. In practice,the “sentiment base” consists of the TSA sentiment base andHowNet (subsection B).

Extraction of subjects and objects: Subjects and objectsare mainly extracted by context mining and document analysis[26], [27]. In TSA, appropriate models should be designed incontext mining according to different data sets and resources.Context mining should obtain results as efficiently as possibleto provide the necessary background knowledge for the subse-quent steps. In practice, context mining includes conservationextraction and coreference analysis. Conservation extractionrefers to handling the text, such as “citation, @.” In addition,coreference analysis refers to mining the object represented byother words. For example, the address in Sina Weibo is usuallyrepresented by a hyperlink.

The second approach of extracting subjects and objects istext analysis, which is extracting the opinion-oriented infor-mation through the pure text. Riloff and Wiebe et al. [26],[28], [29] have proposed a method mainly focused on theextraction patterns, and we applied it to address the problems inSection VI.

Extraction of properties: The extraction of properties isbased on the sentiment, modifier, and rule bases. We appliedthe three-step strategy proposed by Zhang et al. [12]. Here,we identify the updating issues of these bases, which are thekey point in practice. Since topics and fashion terms discussedonline are quickly changing, the rule and object bases need tobe updated with time. The rule base is relatively consistentbecause the regulation of a language is relatively static. Inthis paper, we update the base semiautomatically. With regardto the object base, given that the topics change quickly, weshould summarize the related topics and objects, as well as theirattributions and components.

Evaluation: The approach should be evaluated accordingto a scientifically constructed standard data set [30], [31] beforeapplication. The efficiency and precision of the algorithmsare tested in this step. If the test performance is lower thanexpectation, proper words will be identified and updated toimprove the related bases.

B. Related Bases

The fundamental work of the rule-based approach is to buildthe related bases. In this paper, we propose to construct thesentiment, modifier, object, and rule bases.

Sentiment base: The sentiment analysis consists of twotightly connected modules, i.e., the sentiment lexicon and itswords’ sentiment polarity. With no available sentiment lexiconin the traffic domain, our primary task is to establish thesentimental lexicon [32]. First, we define the positive and neg-ative seed sets as Seedp0 = { } and Seedn0 ={ }, respectively. Then, they were placed intothe LTP constructed by Harbin Institute of Technology, Harbin,China [22]. The two seed sets are extended by finding thesynonymy and antonymy of the seed in the LTP. New wordsare added, to get Seedp1 and Seedn1. These two sets are thenew input of the LTP, and they are iterated k times until thefinal lush sets Seedpk and Seednk finally constructed. However,the results are not as perfect as expected in practice, and weadded the sentiment words from the data set released by ChinaNational Knowledge Infrastructure (cnki.com) for completion[33]. Considering that several words have special meanings inthe traffic area, such as “ /overload” and “U /U-turn,”we manually add the specific sentiment words in the traffic area.Finally, we construct a positive and a negative sentiment wordlexicon with 4893 and 5416 words each, respectively.

Assume that the sentiment polarity of a word is determinedby its morphemes. If the morphemes of a word appear in thepositive lexicon more frequently than they do in the negativelexicon, the word is positive; otherwise, the word is negative. Tomeasure the positive and negative tendencies of the morphemeq, we assign positive and negative weights to the morphemes asfollows:

WeightPci =

fpci

/ n∑i=1

fpci

fpci

/ n∑i=1

fpci + fnci

/ n∑i=1

fnci

(1)

WeightNci =

fnci

/ n∑i=1

fnci

fnci

/ n∑i=1

fnci + fpci

/ n∑i=1

fpci

(2)

Sci =WeightPci − WeightNci . (3)

In formula (3), the polarity Sci depends on morphemes Ci,and the absolute value of Sci is the degree of tendency ofmorphemes Ci. The steps for calculating the sentiment polarityof words are as follows. Scan the positive and negative wordlexicons; if the word w appears in the positive word lexicon,Sw = 1; if the word appears in the negative word lexicon,Sw = −1. Otherwise, the sentiment polarity is computed usingmorphemes by

Sw =1p

p∑j=1

Sci (4)

where Sw represents the sentiment polarity of the word w,which consists of c1, c2, . . . , cp. If Sw > 0, the sentiment po-larity of the word is positive; otherwise, the sentiment polarityof the word is negative. If the value obtained is close to zero,the word can be considered neutral.

Modifier base: In accordance with previous assumptions,the original sentiment of a sentence is determined by the


TABLE ICLASSIFICATION OF DEGREE ADVERBS

sentiment words. In addition, the sentiment is modified byadverbs. Negation adverbs cause sentiment polarity reversalto mean the opposite (e.g., “fast” is positive, but it becomesnegative if preceded by the word “not”). Similarly, degreeadverbs that either strengthen or weaken the intensity of thesentiment polarity must be considered as well. In addition,sentence structure also affects the sentiment polarity value of asentence. A complex sentence is modified by relational schema.Table I shows the relationship between the classification and theword correction value.

In practice, we consider negation and degree adverbs byconstructing adverb lexicon that contains them. In [34], differ-ent polarity modification grades were assigned to each wordaccording to the levels of adverbs (see Table I).

Some adverbs exhibit ambiguity in the modification of sen-timent in Chinese; they must be reconsidered specifically indifferent circumstances. For example, “ ” may be interpretedin two different ways (“very” and “over”). When “ ” isused as “very” (e.g., “ ” very good), the expression is adegree adverb. However, when “ ” is used as “over” (e.g.,“ ”overload), it is a negative adverb and is assigned witha grade of −1.

Semantic rule base: The construction of the semantic ruleis critically important in the rule-based approach because thesentiment of Chinese depends on the location and collocationsof sentiment words [35]. The semantic rule of sentiment is thepattern of the sentiment words (S) and their modifiers [negativewords (N) and degree words (D)], which is expressed by thepattern SND. Among the three factors, S is considered as themost important. Therefore, we first select S from the sentence.The corresponding N and D are placed around S. The SNDmodel is then established.

Each sentiment word has its unique modifiers. The main re-lationships between a sentiment word and its modifiers dependon their locations and classes in the sentence. We observedthe locations of words, the classes of sentiment words, andmodifiers in 10 000 posts selected from the Web. We foundthat sentiment words and their modifiers should be in the samesentence. Moreover, the distance between a sentiment word andits degree word should be less than five Chinese characters.Applying our comprehension of the Chinese sentence structure,we summarize the main regulation of sequence and the wordclasses of sentiment words, as well as their modifiers, as shownin Table II. In the N+S rule, multidenial is common in Chinese,where even negative words are equivalent to a nonnegative wordand odd negative words are equivalent to one negative word.

In the N+D+S rule, N is the modifier of D, and N+D is themodifier of the sentiment word (S).

Therefore, the characteristics of N+D+S are the same asthose of D+S. However, in the D+N+S rule, the negative word(N) is the modifier of the sentiment word (S), and the degreeword (D) is the modifier of N+S. S usually represents a verb ora noun.

In this paper, we use a sentiment polarity score to express thesentiment of a text. The sentiment polarity score is calculatedaccording to rules defined by sentiment pattern. Every senti-ment word in our dictionary is assigned with a predeterminedvalue. In HowNet [36], the degree words are categorized intosix intensity levels, and each degree word is assigned with avalue according to its intensity level. We suppose that p is thesentiment polarity score of the SND pattern, ps is the scoreof sentiment word S, and pd is the value of degree word D.The formulas for calculating the sentimental value are listedin Table II.

For example, the score of word “ ” (safe) is 2, and thedegree word “ ” (very) is 2.1. Thus, the value of phrase“ ” (very safe) is 4.2, according to the D+S rule. Sim-ilarly, the rule for the phrases “ ” (not safe), “ ”(not very), and “ ” (not safe at all) are N+S, N+D+S,and D+N+S, respectively. In addition, their sentiment polarityvalues are −2, −1.4, and −4.2, respectively.

Traffic domain noun base: Numerous specific nouns inthe traffic domain may be ignored in document segmentation,which consequently affects the sentiment analysis. Thus, weconstruct the noun base, which includes auto brands, profes-sional nouns in the traffic domain, features and aspects of publictraffic, and words commonly used in public discussion. Wecollected the data from related websites, such as autohome.comand auto.sina.com. We finally obtain 1732 nouns, ranging fromauto brands to professional terms in traffic.

C. TSA Process

Text sentiment calculation can be categorized into threelevels, namely, word, sentence, and document levels. The cal-culation of the sentiment polarity of words is a basic step inthe construction of the sentiment word base. In practice, weconsider the words or phrases as another form of sentence.Therefore, text processing includes two main parts, the polaritycalculation of the sentence- and document-level text.

Fig. 3 shows the overall process involved in the proposedapproach. The method includes two major steps, i.e., the sen-tence sentiment analysis and document sentiment aggregation.Considering the subtlety of Chinese expression, we first de-compose a document into constituting sentences and determinethe sentiment polarity of each sentence. In contrast to earlydocument-level analytical approaches [14], [37], we regardsentences as atomic units for semantic analysis. The polarityscores of all the sentences are subsequently synthesized tocompute for the overall polarity of the entire document.

The sentiment polarity of a sentence is defined as ps. ps isdetermined to extract the SND patterns and calculate the senti-ment polarity score according to the SND patterns identified in


TABLE IIEXAMPLES OF RULE BASES

Fig. 3. Illustration of rule-based sentiment analysis.

the text. We then calculate the polarity of sentences si accordingto the rules defined in the rule bases.

The most important thematic sentences are usually placed inthe most prominent position, such as the title, the first sentence,and the last sentence, for emphasis. Therefore, in calculating

TABLE IIIRULE-BASED SENTIMENT ANALYSIS ALGORITHM

the overall polarity of a document, the location of the sentimentsentence should be considered. In practice, the importance of asentence to a document can be represented by the weight in theoverall polarity computation. The weight of thematic sentencesshould be greater than those of other sentences in a document.We formalized the problem as follows. Given a text t containingsentences {s1, . . . , sn} as inputs, the system must calculatethe polarity score psi of each sentence si and determine thesentiment polarity, where wi is the weight of sentence si. IfPt > 0, the document shows a positive sentiment; otherwise,the document shows a negative sentiment. Table III shows theentire algorithm of the process.

V. DATA COLLECTION

Since whether the documents are topically relevant to anopinion-oriented remains unidentified, data collection is anessential step in TSA [5]. This section addresses several basicrules on this problem.


Information regarding traffic on the Web can be classifiedinto three categories. The first category consists of news, expertcommentaries, announcements, etc., from the traffic website.The second includes posts from the transport sector in fo-rums. These forums provide a platform through which userscan exchange information about social topics, such as trafficcongestions and transportation policies. The third includes real-time information about traffic in microblogging, which can befound from the social media, such as weibo.com. The sentimentpolarity of the first category is not easily distinguished, but itscontent is true and meaningful. The sentiment polarity of thesecond category is clear, and usually, a discussion on certainevents or topics may be highly valuable for tracking publicopinion. The third category, which includes real-time trafficinformation, may not have a fixed topic but often located in acertain place. Such information bears significance for obtainingreal-time information of travelers and creating a backup sensornetwork system.

Data from the specific websites can be collected by the openapplication programming interface or correspondent crawler,such as the first and third categories of information. However,collecting a data set on a specific topic is more difficult. In mostforums, the information-publishing platform can be dividedinto a series of boards containing various categories or topics. Ina predefined subject board, the topics are designed for specificevents, providing a relatively better framework for the readersand commenters. Nevertheless, the categorization is too simpleand indistinct for analysis and research because of the followingreasons: 1) not all topics can be mapped to a single board; 2) thecontents of the post are not strictly related to the object topics;and 3) a board of forum often contains more than one topic.

Therefore, to precisely collect a topic line and gather theinformation to one post, we first design a special crawler byusing depth retrieval. Traffic-related terms are adopted to buildthe key ontological vocabulary used for the built-in searchengine of the website. Thus, the priority web pages that arehighly related are obtained. We then design a customized datawrapper for one platform to extract the metadata, including theuser ID, timestamp, post message, and the properties of thecited user. This method allows the collection of highly accuratedata on a specific topic from the websites.

VI. CASE STUDIES

This section presents two cases on the “yellow light rule”and “fuel price” from a Chinese website. Both of them arecontroversial topics related to traffic in China.

A. Data Set

Our data set is collected from tianya.cn, one of the biggestonline communities in China, which contains comprehensivediscussions on various topics. We collected related content bythe proposed methods in Section V.Case 1: “Yellow light rule.” A new traffic law that took effect on

January 1, 2013 has drawn considerable attention online.The law was called “the strictest rules of traffic in China”by netizens. Under the new rules, running a yellow lightwould be equivalent to running a red light, entailing a

TABLE IVBACKGROUND INFORMATION ON THE TWO CASES

TABLE VILLUSTRATION OF MANUAL ANNOTATION (%)

TABLE VIPERFORMANCE EVALUATION OF THE PREDICTION

deduction of 6 points from drivers. In addition, a total of12 points is obtained by the suspension of a driver’s license.These rules are heavily discussed online.

Case 2: “Fuel price.” This topic became controversial in Chinafor years. The price decision policy in China follows adistinct and complex process. The fuel price is related tothe international fuel price and tax. Discussions on thistopic are generally focused on the rising/falling and deci-sion policy of fuel price. Table IV shows the backgroundinformation of the two cases.

B. Experiments

We applied our own approach to handle the two cases andmade a comparison with Ku’s algorithm [38].

1) Standard Data Set: To establish a gold-standard dataset, we chose three individuals to tag the sentiment of texts.Table V shows the overlapping mark percentage between pair-wise annotators and the complete overlapping percentage of theannotators.

The result reveals that the three annotators showed no markeddisagreement regarding the text. Thus, we select the data setto which all the annotators assigned the same marks. The ex-perimental data set finally contains 547 positive messages and5937 negative messages on topic 1 and 516 positive messagesand 7418 negative messages on topic 2.

2) Evaluation Method—Sentiment Polarity Evaluation: Insentiment polarity evaluation, we applied the confusion matrixto tackle the evaluation problem. We select accuracy, recall,and precision as indicators to evaluate the performance ofboth algorithms. These performances are defined in Table VI[39]. Note that the “TP,” “FP,” “FN,” and “TN” denote “truepositive,” “false positive,” “false negative,” and “true negative”in the prediction, respectively.


TABLE VIICOMPARISON OF THE TWO ALGORITHMS (%)

The overall accuracy of classification results is calculated asfollows:

Accuracy =TP + TN

TP + FP + FN + TN. (5)

The recall and precision rates of positive polarity in predic-tion are computed as follows:

Recall(P ) =TP

TP + FNPrecision(P ) =

TPTP + FP

. (6)

The recall and precision rates of negative polarity in predic-tion are derived as follows:

Recall(N) =TN

FP + TNPrecision(N) =

TNFN + TN

. (7)

Evaluation of sentiment intensity: The sentiment intensityis evaluated by

D-value =

∣∣∣∑Ni=1(TPi − FPi)

∣∣∣N

(8)

where D-value denotes the difference between the result andthe experts’ standard. A smaller D-value indicates that theresults are closer to the experts’ standard. N is the total numberof texts, TPi denotes the sentiment intensity of the ith textcalculated by the algorithms, and FPi represents the sentimentintensity of the ith texts given by the experts. Formula (8) canbe expressed as

D-value = |meanf − meana| (9)

where meanf represents the mean sentiment value of the text bythe algorithms, and meana denotes the mean intensity assignedby the experts.

C. Results

By using our proposed algorithm and Ku’s algorithm toprocess our data set, we obtain the following results, as shownin Table VII.

According to Table VII, our algorithm exhibits higher ac-curacy by 16.6% and 8.94% than those of Ku’s algorithm, re-spectively. This finding indicates increases in both positive andnegative accuracy rates, which are attributed to the suitabilityof our rules with the traffic-related data set. The proposed algo-rithm improves Ku’s algorithm to a certain extent. The negativetext recall and precision rates have increased more significantlybecause most of the negative texts are modified by the pattern ofcomplex sentences to change the polarity of the sentence, andthis point is extensively considered in our algorithm.

TABLE VIIICOMPARISON OF TWO ALGORITHMS

Table VIII presents the experts’ evaluation of sentimentintensity and the calculation score of the two algorithms on thesame text. A comparison of the results reveal the following.

1) The precision of the intensity of sentiment of the pro-posed algorithm is higher than that of Ku’s algorithm. Inthe two cases, the average sentiment intensity comments(meanf ) of Ku’s algorithm is higher. The reason is that,during affective computing, Ku’s algorithm disregards themodification of the sentiment widely existing in the text.

2) The D-value of Ku’s algorithm is greater than that ofthe proposed algorithm. By definition, our sentiment in-tensity is closer to the experts’ evaluation. This findingsuggests that our proposed algorithm for the positiveaffective computing model is more consistent with themode of human understanding.

D. Discussions

An in-depth understanding of the rule-based approach isneeded, e.g., whether a noun that could represent the sentimentof the texts exists. As emphasized in previous studies, the dataset contains several subjective texts that could not be easilyanalyzed by the rules. The most typical phenomenon is theironic sentiment sentences. For instance, in posts regardingfuel prices, the thread title used was “the fuel price will rise,”to which one user replied, “go to sell the car.” Such a replyapparently carries an ironic tone; thus, all annotators manuallylabeled the reply as “negative.” However, given that the com-puter cannot detect from the given text any word expressing anegative sentiment, the methods cannot recognize the sentimentpolarity. Therefore, numerous problems remain unsolved.

For the limitations of the existing lexicons, an improvedlexicon should be developed, which requires long-term andarduous efforts. We proposed the construction of ITSs underthe architecture of artificial, computational, and parallel (ACP)methods, with the TSA system as one of the data sources.

VII. CONCLUSION

Due to the domain dependence of sentiment analysis, wehave proposed Web-based TSA to analyze the traffic problemsin a humanizer way. To the best of our knowledge, this is thefirst attempt to apply sentiment analysis on the area of traffic.The study of TSA will provide us a new perspective whenfacing with traffic problems.

To address the main content of this paper, our work canbe concluded as the following five folds: 1) designing theapplication architecture of TSA; 2) constructing the relatedbases for the TSA system; 3) comparing the advantages anddisadvantages of both rule- and learning-based approachesbased on the characters of web data; 4) proposing an algorithm


for the sentiment polarity calculation based on the rule-basedapproach; and 5) taking consideration of the modifying rela-tionships of sentence patterns and locations in the sentimentpolarity calculations.

The task to implement the TSA system into existing ITSsis also critically important, and it does need further research.We suggested that take the policy evaluation part to supportdecision making of managers and view the evaluation resultsrelated to specific location as sensor information. The keynoteof implementation is jointly accommodating the traveler’s bestinterest and reasonable workload. Since TSA is still in its in-fancy, we anticipate that more techniques will be developed forthe joint performance of ITS with the TSA system in the future.

REFERENCES

[1] F. Y. Wang, “Social computing: Concepts, contents, and methods,” Int. J.Intell. Control Syst., vol. 9, no. 2, pp. 91–96, 2004.

[2] F.-Y. Wang, R. Lu, and D. Zeng, “Artificial intelligence in China,” IEEEIntell. Syst., vol. 23, no. 6, pp. 24–25, Nov./Dec. 2008.

[3] S.-M. Kim and E. Hovy, “Extracting opinions, opinion holders, and topicsexpressed in online news media text,” in Proc. Workshop Sentiment Subj.Text, 2006, pp. 1–8.

[4] B. Liu, M. Hu, and J. Cheng, “Opinion observer: Analyzing and compar-ing opinions on the Web,” in Proc. 14th Int. Conf. World Wide Web, 2005,pp. 342–351.

[5] B. Pang and L. Lee, “Opinion mining and sentiment analysis,” Found.Trends Inf. Retrieval, vol. 2, no. 1/2, pp. 1–135, Jan. 2008.

[6] F.-Y. Wang, “Agent-based control for networked traffic management sys-tems,” IEEE Intell. Syst., vol. 20, no. 5, pp. 92–96, Sep./Oct. 2005.

[7] S.-M. Kim and E. Hovy, “Determining the sentiment of opinions,” inProc. 20th Int. Conf. Comput. Linguist., 2004, pp. 1367–1373.

[8] V. Hatzivassiloglou and K. R. McKeown, “Predicting the semantic ori-entation of adjectives,” in Proc. 8th Conf. Eur. Chapter Assoc. Comput.Linguist., 1997, pp. 174–181.

[9] T. Nasukawa and J. Yi, “Sentiment analysis: Capturing favorability usingnatural language processing,” in Proc. 2nd Int. Conf. Knowl. Capture,2003, pp. 70–77.

[10] A.-M. Popescu and O. Etzioni, “Extracting product features and opin-ions from reviews,” in Natural Language Processing and Text Mining.New York, NY, USA: Springer-Verlag, 2007, pp. 9–28.

[11] E. Riloff, S. Patwardhan, and J. Wiebe, “Feature subsumption for opin-ion analysis,” in Proc. Conf. Empirical Methods Natural Lang. Process.,2006, pp. 440–448.

[12] C. L. Zhang, D. Zeng, J. X. Li, F. Y. Wang, and W. L. Zuo, “Senti-ment analysis of Chinese documents: From sentence to document level,”J. Amer. Soc. Inf. Sci. Technol., vol. 60, no. 12, pp. 2474–2487, Dec. 2009.

[13] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: Sentiment classifi-cation using machine learning techniques,” in Proc. ACL Conf. EmpiricalMethods Natural Lang. Process., 2002, vol. 10, pp. 79–86.

[14] P. D. Turney, “Thumbs up or thumbs down?: Semantic orientation appliedto unsupervised classification of reviews,” in Proc. 40th Annu. Meet.Assoc. Comput. Linguist., 2002, pp. 417–424.

[15] B. K. Tsou, R. W. Yuen, O. Y. Kwong, T. La, and W. L. Wong, “Polarityclassification of celebrity coverage in the Chinese press,” in Proc. Int.Conf. Intell. Anal., 2005, pp. 137–142.

[16] K. Bloom, N. Garg, and S. Argamon, “Extracting appraisal expressions,”in Proc. HLT-NAACL, 2007, pp. 308–315.

[17] J. Wiebe, T. Wilson, R. Bruce, M. Bell, and M. Martin, “Learningsubjective language,” Comput. Linguist., vol. 30, no. 3, pp. 277–308,Sep. 2004.

[18] R. W. Yuen, T. Y. Chan, T. B. Lai, O. Kwong, and B. K. T’sou,“Morpheme-based derivation of bipolar semantic orientation of Chinesewords,” in Proc. 20th Int. Conf. Comput. Linguist., 2004, pp. 1008–1014.

[19] J. Kamps, M. Marx, R. J. Mokken, and M. De Rijke, “Using WordNetto measure semantic orientations of adjectives,” in Proc. Int. Conf. Lang.Resourc. Eval., 2004, pp. 1115–1118.

[20] M. Hu and B. Liu, “Opinion feature extraction using class sequen-tial rules,” presented at the AAAI Spring Symposium ComputationalApproaches Analyzing Weblogs, Palo Alto, CA, USA, 2006, PaperAAAI-CAAW-06.

[21] D. Zeng, D. Wei, M. Chau, and F. Wang, “Chinese word segmentationfor terrorism-related contents,” in Intelligence and Security Informatics.New York, NY, USA: Springer-Verlag, 2008, pp. 1–13.

[22] W. Che, Z. Li, and T. Liu, “LTP: A Chinese language technology platform,”in Proc. 23rd Int. Conf. Comput. Linguist., Demo., 2010, pp. 13–16.

[23] G. Li, C. Wan, H. Bian, L. Yang, and M. Zhong, “Emotional detection oftext in the financial domain based-morpheme,” J. Comput. Res. Develop.,vol. 48, no. z2, pp. 54–59, 2011.

[24] Y. Choi, C. Cardie, E. Riloff, and S. Patwardhan, “Identifying sources ofopinions with conditional random fields and extraction patterns,” in Proc.Conf. Human Lang. Technol. Empirical Methods Natural Lang. Process.,2005, pp. 355–362.

[25] “ICTCLAS,” 2011. [Online]. Available: http://ictclas.nlpir.org/[26] E. Riloff, J. Wiebe, and W. Phillips, “Exploiting subjectivity classification

to improve information extraction,” in Proc. Nat. Conf. Artif. Intell., 2005,pp. 1106–1111.

[27] K. Dave, S. Lawrence, and D. M. Pennock, “Mining the peanut gallery:Opinion extraction and semantic classification of product reviews,” inProc. 12th Int. Conf. World Wide Web, 2003, pp. 519–528.

[28] E. Riloff, J. Wiebe, and T. Wilson, “Learning subjective nouns usingextraction pattern bootstrapping,” in Proc. 7th Conf. Nat. Lang. Learn.HLT-NAACL, 2003, vol. 4, pp. 25–32.

[29] E. Riloff and J. Wiebe, “Learning extraction patterns for subjective ex-pressions,” in Proc. Conf. Empirical Methods Natural Lang. Process.,2003, pp. 105–112.

[30] C. Whitelaw, N. Garg, and S. Argamon, “Using appraisal groups forsentiment analysis,” in Proc. 14th ACM Int. Conf. Inf. Knowl. Manage.,2005, pp. 625–631.

[31] N. Kobayashi, K. Inui, Y. Matsumoto, K. Tateishi, and T. Fukushima,“Collecting evaluative expressions for opinion extraction,” in NaturalLanguage Processing––IJCNLP 2004, K. Y. Su, J. Tsujii, J. H. Lee, andO. Y. Kwong, Eds. Berlin, Germany: Springer-Verlag, 2005, pp. 596–605.

[32] D. Rao and D. Ravichandran, “Semi-supervised polarity lexicon induc-tion,” in Proc. 12th Conf. Eur. Chapter Assoc. Comput. Linguist., 2009,pp. 675–682.

[33] “CNKI,” 2007. [Online]. Available: http://www.keenage.com/download/sentiment.rar

[34] Y. Guo and Y. Zhou, “Chinese text orientation analysis based on phrase,”in Proc. Int. Conf. NLP-KE, 2009, pp. 1–6.

[35] J. Wiebe, T. Wilson, and M. Bell, “Identifying collocations for recogniz-ing opinions,” in Proc. ACL Workshop Colloc., Comput. Extract., Anal.,Exp., 2001, pp. 24–31.

[36] “HowNet”. [Online]. Available: www.keenage.com[37] Q. Ye, W. Shi, and Y. Li, “Sentiment classification for movie reviews in

Chinese by improved semantic oriented approach,” in Proc. 39th Annu.HICSS, 2006, pp. 1–5.

[38] L.-W. Ku, Y.-T. Liang, and H.-H. Chen, “Opinion extraction, summariza-tion and tracking in news and blog corpora,” in Proc. AAAI Spring Symp.,Comput. Approaches Anal. Weblogs, 2006, pp. 100–107.

[39] J. Diederich, A. Al-Ajmi, and P. Yellowlees, “E-x-ray: Data mining andmental health,” Appl. Soft Comput., vol. 7, no. 3, pp. 923–928, Jun. 2007.

Jianping Cao is currently working toward the Ph.D.degree in the School of Information System andManagement, National University of Defense Tech-nology, Changsha, China.

His major interests include social computing, sen-timent analysis, and parallel management theory.

Ke Zeng is currently working toward the Ph.D. de-gree in the School of Electronic and Information En-gineering, Xi’an Jiaotong University, Xi’an, China.

His major interests include social computing, linkprediction, opining mining, and parallel managementtheory.


Hui Wang (M’08) received the Ph.D. degree insystem engineering from the National University ofDefense Technology, Changsha, China, in 2005.

He is currently a Professor with the ResearchCenter of Computational Experiments and ParallelSystem Technology, College of Information Systemand Management, National University of DefenseTechnology. His research interests include multime-dia intelligence analysis and data mining.

Jiajun Cheng received the B.S. degree in systemengineering in 2012 from the National University ofDefense Technology, Changsha, China, where he iscurrently working toward the M.S. degree in systemengineering.

His research interests include sentiment analysisand machine learning.

Fengcai Qiao received the B.S. degree in systemengineering in 2011 from the National University ofDefense Technology, Changsha, China, where he iscurrently working toward the M.S. degree in systemengineering.

His research interests include image categoriza-tion and image retrieval.

Ding Wen (M’95–SM’99) is a Professor with the National University ofDefense Technology, Changsha, China, where he is a Senior Advisor ofthe Research Center for Military Computational Experiments and ParallelSystems Technology. His main research interests include behavioral operationmanagement, human resource management, management information systems,and intelligent systems. He has published extensively and received numerousawards for his work in those areas.

Yanqing Gao (M’08) is currently with the StateKey Laboratory of Intelligent Control and Manage-ment for Complex Systems, Institute of Automation,Chinese Academy of Sciences (CASIA), Beijing,China.

web based traffic sentiment analysis methods and applications

Education