comparative experiments using supervised learning and machine translation for multilingual sentiment...

20
Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004 ARTICLE IN PRESS +Model YCSLA-587; No. of Pages 20 Available online at www.sciencedirect.com Computer Speech and Language xxx (2013) xxx–xxx Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis Alexandra Balahur a,, Marco Turchi b,1 a European Commission Joint Research Centre, IPSC, GlobeSec, OPTIMA, Via E. Fermi 2749, Ispra, Italy b Fondazione Bruno Kessler-IRST, Via Sommarive, 18, Povo, Trento, Italy Received 27 August 2012; received in revised form 25 February 2013; accepted 27 March 2013 Abstract Sentiment analysis is the natural language processing task dealing with sentiment detection and classification from texts. In recent years, due to the growth in the quantity and fast spreading of user-generated contents online and the impact such information has on events, people and companies worldwide, this task has been approached in an important body of research in the field. Despite different methods having been proposed for distinct types of text, the research community has concentrated less on developing methods for languages other than English. In the above-mentioned context, the present work studies the possibility to employ machine translation systems and supervised methods to build models able to detect and classify sentiment in languages for which less/no resources are available for this task when compared to English, stressing upon the impact of translation quality on the sentiment classification performance. Our extensive evaluation scenarios show that machine translation systems are approaching a good level of maturity and that they can, in combination to appropriate machine learning algorithms and carefully chosen features, be used to build sentiment analysis systems that can obtain comparable performances to the one obtained for English. © 2013 Elsevier Ltd. All rights reserved. Keywords: Multilingual sentiment analysis; Opinion mining; Machine translation; Supervised learning 1. Introduction Together with the increase in the access to technology and the Internet, the recent years have shown a steady growth of the volume of user-generated contents on the Web. The diversity of topics covered by this data (also containing expressions of subjectivity) in the new textual types such as blogs, fora, microblogs, has been proven to be of tremendous value to a whole range of applications, in Economics, Social Science, Political Science, Marketing, to mention just a few. Notwithstanding these proven advantages, the high quantity of user-generated contents makes this information hard to access and employ without the use of automatic mechanisms. This issue motivated the rapid and steady growth in interest from the natural language processing (NLP) community to develop computational methods to analyze subjectivity and sentiment in text. Additionally, apart from the research on sentiment analysis in the context of user- generated contents, studies have also focused on developing methods for sentiment analysis in newspaper articles. This This paper has been recommended for acceptance by Prof. R.K. Moore. Corresponding author. Tel.: +39 0332 785808. E-mail addresses: [email protected] (A. Balahur), [email protected] (M. Turchi). 1 Work developed while working at the European Commission Joint Research Centre, Ispra, Italy. 0885-2308/$ see front matter © 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.csl.2013.03.004

Upload: marco

Post on 11-Dec-2016

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

+ModelY

A

yodmmlsgb©

K

1

oevfhisg

0

ARTICLE IN PRESSCSLA-587; No. of Pages 20

Available online at www.sciencedirect.com

Computer Speech and Language xxx (2013) xxx–xxx

Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis�

Alexandra Balahur a,∗, Marco Turchi b,1

a European Commission Joint Research Centre, IPSC, GlobeSec, OPTIMA, Via E. Fermi 2749, Ispra, Italyb Fondazione Bruno Kessler-IRST, Via Sommarive, 18, Povo, Trento, Italy

Received 27 August 2012; received in revised form 25 February 2013; accepted 27 March 2013

bstract

Sentiment analysis is the natural language processing task dealing with sentiment detection and classification from texts. In recentears, due to the growth in the quantity and fast spreading of user-generated contents online and the impact such information hasn events, people and companies worldwide, this task has been approached in an important body of research in the field. Despiteifferent methods having been proposed for distinct types of text, the research community has concentrated less on developingethods for languages other than English. In the above-mentioned context, the present work studies the possibility to employachine translation systems and supervised methods to build models able to detect and classify sentiment in languages for which

ess/no resources are available for this task when compared to English, stressing upon the impact of translation quality on theentiment classification performance. Our extensive evaluation scenarios show that machine translation systems are approaching aood level of maturity and that they can, in combination to appropriate machine learning algorithms and carefully chosen features,e used to build sentiment analysis systems that can obtain comparable performances to the one obtained for English.

2013 Elsevier Ltd. All rights reserved.

eywords: Multilingual sentiment analysis; Opinion mining; Machine translation; Supervised learning

. Introduction

Together with the increase in the access to technology and the Internet, the recent years have shown a steady growthf the volume of user-generated contents on the Web. The diversity of topics covered by this data (also containingxpressions of subjectivity) in the new textual types such as blogs, fora, microblogs, has been proven to be of tremendousalue to a whole range of applications, in Economics, Social Science, Political Science, Marketing, to mention just aew. Notwithstanding these proven advantages, the high quantity of user-generated contents makes this informationard to access and employ without the use of automatic mechanisms. This issue motivated the rapid and steady growth

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

n interest from the natural language processing (NLP) community to develop computational methods to analyzeubjectivity and sentiment in text. Additionally, apart from the research on sentiment analysis in the context of user-enerated contents, studies have also focused on developing methods for sentiment analysis in newspaper articles. This

� This paper has been recommended for acceptance by Prof. R.K. Moore.∗ Corresponding author. Tel.: +39 0332 785808.

E-mail addresses: [email protected] (A. Balahur), [email protected] (M. Turchi).1 Work developed while working at the European Commission Joint Research Centre, Ispra, Italy.

885-2308/$ – see front matter © 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.csl.2013.03.004

Page 2: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

+Model

ARTICLE IN PRESSYCSLA-587; No. of Pages 20

2 A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx

task is especially relevant to the online reputation management of public figures and organization and to monitoringthe reaction to the events described in mainstream media. As such, different methods have been proposed to deal withthese phenomena for the distinct types of text and domains, reaching satisfactory levels of performance for English.Nevertheless, for certain applications, such as news monitoring, the information in languages other than English isalso highly relevant and cannot be disregarded. Additionally, systems dealing with sentiment analysis in the contextof monitoring must be reliable and perform at similar levels as the ones implemented for English.

Although the most direct solution to these issues of multilingual sentiment analysis would be the use of machinetranslation systems, researchers in sentiment analysis have been reluctant to using such technologies due to the lowperformance they used to have. However, in the past years, the performance of machine translation systems has steadilyimproved. Public or open access solutions (e.g. Google Translate,2 Bing Translator3) offer more and more accuratetranslations for frequently used languages.

Bearing these thoughts in mind, in this article we study the manner in which sentiment analysis can be done forlanguages other than English, using machine translation. In particular, we will study this issue in three languages –French, German and Spanish – using three different machine translation systems – Google Translate, Bing Translatorand Moses (Koehn et al., 2007) and different machine learning models.

We employ these systems to obtain training and test data for these three languages and subsequently extract differentfeatures that we employ to build different machine learning models using Support Vector Machines Sequential MinimalOptimization – SVM SMO – (Platt, 1999). We additionally employ meta-classifiers to test the possibility to minimizethe impact of noise (incorrect translations) in the obtained data. To have a more precise measure of the impact of qualitytranslation on this task, we create Gold Standard sets for each of the three languages, by translating the data with theYahoo translation system4 and subsequently manually correcting the output.

Our experiments show that machine translation systems are reaching a reasonable level of maturity so as to beemployed for multilingual sentiment analysis and that for some languages (for which the translation quality is highenough) the performance that can be attained is similar to that of systems implemented for English, in terms of weightedF-measure.

2. Related work

The work presented herein is related to two different directions of research in NLP: multilingual sentiment analysisand the use of machine translation for multi- and cross-lingual tasks in NLP. The contributions in these two researchdirections that are relevant to the present research are presented in the following subsections.

2.1. Multilingual sentiment analysis

Most of the research in subjectivity and sentiment analysis was done for English. However, there were some authorswho developed methods for the mapping of subjectivity lexicons to other languages. To this aim, Kim and Hovy(2006) use a machine translation system and subsequently employ a subjectivity analysis system that was developedfor English to create subjectivity analysis resources in other languages. Ahmad et al. (2007) use the topical distributionsin different languages to detect important sentiment phrases in a multilingual setting, starting from the idea that wordswith a lower frequency are more representative of the topic and searching for sentiment-related terms around those.Inui and Yamamoto (2011) employ machine translation and, subsequently, sentence filtering to eliminate the noiseobtained in the translation process, based on the idea that sentences that are translations of each other should containsentiment-bearing words that have the same polarity. Mihalcea et al. (2007) propose a method to learn multilingualsubjective language via cross-language projections. They use the Opinion Finder lexicon (Wilson et al., 2005) and use

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

two bilingual English-Romanian dictionaries to translate the words in the lexicon. Another approach was proposedby Banea et al. (2008b). To this aim, the authors perform three different experiments – translating the annotations ofthe MPQA corpus, using the automatically translated entries in the Opinion Finder lexicon and the third, validating

2 http://translate.google.it/.3 http://www.microsofttranslator.com/.4 http://www.babelfish.com/.

Page 3: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

+ModelY

ttfi1wfiuc(ttsticpt

2

pRqltetTsoudsdtscrdseatr

ARTICLE IN PRESSCSLA-587; No. of Pages 20

A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx 3

he data by reversing the direction of translation. In a further approach, Banea et al. (2008a) apply bootstrappingo build a subjectivity lexicon for Romanian, starting with a set of 60 words which they translate and subsequentlylter using a measure of similarity to the original words, based on latent semantic analysis (LSA) (Deerwester et al.,990) scores. Yet another approach to mapping subjectivity lexica to other languages is proposed by Xiaojun (2009),ho uses co-training to classify un-annotated Chinese reviews using a corpus of annotated English reviews. Herst translates the English reviews into Chinese and subsequently back to English. He then performs co-trainingsing all generated corpora. Kim et al. (2010) create a number of systems consisting of different subsystems, eachlassifying the subjectivity of texts in a different language. They translate a corpus annotated for subjectivity analysisMPQA), the subjectivity clues (Opinion Finder) lexicon and re-train a Naive Bayes classifier that is implemented inhe Opinion Finder system using the newly generated resources for all the languages considered. Banea et al. (2010)ranslate the MPQA corpus into five other languages (some with a similar etymology, others with a very differenttructure). Subsequently, they expand the feature space used in a Naive Bayes classifier using the same data translatedo 2 or 3 other languages. Another type of approach was proposed by Bader et al. (2011), who use latent semanticndexing as a manner to bridge between the concepts in different languages. Finally, Steinberger et al. (2011a,b)reate sentiment dictionaries in other languages using a method called “triangulation”. They translate the data, inarallel, from English and Spanish to other languages and obtain dictionaries from the intersection of these tworanslations.

.2. Using machine translation for multi- and cross-lingual tasks in NLP

Attempts to use machine translation in different natural language processing tasks have not been widely used due tooor quality of translated texts, but recent advances in machine translation have motivated such attempts. In Informationetrieval, Savoy and Dolamic (2009) proposed a comparison between Web searches using monolingual and translatedueries. On average, the results show a drop in performance when translated queries are used, around 15%. For someanguage pairs, the average result obtained is around 10% lower than that of a monolingual search while for other pairs,he retrieval performance is clearly lower. In cross-language document summarization, Wan et al. (2010) and Boudint al. (2010) combined the MT quality score with the informativeness score of each sentence in a set of documentso automatically produce summaries in a target language using source language texts. In the work by Steinberger andurchi (2012), different ways of using translated data for multi-document summary evaluation were tested: firstly,ummaries were translated into the target language and compared against human-produced summaries; secondly,riginal documents were translated and then summarized; finally, human-produced summaries were translated andsed to evaluate summaries in the target language. Their results show that the use of translated summaries or modelsoes not alter much the overall system ranking, but a drop in the ROUGE score (Lin, 2004)5 is evident, and ittrongly depends on the translation performance. In the work by Wan et al. (2010), each sentence of the sourceocument is ranked according both the scores, the summary is extracted and then the selected sentences translated tohe target language. Differently, in the work by Boudin et al. (2010), sentences are first translated, then ranked andelected. Both approaches enhance the readability of the generated summaries without degrading their content. In theontext of ranking translated documents, in the research by Turchi et al. (2012c), sentences within a document wereanked according to their informativeness and translation quality and this ranking used to assign a global score to eachocument for the ranking of groups of documents. This required different evaluation strategies from those used in the textummarization field. Finally, systems employing MT for multi- and cross-lingual tasks were developed in the context ofvaluations conducted within the Cross-Language Evaluation Forum (CLEF),6 the Text REtrieval Conference (TREC)7

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

nd the NII Test Collection for IR Systems Project (NTCIR)8 competitions. The results obtained by the systems inhese evaluations have shown that the use of MT for multi- and cross-lingual tasks is a promising direction foresearch.

5 The ROUGE score is the measure which is currently employed to evaluate the performance of summarization systems.6 http://clef.isti.cnr.it/.7 http://trec.nist.gov/.8 http://research.nii.ac.jp/ntcir/index-en.html.

Page 4: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

+Model

ARTICLE IN PRESSYCSLA-587; No. of Pages 20

4 A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx

3. Motivation and contribution

In a formal manner, we can define the sentiment classification performance, scp, as a function of four factors: thefeature set, fs, the feature representation, fr, the learning algorithm, l, and the experimental design, ed (e.g. data split):scp = fn(fs, fr, l, ed). On the one hand, by choosing the optimal parameters for each of the factors, in the case ofcorrect training data from a language, we can obtain the maximum performance for sentiment classification on thatspecific data, for that language. We will denote this maximum performance by scpmax. On the other hand, when machinetranslated data is used for training and a human-produced translation of the Gold Standard for testing, the sentimentclassification performance, scptG, is negatively affected by the translation error εtr such that scptG = scpmax − εtr. Incase of perfect translations of the training data (i.e. human-produced translations), εtr −→ 0 and scptG −→ scpmax.

Our main contribution in this article is the comparative study of multilingual sentiment analysis performance indifferent target languages, using (a) distinct feature sets produced by employing different machine translation systems;(b) feature representations; (c) learning algorithms; and (d) noise removal mechanisms (in this case, meta-classifiers).The objective is to better understand the impact of the translation error on the multilingual sentiment classificationperformance.

Although, as we have seen in Section 2, a few attempts were made to build systems that deal with sentiment analysisin a multilingual setting they mostly employed bilingual dictionaries and used lexicon-based approaches. The very fewthat employed supervised learning using translated data have concentrated only on the issue of sentiment classificationand have disregarded the impact of the translation quality and the difference that the use of distinct translation systemscan make under these settings. Additionally, research in this area has concentrated only on the application of one setof features, one representation and one or two classifiers, whereas the distinct characteristics of translated data (whencompared to the original data) may imply that other features could be more appropriate.

Another characteristic of previous approaches to multiligual sentiment analysis is that such approaches have usuallyemployed only simple machine learning algorithms. No attempt has been made to study the possibility to enhance theperformance of the classification through the removal of noise in the data. To this aim, in the present research, we studythe impact of using meta-classifiers for noise removal.

We employ three different systems – Bing Translator, Google Translate and Moses to translate data from Englishto three languages – French, German and Spanish. We manually create a Gold Standard test set for all the languagesused, on the one hand, to measure the translation quality and to test the performance of sentiment classification ontranslated (noisy) versus correct data. These correct translated test sets allow us to have a more precise measure of theimpact of translation quality on the sentiment classification task.

The lack of manually translated training data for each of the target languages and the large cost of manually producingit do not allow us to compute the maximum sentiment classification performance, scpmax, in all the desired languagesusing training and testing Gold Standard data. In fact, most of the related approaches mentioned in Section 2 onlyemploy corrected test data to measure their performance. Supported by the results proposed by Banea et al. (2008a,b,2010) and Mihalcea et al. (2007), where the sentiment classification performance in English is generally better thanthe performance in the other languages, we use it as reference performance.

Another contribution this article brings is the study of different types of features that can be employed to buildmachine learning models for the sentiment task. Further on, apart from studying different features that can be used torepresent the training data, we also study the use of meta-classifiers to minimize the effect of noise in the data.

We employ Yahoo Translate to translate the test data to the same three languages and manually correct the output,thus obtaining a Gold Standard, used, on the one hand, to measure the translation quality and to test the performanceof sentiment classification on translated (noisy) versus correct data. Using these correct translations, we can have amore precise measure of the impact of translation quality on the sentiment classification task.9

Our comparative results show, on the one hand, that machine translation can be reliably used for multilingual

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

sentiment analysis and, on the other hand, which are the main characteristics of the data for such approaches to besuccessfully employed.

9 Yahoo translate was used because at a first inspection, it contained the least “correct” translations. Using it, although having to perform manymanual corrections in the data, we can impede translation bias – i.e. the use of specific words by a human, if they were to translate the texts manuallyfrom scratch.

Page 5: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

+ModelY

4

(iTwfupttd

tTtlbsptcpSw

5

stii

wf

Sutp

su

ARTICLE IN PRESSCSLA-587; No. of Pages 20

A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx 5

. Dataset presentation and analysis

For our experiments, we employed the data provided for English in the NTCIR 8 Multilingual Opinion Analysis TaskMOAT).10 In this task, the organizers provided the participants with a set of 20 topics (questions) and a set of documentsn which sentences relevant to these questions could be found, taken from the New York Times Text (2002–2005) corpus.he documents were given in two different forms, which had to be used correspondingly, depending on the task tohich they participated. The first variant contained the documents split into sentences (6165 in total) and had to be used

or the task of opinionatedness, relevance and answerness. In the second form, the sentences were also split into opinionnits (6223 in total) for the opinion polarity and the opinion holder and target tasks. For each of the sentences, thearticipants had to provide judgements on the opinionatedness (whether they contained opinions), relevance (whetherhey are relevant to the topic). For the task of polarity classification, the participants had to employ the dataset containinghe sentences that were also split into opinion units (i.e. one sentences could contain two/more opinions, on two/moreifferent targets or from two/more different opinion holders).

For our experiments, we employed the latter representation. From this set, we randomly chose 600 opinion units,o serve as test set. The rest of opinion units will be employed as training set. Subsequently, we employed the Googleranslate, Bing Translator and Moses systems to translate, on the one hand, the training set and on the other hand the

est set, to French, German and Spanish. Additionally, we employed the Yahoo system (whose performance was theowest in our initial experiments) to translate only the test set into these three languages. Further on, this translation haseen corrected manually by a person, for all the languages. This corrected data serves as Gold Standard.11 Most of theseentences, however, contained no opinion (were neutral). Due to the fact that the neutral examples are majority and canroduce a large bias when classifying the polarity of the sentences, we eliminated these examples and employed onlyhe positive and negative sentences in both the training, as well as the test sets. After this elimination, the training setontains 943 examples (333 positive and 610 negative) and the test set and Gold Standard contain 357 examples (107ositive and 250 negative). Although the upper bound for each of the systems would be possible to estimate using Goldtandard for each of the training sets, as well, at this point we considered the scenario that is closer to real situations, inhich the issue is related to the inexistence of training data for a specific language. The process is illustrated in Fig. 1.

. Machine translation

During the 1990s the research community on machine translation proposed a new approach that made use oftatistical tools based on a noisy channel model originally developed for speech recognition (Brown et al., 1994). Inhe simplest form, statistical machine translation (SMT) can be formulated as follows. Given a source sentence writtenn a foreign language f, the Bayes rule is applied to reformulate the probability of translating f into a sentence e writtenn a target language:

ebest = argmaxe

p(e|f ) = argmaxe

p(f |e)pLM(e)

here p(f|e) is the probability of translating e to f and pLM(e) is the probability of producing a fluent sentence e. For aull description of the model see Koehn et al. (2003).

The noisy channel model was extended in different directions. In this work, we analyse the most popular class ofMT systems: phrase-based statistical machine translation (PBSMT). It is an extension of the noisy channel modelsing phrases rather than words. A source sentence f is segmented into a sequence of I phrases fI = {f1, f2, . . ., fI} andhe same is done for the target sentence e, where the notion of phrase is not related to any grammatical assumption; ahrase is an n-gram. The best translation ebest of f is obtained by:

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

ebest = argmaxe

p(e|f ) = argmaxe

p(f |e)pLM(e) = argmaxe

I∏

i=1

φ(fi|ei)λφd(ai − bi−1)λd

|e|∏

i=1

pLM(ei|e1 . . . ei−1)λLM

10 http://research.nii.ac.jp/ntcir/ntcir-ws8/permission/ntcir8xinhua-nyt-moat.html.11 Please note that each sentence may contain more than one opinion unit. In order to ensure a contextual translation, we translated the wholeentences, not the opinion units separately. In the end, we eliminate duplicates of sentences (due to the fact that they contained multiple opinionnits), resulting in around 400 sentences in the test and Gold Standard sets and 5700 sentences in the training set.

Page 6: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

ARTICLE IN PRESS+ModelYCSLA-587; No. of Pages 20

6 A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx

Fig. 1. The process employed to translate the training and test data and to create the Gold Standards.

where φ(fi|ei) is the probability of translating a phrase ei into a phrase fi. d(ai − bi−1) is the distance-based reorderingmodel that drives the system to penalise significant reorderings of words during translation, while allowing some flexi-bility. In the reordering model, ai denotes the start position of the source phrase that is translated into the ith target phrase,and bi−1 denotes the end position of the source phrase translated into the (i − 1)th target phrase. pLM(ei|e1 . . . ei−1)is the language model probability that is based on the Markov’s chain assumption. It assigns a higher probability tofluent/grammatical sentences. λφ, λLM and λd are used to give a different weight to each element. For more details seeKoehn et al. (2003).

Three different SMT systems were used to translate the human annotated sentences: two existing online servicessuch as Google Translate and Bing Translator12 and an instance of the open source phrase-based statistical machinetranslation toolkit Moses, Koehn et al. (2007).

To train our models based on Moses we used the freely available corpora: Europarl (Koehn, 2005), JRC-Acquis(Steinberger et al., 2006), Opus (Tiedemann, 2009), News Corpus (Callison-Burch et al., 2009). This results in 2.7million sentence pairs for English–French, 3.8 for German and 4.1 for Spanish. All the modes are optimized runningthe MERT algorithm (Och, 2003) on the development part of the News Corpus. The translated sentences are recasedand detokonized (for more details on the system, please see Turchi et al. (2012a).

The performance of an SMT system is automatically evaluated comparing the output of the system against human-produced translations (reference). The BLEU score (Papineni et al., 2001) is based on n-gram precision that is thefraction of n-grams of the target sentences that occur in references. This quantity is affected by the fact that the samepart of the reference sentences can be matched more than one time with a n-gram in the target sentence. This impliesthat n-gram precision can produce misleading results. To avoid this situation, the BLEU score uses a modified n-gramprecision that does not allow the same part of the reference sentence to be used twice. Modified n-gram precision is alsoused to penalise target sentences that are longer than their references, but it is not enough to enforce the proper lengthof the translation. To solve this a brevity penalty factor has been introduced to give better score to those target sentencesthat reflect the reference sentence length. BLEU score is the product of the geometric average of the modified n-gramprecision with n-gram up to N and the brevity penalty and it ranges between 0 and 1, and larger value identifies bettertranslation. The BLEU score strongly correlates with other automatic metrics used in machine translation (Turchi et al.,

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

2012), and, although other measures correlate better with the human-judgements (e.g. AMBER; Chen et al., 2012),nowadays it is the most used score in the evaluation of translated texts.

12 http://translate.google.com/ and http://www.microsofttranslator.com/.

Page 7: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

+ModelY

6

riia

oidToce

iptmlai

lSm

aobbbf

7

l

ARTICLE IN PRESSCSLA-587; No. of Pages 20

A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx 7

. Sentiment analysis

In the field of sentiment analysis, most work has concentrated on creating and evaluating methods, tools andesources to discover whether a specific “target” or “object” (person, product, organization, event, etc.) is “regarded”n a positive or negative manner by a specific “holder” or “source” (i.e. a person, an organization, a community, peoplen general, etc.). This task has been given many names, from opinion mining, to sentiment analysis, review mining,ttitude analysis, appraisal extraction and many others.

The issue of extracting and classifying sentiment in text has been approached using different methods, dependingn the type of text, the domain and the language considered. Broadly speaking, the methods employed can be classifiednto unsupervised (knowledge-based), supervised and semi-supervised methods. The first usually employ lexica orictionaries of words with associated polarities (and values – e.g. 1, −1) and a set of rules to compute the final result.he second category of approaches employ statistical methods to learn classification models from training data, basedn which the test data is then classified. Finally, semi-supervised methods employ knowledge-based approaches tolassify an initial set of examples, after which they use different machine learning methods to bootstrap new trainingxamples, which they subsequently use with supervised methods.

The main issue with the first approach is that obtaining large-enough lexica to deal with the variability of languages very expensive (if it is done manually) and generally not reliable (if it is done automatically). Additionally, the mainroblem of such approaches is that words outside contexts are highly ambiguous. Semi-supervised approaches, onhe other hand, highly depend on the performance of the initial set of examples that is classified. If we are to employachine translation, the errors in translating this small initial set would have a high negative impact on the subsequently

earned examples. The challenge of using statistical methods is that they require training data (e.g. annotated corpora)nd that this data must be reliable (i.e. not contain mistakes or “noise”). However, the larger this dataset is, the lessnfluence the translation errors have.

Since we want to study whether machine translation can be employed to perform sentiment analysis for differentanguages, we employed statistical methods in our experiments. More specifically, we used Support Vector Machinesequential Minimal Optimization (SVM SMO) since the literature in the field has confirmed it as the most appropriateachine learning algorithm for this task (Pang and Lee, 2008).In the case of statistical methods, the most important aspect to take into consideration is the manner in which texts

re represented – i.e. the features that are extracted from it. For our experiments, we represented the sentences basedn the unigrams and the bigrams that were found in the training data. Although there is an ongoing debate on whetherigrams are useful in the context of sentiment classification, we considered that the quality of the translation can alsoe best quantified in the process by using these features (because they give us a measure of the translation correctness,oth regarding words, as well as word order). Higher level n-grams, on the other hand, would only produce more sparseeature vectors, due to the high language variability and the mistakes in the translation.

. Experiments

In order to test the performance of sentiment classification when using translated data, we employed supervisedearning using different features:

In the first approach, we represented, for each of the languages and translation systems, the sentences as vectors,whose features marked the presence/absence (boolean) of the unigrams contained in the corresponding training set(e.g. we obtained the unigrams in all the sentences in the training set obtained by translating the English trainingdata to Spanish using Google and subsequently represented each sentence in this training set, as well as the testset obtained by translating the test data in English to Spanish using Google marking the presence of the unigramfeatures).

In the second approach, we represented the training and test sets in the same manner as described above, with the

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

difference that the features were computed not as the presence of the unigrams, but the tf-idf score of that unigram. In the third approach, we represented, for each of the languages and translation systems, the sentences as vectors,

whose features marked the presence/absence of the unigrams and bigrams contained in the corresponding trainingset.

Page 8: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

ARTICLE IN PRESS+ModelYCSLA-587; No. of Pages 20

8 A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx

Table 1Features employed for representing the sentences in the training and test sets.

Language SMT Nr. of unigrams Nr. of bigrams

English – 5498 15,981

French Bing 7441 17,870Google 7540 18,448Moses 6938 18,814Bing + Google + Moses 9082 40,977

German Bing 7817 16,216Google 7900 16,078Moses 7429 16,078Bing + Google + Moses 9371 36,556

Spanish Bing 7388 17,579Google 7803 18,895Moses 7528 18,354Bing + Google + Moses 8993 39,034

• In the fourth approach, we represented the training and test sets as in the previous point, with the difference that thefeatures were computed not as the presence of the unigrams and bigrams, but the tf-idf score of the unigrams andbigrams, respectively.

In our experiments, we also studied the possibility to employ sentiment-bearing words in the sentences to beclassified as features for the machine learning algorithm. In order to do this, we employed the SentiWordNet (Esuliand Sebastiani, 2006), General Inquirer (Stone et al., 1966) and WordNet Affect (Strapparava and Valitutti, 2004)dictionaries for English and the multilingual dictionaries created by Steinberger et al. (2012a). The main problem ofthis approach was, however, that very few features were found, for a small number of the sentences to be classified,on the one hand because affect is not expressed in these sentences using lexical clues and, on the other hand, becausethe dictionaries we had at our disposal for languages other than English were not very large (around 1500 words). Forthis reason, we will not report these results.

Table 1 presents the number of unigram and bigram features employed for each of the languages, per translationsystem (N.B. the features are extracted from the training data).

Subsequently, we performed two sets of experiments:

• In the first set of experiments, we trained an SVM SMO classifier on the training data obtained for each language, witheach of the three machine translations, separately (i.e. we generated a model for each of the languages considered, foreach of the machine translation systems employed), using the four types of aforementioned features. Subsequently,we tested the models thus obtained on the corresponding test set (e.g. training on the Spanish training set obtainedusing Google Translate and testing on the Spanish test set obtained using Google Translate) and on the Gold Standardfor the corresponding language (e.g. training on the Spanish training set obtained using Google Translate and testingon the Spanish Gold Standard). Additionally, in order to study the manner in which the noise in the training datacan be removed, we employed two meta-classifiers – Bagging (Breiman, 1996) (with varying sizes of the bagand SVM SMO as classifier) and AdaBoost (Freund and Schapire, 1995), but the best results were obtained usingBagging.

• In the second set of experiments, we combined the translated data from all three machine translation systems forthe same language and created separate models based on the four types of features we extracted from this data (e.g.we created a Spanish training model using the unigrams and bigrams present in the training sets generated by the

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

translation of the training set to Spanish by Google Translate, Bing Translator and Moses). We subsequently testedthe performance of the sentiment classification using the Gold Standard for the respective language, representedusing the features of the corresponding model built on the training data.

Page 9: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

ARTICLE IN PRESS+ModelYCSLA-587; No. of Pages 20

A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx 9

Table 2Results obtained for English using the different representations.

Feature representation Test set SMO AdaBoost M1 Bagging

Unigram GS 0.683 0.682 0.687Unigram tf-idf GS 0.651 0.667 0.681Uni + bigrams GS 0.685 0.685 0.686Uni + bigrams tf-idf GS 0.669 0.673 0.687

Table 3Comparative evaluation of the feature representations using the SVM algorithm. If the F-score of x is larger than the F-score of y, then the count ofx > y is increased by 1. In case the absolute value of the F-score of x minus F-score of y is smaller than 0.005, x and y are considered equal.

To German To French To Spanish

Uni + bigrams > unigram 1 2 3Unigram tf-idf > unigram 1 3 2Uni + bigrams tf-idf > uni + bigrams 0 2 3Uni + bigrams tf-idf > unigram tf-idf 2 0 3

Average BLEU score over the 3 systems 0.202 0.248 0.318

Wrleifa

t

8

oeet

8

sSGda

All the learning algorithms have been run using the default settings proposed in the machine learning library Weka.13

hile this choice may affect the overall performance on the data in the given settings, preventing the classifier fromeaching the best F-score, it allows to perform a fair comparison across different algorithms, feature representations andanguages. For instance, given two settings with two different feature representations and the parameters optimized,.g. via cross-validation on the training set or feature selection, it would subsequently be difficult to understandf a large difference in performance between the two settings is due to better features or simply better parametersor the classifier. Using the same parameters guarantees that a feature representation is more discriminative thannother.

The results of the experiments (in terms of weighted F-score, per language) are presented in Tables 2, 4–6, and forhe second set of experiments are presented in Table 7.

. Results and discussion

Generally speaking, from our experiments using SVM, we could see that incorrect translations imply an incrementf the features, sparseness and more difficulties in identifying a hyperplane which separates the positive and negativexamples in the training phase. Therefore, a low quality of the translation leads to a drop in performance, as the featuresxtracted are not informative enough to allow for the classifier to learn. We can consider that the results obtained onhe test sets are representative for the translation quality of the training sets.

.1. Experiment 1

From Tables 2, 4–6, we can see that there is a small difference between performances of the sentiment analysisystem using the English and translated data, respectively. In the worst case, there is a maximum drop of 11.8% usingMO, 11.5% using AdaBoost and 8% using Bagging. Ideally, to better measure this drop we would have had to use

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

old Standard training data for each language. As mentioned in Section 4, the creation of the Gold Standard is a veryifficult and time consuming task. We are considering the manual translation of the training data into French, Germannd Spanish for the future work. Nonetheless, the scenario considered was aimed at studying the use of MT for SA in

13 http://www.cs.waikato.ac.nz/ml/weka/.

Page 10: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

ARTICLE IN PRESS+ModelYCSLA-587; No. of Pages 20

10 A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx

Table 4Results obtained for German using the different feature representations.

Feature representation SMT Test SMO AdaBoost M1 Bagging BLEU score

Unigram Bing GS 0.655 0.62 0.658Tr 0.655 0.625 0.666 0.227

Unigram Google T. GS 0.64 0.622 0.655Tr 0.695 0.645 0.693 0.209

Unigram Moses GS 0.649 0.641 0.675Tr 0.666 0.654 0.661 0.17

Unigram tf-idf Bing GS 0.627 0.628 0.64Tr 0.654 0.625 0.673 0.227

Unigram tf-idf Google T. GS 0.626 0.598 0.643Tr 0.667 0.627 0.693 0.209

Unigram tf-idf Moses GS 0.654 0.646 0.659Tr 0.664 0.66 0.673 0.17

Uni + bigrams Bing GS 0.641 0.631 0.648Tr 0.658 0.636 0.662 0.227

Uni + bigrams Google T. GS 0.646 0.623 0.674Tr 0.687 0.645 0.661 0.209

Uni + bigrams Moses GS 0.644 0.644 0.676Tr 0.667 0.667 0.674 0.17

Uni + bigrams tf-idf Bing GS 0.644 0.633 0.663Tr 0.655 0.644 0.647 0.227

Uni + bigrams tf-idf Google T. GS 0.638 0.606 0.654Tr 0.663 0.645 0.68 0.209

Uni + bigrams tf-idf Moses GS 0.645 0.645 0.655Tr 0.663 0.663 0.682 0.17

Table 5Results obtained for Spanish using the different feature representations.

Feature representation SMT Test set SMO AdaBoost M1 Bagging BLEU score

Unigram Bing GS 0.627 0.62 0.633Tr 0.634 0.629 0.618 0.316

Unigram Google T. GS 0.635 0.635 0.659Tr 0.63 0.63 0.665 0.341

Unigram Moses GS 0.644 0.644 0.639Tr 0.675 0.675 0.676 0.298

Unigram tf-idf Bing GS 0.659 0.649 0.655Tr 0.622 0.637 0.646 0.316

Unigram tf-idf Google T. GS 0.652 0.652 0.673Tr 0.624 0.624 0.637 0.341

Unigram tf-idf Moses GS 0.646 0.646 0.66Tr 0.677 0.677 0.676 0.298

Uni + bigrams Bing GS 0.656 0.658 0.646Tr 0.633 0.633 0.633 0.316

Uni + bigrams Google T. GS 0.653 0.653 0.665Tr 0.636 0.667 0.665 0.341

Uni + bigrams Moses GS 0.664 0.664 0.671Tr 0.649 0.649 0.663 0.298

Uni + bigrams tf-idf Bing GS 0.672 0.676 0.665Tr 0.624 0.651 0.632 0.316

Uni + bigrams tf-idf Google T. GS 0.665 0.665 0.684Tr 0.632 0.632 0.649 0.341

Uni + bigrams tf-idf Moses GS 0.683 0.673 0.668Tr 0.684 0.677 0.685 0.298

Page 11: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

ARTICLE IN PRESS+ModelYCSLA-587; No. of Pages 20

A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx 11

Table 6Results obtained for French using the different feature representations.

Feature representation SMT Test set SMO AdaBoost M1 Bagging BLEU score

Unigram Bing GS 0.604 0.634 0.644Tr 0.649 0.654 0.657 0.243

Unigram Google T. GS 0.628 0.628 0.638Tr 0.652 0.652 0.679 0.274

Unigram Moses GS 0.646 0.666 0.642Tr 0.663 0.657 0.66 0.227

Unigram tf-idf Bing GS 0.646 0.641 0.645Tr 0.652 0.661 0.664 0.243

Unigram tf-idf Google T. GS 0.635 0.635 0.645Tr 0.672 0.672 0.68 0.274

Unigram tf-idf Moses GS 0.656 0.635 0.653Tr 0.686 0.646 0.671 0.227

Uni + bigrams Bing GS 0.644 0.645 0.664Tr 0.644 0.649 0.652 0.243

Uni + bigrams Google T. GS 0.64 0.64 0.659Tr 0.652 0.652 0.678 0.274

Uni + bigrams Moses GS 0.633 0.633 0.645Tr 0.666 0.666 0.674 0.227

Uni + bigrams tf-idf Bing GS 0.645 0.658 0.661Tr 0.65 0.659 0.677 0.243

Uni + bigrams tf-idf Google T. GS 0.63 0.63 0.642Tr 0.666 0.666 0.685 0.274

Uni + bigrams tf-idf Moses GS 0.653 0.653 0.648

tpTt

l

TFS

F

U

U

U

U

Tr 0.664 0.664 0.687 0.227

he real-life scenario, in which there is little or no annotated data for the language on which SA is done. As expected, theerformance of the classification is much higher for data obtained using the same translator than on the Gold Standard.his is true, as the same incorrect translations are repeated in both sets and therefore the learning is not influenced by

hese mistakes.

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

In the following part of this section, we discuss the results from three points of view: the feature representation, theearning algorithm and the languages and translation systems.

able 7or each language, each classifier has been trained merging the translated data coming form different SMT systems, and tested using the Goldtandard.

eature representation To German To Spanish To French

nigram SMO 0.565a 0.587 0.609AdaBoost 0.563 0.599 0.575Bagging 0.563a 0.598 0.578

nigram tf-idf SMO 0.658 0.657 0.626AdaBoost 0.64 0.646 0.634Bagging 0.665 0.666 0.635

ni + bigrams SMO 0.565a 0.419 0.25AdaBoost 0.563a 0.494 0.255Bagging 0.563a 0.511 0.23

ni + bigrams tf-idf SMO 0.672 0.691 0.664AdaBoost 0.672 0.684 0.658Bagging 0.675 0.665 0.669

a Classifier is not able to discriminate between positive and negative classes, and assigns most of the test points to one class, and zero to the other.

Page 12: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

+Model

ARTICLE IN PRESSYCSLA-587; No. of Pages 20

12 A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx

8.1.1. Feature representationThe noise in the data appears from two sources – namely the incorrect translations or the features that are not

appropriate. In our experiments, we want to understand which feature representation is more robust to the noise, givesthe best performance and under which conditions. We summarizes the results in Table 3. For the SVM algorithmand for each language, we check how many times the use of a certain representation leads to a higher F-score thananother testing the Gold Standard for the translation system. To guarantee a reasonable difference between the twostrategies, if the absolute value of the F-score difference between two values is smaller than 0.005, we consider thetwo representations equal. For each comparison, x > y, the count can range from 0 to 3: 0 means that y is always betterthan x, while 3 means that x always performs better than y. e.g. for the uni + bigrams > unigram comparison for theGerman language in Table 4 we check the following values: 0.641 < 0.655 and |(0.641 − 0.655)| > 0.005, we do notincrease the count; 0.646 > 0.64 and |(0.646 − 0.64)| > 0.005, we increase the count by 1.

In the first row of Table 3, it is evident how the uni + bigrams representation needs better translations to providesome benefits in sentiment classification compared to the unigram representation. In presence of bad translations, theuse of bigram has a multiplicative effect of the noise: one untranslated or misplaced word in the target text affects twobigrams (features). Furthermore, if this effect is systematically present in all the training data, the bigram representationwill generate a lot of features which are not discriminative and even harmful.

The comparison between the frequency and the presence/absence representations (uni tf-idf > uni and uni + bigramstf-idf > uni + bigrams) shows that the frequency approaches are less suitable for the noisy data. This can be explainedtaking into account the nature of the data that we are using. The MOAT data contains questions coming from twentydifferent topics. From a manual evaluation of the data, we noticed that wrong translations are consistent inside eachtopic which means in a subset of the questions. This is a type of situation where representing a feature using tf-idfmakes a higher difference than representing it in a boolean manner. In such a case, the meaning of the tf-idf scorecorresponds to the fact that the feature (in our case, n-gram) does not appear in all the documents, and where it appearsit is frequent. For this reason, the tf-idf representation gives importance to features which are not present in the GoldStandard. This phenomenon is leveraged by increasing the quality of the translated data.

As mentioned in the survey by Pang and Lee (2008), there is no clear evidence about the benefits of using unigramsor unigrams plus bigrams with the frequency representation.

Summarizing, the most reliable representation in the case of low translation performance data is the unigramboolean one. This type of representation is able to better deal with mistranslated words. However, when the quality ofthe translation is higher, the frequency-based feature representation outperforms the boolean one.

8.1.2. Learning algorithmsBagging, by reducing the variance in the estimated models, produces a positive effect on the performance increasing

the F-score, as compared to the learning process and features without Bagging. These improvements are larger usingthe German data (bigger than one F-score percent when testing on the Gold Standard), because the poor quality ofthe translations increases the variance in the data. For the same reason, Bagging is quite effective when unigrams andbigrams (frequency and presence/absence approach) are used to represent the data.

AdaBoost confirms its sensibility with noisy data producing in general worse performance than the other twoalgorithms. Despite it has substantial increments on the English data only for the frequency representation, on thetranslated data, except for few cases, it does not highlight real benefits. In this work we pair Bagging with SMO andAdaBoost, but we are interested in running experiments using classifiers such as Naive Bayes or neural networks.

8.1.3. Languages and translation systemsComparing the performance language by language on the Gold Standard, the best results are obtained for the Spanish

language, for which in most of the cases the best F-score reaches 0.66% with a maximum of 0.684. For the other twolanguages there is more variation in the performance and it is difficult to distinguish for which language there are thebest results.

For the same language and feature representation, despite the gap between the best and the worst translation systems

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

is quite large in terms of BLEU scores, it is interesting to note that there is no correlation between the BLEU score valuesand the classification performance at system level. The BLEU score is computed taking into account the co-occurrencesof phrases ranging from one to four words between the translated and the Gold Standard data. In presence of changesin the translations from a system to another, they can produce a reasonable variation in the BLEU score modifying the

Page 13: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

+ModelY

cwc

oat

(tabip

8

ioot

ttltbetitpsa

cIe

sFiqasaaGe(t

ARTICLE IN PRESSCSLA-587; No. of Pages 20

A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx 13

ount of the large phrases. Vice versa, the use of the vector space model representation in the classification problem,hich is based on the assumption of independence of the terms, makes the impact of translation changes less critical

ompared to the BLEU score computation.When translated data is used for training and testing, in most of the cases the sentiment classification performance

btained using the Moses data results in the best performance, while it has the smallest translation performanceccording to the BLEU score. It is evident that in this case inaccurate translations which appears systematically in theraining and test data creates a positive effect in the classification task.

The amount of parallel data used to train Moses clearly may affect the classification performance. Turchi et al.2012) showed that doubling the MT training size there is a constant increment in translation performance, this meanshat having more data it is possible to reach better translation performance and indirectly we believe better classificationccuracy. In case of less resourced languages (<1/1.5 million sentence pairs), the commercial translation engines cane used in support of the SA, but, even in this case, it is not guarantee an acceptable level of translation. We are alsonterested in better understanding the relation between the BLEU score and the classification performance, and for thisurpose we will run experiments using Moses trained on different sizes of the training data.

.2. Experiment 2

Looking at the results in Table 7, we can see that adding all the translated training data together makes the featuresn the representation more sparse and increases the noisy level in the training data, creating harmful effects in termsf classification performance: the classifier can lose its discriminative capability. This is not the case when using tf-idfn unigrams and unigrams plus bigrams, in which case the combination of the data improves the classification, as thisype of features deter sparsity in data.

At language level, clearly the results depend on the translation performance. Only for Spanish (for which we havehe highest BLEU score), each classifier is able to properly learn from the training data and try to properly assign theest samples. For the other languages, translated data are so noisy that: (a) either the classifier is not able to properlyearn the correct information for the positive and the negative classes, and this results in the assignment of most of theest points to one class and zero to the other, or (b) there is significant drop in performance, e.g. for the French language,ut the classifier is still able to assign the test points to both the classes. This differs from the results in the previousxperiments, where it is not so evident the relation between the translation and classification performance. When allhe translated data coming from different SMT systems are merged together, the level of noise in all the training datas larger than when only the translations from one SMT are used. This means that only when the general quality of theranslated texts is acceptable, the learning algorithm can properly learn from the data. While in the experiment in therevious Section, the learning algorithm is able to cope with a smaller noisy level coming from the translations of aingle SMT system. This is also evident in the results in Table 7 where the differences in performance across languagesre larger than the experiments in Section 8.1.

The results confirm the capability of Bagging to reduce the model variance and increase the performance in classifi-ation, in particular for those feature representations which include the tf-idf term weight or for the Spanish language.n both the cases, performances are comparable and for some configurations even better than what we obtained usingach dataset independently.

In the case of Spanish, the combined F-score is slightly better than the English result. This shows that differentystems can translate differently the same input sentence catching more linguistic variations in the target language.or example, for the English sentence: “They were old enough to remember how badly things can go when frenzy

s the order of the day and laws are put aside by feelings.” The three different translation systems proposed threeuite different translation, both in the quality of the translated words, as well as in the manner in which syntax rulesre obeyed and the logic and meaning of the sentence is still maintained. The respective translations of each of theystems is given below. “Sie waren alt genug, um Denken Sie daran, wie schlecht Dinge gehen knnen, wenn Rauschn der Tagesordnung ist, und Gesetze sind durch Gefühle beiseite gefegt.” (translation to German by Bing) “Sie warenlt genug, sich daran zu erinnern, wie schlimm es gehen kann, wenn der Rausch der Reihenfolge des Tages, und

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

esetze sind auer von Gefühlen mitgerissen.” (translation to German by Google) “Sie waren alt genug, sich daran zurinnern, wie dringend es gehen kann, wenn eine Intoxikation ist an der Tagesordnung, und die Gesetze von Gefühlen.”translation to German by Moses). Overall, we can see that by adding together the features that we can extract fromhe combination of these three translations, we can obtain a better approximation for the representation of the correct

Page 14: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

ARTICLE IN PRESS+ModelYCSLA-587; No. of Pages 20

14 A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx

translation than when using each of the translation systems separately. Words that are mistaken by one are correctlytranslated by the other two and, at the same time, the correct translation is reinforced.

9. Conclusions and future work

In this work we propose an extensive evaluation of the use of translated data in the context of sentiment analysis. Ourfindings show that SMT systems have reached a reasonable level of maturity to produce sufficiently reliable trainingdata for languages other than English. The gap in classification performance between systems trained on English andtranslated data is minimal, with a maximum of 12% in favor of source language data.

Working with translated data implies an increment number of features, sparseness and noise in the data pointsin the classification task. To limit these problems, we test three different classification approaches, using differenttypes of features and classifiers, showing that using unigrams or tf-idf on unigrams as features, and/or Bagging as ameta-classifier, has a positive impact in the results. Furthermore, in case of good translation quality, we noticed thatthe union of the same training data translated with various systems can help the classifiers to learn different linguisticaspects from the same data.

The proposed approach clearly depends on the availability of the translation engines for the required languages.Although, commercial engines are able to translate from and into a large number of languages, they cannot be used tofreely translate large amounts of data (usually not more than a certain number of characters). On the other hand, theparallel corpora needed for training the open source SMT systems cover only the most used languages, and their sizesare not comparable to the dataset used to train commercial engines. These aspects may limit the use of MT in supportof other natural language processing sectors, in particular if focused on less resourced languages.

In future work, we plan to investigate different document representations, in particular we believe that the projectionof our documents in space where the features belong to a sentiment lexica (in conjunction to the types of features wehave already employed) and include syntax information can reduce the impact of the translation errors.

Acknowledgments

The authors would like to thank Ivano Azzini, from the BriLeMa Artificial Intelligence Studies, for the advice andsupport on using meta-classifiers. We would also like to thank the reviewers for their useful comments and suggestions,which helped to improve on the clarity and completeness of the article.

Appendix A. Precision and recall evaluation

In the previous sections, all the results are expressed in terms of F-measure to evaluate the accuracy of our experi-ments. For completeness and for better evaluating the findings of this work we present the classification performanceusing the weighted precision (Tables A.1, A.3, A.5, A.7 and A.9) and recall (Tables A.2, A.4, A.6, A.8 and A.10).In all the experiments, precision and recall are well balanced showing the capability of the classification algorithm inidentifying more relevant sentiment label than irrelevant and in returning most of the relevant labels.

In general, all the outcomes of the papers are validated also analyzing the performance in terms of precision andrecall. An exception is given by the results in Table A.9, where the highest score for the translation into French isgiven by the uni + bigrams representation instead of the uni + bigrams tf-idf as shown in Table 7. Cross-checking theprecision and recall performance, it is evident that the uni + bigrams tf-idf has a very high precision, but a very small

Table A.1Results obtained for English using the different representations. Results are reported in terms of weighted precision.

Feature representation Test set SMO AdaBoost M1 Bagging

Unigram GS 0.689 0.693 0.697Unigram tf-idf GS 0.655 0.681 0.684Uni + bigrams GS 0.681 0.681 0.683Uni + bigrams tf-idf GS 0.665 0.678 0.682

Page 15: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

ARTICLE IN PRESS+ModelYCSLA-587; No. of Pages 20

A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx 15

Table A.2Results obtained for English using the different representations. Results are reported in terms of weighted recall.

Feature representation Test set SMO AdaBoost M1 Bagging

Unigram GS 0.678 0.675 0.681Unigram tf-idf GS 0.647 0.658 0.678Uni + bigrams GS 0.692 0.692 0.689Uni + bigrams tf-idf GS 0.678 0.669 0.695

Table A.3Results obtained for German using the different feature representations. Results are reported in terms of weighted precision.

Feature representation SMT Test set SMO AdaBoost M1 Bagging BLEU score

Unigram Bing GS 0.654 0.648 0.659Tr 0.653 0.64 0.666 0.227

Unigram Google T. GS 0.647 0.648 0.651Tr 0.692 0.652 0.692 0.209

Unigram Moses GS 0.645 0.651 0.67Tr 0.667 0.663 0.659 0.17

Unigram tf-idf Bing GS 0.628 0.653 0.641Tr 0.648 0.64 0.669 0.227

Unigram tf-idf Google T. GS 0.63 0.623 0.639Tr 0.667 0.642 0.689 0.209

Unigram tf-idf Moses GS 0.649 0.654 0.656Tr 0.663 0.669 0.672 0.17

Uni + bigrams Bing GS 0.639 0.633 0.64Tr 0.653 0.636 0.657 0.227

Uni + bigrams Google T. GS 0.644 0.635 0.668Tr 0.684 0.648 0.659 0.209

Uni + bigrams Moses GS 0.642 0.644 0.67Tr 0.667 0.667 0.669 0.17

Uni + bigrams tf-idf Bing GS 0.641 0.639 0.656Tr 0.65 0.645 0.64 0.227

Uni + bigrams tf-idf Google T. GS 0.635 0.625 0.646Tr 0.66 0.637 0.678 0.209

Uni + bigrams tf-idf Moses GS 0.639 0.639 0.649Tr 0.66 0.66 0.676 0.17

Table A.4Results obtained for German using the different feature representations. Results are reported in terms of weighted recall

Feature representation SMT Test set SMO AdaBoost M1 Bagging BLEU score

Unigram Bing GS 0.656 0.609 0.656Tr 0.657 0.616 0.666 0.227

Unigram Google T. GS 0.634 0.609 0.662Tr 0.698 0.64 0.692 0.209

Unigram Moses GS 0.654 0.634 0.682Tr 0.665 0.648 0.662 0.17

Unigram tf-idf Bing GS 0.626 0.615 0.64Tr 0.663 0.616 0.68 0.227

Unigram tf-idf Google T. GS 0.623 0.584 0.648Tr 0.668 0.617 0.701 0.209

Unigram tf-idf Moses GS 0.662 0.64 0.662Tr 0.665 0.654 0.673 0.17

Uni + bigrams Bing GS 0.642 0.628 0.662Tr 0.663 0.635 0.669 0.227

Page 16: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

ARTICLE IN PRESS+ModelYCSLA-587; No. of Pages 20

16 A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx

Table A.4 (Continued )

Feature representation SMT Test set SMO AdaBoost M1 Bagging BLEU score

Uni + bigrams Google T. GS 0.648 0.615 0.684Tr 0.69 0.642 0.662 0.209

Uni + bigrams Moses GS 0.645 0.644 0.687Tr 0.667 0.667 0.682 0.17

Uni + bigrams tf-idf Bing GS 0.648 0.628 0.679Tr 0.663 0.643 0.657 0.227

Uni + bigrams tf-idf Google T. GS 0.642 0.595 0.668Tr 0.668 0.637 0.682 0.209

Uni + bigrams tf-idf Moses GS 0.654 0.654 0.665Tr 0.668 0.668 0.693 0.17

Table A.5Results obtained for Spanish using the different feature representations. Results are reported in terms of weighted precision.

Feature representation SMT Test set SMO AdaBoost M1 Bagging BLEU score

Unigram Bing GS 0.622 0.614 0.626Tr 0.629 0.623 0.611 0.316

Unigram Google T. GS 0.634 0.634 0.654Tr 0.63 0.63 0.661 0.341

Unigram Moses GS 0.636 0.636 0.632Tr 0.674 0.674 0.675 0.298

Unigram tf-idf Bing GS 0.654 0.642 0.648Tr 0.617 0.629 0.641 0.316

Unigram tf-idf Google T. GS 0.651 0.651 0.669Tr 0.623 0.623 0.638 0.341

Unigram tf-idf Moses GS 0.638 0.638 0.653Tr 0.678 0.678 0.673 0.298

Uni + bigrams Bing GS 0.648 0.65 0.638Tr 0.625 0.625 0.627 0.316

Uni + bigrams Google T. GS 0.646 0.646 0.658Tr 0.632 0.632 0.663 0.341

Uni + bigrams Moses GS 0.656 0.656 0.664Tr 0.65 0.65 0.665 0.298

Uni + bigrams tf-idf Bing GS 0.665 0.67 0.658Tr 0.615 0.648 0.623 0.316

Uni + bigrams tf-idf Google T. GS 0.658 0.658 0.679Tr 0.628 0.628 0.644 0.341

Uni + bigrams tf-idf Moses GS 0.677 0.677 0.661Tr 0.685 0.676 0.685 0.298

Table A.6Results obtained for Spanish using the different feature representations. Results are reported in terms of weighted recall.

Feature representation SMT Test set SMO AdaBoost M1 Bagging BLEU score

Unigram Bing GS 0.633 0.627 0.641Tr 0.64 0.637 0.626 0.316

Unigram Google T. GS 0.633 0.633 0.667Tr 0.63 0.63 0.669 0.341

Unigram Moses GS 0.655 0.655 0.65Tr 0.676 0.676 0.682 0.298

Unigram tf-idf Bing GS 0.667 0.661 0.667Tr 0.628 0.651 0.654 0.316

Page 17: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

ARTICLE IN PRESS+ModelYCSLA-587; No. of Pages 20

A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx 17

Table A.6 (Continued )

Feature representation SMT Test set SMO AdaBoost M1 Bagging BLEU score

Unigram tf-idf Google T. GS 0.653 0.653 0.669Tr 0.625 0.625 0.636 0.341

Unigram tf-idf Moses GS 0.658 0.658 0.672Tr 0.676 0.676 0.679 0.298

Uni + bigrams Bing GS 0.672 0.675 0.658Tr 0.645 0.645 0.64 0.316

Uni + bigrams Google T. GS 0.664 0.664 0.675Tr 0.641 0.641 0.672 0.341

Uni + bigrams Moses GS 0.681 0.681 0.686Tr 0.648 0.648 0.662 0.298

Uni + bigrams tf-idf Bing GS 0.689 0.689 0.681Tr 0.64 0.654 0.645 0.316

Uni + bigrams tf-idf Google T. GS 0.675 0.675 0.695Tr 0.636 0.636 0.655 0.341

Uni + bigrams tf-idf Moses GS 0.697 0.697 0.686Tr 0.683 0.679 0.684 0.298

Table A.7Results obtained for French using the different feature representations. Results are reported in terms of weighted precision.

Feature representation SMT Test set SMO AdaBoost M1 Bagging BLEU score

Unigram Bing GS 0.613 0.636 0.648Tr 0.642 0.648 0.65 0.243

Unigram Google T. GS 0.626 0.626 0.63Tr 0.65 0.65 0.676 0.274

Unigram Moses GS 0.649 0.668 0.646Tr 0.662 0.657 0.662 0.227

Unigram tf-idf Bing GS 0.649 0.638 0.644Tr 0.645 0.654 0.657 0.243

Unigram tf-idf Google T. GS 0.632 0.632 0.637Tr 0.672 0.672 0.677 0.274

Unigram tf-idf Moses GS 0.653 0.649 0.653Tr 0.682 0.647 0.67 0.227

Uni + bigrams Bing GS 0.644 0.644 0.66Tr 0.634 0.642 0.644 0.243

Uni + bigrams Google T. GS 0.636 0.636 0.652Tr 0.646 0.646 0.672 0.274

Uni + bigrams Moses GS 0.643 0.643 0.66Tr 0.663 0.663 0.672 0.227

Uni + bigrams tf-idf Bing GS 0.639 0.651 0.653Tr 0.642 0.652 0.675 0.243

Uni + bigrams tf-idf Google T. GS 0.624 0.624 0.634Tr 0.66 0.66 0.68 0.274

Uni + bigrams tf-idf Moses GS 0.653 0.653 0.648Tr 0.658 0.658 0.682 0.227

Table A.8Results obtained for French using the different feature representations. Results are reported in terms of weighted recall.

Feature representation SMT Test set SMO AdaBoost M1 Bagging BLEU score

Unigram Bing GS 0.598 0.632 0.641Tr 0.659 0.662 0.668 0.243

Unigram Google T. GS 0.629 0.629 0.632Tr 0.655 0.655 0.684 0.274

Page 18: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

ARTICLE IN PRESS+ModelYCSLA-587; No. of Pages 20

18 A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx

Table A.8 (Continued )

Feature representation SMT Test set SMO AdaBoost M1 Bagging BLEU score

Unigram Moses GS 0.644 0.664 0.638Tr 0.665 0.656 0.659 0.227

Unigram tf-idf Bing GS 0.644 0.644 0.647Tr 0.665 0.676 0.682 0.243

Unigram tf-idf Google T. GS 0.638 0.638 0.658Tr 0.672 0.672 0.684 0.274

Unigram tf-idf Moses GS 0.658 0.626 0.652Tr 0.69 0.645 0.673 0.227

Uni + bigrams Bing GS 0.644 0.647 0.67Tr 0.654 0.659 0.67 0.243

Uni + bigrams Google T. GS 0.644 0.644 0.672Tr 0.661 0.661 0.693 0.274

Uni + bigrams Moses GS 0.626 0.626 0.635Tr 0.67 0.67 0.676 0.227

Uni + bigrams tf-idf Bing GS 0.645 0.658 0.661Tr 0.668 0.679 0.704 0.243

Uni + bigrams tf-idf Google T. GS 0.638 0.638 0.67Tr 0.678 0.678 0.701 0.274

Uni + bigrams tf-idf Moses GS 0.652 0.652 0.649Tr 0.673 0.673 0.696 0.227

Table A.9For each language, each classifier has been trained merging the translated data coming form different SMT systems, and tested using the GoldStandard. Results are reported in terms of weighted precision.

Feature representation To German To Spanish To French

Unigram SMO 0.483a 0.569 0.616AdaBoost 0.483 0.582 0.545Bagging 0.483a 0.587 0.596

Unigram tf-idf SMO 0.657 0.653 0.629AdaBoost 0.649 0.647 0.64Bagging 0.659 0.66 0.642

Uni + bigrams SMO 0.483a 0.616 0.73AdaBoost 0.483a 0.649 0.733Bagging 0.483a 0.556 0.752

Uni + bigrams tf-idf SMO 0.666 0.686 0.658AdaBoost 0.666 0.678 0.651Bagging 0.669 0.658 0.663

a Classifier is not able to discriminate between positive and negative classes, and assigns most of the test points to one class, and zero to the other.

Table A.10For each language, each classifier has been trained merging the translated data coming form different SMT systems, and tested using the GoldStandard. Results are reported in terms of weighted recall.

Feature representation To German To Spanish To French

Unigram SMO 0.679a 0.622 0.67AdaBoost 0.676 0.639 0.672Bagging 0.676a 0.616 0.566

Unigram tf-idf SMO 0.659 0.661 0.624AdaBoost 0.634 0.644 0.629Bagging 0.668 0.675 0.629

Page 19: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

ARTICLE IN PRESS+ModelYCSLA-587; No. of Pages 20

A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx 19

Table A.10 (Continued )

Feature representation To German To Spanish To French

Uni + bigrams SMO 0.679a 0.429 0.353AdaBoost 0.676a 0.487 0.356Bagging 0.676a 0.49 0.345

Uni + bigrams tf-idf SMO 0.684 0.703 0.675AdaBoost 0.684 0.697 0.67

ra

R

A

B

B

B

B

B

BB

C

C

D

E

F

I

K

K

K

K

K

LM

O

Bagging 0.687 0.675 0.678

a Classifier is not able to discriminate between positive and negative classes, and assigns most of the test points to one class, and zero to the other.

ecall. This confirms that the uni + bigrams tf-idf representation produces the best results and that precision and recalllone are not sufficed for a complete analysis.

eferences

hmad, K., Cheng, D., Almas, Y., 2007. Multi-lingual sentiment analysis of financial news streams. In: Proceedings of the Second Workshop onComputational Approaches to Arabic Script-based Languages, pp. 1–12.

ader, B., Kegelmeyer, W., Chew, P., 2011. Multilingual sentiment analysis using latent semantic indexing and machine learning. In: IEEE 11thInternational Conference on Data Mining Workshops (ICDMW), pp. 45–52.

anea, C., Mihalcea, R., Wiebe, J., 2008a. A bootstrapping method for building subjectivity lexicons for languages with scarce resources. In:Proceedings of the Conference on Language Resources and Evaluations (LREC 2008), Maraakesh, Morocco.

anea, C., Mihalcea, R., Wiebe, J., 2010. Multilingual subjectivity: are more languages better? In: Proceedings of the International Conference onComputational Linguistics (COLING 2010), Beijing, China, pp. 28–36.

anea, C., Mihalcea, R., Wiebe, J., Hassan, S., 2008b. Multilingual subjectivity analysis using machine translation. In: Proceedings of the Conferenceon Empirical Methods in Natural Language Processing (EMNLP 2008), Honolulu, Hawaii, pp. 127–135.

oudin, F., Huet, S., Torres-Moreno, J., Torres-Moreno, J., 2010. A graph-based approach to cross-language multi-document summarization.Research Journal on Computer Science and Computer Engineering with Applications (Polibits) 43, 113–118.

reiman, L., 1996. Bagging predictors. Machine Learning 24 (2), 123–140.rown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L., 1994. The mathematics of statistical machine translation: parameter estimation.

Computational Linguistics 19, 263–311.allison-Burch, C., Koehn, P., Monz, C., Schroeder, J., 2009. Findings of the 2009 workshop on statistical machine translation. In: Proceedings of

the Fourth Workshop on Statistical Machine Translation, Athens, Greece, pp. 1–28.hen, B., Kuhn, R., Foster, G., 2012. Improving amber, an mt evaluation metric. In: NAACL 2012 Workshop on Statistical Machine

Translation:(WMT-2012), pp. 59–63.eerwester, S., Dumais, S., Furnas, G.W., Landauer, T.K., Harshman, R., 1990. Indexing by latent semantic analysis. Journal of the American

Society for Information Science 3 (41).suli, A., Sebastiani, F., 2006. Sentiwordnet: a publicly available lexical resource for opinion mining. In: Proceedings of the 5th Conference on

Language Resources and Evaluation (LREC’06), pp. 417–422.reund, Y., Schapire, R., 1995. A decision-theoretic generalization of on-line learning and an application to boosting. In: Computational Learning

Theory. Springer23–37.nui, T., Yamamoto, M., 2011. Applying sentiment-oriented sentence filtering to multilingual review classification. In: Proceedings of the Workshop

on Sentiment Analysis where AI Meets Psychology (SAAIP), IJCNLP 2011, pp. 51–58.im, J., Li, J.-J., Lee, J.-H., 2010. Evaluating multilanguage-comparability of subjectivity analysis systems. In: Proceedings of the 48th Annual

Meeting of the Association for Computational Linguistics, pp. 595–602.im, S.-M., Hovy, E., 2006. Automatic identification of pro and con reasons in online reviews. In: Proceedings of the COLING/ACL Main

Conference Poster Sessions, pp. 483–486.oehn, P., 2005. Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the Machine Translation Summit X, Phuket,

Thailand, pp. 79–86.oehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O.,

Constantin, A., Herbst, E., 2007. Moses: open source toolkit for statistical machine translation. In: Proceedings of the Annual Meeting of theAssociation for Computational Linguistics, pp. 177–180.

oehn, P., Och, F.J., Marcu, D., 2003. Statistical phrase-based translation. In: Proceedings of the North America Meeting on Association forComputational Linguistics, pp. 48–54.

in, C.Y., 2004. Rouge: a package for automatic evaluation of summaries, pp. 25–26.

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

ihalcea, R., Banea, C., Wiebe, J., 2007. Learning multilingual subjective language via cross-lingual projections. In: Proceedings of the Conferenceof the Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp. 976–983.

ch, F.J., 2003. Minimum error rate training in statistical machine translation. In: Proceedings of the 41st Annual Meeting on Association forComputational Linguistics, Sapporo, Japan, pp. 160–167.

Page 20: Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis

+Model

ARTICLE IN PRESSYCSLA-587; No. of Pages 20

20 A. Balahur, M. Turchi / Computer Speech and Language xxx (2013) xxx–xxx

Pang, B., Lee, L., 2008. Opinion mining and sentiment analysis. Foundation and Trends in Information Retrieval 2 (1–2), 1–135.Papineni, K., Roukos, S., Ward, T., Zhu, W.J., 2001. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th

Annual Meeting on Association for Computational Linguistics, pp. 311–318.Platt, J.C., 1999. Fast Training of Support Vector Machines Using Sequential Minimal Optimization. MIT Press, Cambridge, MA, USA185–208

http://dl.acm.org/citation.cfm?id=299094.299105Savoy, J., Dolamic, L., 2009. How effective is Google’s translation service in search? Communications of the ACM 52 (10), 139–143.Steinberger, J., Lenkova, P., Ebrahim, M., Ehrman, M., Hurriyetoglu, A., Kabadjov, M., Steinberger, R., Tanev, H., Zavarella, V., Vazquez, S.,

2011a. Creating sentiment dictionaries via triangulation. In: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivityand Sentiment Analysis, Oregon, Portland.

Steinberger, J., Lenkova, P., Kabadjov, M., Steinberger, R., van der Goot, E., 2011b. Multilingual entity-centered sentiment analysis evaluated byparallel corpora. In: Proceedings of the Conference on Recent Advancements in Natural Language Processing (RANLP), Bulgaria, Hisar.

Steinberger, J., Turchi, M., 2012. Machine translation for multilingual summary content evaluation. In: Proceedings of Workshop on EvaluationMetrics and System Comparison for Automatic Summarization. Association for Computational Linguistics, Montréal, Canada, pp. 19–27http://www.aclweb.org/anthology/W12-2603

Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D., 2006. The jrc-acquis: a multilingual aligned parallel corpuswith 20+languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, Genova, Italy, pp. 2142–2147.

Stone, P.J., Dunphy, D.C., Smith, M.S., Ogilvie, D.M., 1966. The General Inquirer: A Computer Approach to Content Analysis. MIT Press,Cambridge, Massachusetts, USA http://www.webuse.umd.edu:9090/

Strapparava, C., Valitutti, A., 2004. Wordnet-affect: an affective extension of wordnet. In: Proceedings of the 4th International Conference onLanguage Resources and Evaluation, pp. 1083–1086.

Tiedemann, J., 2009. News from opus – a collection of multilingual parallel corpora with tools and interfaces. In: Recent Advances in NaturalLanguage Processing V: Selected Papers from RANLP.

Turchi, M., Atkinson, M., Wilcox, A., Crawley, B., Bucci, S., Steinberger, R., Van der Goot, E., 2012a. Onts: “optima” news translation system. In:Proceedings of EACL, p. 25.

Turchi, M., Goutte, C., De Bie, T., Cristianini, N., 2012. Learning to translate: a statistical and computational analysis. Advances in ArtificialIntelligence 2012, 15.

Turchi, M., Specia, L., Steinberger, J., 2012c. Relevance ranking for translated texts. In: Proceedings of the 16th Annual Conference of the EuropeanAssociation for Machine Translation (EAMT-2012), Trento, Italy, pp. 153–160.

Wan, X., Li, H., Xiao, J., 2010. Cross-language document summarization based on machine translation quality prediction. In: Proceedings of the48th Annual Meeting of the Association for Computational Linguistics, pp. 917–926.

Please cite this article in press as: Balahur, A., Turchi, M., Comparative experiments using supervised learning and machinetranslation for multilingual sentiment analysis. Comput. Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.03.004

Wilson, T., Wiebe, J., Hoffmann, P., 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In: Proceedings of HLT-EMNLP,Vancouver, Canada, pp. 347–354.

Xiaojun, W., 2009. Co-training for cross-lingual sentiment classification. In: Proceedings of the 47th Annual Meeting of the Association forComputational Linguistics, pp. 235–243.