ia-regional-radio - social network for radio recommendation. · ia-regional-radio - social network...

IA-Regional-Radio - social network for radiorecommendation.

Grzegorz Dziczkowski1, Lamine Bougueroua2, and KatarzynaWegrzyn-Wolska2

1 Ecole des Mines de Paris, 77305 Fontainebleau, France,2 Ecole Superieur d’Ingenieurs en Informatique et Genie des Telecommunicatiom,

77-215 Avon-Fontainebleau Cedex, France,(grzegorz.dziczkowski lamine.bougueroua katarzyna.wolska)@esigetel.fr

Summary. This chapter describes the functions of a system proposed for the musichit recommendation from social network data base. This system carries out theautomatic collection, evaluation and rating of music reviewers, the possibility forlisteners to rate musical hits and recommendations deduced from auditor’s profiles inthe form of regional internet radio. First, the system searches and retrieves probablemusic reviews from the Internet. Subsequently, the system carries out an evaluationand rating of those reviews. From this list of music hits the system directly allowsnotation from our application. Finally the system automatically create the recordlist diffused each day depended form the region, the year season, day hours and ageof listeners. Our system uses linguistics and statistic methods for classifying musicopinions and data mining techniques for recommendation part needed for recordedlist creation. The principal task is the creation of popular intelligent radio adaptiveon auditor’s age and region - IA-Regional-Radio.

1 Introduction and issue

Social networking sites are a global phenomenon. For the hundreds of mil-lions of people worldwide who belong to sites like MySpace, Facebook andYouTube, social interaction in cyberspace has become an indispensable partof their lives.

Nowadays the internet is an essential tool for the exchange of informationon a personal and professional level. The web offers us a world of prodigiousinformation and has evolved from simple sets of static information to serviceswhich are more and more complex. With the growth of the Web, internet ra-dio, recommendation tools and e-commerce has become popular. Many web-sites offer on line services or sales and propose object ratings to their users,for music, films, and products for example. With globalization, the productchoice is too much diversified, therefore, users are not aware of the availabilityof products. For this reason, prediction engines were developed to offer the

2 Dziczkowski, Bougueroua, Wegrzyn-Wolska

user alternative products. Generally, the influences of others is important foropinion making.

Prediction engines’ algorithms are based on the experience and opinionof other users. In order to do those engines, we need to have an extremelylarge user profile base. In our case - internet radio, products furnish by oursystem are musical hits. In this case, auditors are not supposed to make thevalidation process of listened musical hit. The need of such system is justifiedby increasing value of radio auditors. For example, in France there are 42 mil-lions of auditors each day, it represent 8 people out of 10 older then 13 yearsold [19]. The value of internet radio listener drew over 1.2 additional millionsof listeners aged 13 and over last year [Mediametrie 126 000 Radio 2007-2008][19].

The general objective of our system is to furnish the intelligent internetradio which needs to be programmed first but which interacts with the tasteof listeners from the same region and of the same age. The vote of the listen-ers will directly affect the single rotation frequency. Another advantage of oursystem is that the prediction are not made for each auditor but for the groupof auditors from same region and same age. Like this, we present also musicalhits which have not been discovered yet for singular listener. In addition, ourmethod is more useful for new auditors which didn’t evaluate too many mu-sical hits. The time at which auditors connect to the Internet to listen to themusic is arbitrary. It is possible that a musical hit is very appreciated in themorning but not at night. Therefore, we look at voting time to understandbetter the tastes of users.

Most of radios are using automation radio software. In fact, it gives thepossibility to make playlists, to play all type of song (jungle, advertisement,music, interviews, live, ...) and that with just a simple computer. In Francefor example, we have a radio totally automated, ”Chante France”. There areno speakers, only a computer which play songs. Some products exist in radioautomation software. These products are used by associative or personal ra-dios. There is also a little list of professional products used by national radios.

The player Zarasoft has no database, there is no possibility to programand regroup onto categories. But there is the possibility to program events atspecific time. It detects silence at the end of track. It is every time necessary toadd manually the playlist, or to select a directory, and zararadio selects ran-domly the songs to play, but there is no possibility to show the time remainingfor introduction. We can describe this software as semi-automatically.

Zradio is a freeware, developed by a local radio RMZ at Poitier before2003. It has the possibility to play Jingles, advertisements and to be programin a format by using a database. Winamp is a simple free media player de-veloped since 1997. Contrary to Zararadio, it can not select randomly a songby directory. There is no automation because we need to manually programall the song we want to hear. Radugo exists since before October 2001. Weneed to manually add all songs. There is a possibility to create list think,log files automatically generate, passwords protection, scheduled events and

IA-Regional-Radio - social network for radio recommendation. 3

silence detection.The player DRS2006 includes automatics playlist editor. With the plug-

in broadcast, it’s possible to directly broadcast the stream on the Internet.It includes a database, a playlist editor, a studio Radio and studio DJ, andsome tolls about database. Easyradio exists since 1998. It’s the most com-plete software in radio automation for little radios. It’s simple of use, it hasthe possibility to define some time formats, and the playlist are generatedautomatically. This software is very complete, and is recommended for DJs,radios, expositions. The editor of Sam Broadcaster has been created in 1999.It seems to be complete software. It integrates a web broadcaster. We canmake some graphs with the statistical number of listeners. It gives an htmloutput that allows the update of the song play, for example, for a website.

All softwares have its own advantages and disadvantages. But all the ad-vanced solutions do not include an automatic update of the song frequencybased on the vote of the listeners.System presented in this chapter searches and retrieves probable music re-views from the Internet. Subsequently, the system carries out an evaluationand rating of those reviews. From this list of musical hits the system directlyallows notation from our application. Finally the system automatically createthe record list diffused each day depended form the region, the year sea-son, day hours and age of listeners. Our system uses linguistics and statisticmethods for classifying music opinions and data mining techniques for rec-ommendation part needed for recorded list creation. The principal task is thecreation of popular intelligent radio adaptive on auditor’s age and region -IA-Regional-Radio.

2 General system architecture

In this section, we make a description of the architecture. Our objective isto make internet radio which interacts with the taste of listeners. Principalmodules of our architecture are [Figure 1]: research and collect of reviewerson Internet, attribution of a mark for each review and storage of interestinginformation in database. We also store mark collected from internet radio web-site. The most important part of our modular architecture system is Opinionmarking module described in section 6. The recommendation system is basedon information from opinion marking module to understand better tastes ofusers.


Reviewers

… …

Notation on site web

RecommendationSystem : region, age,

sex,…

Application

Automatic Play list

Web RadioData Base

Spiders

Opinion marking module(Classifier)

Fig. 1. Application architecture.

The last module is used to generate lists of songs depending on severalcriteria : region and age of listener, recommendation results, hours of broad-casting.

The first part will be to make a database and an interface to add songs onit. The field in database will be essentially: title of song, artist name, category,frequency, listener age, listener region and some others information [Figure 2].

Fig. 2. Data base.

The next part is to collect information about musical hits on internet(youtube, ...). In fact, we use spider which is the automated and method-ological traverse and index of web pages for subsequent search purposes. Theapplication has a web site to receive request of web listeners, which result ina note or comment on musical hits.

All relevant information are, then, recovered either directly from the ap-plication, or by the spiders. Finally, informations are stored in a database.After collecting the reviews, we will assign notes by using the classifiers. Theclassifiers provide ratings from users’ feelings. The classifier uses three differ-ent methods for assigning a mark to the reviews. Those methods are basedon different approaches of corpus classification. The notes of the classificationwill be stored in a database.

The recommendation system, relying on the notes in the database andregion, sex and age of the user, determine the most appropriate musical hits


to broadcast. Each day, the application auto-generate playlists by using theplaylist scheme generator model [Figure 3]. It consists to select the sound toplay by respecting the restrictions and the playlist scheme according to thesong frequency in the database.

Fig. 3. Web radio Application - server.

The web radio is composed of two parts. The first part is the studio playerinterface. It represents the server side. It displays the list during playback andthe remaining time for start playing musical hits.

The second part is the web interface. It represents the listener interface. Itallows listener to view the jacket and title name being circulated, the last fivetitles available, the vote for the title currently broadcast and the possibilityto obtain additional information on a title.

3 Text Mining

The text categorization TC is now applicable in many different contexts.Document indexation is based on a lexicon, a document filtration, an au-tomatic generation of metadata, a suppression of words ambiguity, the set-tlement of hierarchical catalogues of Web resources, and, in general, everyapplication, which needs document organization or selective processing anddocument adaptation [16], [23].

The Machine Learning ML describes a general inductive process, whichautomatically constructs a text classifier via the learning, from a series of pre-classified documents or from characteristics of interest categories. Text MiningTM is a set of informatics processing, which consists of extracting knowledgein terms of innovative criteria or in terms of similarities in texts produced byhuman beings for human beings. A field using TC, ML or TM techniques is,in particular, the field of sentimental analysis [4], known as Opinion Mining[5]. The research in this field covers different subjects, in particular the learn-ing of words’ or expressions’ semantic orientation, the sentimental analysis


of documents and opinions and attitudes analysis regarding some subjects orproducts [18], [15].

3.1 Representation of documentary corpus

Texts in natural language can not be directly interpreted by a classifier or byclassification algorithms. The first linguistic units representing the sense arewords’ lemma. The recognition of those linguistic units requires carrying outa linguistic preprocessing of the text’s words. The number of words charac-terizing a document corpus can be really wide. Therefore, it is necessary toconserve a subgroup of those words. This filtering relies at the root on wordsoccurrence frequencies in the corpus.

Other approaches are using not words but group of words, eventually sen-tences such as linguistic units, describing the sense. Thanks to this approach,we have an order relationship between words and words’ co-occurrences. Theinconvenient is that the frequency of group of words apparition can not offerreliable statisticals because the great number of combinations between wordscreates frequencies which are too wick to be exploited.

Another approach to represent the documentary corpus is the utilizationof the n-grams technique [26]. Those methods are independent from the lan-guage, however neither the segmentation in linguistic units, nor pre-treatmentsas filtration and lemmatization are necessary.

If we are using words such as linguistic unit, we notice that different wordshave common sense or are simply another form of conjugation. Therefore, aprocessing named stemming has to be carried out. It is a processing, whichproceed at a morphologic analysis of the text [24]. The processing, which needsa more complex analysis than stemming, is lemmatization, which is based on alexicon. A lexicon is a set of lemmas with which we can refer to the dictionary.Lemmatization needs to carry out in addition a syntax analysis in order toresolved ambiguities. Therefore, it conducts a morphosyntactic analysis [25].

The role of textual representation is represented mathematically in a waythat we can carry out the analytic processing, meanwhile, conserving at amaximum the semantic one. The indexation process itself consists in conduct-ing a simple complete inventory of all corpus lemmas. The next step is theselection process of the lemma, which will constitute linguistic units of thefield or vector space dimension of the representation of documentary corpus.

3.2 Classification techniques

The classification procedure is automatically generated from a set of exam-ples. An example consists in a description of a case with the correspondingclassification. We dispose, for example, a database of patients’ symptoms withthe status of their respective health state, as well as the medical diagnostic oftheir sickness. The training system must then, from this set of examples, ex-tract a classification procedure, which will, with a view of patients’ symptoms,


establish a medical diagnostic. It is a matter of inducing a general classifica-tion procedure taken from examples. The problem is therefore an inductiveproblem. It is a matter of extracting a general rule from observed data.

Bayes’ Classifier

The probabilistic classifier interpret the function CSVi(dj) in terms of P (ci|dj),which represents the probability that a document is represented by a vectordj =< w1, j, ..., w|T |j > of terms, which belongs to ci, and determine thisprobability by using the Bayes’ theorem, defined by:

P (ci|dj) =P (ci)P (dj|ci)

P (dj). (1)

Where P (dj) is the probability that a document choose at random, has thevector dj for its representation; and P (ci) is the probability that a documentchoose at random, belongs to ci. The probability estimation P (ci|dj) is prob-lematic, since the vector number dj possible is too high. For this reason, it iscommon to make the hypothesis that all vector coordinates are statisticallyindependent. Therefore:

P (dj|ci) = Π|T |k=1P (wkj |ci)). (2)

Probabilistic classifier, which are using this hypothesis are named NaveBayes’ classifier and find their usage in most of probabilistic approaches inthe field of text categorization [29], [17].

Calculation of a classifier by the SVM method

Support Vector Machine methods has been introduce by Joachims [13], [14],[6], [35]. The geometrical SVM method can be considered as an attempt to findout between surfaces σ1, σ2, ... of a dimension space |T |, what is separatingexamples of positive training from negative ones. The set of training is definedby a set of vectors associated to the belonging category: (X1, y1), ..., (Xu, yu),Xj ∈ Rn, yj ∈ {+1,−1} with:

• yj represents the belonging category. In a problem with two categories;the first one correspond to a positive answer (yj = +1) and the secondone correspond to a negative answer (yj = -1)

• Xj represents the vector of the text number j of the training set.

The SVM method distinguish vectors of positive category from those of nega-tive category by a hyperplane defined by the following equation: W ⊗X+ b =0,W ∈ Rn, b ∈ R.

Generally, such a hyperplane is not unique. The SVM method determinesthe optimal hyperplane by maximizing the margin: the margin is the distancebetween vectors labeled positively and those labeled negatively.


Calculation of a classifier by the tree decision method

A text classifier based on the tree decision method is a tree of intern node,which are marked by terms, branches getting out of node are tests on terms,and the leaves are marked by categories [20]. This classifier classify documentof the test dj by testing recursively the weight of intern node of the vector dj

, until a leaf is reach. The knot’s label is then attributed to dj . Most of thoseclassifiers use a binary document representation and therefore are created bybinary trees.

A method to conduct the training of a decision tree for the category ciconsist in verifying if every training example have the same label (ci or c̄i).If not, we will select a term tk, and we will break down the training set indocument categories, which have the same value for tk. Finally, we create sub-trees until each leaf of the tree generated this way contain training examplesattributed at the same category ci, which is then choose as the leaf label. Themost important stage is the choice of the term of tk to carry out the partition[20].

Neural Network

A text classifier based on neural network NN is a unit network, where entryunits represent terms, exit units represent the category or interest categories,and the weight of the side relating units represents dependency relationships.In order to classify a document of test dj, its weights wkj are loaded in entryunits; the activation of those units is propagating through the network, and thevalue of the exit unit determines the classification decision. A typical trainingway of neural network is retro propagation, which consist in retro propagatingthe error done by a neuron at its synapses and to related neurons. For neuralnetworks, we are usually using retro propagation of the error gradient [3],which consist in correcting error according to the importance of the elements,which have participated to the realization of those errors.

4 Sentiments Analysis

4.1 The complexity of opinion marking

In order to determine the complexity of opinion marking, we are going to takean example of a review. The example is:

”Yeah, Beautiful girl. I’ve only met 2 people in real life and 1 personon the net who hates this musical hit. My favorite song ever!”

As we have noticed, the review is composed of three phrases, which have op-posite polarity. Even though, we can easily deduct that the first sentence isthe movie title, Beautiful girl, we will have two subjective phrases but hard


to mark correctly. The last phrase is rather easy to mark: ”My favorite songever!”. However, there is a problem for the marking of the phrase: I’ve onlymet... who hates this musical hit, because a statistical study shows us thatthe polarity is negative for this phrase but in fact the polarity is positive andwith high intensity.

Sentiments can often be express in a subtle manner, which creates a dif-ficulty in the identification of the document units when considering themseparately. If we consider a phrase, which indication a strong opinion, it ishard to associate this opinion with keywords or expressions in this phrase. Ingeneral, sentiments and subjectivity are highly sensitive to the context anddependent of the field.

Moreover, on Internet, everyone is using its own vocabulary, which addsdifficulties to the task; even though it is in the same field. Furthermore, it isvery hard to correctly allocate the weight of phrases in the review.

It is not yet possible to find out an ideal case of sentiment marking in atext written by different users because it does not follow a rule and it is im-possible to schedule every possible case. Moreover, frequently the same phrasecan be considered as positive for one person and negative for another one.

4.2 Detection of subjective phrases

For many applications, we have to decide if a document contains subjectiveor objective data and to identify which parts of the document are subjectivein order to be able, then, to process only the subjective part.

Hatzivassiloglou and Wiebe [12] have demonstrated the phrases’ orienta-tion based on the adjectives’ orientation. The objective was to establish if agiven phrase is subjective or not by evaluating adjectives of this phrases [34],[1]. Wiebe et al. [31] present a complete study of the recognition of subjectivityby using different indications and characteristics (the results comparison byusing adjectives, adverbs and verbs in taking in account the syntax’ structurelike for example the words’ location).

Another approach made by Wilson et al. [33] has proposed an opinionclassification according to their intensity (the opinion strength) and accord-ing to other subjective elements. When other researches have been made onthe distinction between subjectivity and objectivity or on the difference be-tween positive and negative phrases, Wilson et al. have classified the opinionand emotion strength expressed in individual clauses. The strength is knownas neutral when it corresponds to the absence of opinion and subjectivity.

Recent works consider as well relationship between the ambiguity in thewords sense and in the subjectivity [32]. The subjectivity detection can alsobe done thanks to classification techniques.

4.3 The opinion polarity and intensity

The classification of the opinion polarity consists in a document classificationbetween positive and negative status. A value called semantic orientation has


been created in order to demonstrate words’ polarity. It varies between twovalues: positive and negative and can have different intensity level. Thereare several calculation methods of the words semantic orientation. Generally,the semantic orientation method of the associations SO-A is calculated asa measure of positive words association less the measure of negative wordsassociation:

SO −A(word) = Σp∈PA(word, p)−Σn∈NA(word, n) (3)

Where A(word,pword) is the association of studied word with the positiveword (equivalent negative).

If the sum is positive, the word is oriented positively, and if the sum isnegative, the orientation is negative. The sum absolute value indicates theorientation intensity. In order to calculate the association measure betweenwords - A, there are several possibilities. One of them is called The PointwiseMutual Information - SO-PMI (proposed by Church and Hanks).

PMI(mot1,mot2) = log2p(word1&word2)p(word1)p(word1)

(4)

The p(word1&word2) defines the probability that the two words coexist to-gether.

Another possibility to analyze the statistical relationship between wordsin the corpus is the utilization of the technique: the Singular Value Decom-position (SVD).

4.4 Different approaches for the sentiment analysis

Turney’s Approach

The semantic orientation of words has been elaborated first of all for the ad-jectives [11], [30]. The works on the subjectivity detection have revealed a highcorrelation between the adjective presence and the subjectivity of phrases [12].This observation has often been considered as the proof that some adjectivesare good sentiment indicators. A certain number of approaches based on theadjectives presence or polarity have been created in order to deduct the textsubjectivity or polarity. One of the first approaches has been proposed byTurney [27] and can be presented in four stages:

• First of all, there is a need to make phrase segmentation (part-of-speech)• Then, we are putting together adjectives and adverbs in series of two words• We apply afterwards SO-PMI in order to calculate the semantic orientation

of each detected series,• Finally, we carry out a review classification as positive or negative by

calculating the average of all find orientation.


Results obtained by this approach are different compared to the field: forcars= 84%, for banking documents= 80% and for cinematographic reviews=65%. The fact that adjectives are good opinion preachers is not diminishingthe other words signification. Pang et al. [21], in the polarity study of cine-matographic criteria, have demonstrated that using only adjectives as char-acteristics gives result less relevant than using the same number of unigrams.

Pang’s Approach

Pang and Lee [22] are proposing another approach for the polarity classifi-cation of cinematographic reviews. The approach is composed of two stagesFigure 4. The first goal is to detect the document’s parts, which are subjec-tive. Then, they are using the same statistical classifier to detect the polarityonly on subjective fragments detected previously. Instead of doing the subjec-tivity classification for each phrase separately, they admit that they can see acertain degree of continuity in the phrases subjectivity - a writer generally isnot changing often between the fact to be subjective or objective. They givepreferences in order to have proximity phrases, which have the same level ofsubjectivity. Every phrase in the document is then labeled as subjective orobjective in the process of collective classification.

Document of n

sentences :

Sentence 1

Sentence 2

Sentence 3

Sentence 4

.

.

.

Sentence n

Subjective

sentences?

Yes

No

No

Yes

.

.

.

Yes

Extraction of m

sentences (m <= n)

Sentence 1

Sentence 4

.

.

.

Sentence n

Polarity

classification + / -

Subjectivity classificationPolarity classification

Fig. 4. Pang’s approach - Utilisation of the same classification technique for thedetection of subjectivity and afterwards of phrases polarity labeled as subjective.

5 Linguistic Analysis

Sentiment detection and marking can also be carried out by NLP techniques -Natural Language Processing. Information extraction consist to identify pre-cise data of a text in natural language and to represent it in a structuredform [23]. It is a documentary research, which aim to find back in the corpusa set of relevant documents regarding a question [28]. It consists in buildingup automatically a data bank from texts written in natural language. It is


not a matter of giving unprocessed text to use, but to give precise answers toquestions that have been ask by a formula or database padding.

The extraction requires specialized lexicon and grammar. The adjustmentof such resources is a long and tiresome task, which needs most of the time anexpertise of the tackled field and knowledge in computing linguistic . Amongthis knowledge, we can mention filtration techniques of documents categoriza-tion and data extraction.

Systems of text understanding have been conceived, for most of them, asgeneric system of understanding, but it has been reveled not much usablein reel applications. Understanding is seen as a transduction, which trans-form a linear structure. It means that the text (i.e. the linear structure) istransformed in an intermediary logical-conceptual representation. The finalobjective is then to create inferences on those representations in order to con-duct different processing, for example, answering questions.

In order to understand the whole text, there is a need to carry out the syn-tactic and the semantic analysis. The syntactic analysis is the largest possiblebecause of the ambiguity problem. The semantic analysis aim to produce astructure representing, the most reliable possible, the entire sentence, with itsnuances and its complexity, and then to integrate all produced structures ina textual structure. At the end, we obtain a logical-conceptual representationof the text. The semantic representation varies from one system to another.

This has driven an important number of researchers to describe naturallanguages in the same way as formal languages. Maurice Gross to precede,with his LADL team, the exhaustive examination of simple French phrases,in order to dispose of reliable and calculated data, on which it will be possibleto conduct meticulous scientific experiences. To reach this result, each verbhas been study so as to test if it verify or not syntactical proprieties as thefact to admit a completive proposition as a subject emplacement. We will seethat we can not describe French with general rules. The same situation appliesto all other languages. Results of this research have been encoded in matricescalled lexical-grammar tables. This table shows a precise description of thesyntactic behavior of each French word. The aim is to use all the resourcesof the lexical- grammar tables in order to obtain a system able to analyzeany simple phrase structure. The sense minimal unit, according to MauriceGross, is the sentence not the word. The principle is therefore to study thetransformation that simple sentence can have. Simple sentences have been in-dexed via their verbs. For a verb, we can have several different usages. Thanksto syntactical proprieties, we can distinguish the usage of a verb. There areno verbs, which possess exactly the same syntactical behavior. We can notexpress, therefore, general rules, which could explain the language.

The text corpuses are represented by automates; in which each path cor-respond to a lexical analysis. Linguistic phenomena are represented by localgrammar, which is translated in automates in a final stage in order to be easilyconfronted to the text corpuses. A local grammar [10] is a representation byautomate of linguistic structures, difficult to formalize in lexical- grammar ta-


bles or in dictionaries. Local grammars represented by graphs, are describingelements that are part of the same syntactic or semantic field.

Linguistic descriptions describe in local grammar form are used for a hugevariety of automatic applied processing on the text corpus. Thus, differentmethods of lexical disambiguates have been developed in order to carry outgrammatical constraints describe with the help of this type of graph.

6 Opinion marking module

Our system possesses a modular architecture. Its principal tasks are the fol-lowing: research and collect of reviews on Internet, attribution of a mark foreach review and presentation of the findings. Each task is done by a specializedmodule [Figure 5].

Fig. 5. General architecture of the system (the three principal modules and cine-matographic reviews marking)

First of all, for the opinion marking part, we have developed three differentmethods for the attribution of a mark to a review:

• The group behavior classifier [section 6.1]• The statistical classifier [section 6.2]• The linguistic classifier [section 6.3]

Those measures are based on different approaches of document classification.Secondly, we have developed, for each method, a classifier , which assign sep-arately the mark [9], [7]. We have obtained, therefore, three marks for eachreview, which can be different. We have used, finally, another classifier , whichassign the final mark for the review, based only on the three marks attributedpreviously in the classification process [8]. For the calculation of the final


mark, we have used the values of the three marks previously attributed andtheir probabilities.

On a research point of view, the most important part of the system con-ceived is the opinion marking module.

6.1 The group behavior classifier

In this section, we present the classifier used for the opinion marking. Thegeneral approach is based on the verification that reviews, having the sameassociated mark, have common characteristics. Then, we determine a reviewsbehavior, for those having the same mark. We determine therefore, the gen-eral behavior of each review group (5 groups corresponding to five differentopinion marks). We have a huge number of reviews already marked. We havegathered together all the reviews according to their mark. We obtain, then,five different groups of music marks. Afterwards, we have tried to determinetypical characteristics of each group. We have defined all parameters, whichcan characterize the group behavior such as:

• Characteristic words,• Characteristic expressions,• The phrase length,• The opinion size,• The frequency of several words repetition,• The negation,• The number of punctuation signs ( !, ;), ?)

The choice of criteria that we have kept for the analysis of the group behaviorhas been done in an empirical way. Fist of all, by analyzing the reviews cor-pus, we have defined criteria that seems interesting and that could determinegroup behavior. Then, we have tested those criteria on a training base con-taining a thousand of reviews. If results showed differences between groups,we considered those criteria as valid criteria for our research work. In this ap-proach, we present the statistical study on linguistic data. The training basehas been used for the review analysis, of those having the same mark, in or-der to find characteristics, which determine the behavior of each group. Eachapproach used in our research is based on different characteristics, in ordernot to repeat them in the classification process. However, we have borrowedsemantic classes from the linguistic approach for the creation of the words listcharacteristics. The utilization of those data is different in those two groups.After having select criteria that characterize mark groups, we have analyzedthe corpus in order to obtain statistical results. Results show huge differencesbetween the characteristics of those groups. The creation of the global behav-ior of each groups, enable to determine the group in which a new review is.We have calculated for new reviews, the distance between its characteristicsand those of the groups.


6.2 The statistical classifier

In this section, we propose a general approach used in the sentiment analysis.We use this method to compare results of our approaches with the sametraining base. The way to carry out a classification is to find a characteristic ofeach category and to associate a belonging function. Among known methods,we can mention Bayes’ classifiers and the SVM method. We have obtainedbetter results for the classifier of Nave Bayes, we are going therefore to basedourselves on this classifier . In our research work, we have used this classifierfirst of all to determine the subjectivity or objectivity of phrases, then in orderto attribute a mark to subjective phrases of the review. The general processneeds the preparation of training base for two classifiers to attribute a mark.The intermediate stages are the followings:

• Preprocessing and lemmatization,• Vectorization and calculation of complete index,• Constitution of training base for each classifier ,• Reduction of the index dedicate to the classifier ,• Addition of synonyms,• Classification of texts

We are using, for the attribution of a mark to the sentiment of the review viaa statistical approach, two classifiers: a first one to filter the objective and thesubjective phrases and a second one to mark the review. The marking is doneonly on subjective phrases. Those classifiers rely on a vectorial representationof the text of the training base. This vectorial representation needs in a firsttime a linguistic preprocessing for the segmentation of the phrase, for thelemmatization and for the suppression of all words, which has no impact onthe sense of the document. This preprocessing has been carried out for thelinguistic classifier .

We carry out the preprocessing thanks to the application Unitex. We arealready disposing of linguistic resources prepared for this task as, for example,the grammar of the phrase segmentation or dictionaries. Then, we take offterm with no sense, such as defined or undefined articles or prepositions. Wecan conduct this task because those grammatical elements have a low impacton the text sense as, for example, on the opinion described in reviews, contraryto adverbs, which give a high contribution to value judgment. Afterwards,on a training corpus, we calculate the dimension of the vectorial space ofthe text representation in order to carry out all lemma enumeration - theentire index. Each document is then represented by a vector, which containsthe number of occurrences of each lemma present in the document. Everydocument of the training base is represented by a vector, which dimensioncorresponds to the whole index and components are occurrences frequenciesof the index units in the document. Therefore, at this stage of the process,texts are seen as a set of phrases. Now, each phrase is labeled according to theconstruction of classifiers (the subjective classifier and the marking classifier).


Labels correspond to subjective phrases (PS) or objective ones (PO) and theestimating mark attributed to those phrases (N from 1 to 5). A phrase j ofthe document i is marked as followed:

VDiPj = (fDiPj1, ...fDiPjk, ...fDiPj |D|, PS/PO,N) (5)

Where fDiPjk represents the occurrences number of the lemma k in thephrase j of the document i. The stage of the labeling was based on the reviews’marks of the training base and subjective phrases have been labeled manually.This is how we have built the set of training necessary to the determinationof classifiers of subjectivity and of sentiment marking.

The last stage of the vectorial representation of the document corpus is thereduction of the entire index dedicate to the classifier. The reduction of thecomplete index consists in eliminating from the vectorial space of the trainingbase, vectors, which have many components always null. This task enablesus to eliminate the noise in the classifier calculation [2]. We have used themethod of mutual information associated to each vectorial space dimension.In our works, we have used two classifiers: the classification based on Bayes’model and the classification using SVM. The two methods have been testedand the best results (F-score) have been obtained by the Bayes’ classifiers. Itis, as a result, Bayes’ classifier who was used in the system. In the process ofthe statistical classification, we have at first classified subjective phrases andthen we have attributed a mark.

Interesting phrases to carry out the opinion marking are subjective phrasesbecause there are the only ones which contains the author point of view. Forthis reason, we have first of all carried out the filtration of subjective phrases.The diagram, which represents those tasks, is shown in the Figure 6.

Fig. 6. Subjectivity classification - the classification steps

The process presented enables to filter only subjective phrases, those ex-pressing an opinion. The different stages are as follow:

• The preprocessing consists in carrying out the phrase segmentation, thelemmatization and the elimination in our research of words without sense.

• The vectorization consists in putting all phrases in the form of vector ofoccurrences and to reduce the complete index.


• The addition of synonym consists to add terms (synonyms) in the vectorof occurrences thanks to the linguistic analysis.

• The subjectivity classification consists in gathering together phrases insubjective or objective phrases. The classification is based on Bayes’ the-orem. For the rest of the classification (marking), we keep only subjectivephrases.

After carrying out the subjectivity classification, we only keep subjectivephrases. We conduct a classification in order to be able to attribute a mark tothose phrases of each analyzed review. The diagram representing those tasksis presented in Figure 7. The process presented enables to attribute a markto phrases classified in the subjective phrases. The marking varies between 1to 5. The stages are the following ones:

• The vectorization and the reduction of the complete index dedicated tothe classification of the marking

• The addition of synonyms• The marking classification, which consists in putting together phrases ac-

cording to the sentiment intensity. Marks are between 1 and 5.

At this stage of the process, we obtain marks associated to every subjectivephrase. The global mark of a review of the statistical classification is thearithmetical average of all the phrases of this review.

Fig. 7. Subjectivity classification - the classification steps

6.3 The linguistic classifier

We carry out reviews marking on a scale going from 1 to 5. We have cre-ated for the linguistic approach a grammar rule for each of those groups. Thisgrammar is based on reviews’ analysis of the training base, which containsapproximately 2000 phrases for each mark (the same database than for theother classifiers). For this part, we have used a linguistic processing, whichdemand specialized lexicon and grammar. The development of those resourcesis a long and tiresome task, which generally needs an expertise in the field


and knowledge linguistic information processing such as filtration techniques,documents categorization and data extraction. This part of the system hasbeen developed with the application Unitex. We are using a linguistic ana-lyzer Unitex to conduct a preprocessing and a lemmatization of words andfinally, for the most important part of our research work, the constructionof complex local grammar. The example of application is shown on Figure 8We have introduced, in order to fragment words in different opinion inten-sity level, words semantic categories, which are associated to words and showpolarity and intensity. We have used, in order to associate words semanticcategories, a subjective dictionary named General Inquirer Dictionary.

The principal goal of the linguistic classifier is the attribution of a markaccording to sentiment described in the review. The marking is done phraseby phrase. The reviews’ study of the training base has been carried out in theaim of creating grammar rules for each mark (in this case, the mark is between1 and 5). Five grammars has been therefore create, one for each mark. Eachgrammar contains a huge number of rules taken from local grammar. For eachgrammar, more than thirty local grammars have been created. The analysisis done phrase by phrase to attribute a mark to a new review in order to finda rule (from our rules base) corresponding to the studied phrase. At the endof this processing, we obtain phase of the new studied review with matchinggrammar rules. The final mark of this classification is the average of markscorresponding to general grammars.

The construction of local grammar has been carried out manually viaphrases analysis of the reviews having the same associated mark. Local gram-mar can not be too general because this tends to add ambiguity to results.However, if the grammar is too specific and complex, the use of this grammar isindeterminate because silence grows in a significant way. Grammars have beencreated to detect the opinion polarity and intensity in a phrase thanks to thelocal grammars form, which constitute a general grammar for each markinggroup. Research works are based only on local grammars form. Other charac-teristics purely statistical like words or characteristic expressions, phrase size,words frequency, words repetition, the number of punctuation signs and soon, are not taken into account. Of course, characteristic words are in dictio-naries with semantic categories and in local grammar, but this approach is alinguistic processing (grammar is necessary) not a statistical one (like the twoother classifiers).

The creation of local grammar is a tiresome task. Grammars used in oursystem have been created in an empirical way. We have carried out in the fol-lowing way: first of all, we have constructed general grammars, then we addeda complexity level to the linguistic analysis and we have made tests. Afterthose tests, we have repeated the process (addition of a complexity level). Foreach level, we have conducted tests and calculated the F-score. The final resultof grammars rules forms have been chose in order to obtain the best result ofF-score. Unfortunately, we can not be sure of the fact that our choice is themost coherent one. We have taken into account the fact that each classifier


presented in our system should have its own criteria and characteristics. Itis important to mention that the linguistic classifier provide the best results.We can observe, in particular, that the precision parameter is better than theone obtained by using other approaches.

This part of the system has been conducted with the text analyzer Uni-tex. Unitex enables to process in real time texts of several mega-bytes forthe indexation of morphosyntactic patterns, the research of hard or semi-hardexpressions, the production of concordance and the statistical study of results.

Fig. 8. Example of linguistics resources

6.4 The final classifier

Until now, we have presented three different methods to attribute a mark toa review. Thus, we obtain three different estimations (one for each classifier).The marking is carried out each time in a different way. Marks are thereforenot always the same. As we are obtaining three different marks, another prob-lem consists in conducting the final marking in order to attribute only onemark to the review. We need a final classification to obtain the final mark,which will be retransmitted to our radio. We have observed that, if we arecalculating the final average obtained by the three classifiers, results are lessefficient than those obtain by the linguistic classifier.

We have also observed that often a classifier in specific situations givesbest results, whereas in other circumstances, it would be another one. Forexample [Figure 9], we have observed that often when the first classifier givesa mark equal to 2 and the last two ones give a mark of 1, the correct results is


2. As a consequence, the first classifier is determinant in this case. By imple-mentation of neural networks for this stage and by taking into considerationeach probability for each score for each classifier we improved our results for3 to 7% depending on the class.

Fig. 9. Final classification- the marks behavior shows the presence of a determinantclassifier in some situations

We are using, for this reason, a final classifier. For this classification weare applying a neural network. The choice of this classifier is justified by thepresence of a wide reviews base, already annotated, which will be useful forthe training base. Moreover, it is easy to implement those data, for it to beused in the training base. The classifier takes into account only the probabilityof the mark of each classifier. No other characteristics are taken in consider-ation. This choice is acceptable because we think that we have used all otherpossible characteristics in the marking process (by using the three classifiersmentioned previously) and we do not wish to repeat those characteristics inthe classifications. Furthermore, the utilization of a characteristic of an opin-ion marking classifier in the final classification can influence the choice of thisclassifier.

For the entries of the final classifier, we have used marks of the previousclassifiers. The marks of each classifier represented by the belonging proba-bility of one of the five marks categories. For example, the linguistic classifierattributes the mark in the following way: the probability that the mark is:

• equal to 5 is p5=0,6• equal to 4 is p4=0,2• equal to 3 is p3=0,1• equal to 2 is p2=0,1• equal to 1 is p1=0

We have used a neural network to determine the correlation between marksobtained by the three classifiers. We are using the neural network of multilayerperceptron (PMC) with the algorithm of retro propagation of gradient.


7 Results

We have observed for the base of cinematographic reviews that we obtainthe best result with the linguistic classifier (especially for the precision). Theworst results are those of the statistical classifier of Nave Bayes (the playbackis correct but the precision is too low). This is demonstrating that it is nec-essary to carry out a deep linguistic analysis. We have observed that the bestresults find for the three approaches were those expressing extreme opinions[Figure 10].

Knowing the principle that it is an obligation to dispose of grammars morecomplex, we have demonstrate that the linguistic classifier gives better resultsthan the statistical or the group behavior ones [Table 1].

Table 1. Classifiers results

mark 5 mark 4 mark 3 mark 2 mark 1

Linguistic classifier 85 % 77.6 % 72.9 % 69.6 % 77.8 %Group behavior classifier 73.8 % 71 % 70.8 % 66.1 % 68.3 %

Statistic classifier 70 % 70.7 % 66.1 % 63.3 % 73 %

Final classifier 83.1 % 81.2 % 74.5 % 72.2 % 81.4 %

0

10

20

30

40

50

60

70

80

90

mark = 5 mark = 4 mark = 3 mark = 2 mark =1

Linguistic classifier Statistic classifier

Group behavior classifier Final classifier

Fig. 10. Final classification- the marks behavior shows the presence of a determinantclassifier in some situations

8 Conclusion

In this chapter we presented a entire system based on social network for radioapplication. Our radio will update a playlist depended on the votes of audi-tors. Users can express their selves via forums, blogs, or directly by adding a


mark to the song. In the goal of understanding the opinions written in naturallanguage, an Opinion Mining knowledge was necessary to implement. For thisreason, we presented in this chapter new approaches to automatically detectopinion from the text. The two classifications (group conduct and linguistic)have been proposed by us. Then, we have compared our approaches with theapproach generally used in this field (the statistical classification, which isbased on Nave Bayes’ classifiers).

After carrying out tests, we can observe that we have succeeded to imple-ment a first innovative method based on a linguistic classifier . The resultsobtained after this classification give us satisfaction. We can, therefore, con-clude that the linguistic analysis, which is deeper, is an important researchpath in the field of Sentiment Analysis.

Despite the fact that the linguistic classifier enables to obtain the best re-sults, its utilization can not be universal. Its application to a new field requiresthe creation of a new linguistic resource base and it is necessary to carry outthe deep linguistic analysis again. Those processing are unavoidable becausethe language is highly dependent of the field.

References

1. Beineke, P., Hastie, T., & Vaithyanathan, S. 2004. Exploring sentiment summa-rization. In Proceedings of the AAAI Spring Symposium on Exploring Attitudeand Affect in Text, AAAI technical report SS-04-07, 2004.

2. Cover, T.M., & Thomas, J.A. 1991. Elements of Information Theory. JohnWiley & sons, NY, 1991 (ISBN 0-471-06259-6).

3. Dagan, I., Karov, Y., & Roth, D. 1997. Mistakedriven learning in text catego-rization. 2nd Conference on Empirical Methods in Natural Language Process-ing, EMNLP-97, 55-63.

4. Das, S., & Chen, M. 2001. Yahoo! for Amazon : Extracting Market Sentimentfrom Stock Message Boards. In Proceedings of the 8th Asia Pacific FinanceAssociation Annual Conference (APFA).

5. Dave, K., Lawrence, S., & M., Pennock D. 2003. Mining the Peanut Gallery :Opinion Extraction and Semantic Classification of Product Reviews. In Pro-ceedings of WWW, 519-528.

6. Drucker, H., Vapnik, V., & Wu, D. 1999. Automatic text categorization and itsapplications to text retrieval. IEEE Trans. Neural Netw., 10, 10481054.

7. Dziczkowski, G., & Wegrzyn-Wolska, K. 2007b. Rcss - rating critics supportsystem purpose built for movies recommendation. In : Advances in IntelligentWeb Mastering. Springer.

8. Dziczkowski, G., & Wegrzyn-Wolska, K. 2008a. An autonomous system de-signed for automatic detection and rating of film. Extraction and linguisticanalysis of sentiments. In Proceedings of IEEE/WIC/ACM inter. conferenceon Web intelligence and intelligent agent technology, Sydney, 847–850.

9. Dziczkowski, G., & Wegrzyn-Wolska, K. 2008b. Tool of the intelligence eco-nomic : Recognition function of reviews critics. In ICSOFT 2008 Proceedings.INSTICC Press, 218–223.


10. Gross, M. 1997. The construction of local grammars. Finite-State LanguaugaProcessing, MIT Press, 329-354.

11. Hatzivassiloglou, V., & McKeown, K. 1997. Predicting the Semantic Orientationof Adjectives. In Proc. of the Joint ACL/EACL Conference, 174-181.

12. Hatzivassiloglou, V., & Wiebe, J. 2000. Effects of Adjective Orientation andGradability on Sentence Subjectivity. In Proceedings of the International Con-ference on Computational Linguistics (COLING).

13. Joachims, T. 1998. Text categorization with support vector machines : learningwith many relevant features. 10th European Conference on Machine Learning,ECML-98, 137-142.

14. Joachims, T. 1999. Transductive inference for text classification using supportvector machines. 16th Inter. Conf. on Machine Learning, ICML-99, 200-209.

15. Joachims, T., & Sebastiani, F. 2002. Guest editors introduction to the specialissue on automated text categorization. J. Intell. Inform. Syst., 18, 103-105.

16. Knight, K. 1999. Mining online tex. Commun. ACM 42, 1, 58-61.17. Lewis, D. D., & Gale, W. A. 1994. A sequential algorithm for training text

classifiers. 17th ACM International Conference on Research and Developmentin Information Retrieval, SIGIR-94, 3-12.

18. Lewis, D. D., & Haues, P. J. 1994. Guest editorial for the special issue on textcategorization. In : ACM Trans. Inform. Syst.

19. Mediametrie, www.mediametrie.fr/new.php?rubrique = rad&newsid = 229.2008.

20. Mitchell, T.M. 1996. Machine Learning. McGraw Hill.21. Pang, B., Lee, L., & Vaithyanathan, S. 2002. Thumbs up ? Sentiment Classi-

fication using Machine Learning Techniques. In Proceedings of the Conferenceon Empirical Methods in Natural Language Processing (EMNLP), 79-86.

22. Pang, B., & Lee, L. 2004. A Sentimental Education : Sentiment Analysis UsingSubjectivity Summarization Based on Minimum Cuts. In Proceedings of theAssociation for Computational Linguistics (ACL), 271-278.

23. Pazienza, M. T. 1997. Information Extraction. In : Lecture Notes in ComputerScience Vol. 1299.

24. Porter, W. A. 1980. Synthesis of polynomic systems. j-SIAM-J-MATH-ANA,11, 308-315.

25. Schmid, H. 1994. Part-of-Speech Tagging with Neural Network. 15th conferenceon Computational linguistics, 1, 172-176.

26. Shannon, C. 1948. A mathematical theory of communication. Bell System Tech-nical Journal, 27, Bell System Technical Journal.

27. Turney, P. 2002. Thumbs Up or Thumbs Down ? Semantic Orientation Appliedto Unsupervised Classification of Reviews. In Proceedings of the Associationfor Computational Linguistics (ACL), 417-424.

28. Voorhess, E.M. 1999. Natural language processing and information retrieval.Information extraction, toward scalable, adaptable systems, Lecture Notes inComputer Science, 32-48.

29. Wang, Y., Hodges, J., & Tang, B. 2005. Classification of Web Documents usingNaive Bayes Method. In Proceedings of the 15th IEEE International Conferenceon Tools with Artificial Intelligence, 560-564.

30. Whitelaw, C., N., Garg, & Argamon, S. 2005. Using appraisal groups for sen-timent analysis. In Proceedings of the ACM SIGIR Conference on Informationand Knowledge Management (CIKM), 625-631.


31. Wiebe, J.M., Wilson, T., Bruce, R., Bell, M., & Martin, M. 2004. LearningSubjective Language. Computational Linguistics, 30(3), 277-308.

32. Wiebe, J, & Mihalcea, R. 2006. Word Sense And Subjectivity. In Proceedingsof the 21st Conference on Computational Linguistics / Association for Compu-tational Linguistics (COLING/ACL), 1065–1072.

33. Wilson, t., Wiebe, j., & Hwa, r. Just How Mad Are You? Finding Strong andWeak Opinion Clauses. In Proc. of AAAI, 761-769. Ext. version in Computa-tional Intelligence 22(2, Special Issue on Sentiment Analysis),73-99, 2006.

34. Wilson, T., Wiebe, J., & Hoffmann, P. 2005. Recognizing Contextual Polarityin Phrase-Level Sentiment Analysis. In Proceedings of the Human LanguageTechnology Conference and the Conference on Empirical Methods in NaturalLanguage Processing (HLT/EMNLP), 347-354.

35. Yang, Y., & Liu, X. 1999. A re-examination of text categorization methods.22nd ACM International Conference on Research and Development in Infor-mation Retrieval, SIGIR-99, 42-49.

Index

classifier, 13–16, 18–22

linguistic, 6, 12–15, 17, 18, 21

Machine Learning, 5

Opinion Mining, 5, 22

semantic, 5, 9, 10, 12, 18social network, 1, 21

Text Mining, 5