automatic term extraction of dynamically updated text collections for sentiment classification into...

Post on 11-Jun-2015

369 Views

Category:

Science

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

An automatic term extraction approach for building a vocabulary that is constantly updated. A prepared dictionary is used for sentiment classification into three classes (positive, neutral, negative). In addition, the results of sentiment classification are described and the accuracy of methods based on various weighting schemes is compared. The work also demonstrates the computational complexity of generating representations for N dynamic documents depending on the weighting scheme used.

TRANSCRIPT

Automatic term extraction of dynamically updated text collections for sentiment

classification into three classes

Yuliya Rubtsova

The A.P. Ershov Institute of Informatics Systems (IIS)

Applied problems which can be solved with sentiment classification

consumer reviews study to commercial products for businesses;

Applied problems which can be solved with sentiment classification

consumer reviews study to commercial products for businesses;

recommender systems;

Applied problems which can be solved with sentiment classification

consumer reviews study to commercial products for businesses;

recommender systems;

Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the current emotional state of the person

Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the

current emotional state of the person

psychological and medical diagnosis;

safety control by analyzing the behavior of mass gatherings;

assistance in carrying out investigative measures.

Most common sentiment analysis approaches

Supervised machine learning

Dictionaries and rules

Combined method

Existing corpora

Corpora of reviews which contain user marks

Belongs to one subject domain (movies reviews, books reviews, gadgets reviews)

Corps of news (a few emotional texts)

Filtration

Texts containing both positive and negative emotions;

Not informative tweets (less than 40 characters long);

Copied texts and retweets.

Corpus of short texts consists of

114 991 – positive texts

111 923 – negative texts

107 990 – neutral texts

Corpus of short texts

Collection type Number of words Number of unique words

Positive messages 1 559 176 150 720

Negative messages 1 445 517 191 677

Neutral messages 1 852 995 105 239

Unique terms distribution in relation depending on the number of tweets

Uniformity of used collections

Words frequency distribution

Most common approaches for used for N-grams extracting

Manually, using a thesaurus.

Term Extraction, based on significance of this term for a collection

Data sets characteristics

The entire data set is known

The entire data set is avaliable

The entire data set is static (can’t change during calculation)

When new document is added, it is necessary to the update the document frequency of many terms and all previously generated term weights needs recalibration. For N documents in a data stream, the computational complexity is O(N2).

Human speech is constantly changing => there is a need to update emotional dictionaries

Change in vocabulary and topics discussed

Febrary August0%

2%

4%

6%

8%

10%

12%

14%12.00%

0.50%

Percentage of references to the Olympic theme on all posts

Change in vocabulary and topics discussed

Febrary August0.00%

0.02%

0.04%

0.06%

0.08%

0.10%

0.12%

0.14%

0.06%

0.12%

Percentage of references to the vacation theme on all posts

Change in vocabulary and topics discussed

Febrary August0.00%

0.01%

0.02%

0.03%

0.00%

0.02%

Percentage of using term “Sebyashka” (selfie – rus) on all posts

Filtration Punctuation – commas, colons, quotation marks

(exclamation marks, question marks and ellipses were retained);

References to significant personalities and events

Proper names;

Numerals;

All links were replaced with the word "Link" and were taken into consideration as a whole;

Many dots were replaced with ellipsis.

TF-ICF

C – number of categories,

cf – the number of categories in which weighed term is found

TF-IDF

tf – is the frequency of term occurrence in the collection (positive or negative tweets) ,

T – total number of messages in the collections,

– the number of messages in the positive and negative collections contained the term

Experiments

Corpus of News texts consists of

46 339 – positive news

46 337 – negative news

46 340 – neutral news

ROMIP mixed collection consists of

543– positive blog texts

236– negative blog texts

103– neutral blog texts

Reviews on books, movies, or digital camera from blogs

Short text collection

News collection

TF-IDF TF-ICFAccuracy 53,9773 57,9545Precision 0,561341047 0,558902611Recall 0,5311636 0,535790598F-Measure 0,545835539 0,547102625

ROMIP collection

TF-IDF TF-ICFAccuracy 69,8619 58,1397Precision 0,709246342 0,61278022Recall 0,698624505 0,581402868F-Measure 0,703895355 0,596679322

TF-IDF TF-ICFAccuracy 95,5981 95,0664Precision 0,958092631 0,953112184Recall 0,955204837 0,94984672F-Measure 0,956646554 0,95147665

Results

Short texts News Romip0

20

40

60

80

100

120

95.66

70.39

54.58

95.15

59.6854.71 TF-IDF

TF-ICF

Experimental results in terms of F-measure

dynamically update the unigram dictionary, recalculate the weight of terms, depending on the accessories to the collection;

take into account the lexical speech changes in time;

investigate new terms entering into active vocabulary.

The program module allows

Thank you!

Yuliya Rubtsova

yu.rubtsova@gmail.comstudy.mokoron.com

Presentation: http://www.slideshare.net/mokoron

top related