audio and text based multimodal sentiment …...for song classification, svm and naive-bayes...

58
Audio and Text based Multimodal Sentiment Analysis using Features Extracted from Selective Regions and Deep Neural Networks Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science by Research in Computer Science and Engineering by Harika Abburi 201450880 [email protected] International Institute of Information Technology Hyderabad - 500 032, INDIA June 2017

Upload: others

Post on 03-Feb-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Audio and Text based Multimodal Sentiment Analysis using Features

Extracted from Selective Regions and Deep Neural Networks

Thesis submitted in partial fulfillment

of the requirements for the degree of

Master of Science by Research

in

Computer Science and Engineering

by

Harika Abburi

201450880

[email protected]

International Institute of Information Technology

Hyderabad - 500 032, INDIA

June 2017

Page 2: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Copyright c© Harika Abburi, 2017

All Rights Reserved

Page 3: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors
Page 4: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

To My Family Members

Page 5: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Acknowledgments

I owe deep gratitude to everyone who have contributed greatly in completion of this thesis.

Foremost, I would like to express my deep gratitude to my advisors Dr. Suryakanth V Gangashetty

and Dr. Manish Shrivastava for their guidance and encouragement throughout my research work. I must

be very fortunate to have advisors who gave me the freedom to explore on my own, and at the same time

their guidance helped me to recover whenever my steps faltered.

I owe special thanks to Prof. B. Yegnanarayana for sharing his immense knowledge on speech. I am

grateful to Dr. Radhika Mamidi for her constant support and encouragement. I thank Dr. Kishore S

Prahallad and Dr. Vinay Kumar Mital for guiding me in the beginning of my research work.

I am grateful to KNRK Raju and Akhil Akkireddy, for their valuable feedback and comments. I

thank VVV Raju, Ramakrishna sir, Harikrishna, Ravi Kumar sir, Mounika for their valuable feedback,

discussions and always helping me in all the situations. I thank all my lab members Gangamohan,

Sudarshan, Sivanand, Nivedita, Vishala, Ravi, Aneeja, Bhanu Teja, Harsha, Anandswaroop, Sirisha,

Bhavya, Ayushi for helping me to complete my thesis work.

Most importantly, none of this would have been possible without the love and patience of my family

members. I formally thank them for their support and strength through. I would like to personally thank

my husband who guided me through the difficult situations in my life and never let me fail. I thank him

immensely for that.

- Harika Abburi

v

Page 6: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Abstract

Sentiment analysis has emerged as a field, that has attracted a significant amount of attention over

the last decade. This is the area of study to analyze people reviews, songs and attitudes from different

types of data and classify whether it is a positive, negative or neutral. Recent advancement of social

media which is an enormous ever-growing source has led people to share their views through various

modalities such as audio, text and video. This source of information is important to automatically make

out the sentiment embedded in the different types of data such as reviews and songs.

In this thesis, an improved multimodal approach to detect the sentiment of product reviews and

songs based on their multi-modality natures (audio and text) is proposed. The basic goal is to classify

the input data as either positive or negative sentiment. Database used in this study are Spanish product

reviews, Hindi product reviews and Telugu songs. Most of the existing systems for audio or speech

based sentiment analysis use the conventional audio features which are extracted from entire signal, but

they are not domain specific features to extract the sentiment. In this work, instead of extracting the

features from entire signal, a specific regions of an audio signal have been identified and experiments

are performed on these regions by extracting the relevant features.

For songs data, experiments are performed over each song, and its beginning and ending regions. For

all these cases, Gaussian Mixture Models (GMM) and Support Vector Machine (SVM) classifiers are

built using the prosody, temporal, spectral, tempo and chroma features. Experimental results shown that

the ate of detecting the sentiment of a song is high at beginning of a song compared to its ending region

and over the entire duration of the song. This is because, the instruments and vocals which convey the

sentiment for beginning part of the song may or may not sustain throughout the song.

For the reviews data, we could not perform these experiments because in such cases the sentiment

may not be present at beginning or ending regions of an utterance. So for the reviews data, the stressed

and normal regions are identified using the strength of excitation. From the stressed regions, Mel Fre-

quency Cepstral Coefficients (MFCC) features are extracted and GMM classifier is built. Further, ex-

periments are performed by extracting the prosody (energy, pitch and duration) and relative prosody

features from both the regions and from the entire audio signal and a GMM classifier is built. From the

results, it is observed that, the performance at specific regions is better as compared to the entire sig-

nal. It is also observed that, relative prosody features extracted from both the regions has high accuracy

of detecting the sentiment compared to the prosody and MFCC features. This is because, the natural

variations present in the prosody features are reduced using the relative prosody features.

vi

Page 7: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

vii

Recently, neural networks have achieved good success on sentiment classification. In this work

also different deep learning architectures like Deep Neural Network (DNN) and Deep Neural Network

Attention Mechanism (DNNAM) are explored. Here stressed regions concept fail because of less train-

ing data. DNN performance depends on the amount of training data. The more the training data, the

more accurate it is. So here the experiments are performed by combination of frames which result in

better performance because each frame will not carry the sentiment. MFCC features considered are

13-dimensional, 65-dimensional and 130-dimensional feature vectors. From the studies, it is observed

that DNNAM classifier gives better results compared to DNN, because the DNN approach is a frame

based one where as the DNNAM approach is a utterance level classification there by efficiently making

use of the context.

For text based sentiment analysis, transcriptions are carried out manually from the audio signal.

For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are

computed by Doc2Vec vectors. As in the audio, here also experiments are performed at the beginning,

the ending and over the entire song. The studies shown that beginning of a song has high accuracy in

detecting the sentiment compared to the ending region and over the entire song. As similar experiments

could not be carried out with the reviews data, entire document is taken as an input to extract the

sentiment. Support Vector Machine (SVM) and Long Short-Term Memory Recurrent Neural Network

(LSTM-RNN) classifiers are used to develop a sentiment model with the textual features computed by

Doc2Vec and Word2Vec. From the experimental studies, it is observed that LSTM-RNN outperforms

the SVM because LSTM-RNN is able to memorize long temporal context.

Finally, we combine both the modalities such as audio and text to extract the sentiment. Both the

modalities are hypothesized based on the highest average probability of the classifiers. It is observed

from the studies that the simultaneous use of these two modalities help to create a better sentiment

analysis model to detect whether the given input is positive or negative sentiment.

Keywords: Sentiment Analysis, Multimodal Classification, Text features, Audio features, Lyric fea-

tures, Stressed regions, Normal regions, Relative prosody features, Mel frequency cepstral coefficients,

Doc2Vec, Word2vec, Gaussian Mixture Models (GMM), Support Vector Machine (SVM), Naive-Bayes

(NB), Deep Neural Networks (DNN), Deep Neural Network Attention Mechanism (DNNAM), Long Short

Term Memory-Recurrent Neural Network(LSTM-RNN).

Page 8: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Contents

Chapter Page

1 INTRODUCTION TO SENTIMENT ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Issues Addressed in this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Applications of Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 REVIEW OF APPROACHES TO SENTIMENT ANALYSIS . . . . . . . . . . . . . . . . . 6

2.1 Audio-based Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Text-based Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Multimodal Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 MULTIMODAL SENTIMENT ANALYSIS USING SELECTIVE REGIONS . . . . . . . . . 12

3.1 Databases used in this Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.1 Spanish database of opinion utterances . . . . . . . . . . . . . . . . . . . . . 12

3.1.2 Hindi database of reviews utterances . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.3 Telugu songs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Detection of Sentiment in Selective Regions of Songs . . . . . . . . . . . . . . . . . . 13

3.2.1 Sentiment analysis using audio features . . . . . . . . . . . . . . . . . . . . . 14

3.2.2 Sentiment analysis using lyric features . . . . . . . . . . . . . . . . . . . . . . 15

3.2.3 Multimodal sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Detection of Sentiment from Selective Regions of Reviews . . . . . . . . . . . . . . . 18

3.3.1 Sentiment analysis using speech features . . . . . . . . . . . . . . . . . . . . 18

3.3.1.1 Detecting stressed and normal regions . . . . . . . . . . . . . . . . 18

3.3.1.2 Computation of strength of excitation using ZFF method . . . . . . 18

3.3.1.3 Detecting stressed and normal regions of an audio signal . . . . . . . 19

3.3.1.4 Features extracted at stressed and normal regions . . . . . . . . . . . 20

3.3.1.5 Sentiment analysis experimental setup and results . . . . . . . . . . 23

3.3.2 Sentiment analysis using text features . . . . . . . . . . . . . . . . . . . . . . 24

3.3.3 Multimodal sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 MULTIMODAL SENTIMENT ANALYSIS USING DEEP NEURAL NETWORKS . . . . . 27

4.1 Sentiment Analysis using Audio Features . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.1 Deep neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.2 Deep neural network attention mechanism . . . . . . . . . . . . . . . . . . . . 29

viii

Page 9: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

CONTENTS ix

4.1.3 Gaussian mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Sentiment Analysis using Text Features . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.1 Support vector machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.2 LSTM-RNN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 Multimodal Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 SUMMARY AND CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Page 10: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

List of Tables

Table Page

3.1 Sentiment classification performance (in %) of songs using 37 dimension feature vector. 15

3.2 Sentiment classification performance (in %) of songs using lyric features. . . . . . . . 16

3.3 Multimodal sentiment classification performance (in %) with lyric and audio features. . 17

3.4 Sentiment classification performance (in %) with different features. . . . . . . . . . . 24

3.5 Sentiment classification performance (in %) using text features. . . . . . . . . . . . . . 25

3.6 Sentiment classification performance (in %) for different models on Spanish and Hindi

datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.1 Performance (in %) of sentiment analysis using deep neural network. . . . . . . . . . . 29

4.2 Performance (in %) of sentiment analysis using deep neural network attention mecha-

nism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Performance (in %) of sentiment analysis using different classifiers. . . . . . . . . . . 32

4.4 Performance (in %) of sentiment analysis using text features. . . . . . . . . . . . . . . 35

4.5 Performance (in %) of multimodal sentiment analysis. . . . . . . . . . . . . . . . . . . 36

x

Page 11: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

List of Figures

Figure Page

3.1 Block diagram of multimodal sentiment analysis of songs. . . . . . . . . . . . . . . . 17

3.2 An approach to detect stressed significant regions of Spanish speech utterance “pero

igual con las lavadas”. (a) Input speech signal, (b) ZFF signal (c) Strength of the excita-

tion at each epoch, (d) Strength of the excitation which is mean smoothed using a frame

size of 20 ms and (e) Stressed significant regions of utterance are detected. . . . . . . . 20

3.3 Block diagram to extract relative prosody features. . . . . . . . . . . . . . . . . . . . 21

3.4 Histogram of (a) maximum pitch, (b) relative maximum pitch for positive and negative

sentiment audio files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.5 Scatter plot of maximum pitch vs maximum energy for positive and negative sentiment

audio files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 Block diagram of sentiment analysis using deep neural network. . . . . . . . . . . . . 28

4.2 Deep neural network attention model [48]. . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 LSTM Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4 Block diagram of multimodal sentiment analysis. . . . . . . . . . . . . . . . . . . . . 36

xi

Page 12: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

List of Abbreviations

ASR Automatic Speech Recognition

DNN Deep Neural Network

DNNAM Deep Neural Network Attention Mechanism

GMM Gaussian Mixture Models

HMM Hidden Markov Model

LPCC Linear Predictive Cepstral Coefficients

LSTM-RNN Long Short-Term Memory Recurrent Neural Network

MFCC Mel Frequency Cepstral Coefficients

NB Naive-Bayes

RNN Recurrent Neural Network

SoE Strength of Excitation

SVM Support Vector Machine

ZFF Zero Frequency Filter

HMM Hidden Markov Model

VSM Vector Space Model

xii

Page 13: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Chapter 1

INTRODUCTION TO SENTIMENT ANALYSIS

Sentiment analysis is a systematic study to identify and extract information present in the source

materials using natural language processing, computational linguistics, and text analytics. It focuses on

the automatic identification of opinions, evaluations, judgments, capturing the users behavior, thoughts,

views, mood, attitude, likes and dislikes of an individual from the generated web content. Sentiments are

considered as the manifestation of human feelings and emotions. The field of computer science helps in

analyzing and predicting the hidden information stored in the different modalities like audio, video and

text. This hidden information provide valuable insights about users intentions, taste and likeliness.

User-generated content is an important source of information to mine the sentiment/opinion of people

on different products and services. To obtain the factual and subjective information on companies

and products, analysts are turning towards web to gather information. The world wide web plays an

important role in gathering public opinion which will be useful in making business related decisions.

Extracting the sentiment from these public opinion is a major task. Industrialists spend a large chunk

of their revenue on business intelligence to read minds of general public and interpret what they think

about their products. Sentiment analysis tries to mine information from various kinds of data such as

reviews, songs, news, blogs and classify them as either positive, negative or neutral.

Much of the work to date on sentiment analysis has focused on textual data. Available dataset

and resources are restricted to text-based sentiment analysis only. With the advent of social media,

people are now extensively using the videos, images and audios in youtube to express their opinions on

social media platforms. Audio reviews represent a growing source of consumer information that gained

increasing interest from companies, researchers, and consumers. Compared to traditional text reviews,

audio reviews provide a more natural experience as they allow the viewer to better sense the reviewers

emotions, beliefs, and intentions through richer channels such as intonations. Thus, it is highly crucial

to mine opinions and identify sentiments from the diverse modalities.

The modalities other than text can often be used as clues for the expression of sentiment. The joint

use of modalities brings significant advantages over text, including linguistic disambiguation (audio

features can help in disambiguate linguistic meaning), linguistic sparsity problem (means audio features

bring additional sentiment information) and grounding the audio modalities enhance the connection to

1

Page 14: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

real-world environments. Here, we address the task of multimodal sentiment analysis. We experiment

with several linguistic, audio features which are extracted from specific regions and show that the joint

use of these two modalities significantly improves the classification accuracy compared to using one

modality at a time.

1.1 Motivation

Research in sentiment analysis is rapidly growing and attracting the attention of both academia and

industry alike. The natural ability of a human being is understanding emotions, analyzing situations and

the sentiments associated with the context. But how efficiently can we train a machine to exhibit same

phenomenon becomes an important and vital question to be explored and answered. Sentiment analy-

sis provides an effective mechanism of understanding individuals attitude, behavior, likes and dislikes

of a user. Both signal processing and AI has led to the development of advanced intelligent systems

that intend to detect and process affective information contained in multimodal sources. The majority of

such state-of-the-art frameworks however, rely on processing a single modality, i.e.,text, audio, or video.

Further, all of these systems are known to exhibit limitations in terms of meeting robustness, accuracy,

and overall performance requirements, which in turn greatly restrict the usefulness of such systems in

real-world applications. The aim of multimodal data fusion is to increase the accuracy and reliability of

estimates. The textual, audio, and visual modalities are exploited to enable effective extraction of the

semantic and affective information conveyed during communication.

With significant increase in the popularity of social media like Facebook and YouTube, many users

tend to upload their opinions on products in video format. So, the task of mining opinions on various

products becomes an useful resource to guide and help people in making choices and decisions. Mining

sentiments and subjective information helps to provide products and services in a personalized fashion

and as per individuals taste and likings. The study of sentiment analysis also provide enough informa-

tion about how human beings perceive and express information in the form of text, audio to express

their feelings and emotions.

Hence, when a person expresses his opinions with more vocal modulation, the audio data may contain

most of the clues for opinion mining rather than only text. So, a generic model needs to be developed

which can adapt itself for any user and can give a consistent result. This wide multi-dimension aspects

motivated to take this multimodal sentiment analysis problem as the current research work.

1.2 Issues Addressed in this Thesis

In this research, a multimodal sentiment analysis is performed at selective regions of an input and

explored deep neural networks concept on sentiment analysis. Previously, research on multimodal sen-

2

Page 15: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

timent analysis is done by extracting the conventional features from th entire signal, but they are not

problem specific features to extract the sentiment. In this work, instead of extracting the features from

entire signal, a specific regions of an audio signal have been identified and experiments are performed

on these regions using relevant features. To extract the sentiment from the songs data, a specific regions

like at the beginning and ending of a song are identified. For the reviews, stressed significant and nor-

mal regions of an audio input are identified using Zero Frequency Filtered (ZFF) Signal and perform

sentiment analysis on these specific regions.

A sentiment analysis is performed using new features namely a relative prosody features which are

extracted from the stressed and normal regions of a audio signal. This representation outperforms the

prosody and Mel Frequency Cepstral Coefficients features (MFCC) extracted at both the regions.

A song sentiment classification model is developed for different regions of a song in Telugu language.

A deep learning architectures (like deep neural network and deep neural network with attention

mechanism) are explored using different dimensional Mel Frequency Cepstral Coefficients feature vec-

tors (like 13-Dimensional, 65-Dimensional (5 frames are combined) and 130-Dimensional (10 frames

are combined)) to extract the sentiment present in the audio input.

Hindi product reviews database and Telugu songs database are collected for this work. A total of

100 files are collected out of which 50 are positive and 50 are negative sentiments. The audio files are

downloaded from Youtube and transcriptions are done manually for reviews.

1.3 Applications of Sentiment Analysis

Sentiment Analysis has been widely used for understanding the nature of a context. Few areas where

Sentiment Analysis can be applied are:

(a) Businesses and organizations: Much of the business strategies are been guided with respect to

the response from the customers. Companies aim to satisfy the needs and demands of the users, thus

strategic moves of companies are driven through public opinions and views. With the world connected

through technology events have a global impact, the issue/failure on one part of the world has an impact

on the other part of the globe. So it becomes quite important to drive products/services according to the

public viewpoint. Business men are investing a huge amount of money to find the sentiment.

(b) Individual products analysis and decision making: With the help of sentiment analysis it has

become easier to analyze different products and make the choices accordingly. This kind of analysis

also helps to select a product based on its feature specifications. The comparison between two products

3

Page 16: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

has also been made quite easier. Decision making is an integral part of our life. It ranges from which

products to buy, which restaurant to go, to which bank insurance policies to go for and which invest-

ments to make. Sentiment analysis can be used to decide and select from the available options based on

the general opinions expressed by other users.

(c) Ads placements: Placing ads in the user-generated content. Place an ad when one praises a product

and place an ad from a competitor if one criticizes a product. When ever one express opinion about a

product by analyzing that review we can say weather it is positive or negative.

(d) Recommendation systems: Most of the websites we visit have a recommendation system in-built

to assist us, ranging from sites related to books, online-media, entertainment, music, film industry to

other forms of art. These systems uses our personal information, previous history, likes and dislikes and

our friends information to make suggestions.

(e) Opinion retrieval/search: Providing general search based on opinion given by the user.

(f) Designing and building innovative products: With exposed to tough competition and open to

critics through public reviews and opinions, sentiment analysis leads to better analysis of the products

in terms of the usability and human-friendly nature. It creates an environment for better and more inno-

vative products.

(g) Computing customer satisfaction metrics: You can get an idea of how much happy customers

are with your products from the ratio of positive to negative tweets about them.

(h) Identifying detractors and promoters: It can be used for customer service, by spotting dissat-

isfaction or problems with the products. It can also be used to find people who are happy with your

products or services and their experiences can be used to promote your products.

There are more applications like question and answering, text summarization and mining opinion

from product reviews.

1.4 Organization of the Thesis

Organization of the thesis is as follows:

• In chapter 2, we review various approaches for multimodal sentiment analysis. This chapter

discuss the related work on sentimental analysis on different modalities such as audio and text.

4

Page 17: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

• Chapter 3 discusses about the importance of specific regions in audio input. Further we also

present the different features which are used to detect the sentiment at specific regions.

• Chapter 4 describes different architectures of deep neural network is developed to built a senti-

ment model.

• Finally we summarize the contributions of this research work and discuss some directions for

further work in chapter 5.

5

Page 18: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Chapter 2

REVIEW OF APPROACHES TO SENTIMENT ANALYSIS

The sentiment analysis plays a key role in classification of many applications such as in online songs

[1] [43], online reviews [12] [54] and Twitter data [2] [26]. In this chapter we provide a brief overview

of related work on multimodal sentiment analysis. We first focus on audio-based sentiment analysis,

and then text-based sentiment analysis. Each modality have some special information which will help

to detect the sentiment. Hence, all kinds of data are required for better sentiment classification. For

any kind of approach like audio and text, sentiment can be extracted using sentiment classification

techniques using machine learning approach [32].

2.1 Audio-based Sentiment Analysis

The study of the relationship between emotional content and audio signals is a very mature field. Re-

searchers have expanded the success found in using the Mel Frequency Cepstral Coefficients (MFCC)

features for speech recognition community to explore their uses in music modeling [29]. MFCCs are

currently a staple in audio processing and are commonly used in MIR applications such as genre clas-

sification [55], since they are a quantifiable method for comparing the timbral texture of songs. Timbre

has been used with some success to classify the emotional content of songs [28] however, class inconsis-

tencies have proven to be a difficult challenge, causing substantial misclassification between edge cases.

Timbre and chroma has also been used to generate songs that evoke particular emotions [6]. These

vectors have been commonly classified using support vector machines (SVM) and naive Bayes classi-

fiers. Instead of using MFCC and chroma features separately research has been done by combining both

features which gives better performance. The chroma features are less informative for classes such as

artist, but contain information which is independent of the spectral features [10]. Due to this reason in

our work, we combine both features along with some other features for song database.

To detect the sentiment from natural audio streams, a sentiment detection system is developed based

on Maximum Entropy modeling and Part of Speech tagging. Transcripts from audio streams are ob-

tained using Automatic Speech Recognition (ASR) [24]. This approach shows that it is possible to

automatically detect sentiment in natural spontaneous audio with good accuracy. Another method is

6

Page 19: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

presented for audio sentiment detection based on Keyword Spotting rather than using ASR [23]. Exper-

iments show that the presented method outperform the traditional ASR approach by 12 percent increase

in classification accuracy. Audio features like pitch, intensity and loudness are extracted using Open-

EAR software and Support Vector Machine (SVM) classifier is built to detect the sentiment [45]. The

audio features are automatically extracted from the audio track of each video clip using OpenEAR

software and Hidden Markov Models (HMM) classifier is built to detect the sentiment [37]. Instead of

extracting all the features from the entire input using tools like OpenEAR/OpenSMILE only specific rel-

evant features like MFCC, prosody and relative prosody are extracted from stressed and normal regions

of an input are used in our study.

As prosody have been used before for the task of emotion recognition in speech, it has also been

experimented for the task of sentiment analysis [31]. Speech data is generally extracted from the char-

acteristics of the vocal tract, excitation and prosody. Prosody parameters extracted from segmental,

sub-segmental and supra-segmental are used for emotion recognition in [49]. To develop an emotion

recognition system, relation between various prosody parameters had been explored[25]. In the liter-

ature MFCC, Linear Predictive Cepstral Coefficients (LPCC) are the major spectral features used for

emotion recognition [56]. Various acoustic cues such as Energy of the excitation (EoE), loudness,

Strength of Excitation (SoE), instantaneous Fo, and their combinations are explored to study the emo-

tion discriminating capabilities of the excitation signal [16]. These acoustic cues such as Fo and SoE are

also used for detecting the paralinguistic sounds such as cry, shout and laughter [34] [35][36]. Strength

of excitation of audio input is found using Zero-Frequency Filtering (ZFF) method. Regions whose the

strength of excitation fluctuates above and below 30% from the mean strength of excitation referred

as emotionally significant regions [57]. From the emotionally significant regions MFCC features are

extracted and tested on Guassian Mixture Model (GMM) classifier. A significant improvement in the

performance of a system is observed.

To detect the sentiment from an input data, instead of using SVM, HMM and GMM classifiers

the current research is going on the deep architectures which has ability to discover multiple levels

of features from data. Deep Neural Networks (DNN) are applied on many applications like speech

recognition, language recognition, sentiment analysis etc. The significant improvement in the perfor-

mance is observed by using deep neural networks in acoustic modeling for speech recognition instead

of traditional GMM [17]. In [30] [51], a single DNN acoustic model is used to train both the language

recognition and speaker recognition tasks. From results, a significant improvement in the performance

is observed. The drawback of the DNN system is that, the decision is taken at every frame and the

context used is fixed which is assigned to the entire utterance and it cannot memorize long temporal

context. To overcome this problem, a feed-forward deep neural network with attention mechanism [48]

is proposed to solve long range dependency memory problems. This attention mechanism is used for

language recognition task in [38], which results in better performance compared to the DNN. In this

work, we used this architecture for sentiment analysis and referred as Deep Neural Network Attention

Mechanism (DNNAM). This attention mechanism is parallelized because it is a feedforward neural net-

7

Page 20: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

work with no recurrent networks and it is able to memorize the long temporal context. This architecture

is used for classifying whole utterance where as in DNN a frame-level decision is taken and then all the

decisions are combined to extract the sentiment.

2.2 Text-based Sentiment Analysis

A sentiment-vector space model is used for song sentiment classification [60]. Experiments are done

on two approaches: knowledge-based and machine learning. In knowledge-based, HowNet [8] is used to

detect the sentiment words and to locate the sentiment units with in the song lyric. In machine learning,

the SVM algorithm is implemented based on Vector Space Model (VSM) and sentiment-Vector Space

Model (s-VSM), respectively. Experiments show that s-VSM gives better results compared to VSM and

knowledge-based. A previous work includes sentiment analysis for mining the topics from songs based

on their moods [52]. The input lyrics files are measured based on the word net graph representation and

the sentiments of each song are mined using Hidden Markov Model (HMM). Based on single adjective

words available from the audio dataset USPOP (hit pop songs), a new dataset is derived from the last.fm

tags [19]. Using this dataset, K-means clustering method is applied to create a meaningful cluster-based

set of high-level mood categories for music mood classification. This set was not adopted by others

because mood categories developed by them were seen as a domain oversimplification. The authors

in [21] presented the usefulness of text features in music mood classification on 18 mood categories

derived from user tags and they show that these text features outperform audio features in categories

where samples are more sparse. An unsupervised method to classify music by mood is proposed in

[42]. Fuzzy c-means classifier is used to do the automatic mood classification.

Movie review mining using machine learning and semantic orientation is implemented in [5]. Based

on the document features the semantic orientation approach classifies the input review as a negative or

positive class. Machine learning techniques are used to investigate the effectiveness of classification

of documents by overall sentiment. Another work on movie reviews is done using SVM, Naive Bayes

and Maximum Entropy classifiers [41]. Eight different types of features are extracted such as unigrams,

bigrams, combination of both and so on. These features are tested on the three classifiers, among them

SVM with binary-unigram features gives a high rate of detecting the sentiment. By using unigrams

and bigrams the text is somewhat misleading because subjective information is missing. In [40] each

sentence of a review is labeled as either objective or subjective and neglected the objective sentences

by finding minimum cuts in graphs. This prevents the classifier from considering misleading text. On

this data, SVM and Naive Bayes classifiers are implemented to develop a model. From the result, it

is observed that a small improvement is obtained using this method. Sentiment analyzer is developed

to find out all the references on the subject and the sentiment polarity of each reference from online

data documents [63]. A sentiment lexicon and the sentiment pattern database is used for extraction

and association purposes. They classify expressions about specific items and use manually developed

patterns to classify polarity.

8

Page 21: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Sentiment is extracted using the opinion words like a combination of the adjectives along with the

verbs and adverbs in the tweets [26]. Adjectives and negative words are taken into account to calculate

the polarity of the whole phrase. The corpus-based method was used to find the semantic orientation

of adjectives and the dictionary-based method to find the semantic orientation of verbs and adverbs.

Two different Naive Bayes classifiers which make use of polarity lexicon to classify as positive and

negative are used to detect the polarity of English tweets [15]. These classifiers are treated as the

baseline. Features like lemmas, multiword, polarity lexicon and valence shifters are used. The training

data set of tweets is obtained from SemEval 2014 and additional annotated tweets from external sources.

Experiments show that performance is best when binary strategy is used with multiword and valence

shifters. Approach to analyze the sentiment of short Chinese texts is presented in [61]. By using

word2vec tool, sentiment dictionaries from NTU and HowNet are extended. Then the feature weight

of the words are enhanced including the words that appear in the sentiment dictionary and the words

next to the sentiment words. The model is implemented using SVM classifier. A work on sentiment

analysis of online news articles is presented in [13]. By using Machine Learning for Language Toolkit

(MALLET), six text-classification algorithms are compared such as the Naive Bayes, the Maximum

Entropy, a decision tree rule base, a decision tree with the C4.5 algorithm, the Winnow algorithm and the

Balanced Winnow algorithm. Experimental results have shown that the Naive Bayes classifier performs

the best.

The most commonly used classification techniques like SVM, Maximum Entropy and Naive Bayes

are based on bag of words model in which the sequence of words is ignored. This may results in

inefficient in extracting the sentiment from the input because the sequence of words will affect the

sentiment present in it. By overcoming this problem many researches reported by employing deep

learning in sentiment analysis. A deep neural network architecture that jointly uses word level, character

level and sentence level representations to perform sentiment analysis is proposed [9]. By observing

the experiment results performance is improved. A DNN is applied for language modeling [3] which

outperforms the n-gram model. A DNN can have any number of hidden layers and any number of

nodes per each layer in which the weights are connected in between. A DNN can learn more complex

model if the layers of architecture are more. But, a simple feed forward neural network cannot be

more accurate just by only adding layers because the training process is ineffective if there are more

layers [7] and it cannot capture the temporal context accurately. To better capture the temporal context

Recurrent Neural Network (RNN) has been proposed [33]. They applied RNN on speech recognition

for language modeling. They show that RNN outperforms n-gram technique. The advantage of RNN

is it will use previous state information to compute its current state, which is similar to the context in

most of natural languages. However, simple RNN has a problem in passing the information in a long

sequence. A solution to this issue is Long Short Term Memory (LSTM), a RNN with additional long

term memory, that was proposed [18]. In this paper a LOng Short Term Memory -Recurrent Neural

Network (LSTM-RNN) classifier is proposed to extract the sentiment from input data.

9

Page 22: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

2.3 Multimodal Sentiment Analysis

Instead of using only lyrics or only audio, research is also done on combinations of both the domains.

A survey on multimodal sentimental analysis and methods are discussed in [14][53].

In [20] work is done on the mood classification in music digital libraries by combining lyrics and

audio features. It is concluded that complementing audio with lyrics could reduce the number of training

samples required to achieve the same or better performance than single source-based systems. Music

sentiment classification using both lyrics and audio is presented [64]. For lyric sentiment classification

task, CHI approach (feature selection method in text categorization) and an improved difference-based

CHI approach were developed to extract discriminative affective words from lyrics text. Difference-

based CHI approach gives good results compare to CHI approach. For audio sentiment classification

task, features like chroma, spectral etc. are used to build SVM classifier. Experiments show that the

fusion approach using data sources help to improve music sentiment classification. In [22], [58] music

is retrieved based on both lyrics and melody information. For lyrics, keyword spotting is used and for

melody MFCC and Pitch features are extracted. Experiments show that by combining both modalities

the performance is increased.

The joint use of multiple modalities such as video, audio and text features is explored for the purpose

of classifying the polarity of opinions in online videos [37] [47]. Both feature level and decision level

fusion methods are used to merge affective information extracted from multiple modalities. They have

reported an improvement in classification by grouping different modalities rather than single modality.

Multimodal sentiment analysis approach is an intelligent opinion mining system for identifying and

understanding sentiment present in the reviews. In order to extract the sentiment they used audio and

video signals and hence overcome the drawbacks of traditional sentiment analysis system [62]. In [59],

the authors introduce the Institute for Creative Technologies Multimodal Movie Opinion (ICT-MMMO)

database of personal movie reviews collected from YouTube and ExpoTV. It consists of English clips

with sentiment annotation of one to two coders. The feature basis is formed by using audio, video

and textual features. Based on the textual movie review corpus, different levels of domain-dependence

are considered: in-domain analysis and cross-domain analysis. This shows that cross-corpus training

works sufficiently well. Authors of [44] introduce MOUD database consists of Spanish videos. They

have explored the effect of using different combinations of text, speech and video features on classifi-

cation. They have also carried out the correlation between visual and acoustic features. This is further

confirmed on another set of English videos. From the results it is observed that the joint use of three

modalities bring significant improvement. To determine the sentiment polarity present in the input [46]

features are extracted from three modalities. The convolution neural network is used to extract the text

features showed significant improvement in detecting the sentiment from a review.

In this work, a method to combine both the text and audio or speech features is explored to detect the

sentiment from the online product reviews and from the online songs. As of now, less research is done

on the multimodal sentiment analysis of online reviews and songs in Indian languages. Our proposed

10

Page 23: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

system is developd for Telugu, Hindi and Spanish databases. From the literature, it is observed that, from

the whole input several features are extracted using Openear/Opensmile tool and a sentiment model is

build using different classifiers instead in our work experiments are performed on selective regions.

Experiments on whole song does not shown good accuracy in detecting the sentiment because the

whole song may or may not carry the same attribute like happy (positive) and sad (negative). The begin-

ning and the ending parts of the song may includes the main attribute of the song. Hence, experiments

are done on different parts of the song to extract the sentiment. For reviews, instead of extracting all

the features from toolkit, MFCC, Prosdy, relative prosody features are extracted at stressed significant

and normal regions of an audio input and built sentiment analysis system using SVM, GMM classifiers.

Experiments are even performed using deep neural architectures like DNN and DNNAM. In order to

detect the sentiment of a text input, textual features which are computed by Doc2Vec and word2vec

vectors are used to build the SVM and LSTM-RNN classifiers respectively. The sentiment is extracted

by combining both the audio and text modalities.

In the next chapter we describe about how multimodal sentiment analysis is performed at selective

regions of an input.

11

Page 24: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Chapter 3

MULTIMODAL SENTIMENT ANALYSIS USING SELECTIVE

REGIONS

Till now most of the work on sentiment analysis is done on the entire signal by extracting several

features. In this chapter, instead of using the entire signal, a specific regions of an audio signal have

been identified and studies are performed on these regions to detect the sentiment associated with it.

We present a method to detect the specific regions of audio signals. We also present, how the combina-

tion of audio and text modalities improve the performance in detecting the sentiment. We present the

multimodal sentiment analysis on Spanish reviews, Hindi reviews and Telugu songs databases.

3.1 Databases used in this Study

This section describes Spanish, Hindi and Telugu databases used in this study. Spanish database is

publicly available and the Hindi and Telugu databases are collected to carry out this study.

3.1.1 Spanish database of opinion utterances

This database is obtained from [44] named as MOUD: Multimodal Opinion Utterances Dataset.

Dataset consists of variety of product reviews on movies, perfumes, and books which are collected from

publicly available source YouTube. As the variety of product reviews are used, the database has some

degree of generality within the broad domain of product reviews. Total database has 100 videos, among

them 42 are positive, 36 are negative, and 22 are neutral. For our study only negative and positive

sentiments are considered. Among them 80% is used for training and 20% is used for testing. Only 30

seconds of opinion segments are taken for each video after removing titles and advertisements. For text

based sentiment classification, transcription and sentiment annotations were manually performed. The

average number of words in each input is around 50. Annotators were provided with all the modalities

i.e. audio and transcribed text to correctly figure out the opinion of the reviews.

12

Page 25: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

3.1.2 Hindi database of reviews utterances

The database used in our studies is collected from YouTube, which is a publicly available source.

As Hindi is a scarce-resource language, not much work has been done in the area of sentiment analysis

for this language. The dataset includes reviews of phones, lotions and shampoos. The database has

some degree of generality as a variety of product reviews are used within the broad domain of product

reviews. The two basic sentiments presented in the database are: Positive and negative. A total of 110

product reviews are collected, among them 100 inputs are taken based on inter-annotator agreement.

Among them 80% are used for training and remaining 20% are used for testing. Both the modalities

such as audio and text are provided for annotators to find the exact opinion of the input. Then based

on inter-annotator agreement, 50 positive and 50 negative inputs are chosen. Each audio input is in

.wav format, 16 bit resolution, 16000 Hz sampling frequency and a mono channel. Transcription and

sentiment annotations were manually performed for text based sentiment classification. The average

length of each input is 30 seconds and average number of words in each input is about 40.

3.1.3 Telugu songs

The database used in this studies is collected from the You-tube which is a publicly available source.

A total of 150 Telugu movie songs along with lyrics corresponding to each song are taken. The two basic

sentiments presented in the database are: Happy and sad. Joyful, thrilled, powerful, etc, are taken as

happy sentiment and ignored, depressed, worry, etc are taken as sad sentiment. As our native language

is Telugu, work is implemented on Telugu songs which don’t have any special features compared to

other language songs. Telugu songs are one of the popular categories of Indian songs and are present in

Tollywood movies. Most of the people who belongs to the south part of India will listen to these songs.

The songs include variety of instruments along with the vocals. Here the main challenging issue is the

diversity of instruments and vocals. The average length of each song is 3.5 minutes and average number

of words in lyrics for each song is around 300. The database is annotated for the sentiment happy and

sad by three people. Annotators are provided with the two modalities such as text and audio to correctly

figure out the sentiment of a song. Then based on inter-annotator agreement, 50 happy songs and 50

sad songs are selected because some songs seems to be happy or sad for one annotator and neutral for

another annotator. So, only 100 songs are selected out of 150. Inter-annotator agreement is a measure of

how well two or more annotators can make the same annotation decision for a certain category. Among

them 40% of songs are used for training and 60% of songs are used for testing.

3.2 Detection of Sentiment in Selective Regions of Songs

The entire song may or may not carry the same sentiment like happy (positive) or sad (negative) to

detect the sentiment. So, in this work, experiments are not only performed on the entire song but also

on different regions like at the beginning of a song and the end of a song

13

Page 26: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

3.2.1 Sentiment analysis using audio features

This section describes the process of developing the sentiment models at different regions by ex-

tracting the audio features of a song. These features are then used to built the classifiers of positive or

negative sentiment of a song. Each song underwent the preprocessing step of converting mp3 files into

wave file (.wav format), 16 bit resolution, 16000 Hz sampling frequency and mono channel.

As the entire song may or may not carry the same sentiment, in this work three studies are performed:

namely at the beginning 30 seconds of a song, the ending 30 seconds of a song and for the whole

song. To extract a set of audio features like Mel-frequency Cepstral Coefficients (MFCC), chroma,

prosody, temporal, spectrum, harmonics and tempo from a wave file openEAR/openSMILE toolkit [11]

is used. As prosody have been used before for the task of emotion recognition in speech, it has also been

experimented for the task of sentiment analysis [31] by using the openEAR toolkit which is succeeded.

Brief details about audio features used in this work are mentioned below:

1. Prosodic features are those aspects of speech which go beyond phonemes and deal with the au-

ditory qualities of sound. In spoken communication, we use and interpret these features without

really thinking about them. These features include intensity, loudness and pitch that describe the

speech signal.

2. Temporal features also called as time domain features, are the energy of signal, zero crossing rate.

3. Spectral features also called as frequency domain features which are extracted by converting the

time domain signal into frequency domain using the Fourier Transform. It include features like

fundamental frequency, spectral centroid, spectral flux, spectral roll-off, spectral kurtosis, spectral

skewness. These features can be used to identify the notes, pitch, rhythm, and melody.

4. Mel frequency cepstral coefficients (13 dimensional feature vector) were calculated based on

the short time Fourier transform (STFT). First, log-amplitude of the magnitude spectrum was

taken, followed by grouping and smoothing the fast Fourier transform (FFT) bins according to

the perceptually motivated Mel-frequency scaling. The frequency bands are equally spaced on

the mel scale, which approximates the human auditory system’s response more closely.

5. Chroma features (12 dimension feature vector) are most popular feature in music which is closely

relates to the pitch classes. A pitch class is a set of pitches that share the same chroma. The entire

spectrum is projected into a 12 bins representing 12 different semitones or chroma of the musical

octave. These features are extensively used for chord, key recognition and segmentation.

6. Harmonic tempo is the rate at which the chords change in the musical composition in relation to

the rate of notes.

By combining all these features a total of 37 dimension feature vector is extracted, each of them

obtained at frame level. During the feature extraction, frame size of 25 ms and frame shift of 10 ms is

14

Page 27: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Table 3.1 Sentiment classification performance (in %) of songs using 37 dimension feature vector.

Region SVM GMM SVM+GMM

Entire song 52.8 54.9 69.7

Beginning of the song 55.8 73.5 88.3

Ending of the song 64.7 61.7 82.35

used. The classifiers such as Support Vector Machine (SVM), Guassian Mixture Models (GMM), and

combination of both are developed using these features. GMM are known for capturing the distribution

in the features, and SVM are known for capturing discriminative information. Hence these models are

combined to improve the performance in detecting the sentiment of a song. In this work, 64 mixtures

for GMM models and Gaussian kernel for SVM models are determined empirically.

A 37 dimension feature vector is given as input to the classifiers to detect the sentiment. From the

Table 3.1 it is observed that, the entire song gives less performance in detecting the sentiment compared

to the beginning and ending of a song. This is due to the fact that, the whole song will carry different

attributes (happy and sad) which is not clear. So by using part of song, the performance is increased.

The performance of beginning of a song is better compared to ending of the song because the vocals

and instruments which are present in the beginning of a song may or may not sustain through out the

song. It is also observed that, the combination of evidences from both the classifiers gives the better

performance for beginning of the song compared to the ending of the song. The average performance of

sentiment analysis for beginning, ending and for whole song is 88.3%, 82.3% and 69.7% respectively.

3.2.2 Sentiment analysis using lyric features

This section describes the process of extracting the textual lyrics of a song at different regions. These

features are then used to build a classifier for the analysis of positive or negative sentiment of a song. All

the lyrics are collected from the Internet. In preprocessing step, lyrics which contain stanza names like

”pallavi” and ”charanam” were removed because the headings (”pallavi” and ”charanam”) are common

for each song which does not act like a feature to detect the sentiment of the song. If the same line has

to be repeated, it is represented as ”x2” in the original lyrics, so ”x2” is removed and the line opposite

to that is considered twice. For each song in a database, a 300-dimensional feature vector is generated

using Doc2vec model [27]. As we have 100 files, 100 feature vectors are generated one for each song.

For checking the accuracy, each song is manually annotated and is given a tag like happy or sad.

The Doc2Vec model is used for associating random documents with labels. Doc2vec modifies

word2vec algorithm to an unsupervised learning of continuous representations for larger blocks of text

such as sentences, paragraphs or entire documents. This means, Doc2vec learns to correlate labels and

words rather than words with other words. In document vector, the vector tries to grasp the semantic

meaning of all the words in the context by placing the vector itself in each and every context. Thus fi-

15

Page 28: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Table 3.2 Sentiment classification performance (in %) of songs using lyric features.

Region SVM NB SVM+NB

Entire song 60.6 52.3 70.2

Beginning of a song 67.5 57.3 75.7

Ending of a song 64.4 55.8 72.4

nally, the document vector contains the semantic meaning of all the words in the context trained. In the

word2vec architecture, the two algorithms used are continuous bag of words and skip-gram. Word2vec

input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. In

doc2vec architecture, the two algorithms are distributed memory and distributed bag of words. In dis-

tributed memory, in addition to the word vectors there is a document vector that keeps track of the whole

document and in distributed bag of words it does not have any word vectors, there is just a document

vector which is trained to predict the context.

All songs are given as input to the doc2vec. This generates a single vector that represents the meaning

of a document, which can then be used as input to a supervised machine learning algorithms like SVM

and Naive-Bayes (NB) to associate documents with labels. SVM, NB and combination of both of these

two classifiers are trained with vectors generated from the doc2vec for positive or negative sentiment

classification. Given a test data song, the trained models classifies it as either happy or sad. Three

experiments are done on each song:beginning 30 seconds, last 30 seconds and for the whole song.

GMM requires more features for training as compared to Naive-Bayes and SVM, but in textual part

we have less features (only one feature vector for one song using doc2vec). Where as for audio, several

features are there because for each song frame level features are extracted with a frame size of 20 ms. So

for acoustic models GMM and SVM are used, where as for linguistic features Naive-Bayes and SVMs

are used.

Features are extracted using Doc2Vec model, by giving input to the model it creates a fixed length

feature vector as an output. From the Table 3.2, it is observed that a combination of both the classifiers

give better performance for beginning of the song as compared to the ending and whole song. Features

extracted from Whole song gives less accuracy in detecting the sentiment of a song because of ambiguity

that present in large number of features. The performance at beginning of a song is better compared to

ending because the amount of words which will be there in the beginning may have strong sentiment.

The average performance of sentiment analysis for beginning, ending and for whole song is 75.7%,

72.4% and 70.2% respectively, by using lyric features.

3.2.3 Multimodal sentiment analysis

The main advantage of audio as compared to their textual data is that audio has voice modularity. In

textual data, the only source that is available is information regarding the words and their dependencies.

16

Page 29: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Songs

Hypothesized Sentiment (Happy or Sad)

Lyrics File Audio File

Extracting Textual Lyric Features

Extracting Audio Features

SVM Naive Bayes SVM GMM

Hypothesizing based on the highestaverage probability of the classifiers.

Preprocessing

SVM+NaiveBayes SVM+GMM

Figure 3.1 Block diagram of multimodal sentiment analysis of songs.

Table 3.3 Multimodal sentiment classification performance (in %) with lyric and audio features.

Region Lyric Audio Lyric+Audio

Entire song 70.2 69.7 75.8

Beginning of a song 75.7 88.3 91.2

Ending of a song 72.4 82.3 85.6

This may sometime be insufficient to convey the exact sentiment of the song. Instead, audio data contain

multiple modalities like acoustic, and linguistic streams. The simultaneous use of these two modalities

will help to create a better sentiment analysis system to detect whether the song is happy or sad.

Sequence of steps in proposed approach is presented in the Figure 3.1. Table 3.3 presents the accu-

racy of sentiment by combining lyrics and audio features. To handle the similarity of sentiment classes,

decision from different classification models trained using different modalities are combined. By com-

bining both the modalities performance is improved by 3 to 5%.

From these studies it is observed that, the performance is better in beginning compared to ending

and from the whole song. But these studies cannot be performed on reviews because whenever we

are expressing something, the sentiment may not be present only at the beginning or the ending of an

utterance. So to detect the sentiment from reviews, the stressed and normal regions are identified from

the input and the experiments are performed on these regions.

17

Page 30: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

3.3 Detection of Sentiment from Selective Regions of Reviews

This section describes the multimodal sentiment analysis performance on reviews dataset.

3.3.1 Sentiment analysis using speech features

Whole audio input may not contain positive or negative sentiment because human beings cannot

sustain the same sentiment for the entire utterance. The utterance may have positive, neutral or negative

data. So stressed regions and normal regions are detected from a voiced segments to extract the sen-

timent of an audio. MFCC, prosody and relative prosody features are extracted from stressed regions

and normal regions of an audio input. These features are then used to build a classifier of positive or

negative sentiment. Each audio input is in .wav format, 16 bit, 16000 Hz sampling frequency and a

mono channel. The process of finding stressed and normal regions from an audio input is described in

following subsection.

3.3.1.1 Detecting stressed and normal regions

Stressed regions are detected based on the strength of excitation of an audio signal using Zero-

Frequency Filter (ZFF) method [39]. Strength of the excitation is defined as the slope of ZFF signal at

epoch. Zero-frequency filter of audio signals provides useful information about the excitation source

such as epoch locations, fundamental frequency (Fo) and Strength of excitation. This analysis is useful

for detecting the voiced/unvoiced regions as well as stressed/normal regions. Computing the strength of

excitation using ZFF method is discussed in the following subsection [39].

3.3.1.2 Computation of strength of excitation using ZFF method

• Time-varying low frequency bias is removed from the signal by pre-emphasizing the audio signal

using a difference operation.

x(k) = s(k)− s(k − 1). (3.1)

• Then, the pre-emphasized audio signal is passed through a cascade of two ideal digital resonators

at 0 Hz i.e.,

y(k) =

4∑

p=1

apy(k − p) + x(k), (3.2)

where a1 = +4, a2 = −6, a3 = +4, and a4 = −1. The above operation can be realized by

passing the signal x(k) through a digital filter given by

H(z) =1

(1− z−1)4. (3.3)

18

Page 31: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

• From the above step an exponential trend is introduced in the y(k), which will be removed in the

following manner.

zffsignal = y(k)− y(k), (3.4)

where

y(k) =1

2N + 1

N∑

k=−N

y(k). (3.5)

Here a window size of 2N +1 is used to compute the local mean, and typically this is the average

pitch period when computed over a longer segment of speech.

• The trend removed signal zffsignal is referred as the zero-frequency filtered signal (ZFF signal).

The positive zero crossings of the ZFF signal correspond to epochs or the instants of significant

excitation.

• Strength of the excitation is given by

SoE = |sz[k]((le)p + 1)− sz[k](le)p − 1)|. (3.6)

Where p=1,2,3...M. Here M is the total number of epochs, (le)p is the location of pth epoch and

SoE is the strength of the excitation.

3.3.1.3 Detecting stressed and normal regions of an audio signal

For each audio signal voiced or unvoiced regions are identified using ZFF signal. Strength of excita-

tion at each epoch is computed for the entire signal. The regions where epochs are equally spaced and

more Strength of excitation is treated as voiced regions. In the unvoiced regions epochs are irregularly

spaced and strength of excitation is low. So, a simple threshold on strength of excitation will separate

the voiced regions from the unvoiced regions.

From the voiced regions, normal and stressed regions are identified based on the strength of exci-

tation. Strength of excitation is computed at an epoch and keep that same value till the next epoch.

From this, step signal is obtained which is subjected to a 20 ms mean smoothing to generate a smooth

contour representing the strength of the excitation curve. From this, the average value of the strength

of excitation is computed and refereed as SoEavgvalue. The regions where the strength of excitation

fluctuates above and below 30% of the SoEavgvalue is considered as stressed significant regions and the

remaining part of the audio is identified as normal regions.

An approach for detecting stressed significant and normal regions of an utterance is shown in Fig. 3.2.

Spanish speech utterance “pero igual con las lavadas” is shown in Fig. 3.2(a). Fig. 3.2(b) shows the

zero-frequency filtered signal (zffsignal). At every epoch strength of excitation is computed using

the algorithm described in Section 3.3.1.2 is shown in Fig. 3.2(c). Strength of the excitation which

is computed in the above step is mean smoothed with a frame size of 20 ms is shown in Fig. 3.2(d).

19

Page 32: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

−1

0

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

−0.50

0.51

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.10

0.5

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.10

0.5

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

−1

0

1

(a)

(b)

(c)

(d)

(e)

Figure 3.2 An approach to detect stressed significant regions of Spanish speech utterance “pero igual

con las lavadas”. (a) Input speech signal, (b) ZFF signal (c) Strength of the excitation at each epoch,

(d) Strength of the excitation which is mean smoothed using a frame size of 20 ms and (e) Stressed

significant regions of utterance are detected.

Stressed significant regions and normal regions of an utterance which are detected are marked with red

color is shown in Fig. 3.2(e).

3.3.1.4 Features extracted at stressed and normal regions

After detecting the stressed and normal regions, a sentiment model is built by extracting MFCC,

prosody and relative prosody features from these regions. The features like energy, pitch, and duration

which are extracted from suprasegmental regions are called as prosody features. In this study, the

prosody features considered are mean pitch, maximum pith, minimum pitch, mean energy, maximum

energy, minimum energy and the duration ratio. Relative prosody features are extracted from both the

regions. MFCC features are of 13-dimensional feature vector.

20

Page 33: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Figure 3.3 shows that block diagram to extract the relative prosody features from an audio signal.

After identifying the normal and stressed regions, prosody features are extracted from both the regions

separately. Pitch is the perceptual quantity of sound. In general, fundamental frequency (F0) is treated

as a pitch. In the present study pitch is computed from ZFF signal. The positive zero crossings of ZFF

signal are hypothesized as epochs. The time difference between two consecutive epochs is termed as the

fundamental period (T0). The inverse of T0 gives F0. The pitch and energy parameters are considered

from voiced regions and the duration ratio is considered as the ratio between voiced duration to the total

duration. After extracting the prosody features from both the regions, the relative prosody features are

calculated as difference of stressed and normal prosody features. The three (mean, maximum and mini-

mum) variations of each pitch and energy along with the duration ratio together forms a 7-dimensional

feature vector for relative prosody features.

Figure 3.3 Block diagram to extract relative prosody features.

The importance of relative prosody features over prosody features is shown in Figures 3.4 and 3.5.

To plot these figures, 30 positive and 30 negative sentiment audio signals are used. Figure 3.4 shows the

histogram of maximum pitch value and relative maximum pitch value for postive and negative sentiment

audio signals. In Figure 3.4 (a) it can be observed that maximum pitch values for positive (blue color)

and negative (red color) sentiment audio signals are overlapped. In contrast, maximum relative pich

21

Page 34: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

values of positive and negative sentiment audio signals are concentrated in different bins and it is shown

in Figure 3.4 (b).

440 450 460 470 480 490 500 510 5200

5

10

15

20

MAXIMUM PITCH

−50 0 50 100 150 2000

5

10

15

20

25

RELATIVE MAXIMUM PITCH

NegativePositive

NegativePositive

(a)

(b)

Figure 3.4 Histogram of (a) maximum pitch, (b) relative maximum pitch for positive and negative

sentiment audio files.

Figure 3.5 shows the scatter plot of maximum pitch vs maximum energy. In figure red color indicates

negative and green color indicates positive sentiment. Values plotted using ‘o’,‘⋄’ indicates prosody

features and values plotted using ‘+’ and ‘�’ are relative prosody features. In Figure 3.5, it can observed

that overlap is high in prosody features (‘o’,‘⋄’) compared to relative features (‘+’,‘�’).

From Figures 3.4 and 3.5 it can be observed that relative prosody features has more sentiment specific

discrimination compared to prosody features. Prosody features play a key role in understanding the

sentiment by human being, but the natural variations in prosody features are limiting factors for using

them. In this work, the natural variations in the prosody features are minimized by using relative prosody

features.

22

Page 35: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

−20 −10 0 10 20 30 40−100

0

100

200

300

400

500

600

700

MAXIMUM ENERGY

MA

XIM

UM

PIT

CH

Positive ProsodyNegative ProsodyPositive Relative ProsodyNegative Relative Prosody

Figure 3.5 Scatter plot of maximum pitch vs maximum energy for positive and negative sentiment audio

files.

3.3.1.5 Sentiment analysis experimental setup and results

Experiments are performed on Spanish and Hindi datasets. For each experiment, a 5-fold cross

validation is run on the entire dataset. In this work, two experimental setups are considered.

In first experimental setup, a sentiment model is build using MFCC features (13-dimension feature

vector) and GMM classfier. GMMs are build with 13-dimensional MFCC feature vector extracted from

the whole input and with the MFCC features extracted from the stressed regions of an audio signal.

Different mixture components (8, 16, 32 and 64) are considered for building GMM. The results obtained

from GMM with 32 mixtures are given in the Table 3.4.

In second experimental setup, a sentiment model is build using prosody features and SVM classi-

fier. In this work, a linear kernel SVMs are used which are binary classifiers. SVMs linear kernel will

separates a set of negative examples from a set of positive examples by finding out the optimal hyper-

plane with a maximum margin. SVM classifier is build using the following three different features.

(1) By extracting the prosody features (7-dimensional feature vector) of the entire audio signal. (2) By

extracting the prosody features from stressed (7-dimensional) and normal(7-dimensional) regions, total

of 14-dimensional feature vector. (3) By calculating the relative prosody features (stressed minus nor-

mal) which is of 7-dimensional. Sentiment models are built using these three different features. These

23

Page 36: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Table 3.4 Sentiment classification performance (in %) with different features.

Features Spanish Hindi

Pitch and intensity [44] 46.7 –

MFCC (Entire signal-13 dimensional) 58.3 53.6

MFCC (Stressed regions-13 dimensional) 67.8 65.2

Prosody (Entire signal-7 dimensional) 41.7 41.7

Prosody (Stressed and normal regions-14 dimensional) 75.0 67.8

Relative prosody (stressed minus normal-7 dimensional) 83.3 75.0

prosody features are not tested on GMM classifier because the data may not be sufficient as only one

feature vector is extracted for each utterance.

From the literature, it is observed that, lot of audio features are extracted using toolkits and classifiers

are built using all those features which might create confusion in identifying the sentiment of a signal

[44]. Instead of using all the features, in our work only suitable features like prosody and MFCC

are considered. Prosody features are extracted from suprasegmental and MFCC features are extracted

from segmental level. As prosody features are extracted from suprasegmental, the rate of detecting the

sentiment of a signal might be more accurate than the MFCC features which are computed for each

frame, because each frame may not carry the sentiment.

Human being depends on suprasegmental information like prosody to detect the sentiments, but the

problem with prosody features is that, they are not very much dependable due to the natural variations

present in them. In this proposed approach for sentiment detection, natural variations in the prosody

features are reduced by considering the relative prosody features. This can be observed from the Table

3.4.

Relative prosody features are extracted by taking the difference of prosody features in stressed and

normal regions. MFCC features are tested on GMM classifier. Prosody features and relative prosody

features are tested on SVM classifier. Prosody and relative prosody features are not tested on GMM clas-

sifier because of less data. The Table 3.4 shows the sentiment classification performance with different

features. From the Table 3.4, it is observed that features extracted from whole signal has less perfor-

mance compared to the features extracted from stressed and normal regions. It is also observed that by

using relative prosody features, the rate of detecting the sentiment for Spanish and Hindi database is

83.3% and 75.0% which outperforms the MFCC extracted at stressed regions by 16.6% and 9.8% , and

the prosody features extracted from both the regions by 8.3% and 7.2% respectively. It is also observed

that our proposed method for Spanish database outperforms the [44] by 36.6 %.

3.3.2 Sentiment analysis using text features

This section describes the process of extracting the text features of an audio signal. These features

are then used to build a classifier to identify positive or negative sentiment. In a preprocessing step, each

24

Page 37: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

audio input is transcribed manually and sentiment annotations are assigned. For each of the audio input

300- dimensional feature vector is extracted for better results.

Many machine learning algorithms require the input to be represented as a fixed-length feature vec-

tor. Doc2Vec model is used for associating documents with labels which is an extension of the existing

word2vec model. Doc2vec modifies word2vec to an unsupervised learning of continuous representa-

tions for larger blocks of text such as sentences, paragraphs or whole documents means Doc2vec learns

to correlate labels and words rather than words with other words. In the word2vec architecture, the two

algorithms used are skip-gram and continuous bag of words and in the doc2vec architecture, the cor-

responding algorithms are distributed bag of words and distributed memory. All transcribed inputs are

given to the doc2vec which generates a single vector that represents the meaning of a document. This

will be used as input to a supervised machine learning algorithm to associate documents with labels.

Sentiment analysis based on text can be viewed as a text classification task which can be handled by

SVM due to its better classification. SVM classifier is trained with vectors generated from the doc2vec

and by using corresponding sentiment tags (positive/negative). Given a test input, the trained models

classify it as either positive or negative.

Table 3.5 Sentiment classification performance (in %) using text features.

Classifier Spanish Hindi

SVM 65.5 63.4

From Table 3.5, it is observed that the rate of detecting the sentiment from Spanish data and Hindi

data using text features is 65.5% and 63.4% respectively. When compared these results with [44] our

system outperforms in detecting the sentiment using text features.

3.3.3 Multimodal sentiment analysis

The advantage that comes to the analysis of audio is voice modularity when compared to their textual

data. Textual data will only have the information regarding the words and their dependencies which may

be insufficient to convey the exact sentiment of the input. Instead, audio data contain multiple modalities

like linguistic and acoustic streams. Both the modalities are hypothesized based on the highest average

probability of the classifiers. The simultaneous use of these two modalities might help to create a better

sentiment analysis model to detect whether the given test input is positive or negative sentiment.

Table 3.6 presents the accuracy of sentiment by combining text and audio features using the proposed

method. In audio model, the performance of detecting the sentiment on Spanish review is improved by

36.5% and for text model the performance is improved by 1%. By combining both audio and text

models, rate of detecting the sentiment of an Spanish audio input is 87.2% which is improved by 20%

compared to [44] model. It is also observed that the multimodality accuracy of Hindi database is 81.4%.

25

Page 38: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Table 3.6 Sentiment classification performance (in %) for different models on Spanish and Hindi

datasets.

Modality Spanish [44] Spanish Hindi

Proposed Proposed

Audio 46.75 83.3 75.0

Text 64.94 65.5 63.4

Audio+Text 67.42 87.2 81.4

3.4 Summary and Conclusions

In this chapter, we have proposed an approach to extract the sentiment of an input using both au-

dio and text modalities. Instead of focusing on an entire input, experiments are performed on selective

regions and observed good accuracy because the entire input may contain positive, negative and neu-

tral. For songs, experiments are performed on beginning, end and whole song and for reviews stressed

significant and normal regions are identified based on strength of excitation.

Audio features like chroma, spectral etc, which are extracted at beginning, ending and from whole

song are used to build the song sentiment classifiers such as SVM, GMM. The beginning of a song

outperforms in detecting the sentiment compared to ending of a song and whole song because the in-

struments and vocals which convey the sentiment for beginning part of the song may or may not sustain

throughout the song. The features like MFCC, prosody (pitch, energy and duration ratio) and relative

prosody features extracted at stressed and normal regions are used to detect the sentiment of an audio

signal. Stressed regions of an input are detected based on the strength of excitation using ZFF based

method. These features are then used to build a sentiment models using GMM and SVM classifiers.

From our experiments, it is observed that features which are extracted from the selective regions of an

input performed better compared to the entire input. The relative prosody features extracted from both

the regions has a high rate in detecting the sentiment compared to MFCC and prosody. This is because

of natural variations present in the prosody features are reduced using the relative prosody features.

The text features generated using Doc2Vec are used o build the SVM classifier. From the experimental

setup, it is also observed that by combining audio and text modalities rate of detecting the sentiment of

an input is improved.

In the next chapter we describe how multimodal sentiment analysis is developed using deep neural

networks.

26

Page 39: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Chapter 4

MULTIMODAL SENTIMENT ANALYSIS USING DEEP NEURAL

NETWORKS

Recently, neural networks have achieved a good success on sentiment classification. In this chapter,

a neural network approach is presented to detect the sentiment of an online product review based on its

multi-modality nature (audio and text). We also present how different deep neural network classifiers are

developed to detect the sentiment of reviews. For each audio input, Mel Frequency Cepstral Coefficients

(MFCC) features are extracted. These features are used to develop sentiment models using Deep Neural

Network (DNN) and Deep Neural Network Attention Mechanism (DNNAM) classifiers. From results,

it is observed that DNNAM classifier gives better results compared to DNN. This is because the DNN

approach is a frame based one where as the DNNAM approach is an utterance level classification thereby

efficiently making use of the context. Further textual features are extracted from the transcript of the

audio input by using Doc2vec and Word2vec models. Support Vector Machine (SVM) and Long Short-

Term Memory Recurrent Neural Network (LSTM-RNN) classifiers are used to develop a sentiment

model with the textual features. From the results, it is observed that LSTM-RNN outperforms the SVM

because LSTM-RNN is able to memorize long temporal context. From experimental results, it is also

observed that by combining both the audio and text features there is a significant improvement in the

performance for detecting the sentiment of online product reviews.

4.1 Sentiment Analysis using Audio Features

The process of developing the sentiment model and extracting the audio features from the input is

described in this section. The features which are extracted are used to build a classifier to detect the

input as positive or negative sentiment. MFCC features are extracted from each input and sentiment

model is developed using Guassian Mixture Model (GMM), DNN and DNNAM classifiers.

27

Page 40: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Selection of Feature Dimension(13 or 65 or 130 dimension)

Sentiment

1 Frame 5 Frames are combined

10 Frames are combined

Figure 4.1 Block diagram of sentiment analysis using deep neural network.

4.1.1 Deep neural network

A deep neural network (DNN) is a neural network with multiple hidden layers of nodes between

the input and output layers. These hidden layers will perform feature identification and processing in

a series of stages. The successive layers can learn higher level features. DNN performance depends

on the amount of training data. The more the training data the better accurate it will be. Each DNN

is trained for 30 epochs with different number of layers and different number of nodes in each layer.

In our work up to four deep layers are explored. All the DNNs are trained with ADAM method which

is hyper parameter learning algorithm. MFCC features considered in this study are 13-dimensional,

65-dimensional and 130-dimensional. Five frames of 13 dimensional MFCC frames are concatenated

to get 65 dimensional vectors and ten frames of 13 dimensional MFCC frames are concatenated to get

130 dimensional feature vector. Each frame will not carry the sentiment. So, experiments are done in

combination of frames, which results in better performance. Based on the input dimension, the input

layer nodes can be 13, 65 or 130 which are linear. The output layer is of softmax layer with 2 nodes

because the number of classes are 2. During testing the node which gives maximum value is assigned

as the claimed class. For testing only 5 seconds of data are considered for each input. Block diagram of

sentiment analysis using Deep Neural Network is shown in the Figure 4.1.

28

Page 41: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Table 4.1 Performance (in %) of sentiment analysis using deep neural network.

FeaturesHindi Spanish

2 layers 4 layers 2 layers 4 layers

13-D MFCC 41.6 58.3 50.0 61.2

65-D MFCC 58.3 75.0 66.7 72.0

130-D MFCC 58.3 66.7 61.2 66.7

From the Table 4.1, it is observed that DNN with 65-dimensional feature vector has performed better

compared to 130 dimensional because by using 130 dimension feature vector, features are not sufficient

to train the DNN. It is also observed that, performance is better for four layers. The rate of detecting the

sentiment using DNN classifier on Hindi and Spanish reviews is 75.0% and 72.0% respectively.

A DNN can have any number of hidden layers and any number of nodes in each layer. But, a simple

feedforward neural network or DNN cannot achieve good performance just by adding more layers to

it because the training process is ineffective when there are more layers. Even we combine frames

and done experiments with different layers and observed good performance, the drawback of the DNN

system is that, at each frame decision is taken and the context used is fixed which is usually assigned

to an entire utterance. DNN cannot memorize the long temporal context. To overcome these problems

a DNNAM architecture is proposed to capture the temporal context better and to do utterance-wise

sentiment classification.

4.1.2 Deep neural network attention mechanism

A DNNAM is a simple DNN implemented with an attention mechanism. The advantage of DNNAM

is that it is able to memorize and is also parallelized because of the strictly feed forward neural network.

This is used for classifying entire utterance rather than taking a frame level decision as we did in DNN.

In addition to this, the attention mechanism will go deep into the input feature frames which are more

important to extract the sentiment. In the DNNAM architecture, the attention is computed by using the

input feature vectors. The block diagram for deep neural network attention model is shown in Figure

4.2.

Given an input sequence, Y = {Y1, Y2, . . . , YT }, a hidden layer sequence H = {h1, h2, · · · , hT }, is

computed by the regular DNN and attention is computed on this hidden features.

By using a single layer perceptron, attention mechanism a(ht) which is shown in Figure 4.2 is

computed and then a softmax operation is performed for normalizing the values between zero and one.

H = [h1 h2 · · · hT ] (4.1)

γ = tanh(WaH + ba) (4.2)

α = softmax(γ) (4.3)

29

Page 42: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Figure 4.2 Deep neural network attention model [48].

In the above equations, γ is a hyperbolic tanh function and Wa, ba are the parameters of the attention

mechanism. By using backpropagation algorithm these parameters are optimized along with other pa-

rameters of the network, where as α is referred to a attention vector. The vectors in the hidden state

sequence ht are fed into the attention mechanism a(ht) to produce a probability vector α.

The context vector c which is shown in Figure 4.2 is computed using the weighted average of H with

weight α.

c = Hα (4.4)

Then, the output vector is computed by transforming the c using output layer weights U and then

softmax operation is performed.

Z = softmax(Uc+ bo) (4.5)

In the above equation bo is the output layer bias. From the above equation it is inferred that from the

entire input utterance Y , a single decision vector Z is predicted.

The architecture can have any number of layers before or after the attention model. The number

of layers in this architecture are termed as the hidden layers present before the context vector plus one

additional output layer. This architecture is trained for 30 epochs with different number of layers and

different number of nodes in each layer. In our work up to four deep layers are explored. All the

DNNs are trained with ADAM method which is hyper parameter learning algorithm. Hyper parameters

30

Page 43: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Table 4.2 Performance (in %) of sentiment analysis using deep neural network attention mechanism.

FeaturesHindi Spanish

2 layers 4 layers 2 layers 4 layers

13-D MFCC 50.0 62.5 56.2 68.7

65-D MFCC 68.7 81.2 72.0 77.4

130-D MFCC 62.5 68.7 61.2 75.0

are adjusted using validation set. Similar to DNN here also 13-dimensional, 65-dimensional and 130-

dimensional MFCC features are considered. As single frames cannot carry the sentiment, the frames

are concatenated and observed that combining the frames results in improvement in the performance.

The input layer can have 13, 65 or 130 nodes, which are linear units and the output layer is of softmax

layer with 2 nodes because the number of classes considered in the databases is 2. During testing the

node which gives maximum score is assigned as the claimed class. For testing only 5 seconds of data

are used for each input.

From the Tables 4.2, it is observed that DNNAM with 65-dimensional feature vector has performed

better when compared to 130-dimensional because when we combine the frame features will be less

therefore they are not sufficient to train the DNN. It is also observed that, the performance is more with

four layers. The rate of detecting the sentiment using DNN classifier on Hindi and Spanish reviews is

81.2% and 77.4% respectively. From the Tables 4.1 and 4.2, it is observed that DNNAM performance

is more with four layers compared with DNN because DNNAM with hidden layers before the attention

mechanism captures context and integrates information before taking the final decision.

4.1.3 Gaussian mixture models

GMMs are well known to capture the distribution of data in the feature space. A Gaussian mixture

density is a sum of M weighted component densities [50] given by:

p(xk|λ) =

M∑

r=1

wrKr(xk) (4.6)

where xk is an N dimensional input vector,

Kr(xk), r = 1...M are the component densities and

wr, r = 1...M are the weights of the mixtures.

The product of the component Gaussian with its mixture weight i.e., Kp(xk)wr is termed as com-

ponent density. Sum of the component densities is given by Gaussian mixture density. The accuracy

in capturing the true distribution of data depends on various parameters such as dimension of feature

vectors, number of feature vectors, and number of mixture components. In this work Expectation Max-

imization (EM) algorithm is used to train the GMM models using audio features.

31

Page 44: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Table 4.3 Performance (in %) of sentiment analysis using different classifiers.

Classifiers Hindi Spanish

GMM (64 Mixtures) 58.3 66.4

DNN(Four Layers) 75.0 72.0

DNNAM (Four Layers) 81.2 77.4

MFCC features which are extracted from each input audio signal is given as input to the GMM.

GMM is tested with different number of test cases and different number of mixtures like 16, 32 and 64.

Here also for testing 5 seconds of data is considered. From the Table 4.3 it is observed that DNNAM

with four hidden layers outperforms the DNN with four hidden layers and GMM with 64 mixtures.

4.2 Sentiment Analysis using Text Features

The process of developing the sentiment model using the text features from the input is described

in this section. These features are used to build a classifier of positive or negative sentiment. In a

preprocessing step, each audio input is manually transcribed and sentiment annotations are also assigned

manually.

Many machine learning algorithms require the input to be represented as a fixed-length feature vec-

tor. Word2Vec and Doc2Vec are two models which will represent the input in fixed length feature vector.

Word2Vec takes data from a corpus, and churns out vectors for each of those words. These vectors are

used to represent a word instead of the entire document and it is interesting because similar words are

placed nearer to each other in the vector space. The length of word vector is adjustable and indepen-

dent from the size of dictionary. In the Word2Vec architecture, the two algorithms used are continuous

bag of words and skip-gram. The continuous bag of words predicts the current word from a window

of surrounding context words, where as the continuous skip-gram model uses the current word to pre-

dict the surrounding window of context words. On the other hand Doc2Vec model not only represents

words, but it is an unsupervised learning of continuous representations for larger text such as sentences,

paragraphs and whole documents. In Doc2Vec architecture, the algorithms used are distributed memory

and distributed bag of words. Distributed memory will randomly initialize paragraph vector for each

document and predict next instance using context words and paragraph vectors. Context window slide

across a document, but paragraph vector is fixed. On the other hand, distributed bag of words will only

use paragraph vectors and not word vectors. It will take window of words in a paragraph and randomly

sample which one to predict using paragraph vector (ignores word ordering).

Two types of sentiment models are built for text classification. One is SVM model and other one is

an LSTM-RNN model.

32

Page 45: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

4.2.1 Support vector machine

Support vector machine classifier is intended to solve two class classification problems. The main

aim is to find linear separators in the search space which can be best disjoin the various indifferent

classes. The basic principle implemented in a support vector machine is that the input vectors which

are not linearly separable are transformed to a higher dimensional feature space and an optimum liner

hyperplane is constructed to separate the classes. SVM classification is best suited for sparse nature of

input. An SVM [4] is a two-class classifier constructed from sums of a kernel functions.

SVM classifier is trained with vectors generated from the Doc2vec and by using corresponding sen-

timent tags: positive or negative. Doc2Vec will generate a single vector for each manually transcribed

input, which represents the meaning of a document. For better results, 300-dimensional feature vector

is generated from each text input. To associate documents with labels, this vector will be used as an

input to a supervised machine learning algorithm like SVM. The problem with SVM is that it does not

model sequence information, it just uses the given features directly. To solve this problem, experiments

are performed using LSTM-RNN model.

4.2.2 LSTM-RNN model

A recurrent neural network (RNN) is a network of neurons with feedback connections. It can learn

sequence processing tasks that are not learnable by the traditional machine learning methods such as

feedforward networks and SVM which does not have any internal states at all. In-spite of its advantages

RNN suffers from vanishing gradient descent problem which is overcome by LSTM-RNN. LSTM-RNN

prevent the backpropagated errors from vanishing. The errors can flow backwards through any number

of virtual layers which are unfolded in space. The LSTM-RNN model will automatically learn a flexible

history length and it has an abstracted feature representation. Compared to the SVM model, the LSTM-

RNN model has more parameters that need to be learned during training. The block diagram for LSTM

is shown in the Figure 4.3.

The LSTM-RNN model [18] is a recurrent neural network model with multiple hidden layers and

a special memory unit. In LSTM-RNN, the information is stored in two ways: Long-term Memory as

weights, which are modified based on back propagation and Short-term Memory as activations of the

neurons which capture the recent history. LSTM-RNN can capture the long dependencies in a sequence

by introducing a memory unit and a gate mechanism which aims to decide how to utilize and update the

information kept in the memory cell. From the Figure 4.3, we can see that each memory cell ct has three

gate units: the input gate it, output gate ot and forget gate ft. The input gate controls the flow of input

activations into the memory cell. The output gate controls the flow of cell activations into the rest of the

network. The forget gate scales the internal state of the cell before adding it as input to the cell through

self recurrent connection of the cell, therefore adaptively forgetting or resetting the cell memory. In

addition, the LSTM- RNN architecture contains peephole connections from its internal cells to the gates

33

Page 46: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Figure 4.3 LSTM Network.

in the same cell to learn precise timing of the outputs. An LSTM-RNN network computes a mapping

from an input sequence to an output sequence by calculating the network unit activations.

it = σ(Wixxt +Wimmt1 +Wicct1 + bi) (4.7)

ft = σ(Wfxxt +Wfmmt1 +Wfcct1 + bf ) (4.8)

ct = ft ⊙ ct1 + it ⊙ g(Wcxxt +Wcmmt1 + bc) (4.9)

ot = (Woxxt +Wommt1 +Wocct + bo) (4.10)

mt = ot ⊙ h(ct) (4.11)

yt = Φ(Wymmt + by) (4.12)

where the W terms denote weight matrices (e.g. Wix is the matrix of weights from the input gate

to the input), the b terms denote bias vectors (bi is the input gate bias vector), σ is the logistic sigmoid

function, and i, f, o and c are respectively the input gate, forget gate, output gate and cell activation

vectors, all of which are the same size as the cell output activation vector m, ⊙ is the element-wise

product of the vectors and g and h are the cell input and cell output activation functions, generally tanh

34

Page 47: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Table 4.4 Performance (in %) of sentiment analysis using text features.

Classifier Hindi Spanish

SVM 65.5 63.6

LSTM-RNN 72.4 68.3

and Φ is the network output activation function which is softmax.

In this work an LSTM network with 32 units and a single output neuron with a softmax activation

function is used for making 0 or 1 predictions for the two classes positive and negative. Because this is a

binary classification problem, a log loss function is used and the network is optimized using the ADAM

optimization function. The model is fit over 100 epochs with a batch size of 256. Word2vec is used to

train initial word vectors for LSTM-RNN instead of Doc2Vec. In case of Doc2Vec, we already have a

summary vector for the sequence means a vector which represents sentence or document but RNN is

more useful for sequences. If we use Doc2Vec to train RNN there may not be any additional advantage.

From Table 4.4, it is observed that the rate of detecting the sentiment of an input in both databases is

high with LSTM-RNN when compared to SVM.

4.3 Multimodal Sentiment Analysis

As sentiment classification with both textual and audio features had few limitations, in our work we

carried out multimodality sentiment analysis. Social media are a huge untapped source of user opinion

for various products and services. Multimodality entails the use of multiple media such as audio and

video in addition to text to enhance the accuracy of sentiment analyzers.

The main advantage that comes with the analysis of audio opinions, as compared to their textual

counterparts, is the availability of speech cues. Speech is a nontrivial part which serves purpose of

emotion reorganization. Speech helps to understand the emotion information via explicit (linguistic)

message and implicit (paralinguistic) message. In textual opinions, the only source of information it

consists of words and their dependencies which may sometime prove insufficient to convey the exact

sentiment of the user. Instead, audio opinions naturally contain acoustic, and linguistic data streams.

The vocal modulations in the recorded response help us determine the tone of the speaker. So, the

simultaneous use of these two modalities will help to create a better sentiment analysis model. From

our experiments also, it is observed that the simultaneous use of these two modalities help to create a

better sentiment analysis model to detect whether the given test input is positive or negative sentiment.

Block diagram for multimodal sentiment analysis is shown in Figure 4.4.

35

Page 48: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Figure 4.4 Block diagram of multimodal sentiment analysis.

Table 4.5 Performance (in %) of multimodal sentiment analysis.

Modality Hindi Spanish

Audio 81.2 77.4

Text 72.4 68.3

Audio + Text 85.6 82.3

From the Table 4.5, it is observed that by combining the two modalities such as text and audio, rate of

detecting the sentiment of a product reviews in both the databases are improved significantly compared

to the single modality.

4.4 Summary and Conclusions

In this chapter, we proposed an approach to detect the sentiment of a given input using both audio

and text information. MFCC features are extracted from audio and sentiment models are built using

DNNAM, DNN and GMM. DNNAM and DNN are tested with different layers with different number

of nodes, whereas the GMM is tested with different mixture components with different test cases. From

our experiments, it is observed that DNNAM classifier with 65-dimensional MFCC features has high

accuracy of detecting the sentiment of an input compared to other two dimensions and even DNNAM

classifier outperformed DNN and GMM classifiers because the attention mechanism captures the con-

text present in the review. For text, features generated using Doc2Vec and Word2Vec are used to build

36

Page 49: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

the model using the SVM and LSTM-RNN classifiers. From experimental studies, it is observed that

LSTM-RNN outperforms the SVM classifier because SVM does not model sequence information. It

is also observed that by combining both the modalities such as audio and text, rate of detecting the

sentiment is significantly improved.

37

Page 50: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Chapter 5

SUMMARY AND CONCLUSIONS

Sentiment analysis emerges as a challenging field with lots of obstacles as it involves natural lan-

guage processing. Sentiment analysis has been quite popular and has lead to building of better products,

understanding users opinion, executing and managing of business decisions. It has a wide variety of

applications that could benefit from its results, such as news analytics, marketing, question answering

and so on. The challenge of this field is to develop the machines ability to understand texts as human

readers do. Getting important insights from opinions expressed on the Internet especially from social

media blogs is vital for many companies and institutions, whether it is in terms of product feedback,

public mood, or investors opinions. To buy any product people also rely and make decisions based on

reviews and opinions.

The scarcity of resources is one of the biggest challenge while dealing with multimodal sentiment

analysis for Indian languages. In this work, we build Telugu songs dataset and Hindi product reviews

dataset. Based on the inter-annotator agreement, Telugu songs dataset has 110 songs (55 positive and

55 negative) and for the Hindi product review dataset has 100 reviews (50 positive and 50 negative).

For songs lyrics are collected from Internet and for reviews database the transcription and sentiment

annotations are manually performed as positive or negative. We even work on Spanish product review

database which is available online.

While developing multimodal sentiment analysis, instead of taking entire input and extracting sev-

eral features from toolkit, we identify selective regions of an input and experiment is performed on those

regions by extracting specific features.

For both lyrics and audio information of Telugu songs, experiments are performed on the beginning,

the ending and on the whole song. Experimental results show that the rate of detecting the sentiment of

a song is more at the beginning, ending compared to the whole song because the whole song will carries

different attributes (happy and sad) which is not clear. So by using part of song, the performance is

increased. The performance of the song at beginning is high compared to the ending of a song because

38

Page 51: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

the instruments and vocals which convey the sentiment for beginning part of the song may or may not

sustain throughout the song. It is also observed that by combining both the modalities such as audio and

text, rate of detecting the sentiment is significantly improved.

The experiments at beginning and ending is not worked with the reviews because in reviews it is

not like that the opinion is there at beginning or at ending. So, in order to extract the sentiment of an

audio an algorithm is developed to detect the stressed regions and normal regions based on strength of

excitation using ZFF based method. MFCC, prosody and relative prosody features which are extracted

at these regions are used to build a sentiment models using GMM and SVM classifiers. From our ex-

periments, it is observed that the performance at selective regions is better compared to the whole input.

The relative prosody features extracted from stressed and normal regions has a improved performance

compared to MFCC and prosody because natural variations present in the prosody features are reduced

using the relative prosody features. For text features generated using Doc2Vec are used to build a sen-

timent models using SVM classifier. sIt is also observed that by combining both the modalities rate of

detecting the sentiment is further improved.

As the current research is going on the deep architectures, in our work we also build a sentiment

models using deep neural network classifiers. Selective regions concept is not worked well with the

deep neural networks because of the less data. So in this work we have done experiments with different

MFCC dimensional feature vectors like 13-dimensional, 65-dimensional and 130-dimensional. MFCC

features are extracted from audio and sentiment models are built using DNNAM, DNN and GMM clas-

sifiers. DNNAM and DNN are tested with different layers with different number of nodes, whereas the

GMM is tested with different mixture components with different test cases. From our experiments, it

is observed that DNNAM classifier with 65-dimensional MFCC features has high accuracy of detecting

the sentiment of an input compared to other dimensions because each frame cannot carry the sentiment

and even DNNAM classifier outperformed DNN and GMM classifiers because the attention mechanism

captures the context present in the review. For text, features generated using Doc2Vec and Word2Vec

are used to build the model using the SVM and LSTM-RNN classifiers. From experimental results, it

is observed that LSTM-RNN outperforms the SVM classifier because SVM does not model sequence

information. It is also observed that by combining both the modalities such as audio and text, rate of

detecting the sentiment is significantly improved.

The following is a list of some directions for future work:

1. Database need to be increased. In our work the selective regions concept is not applied on deep

neural network classifiers because of less data. The performance of deep neural networks is

depend on the amount of training data. The more the training data the better accurate it was.

2. In this work we focused only on two modalities text and audio. The performance can be improved

by combining these two modalities with video modality.

39

Page 52: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

3. The percentage of text sentiment analysis can be improved by using rule based and linguistic

approach.

40

Page 53: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

RELATED PUBLICATIONS

PUBLISHED:

• Harika Abburi, Eswar Sai Akhil Akkireddy, Suryakanth V Gangashetty, and Radhika Mamidi.

“Multimodal Sentiment Analysis of Telugu Songs”. In Proc. 4th Workshop on Sentiment Analysis

where AI meets Psychology (SAAIP) co-located with 25th International Joint Conference on

Artificial Intelligence (IJCAI), Hilton, New York City, USA, pages 48-52, 2016.

• Harika Abburi, Manish Shrivastava, and Suryakanth V Gangashetty. “ Improved Multimodal Sen-

timent Detection using Stressed Regions of audio”. In Proc. International Conference TENCON,

Marina Bay Sands, Singapore, pages 2834-2837, 2016.

• Harika Abburi, Manish Shrivastava, and Suryakanth V Gangashetty. “Multimodal Sentiment

Analysis using Deep Neural Network”. In Proc. Fourth International Conference on Mining

Intelligence and Knowledge Exploration (MIKE), Mexico, pages 58-65, 2016.

COMMUNICATED:

• Harika Abburi, K N R K Raju Alluri, Anil Kumar Vuppala, Manish Shrivastava, and Suryakanth

V Gangashetty. “Sentiment Analysis using Relative Prosody Features”. Submitted to Tenth In-

ternational Conference on Contemporary Computing (IC3), Noida, India, 2017.

41

Page 54: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

Bibliography

[1] H. Abburi, E. S. A. Akkireddy, S. V. Gangashetty, and R. Mamidi. Multimodal sentiment analysis of Telugu

songs. In Proc. 4th Workshop on Sentiment Analysis where AI meets Psychology (SAAIP), pages 48–52,

2016.

[2] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Passonneau. Sentiment analysis of twitter data. In Proc.

Workshop on Languages in Social Media, pages 30–38, 2011.

[3] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of

Machine Learning Research, 3:1137–1155, 2003.

[4] M. W. Campbel, P. J. Cambell, A. D. Reynolds, E. Singer, and A. P. Torres-Carrasquillo. Support vector

machines for speaker, language recognition. Computer Speech & Language, 20(2):210–229, 2006.

[5] P. Chaovalit and L. Zhou. Movie review mining: A comparison between supervised, unsupervised classifi-

cation approaches. In Proc. IEEE 38th International Conference on System Sciences. Big Island, Hawaii,

pages 1–9, 2005.

[6] H. Davis and S. M. Mohammad. Generating music from literature. In Proc. 3rd Workshop on Computational

Linguistics for Literature, pages 1–10, 2014.

[7] L. Deng. A tutorial survey of architectures, algorithms, applications for deep learning. APSIPA Transactions

on Signal, Information Processing, 3, 2014.

[8] Z. Dong, Q. Dong, and C. Hao. Hownet and its computation of meaning. In Proc. 23rd International Con-

ference on Computational Linguistics: Demonstrations, Association for Computational Linguistic, pages

53–56, 2010.

[9] C. N. dos Santos and M. Gatti. Deep convolutional neural networks for sentiment analysis of short texts. In

Proc. 25th International Conference on Computational Linguistics (COLING), pages 69–78, 2014.

[10] D. P. W. Ellis. Clasifying music audio with timbral and chroma features. In Proc. 8th International Confer-

ence on Music Information Retrieval (ISMIR), pages 339–340, 2007.

[11] F. Eyben, M. Wollmer, and B. Schulle. Opensmile the munich versatile and fast open-source audio feature

extractor. In Proc. ACM Multimedia (MM), pages 1459–1462, 2010.

[12] X. Fang and J. Zhan. Sentiment analysis using product review data. Journal of Big Data, Springer Open

Journal, 2(5), 2015.

42

Page 55: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

[13] S. Fong, Y. Zhuang, and J. L. R. Khoury. Sentiment analysis of online news using mallet. In Proc. IEEE

International Symposium on Computational,Business Intelligence(ISCBI), pages 301–304, 2013.

[14] S. J. Fulse, R. Sugandhi, and A. Mahajan. A survey on multimodal sentiment analysis. International Journal

of Engineering Research,Technology (IJERT) ISSN: 2278-0181, 3(4):1233–1238, Nov 2014.

[15] P. Gamallo and M. Garcia. Citius: A Naive-Bayes strategy for sentiment analysis on English tweets. In

Proc. 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 171–175, August 2014.

[16] P. Gangamohan, S. R. Kadiri, and B. Yegnanarayana. Analysis of emotional speech at subsegmental level.

In Proc. INTERSPEECH, pages 1916–1920, 2013.

[17] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath,

and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Pro-

cessing Magazine, 29(6):82–97, 2012.

[18] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.

[19] X. Hu, M. Bay, and J. S. Downie. Creating a simplified music mood classification ground-truth set. In Proc.

8th International Conference on Music Information Retrieval, 2007.

[20] X. Hu and J. S. Downie. Improving mood classification in music digital libraries by combining lyrics and

audio. In Proc. Joint Conference on Digital Libraries, (JCDL), pages 159–168, 2010.

[21] X. Hu, J. S. Downie, and A. F. Ehmann. Lyric text mining in music mood classification. In Proc. 10th

International Conference on Music Information Retrieval (ISMIR), pages 411–416, 2009.

[22] A. Jamdar, J. Abraham, K. Khanna, and R. Dubey. Emotion analysis of songs based on lyrical and audio

features. International Journal of Artificial Intelligence and Applications(IJAIA), 6(3):35–50, 2015.

[23] L. Kaushik, A. Sangwan, and J. H. Hansen. Automatic audio sentiment extraction using keyword spotting.

In Proc. INTERSPEECH, pages 2709–2713, September 2015.

[24] L. Kaushik, A. Sangwan, and J. H. L. Hansen. Sentiment extraction from natural audio streams. In Proc.

ICASSP, pages 8485–8489, 2013.

[25] S. G. Koolagudi, N. Kumar, and K. S. Rao. Speech emotion recognition using segmental level prosodic

analysis. In Proc. IEEE International Conference on Devices and Communications (ICDCom), pages 1–5,

2011.

[26] A. Kumar and T. M. Sebastian. Sentiment analysis on twitter. International Journal of Computer Science

(IJCSI), 9(4):372–378, July 2012.

[27] Q. Le and T. Mikolov. Distributed representations of sentences and documents. In Proc. 31st International

Conference on Machine Learning (ICML), pages 1188–1196, 2014.

[28] T. Li and M. Ogihara. Detecting emotion in music. In Proc. International Symposium on Music Information

Retrieval, pages 239–240, 2003.

[29] B. Logan. Mel frequency cepstral coefficients for music modeling. In Proc. International Symposium on

Music Information Retrieval, 2000.

43

Page 56: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

[30] I. Lopez-Moreno, J. Gonzalez-Dominguez, O. Plchot, D. Martinez, J. Gonzalez-Rodriguez, and P. Moreno.

Automatic language identification using deep neural networks. In Proc. ICASSP, pages 5337–5341, 2014.

[31] F. Mairesse, J. Polifroni, and G. D. Fabbrizio. Can prosody inform sentiment analysis? experiments on

short spoken reviews. In Proc. ICASSP, pages 5093–5096, 2012.

[32] W. Medhat, A. Hassana, and H. Korashy. Sentiment analysis algorithms, applications: A survey. Ain Shams

Engineering journal, pages 1093–1113, 2014.

[33] T. Mikolov, M. Karafiat, L. Burget, J. H. Cernocky, and S. Khudanpur. Recurrent neural network based

language model. In Proc. INTERSPEECH, pages 1045–1048, 2010.

[34] V. K. Mittal. Discriminating features of infant cry acoustic signal for automated detection of cause of crying.

In Proc. 10th International Symposium Chinese Spoken Language Processing (ISCSLP), pages 1–5, 2016.

[35] V. K. Mittal and A. K. Vuppala. Significance of automatic detection of vowel regions for automatic shout de-

tection in continuous speech. In Proc. 10th International Symposium Chinese Spoken Language Processing

(ISCSLP), pages 1–5, 2016.

[36] V. K. Mittal and B. Yegnanarayana. Analysis of production characteristics of laughter. International JOur-

nal of Computer Speech and Language, 30(1):99–115, 2015.

[37] L. P. Morency, R. Mihalcea, and P. Doshi. Towards multimodal sentiment analysis: Harvesting opinions

from the web. In Proc. 13th International Conference on Multimodal Interfaces (ICMI), pages 169–176,

November 2011.

[38] K. Mounika, A. Sivanand, H. R. Lakshmi, S. V. Gangashetty, and A. K. Vuppala. An investigation of deep

neural network architectures for language recognition in Indian languages. In Proc. INTERSPEECH, pages

2930–2933, 2016.

[39] K. S. R. Murty and B. Yegnanarayana. Epoch extraction from speech signals. IEEE Trans. Speech, Audio

and language Processing, 16(8):1602–1613, 2008.

[40] B. Pang and L. Lee. A sentimental education:sentiment analysis using subjectivity summarization based on

minimum cuts. In Proc. Association for Computational Linguistics, pages 271–278, 2004.

[41] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classification using machine learning

techniques. In Proc. ACL Conference on Empirical methods in natural language processing, volume 10,

pages 79–86, 2002.

[42] B. G. Patra, D. Das, and S. Bandyopadhyay. Unsupervised approach to Hindi music mood classification. In

Proc. Mining Intelligence and Knowledge Exploration (MIKE), pages 62–69, 2013.

[43] B. G. Patra, D. Das, and S. Bandyopadhyay. Mood classification of Hindi songs based on lyrics. In Proc.

12th International Conference on Natural Language Processing (ICON), 2015.

[44] V. Perez-Rosas, R. Mihalcea, and L. P. Morency. Multimodal sentiment analysis of spanish online videos.

IEEE Intelleigent Systems, 28(3):38–45, 2013.

[45] V. Perez-Rosas, R. Mihalcea, and L. P. Morency. Utterance level multimodal sentiment analysis. In Proc.

ACL, pages 973–982, 2013.

44

Page 57: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

[46] S. Poria, E. Cambria, and A. Gelbukh. Deep convolutional neural network textual features,multiple kernel

learning for utterance-level multimodal sentiment analysis. In Proc. EMNLP, pages 2539–2544, 2015.

[47] S. Poria, E. Cambria, N. Howard, G. B. Huang, and A. Hussain. Fusing audio, visual,textual clues for

sentiment analysis from multimodal content. Neurocomputing 174, pages 50–59, 2015.

[48] C. Raffel and D. P. Ellis. Feed-forward networks with attention can solve some long-term memory problems.

In CoRR, vol. abs/1512.08756. [Online]. Available: http://arxiv.org/abs/1512.08756, 2015.

[49] K. S. Rao, S. R. M. Prasanna, and T. V. Sagar. Emotion recognition using multilevel prosodic information.

In Proc. Workshop on Image and Speech Processing (WISP), 2007.

[50] A. D. Reynolds and C. R. Rose. Robust text-independent speaker identification using Gaussian mixture

speaker models. IEEE Transactions on Speech, Audio Processing, 3(1):72–83, 1995.

[51] F. Richardson, D. Reynolds, and N. Dehak. A unified deep neural network for speaker, language recognition.

In Proc. INTERSPEECH, pages 1146–1150, 2015.

[52] K. Shanmugapriya and Dr.B.Srinivasan. An efficient method for determining sentiment from song lyrics

based on wordnet representation using HMM. International Journal of Innovative Research in Computer

and Communication Engineering, 3(2):1139–1145, February 2015.

[53] M. Sikandar. A survey for multimodal sentiment analysis methods. International Journal of Computer

Technology, Applications (IJCTA) ISSN:2229-6093, 5(4):1470–1476, July 2014.

[54] R. Singh and R. Kaur. Sentiment analysis on social media, online review. International Journal of Computer

Applications, 121(20):44–48, 2015.

[55] G. Tzanetakis and P. Cook. Musical genre classification of audio signals. IEEE Transactions on Speech and

Audio Processing, 10(5):293–302, 2002.

[56] D. Ververidis, C. Kotropoulos, and I. Pitas. Automatic emotional speech classification. In Proc. ICASSP,

volume 1, pages 593–596, 2004.

[57] H. K. Vydana, P. Vikash, T. Vamsi, K. P. Kumar, and A. K. Vuppala. Detection of emotionally significant

regions of speech for emotion recognition. In Proc. IEEE India Conference (INDICON), pages 1–6, 2015.

[58] T. Wang, D. Kim, K. Hong, and J. Youn. Music information retrieval system using lyrics and melody

information. In Proc. Asia-Pacific Conference on Information Processing, pages 601–604, 2009.

[59] M. Wollmer, F. Weninger, T. Knaup, and L.-P. Morency. Youtube movie reviews: Sentiment analysis in an

audio-visual context. IEEE Intelleigent Systems, 28(3):46–53, 2013.

[60] Y. Xia, L. Wang, K.-F. Wong, and M. Xu. Sentiment vector space model for lyric-based song sentiment

classification. In Proc. ACL-08:HLT, Short Papers, pages 133–136, 2008.

[61] L. Xing, L. Yuan, W. Qinglin, and L. Yu. An approach to sentiment analysis of short chinese texts based on

svms. In Proc. IEEE 34th Chinese Control Conference, pages 28–30, July 2015.

[62] S. K. Yadav, M. Bhushan, and S. Gupta. Multimodal sentiment analysis: Sentiment analysis using audio-

visual format. In Proc. IEEE 2nd International Conference on Computing for Sustainable Global Develop-

ment (INDIACom), pages 1415–1419, 2015.

45

Page 58: Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors

[63] J. Yi, T. Nasukawa, R. Bunescu, and W. Niblack. Sentiment analyzer: Extracting sentiments about a given

topic using natural language processing techniques. In Proc. IEEE International Conference on Data Mining

(ICDM), 2003.

[64] J. Zhong, Y. Cheng, S. Yang, and L. Wen. Music sentiment classification integrating audio with lyrics.

Information and Computational Science, 9(1):35–54, 2012.

46