semantic web based sentiment engine

18
Semantic Web Based Sentiment Engine A system to determine online sentiment on current affairs for the purpose of analysis and prediction CA652A 11210889 52595354 CA652A

Upload: james-dellinger

Post on 27-Jan-2015

111 views

Category:

Technology


1 download

DESCRIPTION

Imperfect look at possible applications of Web Based Sentiment Engine MECB 2012. Sentiment analysis involves classifying opinions from text as "positive", "negative" or “neutral. Its purpose and benefit is to assist in extracting valuable information and insight from copious amounts of unstructured data. This proposed system will have the capability to determine online sentiment on current affairs for the purpose of analysis and prediction. For the sentiment analysis a cluster-method approach is recommended, which is a recent advancement in this area. Various APIs will assist in extracting other data such as location and time. Evaluation of system through the use of the Pang et al movie review data sets is recommended to validate basic functionality and real life data in the form of the 2008 US presidential race data to evaluate all functionality of the system. Multiple industries are identified as potential users of this system from marketing companies to hotels adding to our benefit in the commercialisation potential of the system.

TRANSCRIPT

Page 1: Semantic Web Based Sentiment Engine

Semantic Web Based Sentiment Engine A system to determine online sentiment on current affairs for the purpose of analysis and prediction

CA652A

11210889 52595354

CA652A

Page 2: Semantic Web Based Sentiment Engine

1 | P a g e

ABSTRACT

Sentiment analysis involves classifying opinions from text as "positive", "negative" or

“neutral. Its purpose and benefit is to assist in extracting valuable information and insight

from copious amounts of unstructured data. This proposed system will have the capability to

determine online sentiment on current affairs for the purpose of analysis and prediction. For

the sentiment analysis a cluster-method approach is recommended, which is a recent

advancement in this area. Various APIs will assist in extracting other data such as location

and time. Evaluation of system through the use of the Pang et al movie review data sets is

recommended to validate basic functionality and real life data in the form of the 2008 US

presidential race data to evaluate all functionality of the system. Multiple industries are

identified as potential users of this system from marketing companies to hotels adding to our

benefit in the commercialisation potential of the system.

Page 3: Semantic Web Based Sentiment Engine

2 | P a g e

A report submitted to Dublin City University, School of Computing for module

CA652: Information Access, 2011/2012.

We hereby certify that the work presented and the material contained herein is

my/our own except where explicitly stated references to other material are made

Student Numbers

52595354

11210889

Page 4: Semantic Web Based Sentiment Engine

3 | P a g e

TABLE OF CONTENTS

Abstract .................................................................................................................................... 1

Introduction ............................................................................................................................ 5

Concept Overview ................................................................................................................. 5

Constraints and Limitations ............................................................................................ 5

Functional Description ......................................................................................................... 6

Sentiment Search Functions............................................................................................... 6

Techniques ........................................................................................................................... 6

Time parameter Based Search ....................................................................................... 8

Geographical Extraction Based ..................................................................................... 8

Social Sentiment Extraction Based data ....................................................................... 9

Graphical Data Generation Tools ................................................................................. 9

Pros & Cons of proposed system ...................................................................................... 9

Evaluation Plan..................................................................................................................... 10

Stage One Testing - Validation ..................................................................................... 10

Stage Two Testing – Functionality Testing ................................................................ 11

Stage Three Testing – Real Life Data ........................................................................... 11

Commercialisation Potential ............................................................................................. 13

Conclusion and Further Research Opportunities .......................................................... 14

References .............................................................................................................................. 15

Page 5: Semantic Web Based Sentiment Engine

4 | P a g e

Table of Figures

Figure 1 - Sentiment Analysis framework ........................................................................... 7

Figure 2 - Cluster Method Accuracy/Efficiency ................................................................ 8

Figure 3 - Graphical Representation of content .................................................................. 9

Figure 4 - Basic Validation Testing Results ....................................................................... 11

Figure 5 - Two Topic Validation Testing ........................................................................... 11

Figure 6 - Sample Test Output (Obama) ............................................................................ 12

Figure 7 - Sample Test Data (McCain) ............................................................................... 13

Page 6: Semantic Web Based Sentiment Engine

5 | P a g e

INTRODUCTION

The ‘media’ as we now conceptualise it has changed dramatically. With the internet,

people have an opportunity to ‘weigh in’ on events, by providing their opinions, and

feedback and in real time through blogs, forum, social networks and commenting

systems on news websites. There is a growing interest in measuring sentiment that

can be contributed to the dramatic increase in the volume of digitized information.

“An increasing number of studies in political communication focus on the “sentiment” or

“tone” of news content, political speeches, or advertisements” (Young, L, & Soroka, S 2012)

This report discusses the concept of developing a Semantic Web based sentiment

engine that will be able to analyse public sentiment on current issues, from politics

to reality TV shows. Based on the analysis, tracking of popular opinion through

social media channels and leveraging research in the area of sentiment analysis,

accurate predictions could be made possible on events from presidential elections to

the X-Factor competition.

CONCEPT OVERVIEW

This proposed system is not a standard sentiment engine that returns static data; it

offers increased functionality to assist with data interpretation. By allowing end

users to customise their search, filter the returned data under multiple parameters

and have graphical representation of results to facilitate interpretation.

CONSTRAINTS AND LIMITATIONS

The limitations of this concept are not due to the technological constraints but are

simply down to the volatility of public opinion and that is something that cannot be

remedied or correcting by technology.

Another limitation is the scope of the opinion being captured. User groups of social

media and participants in online forums are statistical of a younger age group. The

lack of inclusion of the opinion of older age groups could greatly affect the accuracy

Page 7: Semantic Web Based Sentiment Engine

6 | P a g e

of the data as it would not be entirely representative – the impact of this imbalance

would particularly impact politics with older groups statistical more likely to vote.

FUNCTIONAL DESCRIPTION

SENTIMENT SEARCH FUNCTIONS

• Users can enter multiple search terms for the purpose of data comparison.

Other features would be utilised to improve the analysis returns.

• Multiple Search Parameters

o Time Frame Defined Search - Data retrieved can be limited to a specific

time frame.

o Geographical Location Based Search – Search data retrieved can be

filtered by location of users

o Narrow Search Scope – Select websites to exclude or restrict search to

small number of websites.

• Graphical representations of the data are generated.

TECHNIQUES

Sentiment Analysis Techniques

There is much research in the area of sentiment analysis, the primary objective being

to find a technique where there is no trade-off between speed and accuracy. Several

new and emerging techniques have been researched as part of identifying the best fit

for this system.

• Proximity-Based Approach (Hasan, S, & Adjeroh, D 2011)

o This proposed method uses proximity-based features to determine

sentiment; proximity distribution, mutual information between

proximity types, and proximity patterns.

Page 8: Semantic Web Based Sentiment Engine

7 | P a g e

• Based on Annotation (Shukla, A 2011)

o This proposed method counts all the annotation present, calculates

sentiment scores of all annotation including comments to determine

sentiments.

• Sentence-level Lexical Based Semantic Orientation (Khan, A et al, 2011)

o This proposed method uses SentiWordNet to calculate the semantic

‘score’ of sentences it has classified as subjective from reviews and blog

comments.

• Machine Learning approach to contextual information (YANG, C et al, 2008)

o This proposed method differentiates itself from others by taking

context into account when determining the sentiment category. Its

primary focus and test data sets have been blog posts. Figure 1 below,

shows the framework employed.

FIGURE 1 - SENTIMENT ANALYSIS FRAMEWORK

• Clustering-Based Sentiment Analysis Approach (Li, G, & Liu, F 2012)

The method deemed most appropriate for this proposed system was based on a

article from the Journal Of Information Science in April this year, which outlined the

Clustering-Based Sentiment Analysis approach. It proposed that by applying a “TF-

IDF weighting method, a voting mechanism and importing term scores, an acceptable and

stable clustering result can be obtained” (Li, G, & Liu, F 2012) The evaluation results

Page 9: Semantic Web Based Sentiment Engine

8 | P a g e

were the most impressive of all techniques reviewed as part of this research. It

appears to have performed well in terms of both accuracy and efficiency with no

need for human participation, as can be seen from figure 1.

FIGURE 2 - CLUSTER METHOD ACCURACY/EFFICIENCY

Apart from its accuracy and efficiency, this technique was deemed the most suitable

as it can be applied universally to any data set. Other techniques researched, have

been developed for particular data types, customer reviews or blogs and their

evaluation appraisals appear to suggest they do not perform as well outside of these

data types.

TIME PARAMETER BASED SEARCH

This sentiment engine would make use of the adaptible Librato API libraries to

allow sentiment returns to be time sensative. This would be in order for a user to

evaluate how sentiment is changing over time or what sentiment was during

specific time periods.

GEOGRAPHICAL EXTRACTION BASED

Adding a geographical element would be a unique feature allowing for mapping of

sentiment results. Preferred location content will be pulled from the Twitter API as

it gives access to Twitter profile location. Comment systems used by news websites

etc. request a location prior to posting the comment like on the Irish Times website.

Facebook API allows access to location of user if the privacy setting is turned on.

OAUTH setting would be used to allow the users of the sentiment engine to explore

the opinions of their friends and networked associates and how it would fit on the

sentiment scales. Other free use location APIs may also be needed.

Page 10: Semantic Web Based Sentiment Engine

9 | P a g e

SOCIAL SENTIMENT EXTRACTION BASED DATA

The content used to create athematrix of information to evaluate sentiment within

via FLP would likely be the following but not limited to: Twitter; Disqus; Livefyre;

Intensedebate; Drupal comments; Wordpress comments; other blog posts; scraped

open facebook and fan page comments; facebook comment system; text comments;

G+ posts; Slideshare.net; Pinterest pins; Google News articles; various bookmarking

site comments like fark.com reddit; and other language relavent wire news services.

GRAPHICAL DATA GENERATION TOOLS

Graphical representations of the data are generated. The results could be rendered as

web-based flash objects or in way that is complient to the evolving HTML5

standards and be IOS 5 comlient given the anamosity Apple has with Adobe over

flash for results to be useful on mobile devices and tablets. These reports woud be

exportable to Crystal Reports.

FIGURE 3 - GRAPHICAL REPRESENTATION OF CONTENT

PROS & CONS OF PROPOSED SYSTEM

The primary argument for why sentiment engines via Semantic Web and linked data

are useful is based upon the new information and insight that can be gleaned from it.

The ability to know relative and positional sentiment can be useful in many anytical

or informational arbitrage situations.

0

200

400

600

800

1000

1200

1400

1600

Postive Neutral Negative

Candidate A

Candidate B

Page 11: Semantic Web Based Sentiment Engine

10 | P a g e

In terms of the cons, primary concern would be data quality. Problems with data

quality are a huge issue and can skew any resulting analysis. The extent of the data

quality problem has been often discovered by information activists working in the

open data movement.

Secondly privacy concerns and staying within the spirit and letter of the relavent

data privacy laws of the regulatory regime you operate under may at times be an

issue. This can be tricky given the interconnected nature of the web.

Lastly, inaccuracies of data and it being organisied in “short sets” vs deeper data

may create false sentiments. Is their enough data being looked at to create a realist

postive or negative sentiment? Some additional analysis may need some addition

parsing to tease out, for example, intial heated emotion responses from the rationale

morning after response.

EVALUATION PLAN

STAGE ONE TESTING - VALIDATION

The evaluation plan would begin with simple software validation. The first test case

would consist of validating the fundamental functionality of the system, its ability to

differentiate between sentiments. The data set that’s to be used is the movie review

data from Pang et al experiments1 Movie review data is widely regarded as the most

challenging data for sentiment engines to analysis, this can be contributed to the fact

that a positive review may contain descriptions of gory or violent scenes and equally

a negative review could contain descriptions of light-hearted pleasant scenes. For

additional testing other data sets could be used for each iteration of this dynamic

testing stage

1 Pang B, Lee L, Vaithyanathan S. Thumbs up, Sentiment classification using

machine learning techniques. In: Conference on empirical methods in natural

language processing (EMNLP). Philadelphia, Pennsylvania, USA, 2002, p. 79.

Page 12: Semantic Web Based Sentiment Engine

11 | P a g e

.

FIGURE 4 - BASIC VALIDATION TESTING RESULTS

STAGE TWO TESTING – FUNCTIONALITY TESTING

The second stage of testing would be the validation of the multiple input

functionality; to ensure that data can be retrieved for two or more search terms and

also that they can be accurately differentiated. The test case for this would be built

on the first stage of testing with added content regarding a second movie etc.

FIGURE 5 - TWO TOPIC VALIDATION TESTING

STAGE THREE TESTING – REAL LIFE DATA

The final stage of the evaluation plan would be to perform testing using previous

high profile events as the test cases, such as the US Presidential Election of 2008 and

20%

41%

39%

Neutral

Positive

Negative

20%

41%

39%

Schlinder's List

Neutral

Positive

Negative

21%

59%

20%

The Usual Suspects

Neutral

Positive

Negative

Page 13: Semantic Web Based Sentiment Engine

12 | P a g e

the X-Factor competition from previous years. This validation is more complex as it

will span the entire internet not just the staging website.

The testing would be performed over different time intervals, days, weeks, months,

and the entire duration of the event. In the case of the political elections these time

periods could be used to coincide with official opinion polls, for example Gallop and

Rasmussen state side or RedC for Irish based events.

Validation of the geographical based sentiment analysis function would be tested to

gauge the accuracy of the location results. In the case of the US Presidential Election

the final voting percentages for each candidate per state would give an accurate

basis for comparison.

SAMPLE EVALUATION TEST CASE

By taking the ten states where each candidate won by the largest percentage

majority, and graphing the percentage of votes each candidate received, and also the

percentage of positive, negative and neutral data regarding that candidate. What one

would expect in a fully evaluated system would be a close correlation between

positive data and the percentage of votes and also a correlation with the negative or

neutral data and the other candidate’s percentage of votes, as per the sample charts

below for Obama and McCain respectively.

FIGURE 6 - SAMPLE TEST OUTPUT (OBAMA)

0

10

20

30

40

50

60

70

80

90

Obama's Percentageof Votes

McCain's Percentageof Votes

Positive %

Negative %

Neutral %

Obama’s Data

Page 14: Semantic Web Based Sentiment Engine

13 | P a g e

FIGURE 7 - SAMPLE TEST DATA (MCCAIN)

COMMERCIALISATION POTENTIAL

In an era where both business and individuals are attempting to move further and

further to data driven decision sentiment engine products have a range of

commercial potential.

Some companies have already begun commercializing Semantic Web applications

like IBM licensing of their WebFountain Internet analytical engine to FActiva and

ThompsonReuters 2003 for example for those interested in corporate reputational

data.

Various market research for people who cannot afford Enterprise Resoruce Planning

(ERP) add ons like SAP Business Objects, SAS, or say LexisNexis Analytics and for

who the current available crop of free semantic sentiment engines (name a few from

those ten) tools are just insufficient, too niche, or unscalable (Basu, 2010). Semantic

Web products are becoming important in internal and external Business Inframatics.

However, information arbitrage is not merely for professional market traders. This

system would likely be a software as service (SaaS) on the web, it could be sold on a

free-mium basis or a monthly subscription or yearly license depending on the

implementation.

0

10

20

30

40

50

60

70

McCain's Percentageof Votes

Obama's Percentageof Votes

Positive %

Negative %

Neutral %

McCain’s Data

Page 15: Semantic Web Based Sentiment Engine

14 | P a g e

Primary clients would depend on the sentiments needing to be parsed and the

proprietary and public data sets being used in within the sentiment engine.

Examples to be included: Corporate Media; Content Publishing industry; PR firms;

polling; market research firms; Trading platforms; Political Parties; Elections;

Government agencies; security services; and Bookmarkers for deciding odds on

Novelty bets - reality TV shows, politics etc.

CONCLUSION AND FURTHER RESEARCH OPPORTUNITIES

Where does the Semantic Web lead to exactly? We don’t really know, but opening

up the segregated data silos and making sense of deeper dark ‘big data,’ in pursuit

of the benefits of a deeper rooted “hyperdata” would be a nice path. However, the

road will be long but it may improve our day to day lives immensely.

"Many applications and services claim to be "semantic" in one manner or another,

but that does not mean they are "Semantic Web." Semantic applications include any

applications that can make sense of meaning, particularly in language such as

unstructured text, or structured data in some cases. By this definition, all search

engines today are somewhat "semantic" but few would qualify as "Semantic Web"

apps. (Spivak, 2007)

How we get from the early steps of Web 3.0 to this deeper data web will be a long

process. It will provide countless benefits, many of which we may not even percieve

today. However, sentiment engines are mearly one way to get the public and the

developer community interested and excited for all the other benefits that this open

data future could hold. For that reason sentiment engines will remain an important

component in the near term future, as “big data,” holds much of the future promise

to bring the of the “web of things” and make sense and use of them.

Page 16: Semantic Web Based Sentiment Engine

15 | P a g e

REFERENCES

Abbasi, A, Hsinchun, C, & Salem, A 2008, 'Sentiment Analysis in Multiple

Languages: Feature Selection for Opinion Classification in Web Forums', ACM

Transactions On Information Systems, 26, 3, pp. 1-34, Computers & Applied Sciences

Complete, viewed 4 May 2012.

Basu, Saikat 2010. 10 Web Tools To Try Out Sentiment Search & Feel the Pulse Make

Use Of [Online] 30 April. http://www.makeuseof.com/tag/10-web-tools-sentiment-

search-feel-pulse/ [Accessed 1 May 2012]

Bergman, Mike 2010. I Have Yet to Metadata I Didn’t Like. AI3 [Online] 16 August.

http://www.mkbergman.com/902/i-have-yet-to-metadata-i-didnt-like/ [Accessed

1 May 2012]

Bollen, J. Mao, Huina. Zeng, Xiao-Jun March 2011. Twitter mood predicts the stock

market. Journal of Computational Science, 2(1), Pages 1-8 Available from:

http://arxiv.org/abs/1010.3003

Cai, K, Spangler, S, Ying, C, & Li, Z 2010, 'Leveraging sentiment analysis for topic

detection', Web Intelligence & Agent Systems, 8, 3, pp. 291-302, Academic Search

Complete, viewed 20 April 2012.

Dalton, Jeff 2007. Caffè Java Open Source NLP and Text Mining tools. Jeff's Search

Engine Caffé [Online] 16 March. http://www.searchenginecaffe.com/2007/03/java-

open-source-text-mining-and.html [Accessed 1 May 2012]

Hamouda, A, Marei, M, & Rohaim, M 2011, 'Building Machine Learning Based Senti-

word Lexicon for Sentiment Analysis', Journal Of Advances In Information Technology,

2, 4, pp. 199-203, Library, Information Science & Technology Abstracts with Full

Text, , viewed 1 May 2012.

Hasan, S, & Adjeroh, D 2011, 'Detecting Human Sentiment from Text using a

Proximity-Based Approach', Journal Of Digital Information Management, 9, 5, pp.

Page 17: Semantic Web Based Sentiment Engine

16 | P a g e

206-212, Library, Information Science & Technology Abstracts with Full Text, ,

viewed 7 May 2012.

Kang, H, Yoo, S, & Han, D 2012, 'Senti-lexicon and improved Naïve Bayes

algorithms for sentiment analysis of restaurant reviews', Expert Systems With

Applications, 39, 5, pp. 6000-6010, Academic Search Complete, , viewed 10 April

2012.

Lévy, Pierre CRC, FRSC 2007. Elements of Semantic Engineering I3 workshop / WWW

Consortium Conference / Banff 2007 Available from:

http://www.ieml.org/text/semantic_space.pdf

Li, G, & Liu, F 2012, 'Application of a clustering method on sentiment analysis',

Journal Of Information Science, 38, 2, pp. 127-139, Business Source Complete, ,

viewed 21 April 2012.

Pang B, Lee L, Vaithyanathan S. Thumbs up, Sentiment classification using machine

learning techniques. In: Conference on empirical methods in natural language

processing (EMNLP). Philadelphia, Pennsylvania, USA, 2002, p. 79.

Shukla, A 2011, 'SENTIMENT ANALYSIS OF DOCUMENT BASED ON

ANNOTATION', International Journal Of Web & Semantic Technology, 2, 4, pp. 91-103,

Computers & Applied Sciences Complete, , viewed 6 May 2012.

Spivac, Nova 2007. The Semantic Web, Collective Intelligence and Hyperdata.

novaspivack.typepad.com [Online] 18 September.

http://novaspivack.typepad.com/nova_spivacks_weblog/2007/09/hyperdata.html

[Accessed 1 May 2012]

Vishwanath, J, & Aishwarya, S 2011, 'User Suggestions Extraction from customer

Reviews: A Sentiment Analysis approach', International Journal On Computer Science

& Engineering, 3, 3, pp. 1203-1206, Academic Search Complete, , viewed 1 May 2012.

YANG, C, LIN, K, & CHEN, H 2008, 'Sentiment Analysis in Weblog Using

Contextual Information:: A Machine Learning Approach', International Journal Of

Page 18: Semantic Web Based Sentiment Engine

17 | P a g e

Computer Processing Of Languages, 21, 4, pp. 331-345, Academic Search Complete, ,

viewed 27 April 2012.

Young, L, & Soroka, S 2012, 'Affective News: The Automated Coding of Sentiment in

Political Texts', Political Communication, 29, 2, pp. 205-231, Academic Search

Complete, , viewed 10 May 2012.