sentiment analysis bigdata project

DataSentiment Analysis

CHAPTER 1

INTRODUCTION

The amount of raw data stored in corporate databases is exploding. Intodays fiercely competitive business environment, companies need torapidly turn these raw data into significant insights. Data mining, orknowledge discovery, is the computer-assisted process of digging throughand analyzing enormous sets of data and then extracting the meaning of thedata. Data mining tools predict behaviors and future trends, allowingbusinesses to make proactive, knowledge-driven decisions. The analysis of largevolumes of data with data mining methods is generally regarded as a field for specialists. The

latter create more or less complex analysis processes with often shockingly expensive software

solutions for predicting the imminent handing in of notices or the sales figures of a product for

example. The economic benefit is obvious, and so it was thought for a long time that the use of

data mining software products was also associated with high software license costs and the

support often necessary due to the complexity of the subject matter.

Probably no later than when the open source software RapidMiner was developed could

anybody seriously doubt that software solutions for data mining did not have to be expensive or

difficult to use.

There are different data mining tools like Weka, Orange, rattle, KNIME etc which are available

in open source. In this group rapid miner stands out with its efficient performance. Today

RapidMiner is the worldwide leading opensource data mining solution due to the combinationof its leadingedge technologies and its functional range. Applications of RapidMiner cover awide range of realworld data mining tasks.

Page 1


1.1 The ToolRapidMiner is licensed under the GNU Affero General Public License version 3 and is

currently available in version 5.3. RapidMiner contains more than 500 operators altogether for

all tasks of professional data analysis, i.e. operators for input and output as well as data

processing (ETL), modeling and other aspects of data mining. But also methods of text mining,

web mining, the automatic sentiment analysis from Internet discussion forums (sentiment

analysis, opinion mining) as well as the time series analysis and - prediction are available to the

analyst.

In addition, RapidMiner contains more than 20 methods to also visualize high-dimensional data

and models. Moreover, all learning methods and weighting factors of the Weka Toolbox have

also been completely and smoothly integrated into RapidMiner, meaning that the complete range

of functions of Weka, which is equally widespread in research at the moment, also joins the

already enormous range of functions of RapidMiner.

Page 2


CHAPTER 2

2.2 InstallationDownload the appropriate installation package for your operating system and install

RapidMiner according to the instructions on the website. All usual Windows versions are

supported as well as Macintosh, Linux or UNIX systems. Download is available from

http://www.rapid-i.com.

2.3 Perspective and viewsWhen you open you will be welcomed into the so-called Welcome Perspective. The upper

section shows typical actions which you as an analyst will perform frequently after starting

RapidMiner. Here are the details of these:

1. New: Starts a new analysis process. First you must define a location and a name within the

process and data repository and then you will be able to start designing a new process.

2. Open Recent: Opens the process which is selected in the list below the actions. Alternatively,

you can also open this process by double-clicking inside the list. Either way, RapidMiner will

then automatically switch to the Design Perspective.

3. Open: Opens the repository browser and allows you to select a process to be opened within

the process Design Perspective.

4. Open Template: Shows a selection of different pre-defined analysis processes, which can be

configured in a few clicks.

5. Online Tutorial: Starts a tutorial which can be used directly within Rapid-Miner and gives an

introduction to some data mining concepts using a selection of analysis processes.

Page 3


Figure 1: Welcome Perspective of RapidMiner.

We will find an icon for each perspective within the right-hand area of the toolbar:

Figure 2: Toolbar Icons for Perspectives

Page 4


The icons shown here take you to the following perspectives:

1. Design Perspective: This is the central RapidMiner perspective where all analysis processes

are created and managed.

2. Result Perspective: If a process supplies results in the form of data or models then

RapidMiner takes you to this Result Perspective, where you can look at several results at the

same time.

3. Welcome Perspective: The Welcome Perspective already described above, which RapidMiner

welcomes you with after starting the program.

You can switch to the desired perspective by clicking inside the toolbar or alternatively

via the menu entry View"- Perspectives" followed by the selection of the target perspective.

RapidMiner will eventually also ask you automatically if switching to another perspective seems

a good idea, e.g. to the Result Perspective on completing an analysis process.

Design PerspectiveSince the Design Perspective is the central working environment of RapidMiner, we will

discuss all parts of the Design Perspective separately in the following and discuss the

fundamental functionalities of the associated views. There are two very central views in this area,

at least in the standard setting.

Page 5


Figure 3: Design Perspective of RapidMiner

Operators ViewAll work steps (operators) available in RapidMiner are presented in groups here and can

therefore be included in the current process. You can navigate within the groups in a simple

manner and browse in the operators provided to your heart's desire. If RapidMiner has been

extended with one of the available extensions, then the additional operators can also be found

here.

Page 6


Without extensions you will find at least the following groups of operators in the tree

structure:

1. Process Control: Operators such as loops or conditional branches which can control the

process flow.2. Utility: Auxiliary operators which, alongside the operator Subprocess" for grouping sub

processes, also contain the important macro-operators as well as the operators for

logging.3. Repository Access: Contains the two operators for read and write access in repositories.4. Import: Contains a large number of operators in order to read data and objects from

external formats such as files, databases etc. 5. Export: Contains a large number of operators for writing data and objects into external

formats such as files, databases etc. 6. Data Transformation: Probably the most important group in the analysis in terms of size

and relevance. All operators are located here for transforming both data and meta data.7. Modeling: Contains the actual data mining process such as classification methods,

regression methods, clustering, weightings, methods for association rules, correlation and

similarity analyses as well as operators, in order to apply the generated models to new

data sets.8. Evaluation: Operators using which one can compute the quality of a modeling and thus

for new data e.g. cross-validations, bootstrapping etc.

You can select operators within the Operators View and add them in the desired place in the

process by simply dragging and dropping.

Repositories View

Page 7


The repository is a central component of RapidMiner which was introduced in Version 5.

It serves for the management and structuring of your analysis processes into projects and at the

same time as both a source of data as well as of the associated meta data.

Process ViewThe Process View shows the individual steps within the analysis process as well as their

interconnections.

Inserting OperatorsYou can insert new operators into the process in different ways. Here are the details of the

different ways:

1. Via drag &drop from the Operators View as described above,

2. Via double click on an operator in the Operators View,

3. Via dialog which is opened by means of the first icon in the toolbar of the Process View,

4. Via dialog which is opened by means of the menu entry Edit" - New Operator. . . (CTRL-I),

5. Via context menu in a free area of the white process area and there via the submenu\New

Operator" and the selection of an operator.

Parameters View

Page 8


Numerous operators require one or several parameters to be indicated for a correct

functionality. For example, operators that read data from files require the file path to be

indicated. Note that some parameters are only indicated when other parameters have a certain

value. For example, an absolute number of desired examples can only be indicated for the

operator \sampling" when \absolute" has been selected as the type of sampling.

Help and Comment ViewEach time you select an operator in the Operators View or in the Process View, the help

window within the Help View shows a description of this operator. These descriptions include

1. A short synopsis which summarizes the function of the operator in one or a few sentences,

2. A detailed description of the functionality of the operator,

3. A list of all parameters including a short description of the parameter, the default value (if

available), the indication as to whether this parameter is an expert parameter as well as an

indication of parameter dependencies.

Comment ViewUnlike Help, the Comment View is not dedicated to pre-defined descriptions but rather to

your own comments on individual steps of the process. Simply select an operator and write any

text on it in the comment field. This will then be saved together with your process definition and

can be useful for tracing individual steps in the design later on.

Problems and Log View Page 9


A further very central element and valuable source of help during the design of your

analysis processes is the Problems View. Any warnings and error messages are clearly indicated

in a table here. In the first column with the name Message" you will find a short summary of the

problem. The last column named location" shows you the place where the problem arises in the

form of the operator name and the name of the input port concerned. A considerable innovation

of RapidMiner 5 however is the possibility of also suggesting solutions for such problems and of

implementing them directly. These solution methods are called quick fixes. The second column

gives an overview of such possible solutions, either directly as text if there is only one possibility

of solution or as an indication of how many different possibilities exist to solve the problem.

Log ViewDuring the design, and in particular during the execution of processes, numerous

messages are written at the same time and can provide information, particularly in the event of

an error, as to how the error can be eliminated by a changed process design. You can copy the

text within the Log View as usual and process it further in other applications. You can also save

the text in a file, delete the entire contents or search through the text using the actions in the

toolbar.

Page 10


CHAPTER 3

SYSTEM REQUIREMENTS

Hardware Requirements:

Processor : Pentium 4 Memory Size : 1 GB RAM Storage : 80GB Hard Disk Display : EGA/VGA Color Monitor

600x800 Pixels Resolution High Color (16 Bit)

Keyboard : Any with minimum required Keys

Software Requirements:

Operating System : Windows XP and above, Linux, Mac Java SE 1.6 and above

Page 11


CHAPTER 4

Data Sentiment Analysis with Rapidminer

Sentiment analysis or opinion mining is an application of Text Analytics to identify and extract

subjective information in source materials.

A basic task in sentiment analysis is classifying an expressed opinion in a document, a sentence

or an entity feature as positive or negative.

The example presented here gives the list of movies and its review such as Positive or Negative.

This program implements Precision and Recall method. Precision is the probability that a

(randomly selected) retrieved document is relevant. Recall is the probability that a (randomly

selected) relevant document is retrieved in a search. Or high recall means that an algorithm

returned most of the relevant results. High precision means that an algorithm returned more

relevant results than irrelevant.

At first, both positive and negative reviews of a certain movie are taken. All of the words are

stemmed into root words. Then the words are stored in different polarity (positive and negative).

Both vector wordlist and model are created. Then, the required list of movies is given as an

input. Model compares each and every word from the given list of movies with that of words

which come under different polarity stored earlier. The movie review is estimated based on the

majority of number of words that occur under a polarity.

For example, when you look at Django Unchained, the reviews are compared with the vector

wordlist created at the beginning. The highest number of words comes under positive polarity. So

the outcome is Positive. Same happens for Negative outcome.

Page 12


First step for implementing this analysis is Processing the document from data i.e. extracting the

positive and negative reviews of a movie and storing it in different polarity.

hug

The model is shown in Figure1.

Figure 1.

Page 13


Under Process document, click on the Edit List on the right. Load the positive and negative

reviews under different class name "Positive" and "Negative" as shown in Figure 2.

Figure 2.

Page 14


Under Process Document operator, nested operation takes place such as Tokenizing the words, Filtering

the Stop words, Stemming the words into root words and Filtering the tokens between 4 and 25 characters

as shown in Figure 3.

Figure 3.

Page 15


Then two operators are used such as Store and Validation operator as shown in Figure 1. Store

operator is used to output word vector to a file and directory of our choosing. Validation operator

(Cross-validation) is a standard way to assess the accuracy and validity of a statistical model.

Our data set is divided into two parts, a training set and a test set. The model is trained on the

training set only and its accuracy is evaluated on the test set. This is repeated n number of times.

Double click on validation operator. There will two panels- Training and Testing. Under Training

panel, Linear Support Vector Machine(SVM) is used which is a popular set of classifier since the

function is a linear combination of all the input variables. In order to test the model, we use the

Apply Model operator to apply the training set to our test set. To measure the model accuracy

we use the Performance operator.

The operations under Validation is shown in Figure 4.

Figure 4.

Page 16


Then run the model. The result of Class Recall % and Precision % is shown in Figure 5. The

model and vector wordlist are stored in a Repository.

Figure 5.

Page 17


Then retrieve both the model and vector wordlist from the Repository you have stored earlier.

Then connect out from the retrieve wordlist to the process document operator shown in Figure 6.

The operations under Process document are same shown in Figure 3.

Figure 6.

Page 18


Then click on Process Document operator and click edit list on the right. This time Ihave added the list of 5 movie reviews from Rottentomatoes website and stored it ina directory. Assign the class name as Unlabeled as shown in fig 7.

Figure 7.

Page 19


The Apply Model operator takes a model from a Retrieve operator and unlabeled data from

Process document as input and outputs the applied model to the lab port, so connect that to the

res (results) port. The result is shown below. When you look at Les Miserables, there is 86.4%

confidence that it is positive and 13.6% as negative because the match of the reviews with

wordlist under positive polarity is higher compared to negative polarity.

Figure 8.

Page 20


CHAPTER 5

COMPARISIONProcedure KNIME RapidMiner Weka TANAGRA Orange

Partitioning of datasetinto training and testing sets.

Pass (but limited partitioning methods)



Pass (but limitedpartitioning methods)


Descriptor scaling Pass Pass

Fail (cannot save parameters for scaling to apply to future datasets)

Fail (cannot saveparameters for scaling to apply to future datasets)

Fail (no scalingmethods)

Descriptor selectionFail (no wrapper methods)

Pass Pass (but is not part of KnowledgeFlow)

Fail (wrapper methods valid for logistic regression only)

Fail (no wrapper methods)

Parameter optimization of machine learning/statistical methods

Fail (not automatic) Pass Fail (not automatic)

Fail (not automatic)

Fail (not automatic)

Model validation using cross-validationand/or independent validation set

Pass (but limited error measurement methods)

Pass

Pass (but cannot save model so have to rebuild model for every future dataset)

Fail (cannot validate independent validation set)

Pass (but cannot save model so have to rebuild model for everyfuture dataset)

Table 1.

Page 21


CHAPTER 6

ADVANTAGES AND DISADVANTAGESAdvantages

Free version has adequate resources to avoid big name options if a small business

It is a quality tool, given its ranking among the other commercial products

GUI is very user friendly.GUI is used to create data mining operators in XML files

XML Standardization is great for utilizing various data sources

Ease of use and available tutorials

Works on any operating system

Disadvantage

Some options are not available in free product, but you can upgrade

Possibly less customer service available for free version

There can be some restriction on customized use

Beginner may face some difficulty in understanding

Page 22


Page 23