sentiment analysis bigdata project

23
Data Sentiment Analysis CHAPTER 1 INTRODUCTION The amount of raw data stored in corporate databases is exploding. In today’s fiercely competitive business environment, companies need to rapidly turn these raw data into significant insights. Data mining, or knowledge discovery, is the computer-assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. The analysis of large volumes of data with data mining methods is generally regarded as a field for specialists. The latter create more or less complex analysis processes with often shockingly expensive software solutions for predicting the imminent handing in of notices or the sales figures of a product for example. The economic benefit is obvious, and so it was thought for a long time that the use of data mining software products was also associated with high software license costs and the support often necessary due to the complexity of the subject matter. Probably no later than when the open source software RapidMiner was developed could anybody seriously doubt that software solutions for data mining did not have to be expensive or difficult to use. There are different data mining tools like Weka, Orange, rattle, KNIME etc which are available in open source. In this group rapid miner stands out with its efficient performance. Today RapidMiner is the worldwide leading opensource data mining solution due to the combination of its leadingedge technologies and its functional range. Applications of RapidMiner cover a wide range of realworld data mining tasks. Page 1

Upload: pramodbdvt

Post on 02-Oct-2015

20 views

Category:

Documents


8 download

DESCRIPTION

Is a simple illustration of the use of Big Data and Data Mining using the Rapid Miner Tool

TRANSCRIPT

  • DataSentiment Analysis

    CHAPTER 1

    INTRODUCTION

    The amount of raw data stored in corporate databases is exploding. Intodays fiercely competitive business environment, companies need torapidly turn these raw data into significant insights. Data mining, orknowledge discovery, is the computer-assisted process of digging throughand analyzing enormous sets of data and then extracting the meaning of thedata. Data mining tools predict behaviors and future trends, allowingbusinesses to make proactive, knowledge-driven decisions. The analysis of largevolumes of data with data mining methods is generally regarded as a field for specialists. The

    latter create more or less complex analysis processes with often shockingly expensive software

    solutions for predicting the imminent handing in of notices or the sales figures of a product for

    example. The economic benefit is obvious, and so it was thought for a long time that the use of

    data mining software products was also associated with high software license costs and the

    support often necessary due to the complexity of the subject matter.

    Probably no later than when the open source software RapidMiner was developed could

    anybody seriously doubt that software solutions for data mining did not have to be expensive or

    difficult to use.

    There are different data mining tools like Weka, Orange, rattle, KNIME etc which are available

    in open source. In this group rapid miner stands out with its efficient performance. Today

    RapidMiner is the worldwide leading opensource data mining solution due to the combinationof its leadingedge technologies and its functional range. Applications of RapidMiner cover awide range of realworld data mining tasks.

    Page 1

  • DataSentiment Analysis

    1.1 The ToolRapidMiner is licensed under the GNU Affero General Public License version 3 and is

    currently available in version 5.3. RapidMiner contains more than 500 operators altogether for

    all tasks of professional data analysis, i.e. operators for input and output as well as data

    processing (ETL), modeling and other aspects of data mining. But also methods of text mining,

    web mining, the automatic sentiment analysis from Internet discussion forums (sentiment

    analysis, opinion mining) as well as the time series analysis and - prediction are available to the

    analyst.

    In addition, RapidMiner contains more than 20 methods to also visualize high-dimensional data

    and models. Moreover, all learning methods and weighting factors of the Weka Toolbox have

    also been completely and smoothly integrated into RapidMiner, meaning that the complete range

    of functions of Weka, which is equally widespread in research at the moment, also joins the

    already enormous range of functions of RapidMiner.

    Page 2

  • DataSentiment Analysis

    CHAPTER 2

    2.2 InstallationDownload the appropriate installation package for your operating system and install

    RapidMiner according to the instructions on the website. All usual Windows versions are

    supported as well as Macintosh, Linux or UNIX systems. Download is available from

    http://www.rapid-i.com.

    2.3 Perspective and viewsWhen you open you will be welcomed into the so-called Welcome Perspective. The upper

    section shows typical actions which you as an analyst will perform frequently after starting

    RapidMiner. Here are the details of these:

    1. New: Starts a new analysis process. First you must define a location and a name within the

    process and data repository and then you will be able to start designing a new process.

    2. Open Recent: Opens the process which is selected in the list below the actions. Alternatively,

    you can also open this process by double-clicking inside the list. Either way, RapidMiner will

    then automatically switch to the Design Perspective.

    3. Open: Opens the repository browser and allows you to select a process to be opened within

    the process Design Perspective.

    4. Open Template: Shows a selection of different pre-defined analysis processes, which can be

    configured in a few clicks.

    5. Online Tutorial: Starts a tutorial which can be used directly within Rapid-Miner and gives an

    introduction to some data mining concepts using a selection of analysis processes.

    Page 3

  • DataSentiment Analysis

    Figure 1: Welcome Perspective of RapidMiner.

    We will find an icon for each perspective within the right-hand area of the toolbar:

    Figure 2: Toolbar Icons for Perspectives

    Page 4

  • DataSentiment Analysis

    The icons shown here take you to the following perspectives:

    1. Design Perspective: This is the central RapidMiner perspective where all analysis processes

    are created and managed.

    2. Result Perspective: If a process supplies results in the form of data or models then

    RapidMiner takes you to this Result Perspective, where you can look at several results at the

    same time.

    3. Welcome Perspective: The Welcome Perspective already described above, which RapidMiner

    welcomes you with after starting the program.

    You can switch to the desired perspective by clicking inside the toolbar or alternatively

    via the menu entry View"- Perspectives" followed by the selection of the target perspective.

    RapidMiner will eventually also ask you automatically if switching to another perspective seems

    a good idea, e.g. to the Result Perspective on completing an analysis process.

    Design PerspectiveSince the Design Perspective is the central working environment of RapidMiner, we will

    discuss all parts of the Design Perspective separately in the following and discuss the

    fundamental functionalities of the associated views. There are two very central views in this area,

    at least in the standard setting.

    Page 5

  • DataSentiment Analysis

    Figure 3: Design Perspective of RapidMiner

    Operators ViewAll work steps (operators) available in RapidMiner are presented in groups here and can

    therefore be included in the current process. You can navigate within the groups in a simple

    manner and browse in the operators provided to your heart's desire. If RapidMiner has been

    extended with one of the available extensions, then the additional operators can also be found

    here.

    Page 6

  • DataSentiment Analysis

    Without extensions you will find at least the following groups of operators in the tree

    structure:

    1. Process Control: Operators such as loops or conditional branches which can control the

    process flow.2. Utility: Auxiliary operators which, alongside the operator Subprocess" for grouping sub

    processes, also contain the important macro-operators as well as the operators for

    logging.3. Repository Access: Contains the two operators for read and write access in repositories.4. Import: Contains a large number of operators in order to read data and objects from

    external formats such as files, databases etc. 5. Export: Contains a large number of operators for writing data and objects into external

    formats such as files, databases etc. 6. Data Transformation: Probably the most important group in the analysis in terms of size

    and relevance. All operators are located here for transforming both data and meta data.7. Modeling: Contains the actual data mining process such as classification methods,

    regression methods, clustering, weightings, methods for association rules, correlation and

    similarity analyses as well as operators, in order to apply the generated models to new

    data sets.8. Evaluation: Operators using which one can compute the quality of a modeling and thus

    for new data e.g. cross-validations, bootstrapping etc.

    You can select operators within the Operators View and add them in the desired place in the

    process by simply dragging and dropping.

    Repositories View

    Page 7

  • DataSentiment Analysis

    The repository is a central component of RapidMiner which was introduced in Version 5.

    It serves for the management and structuring of your analysis processes into projects and at the

    same time as both a source of data as well as of the associated meta data.

    Process ViewThe Process View shows the individual steps within the analysis process as well as their

    interconnections.

    Inserting OperatorsYou can insert new operators into the process in different ways. Here are the details of the

    different ways:

    1. Via drag &drop from the Operators View as described above,

    2. Via double click on an operator in the Operators View,

    3. Via dialog which is opened by means of the first icon in the toolbar of the Process View,

    4. Via dialog which is opened by means of the menu entry Edit" - New Operator. . . (CTRL-I),

    5. Via context menu in a free area of the white process area and there via the submenu\New

    Operator" and the selection of an operator.

    Parameters View

    Page 8

  • DataSentiment Analysis

    Numerous operators require one or several parameters to be indicated for a correct

    functionality. For example, operators that read data from files require the file path to be

    indicated. Note that some parameters are only indicated when other parameters have a certain

    value. For example, an absolute number of desired examples can only be indicated for the

    operator \sampling" when \absolute" has been selected as the type of sampling.

    Help and Comment ViewEach time you select an operator in the Operators View or in the Process View, the help

    window within the Help View shows a description of this operator. These descriptions include

    1. A short synopsis which summarizes the function of the operator in one or a few sentences,

    2. A detailed description of the functionality of the operator,

    3. A list of all parameters including a short description of the parameter, the default value (if

    available), the indication as to whether this parameter is an expert parameter as well as an

    indication of parameter dependencies.

    Comment ViewUnlike Help, the Comment View is not dedicated to pre-defined descriptions but rather to

    your own comments on individual steps of the process. Simply select an operator and write any

    text on it in the comment field. This will then be saved together with your process definition and

    can be useful for tracing individual steps in the design later on.

    Problems and Log View Page 9

  • DataSentiment Analysis

    A further very central element and valuable source of help during the design of your

    analysis processes is the Problems View. Any warnings and error messages are clearly indicated

    in a table here. In the first column with the name Message" you will find a short summary of the

    problem. The last column named location" shows you the place where the problem arises in the

    form of the operator name and the name of the input port concerned. A considerable innovation

    of RapidMiner 5 however is the possibility of also suggesting solutions for such problems and of

    implementing them directly. These solution methods are called quick fixes. The second column

    gives an overview of such possible solutions, either directly as text if there is only one possibility

    of solution or as an indication of how many different possibilities exist to solve the problem.

    Log ViewDuring the design, and in particular during the execution of processes, numerous

    messages are written at the same time and can provide information, particularly in the event of

    an error, as to how the error can be eliminated by a changed process design. You can copy the

    text within the Log View as usual and process it further in other applications. You can also save

    the text in a file, delete the entire contents or search through the text using the actions in the

    toolbar.

    Page 10

  • DataSentiment Analysis

    CHAPTER 3

    SYSTEM REQUIREMENTS

    Hardware Requirements:

    Processor : Pentium 4 Memory Size : 1 GB RAM Storage : 80GB Hard Disk Display : EGA/VGA Color Monitor

    600x800 Pixels Resolution High Color (16 Bit)

    Keyboard : Any with minimum required Keys

    Software Requirements:

    Operating System : Windows XP and above, Linux, Mac Java SE 1.6 and above

    Page 11

  • DataSentiment Analysis

    CHAPTER 4

    Data Sentiment Analysis with Rapidminer

    Sentiment analysis or opinion mining is an application of Text Analytics to identify and extract

    subjective information in source materials.

    A basic task in sentiment analysis is classifying an expressed opinion in a document, a sentence

    or an entity feature as positive or negative.

    The example presented here gives the list of movies and its review such as Positive or Negative.

    This program implements Precision and Recall method. Precision is the probability that a

    (randomly selected) retrieved document is relevant. Recall is the probability that a (randomly

    selected) relevant document is retrieved in a search. Or high recall means that an algorithm

    returned most of the relevant results. High precision means that an algorithm returned more

    relevant results than irrelevant.

    At first, both positive and negative reviews of a certain movie are taken. All of the words are

    stemmed into root words. Then the words are stored in different polarity (positive and negative).

    Both vector wordlist and model are created. Then, the required list of movies is given as an

    input. Model compares each and every word from the given list of movies with that of words

    which come under different polarity stored earlier. The movie review is estimated based on the

    majority of number of words that occur under a polarity.

    For example, when you look at Django Unchained, the reviews are compared with the vector

    wordlist created at the beginning. The highest number of words comes under positive polarity. So

    the outcome is Positive. Same happens for Negative outcome.

    Page 12

  • DataSentiment Analysis

    First step for implementing this analysis is Processing the document from data i.e. extracting the

    positive and negative reviews of a movie and storing it in different polarity.

    hug

    The model is shown in Figure1.

    Figure 1.

    Page 13

  • DataSentiment Analysis

    Under Process document, click on the Edit List on the right. Load the positive and negative

    reviews under different class name "Positive" and "Negative" as shown in Figure 2.

    Figure 2.

    Page 14

  • DataSentiment Analysis

    Under Process Document operator, nested operation takes place such as Tokenizing the words, Filtering

    the Stop words, Stemming the words into root words and Filtering the tokens between 4 and 25 characters

    as shown in Figure 3.

    Figure 3.

    Page 15

  • DataSentiment Analysis

    Then two operators are used such as Store and Validation operator as shown in Figure 1. Store

    operator is used to output word vector to a file and directory of our choosing. Validation operator

    (Cross-validation) is a standard way to assess the accuracy and validity of a statistical model.

    Our data set is divided into two parts, a training set and a test set. The model is trained on the

    training set only and its accuracy is evaluated on the test set. This is repeated n number of times.

    Double click on validation operator. There will two panels- Training and Testing. Under Training

    panel, Linear Support Vector Machine(SVM) is used which is a popular set of classifier since the

    function is a linear combination of all the input variables. In order to test the model, we use the

    Apply Model operator to apply the training set to our test set. To measure the model accuracy

    we use the Performance operator.

    The operations under Validation is shown in Figure 4.

    Figure 4.

    Page 16

  • DataSentiment Analysis

    Then run the model. The result of Class Recall % and Precision % is shown in Figure 5. The

    model and vector wordlist are stored in a Repository.

    Figure 5.

    Page 17

  • DataSentiment Analysis

    Then retrieve both the model and vector wordlist from the Repository you have stored earlier.

    Then connect out from the retrieve wordlist to the process document operator shown in Figure 6.

    The operations under Process document are same shown in Figure 3.

    Figure 6.

    Page 18

  • DataSentiment Analysis

    Then click on Process Document operator and click edit list on the right. This time Ihave added the list of 5 movie reviews from Rottentomatoes website and stored it ina directory. Assign the class name as Unlabeled as shown in fig 7.

    Figure 7.

    Page 19

  • DataSentiment Analysis

    The Apply Model operator takes a model from a Retrieve operator and unlabeled data from

    Process document as input and outputs the applied model to the lab port, so connect that to the

    res (results) port. The result is shown below. When you look at Les Miserables, there is 86.4%

    confidence that it is positive and 13.6% as negative because the match of the reviews with

    wordlist under positive polarity is higher compared to negative polarity.

    Figure 8.

    Page 20

  • DataSentiment Analysis

    CHAPTER 5

    COMPARISIONProcedure KNIME RapidMiner Weka TANAGRA Orange

    Partitioning of datasetinto training and testing sets.

    Pass (but limited partitioning methods)

    Pass (but limited partitioning methods)

    Pass (but limited partitioning methods)

    Pass (but limitedpartitioning methods)

    Pass (but limited partitioning methods)

    Descriptor scaling Pass Pass

    Fail (cannot save parameters for scaling to apply to future datasets)

    Fail (cannot saveparameters for scaling to apply to future datasets)

    Fail (no scalingmethods)

    Descriptor selectionFail (no wrapper methods)

    Pass Pass (but is not part of KnowledgeFlow)

    Fail (wrapper methods valid for logistic regression only)

    Fail (no wrapper methods)

    Parameter optimization of machine learning/statistical methods

    Fail (not automatic) Pass Fail (not automatic)

    Fail (not automatic)

    Fail (not automatic)

    Model validation using cross-validationand/or independent validation set

    Pass (but limited error measurement methods)

    Pass

    Pass (but cannot save model so have to rebuild model for every future dataset)

    Fail (cannot validate independent validation set)

    Pass (but cannot save model so have to rebuild model for everyfuture dataset)

    Table 1.

    Page 21

  • DataSentiment Analysis

    CHAPTER 6

    ADVANTAGES AND DISADVANTAGESAdvantages

    Free version has adequate resources to avoid big name options if a small business

    It is a quality tool, given its ranking among the other commercial products

    GUI is very user friendly.GUI is used to create data mining operators in XML files

    XML Standardization is great for utilizing various data sources

    Ease of use and available tutorials

    Works on any operating system

    Disadvantage

    Some options are not available in free product, but you can upgrade

    Possibly less customer service available for free version

    There can be some restriction on customized use

    Beginner may face some difficulty in understanding

    Page 22

  • DataSentiment Analysis

    Page 23