rapid miner tutorial

17
RapidMiner Tutorial Overview: RapidMiner is an open source learning environment for data mining and machine learning. This environment can be used to extract meaning from a dataset. There are hundreds of machine learning operators to choose from, helpful pre and post processing operators, descriptive graphic visualizations, and many other features. This environment has a steep learning curve, especially for someone who does not have a background in data mining. This is an example based tutorial that will work through some common tasks in data mining with RapidMiner. Classification with Cross Validation (Two Classes) Summary: In this experiment, we will import a dataset and train a support vector machine model. We will use cross validation to evaluate the accuracy of our learning model. Cross validation works by using part of the data to train the model, and the rest of the dataset to test the accuracy of the trained model. In this case, we will divide the dataset into 10 parts and train and test for each part. For example, if we have 100 data vectors, we will train the model with the last ninety, and then test the accuracy of the trained model with the first ten. And then continue for the other nine partitions. This dataset is for breast cancer prediction. It is a binomial classification where the subject is either malignant or benign. Import the data: When finding dataset online, they will come in a variety of formats. I like to start by using MS Excel to import the data. Below are the first 9 lines from the “breast-cancer-wisconsin” data file which is our dataset. According to the accompanying NAMES file, the attributes from left to right are the id number, clump thickness, uniformity of cell size, …, and classification (2 for benign, 4 for malignant). You can see that the first five are benign, and the 6 th is malignant. 1000025,5,1,1,1,2,1,3,1,1,2

Upload: deepika-vaidhyanathan

Post on 06-Mar-2015

403 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Rapid Miner Tutorial

RapidMiner TutorialOverview: RapidMiner is an open source learning environment for data mining and machine learning. This environment can be used to extract meaning from a dataset. There are hundreds of machine learning operators to choose from, helpful pre and post processing operators, descriptive graphic visualizations, and many other features. This environment has a steep learning curve, especially for someone who does not have a background in data mining. This is an example based tutorial that will work through some common tasks in data mining with RapidMiner.

Classification with Cross Validation (Two Classes)

Summary: In this experiment, we will import a dataset and train a support vector machine model. We will use cross validation to evaluate the accuracy of our learning model. Cross validation works by using part of the data to train the model, and the rest of the dataset to test the accuracy of the trained model. In this case, we will divide the dataset into 10 parts and train and test for each part. For example, if we have 100 data vectors, we will train the model with the last ninety, and then test the accuracy of the trained model with the first ten. And then continue for the other nine partitions. This dataset is for breast cancer prediction. It is a binomial classification where the subject is either malignant or benign.

Import the data: When finding dataset online, they will come in a variety of formats. I like to start by using MS Excel to import the data. Below are the first 9 lines from the “breast-cancer-wisconsin” data file which is our dataset. According to the accompanying NAMES file, the attributes from left to right are the id number, clump thickness, uniformity of cell size, …, and classification (2 for benign, 4 for malignant). You can see that the first five are benign, and the 6th is malignant.

1000025,5,1,1,1,2,1,3,1,1,21002945,5,4,4,5,7,10,3,2,1,21015425,3,1,1,1,2,2,3,1,1,21016277,6,8,8,1,3,4,3,7,1,21017023,4,1,1,3,2,1,3,1,1,21017122,8,10,10,8,7,10,9,7,1,41018099,1,1,1,1,2,10,3,1,1,21018561,2,1,2,1,2,1,3,1,1,21033078,2,1,1,1,2,1,1,1,5,2

1. Click the Data tab in MS Excel.

Page 2: Rapid Miner Tutorial

2. Click on From Text on the Get External Data tab.3. Browse to the file with the data - “breast-cancer-wisconsin” Data file. 4. Text Import Wizard

a. Select Delimited and click Next.b. Select Comma and click Next.c. Finish importing with defaults.

5. Insert a new row at the top to type the data headings that are in the names file. I used the following.

6. Save this file in your RapidMiner workspace as a .csv file.

Page 3: Rapid Miner Tutorial

7. Open a new project in RapidMiner. The operator tree will have only the Root node.

8. Right click on the root node, select New Operator, IO, Examples, ExampleSource.9. Click on ExampleSource and then click on Start Configuration Wizard.

10. Import the .csv file using the first row for column names.

Page 4: Rapid Miner Tutorial

11. This wizard assigns attribute value types according to what it thinks they are, sometimes you will need to change these. Here, the only one to change is the Classification which should be binomial.

12. Next specify special attributes. All of them are preselected as attribute. Change id to id, and change Classification to label.

13. Navigate to your breast cancer workspace and save the description file.

Congratulations, the dataset is now in RapidMiner’s learning environment.

Page 5: Rapid Miner Tutorial

Cross Validation: 1. Right click on Root, New Operator, Validation, XValidation.

2. Right click on XValidation, Operator Info. Use this catalog to find out how and why an operator is used. In this case, XValidation requires two inner operators – training and testing. We will train a support vector machine model and then test this model.

3. Traininga. Right click on XValidation, New Operator, Learner, Supervised, Functions,

LibSVMLearner. LibSVMLearner takes an ExampleSet as input and produces a Model as output. This is exactly what we need for the first (training) operator for XValidation.

Page 6: Rapid Miner Tutorial
Page 7: Rapid Miner Tutorial

4. Testinga. Right click on XValidation, New Operator, OperatorChain.b. Right click on OperatorChain, New Operator, ModelApplier.c. Right click on OperatorChain, New Operator, Validation,

BinomialClassificationPerformance. This takes an ExampleSet as input and gives a PerfromanceVector as output. Again, this is required by the XValidation operator. Under the parameters tab, select some criteria on which to measure performance.

Page 8: Rapid Miner Tutorial

Pre-Processing:1. Run the experiment by clicking the blue arrow. In the results view, notice that the precision

was about 34%. The model simply predicted 4 every time. The problem was that the dataset had some “?” symbols where values were unavailable or unknown. RapidMiner did not know how to process this since it was expecting an integer.

2. Go back to the edit mode. You can do this by clicking the notepad icon in the upper right corner or by pressing F9.

3. Right click on Root, New Operator, Preprocessing, Attribute, Filter, AttributeFilter4. In the operator tree, click and drag Attribute filter right below ExapmpleSource. In other words,

we want to import our dataset, and then perform some preprocessing on it before we send it through the cross validation process.

5. Click on AttributeFilter and select the parameter numeric_value_filter. In the parmater_string field type “>= 0”without quotes. (need a space between = and 0).

Page 9: Rapid Miner Tutorial

6. Click the blue run arrow again and it should work with about 94% precision.

Save the process: This is a good idea so that you can continue to go back and make changes to you experiment.

Replace Operator: How well does the Support Vector Machine compare with Decision Tree? Simply right click on LibSVMLearner, Replace Operator, Learner, Supervised, Function, DecisionTree. Run the experiment and investigate the results.

Multi-Class Classification with Cross Validation

Summary: In this experiment, we will import a dataset that is trying to predict the age of an abalone based on 10 metrics. The age can range from 1 to 29 years. We will change the prediction to three ranges, 1-8 years, 9-10 years, and 11-29 years. Since there are three classes, we will not be able to use the SVM as it is. We will do some preprocessing to change it to handle three classifications.

Import the dataset: These steps are the same as the last experiment. However, we need to change the last column (Age) to young, middleAge, and old. I used find and replace in that column in excel for each number. I’m sure there is an easier way to do this with pre-processing in RapidMiner, but I was not able to find it. Notice that the sex is not numeric. It includes the characters M, F, or I (Male, Female, Infant).

Page 10: Rapid Miner Tutorial

When importing, these are nominal attributes. Finally, the Age should be selected as polynomial instead of binomial like last time. Of course Age is the label and all others are attributes.

M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8F,0.53,0.415,0.15,0.7775,0.237,0.1415,0.33,20F,0.545,0.425,0.125,0.768,0.294,0.1495,0.26,16M,0.475,0.37,0.125,0.5095,0.2165,0.1125,0.165,9Part of the dataset.

Cross Validation: Set this experiment up like the breast cancer experiment (without the AttributeFilter). However, we have two problems.

1. Not all attributes are numeric. The Sex is M, F, I.2. The classification includes 3 classes instead of the expected 2.

Pre-processing non numeric attributes: 1. Right Click on Root, New Operator, Preprocessing, Attributes, Filter, Nominal2Numeric2. In the operator tree, click and drag Nominal2Numeric directly below ExampleSource.3. Now RapidMiner will map M, F, and I to numeric values. This work is transparent to the user;

but it is necessary if we want to use the SVM learner.

Pre-processing 3 classes instead of 2: 1. We need to wrap the LibSVMLearner in a Binary2MultiClassLearner operator. 2. Right click on XValidation, New Operator, Learner, Supervised, Meta, Binary2MultiClassLearner3. In the Operator Tree, click and drag Binary2MultiClassLearner directly below XValidation. It

should be above LibSVMLearner. 4. Click and drag LibSVMLearner on top of Binary2MultiClassLearner. Now LibSVMLearner is an

inner operator for Binary2MultiClassLearner.

Page 11: Rapid Miner Tutorial

Change LibSVMLearner Parameters:1. Click on LibSVMLearner. Click on the parameters tab and change kernel_type to poly. Run the

experiment to get about 60% precision. Investigate the effects of changing the kernel and the degree. You can also compare the SVM to other learners by replacing that operator.

Multi-Class Classification with Feature Selection and Support Vector Machines

Summary: Sometimes a vector can have attributes that do not benefit the training of the learner. It is best to simplify the model by using a feature selection. In this experiment, we will look at classifying wine into three categories with both forward and backward feature selection. Feature selection simplifies the model and it often leads to more accurate predictions. We will also briefly explore some of the graphic capabilities.

Page 12: Rapid Miner Tutorial

1. Create a cross validation experiment with the wine dataset. This is the same as the abalone experiment before.

2. Run the experiment. The model ‘s performance is about 72% 13% accuracy.

Page 13: Rapid Miner Tutorial

3. Now wrap the XValidation in a feature selection operator. Right click on Root, New Operator, Preprocessing, Attributes, Selection, FeatureSelection. Then click and drag it below ExampleSource and above XValidation. Finally, minimize XValidation and drag it on top of FeatureSelection.

Page 14: Rapid Miner Tutorial

4. Now run the experiment again with forward selection. We now get accuracy of 76% 8%

accuracy. Also take note of the features that were selected. Only J, K, and M are being used for this learner because of forward selection.

Page 15: Rapid Miner Tutorial

5. Try running it with backward selection. Our accuracy shot up to 94% 5%. Here the backward

selection gives us a different set of features that are being used. The only ones NOT being used are E, F, and M.

6. Try plotting some of the data with the plot view setting. Here is where it pays off to label the attributes in your dataset rather than using arbitrary names like A, B, C.

Datasets were taken from the following online source.

Hettich, S. and Bay, S. D. (1999). The UCI KDD Archive [http://kdd.ics.uci.edu]. Irvine, CA: University of California, Department of Information and Computer Science.