department - جامعة نزوى · dataminig (weka) lab manual. table of contents s.no week topic...
TRANSCRIPT
Table of Contents
S.No Week Topic Page No
1. Week-1 How to Create Attribute Relation File Format (.arff) 1
2. Week-2 How to Create CSV (Comma-Separated Values) File 3
3. Week-3 Data Pre-Processing in Weka 6
4. Week-4 Data Discretization in Weka 17
5. Week-5 & 6 Association Rule Mining in Weka 24
6. Week-7 & 8 Classification via Decision Tree in Weka 32
7. Week-9 & 10 K-Means Clustering in Weka 45
8. Week-11 & 12 Using Visualization in Weka 24-45
9. Week-13 & 14 Using the Command Line 6-45
1 | P a g e
Attribute-Relation File Format (ARFF)
An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of
instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project
at the Department of Computer Science of The University of Waikato for use with the Weka
machine learning software. This document describes the version of ARFF used with Weka
versions 3.2 to 3.3; this is an extension of the ARFF format as described in the data mining book
written by Ian H. Witten and Eibe Frank (the new additions are string attributes, date attributes,
and sparse instances).
Overview
ARFF files have two distinct sections. The first section is the Header information, which is
followed the Data information.
The Header of the ARFF file contains the name of the relation, a list of the attributes (the
columns in the data), and their types. An example header on the standard IRIS dataset looks like
this:
% 1. Title: Iris Plants Database
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%[email protected])
% (c) Date: July, 1988
%
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
2 | P a g e
The DATA of the arff file looks like the following:
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
Lines that begin with a % are comments.
The @RELATION, @ATTRIBUTE and @DATA declarations are case insensitive.
Example:
@relation LCCvsLCSH
@attribute LCC string
@attribute LCSH string
@data
AG5, 'Encyclopedias and dictionaries.;Twentieth century.'
AS262, 'Science -- Soviet Union -- History.'
AE5, 'Encyclopedias and dictionaries.'
AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Phases.'
AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Tables.'
3 | P a g e
HOW TO CREATE CSV FILE
A comma-separated values (CSV) file is any file containing text that is separated with a comma,
but can also be a file separated with any other character. To create a CSV is as simple as creating
any text file and can be created in any text editor, however, is often created in a spreadsheet
program such as Microsoft Excel or Open Office Calc. Below are the steps on how to create a
CSV file using a text editor such as Notepad, Microsoft Excel, Open Office Calc, and Google
Docs.
1. Notepad (or any text editor)
2. Microsoft Excel
Notepad (or any text editor)
To create a CSV file in a text editor open a new text editor program, such as Notepad. Once open
write the text data you wish the file to contain and separate each field or column of data with a
comma and each row with a new line.
Title1,Title2,Title3
one,two,three
example1,example2,example3
As an example, if you were to create a text file with the above data, and save it as a CSV file,
each column is created by each comma and each row is created by each new line. Therefore, the
above data if opened in a spreadsheet program such as Microsoft Excel would create a table
similar to the below example.
4 | P a g e
Title1 Title2 Title3
one two three
example1 example2 example3
If the data you're planning to use in your CSV file already has commas, such as an address; it's
easier to use a different delimiter to separate each of the values. For example, in the below CSV
file we will be creating names and addresses for labels that will be printed and are separating
each name and address label with a tilde character. Alternatively a better solution would be to
have the address, city, and state in their own column.
Name
Street Address
City, State ZIP Code
~Mr John Smith
123 Fake address
Salt Lake City, Utah 89110
~Mrs Jane Doe
586 Another fake
Delta, Utah 84624
~Bill White
123 N Fake Street
St Anthony, Idaho 83445
Microsoft Excel
Open Microsoft Excel and the file you wish to save as a CSV file. For example, below is
the data contained in our example Excel worksheet?
Item Cost Sold Profit
Keyboard $10.00 $16.00 $6.00
Monitor $80.00 $120.00 $40.00
Mouse $5.00 $7.00 $2.00
Total $48.00
5 | P a g e
Once open, click File, choose the Save As option, and as the Save as type: select the CSV
(Comma delimited) (*.csv) option.
Once saved, if you were to open the CSV file in a text editor, such as Notepad, the CSV file
should resemble the below example.
Item,Cost,Sold,Profit
Keyboard,$10.00,$16.00,$6.00
Monitor,$80.00,$120.00,$40.00
Mouse,$5.00,$7.00,$2.00
,,Total,$48.00
6 | P a g e
Data Preprocessing in WEKA
This Program illustrates some of the basic data preprocessing operations that can be performed
using WEKA. The sample data set used for this example, unless otherwise indicated, is the "bank
data" available in comma-separated format (bank-data.csv).
The data contains the following fields
id a unique identification number
age age of customer in years (numeric)
sex MALE / FEMALE
region inner_city/rural/suburban/town
income income of customer (numeric)
married is the customer married (YES/NO)
children number of children (numeric)
car does the customer own a car (YES/NO)
save_acct does the customer have a saving account (YES/NO)
current_acct does the customer have a current account (YES/NO)
mortgage does the customer have a mortgage (YES/NO)
pep did the customer buy a PEP (Personal Equity Plan) after the last mailing
(YES/NO)
7 | P a g e
Loading the Data
In addition to the native ARFF data file format, WEKA has the capability to read in ".csv"
format files. This is fortunate since many databases or spreadsheet applications can save or
export data into flat files in this format. As can be seen in the sample data file, the first row
contains the attribute names (separated by commas) followed by each data row with attribute
values listed in the same order (also separated by commas). In fact, once loaded into WEKA, the
data set can be saved into ARFF format. If, however, you are interested in conveting a ".csv" file
into WEKA's native ARFF using the command line, this can be accomplished using the
following command:
java weka.core.converters.CSVLoader filename.csv > filename.arff
In this example, we load the data set into WEKA, perform a series of operations using WEKA's
attribute and discretization filters, and then perform association rule mining on the resulting data
set. While all of these operations can be performed from the command line, we use the GUI
interface for WEKA Explorer.
Initially (in the Preprocess tab) click "open" and navigate to the directory containing the data file
(.csv or .arff). In this case we will open the above data file. This is shown in Figure p1.
8 | P a g e
Since the data is not in ARFF format, a dialog box will prompt you to use the convertor, as in
Figure p2. You can click on "Use Covertor" button, and click OK in the next dialog box that
appears (See Figure p3).
10 | P a g e
Once the data is loaded, WEKA will recognize the attributes and during the scan of the data will
compute some basic statistics on each attribute. The left panel in Figure p4 shows the list of
recognized attributes, while the top panels indicate the names of the base relation (or table) and
the current working relation (which are the same initially).
Clicking on any attribute in the left panel will show the basic statistics on that attribute. For
categorical attributes, the frequency for each attribute value is shown, while for continuous
attributes we can obtain min, max, mean, standard deviation, etc. As an example, see Figures p5
and p6 below which show the results of selecting the "age" and "married" attributes,
respectively.
12 | P a g e
Note that the visualization in the right bottom panel is a form of cross-tabulation across two
attributes. For example, in Figure p6 above, the default visualization panel cross-tabulates
"married" with the "pep" attribute (by default the second attribute is the last column of the data
file). You can select another attribute using the drop down list.
Selecting or Filtering Attributes
In our sample data file, each record is uniquely identified by a customer id (the "id" attribute).
We need to remove this attribute before the data mining step. We can do this by using the
Attribute filters in WEKA. In the "Filter" panel, click on the "Choose" button. This will show a
popup window with a list available filters. Scroll down the list and select the
"weka.filters.unsupervised.attribute.Remove" filter as shown in Figure p7.
13 | P a g e
Next, click on text box immediately to the right of the "Choose" buttom. In the resulting dialog
box enter the index of the attribute to be filtered out (this can be a range or a list separated by
commas). In this case, we enter 1 which is the index of the "id" attribute (see the left panel).
Make sure that the "invertSelection" option is set to false (otherwise everything except attribute 1
will be filtered). Then click "OK" (See Figure p8). Now, in the filter box you will see "Remove -
R 1" (see Figure p9).
14 | P a g e
Click the "Apply" button to apply this filter to the data. This will remove the "id" attribute and
create a new working relation (whose name now includes the details of the filter that was
applied). The result is depicted in Figure p10:
15 | P a g e
It is possible now to apply additional filters to the new working relation. In this example,
however, we will save our intermediate results as separate data files and treat each step as a
separate WEKA session. To save the new working relation as an ARFF file, click on save button
in the top panel. Here, as shown in the "save" dialog box (see Figure p11), we will save the new
relation in the file "bank-data-R1.arff".
Figure p12 shows the top portion of the new generated ARFF file (in TextPad).
16 | P a g e
Note that in the new data set, the "id" attribute and all the corresponding values in the records
have been removed. Also, note that Weka has automatically determined the correct types and
values associated with the attributes, as listed in the Attributes section of the ARFF file.
17 | P a g e
Discretization
Some techniques, such as association rule mining, can only be performed on categorical data.
This requires performing discretization on numeric or continuous attributes. There are 3 such
attributes in this data set: "age", "income", and "children". In the case of the "children" attribute
the range of possible values are only 0, 1, 2, and 3. In this case, we have opted for keeping all of
these values in the data. This means we can simply discretize by removing the keyword
"numeric" as the type for the "children" attribute in the ARFF file, and replacing it with the set of
discrete values. We do this directly in our text editor as seen in Figure p13. In this case, we have
saved the resulting relation in a separate file "bank-data2.arff".
We will rely on WEKA to perform discretization on the "age" and "income" attributes. In this
example, we divide each of these into 3 bins (intervals). The WEKA discretization filter, can
divide the ranges blindly, or used various statistical techniques to automatically determine the
best way of partitioning the data. In this case, we will perform simple binning.
18 | P a g e
First we will load our filtered data set into WEKA by opening the file "bank-data2.arff". The
"open" dialog box in depicted in Figure p14.
If we select the "children" attribute in this new data set, we see that it is now a categorical
attribute with four possible discrete values. This is depicted in Figure p15.
19 | P a g e
Now, once again we activate the Filter dialog box, but this time, we will select
"weka.filters.unsupervised.attribute.Discretize" from the list (see Figure p16).
20 | P a g e
Next, to change the defaults for this filter, click on the box immediately to the right of the
"Choose" button. This will open the Discretize Filter dialog box. We enter the index for the the
attributes to be discretized. In this case we enter 1 corresponding to attribute "age". We also enter
3 as the number of bins (note that it is possible to discretize more than one attribute at the same
time (by using a list of attribute indeces). Since we are doing simple binning, all of the other
available options are set to "false". The dialog box is depicted in Figure p17.
Click "Apply" in the Filter panel. This will result in a new working relation with the selected
attribute partitioned into 3 bins (see Figure p18). To examine the results, we save the new
working relation in the file "bank-data3.arff" as depicted in Figure p19.
21 | P a g e
Let us now examine the new data set using our text editor (in this case, TextPad). The top portion
of the data is shown in Figure p19. You can observe that WEKA has assigned its own labels to
each of the value ranges for the discretized attribute. For example, the lower range in the "age"
22 | P a g e
attribute is labeled "(-inf-34.333333]" (enclosed in single quotes and escape characters), while
the middle range is labeled "(34.333333-50.666667]", and so on. These labels now also appear in
the data records where the original age value was in the corresponding range.
Next, we apply the same process to discretize the "income" attribute into 3 bins. Again, Weka
automatically performs the binning and replaces the values in the "income" column with the
appropriate automatically generated labels. We save the new file into "bank-data3.arff",
replacing the older version.
Clearly, the WEKA labels, while readable, leave much to be desired as far as naming
conventions go. We will thus use the global search/replace functions in TextPad to replace these
labels with more succinct and readable ones. Fortunately, TextPad has a powerful regular
expression pattern matching capability which allows us to do this efficiently. The TextPad
search/replace dialog box for replacing the age label "(-inf-34.333333]" with the label "0_34".
Note that the "regular expression" option is selected. In the "Find what" box we have entered the
full label '\'(-inf-34.333333]\'' (including the back-slashes and single quotes). Furthermore, back-
slashes are escaped with another back-slash so that in the regular expression patterns matching
they are treated as literals (resulting in: '\\'(-inf-34.333333]\\''. In the "Replace with" box we enter
"0_34".
Now we click on the "Replace All" button to replace all instances of the old patterns with the
new one. The result of this operation is depicted in Figure p20.
23 | P a g e
Note that the new label now appears in place of the old one both in the attribute section of the
ARFF file as well as in the relevant data records. We repeat this manual re-labeling process with
all of the WEKA-assigned labels for the "age" and the "income" attributes. Figure p21 shows the
final result of the transformation and the newly assigned labels for these attribute values.
We now also change the relation name in the ARFF file to "bank-data-final" and save the file as
"bank-data-final.arff".
24 | P a g e
Association Rule Mining with WEKA
This Program illustrates some of the basic elements of associate rule mining using WEKA. The
sample data set used for this example, unless otherwise indicated, is the "bank data" described in
(Data Preprocessing in WEKA). In this case, our starting point is the discretized data obtained
after performing the preprocessing tasks. Figure a1 shows the WEKA explorer interface after
opening this data file ("bank-data-final.arff").
Clicking on the "Associate" tab will bring up the interface for the association rule algorithms.
The Apriori algorithm which we will use is the deafult algorithm selected. However, in order to
change the parameters for this run (e.g., support, confidence, etc.) we click on the text box
immediately to the right of the "Choose" button. Note that this box, at any given time, shows the
specific commandline arguments that are to be used for the algorithm. The dialog box for
25 | P a g e
changing the parameters is depicted in Figure a2. Here, you can specify various parameters
associated with Apriori. Click on the "More" button to see the synopsis for the different
parameters.
WEKA allows the resulting rules to be sorted according to different metrics such as confidence,
leverage, and lift. In this example, we have selected lift as the criteria. Furthermore, we have
entered 1.5 as the minimum value for lift (or improvement) is computed as the confidence of the
rule divided by the support of the right-hand-side (RHS). In a simplified form, given a rule L =>
R, lift is the ratio of the probability that L and R occur together to the multiple of the two
individual probabilities for L and R, i.e.,
lift = Pr(L,R) / Pr(L).Pr(R).
If this value is 1, then L and R, are independent. The higher this value, the more likely that the
existence of L and R together in a transaction is not just a random occurrence, but because of
some relationship between them.
26 | P a g e
Here we also change the default value of rules (10) to be 100; this indicates that the program will
report no more than the top 100 rules (in this case sorted according to their lift values). The
upper bound for minimum support is set to 1.0 (100%) and the lower bound to 0.1 (10%).
Apriori in WEKA starts with the upper bound support and incrementally decreases support (by
delta increments which by default is set to 0.05 or 5%). The algorithm halts when either the
specified number of rules are generated, or the lower bound for min. support is reached. The
significance testing option is only applicable in the case of confidence and is by default not used
(-1.0).
The final selection of parameters for our current run is depicted in Figure a3:
Once the parameters have been set, the commandline text box will show the new command line.
We now click on start to run the program. This results in a set of rules as depicted in Figure a4.
27 | P a g e
The panel on the left ("Result list") now shows an item indicating the algorithm that was run and
the time of the run. You can perform multiple runs in the same session each time with different
paprmeters. Each run will appear as an item in the Result list panel. Clicking on one of the
results in this list will bring up the details of the run, including the discovered rules in the right
panel. In addition, right-clicking on the result set allows us to save the result buffer into a
separate file. In this case, we save the output in the file bank-data-ar1.txt. A portion of this file is
depicted in Figure a5:
28 | P a g e
Note that the rules were discovered based on the specified threshold values for support and lift.
For each rule, the frequency counts for the LHS and RHS of each rule is given, as well as the
values for confidence, lift, leverage, and conviction. Note that leverage and lift measure similar
things, except that leverage measures the difference between the probability of co-occurrence of
L and R (see above example) as the independent probabilities of each of L and R, i.e.,
leverage = Pr(L,R) - Pr(L).Pr(R).
In other words, leverage measures the proportion of additional cases covered by both L and R
above those expected if L and R were independent of each other. Thus, for leverage, values
29 | P a g e
above 0 are desirable, whereas for lift, we want to see values greater than 1. Finally, conviction
is similar to lift, but it measures the effect of the right-hand-side not being true. It also inverts the
ratio. So, convictions is measured as:
conviction = Pr(L).Pr(not R) / Pr(L,R).
Thus, conviction, in contrast to lift is not symmetric (and also has no upper bound).
In most cases, it is sufficient to focus on a combination of support, confidence, and either lift or
leverage to quantitatively measure the "quality" of the rule. However, the real value of a rule, in
terms of usefulness and actionability is subjective and depends heavily of the particular domain
and business objectives.
Using the Command Line
In general, using WEKA from the command line provides more flexibility that using the GUI
version (we will discuss this more in the context of classification). In the case of association
rules, the GUI version does not provide the ability to save the frequent itemsets (independently
of the generated rules). We can do this using the command line. If we look at the output of the
association rule mining from the above example (the file bank-data-ar1.txt), the actual command
line options are given under the "Run information" at the top. In the example, this command line
is:
weka.associations.Apriori -N 100 -T 1 -C 1.5 -D 0.05 -U 1.0 -M 0.1 -S -1.0
We can use this directly using the "Simple CLI" interface.
In the main WEKA interface, click "Simple CLI" button to start the command line interface. The
main command for generating the rules as we did above is:
30 | P a g e
java weka.associations.Apriori options -t directory-path\bank-data-final.arff
where the word options is replaced with the command line options, which for the above example
are:
-N 100 -T 1 -C 1.5 -D 0.05 -U 1.0 -M 0.1 -S -1.0
The additional "-t directory-path\bank-data-final.arff" option tells WEKA to use the file "bank-
data-final.arff" as the input file (located in the specified directory). This command will produce
exactly the same output as the previous GUI example. However, we can add an additional option
("-I") which results in the generation of all frequent itemsets:
java weka.associations.Apriori options -I -t directory-path\bank-data-final.arff
This command as it is used in the SimpleCLI interface is depicted in Figure a6:
31 | P a g e
When ready, press enter to run the program with the indicated options. The result of this
command will be displayed in the top panel of the Simple CLI interface. Here, the results have
been saved into a file bank-data- ar2.txt. You will notice that before the rules, the output includes
itemset of various sizes generated at different iterations of Apriori algorithm (in this case, L1
through L5) along with the support count for each itemset. In the case of L1, these are simply the
individual items (attributes) that meet the minimum support threshold.
32 | P a g e
Classification via Decision Trees in WEKA
This example illustrates the use of C4.5 (J48) classifier in WEKA. The sample data set used for
this example, unless otherwise indicated, is the bank data available in comma-separated format
(bank-data.csv). This document assumes that appropriate data preprocessing has been perfromed.
In this case ID field has been removed. Since C4.5 algorithm can handle numeric attributes, there
is no need to discretize any of the attributes. For the purposes of this example, however, the
"Children" attribute has been converted into a categorical attribute with values "YES" or "NO".
WEKA has implementations of numerous classification and prediction algorithms. The basic
ideas behind using all of these are similar. In this example we will use the modified version of
the bank data to classify new instances using the C4.5 algorithm (note that the C4.5 is
implemented in WEKA by the classifier class: weka.classifiers.trees.J48). The modified (and
smaller) version of the bank data can be found in the file "bank.arff" and the new unclassified
instances are in the file "bank-new.arff".
As usual, we begin by loading the data into WEKA, as seen in Figure 1:
33 | P a g e
Figure: 1
Next, we select the "Classify" tab and click the "Choose" button to select the J48 classifier, as
depicted in Figures 21-a and 21-b. Note that J48 (implementation of C4.5 algorithm) does not
require discretization of numeric attributes, in contrast to the ID3 algorithm from which C4.5 has
evolved.
Figure 2-a Figure 2-b
34 | P a g e
Now, we can specify the various parameters. These can be specified by clicking in the text box
to the right of the "Choose" button, as depicted in Figure 3. In this example we accept the default
values. The default version does perform some pruning (using the subtree raising approach), but
does not perform error pruning. The selected parameters are depicted in Figure 3.
Figure 3
Under the "Test options" in the main panel we select 10-fold cross-validation as our evaluation
approach. Since we do not have separate evaluation data set, this is necessary to get a reasonable
idea of accuracy of the generated model. We now click "Start" to generate the model. The ASCII
version of the tree as well as evaluation statistics will appear in the eight panel when the model
construction is completed (see Figure 4).
35 | P a g e
Figure 4
We can view this information in a separate window by right clicking the last result set (inside the
"Result list" panel on the left) and selecting "View in separate window" from the pop-up menu.
These steps and the resulting window containing the classification results are depicted in Figures
5-a and 5-b.
Figure 5-a Figure 5-b
36 | P a g e
Note that the classification accuracy of our model is only about 69%. This may indicate that we
may need to do more work (either in preprocessing or in selecting the correct parameters for
classification), before building another model. In this example, however, we will continue with
this model despite its inaccuracy.
WEKA also let's us view a graphical rendition of the classification tree. This can be done by
right clicking the last result set (as before) and selecting "Visualize tree" from the pop-up menu.
The tree for this example is depicted in Figure 6. Note that by resizing the window and selecting
various menu items from inside the tree view (using the right mouse button), we can adjust the
tree view to make it more readable.
Figure 6
We will now use our model to classify the new instances. A portion of the new instances ARFF
file is depicted in Figure 7. Note that the attribute section is identical to the training data (bank
37 | P a g e
data we used for building our model). However, in the data section, the value of the "pep"
attribute is "?" (or unknown).
Figure 7
In the main panel, under "Test options" click the "Supplied test set" radio button, and then click
the "Set..." button. This will pop up a window which allows you to open the file containing test
instances, as in Figures 8-a and 8-b.
Figure 8-a Figure 8-b
In this case, we open the file "bank-new.arff" and upon returning to the main window, we click
the "start" button. This, once again generates the models from our training data, but this time it
38 | P a g e
applies the model to the new unclassified instances in the "bank-new.arff" file in order to predict
the value of "pep" attribute. The result is depicted in Figure 9. Note that the summary of the
results in the right panel does not show any statistics. This is because in our test instances the
value of the class attribute ("pep") was left as "?", thus WEKA has no actual values to which it
can compare the predicted values of new instances.
Figure 9
Of course, in this example we are interested in knowing how our model managed to classify the
new instances. To do so we need to create a file containing all the new instances along with their
predicted class value resulting from the application of the model. Doing this is much simpler
using the command line version of WEKA classifier application. However, it is possible to do so
in the GUI version using an "indirect" approach, as follows.
39 | P a g e
First, right-click the most recent result set in the left "Result list" panel. In the resulting pop-up
window select the menu item "Visualize classifier errors". This brings up a separate window
containing a two-dimensional graph. These steps and the resulting window are shown in Figures
9 and 10.
Figure 10
For now, we are not interested in what this graph represents. Rather, we would like to "save" the
classification results from which the graph is generated. In the new window, we click on the
"Save" button and save the result as the file: "bank-predicted.arff", as shown in Figure 11.
40 | P a g e
Figure 11
This file contains a copy of the new instances along with an additional column for the predicted
value of "pep". The top portion of the file can be seen in Figure 12.
Figure 12
Note that two attributes have been added to the original new instances data: "Instance_number"
and "predictedpep". These correspond to new columns in the data portion. The "predictedpep"
41 | P a g e
value for each new instance is the last value before "?" which the actual "pep" class value. For
example, the predicted value of the "pep" attribute for instance 0 is "YES" according to our
model, while the predicted class value for instance 4 is "NO".
Using the Command Line (Recommended)
While the GUI version of WEKA is nice for visualizing the results and setting the parameters
using forms, when it comes to building a classification (or predictions) model and then applying
it to new instances, the most direct and flexible approach is to use the command line. In fact, you
can use the GUI to create the list of parameters (for example in case of the J48 class) and then
use those parameters in the command line.
In the main WEKA interface, click "Simple CLI" button to start the command line interface. The
main command for generating the classification model as we did above is:
java weka.classifiers.trees.J48 -C 0.25 -M 2 -t directory-path\bank.arff -d directory-path
\bank.model
The options -C 0.25 and -M 2 in the above command are the same options that we selected for
J48 classifier in the previous GUI example (see Figure 3). The -t option in the command
specifies that the next string is the full directory path to the training file (in this case "bank.arff").
In the above command directory-path should be replaced with the full directory path where the
training file resides. Finally, the -d option specifies the name (and location) where the model will
be stored. After executing this command inside the "Simple CLI" interface, you should see the
tree and stats about the model in the top window (See Figure 13).
42 | P a g e
Figure 13
Based on the above command, our classification model has been stored in the file "bank.model"
and placed in the directory we specified. We can now apply this model to the new instances. The
advantage of building a model and storing it is that it can be applied at any time to different sets
of unclassified instances. The command for doing so is:
java weka.classifiers.trees.J48 -p 9 -l directory-path\bank.model -T directory-path \bank-
new.arff
43 | P a g e
In the above command, the option -p 9 indicates that we want to predict a value for attribute
number 9 (which is "pep"). The -l options specifies the directory path and name of the model file
(this is what was created in the previous step). Finally, the -T option specifies the name (and
path) of the test data. In our example, the test data is our new instances file "bank-new.arff").
This command results in a 4-column output similar to the following:
0 YES 0.75 ?
1 NO 0.7272727272727273 ?
2 YES 0.95 ?
3 YES 0.8813559322033898 ?
4 NO 0.8421052631578947 ?
The first column is the instance number assigned to the new instances in "bank-new.arff" by
WEKA. The 2nd column is the predicted value of the "pep" attribute for the new instance. The
3rd column is the confidence (prediction accuracy) for that instance. Finally, the 4th column in
the actual "pep" value in the test data (in this case, we did not have a value for "pep" in "bank-
new.arff", thus this value is "?"). For example, in the above output, the predicted value of "pep"
in instance 2 is "YES" with a confidence of 95%. Portion of the final result are depicted in
Figure 14.
44 | P a g e
Figure 14
The above output is preferable over the output derived from the GUI version on WEKA. First,
this is a more direct approach which allows us to save the classification model. This model can
be applied to new instance later without having to regenerate the model. Secondly (and more
importantly), in contrast to the final output of the GUI version, in this case we have independent
confidence (accuracy) values for each of the new instances. This means that we can focus only
on those prediction with which we are more confident. For example, in the above output, we
could filter out any instance whose predicted value has an accuracy of less than 85%.
45 | P a g e
K-Means Clustering in WEKA
This example illustrates the use of k-means clustering with WEKA The sample data set used for
this example is based on the "bank data" available in comma-separated format (bank-data.csv).
This document assumes that appropriate data preprocessing has been perfromed. In this case a
version of the initial data set has been created in which the ID field has been removed and the
"children" attribute has been converted to categorical (This, however, is not necessary for
clustering).
The resulting data file is "bank.arff" and includes 600 instances. As an illustration of performing
clustering in WEKA, we will use its implementation of the K-means algorithm to cluster the
customers in this bank data set, and to characterize the resulting customer segments. Figure 1
shows the main WEKA Explorer interface with the data file loaded.
Figure 1
46 | P a g e
Some implementations of K-means only allow numerical values for attributes. In that case, it
may be necessary to convert the data set into the standard spreadsheet format and convert
categorical attributes to binary. It may also be necessary to normalize the values of attributes that
are measured on substantially different scales (e.g., "age" and "income"). While WEKA provides
filters to accomplish all of these preprocessing tasks, they are not necessary for clustering in
WEKA . This is because WEKA SimpleKMeans algorithm automatically handles a mixture of
categorical and numerical attributes. Furthermore, the algorithm automatically normalizes
numerical attributes when doing distance computations. The WEKA SimpleKMeans algorithm
uses Euclidean distance measure to compute distances between instances and clusters.
To perform clustering, select the "Cluster" tab in the Explorer and click on the "Choose" button.
This results in a drop down list of available clustering algorithms. In this case we select
"SimpleKMeans". Next, click on the text box to the right of the "Choose" button to get the pop-
up window shown in Figure 2, for editing the clustering parameter.
Figure 2
47 | P a g e
In the pop-up window we enter 6 as the number of clusters (instead of the default values of 2)
and we leave the value of "seed" as is. The seed value is used in generating a random number
which is, in turn, used for making the initial assignment of instances to clusters. Note that, in
general, K-means is quite sensitive to how clusters are initially assigned. Thus, it is often
necessary to try different values and evaluate the results.
Once the options have been specified, we can run the clustering algorithm. Here we make sure
that in the "Cluster Mode" panel, the "Use training set" option is selected, and we click "Start".
We can right click the result set in the "Result list" panel and view the results of clustering in a
separate window. This process and the resulting window are shown in Figures 3 and 4.
Figure 3
48 | P a g e
Figure 4
The result window shows the centroid of each cluster as well as statistics on the number and
percentage of instances assigned to different clusters. Cluster centroids are the mean vectors for
each cluster (so, each dimension value in the centroid represents the mean value for that
dimension in the cluster). Thus, centroids can be used to characterize the clusters. For example,
the centroid for cluster 1 shows that this is a segment of cases representing middle aged to young
(approx. 38) females living in inner city with an average income of approx. $28,500, who are
married with one child, etc. Furthermore, this group have on average said YES to the PEP
product.
Another way of understanding the characteristics of each cluster in through visualization. We can
do this by right-clicking the result set on the left "Result list" panel and selecting "Visualize
cluster assignments". This pops up the visualization window as shown in Figure 5.
49 | P a g e
Figure 5
You can choose the cluster number and any of the other attributes for each of the three different
dimensions available (x-axis, y-axis, and color). Different combinations of choices will result in
a visual rendering of different relationships within each cluster. In the above example, we have
chosen the cluster number as the x-axis, the instance number (assigned by WEKA) as the y-axis,
and the "sex" attribute as the color dimension. This will result in a visualization of the
distribution of males and females in each cluster. For instance, you can note that clusters 2 and 3
are dominated by males, while clusters 4 and 5 are dominated by females. In this case, by
changing the color dimension to other attributes, we can see their distribution within each of the
clusters.
Finally, we may be interested in saving the resulting data set which included each instance along
with its assigned cluster. To do so, we click the "Save" button in the visualization window and
save the result as the file "bank-kmeans.arff". The top portion of this file is depicted in Figure 6.
50 | P a g e
Figure 6
Note that in addition to the "instance_number" attribute, WEKA has also added "Cluster"
attribute to the original data set. In the data portion, each instance now has its assigned cluster as
the last attribute value. By doing some simple manipulation to this data set, we can easily convert
it to a more usable form for additional analysis or processing. For example, here we have
converted this data set in a comma-separated format and sorted the result by clusters.
Furthermore, we have added the ID field from the original data set (before sorting). The results
of these steps can be seen in the file "bank- kmeans.csv".