실험데이터처리 data mining (ch. 9 - ch. 10) using weka

실험데이터처리

Data mining (Ch. 9 - Ch. 10)

Using Weka

Major: Interdisciplinary program of the integrated biotechnology

Graduate school of bio- & information technology

Youngil Lim (N110), Lab. FACSYoungil Lim (N110), Lab. FACSphone: +82 31 670 5200 (secretary), +82 31 670 5207 (direct)phone: +82 31 670 5200 (secretary), +82 31 670 5207 (direct)

Fax: +82 31 670 5445, mobile phone: +82 10 7665 5207Fax: +82 31 670 5445, mobile phone: +82 10 7665 5207Email: Email: [email protected], homepage:, homepage: http://facs.maru.net

mailto:[email protected]

http://facs.maru.net/

Overview of this lecture

- Machine learning = acquisition of structural descriptions automatically or semi-auto. (it is similar as the brain development from repeating experiences)- Weka written in JAVA (object-oriented programming language) (JAVA is free to OS and its calculation is 2-3 times slower than C, C++ and Fortran- Java compiler (Java virtual machine) translate the byte-code into machine code

Information (data, database)

Knowledge(understanding, application,

prediction)

Data mining(extraction of useful information)

input (ch. 2)

output(ch. 3)

Relationships?ModelingStructural patternsTechnical tools: machine learning

Outline of this lecturePart I. Machine learning tools and techniques

- Level 1: Ch 1. Applications, common problems

Ch 2. Input, concepts, instances and attributes

Ch 3. Output, knowledge representation

- Level 2: Ch 4. Numerical algorithms, the basic methods

- Level 3: Ch 5-6 (advanced topics)

Part II. Weka manual (ftp://facs/lim/lecture_related/weka3.4.exe)

- Level 1: Ch 9. Introduction of Weka

Ch 10. Explorer

- Level 2: Ch 11-15 (advanced options in Weka)

But, you need to read those chapters to make a paper on data mining

Ch. 9. Introduction to Weka

- No single scheme ML is appropriate to all DM problems

- DM is an experimental science.- Weka is a collection of state-of-the-art ML algorithms- Weka includes

1) input data preparation (ARFF)2) various learning algorithm evaluations3) input data visualization4) visualization of ML result

Introduction

- Weka workbench includes methods for DM

1) regression (numerical prediction)

2) classification

3) clustering

4) association rule

5) attribute-selection

9.1 What’s in Weka

- classifier: learning methods (or algorithms)

- object editor: adjustment of tunable parameters of the classifier

- filter: tools for data preparation (filtering algorithm)

- 4 graphical user interfaces (GUI) of Weka

1) Explorer (for small/medium data size): main GUI Ch. 10

2) Knowledge flow (for large data set): design of configurations for streamed data processing, incremental learning data. Ch. 11

3) Experimenter: automatic running of classifier and filter with different parameter settings, parallel computing. Ch. 12

4) command-line interface in JAVA. Ch. 13

Terminology and components

9.2 How do you use it?

Weka GUI chooser

9.2 How do you use it?

4 graphical user interfaces (GUI) of Weka

1) Explorer (for small/medium data size): main GUI Ch. 10

2) Knowledge flow (for large data set): design of configurations for streamed data processing, incremental learning data. Ch. 11

3) Experimenter: automatic running of classifier and filter with different parameter settings, parallel computing. Ch. 12

4) CLI in JAVA. Ch. 13

Ch. 10. The Explorer

10.1 Getting started

10.2 Exploring the explorer

10.3 Filtering algorithms

10.4 Learning algorithms

10.5 Meta-learning algorithms

10.6 Clustering algorithms

10.7 Association-rule learners

10.8 Attribute selection

OutlineThis lecture does not

cover Ch-10.6 and Ch-10.7

Ch. 10. The Explorer

Build a decision tree from the data:• Prepare the data (comma separated value format)• Fire up Weka• Load data• Select a decision tree construction method• Build a tree• Interpret the output

Procedure of using Explorer

- ARFF by default (comma-separated value format)

- tags:

1) @relation (data title)

2) @attribute (variables)

3) @data (instances)


(1) Preparing the data


(1) Preparing the data

Open <weather.arff> using MS word or other editors

10.1.1 Loading the data into the Explorer

10.1.2 Building a decision tree

Class attribute (dependent variable)

10.1.3 Test options

The result of applying the chosen classifier will be tested according to the options that are set by clicking in the Test options box. There are four test modes:

1) Use training set. The classifier is evaluated on how well it predicts the class of the instances it was trained on.

2) Supplied test set. The classifier is evaluated on how well it predicts the class of a set of instances loaded from a file. Clicking the Set... button brings up a dialog allowing you to choose the file to test on.

3) Cross-validation. The classifier is evaluated by cross-validation, using the number of folds that are entered in the Folds text field.

4) Percentage split. The classifier is evaluated on how well it predicts a certain percentage of the data which is held out for testing. The amount of data held out depends on the value entered in the % field.

10.1.4 Cross-validation

- Cross-validation (repeated holdout):1) fold: number of partitioning of the data.2) 10-fold cross-validation is generally used

for a single and fixed data.3) divide randomly into 10 parts from the data4) 9 parts are used for training and 1 part is used for testing5) measure its error rate6) repeat 10 times of cross-validation on different training sets7) the overall error is the average of the 10 error rates.

dataset

Random sampling is required !(stratification)

Training dataset(9/10 of data)

Validation dataset(testing, 1/10 of data)

Prediction dataset(new data)

Witten & Frank (2005), data mining, p149-151

10.1.5 Examining the output

10.1.6 Doing it again

10.1.7 working with models

Exercise: analyze iris dataset !

- Load iris data into Weka

- Find the classification rule

- Visualize the decision tree

- Visualize threshold curve

10.1.8 When things go wrong

To see the error message, click the Log

button.

What’s going on?(memory available)

10.2 Exploring the Explorer

- We have six tabs1) preprocess: choose the data set and modify it.2) classify: train learning schemes that perform classification, regression, and evaluate them.3) cluster: learn clusters for the dataset4) associate: learn associate rules for the data and evaluate them5) select attributes: select the most relevant aspects in the dataset.6) visualize: view different two-dimensional plots of the data and interact with them.

10.2 Exploring the Explorer

- The bird (Weka) dances when Weka is active.

- shows how many con-current processes are running.

- The bird sits when Weka is non-active.

- The bird is standing but stops moving, it’s sick (something has gone wrong, and the Explorer should be restarted).

2

10.2.1 Converting files to ARFF

- 3 file converters to arff

1) .csv (comma-separated value) .arff

2) .name and .data (C4.5’s native file) .arff

3) .bsi (binary serialized instances) .arff

10.2.2 Using filters

Unsupervised attribute filter

10.2.3 Training and testing learning schemes

Open file: cpu.with.vendor.arf

f


Classifiers>trees>M5P


How many leaves?How many nodes?


It gives a single linear regression model rather than the two of “trees>M5P”

Classifiers>functions>LinearRegression

10.2.4 Visualizing error

What is better between M5P and linear regression?

10.2.5 Do it yourself: the user classifier

- Open data>segment-challenge.arff

- Segment the visual image data into classes (grass, sky, cement …)

Classifiers>Trees>UserClassifier

data>segment-test.arff


The goal is to find a combination that separates the classes as clearly as possible.

Change X and Y axis !


Specify a region in the graph 1. Select instance 2. Rectangle 3. Polygon 4. Polyline

- clear: clear the selection- save: save instances in the

current tree node as an ARFF file


Accept to tree (right-click of your mouse on any blank space

Building trees manually is very tedious

Correctly classified instances: 40%In-correctly ones: 60%

10.2.6 Using a metalearnerMeta-learner means a powerful user who controls Weka very well

Boosting decision stumps up to 10 times

Adaptive boosting

10.2.6 Using a metalearnerMeta-learner means a powerful user who controls Weka very well

10.2.7 Clustering and association rules

We skip clustering and association rules.So, 10.6 and 10.7 are also skipped.

In Ch. 4, 4.5 and 4.8 are also skipped.

10.2.8 Attribution selection

We will learn more Ch. 10.8

10.2.9 Visualization of data

2D scattering plots of every pair of attributes

10.3 Filtering algorithms

- Filtering of data (= attributes + instances)- All filters transform the input dataset in a way.- Two kinds of filter:

1) supervised (Section 7.2): to be used carefully2) unsupervised:

- Each filter has two types of distinction between attribute filters and instance filters.1) attribute filter: it works on the attribute of data2) instance filter: it works on the instance of data

- See Section 7.3 1) PCA (principal component analysis)2) Random projections

10.3.1 Adding and removing attributes

- Add: insert a new attribute, whose value is empty.- Copy: copy existing attributes and their values.- Remove: it is the same as <remove> tab- RemoveType: remove all instances of the same type such as

nominal, numeric, string, or date.- AddCluster: apply a clustering algorithm to the data before

filtering it (see Section 10.6)- AddExpression: create a new attribute by applying a mathematical

function to numeric attributes.e.g., a1^2*a5/log(4*a7)

- NumericTransform: performs an arbitrary transformation by applying a given JAVA function to selected numeric attributes.

- Normalize: scale all numeric values to lie between 0 and 1.- Standardize: transforms all numeric values to have zero mean and

unit variance.

10.3.2 Changing values

- SwapValues: just change the position of two values of a nominal attribute (it does not affect learning at all).

- MergeTwoValues: merge values of a nominal attribute into a single category.

- ReplaceMissingValues: replace each missing value with the mean for numeric attributes and the mode for nominal.1) if a class is set, missing values of that attribute are not replaced.

10.3.3 Conversions

- Discretize: change the numeric attribute to the nominal attribute (section 7.2).1) equal-width binning2) equal-frequency binning

- PKIDiscretize: discretize numeric attributes using equal-frequency binning where the number of bins is the square root of the number of values. 83 instances without missing value are binned by 9 bins.

- MakeIndicator: convert a nominal attribute into a binary indicator attribute it is necessary when the numeric attribute is required for a ML scheme.

- NorminalToBinary: transform all multivalued nominal attributes ina dataset into binary ones (k-value attribute k-binary attribute)

- NumericToBinary: convert all numeric attributes into binary ones (if numeric value = 0, then binary value = 0. otherwise, binary value = 1).

- FirstOrder: take a difference between two attribute values.

10.3.4 String conversions

- StringToNominal: - StringToWordVector:

10.3.5 Time series

For time-series data,

- TimeSeriesTranslate: replace attribute values in the current instance with the equivalent attribute values of some previous (or future) instance.

- TimeSeriesDelta: replace attribute values in the current instance with the difference between the current value and the equivalent attribute value of some previous (or future) instance.

10.3.6 Randomizing

These filters change values of the data.

- AddNoise: it introduces a noise to the data. It takes a nominal attribute and changes its value to other value by a given percentage.

- Obfuscate: rename the attribute name and anonymize data.- RandomProjection: see section 7.3

10.3.7 Unsupervised instance filters

- Attribute filers: it affects all values of attribute (column of data)

- Instance filters: if affects all values of instance (raw of data)

10.3.8 Randomizing and subsampling

- Randomize: the order of instances are randomized - Normalize: all numeric attributes are treated as a vector and

normalized to a given length.- Resample: it produces a random sample by sampling by

replacement- RemoveFolds: it first splits the data into a given number of cross-

validation folds and then reduces the data just one of them. If random number seed is provided, the dataset will be shuffled before the subset is extracted.

- RemovePercentage: it removes a given percentage of instnaces.- RemoveRange: it removes a certain range of instance numbers.- RemoveWithValues: it removes all instances that have certain

values above or below a certain threshold.

10.3.9 Sparse instances

- NonSparseToSparse: - SparseToNonSparse:

10.3.10 Supervised filters

- Supervised filters are affected by the class attribute.- We have two categories of supervised filters

1) attribute2) instance

- You need to be careful with them because they are not really preprocessing operations.

- Discretize: see section 7.2- NormialToBinary: see section 6.5- ClassOrder: it changes the ordering of the class values

- Resample: it is like the unsupervised instance filter, except that it maintains the class distribution in the subsample.

…..

10.4 Learning algorithms

- We have 7 categories in classification.1) Bayesian: document classification (e.g., google search)2) Trees: decision trees, divide-and-conquer (stump, node, leaf, model tree)3) Rules: covering approach (or excluding instances), the decision tree is converted to a set of logical expression.4) Functions: linear model, nonlinear model.5) Lazy (instance-based learning): distance function.6) metalearning algorithms: more powerful learner.7) Miscellaneous: divers

- Ch. 4 and Ch. 6 covers those algorithms.

10.4.2 Decision trees

- J48 (see section 6.1 and 6.2): reimplementation of program C4.5, an outcome of Quinlan (1993) over 20 years.

Confidence threshold for pruning

Minimum number of instances permissible at a leaf

Size of pruning set: data is divided equally into that number and the last one used for pruning

This algorithm is valid for nominal class

attribute

10.4.2 Decision trees

- Id3 (see Ch. 4): basic divide-and-conquer decision tree algorithm, it allows the nominal class.

- Decision Stump: boosting method, it builds on-level binary decision trees.

- RandomTree: construct a tree that considers a given number of random features at each node, performing no pruning.

- RandomForest (see section 7.5): it constructs random forests by bagging ensembles of random trees.

- REPTree (see section 6.2): it builds a decision or regression tree using information gain/variance reduction

- NBTree: a hybrid between decision trees and Naïve bayes- M5P (see section 6.5, Quinlan, 1992): model tree learner with linear

models- LMT (see section 7.5): it builds logistic model tress for nominal

class.- ADTree (see section 7.5): alternating decision tree using boosting.

10.4.3 Classification rules

- ConjunctiveRule - DecisionTable- JRip- M5Rules- Nnge- OneR- Part- Prisim.- Ridor- ZeroR: it predicts the test data’s majority class (if nominal) or

average value (if numeric).

10.4.4 Functions- Bayesian has a simple mathematical formulation- Decision trees and rules have linear regression models.- Functions give us more complicated mathematical models

We focus on

1) linear regression model

2) Support vector machine algorithm

3) Neural network

10.4.4 Functions

- SimpleLinearRegression: linear regression model based on a single class attribute.

- LinearRegression: linear regression model for all numeric attributes- LeastMedSq: Implements a least median squared linear regression

utilizing the existing Weka LinearRegression class to form predictions (the solution has the smallest median-squared error)

- SMO (see section 6.3): it implements the sequential minimal optimization algorithm for training a support vector classifier.

- SMOreg: Support vector machine algorithm for numeric class attribute

- Linear model: really linear relationship between attributes- Support vector machine: linear model for nonlinear class boundaries

10.4.4 Functions

- VotedPerceptron (see section 6.3): voted perceptron algorithm, Globally replaces all missing values, and transforms nominal attributes into binary ones.

- Window (see section 4.6): it modifies the basic perceptron to use multiplicative updates.

- PaceRegression: it builds linear regression modes using the new technique of Pace Regression (Wang & Witten, 2002).

- SimpleLogistic (see section 4.6): it builds logistic regression models

- RBFNetwork (see section 6.3): it implements a Gaussian radial basis function network, one kind of linear regression models.

10.4.5 Functions: artificial neural network (ANN)

Najjar, Y.M., I.A. Basheer and M.N. Hajmeer (1997), Computational neural networks for predictive microbiology: I. methodology, Int. Journal of Food Microbiology, 34, 27-49.

10.4.5 Functions: neural networks

Input layer

f(n)

p1

p2

p3

p4

.

.pr

w1,1

w1,rb1

n1 a1

n2 a2w2,1

f(n)

Hidden layer output layer

b2

Input layer

f(n)

p1

p2

p3

p4

.

.pr

w1,1

w1,rb1

n1 a1

n2 a2w2,1

f(n)

Hidden layer output layer

b2

)( 111 bpwfa sigmoid

)( 2122 bawfa purelinear

neuron

Linear regression model

10.4.5 Functions: neural networks

)( 111 bpwfa sigmoid

)( 2122 bawfa purelinear

0

1

n

a

1

0

1

n

a

0.5

]5.2[

]5.0,3.0,2.0,0[

]3,2,1,1[

b

w

p

4.1

5.235.023.012.010

1

4

11

bpwni

ii

1978.01

14.1

ea

nea

1

1

]5.0[

]2.0,6.0,2.0[

]2,0,1[

b

w

pa1=?

10.4.5 Functions: neural networks- MultilayerPerceptron (see section 6.3): it

trains back-propagation and validates (or calculates) feed-forward,

Nonlinear algorithm.

Set GUI to True to see ANN

Adds and connects up hidden layers in the network.

Filtering the data

The amount the weights are updated.

Momentum applied to the weights during updating.

10.4.5 Functions: neural networks- MultilayerPerceptron (see section 6.3): it

trains back-propagation and validates (or calculates) feed-forward,

Nonlinear algorithm.

Define the structure and node number of hidden layers: e.g., [4,5] two hidden layers with 4 and 5 neurons, respectively

The number of epochs to train through

The percentage size of the validation set

Used to terminate validation testing. The value here dictates how many times in a row the validation set error can get worse before training is terminated.

10.4.5 Functions: neural networks –CPU problemUsing two hidden layers with 4 and 5 neurons, fine

the correlation coefficient of this ANN?

Options: 10-fold cross validation, 500 epochs Open file: CPU.arff



Options: 10-fold cross validation, 500 epochs

Classify: Functions>multilayer perceptron

10-fold cross validation




options




After start in the main Weka window, you can see the ANN GUI

1. Click “accept”2. Clikc “start” 3. Until 10-fold repeats




You can see the number of cross-validation increases.

The bird will dance until the 10-fold validation is over.




After 10-fold validation, the results appear in the Weka main window.

You can now analyze the results

10.5 Meta-learning algorithms

For powerful usage of ML, please read this section !

We skip section 10.6 and 10.7

10.8 Attribute selection

It works, selecting one of the attribute evaluators and, at the same time, one of the search methods

10.8.1 Attribute subset evaluators

Subset evaluators take a subset of attributes and return a numeric measure that guides the search.

- CfsSubsetEval (see section 7.1): it finds the highest correlation attributes with the class, but they have low inter-correlation between the attributes, one of the filtering methods

- ConsistencySubsetEval: one of the filtering methods

- ClassifierSubsetEval: one of the wrapper methods, it uses a classifier and evaluate sets of attributes on the training data.

10.8.2 Single-attribute evaluators

Single-attribute evaluators should be used with the Ranker search method.

- InfoGainAttributeEval: it evaluates attributes by measuring their information gain w.r.t. the class.

- ChiSquaredAttributeEval: it evaluates attributes by computing the chi-squared statistic w.r.t. the class.

- SVMAttributeEval: evalution with linear support vector machine.

- PrincipalComponents (see section 7.3): PCA

10.8.3 Search methods

- Search methods are optimization algorithms to find a good subset.

- Subset evaluators are object functions to be optimized

- GreedyStepwise:

- GeneticSearch:

- RankSearch:

실험데이터처리 data mining (ch. 9 - ch. 10) using weka

Documents