실험데이터처리 data mining (ch. 9 - ch. 10) using weka
DESCRIPTION
실험데이터처리 Data mining (Ch. 9 - Ch. 10) Using Weka. Major: Interdisciplinary program of the integrated biotechnology Graduate school of bio- & information technology Youngil Lim (N110), Lab. FACS phone: +82 31 670 5200 (secretary), +82 31 670 5207 (direct) - PowerPoint PPT PresentationTRANSCRIPT
실험데이터처리
Data mining (Ch. 9 - Ch. 10)
Using Weka
Major: Interdisciplinary program of the integrated biotechnology
Graduate school of bio- & information technology
Youngil Lim (N110), Lab. FACSYoungil Lim (N110), Lab. FACSphone: +82 31 670 5200 (secretary), +82 31 670 5207 (direct)phone: +82 31 670 5200 (secretary), +82 31 670 5207 (direct)
Fax: +82 31 670 5445, mobile phone: +82 10 7665 5207Fax: +82 31 670 5445, mobile phone: +82 10 7665 5207Email: Email: [email protected], homepage:, homepage: http://facs.maru.net
Overview of this lecture
- Machine learning = acquisition of structural descriptions automatically or semi-auto. (it is similar as the brain development from repeating experiences)- Weka written in JAVA (object-oriented programming language) (JAVA is free to OS and its calculation is 2-3 times slower than C, C++ and Fortran- Java compiler (Java virtual machine) translate the byte-code into machine code
Information (data, database)
Knowledge(understanding, application,
prediction)
Data mining(extraction of useful information)
input (ch. 2)
output(ch. 3)
Relationships?ModelingStructural patternsTechnical tools: machine learning
Outline of this lecturePart I. Machine learning tools and techniques
- Level 1: Ch 1. Applications, common problems
Ch 2. Input, concepts, instances and attributes
Ch 3. Output, knowledge representation
- Level 2: Ch 4. Numerical algorithms, the basic methods
- Level 3: Ch 5-6 (advanced topics)
Part II. Weka manual (ftp://facs/lim/lecture_related/weka3.4.exe)
- Level 1: Ch 9. Introduction of Weka
Ch 10. Explorer
- Level 2: Ch 11-15 (advanced options in Weka)
But, you need to read those chapters to make a paper on data mining
Ch. 9. Introduction to Weka
- No single scheme ML is appropriate to all DM problems
- DM is an experimental science.- Weka is a collection of state-of-the-art ML algorithms- Weka includes
1) input data preparation (ARFF)2) various learning algorithm evaluations3) input data visualization4) visualization of ML result
Introduction
- Weka workbench includes methods for DM
1) regression (numerical prediction)
2) classification
3) clustering
4) association rule
5) attribute-selection
9.1 What’s in Weka
- classifier: learning methods (or algorithms)
- object editor: adjustment of tunable parameters of the classifier
- filter: tools for data preparation (filtering algorithm)
- 4 graphical user interfaces (GUI) of Weka
1) Explorer (for small/medium data size): main GUI Ch. 10
2) Knowledge flow (for large data set): design of configurations for streamed data processing, incremental learning data. Ch. 11
3) Experimenter: automatic running of classifier and filter with different parameter settings, parallel computing. Ch. 12
4) command-line interface in JAVA. Ch. 13
Terminology and components
9.2 How do you use it?
Weka GUI chooser
9.2 How do you use it?
4 graphical user interfaces (GUI) of Weka
1) Explorer (for small/medium data size): main GUI Ch. 10
2) Knowledge flow (for large data set): design of configurations for streamed data processing, incremental learning data. Ch. 11
3) Experimenter: automatic running of classifier and filter with different parameter settings, parallel computing. Ch. 12
4) CLI in JAVA. Ch. 13
Ch. 10. The Explorer
10.1 Getting started
10.2 Exploring the explorer
10.3 Filtering algorithms
10.4 Learning algorithms
10.5 Meta-learning algorithms
10.6 Clustering algorithms
10.7 Association-rule learners
10.8 Attribute selection
OutlineThis lecture does not
cover Ch-10.6 and Ch-10.7
Ch. 10. The Explorer
Build a decision tree from the data:• Prepare the data (comma separated value format)• Fire up Weka• Load data• Select a decision tree construction method• Build a tree• Interpret the output
Procedure of using Explorer
- ARFF by default (comma-separated value format)
- tags:
1) @relation (data title)
2) @attribute (variables)
3) @data (instances)
10.1 Getting started
(1) Preparing the data
10.1 Getting started
(1) Preparing the data
Open <weather.arff> using MS word or other editors
10.1.1 Loading the data into the Explorer
10.1.1 Loading the data into the Explorer
10.1.2 Building a decision tree
Class attribute (dependent variable)
10.1.3 Test options
The result of applying the chosen classifier will be tested according to the options that are set by clicking in the Test options box. There are four test modes:
1) Use training set. The classifier is evaluated on how well it predicts the class of the instances it was trained on.
2) Supplied test set. The classifier is evaluated on how well it predicts the class of a set of instances loaded from a file. Clicking the Set... button brings up a dialog allowing you to choose the file to test on.
3) Cross-validation. The classifier is evaluated by cross-validation, using the number of folds that are entered in the Folds text field.
4) Percentage split. The classifier is evaluated on how well it predicts a certain percentage of the data which is held out for testing. The amount of data held out depends on the value entered in the % field.
10.1.4 Cross-validation
- Cross-validation (repeated holdout):1) fold: number of partitioning of the data.2) 10-fold cross-validation is generally used
for a single and fixed data.3) divide randomly into 10 parts from the data4) 9 parts are used for training and 1 part is used for testing5) measure its error rate6) repeat 10 times of cross-validation on different training sets7) the overall error is the average of the 10 error rates.
dataset
Random sampling is required !(stratification)
Training dataset(9/10 of data)
Validation dataset(testing, 1/10 of data)
Prediction dataset(new data)
Witten & Frank (2005), data mining, p149-151
10.1.5 Examining the output
10.1.6 Doing it again
10.1.7 working with models
Exercise: analyze iris dataset !
- Load iris data into Weka
- Find the classification rule
- Visualize the decision tree
- Visualize threshold curve
10.1.8 When things go wrong
To see the error message, click the Log
button.
What’s going on?(memory available)
10.2 Exploring the Explorer
- We have six tabs1) preprocess: choose the data set and modify it.2) classify: train learning schemes that perform classification, regression, and evaluate them.3) cluster: learn clusters for the dataset4) associate: learn associate rules for the data and evaluate them5) select attributes: select the most relevant aspects in the dataset.6) visualize: view different two-dimensional plots of the data and interact with them.
10.2 Exploring the Explorer
- The bird (Weka) dances when Weka is active.
- shows how many con-current processes are running.
- The bird sits when Weka is non-active.
- The bird is standing but stops moving, it’s sick (something has gone wrong, and the Explorer should be restarted).
2
10.2.1 Converting files to ARFF
- 3 file converters to arff
1) .csv (comma-separated value) .arff
2) .name and .data (C4.5’s native file) .arff
3) .bsi (binary serialized instances) .arff
10.2.2 Using filters
Unsupervised attribute filter
10.2.3 Training and testing learning schemes
Open file: cpu.with.vendor.arf
f
10.2.3 Training and testing learning schemes
Classifiers>trees>M5P
10.2.3 Training and testing learning schemes
How many leaves?How many nodes?
10.2.3 Training and testing learning schemes
It gives a single linear regression model rather than the two of “trees>M5P”
Classifiers>functions>LinearRegression
10.2.4 Visualizing error
What is better between M5P and linear regression?
10.2.5 Do it yourself: the user classifier
- Open data>segment-challenge.arff
- Segment the visual image data into classes (grass, sky, cement …)
Classifiers>Trees>UserClassifier
data>segment-test.arff
10.2.5 Do it yourself: the user classifier
The goal is to find a combination that separates the classes as clearly as possible.
Change X and Y axis !
10.2.5 Do it yourself: the user classifier
Specify a region in the graph 1. Select instance 2. Rectangle 3. Polygon 4. Polyline
- clear: clear the selection- save: save instances in the
current tree node as an ARFF file
10.2.5 Do it yourself: the user classifier
Accept to tree (right-click of your mouse on any blank space
Building trees manually is very tedious
Correctly classified instances: 40%In-correctly ones: 60%
10.2.6 Using a metalearnerMeta-learner means a powerful user who controls Weka very well
Boosting decision stumps up to 10 times
Adaptive boosting
10.2.6 Using a metalearnerMeta-learner means a powerful user who controls Weka very well
10.2.7 Clustering and association rules
We skip clustering and association rules.So, 10.6 and 10.7 are also skipped.
In Ch. 4, 4.5 and 4.8 are also skipped.
10.2.8 Attribution selection
We will learn more Ch. 10.8
10.2.9 Visualization of data
2D scattering plots of every pair of attributes
10.3 Filtering algorithms
- Filtering of data (= attributes + instances)- All filters transform the input dataset in a way.- Two kinds of filter:
1) supervised (Section 7.2): to be used carefully2) unsupervised:
- Each filter has two types of distinction between attribute filters and instance filters.1) attribute filter: it works on the attribute of data2) instance filter: it works on the instance of data
- See Section 7.3 1) PCA (principal component analysis)2) Random projections
10.3.1 Adding and removing attributes
- Add: insert a new attribute, whose value is empty.- Copy: copy existing attributes and their values.- Remove: it is the same as <remove> tab- RemoveType: remove all instances of the same type such as
nominal, numeric, string, or date.- AddCluster: apply a clustering algorithm to the data before
filtering it (see Section 10.6)- AddExpression: create a new attribute by applying a mathematical
function to numeric attributes.e.g., a1^2*a5/log(4*a7)
- NumericTransform: performs an arbitrary transformation by applying a given JAVA function to selected numeric attributes.
- Normalize: scale all numeric values to lie between 0 and 1.- Standardize: transforms all numeric values to have zero mean and
unit variance.
10.3.2 Changing values
- SwapValues: just change the position of two values of a nominal attribute (it does not affect learning at all).
- MergeTwoValues: merge values of a nominal attribute into a single category.
- ReplaceMissingValues: replace each missing value with the mean for numeric attributes and the mode for nominal.1) if a class is set, missing values of that attribute are not replaced.
10.3.3 Conversions
- Discretize: change the numeric attribute to the nominal attribute (section 7.2).1) equal-width binning2) equal-frequency binning
- PKIDiscretize: discretize numeric attributes using equal-frequency binning where the number of bins is the square root of the number of values. 83 instances without missing value are binned by 9 bins.
- MakeIndicator: convert a nominal attribute into a binary indicator attribute it is necessary when the numeric attribute is required for a ML scheme.
- NorminalToBinary: transform all multivalued nominal attributes ina dataset into binary ones (k-value attribute k-binary attribute)
- NumericToBinary: convert all numeric attributes into binary ones (if numeric value = 0, then binary value = 0. otherwise, binary value = 1).
- FirstOrder: take a difference between two attribute values.
10.3.4 String conversions
- StringToNominal: - StringToWordVector:
10.3.5 Time series
For time-series data,
- TimeSeriesTranslate: replace attribute values in the current instance with the equivalent attribute values of some previous (or future) instance.
- TimeSeriesDelta: replace attribute values in the current instance with the difference between the current value and the equivalent attribute value of some previous (or future) instance.
10.3.6 Randomizing
These filters change values of the data.
- AddNoise: it introduces a noise to the data. It takes a nominal attribute and changes its value to other value by a given percentage.
- Obfuscate: rename the attribute name and anonymize data.- RandomProjection: see section 7.3
10.3.7 Unsupervised instance filters
- Attribute filers: it affects all values of attribute (column of data)
- Instance filters: if affects all values of instance (raw of data)
10.3.8 Randomizing and subsampling
- Randomize: the order of instances are randomized - Normalize: all numeric attributes are treated as a vector and
normalized to a given length.- Resample: it produces a random sample by sampling by
replacement- RemoveFolds: it first splits the data into a given number of cross-
validation folds and then reduces the data just one of them. If random number seed is provided, the dataset will be shuffled before the subset is extracted.
- RemovePercentage: it removes a given percentage of instnaces.- RemoveRange: it removes a certain range of instance numbers.- RemoveWithValues: it removes all instances that have certain
values above or below a certain threshold.
10.3.9 Sparse instances
- NonSparseToSparse: - SparseToNonSparse:
10.3.10 Supervised filters
- Supervised filters are affected by the class attribute.- We have two categories of supervised filters
1) attribute2) instance
- You need to be careful with them because they are not really preprocessing operations.
- Discretize: see section 7.2- NormialToBinary: see section 6.5- ClassOrder: it changes the ordering of the class values
- Resample: it is like the unsupervised instance filter, except that it maintains the class distribution in the subsample.
…..
10.4 Learning algorithms
- We have 7 categories in classification.1) Bayesian: document classification (e.g., google search)2) Trees: decision trees, divide-and-conquer (stump, node, leaf, model tree)3) Rules: covering approach (or excluding instances), the decision tree is converted to a set of logical expression.4) Functions: linear model, nonlinear model.5) Lazy (instance-based learning): distance function.6) metalearning algorithms: more powerful learner.7) Miscellaneous: divers
- Ch. 4 and Ch. 6 covers those algorithms.
10.4.2 Decision trees
- J48 (see section 6.1 and 6.2): reimplementation of program C4.5, an outcome of Quinlan (1993) over 20 years.
Confidence threshold for pruning
Minimum number of instances permissible at a leaf
Size of pruning set: data is divided equally into that number and the last one used for pruning
This algorithm is valid for nominal class
attribute
10.4.2 Decision trees
- Id3 (see Ch. 4): basic divide-and-conquer decision tree algorithm, it allows the nominal class.
- Decision Stump: boosting method, it builds on-level binary decision trees.
- RandomTree: construct a tree that considers a given number of random features at each node, performing no pruning.
- RandomForest (see section 7.5): it constructs random forests by bagging ensembles of random trees.
- REPTree (see section 6.2): it builds a decision or regression tree using information gain/variance reduction
- NBTree: a hybrid between decision trees and Naïve bayes- M5P (see section 6.5, Quinlan, 1992): model tree learner with linear
models- LMT (see section 7.5): it builds logistic model tress for nominal
class.- ADTree (see section 7.5): alternating decision tree using boosting.
10.4.3 Classification rules
- ConjunctiveRule - DecisionTable- JRip- M5Rules- Nnge- OneR- Part- Prisim.- Ridor- ZeroR: it predicts the test data’s majority class (if nominal) or
average value (if numeric).
10.4.4 Functions- Bayesian has a simple mathematical formulation- Decision trees and rules have linear regression models.- Functions give us more complicated mathematical models
We focus on
1) linear regression model
2) Support vector machine algorithm
3) Neural network
10.4.4 Functions
- SimpleLinearRegression: linear regression model based on a single class attribute.
- LinearRegression: linear regression model for all numeric attributes- LeastMedSq: Implements a least median squared linear regression
utilizing the existing Weka LinearRegression class to form predictions (the solution has the smallest median-squared error)
- SMO (see section 6.3): it implements the sequential minimal optimization algorithm for training a support vector classifier.
- SMOreg: Support vector machine algorithm for numeric class attribute
- Linear model: really linear relationship between attributes- Support vector machine: linear model for nonlinear class boundaries
10.4.4 Functions
- VotedPerceptron (see section 6.3): voted perceptron algorithm, Globally replaces all missing values, and transforms nominal attributes into binary ones.
- Window (see section 4.6): it modifies the basic perceptron to use multiplicative updates.
- PaceRegression: it builds linear regression modes using the new technique of Pace Regression (Wang & Witten, 2002).
- SimpleLogistic (see section 4.6): it builds logistic regression models
- RBFNetwork (see section 6.3): it implements a Gaussian radial basis function network, one kind of linear regression models.
10.4.5 Functions: artificial neural network (ANN)
Najjar, Y.M., I.A. Basheer and M.N. Hajmeer (1997), Computational neural networks for predictive microbiology: I. methodology, Int. Journal of Food Microbiology, 34, 27-49.
10.4.5 Functions: neural networks
Input layer
f(n)
p1
p2
p3
p4
.
.pr
w1,1
w1,rb1
n1 a1
n2 a2w2,1
f(n)
Hidden layer output layer
b2
Input layer
f(n)
p1
p2
p3
p4
.
.pr
w1,1
w1,rb1
n1 a1
n2 a2w2,1
f(n)
Hidden layer output layer
b2
)( 111 bpwfa sigmoid
)( 2122 bawfa purelinear
neuron
Linear regression model
10.4.5 Functions: neural networks
)( 111 bpwfa sigmoid
)( 2122 bawfa purelinear
0
1
n
a
1
0
1
n
a
0.5
]5.2[
]5.0,3.0,2.0,0[
]3,2,1,1[
b
w
p
4.1
5.235.023.012.010
1
4
11
bpwni
ii
1978.01
14.1
ea
nea
1
1
]5.0[
]2.0,6.0,2.0[
]2,0,1[
b
w
pa1=?
10.4.5 Functions: neural networks- MultilayerPerceptron (see section 6.3): it
trains back-propagation and validates (or calculates) feed-forward,
Nonlinear algorithm.
Set GUI to True to see ANN
Adds and connects up hidden layers in the network.
Filtering the data
The amount the weights are updated.
Momentum applied to the weights during updating.
10.4.5 Functions: neural networks- MultilayerPerceptron (see section 6.3): it
trains back-propagation and validates (or calculates) feed-forward,
Nonlinear algorithm.
Define the structure and node number of hidden layers: e.g., [4,5] two hidden layers with 4 and 5 neurons, respectively
The number of epochs to train through
The percentage size of the validation set
Used to terminate validation testing. The value here dictates how many times in a row the validation set error can get worse before training is terminated.
10.4.5 Functions: neural networks –CPU problemUsing two hidden layers with 4 and 5 neurons, fine
the correlation coefficient of this ANN?
Options: 10-fold cross validation, 500 epochs Open file: CPU.arff
10.4.5 Functions: neural networks –CPU problemUsing two hidden layers with 4 and 5 neurons, fine
the correlation coefficient of this ANN?
Options: 10-fold cross validation, 500 epochs
Classify: Functions>multilayer perceptron
10-fold cross validation
10.4.5 Functions: neural networks –CPU problemUsing two hidden layers with 4 and 5 neurons, fine
the correlation coefficient of this ANN?
Options: 10-fold cross validation, 500 epochs
options
10.4.5 Functions: neural networks –CPU problemUsing two hidden layers with 4 and 5 neurons, fine
the correlation coefficient of this ANN?
Options: 10-fold cross validation, 500 epochs
After start in the main Weka window, you can see the ANN GUI
1. Click “accept”2. Clikc “start” 3. Until 10-fold repeats
10.4.5 Functions: neural networks –CPU problemUsing two hidden layers with 4 and 5 neurons, fine
the correlation coefficient of this ANN?
Options: 10-fold cross validation, 500 epochs
You can see the number of cross-validation increases.
The bird will dance until the 10-fold validation is over.
10.4.5 Functions: neural networks –CPU problemUsing two hidden layers with 4 and 5 neurons, fine
the correlation coefficient of this ANN?
Options: 10-fold cross validation, 500 epochs
After 10-fold validation, the results appear in the Weka main window.
You can now analyze the results
10.5 Meta-learning algorithms
For powerful usage of ML, please read this section !
We skip section 10.6 and 10.7
10.8 Attribute selection
It works, selecting one of the attribute evaluators and, at the same time, one of the search methods
10.8.1 Attribute subset evaluators
Subset evaluators take a subset of attributes and return a numeric measure that guides the search.
- CfsSubsetEval (see section 7.1): it finds the highest correlation attributes with the class, but they have low inter-correlation between the attributes, one of the filtering methods
- ConsistencySubsetEval: one of the filtering methods
- ClassifierSubsetEval: one of the wrapper methods, it uses a classifier and evaluate sets of attributes on the training data.
10.8.2 Single-attribute evaluators
Single-attribute evaluators should be used with the Ranker search method.
- InfoGainAttributeEval: it evaluates attributes by measuring their information gain w.r.t. the class.
- ChiSquaredAttributeEval: it evaluates attributes by computing the chi-squared statistic w.r.t. the class.
- SVMAttributeEval: evalution with linear support vector machine.
- PrincipalComponents (see section 7.3): PCA
10.8.3 Search methods
- Search methods are optimization algorithms to find a good subset.
- Subset evaluators are object functions to be optimized
- GreedyStepwise:
- GeneticSearch:
- RankSearch: