weka and machine learning algorithms. algorithm types classification (supervised) given -> a set...
TRANSCRIPT
![Page 1: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/1.jpg)
WEKA and Machine Learning Algorithms
![Page 2: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/2.jpg)
Algorithm Types
• Classification (supervised)Given -> A set of classified examples
“instances”Produce -> A way of classifying new examples
Instances: described by fixed set of features “attributes”Classes: discrete or continuous “classification” “regression”Interested in:– Results? (classifying new instances)– Model? (how the decision is made)
• Clustering (unsupervised)There are no classes
• Association rulesLook for rules that relate features to other features
![Page 3: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/3.jpg)
Classification
![Page 4: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/4.jpg)
Clustering
![Page 5: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/5.jpg)
Clustering
• It is expected that similarity among members of a cluster should be high and similarity among objects of different clusters should be low.
• The objectives of clustering – knowing which data objects belong to which cluster – understanding common characteristics of the
members of a specific cluster
![Page 6: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/6.jpg)
Clustering vs Classification
• There is some similarity between clustering and classification.
• Both classification and clustering are about assigning appropriate class or cluster labels to data records. However, clustering differs from classification in two aspects. – First, in clustering, there are no pre-defined classes.
This means that the number of classes or clusters and the class or cluster label of each data record are not known before the operation.
– Second, clustering is about grouping data rather than developing a classification model. Therefore, there is no distinction between data records and examples. The entire data population is used as input to the clustering process.
![Page 7: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/7.jpg)
Association Mining
![Page 8: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/8.jpg)
Overfitting
• Memorization vs generalization
• To fix, use– Training data — to form rules– Validation data — to decide on best rule– Test data — to determine system performance
• Cross-validation
![Page 9: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/9.jpg)
Baseline Experiments
• In order to evaluate the efficiency of the classifiers used in experiments, we use baselines: – Majority based random classification (Kappa=0)– Class distribution based random classification (Kappa=0)
• Kappa statistics, is used as a measure to assess the improvement of a classifier’s accuracy over a predictor employing chance as its guide.
• P0 is the accuracy of the classifier and Pc is the expected accuracy that can be achieved by a randomly guessing classifier on the same data set. Kappa statistics has a range between 1 and 1, where 1 is total disagreement (i.e., total misclassification) and 1 is perfect agreement (i.e., a 100% accurate classification).
• Kappa score over 0.4 indicates a reasonable agreement beyond chance.
9
0 c
c
P P
1 P
![Page 10: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/10.jpg)
Data Mining Process
![Page 11: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/11.jpg)
WEKA: the software
• Machine learning/data mining software written in Java (distributed under the GNU Public License)
• Used for research, education, and applications
• Complements “Data Mining” by Witten & Frank
• Main features:– Comprehensive set of data pre-processing tools, learning
algorithms and evaluation methods– Graphical user interfaces (incl. data visualization)– Environment for comparing learning algorithms
![Page 12: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/12.jpg)
Weka’s Role in the Big Picture
Input•Raw data
Input•Raw data
Data Miningby Weka
•Pre-processing •Classification•Regression •Clustering
•Association Rules •Visualization
Data Miningby Weka
•Pre-processing •Classification•Regression •Clustering
•Association Rules •Visualization
Output•Result
Output•Result
![Page 13: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/13.jpg)
13
WEKA: TerminologyWEKA: Terminology
Some synonyms/explanations for the terms used by WEKA:
Attribute: feature Relation: collection of examples Instance: collection in use Class: category
![Page 14: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/14.jpg)
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
WEKA only deals with “flat” files
![Page 15: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/15.jpg)
![Page 16: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/16.jpg)
Explorer: pre-processing the data
• Data can be imported from a file in various formats: ARFF, CSV, C4.5, binary
• Data can also be read from a URL or from an SQL database (using JDBC)
• Pre-processing tools in WEKA are called “filters”
• WEKA contains filters for:– Discretization, normalization, resampling, attribute
selection, transforming and combining attributes, …
![Page 17: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/17.jpg)
04/19/2317
Explorer: building “classifiers”
• Classifiers in WEKA are models for predicting nominal or numeric quantities
• Implemented learning schemes include:– Decision trees and lists, instance-based classifiers,
support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, …
• “Meta”-classifiers include:– Bagging, boosting, stacking, error-correcting output
codes, locally weighted learning, …
![Page 18: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/18.jpg)
Classifiers - Workflow
LabeledData
LearningAlgorithm Classifier
UnlabeledData
Predictions
![Page 19: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/19.jpg)
Evaluation
• Accuracy– Percentage of Predictions that are correct– Problematic for some disproportional Data Sets
• Precision– Percent of positive predictions correct
• Recall (Sensitivity)– Percent of positive labeled samples predicted as
positive
• Specificity– The percentage of negative labeled samples
predicted as negative.
![Page 20: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/20.jpg)
20
Contains information about the actual and the predicted classificationAll measures can be derived from it:
accuracy: (a+d)/(a+b+c+d) recall: d/(c+d) => R precision: d/(b+d) => P F-measure: 2PR/(P+R) false positive (FP) rate: b /(a+b) true negative (TN) rate: a /(a+b) false negative (FN) rate: c /(c+d)
Confusion matrixConfusion matrix
predicted
– +
true– a b
+ c d
![Page 21: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/21.jpg)
04/19/2321
Explorer: clustering data
• WEKA contains “clusterers” for finding groups of similar instances in a dataset
• Implemented schemes are:– k-Means, EM, Cobweb, X-means, FarthestFirst
• Clusters can be visualized and compared to “true” clusters (if given)
• Evaluation based on loglikelihood if clustering scheme produces a probability distribution
![Page 22: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/22.jpg)
04/19/2322
Explorer: finding associations
• WEKA contains an implementation of the Apriori algorithm for learning association rules– Works only with discrete data
• Can identify statistical dependencies between groups of attributes:– milk, butter bread, eggs (with confidence 0.9 and
support 2000)
• Apriori can compute all rules that have a given minimum support and exceed a given confidence
![Page 23: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/23.jpg)
04/19/2323
Explorer: attribute selection
• Panel that can be used to investigate which (subsets of) attributes are the most predictive ones
• Attribute selection methods contain two parts:– A search method: best-first, forward selection,
random, exhaustive, genetic algorithm, ranking– An evaluation method: correlation-based, wrapper,
information gain, chi-squared, …
• Very flexible: WEKA allows (almost) arbitrary combinations of these two
![Page 24: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/24.jpg)
04/19/2324
Explorer: data visualization
• Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem
• WEKA can visualize single attributes (1-d) and pairs of attributes (2-d)– To do: rotating 3-d visualizations (Xgobi-style)
• Color-coded class values
• “Jitter” option to deal with nominal attributes (and to detect “hidden” data points)
• “Zoom-in” function
![Page 25: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/25.jpg)
04/19/2325
Performing experiments
• Experimenter makes it easy to compare the performance of different learning schemes
• For classification and regression problems
• Results can be written into file or database
• Evaluation options: cross-validation, learning curve, hold-out
• Can also iterate over different parameter settings
• Significance-testing built in!
![Page 26: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/26.jpg)
04/19/2326
The Knowledge Flow GUI
• New graphical user interface for WEKA
• Java-Beans-based interface for setting up and running machine learning experiments
• Data sources, classifiers, etc. are beans and can be connected graphically
• Data “flows” through components: e.g.,
“data source” -> “filter” -> “classifier” -> “evaluator”
• Layouts can be saved and loaded again later
![Page 27: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/27.jpg)
27
Beyond the GUI
• How to reproduce experiments with the command-line/API– GUI, API, and command-line all rely
on the same set of Java classes– Generally easy to determine what
classes and parameters were used in the GUI.
– Tree displays in Weka reflect its Java class hierarchy.
> java -cp ~galley/weka/weka.jar weka.classifiers.trees.J48 –C 0.25 –M 2 -t <train_arff> -T <test_arff>
![Page 28: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/28.jpg)
28
Important command-line parameters
where options are:
• Create/load/save a classification model:
-t <file> : training set-l <file> : load model file-d <file> : save model file
• Testing:-x <N> : N-fold cross validation
-T <file> : test set-p <S> : print predictions + attribute selection S
> java -cp ~galley/weka/weka.jar weka.classifiers.<classifier_name>
[classifier_options] [options]
![Page 29: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of](https://reader034.vdocuments.mx/reader034/viewer/2022051416/56649e7a5503460f94b7a076/html5/thumbnails/29.jpg)
Problem with Running Weka
Solution : java -Xmx1000m -jar weka.jar
Problem : Out of memory for large data set