weka를 이용한 기계학습 - seoul national university · 2015-11-23 · 데이터 분석 (data...
TRANSCRIPT
김 병 희
Biointelligence Laboratory Seoul National University
http://bi.snu.ac.kr
Weka를 이용한 기계학습
한국인지과학기술협회 인지기술 튜토리얼 휴먼인지센서 빅데이터 분석기술(CogAnalytics)
2014년 10월 17-18일(금-토), 서울대학교
Tutorial Outline
Analytics & Machine Learning
Weka를 이용한 분류 및 예측
Weka의 다양한 인터페이스
군집화
데이터 전처리 및 가시화
개별 실습
Weka 추가 정보 및 관련 S/W
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 2
ANALYTICS & MACHINE LEARNING
Part I
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 3
어낼리틱스(Analytics)
어낼리틱스(Analytics, 분석 솔루션) 데이터에서 의미 있는 패턴을 발견하고 교류하는 과정
다양한 분야에 접미사로 사용됨 예) Text Analytics, Social Analytics, Business Analytics
데이터 분석(data analysis)을 포괄하는 전체적인 방법론을 지칭 CogAnalytics
Cognitive Science & Engineering Predictive Analytics, Advanced Analytics Machine Learning
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 4
Hype Cycle of Emerging Technologies 2010, Gartner
Analytics as Mainstream Technology
5 (C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Human Learning & Machine Learning
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 6
• Gather data
데이터 수집
• Preprocessing • Feature selection
전처리, 특성값 선택
• Features-correct labels combinations
‘특성값-정답 레이블’ 조합
R. Elwell and R. Polikar, “Incremental learning of concept drift in nonstationary environments,” IEEE Trans. Neural Netw., vol. 22, no. 10, pp. 1517–31, Oct. 2011.
Scaffolding: tutoring theory to enhance human learning
(passive) Supervised Learning
• Knowledge become available
지식 축적
• Complexity reduction
지식 표현 복잡도 경감/정제
• Experience - consequence combinations
‘경험-결과’ 조합 제공
데이터 분석(data analysis)의 과정
문제 정의(define the question) 데이터 준비(dataset)
이상적인 데이터셋 정의(define the ideal data set) 수집할 데이터 결정(determine what data you can access) 데이터 수집(obtain the data) 데이터 정리(clean the data)
탐색적 데이터 분석(exploratory data analysis) 오늘의 실습: 클러스터링/데이터 가시화(Clustering / Data visualization)
통계적 예측/모델링(statistical prediction/modeling) 오늘의 실습: 분류/예측(Classification / Prediction)
결과 해석(interpret results) 오늘의 실습: 다양한 평가 척도 및 방법(evaluation), 학습 결과 모델 선정(model
selection)
모든 과정 및 결과에 대한 이의 제기 및 점검(challenge results) 결과 정리 및 보고서 작성(synthesize/write up results) 결과 재현 가능한 프로그램 작성(create reproducible code) (C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 7
J. Leek, Data Analysis – Structure of a Data Analysis, Lecture at Coursera, 2013
분석 목적에 따른 데이터셋 선택
기술적(descriptive) 전체 모수 필요(a whole population)
탐색적(exploratory) 무작위 추출 후 다양한 변수 측정(a random sample with many variables
measured) 추론적(inferential)
모집단을 정확히 선별후 무작위 추출(the right population, randomly sampled)
예측적(predictive) 동일한 모집단에서 학습 데이터와 테스트 데이터 획득(a training and test
data set from the same population) 인과적(causal)
무작위적 기법을 적용한 연구에서 데이터 획득(data from a randomized study)
기계론적(mechanistic) 시스템의 모든 요소를 아우르는 데이터 획득(data about all components of
the system) (C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 8
J. Leek, Data Analysis – Structure of a Data Analysis, Lecture at Coursera, 2013
예제: 사진 기반 자동 판별기
무엇을 판별할 것인가?
어떤 측정값을 기준으로 판별할 것인가?
측정 자료 수집 및 분석
자동 판별 기계 ‘학습’ – 테스트 – 출시 (C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 9
Salmon Sea Bass
Rockfish 성별 판별
생선 종류 판별
감정 판별
사진출처: http://jamja.tistory.com/1705
해법: 기계학습 - 감독학습 - 분류 기법
무엇을 판별할 것인가 Class label
어떤 측정값을 기준으로 판별할 것인가? 특성값(feature, attribute), 변수(variable)
측정 자료 수집 및 정리 Dataset data matrix
자동 판별 기계 학습 – 테스트 – 출시 Training dataset Test dataset Classification model Evaluation & Model selection
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 10
y
𝑥1, 𝑥2, 𝑥3, …
(𝑋,𝑦)
𝑿 𝒚
Training set
Test set
features
instances
label
분류 기법의 역할
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 11
x1
x2
x1
x2
Binary classification: Multi-class classification:
A. Ng, Machine Learning, Lecture at Coursera, 2013
용어 정리
특성값(Features, Attributes, or Variables) Features are the individual measurable properties of the phenomena
being observed Choosing discriminating and independent features is key to any pattern
recognition algorithm being successful in classification
학습 데이터 / 테스트 데이터 (Training set / Test set) 학습 데이터(Training set): A set of examples used for learning, that
is to fit the parameters [i.e., weights] of the classifier 테스트 데이터(Test set): A set of examples used only to assess the
performance [generalization] of a fully-specified classifier
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 12
WEKA를 이용한 분류 및 예측 Classification and Prediction using Weka
Part Ⅱ
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 13
Weka 소개
대표적인 기계학습 알고리즘 모음, 데이터 마이닝 도구 Weka의 주요 기능
데이터 전처리(data pre-processing), 특성값 선별(feature selection) 군집화(clustering), 가시화(visualization) 분류(classification), 회귀분석(regression), 시계열 예측(forecast) 연관 규칙 학습(association rules)
S/W 특성 무료 및 소스 공개 소프트웨어(free & open source GNU General Public License) 주요 analytics S/W의 모체가 됨: RapidMiner, MOA Java로 구현. 다양한 플랫폼에서 실행 가능
다운로드 Google에서 Weka로 검색, 첫 번째 검색 결과 http://www.cs.waikato.ac.nz/ml/weka/
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 14 Weka (bird): http://www.arkive.org/weka/gallirallus-australis/video-au00.html
Weka를 구성하는 인터페이스
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 15
Explorer: 다양한 분석 작업을 한 단계씩 분석 수행 및 결과 확인 가능. 일반적으로, 가장 먼저 실행
KnowledgeFlow: 데이터 처리 과정의 주요 모듈을 그래프로 가시화하여 구성하고 실험 수행
Experimenter: 분류 및 회귀 분석을 일괄 처리. 결과 비교 분석. - 다양한 알고리즘 및 파라미터 설정 - 여러 데이터-알고리즘 조합 동시 분석 - 분석 모델 간 통계적 비교 - 대규모 통계적 실험 수행
Simple CLI: 다른 인터페이스를 컨트롤하는 스크립트 입력창. Weka의 모든 기능을 명령어로 수행 가능
그 외 주요 도구
Weka 실습 구성
분류 문제 바로 풀어보기 문제: 붓꽃 종류 판별, (자습: 스팸 필터링, 당뇨병 예측, 필기체 인식) 목표: 예측적 핵심 과정: 통계적 예측/모델링, 결과 해석 도구: Weka의 Explorer, Experimenter
분류 문제를 더 잘 이해하고 풀기 위한 사전 작업 목표: 탐색적 핵심 과정: 탐색적 데이터 분석 주요 작업
데이터 전처리 요인별 분류 기여도 평가 및 선별 데이터 군집화 데이터 가시화
도구: Weka의 Explorer
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 16
실습: 붓꽃(iris) 분류(classification)
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 17
Iris virginica Iris versicolor Iris setosa
분류에 사용할 특성값
실습: 붓꽃(iris) 분류(classification)
특성값 정의(Define features or attributes) Sepal length, sepal width, petal length, petal width 분류 라벨(Class label): 붓꽃(iris)의 세 아종. Setosa, versicolor, 및 virginica
샘플 수집 및 데이터 구성 각 붓꽃 아종별로 50개씩 샘플 수집 (1935년) Data table : 150 samples (or instances) * 5 attributes R. Fisher 경은 1936년 발표 논문에서 이 데이터에 linear discriminant model 을 적용함
학습: 분류 알고리즘 선택 및 파라미터 설정 세 가지 분류 알고리즘으로 실습: 신경망, 결정 트리, SVM 각 알고리즘 별 파라미터 설정은 기본값을 적용
학습 결과 평가 및 모델 선정 다양한 평가 척도 확인 평가 척도를 기준으로 학습 결과 모델(algorithm + parameter setting) 비교 및 선정
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 18
19
분류 알고리즘 – 신경망(Neural Networks)
MLP (Multilayer Perceptron) 실용적으로 매우 폭넓게 쓰이는 대표적 분류 알고리즘 Weka에서 찾아가기: classifiers-functions-MultilayerPerceptron
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Figure from Andrew Ng’s Machine Learning Lecture Notes, on Coursera, 2013-1
20
분류 알고리즘 – 결정트리(Decision Trees)
J48 (C4.5의 Java 구현 버전)
학습 결과 모델에서 분류 규칙을 ‘트리’ 형태로 얻을 수 있음 Weka에서 찾아가기: classifiers-trees-J48
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/
분류 알고리즘 – Support Vector Machines
SMO (sequential minimal optimization) for training SVM
Kernel Machine 기반의 대표적인 분류 알고리즘 Weka에서 찾아가기: classifiers-functions-SMO
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 21
실습: 붓꽃(iris) 데이터셋
Just open “iris.arff” in ‘data’ folder
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 22
Weka 데이터 양식 (.ARFF)
@RELATION iris @ATTRIBUTE sepallength REAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidth REAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA
5.1, 3.5, 1.4, 0.2, Iris-setosa 4.9, 3.0, 1.4, 0.2, Iris-setosa 4.7, 3.2, 1.3, 0.2, Iris-setosa … 7.0, 3.2, 4.7, 1.4, Iris-versicolor 6.4, 3.2, 4.5, 1.5, Iris-versicolor 6.9, 3.1, 4.9, 1.5, Iris-versicolor …
데이터 (CSV format)
헤더
23
Note: Excel을 이용하여 CSV 파일 생성 후, 헤더만 추가하면 쉽게 arff 포맷의 파일 생성 가능
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Preprocess 탭에서 탐색적 데이터 분석 수행
데이터 구성(current relation) 특성값 삭제(remove attributes) 특성값 별 기초적 통계 분석(selected attribute) 모든 특성값을 대상으로 class label 분포 가시화(Visualize All) ‘Filter’를 이용한 preprocessing: Part V에서 설명
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 24
Weka로 분류 수행하기 – 신경망의 예
25
click • load a file that contains the training data by clicking ‘Open file’ button
• ‘ARFF’ or ‘CSV’ formats are readable
• Click ‘Classify’ tab • Click ‘Choose’ button • Select ‘weka – function - MultilayerPerceptron
• Click ‘MultilayerPerceptron’ • Set parameters for MLP • Set parameters for Test • Click ‘Start’ for learning
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/
26
분류 알고리즘의 파라미터 설정
파라미터 설정(Parameter Setting) = 자동차 튜닝(Car Tuning) 많은 경험 또는 시행착오 필요 파라미터 설정에 따라 동일한 알고리즘에서도 최악에서 최고의 성능을 모두 보일 수도 있다
신경망의 주요 파라미터 (MultilayerPerceptron in Weka) 구조 관련: hiddenLayers, 학습 과정 관련: learningRate, momentum, trainingTime (epoch), seed
결정트리의 주요 파라미터 (J48 in Weka) unpruned, numFolds, minNumObj 트리의 크기에 직접적 용향: confidenceFactor, pruning 등
Support Vector Machine (SVM)의 주요 파라미터 (SMO in Weka) 커널(kernel) 관련: kernel 선택, kernel별 추가의 파라미터 최적화 관련: c (complexity parameter)
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Test Options and Classifier Output
27
There are various metrics for evaluation
Setting the data set used for evaluation
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/
28
Evaluation Method - Cross Validation
K-fold Cross Validation The data set is randomly divided into k subsets. One of the k subsets is used as the ‘test set’ and the
other k-1 subsets are put together to form a ‘training set’.
30 30 30 30 30 D1 D2 D3 D4 D5
30 D6
30 30 30 30 30 D1 D2 D3 D4 D6
30 D5
30 30 30 30 30 D2 D3 D4 D5 D6
30 D1
∑=
=k
iiError
kError
1
1
예: 6-fold cross validation: 180개의 데이터를 6등분 후, 6회 학습/평가 수행하여 평균 성능 측정
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Classifier Output
Run information
Classifier model (full
training set)
Evaluation results General summary Detailed accuracy by
class Confusion matrix
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 29
The output depends on the classifier
30
How to Evaluate the Performance? (1/2)
Usually, build a ‘Confusion Matrix’ on the test data set
Evaluation Metrics Accuracy (percent correct) Precision / Recall Various metrics: F-measure, Kappa score, etc.
For fare evaluation, the ‘cross-validation’ scheme is used
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/
31
How to Evaluate the Performance? (2/2)
Confusion Matrix (binary class case) Real
Prediction Positive Negative
Positive TP FP All with positive
Test
Negative FN TN All with
Negative Test
All with Disease
All without Disease Everyone
FNTNFPTPTNTP
++++
=Accuracy
FNTPTP+
= RecallFPTP
TP+
=Precision
As recall ↑ precision ↓ conversely:
As recall ↓ precision ↑
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/
WEKA의 다양한 인터페이스 Interfaces of Weka
Part Ⅲ
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 32
Using Experimenter in Weka
Tool for ‘Batch’ experiments
33
click
• Set experiment type/iteration control
• Set datasets / algorithms
Click ‘New’
• Select ‘Run’ tab and click ‘Start’ • If it has finished successfully, click
‘Analyse’ tab and see the summary
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Usages of the Experimenter
Model selection for classification/regression Various approaches
Repeated training/test set split Repeated cross-validation (c.f. double cross-validation) Averaging
Comparison between models / algorithms Paired t-test On various metrics: accuracies / RMSE / etc.
Batch and/or Distributed processing Load/save experiment settings http://weka.wikispaces.com/Remote+Experiment Multi-core support : utilize all the cores on a multi-core machine
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 34
Experimenter 실습
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 35
Experimenter 실습
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 36
Experimenter 실습
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 37
KnowledgeFlow for Analysis Process Design
38
(‘Process Flow Diagram’ of SAS® Enterprise Miner )
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/
KNIME
RapidMiner
KnowledgeFlow: Example Usage
Decision tree (J48)
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 39
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 40
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 41
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 42
Command Line Interface (CLI)
Example command and result java weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V
0 -S 0 -E 20 -H a -t "C:\Program Files\Weka-3-7\data\iris.arff"
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 43
You may build command line scripts for various batch experiments easily
Refer Ch.1 of WekaManual-3-*-*.pdf for further information
Package Manager
Explorer에 다양한 기능 추가 가능
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 44
군집화 Clustering
Part Ⅳ
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 45
Motivating Questions for Clustering
What is the natural groupings in a set of data?
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 46
School Employees Simpson's Family Males Females
The way of grouping is not unique
Issues in Clustering
How should one measure the similarity between samples? (similarity measure)
How many clusters would be there? (number of clusters)
How should one evaluate a partitioning of a set of samples into clusters? (criterion function) E.g. high intra-class similarity and low inter-class
similarity
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 47
Purposes of Clustering
Quick review of data
Clustering before classification Checking if the given instances form separated clusters
well
Discovery Possible number of class labels Subclasses Groups of features Modules (feature / instance combinations)
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 48
49
k-means Clustering
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/
50
Hierarchical Clustering
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Resulting dendrograms with the original data matrix
Self-organizing Map (SOM)
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 51
World poverty map (http://www.cis.hut.fi/research/som-research/worldmap.html)
K-means in Weka
52
click • load a file that contains the training data by clicking ‘Open file’ button
• ‘ARFF’ or ‘CSV’ formats are readible
• Click ‘Cluster’ tab • Click ‘Choose’ button • Select ‘weka–clusterers - SimpleKMeans
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/
• Click ‘SimpleKMeans’ • Set distanceFunction • Set other parameters • Check ‘Classes to cluster ~’ • Click Start
53
Evaluation of Clustering Results
Contingency Table Used when we have labels for instances Matching resulting clusters with labels
(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Categories of Clustering Algorithms
Hierarchical clustering Agglomerative / divisive Concept clustering
Partitional clustering K-means clustering Fuzzy c-means clustering Graph-theoretic clustering methods
Spectral clustering
Subspace clustering Co-clustering, biclustering
(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 54
데이터 전처리 및 가시화 Data Preprocessing &Visualization
Part Ⅴ
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 55
Data Preprocessing with Filter in Weka
Attribute Selection, discretize
Instance Re-sampling, selecting specified folds
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 56
가시화 (Visualization)
Descriptive analysis Scatter plot & correlation analysis among features
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 57
가시화 (Visualization)
Unsupervised Learning Results Dimension reduction Clustering
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 58
Dendrogram (hierarchical clustering)
cluster assignments
가시화 (Visualization)
Classification 학습 결과 모델
tree, graph boundary
모델의 종합적 성능 평가: ROC curve (threshold curve) cost curve
Classifier errors
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 59
trees.j48의 예 bayes.BayesNet - TAN의 예
개별 실습 Self-Practice
Part Ⅵ
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 60
Additional Datasets for Practice
Three datasets in ‘dataset’ folder 각 데이터셋에 대해 최고의 분류 정확도를 얻어보세요
Spam Classification Training/test set: spamTrain.csv, spamTest.csv Vocab.txt: list of vocabulary
Diagnosis of Diabetes for Pima Indians diabetes.csv
Handwritten digit recognition handwritten_digit.csv
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 61
Dataset #1: Spam Classification
Description Many email services today provide
spam filters that are able to classify emails into spam and non-spam email
You will be training a classifier to classify whether a given email, x, is spam or non-spam
Configuration of the data set 1899 terms to check spams All terms are binary which means the
term exists or not 1899 binary attributes Binary class label 4000 emails in Training set 1000 emails in Test set
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 62
Preprocessing and Normalization steps which were applied to the dataset
Dataset #2: Pima Indians Diabetes
Description Pima Indians have the highest prevalence of diabetes in the world We will build classification models that diagnose if the patient shows signs of diabetes http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
Configuration of the data set 768 instances 8 attributes
age, number of times pregnant, results of medical tests/analysis all numeric (integer or real-valued) Also, a discretized set will be provided
Class label = 1 (Positive example ) Interpreted as "tested positive for diabetes" 500 instances
Class label = 0 (Negative example) 268 instances
63 (C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Dataset #3: Handwritten Digits (MNIST)
Description The MNIST database of handwritten digits contains digits written by office
workers and students We will build a recognition model based on classifiers with the reduced set of
MNIST http://yann.lecun.com/exdb/mnist/
Configuration of the data set For our practice, we use a subset of the MNIST set
Full MNIST set contains 60,000 training and 10,000 test samples 5,000 examples are used for this practice
Attributes pixel values in gray level in a 20x20 image 400 attributes (floating point-valued: grayscale intensity)
Class attribute: 1~10, which represent digits from 1 to 9 and 10 for 0
64 (C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/
WEKA 추가 정보 및 관련 S/W
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 65
More Information on Weka
Current version (October, 2014) Stable version: 3.6.11 Developer version: 3.7.11
Collections of datasets in Weka (ARFF) format http://www.cs.waikato.ac.nz/ml/weka/datasets.html Datasets from UCI repository Datasets from UCI KDD repository …
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 66
Weka References
Weka Wiki: http://weka.wikispaces.com/ Primer: good starting point
Weka online documentation: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html
Textbook Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining:
Practical Machine Learning Tools and Techniques (Third Edition), Morgan Kaufmann, Jan. 2011.
Articles Data mining with WEKA, Part 1, Part 2, Part 3 in IBM
Technical Library Weka를 이용한 예측프로그램 만들기 – 월간 마소 연재(2009 7,8,9월호) 블로그, MS Live
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 67
Other ML Open Source S/W’s
RapidMiner The most used analytics software in 2013 & 2014 (KDNuggets Poll) Basic interface for analyses is in the style of KnowlegeFlow in Weka Provides integrated environment for ML, DM, predictive analytics http://rapidminer.com/
MOA (Massive Online Analysis) Closely related project to the WEKA project Open source framework for data stream mining http://moa.cms.waikato.ac.nz/
KNIME Konstanz Information Miner modular data pipelining concept http://www.knime.org/
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 68
Other ML Open Source S/W’s
Mahout http://mahout.apache.org/ Apache project to produce free implementations of distributed or
otherwise scalable machine learning algorithms Classification, clustering, and collaborative filtering, frequent itemset
mining Book: Mahout in Action
MLOSS http://mloss.org/ : forum for open source software in machine learning http://jmlr.org/mloss/ : JMLR Machine Learning Open Source
Software (MLOSS)
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 69