weka를 이용한 기계학습 - seoul national university · 2015-11-23 · 데이터 분석 (data...

김 병 희

Biointelligence Laboratory Seoul National University

http://bi.snu.ac.kr

Weka를 이용한 기계학습

한국인지과학기술협회 인지기술 튜토리얼 휴먼인지센서 빅데이터 분석기술(CogAnalytics)

2014년 10월 17-18일(금-토), 서울대학교

Tutorial Outline

Analytics & Machine Learning

Weka를 이용한 분류 및 예측

Weka의 다양한 인터페이스

군집화

데이터 전처리 및 가시화

개별 실습

Weka 추가 정보 및 관련 S/W

(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 2

ANALYTICS & MACHINE LEARNING

Part I


어낼리틱스(Analytics)

어낼리틱스(Analytics, 분석 솔루션) 데이터에서 의미 있는 패턴을 발견하고 교류하는 과정

다양한 분야에 접미사로 사용됨 예) Text Analytics, Social Analytics, Business Analytics

데이터 분석(data analysis)을 포괄하는 전체적인 방법론을 지칭 CogAnalytics

Cognitive Science & Engineering Predictive Analytics, Advanced Analytics Machine Learning


Hype Cycle of Emerging Technologies 2010, Gartner

Analytics as Mainstream Technology

5 (C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Human Learning & Machine Learning


• Gather data

데이터 수집

• Preprocessing • Feature selection

전처리, 특성값 선택

• Features-correct labels combinations

‘특성값-정답 레이블’ 조합

R. Elwell and R. Polikar, “Incremental learning of concept drift in nonstationary environments,” IEEE Trans. Neural Netw., vol. 22, no. 10, pp. 1517–31, Oct. 2011.

Scaffolding: tutoring theory to enhance human learning

(passive) Supervised Learning

• Knowledge become available

지식 축적

• Complexity reduction

지식 표현 복잡도 경감/정제

• Experience - consequence combinations

‘경험-결과’ 조합 제공

데이터 분석(data analysis)의 과정

문제 정의(define the question) 데이터 준비(dataset)

이상적인 데이터셋 정의(define the ideal data set) 수집할 데이터 결정(determine what data you can access) 데이터 수집(obtain the data) 데이터 정리(clean the data)

탐색적 데이터 분석(exploratory data analysis) 오늘의 실습: 클러스터링/데이터 가시화(Clustering / Data visualization)

통계적 예측/모델링(statistical prediction/modeling) 오늘의 실습: 분류/예측(Classification / Prediction)

결과 해석(interpret results) 오늘의 실습: 다양한 평가 척도 및 방법(evaluation), 학습 결과 모델 선정(model

selection)

모든 과정 및 결과에 대한 이의 제기 및 점검(challenge results) 결과 정리 및 보고서 작성(synthesize/write up results) 결과 재현 가능한 프로그램 작성(create reproducible code) (C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 7

J. Leek, Data Analysis – Structure of a Data Analysis, Lecture at Coursera, 2013

분석 목적에 따른 데이터셋 선택

기술적(descriptive) 전체 모수 필요(a whole population)

탐색적(exploratory) 무작위 추출 후 다양한 변수 측정(a random sample with many variables

measured) 추론적(inferential)

모집단을 정확히 선별후 무작위 추출(the right population, randomly sampled)

예측적(predictive) 동일한 모집단에서 학습 데이터와 테스트 데이터 획득(a training and test

data set from the same population) 인과적(causal)

무작위적 기법을 적용한 연구에서 데이터 획득(data from a randomized study)

기계론적(mechanistic) 시스템의 모든 요소를 아우르는 데이터 획득(data about all components of

the system) (C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 8

J. Leek, Data Analysis – Structure of a Data Analysis, Lecture at Coursera, 2013

예제: 사진 기반 자동 판별기

무엇을 판별할 것인가?

어떤 측정값을 기준으로 판별할 것인가?

측정 자료 수집 및 분석

자동 판별 기계 ‘학습’ – 테스트 – 출시 (C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 9

Salmon Sea Bass

Rockfish 성별 판별

생선 종류 판별

감정 판별

사진출처: http://jamja.tistory.com/1705

해법: 기계학습 - 감독학습 - 분류 기법

무엇을 판별할 것인가 Class label

어떤 측정값을 기준으로 판별할 것인가? 특성값(feature, attribute), 변수(variable)

측정 자료 수집 및 정리 Dataset data matrix

자동 판별 기계 학습 – 테스트 – 출시 Training dataset Test dataset Classification model Evaluation & Model selection


y

𝑥1, 𝑥2, 𝑥3, …

(𝑋,𝑦)

𝑿 𝒚

Training set

Test set

features

instances

label

분류 기법의 역할


x1

x2

x1

x2

Binary classification: Multi-class classification:

A. Ng, Machine Learning, Lecture at Coursera, 2013

용어 정리

특성값(Features, Attributes, or Variables) Features are the individual measurable properties of the phenomena

being observed Choosing discriminating and independent features is key to any pattern

recognition algorithm being successful in classification

학습 데이터 / 테스트 데이터 (Training set / Test set) 학습 데이터(Training set): A set of examples used for learning, that

is to fit the parameters [i.e., weights] of the classifier 테스트 데이터(Test set): A set of examples used only to assess the

performance [generalization] of a fully-specified classifier


WEKA를 이용한 분류 및 예측 Classification and Prediction using Weka

Part Ⅱ


Weka 소개

대표적인 기계학습 알고리즘 모음, 데이터 마이닝 도구 Weka의 주요 기능

데이터 전처리(data pre-processing), 특성값 선별(feature selection) 군집화(clustering), 가시화(visualization) 분류(classification), 회귀분석(regression), 시계열 예측(forecast) 연관 규칙 학습(association rules)

S/W 특성 무료 및 소스 공개 소프트웨어(free & open source GNU General Public License) 주요 analytics S/W의 모체가 됨: RapidMiner, MOA Java로 구현. 다양한 플랫폼에서 실행 가능

다운로드 Google에서 Weka로 검색, 첫 번째 검색 결과 http://www.cs.waikato.ac.nz/ml/weka/

(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 14 Weka (bird): http://www.arkive.org/weka/gallirallus-australis/video-au00.html

http://www.cs.waikato.ac.nz/ml/weka/

http://www.arkive.org/weka/gallirallus-australis/video-au00.html






Weka를 구성하는 인터페이스


Explorer: 다양한 분석 작업을 한 단계씩 분석 수행 및 결과 확인 가능. 일반적으로, 가장 먼저 실행

KnowledgeFlow: 데이터 처리 과정의 주요 모듈을 그래프로 가시화하여 구성하고 실험 수행

Experimenter: 분류 및 회귀 분석을 일괄 처리. 결과 비교 분석. - 다양한 알고리즘 및 파라미터 설정 - 여러 데이터-알고리즘 조합 동시 분석 - 분석 모델 간 통계적 비교 - 대규모 통계적 실험 수행

Simple CLI: 다른 인터페이스를 컨트롤하는 스크립트 입력창. Weka의 모든 기능을 명령어로 수행 가능

그 외 주요 도구

Weka 실습 구성

분류 문제 바로 풀어보기 문제: 붓꽃 종류 판별, (자습: 스팸 필터링, 당뇨병 예측, 필기체 인식) 목표: 예측적 핵심 과정: 통계적 예측/모델링, 결과 해석 도구: Weka의 Explorer, Experimenter

분류 문제를 더 잘 이해하고 풀기 위한 사전 작업 목표: 탐색적 핵심 과정: 탐색적 데이터 분석 주요 작업

데이터 전처리 요인별 분류 기여도 평가 및 선별 데이터 군집화 데이터 가시화

도구: Weka의 Explorer


실습: 붓꽃(iris) 분류(classification)


Iris virginica Iris versicolor Iris setosa

분류에 사용할 특성값

http://en.wikipedia.org/wiki/File:Kosaciec_szczecinkowaty_Iris_setosa.jpg

http://en.wikipedia.org/wiki/File:Iris_versicolor_3.jpg

http://en.wikipedia.org/wiki/File:Iris_virginica.jpg

http://en.wikipedia.org/wiki/File:Petal-sepal.jpg

실습: 붓꽃(iris) 분류(classification)

특성값 정의(Define features or attributes) Sepal length, sepal width, petal length, petal width 분류 라벨(Class label): 붓꽃(iris)의 세 아종. Setosa, versicolor, 및 virginica

샘플 수집 및 데이터 구성 각 붓꽃 아종별로 50개씩 샘플 수집 (1935년) Data table : 150 samples (or instances) * 5 attributes R. Fisher 경은 1936년 발표 논문에서 이 데이터에 linear discriminant model 을 적용함

학습: 분류 알고리즘 선택 및 파라미터 설정 세 가지 분류 알고리즘으로 실습: 신경망, 결정 트리, SVM 각 알고리즘 별 파라미터 설정은 기본값을 적용

학습 결과 평가 및 모델 선정 다양한 평가 척도 확인 평가 척도를 기준으로 학습 결과 모델(algorithm + parameter setting) 비교 및 선정


http://en.wikipedia.org/wiki/Linear_discriminant_analysis

19

분류 알고리즘 – 신경망(Neural Networks)

MLP (Multilayer Perceptron) 실용적으로 매우 폭넓게 쓰이는 대표적 분류 알고리즘 Weka에서 찾아가기: classifiers-functions-MultilayerPerceptron

(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Figure from Andrew Ng’s Machine Learning Lecture Notes, on Coursera, 2013-1

20

분류 알고리즘 – 결정트리(Decision Trees)

J48 (C4.5의 Java 구현 버전)

학습 결과 모델에서 분류 규칙을 ‘트리’ 형태로 얻을 수 있음 Weka에서 찾아가기: classifiers-trees-J48


분류 알고리즘 – Support Vector Machines

SMO (sequential minimal optimization) for training SVM

Kernel Machine 기반의 대표적인 분류 알고리즘 Weka에서 찾아가기: classifiers-functions-SMO


실습: 붓꽃(iris) 데이터셋

Just open “iris.arff” in ‘data’ folder


Weka 데이터 양식 (.ARFF)

@RELATION iris @ATTRIBUTE sepallength REAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidth REAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA

5.1, 3.5, 1.4, 0.2, Iris-setosa 4.9, 3.0, 1.4, 0.2, Iris-setosa 4.7, 3.2, 1.3, 0.2, Iris-setosa … 7.0, 3.2, 4.7, 1.4, Iris-versicolor 6.4, 3.2, 4.5, 1.5, Iris-versicolor 6.9, 3.1, 4.9, 1.5, Iris-versicolor …

데이터 (CSV format)

헤더

23

Note: Excel을 이용하여 CSV 파일 생성 후, 헤더만 추가하면 쉽게 arff 포맷의 파일 생성 가능


Preprocess 탭에서 탐색적 데이터 분석 수행

데이터 구성(current relation) 특성값 삭제(remove attributes) 특성값 별 기초적 통계 분석(selected attribute) 모든 특성값을 대상으로 class label 분포 가시화(Visualize All) ‘Filter’를 이용한 preprocessing: Part V에서 설명


Weka로 분류 수행하기 – 신경망의 예

25

click • load a file that contains the training data by clicking ‘Open file’ button

• ‘ARFF’ or ‘CSV’ formats are readable

• Click ‘Classify’ tab • Click ‘Choose’ button • Select ‘weka – function - MultilayerPerceptron

• Click ‘MultilayerPerceptron’ • Set parameters for MLP • Set parameters for Test • Click ‘Start’ for learning


26

분류 알고리즘의 파라미터 설정

파라미터 설정(Parameter Setting) = 자동차 튜닝(Car Tuning) 많은 경험 또는 시행착오 필요 파라미터 설정에 따라 동일한 알고리즘에서도 최악에서 최고의 성능을 모두 보일 수도 있다

신경망의 주요 파라미터 (MultilayerPerceptron in Weka) 구조 관련: hiddenLayers, 학습 과정 관련: learningRate, momentum, trainingTime (epoch), seed

결정트리의 주요 파라미터 (J48 in Weka) unpruned, numFolds, minNumObj 트리의 크기에 직접적 용향: confidenceFactor, pruning 등

Support Vector Machine (SVM)의 주요 파라미터 (SMO in Weka) 커널(kernel) 관련: kernel 선택, kernel별 추가의 파라미터 최적화 관련: c (complexity parameter)


Test Options and Classifier Output

27

There are various metrics for evaluation

Setting the data set used for evaluation


28

Evaluation Method - Cross Validation

K-fold Cross Validation The data set is randomly divided into k subsets. One of the k subsets is used as the ‘test set’ and the

other k-1 subsets are put together to form a ‘training set’.

30 30 30 30 30 D1 D2 D3 D4 D5

30 D6

30 30 30 30 30 D1 D2 D3 D4 D6

30 D5

30 30 30 30 30 D2 D3 D4 D5 D6

30 D1

∑=

=k

iiError

kError

1

1

예: 6-fold cross validation: 180개의 데이터를 6등분 후, 6회 학습/평가 수행하여 평균 성능 측정


Classifier Output

Run information

Classifier model (full

training set)

Evaluation results General summary Detailed accuracy by

class Confusion matrix


The output depends on the classifier

30

How to Evaluate the Performance? (1/2)

Usually, build a ‘Confusion Matrix’ on the test data set

Evaluation Metrics Accuracy (percent correct) Precision / Recall Various metrics: F-measure, Kappa score, etc.

For fare evaluation, the ‘cross-validation’ scheme is used


31

How to Evaluate the Performance? (2/2)

Confusion Matrix (binary class case) Real

Prediction Positive Negative

Positive TP FP All with positive

Test

Negative FN TN All with

Negative Test

All with Disease

All without Disease Everyone

FNTNFPTPTNTP

++++

=Accuracy

FNTPTP+

= RecallFPTP

TP+

=Precision

As recall ↑ precision ↓ conversely:

As recall ↓ precision ↑


WEKA의 다양한 인터페이스 Interfaces of Weka

Part Ⅲ


Using Experimenter in Weka

Tool for ‘Batch’ experiments

33

click

• Set experiment type/iteration control

• Set datasets / algorithms

Click ‘New’

• Select ‘Run’ tab and click ‘Start’ • If it has finished successfully, click

‘Analyse’ tab and see the summary


Usages of the Experimenter

Model selection for classification/regression Various approaches

Repeated training/test set split Repeated cross-validation (c.f. double cross-validation) Averaging

Comparison between models / algorithms Paired t-test On various metrics: accuracies / RMSE / etc.

Batch and/or Distributed processing Load/save experiment settings http://weka.wikispaces.com/Remote+Experiment Multi-core support : utilize all the cores on a multi-core machine


http://weka.wikispaces.com/Remote+Experiment

Experimenter 실습


KnowledgeFlow for Analysis Process Design

38

(‘Process Flow Diagram’ of SAS® Enterprise Miner )


KNIME

RapidMiner

KnowledgeFlow: Example Usage

Decision tree (J48)


Command Line Interface (CLI)

Example command and result java weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V

0 -S 0 -E 20 -H a -t "C:\Program Files\Weka-3-7\data\iris.arff"


You may build command line scripts for various batch experiments easily

Refer Ch.1 of WekaManual-3-*-*.pdf for further information

Package Manager

Explorer에 다양한 기능 추가 가능


군집화 Clustering

Part Ⅳ


Motivating Questions for Clustering

What is the natural groupings in a set of data?


School Employees Simpson's Family Males Females

The way of grouping is not unique

Issues in Clustering

How should one measure the similarity between samples? (similarity measure)

How many clusters would be there? (number of clusters)

How should one evaluate a partitioning of a set of samples into clusters? (criterion function) E.g. high intra-class similarity and low inter-class

similarity


Purposes of Clustering

Quick review of data

Clustering before classification Checking if the given instances form separated clusters

well

Discovery Possible number of class labels Subclasses Groups of features Modules (feature / instance combinations)


49

k-means Clustering


50

Hierarchical Clustering


Resulting dendrograms with the original data matrix

Self-organizing Map (SOM)


World poverty map (http://www.cis.hut.fi/research/som-research/worldmap.html)

http://www.cis.hut.fi/research/som-research/worldmap.html




K-means in Weka

52

click • load a file that contains the training data by clicking ‘Open file’ button

• ‘ARFF’ or ‘CSV’ formats are readible

• Click ‘Cluster’ tab • Click ‘Choose’ button • Select ‘weka–clusterers - SimpleKMeans


• Click ‘SimpleKMeans’ • Set distanceFunction • Set other parameters • Check ‘Classes to cluster ~’ • Click Start

53

Evaluation of Clustering Results

Contingency Table Used when we have labels for instances Matching resulting clusters with labels


Categories of Clustering Algorithms

Hierarchical clustering Agglomerative / divisive Concept clustering

Partitional clustering K-means clustering Fuzzy c-means clustering Graph-theoretic clustering methods

Spectral clustering

Subspace clustering Co-clustering, biclustering


데이터 전처리 및 가시화 Data Preprocessing &Visualization

Part Ⅴ


Data Preprocessing with Filter in Weka

Attribute Selection, discretize

Instance Re-sampling, selecting specified folds


가시화 (Visualization)

Descriptive analysis Scatter plot & correlation analysis among features



Unsupervised Learning Results Dimension reduction Clustering


Dendrogram (hierarchical clustering)

cluster assignments


Classification 학습 결과 모델

tree, graph boundary

모델의 종합적 성능 평가: ROC curve (threshold curve) cost curve

Classifier errors


trees.j48의 예 bayes.BayesNet - TAN의 예

개별 실습 Self-Practice

Part Ⅵ


Additional Datasets for Practice

Three datasets in ‘dataset’ folder 각 데이터셋에 대해 최고의 분류 정확도를 얻어보세요

Spam Classification Training/test set: spamTrain.csv, spamTest.csv Vocab.txt: list of vocabulary

Diagnosis of Diabetes for Pima Indians diabetes.csv

Handwritten digit recognition handwritten_digit.csv


Dataset #1: Spam Classification

Description Many email services today provide

spam filters that are able to classify emails into spam and non-spam email

You will be training a classifier to classify whether a given email, x, is spam or non-spam

Configuration of the data set 1899 terms to check spams All terms are binary which means the

term exists or not 1899 binary attributes Binary class label 4000 emails in Training set 1000 emails in Test set


Preprocessing and Normalization steps which were applied to the dataset

Dataset #2: Pima Indians Diabetes

Description Pima Indians have the highest prevalence of diabetes in the world We will build classification models that diagnose if the patient shows signs of diabetes http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

Configuration of the data set 768 instances 8 attributes

age, number of times pregnant, results of medical tests/analysis all numeric (integer or real-valued) Also, a discretized set will be provided

Class label = 1 (Positive example ) Interpreted as "tested positive for diabetes" 500 instances

Class label = 0 (Negative example) 268 instances


http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

http://upload.wikimedia.org/wikipedia/commons/6/6f/Pima.jpg

Dataset #3: Handwritten Digits (MNIST)

Description The MNIST database of handwritten digits contains digits written by office

workers and students We will build a recognition model based on classifiers with the reduced set of

MNIST http://yann.lecun.com/exdb/mnist/

Configuration of the data set For our practice, we use a subset of the MNIST set

Full MNIST set contains 60,000 training and 10,000 test samples 5,000 examples are used for this practice

Attributes pixel values in gray level in a 20x20 image 400 attributes (floating point-valued: grayscale intensity)

Class attribute: 1~10, which represent digits from 1 to 9 and 10 for 0


http://yann.lecun.com/exdb/mnist/

WEKA 추가 정보 및 관련 S/W


More Information on Weka

Current version (October, 2014) Stable version: 3.6.11 Developer version: 3.7.11

Collections of datasets in Weka (ARFF) format http://www.cs.waikato.ac.nz/ml/weka/datasets.html Datasets from UCI repository Datasets from UCI KDD repository …


http://www.cs.waikato.ac.nz/ml/weka/datasets.html

Weka References

Weka Wiki: http://weka.wikispaces.com/ Primer: good starting point

Weka online documentation: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html

Textbook Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining:

Practical Machine Learning Tools and Techniques (Third Edition), Morgan Kaufmann, Jan. 2011.

Articles Data mining with WEKA, Part 1, Part 2, Part 3 in IBM

Technical Library Weka를 이용한 예측프로그램 만들기 – 월간 마소 연재(2009 7,8,9월호) 블로그, MS Live


http://weka.wikispaces.com/

http://weka.wikispaces.com/Primer

http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html

http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html



http://freesearch.pe.kr/archives/tag/weka

https://skydrive.live.com/?cid=d97db53e9d36cde4&id=D97DB53E9D36CDE4!163

Other ML Open Source S/W’s

RapidMiner The most used analytics software in 2013 & 2014 (KDNuggets Poll) Basic interface for analyses is in the style of KnowlegeFlow in Weka Provides integrated environment for ML, DM, predictive analytics http://rapidminer.com/

MOA (Massive Online Analysis) Closely related project to the WEKA project Open source framework for data stream mining http://moa.cms.waikato.ac.nz/

KNIME Konstanz Information Miner modular data pipelining concept http://www.knime.org/


http://rapidminer.com/

http://moa.cms.waikato.ac.nz/

http://www.knime.org/

http://www.knime.org/

Other ML Open Source S/W’s

Mahout http://mahout.apache.org/ Apache project to produce free implementations of distributed or

otherwise scalable machine learning algorithms Classification, clustering, and collaborative filtering, frequent itemset

mining Book: Mahout in Action

MLOSS http://mloss.org/ : forum for open source software in machine learning http://jmlr.org/mloss/ : JMLR Machine Learning Open Source

Software (MLOSS)


http://mahout.apache.org/

http://www.amazon.com/Mahout-Action-Sean-Owen/dp/1935182684/ref=sr_1_1?s=books&ie=UTF8&qid=1378757481&sr=1-1&keywords=Mahout+in+Action

http://mloss.org/

http://jmlr.org/mloss/

http://jmlr.org/mloss/

weka를 이용한 기계학습 - seoul national university · 2015-11-23 · 데이터 분석 (data...

Documents