weka a data mining tool – part 2 - uh.edusmiertsc/4397cis/weka_data-mining_tool_part2.pdf ·...

23
WEKA WEKA A Data Mining Tool – Part 2 By Susan L. Miertschin 1

Upload: doanngoc

Post on 18-Mar-2018

224 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

WEKAWEKAA Data Mining Tool – Part 2

By Susan L. Miertschin

1

Page 2: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

Preparing Data for Data MiningPreparing Data for Data Mining Data cleaning, data cleansing

Gather up the data and make sure it is all consistently in the same format

C b d Can be an arduous, time-consuming process Modern tools help

2

Page 3: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

ARFF FormatARFF Format

Standard way of representing datasets that consist of independent, y p g punordered instances

Does not involve relationships among instances or attributes

3

Page 4: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

ARFF Syntax RulesARFF Syntax Rules % Indicates a comment

@relation name_of_relationN h l i Names the relation

@attribute name_of_attribute { list of categorical value possibilities}possibilities}

OR @attribute name_of_attribute numeric

@data@ ata Indicates that comma-delimited rows that follow are the data

4

Page 5: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

WEKA U D t Fil i ARFF F tWEKA Uses Data Files in ARFF Format

Can also Can also convert from .csvor .txt to .arff

5

Page 6: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

Start WEKAStart WEKA

6

Page 7: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

WEKA Main InterfaceWEKA Main Interface

7

Page 8: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

Credit Card Promotion ExampleCredit Card Promotion Example Download CreditCardPromotion.xlsx

Save the sheet name CreditCardPromotion as CreditCardPromotion.csv

O h C d C dP WEKA Open the CreditCardPromoiton.csv in WEKA

8

Page 9: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

Open List of File TypesOpen List of File Types

9

Page 10: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

Open the File You Saved as csvOpen the File You Saved as .csv

10

Page 11: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

WEKA Opens csvWEKA Opens .csv

11

Page 12: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

Save in arff Format from WEKASave in .arff Format from WEKA

12

Page 13: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

Following the Example in Ch.04 that U ESXUses ESX

13

Page 14: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

Results Vary Depending on Algorithm U dUsed

14

Page 15: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

Categorical Data ConceptsCategorical Data Concepts

15

Page 16: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

Domain PredictabilityDomain Predictability A is a categorical attribute E.g., Income Range

Possible values of A are {V1, V2, V3, …, Vn}E 20 30K 30 40K 40 50K E.g., 20-30K, 30-40K, 40-50K, etc.

Domain predictability of Vk is the percent of instances that have Vk as the value for attribute Ahave Vk as the value for attribute A E.g, 5 out of 15 instances are in the category 30-40K Domain predictability of 30-40K is 33.3%

16

Page 17: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

Domain PredictabilityDomain Predictability

17

Page 18: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

Domain PredictabilityDomain Predictability If domain predictability is quite high, e.g., 98%, then the

attribute is not useful for classification or clustering

18

Page 19: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

Class PredictabilityClass Predictability A is a categorical attribute e.g., Income Range

Possible values of A are {V1, V2, V3, …, Vn} 20 30K 30 40K 40 50K e.g., 20-30K, 30-40K, 40-50K, etc.

Class predictability of Vk is the percent of instances in a particular class that have Vk as the value for attribute Aparticular class that have Vk as the value for attribute A Cannot see how to get this from WEKA??

19

Page 20: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

Class PredictivenessClass Predictiveness Probability that an instance resides in a specified class given

th i t h th l f th h tt ib tthe instance has the value for the chosen attribute A is a categorical attribute e.g., Income Rangeg , g

Possible values of A are {V1, V2, V3, …, Vn} e.g., 20-30K, 30-40K, 40-50K, etc.

Attribute-value predictiveness for Vk is the probability an instance is in class C given the value for attribute A is Vk. E.g., probability an instance is is class 0 given income range of E.g., probability an instance is is class 0 given income range of

the instance is 30-40K Cannot see how to get this from WEKA??

20

Page 21: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

Interpretations of Class Predictivenessd P di t bilitand Predictability

If an attribute value has predictability and predictiveness of 1.0 for a class then the attribute value is necessary and sufficient for a class, then the attribute value is necessary and sufficient for membership in the class

If an attribute value has predictability < 1 and predictiveness = 1, then the attribute value is sufficient but not necessary for then the attribute value is sufficient but not necessary for membership in the class All instances with the value are in the class, but there are others in the

class that have a different value If an attribute has predictability = 1 and predictiveness < 1, then

the attribute value is necessary but not sufficient for membership in the class All instances in the class have the same value for the attribute being

considered, but some instances outside the class also have the same attribute value

21

Page 22: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

Section 4.5 A Six-Step Approach for S i d L i gSupervised Learning Use WEKA’s decision tree and rule-based classifiers J48, PART

22

Page 23: WEKA A Data Mining Tool – Part 2 - uh.edusmiertsc/4397cis/WEKA_Data-Mining_Tool_Part2.pdf · Preparing Data for Data Mining Data cleaning, data cleansing Gather up the data and

WEKAWEKAA Data Mining Tool – Part 2

By Susan L. Miertschin

23