dr. chen, data mining a/w & dr. chen, data mining chapter 4 an excel-based data mining tool...

45
A/W & Dr. Chen, Data Mining Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University Spokane, WA 99223 [email protected]

Upload: anthony-jefferson

Post on 20-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Chapter 4An Excel-based Data Mining Tool

(iData Analyzer)

Jason C. H. Chen, Ph.D.Professor of MIS

School of Business AdministrationGonzaga UniversitySpokane, WA 99223

[email protected]

Page 2: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

2A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Objectives

• This chapter will introduce you the iData Analyzer(iDA) and how to use two of learner models contained in your iDA software of data mining tools.

• In Section 4.1 overviews the iDA Model for Knowledge Discovery.

• In Section 4.2, introduces an exemplar-based data mining tool, ESX, capable of both supervised learning and unsupervised clustering.

• The way of representing datasets and how to use ESX to perform unsupervised clustering and building supervised learning models and others will be also introduced in this chapter.

Page 3: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

3A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

4.1 The iData Analyzer

• iDA provides support for the business or technical analyst by offering a visual learning environment, an integrated tool set, and data mining process support.

• iDA consists of the following components:– Preprocessor– Heuristic agent (for larger Large Dataset)– ESX– Neural Network– Rule Maker– Report Generator

See p.107 and Appendix A-2 for the instructions of installation

Page 4: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

4A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Limitations

• The commercial version of iDA is bounded by the size of a single MS Excel spreadsheet, i.e., up to 65,536 rows and 256 columns

• The iDA input format uses the first three rows of a spreadsheet to house information about individual attributes– Up to 65,533 data instances in attribute-value format

can be mined– The student version allows a maximum of 7,000 data

instances (i.e., 7003 rows)

After completing the installation if the security setting is high, you should change it to medium and click OK.

Page 5: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

5A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Data

PreProcessor

Interface

HeuristicAgent

NeuralNetworks

LargeDataset

ESX

MiningTechnique

GenerateRules

RulesRuleMaker

ReportGenerator

ExcelSheets

Explaination

Yes

No

No

Yes

Yes

No

Figure 4.1 – The iDA system architecture

Page 6: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

6A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Page 7: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

7A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

4.2 ESX: A Multipurpose Tool for Data Mining

• ESX can help create target data, find irregularities in data, perform data mining, and offer insight about the practical value of discovered knowledge.

• Features of ESX learner model are:– It supports both supervised learning and unsupervised

clustering– It supports an automated method for dealing with missing

attribute value– It does not make statistical assumptions about the nature of

data to be processed– It can point out inconsistencies and unusual values in data

Page 8: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

8A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Root

CnC1 C2

I11 I1jI12

Root Level

Instance Level

Concept Level

. . .

. . .

I21 I2kI22

. . . In1 InlIn2

. . .

Figure 4.3 An ESX concept hierarch

Page 9: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

9A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

4.3 iDAV Format for Data Mining

Second Row: C: categorical; R: real-valuedThird Row (see Table 4.2 below)

   

 

Character Usage

I The attribute is used as an input attribute

U The attribute is not used (categorical attribute with several unique values are of little predictive value)

D The attribute is not used for classification or clustering, but attribute value summary information is displayed in all output reports

O The attribute is used as an output attribute. For supervised learning with ESX exactly one categorical attribute is selected as the output attribute.

 

Table 4.2 – Values for Attribute Usage

Page 10: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

10A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Income Range Magazine Promo Watch Promo Life Ins Promo Credit Card Ins. Sex AgeC C C C C C RI I I I I I I

40-50,000 Yes No No No Male 4530-40,000 Yes Yes Yes No Female 4040-50,000 No No No No Male 4230-40,000 Yes Yes Yes Yes Male 4350-60,000 Yes No Yes No Female 3820-30,000 No No No No Female 5530-40,000 Yes No Yes Yes Male 3520-30,000 No Yes No No Male 2730-40,000 Yes No No No Male 4330-40,000 Yes Yes Yes No Female 4140-50,000 No Yes Yes No Female 4320-30,000 No Yes Yes No Male 2950-60,000 Yes Yes Yes No Female 3940-50,000 No Yes No No Male 5520-30,000 No No Yes Yes Female 19

Table 4.1 – Credit Card Promotion Database: iDAV Format

Page 11: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

4.4 A Five-step Approach for Unsupervised Clustering

Step 1: Enter the Data to be Mined

Step 2: Perform a Data Mining Session

Step 3: Read and Interpret Summary Results

Step 4: Read and Interpret Individual Class Results

Step 5: Visualize Individual Class Rules

Page 12: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

12A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Step 1: Enter The Data To Be Mined

Page 13: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

13A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Step 2: Perform A Data Mining Session

Page 14: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

14A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Figure 4.5 – Unsupervised settings for ESX (#4,p.116)

Value for instance similarity: A value closer to 100 encourages the formation of new clustersA value closer to 0 favors new instances to enter existing clusters

The real-valued tolerance setting helps determine the similarity criteria for real-valued attributes. A setting of 1.0 is usually appropriate.

Page 15: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

15A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

#6 A message box indicating that eight clusters were formed.

This tells us the data has been successfully mine.

Page 16: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

16A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

#6, #7 (p.116)As a general rule, an unsupervised clustering of more than five or six clusters is likely to be less than optimal.

Page 17: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

17A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

#8 and #9, Repeat steps 1-4. For step 5, set the similarity value to 55

Page 18: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

18A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Re-rule feature

Minimum correctness rule (50-100): if 80, the rules generated must have an error rate less than or equal to 20%Minimum coverage (10-100): if 10, RuleMaker will generate rules that cover 10% or more of the instances in each class.Attribute significance (start with 80-90): values close to 100 will allow RuleMaker to consider only those attribute values most highly predictive of class membership for rule generation.

Covering set rules: RuleMaker will generate a set of best-defining rules for each class.

Page 19: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

19A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

#10 (p.117) Set minimum rule coverage at 30

30

Minimum correctness rule (50-100): if 80, the rules generated must have an error rate less than or equal to 20%Minimum coverage (10-100): if 10, RuleMaker will generate rules that cover 10% or more of the instances in each class.Attribute significance: values close to 100 will allow RuleMaker to consider only those attribute values most highly predictive of class membership for rule generation.

Page 20: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

A Production Rule for theCredit Card Promotion Database

IF Sex = Female & 19 <=Age <= 43

THEN Life Insurance Promotion = Yes

Rule Accuracy: 100.00%

Rule Coverage: 66.67%

Question: Can we assume that two-thirds of all females in the specified age range will take advantage of the

promotion?

• Rule accuracy is a between-class measure.• Rule coverage is a within-class measure.

Page 21: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

21A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Output Reports:Unsupervised Clustering

• RES SUM: This sheet contains summary statistics about attribute values and offers several heuristics to help us determine the quality of a data mining session.

• RES CLS: this sheet has information about the clusters formed as a result of an unsupervised mining session

• RUL TYP: Instances are listed by their cluster number. The typicality of instance i is the average similarity of i to the other members of its cluster.

• RES RUL: The production rules generated for each cluster are contained in this sheet.

Page 22: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

22A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

#10 (p.117) Set minimum rule coverage at 30

Page 23: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

23A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Figure 4.7 Rules for the credit card promotion database

Page 24: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

24A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Step 3: Read and Interpret Summary Results (p.117)

(Sheet1 RES SUM)

• Class Resemblance Scores (RES)

• Domain Resemblance Score

• Domain Predictability

Page 25: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

25A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Step 3: Read and Interpret Summary Results (p.119)

Similarity value(within the class)

In general, the within-class RES scores should be higher than the domain RES. It should be true for most of the classes.

Instances of Class 1 have a best within-class fit

Page 26: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

26A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Figure 4.9 - Step 3: Read and Interpret Summary Results (cont.)

Page 27: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

27A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Figure 4.9 -Statistics for numerical attributes and common categorical attribute values Step 3: Read and Interpret Summary Results (cont.)

Page 28: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

28A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Step 4: Read and Interpret Individual Class Results (p.121)

(Sheet1 RES CLS)

• Typicality– is defined as the average similarity of an instance to all other members

of its cluster or class

• Class Predictability is a within-class measure. – the percent of class instances having a particular value for a categorical

attribute

• Class Predictiveness is a between-class measure– it is defined as probability an instance resides in a specified class given the

instance has the value for the chosen attribute

Page 29: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

29A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Figure 4.10 – Class 3 Summary Results

within-class between-class

Page 30: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

30A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Figure 4.11 – Necessary and sufficient attribute values for Class 3

Page 31: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

31A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Step 5: Visualize Individual Class Rules

IF life ins Promo = YesTHEN Class = 3 :rule accuracy 77.78% :rule coverage 100.00%

Page 32: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

32A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

4.5 A Six-Step Approach for Supervised Learning

• Step 1: Choose an Output Attribute– Launch a fresh life insurance promotion

• Step 2: Perform the Mining Session• Step 3: Read and Interpret Summary

Results• Step 4: Read and Interpret Test Set Results• Step 5: Read and Interpret Class Results• Step 6: Visualize and Interpret Class Rules

Page 33: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

33A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Step 2: Perform the Mining Session

O: output; D: Display-Only

Filename: CreditCardPromotion-supervised.xls

Page 34: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

34A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Step 2(#4): Select the number of instances for training and a real-valued tolerance setting (p.127)

Page 35: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

35A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Step 3 – Read and Interpret Summary Results

Domain statistics for categorical attributes tells us that 80% of the training instances represent individuals without credit card insurance.

Page 36: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

36A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Step 3 – Read and Interpret Summary Results (cont.)

Page 37: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

37A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Step 4 - Read and Interpret Test Set Results

Page 38: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

38A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Step 5 - Read and Interpret Results for Individual Classes (p.130)

Page 39: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

39A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Sheet1 RUL TYP

In Class Yes (Life Ins. Promo)Instances of Credit Card Ins = Yes is 40% (2/5)

Page 40: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

40A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Step 6 – Visualize and Interpret Class Rules (p.130)

Re-rule feature

Minimum correctness rule (50-100): if 80, the rules generated must have an error rate less than or equal to 20%Minimum coverage (10-100): if 10, RuleMaker will generate rules that cover 10% or more of the instances in each class.Attribute significance (start with 80-90): values close to 100 will allow RuleMaker to consider only those attribute values most highly predictive of class membership for rule generation.

Covering set rules: RuleMaker will generate a set of best-defining rules for each class.

Page 41: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

4.6 Techniques for Generating Rules

1. Define the scope of the rules.

2. Choose the instances.

3. Set the minimum rule correctness.

4. Define the minimum rule coverage.

5. Choose an attribute significance value.

Page 42: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Typicality Scores

• Identify prototypical and outlier instances.

• Select a best set of training instances.

• Used to compute individual instance classification confidence scores.

4.7 Instance Typicality

Page 43: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

43A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Figure 4.13 Instance Typicality

Page 44: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

44A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

4.8 Special Considerations and Features

• Avoid Mining Delays

• The Quick Mine Feature– Supervised with more than 2000 training set

instances, “quick mine” feature will be asked– Unsupervised with more than 2000 data

instances. ESX is given a random selection of 500 instances.

• Erroneous and Missing Data

Page 45: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS

45A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Homework

• Use EXS (and iDA) to perform a supervised data mining session using the CardiologyCategorical.xls data file.

• Save output file as CardiologyCategorical-supervised.xls

• Lab#4 (p.141)• Turn in

– 1. Spreadsheet file (CardiologyCategorical-supervised.xls) that contains the outcome of data mining session

– 2. Word file that includes (and explains) answers to all questions (a. thru n.)