data mining in excel using xl miner

1

Data Mining in Excel Using XLMiner™

Nitin R. PatelCytel Software and M.I.T.Sloan

2

Contact Info

• XLMiner is distributed by Resampling Stats, Inc.

• www.xlminer.net• Contact Peter Bruce: [email protected]• 703-522-2713

http://www.xlminer.net/

http://www.xlminer.net/

mailto:[email protected]

mailto:[email protected]

3

What is XLMiner?

• XLMiner is an affordable, easy-to-use tool for business analysts, consultants and business students to: – learn strengths and weaknesses of data mining methods,– prototype large scale data mining applications,– implement medium scale data mining applications.

• More generally, XLMiner is a tool for data analysis in Excel that uses classical and modern, computationally-intensive techniques.

4

Available Data Mining Software• Application-specific: aimed at providing

solutions to end-users for common tasks (e.g. Unica for Customer Relationship Management, Urban Science for location and distribution)

• Technique-specific: focused on a few data mining methods (e.g. CART from Salford Associates, Neural Nets from HNC Software)

5

TECHNIQUE-SPECIFIC PRODUCTS Source: Elder Research

xxSee5

xWizWhy

xxxNeuroShell

xCognos

xCART (Salford)

Kohonen N

et

Association R

ules

K-M

eansSequential. R

ules

TimeSeries

Logistic R

egression

Rule Induction

Naïve B

ayes

Radial B

asis Fns.K

-Nearest

Neighbors

Multilayer N

eural N

et

Linear Regression

Class. &

Regr.

Trees

Algorithms>

6

Available Data Mining Software

• Horizontal products: designed for data mining analysts: (e.g. SAS Enterprise Miner, SPSS Clementine, IBM Intelligent Miner, NCR Teraminer, Splus Insightful Miner, Darwin/Oracle)– Powerful, comprehensive, easy-to-use; but…– Need substantial learning effort– Expensive

7

HORIZONTAL PRODUCTS Source: Elder Research

xxxxxxPRW (Unica)

xxxxDarwin (Oracle)

xxxxxMineSet (SGI)

xxxxxxxxIntelligent Miner (IBM)

xxxxxxxClementine (SPSS)

xxxxxxxxxEnterprise Miner (SAS)

Kohonen N

et

Association R

ules

K-M

eansSequential. R

ules

TimeSeries

Logistic Regression

Rule Induction

Naïve B

ayes

Radial B

asis Fns.

K-N

earest Neighbors

Multilayer N

eural Net

Linear Regression

Class. &

Regr. Trees

Algorithms>

8

Desiderata for Data Mining and Modern Data Analysis Software

• Easy-to-use– Data import (e.g. cross-platform, various data bases)– Data handling (e.g. data partitioning, scoring)– Invoking and experimenting with procedures

• Comprehensive Range of Procedures:– Statistics (e.g. Regression, Multivariate procedures)– Machine learning (e.g. Neural Nets, Classification

Trees)– Database (e.g. Association Rules)

9

XLMiner is Unique• Low cost,• Comprehensive set of data mining models and

algorithms that includes statistical, machine learning and database methods,

• Based on prototype used in three years of MBA courses on data mining at Sloan School, M.I.T.

• Focus on business applications: Book of lecture notes and cases in preparation (first draft available for examination).

10

Why Data Mining in Excel?

• Leverage familiarity of MBA students, managers and business analysts with interface and functionality of Excel to provide them with hands-on experience in data mining.

11

Advantages• Low learning hurdle • Promotes understanding of strengths and

weaknesses of different data mining techniques and processes

• Enables interactive analysis of data (important in early stages of model building)

• Facilitates incorporation of domain knowledge (often key to successful applications) by empowering end-users to participate actively in data mining projects

• Enables pre-processing of data and post-processing of results using Excel functions, reporting in Word, presentations in PowerPoint

12

Advantages (cont.)• Supports communication between data miners and

end-users • Supports smooth transition from prototyping to

custom solution development (VB and VBA)• Emphasizes openness

– enables integration with other analytic software for optimization (Solver), simulation (Crystal Ball) , numerical methods;

– interface modifications (e.g.custom forms and outputs)– solution specific routines (VBA)

• Examples:– Boston Celtics – analysis of player statistics– Clustering for improving forecasts, optimizing price

markdowns.

13

Size Limitations• An Excel spreadsheet cannot exceed 64,000 rows.

If data records are stored as rows in a single spreadsheet this is the largest data set that can be accommodated. The number of variables cannot exceed 256 (number of columns).

• These limits do not apply to deployment of model to score large databases.

• If Excel is used as a view-port into a database such as Access, MS SQL Server, Oracle or SAS, these limits do not apply.

14

Sampling

• Practical Data Mining Methodologies such as SEMMA (SAS) and CRISP-DM (SPSS and European Industry Standard) recommend working with a sample (typically 10,000 random cases) in the model and algorithm selection phase. This facilitates interactive development of data mining models.

15

XLMiner• Free 30 day trial version: limit is 200 records per

partition. • Education version: limit is 2,000 records per

partition, so maximum size for a data set is 6,000 records.

• Standard version (currently in beta test: will be available by end August):

Up to 60,000 records obtained by drawing samples from large data bases in accordance with SAS’s SEMMA (Sample, Explore, Model, Measure, Apply) methodology. Training data restricted to 10,000 records Sampling from and scoring to Access databases (later SQLServer, Oracle, SAS)

16

Data Mining Procedures in XLMiner

• Partitioning data sets (into Training, Validation, and Test data sets)

• Scoring of training, validation, test and other data• Prediction (of a continuous variable)• Classification• Data reduction and exploration• Affinity• Utilities: Sampling, graphics, missing data,

binning, creation of dummy variables

17

Prediction

• Multiple Linear Regression with subset selection, residual analysis, and collinearity diagnostics.

• K-Nearest Neighbors• Regression Tree• Neural Net

18

Classification

• Logistic Regression with subset selection, residual analysis, and collinearity diagnostics

• Discriminant Analysis• K-Nearest Neighbors• Classification Tree• Naïve Bayes• Neural Networks

19

Data Reduction and Exploration

• Principal Components• K-Means Clustering• Hierarchical Clustering

20

Affinity

• Association Rules (Market Basket Analysis)

21

Partitioning

Aim: To construct training, validation, and test data sets from

Boston Housing data

23

Boston Housing DataCRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV

0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 397 4.98 240.02731 0 7.07 0 0.469 6.421 78.9 4.97 2 242 17.8 397 9.14 21.60.02729 0 7.07 0 0.469 7.185 61.1 4.97 2 242 17.8 393 4.03 34.70.03237 0 2.18 0 0.458 6.998 45.8 6.06 3 222 18.7 395 2.94 33.40.06905 0 2.18 0 0.458 7.147 54.2 6.06 3 222 18.7 397 5.33 36.20.02985 0 2.18 0 0.458 6.43 58.7 6.06 3 222 18.7 394 5.21 28.70.08829 13 7.87 0 0.524 6.012 66.6 5.56 5 311 15.2 396 12.43 22.90.14455 13 7.87 0 0.524 6.172 96.1 5.95 5 311 15.2 397 19.15 27.10.21124 13 7.87 0 0.524 5.631 100 6.08 5 311 15.2 387 29.93 16.50.17004 13 7.87 0 0.524 6.004 85.9 6.59 5 311 15.2 387 17.1 18.90.22489 13 7.87 0 0.524 6.377 94.3 6.35 5 311 15.2 393 20.45 150.11747 13 7.87 0 0.524 6.009 82.9 6.23 5 311 15.2 397 13.27 18.90.09378 13 7.87 0 0.524 5.889 39 5.45 5 311 15.2 391 15.71 21.70.62976 0 8.14 0 0.538 5.949 61.8 4.71 4 307 21 397 8.26 20.4

24

XLMiner : Data Partition Sheet

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT

1 0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.982 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.145 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.336 0.02985 0 2.18 0 0.458 6.43 58.7 6.0622 3 222 18.7 394.12 5.217 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.6 12.438 0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.9 19.15

10 0.17004 12.5 7.87 0 0.524 6.004 85.9 6.5921 5 311 15.2 386.71 17.112 0.11747 12.5 7.87 0 0.524 6.009 82.9 6.2267 5 311 15.2 396.9 13.2714 0.62976 0 8.14 0 0.538 5.949 61.8 4.7075 4 307 21 396.9 8.26

Date: 29-Jul-2003 13:50:09 (Ver: 1.2.0.1)

Output Navigator

Training Data Validation Data Test Data

DataData source housing!$A$2:$O$507

Selected variables

Partitioning Method Randomly chosenRandom Seed 81801# training row s 253# validation row s 152# test row s 101

Row Id.

Selected variables

3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.039 0.21124 12.5 7.87 0 0.524 5.631 100 6.0821 5 311 15.2 386.63 29.93

13 0.09378 12.5 7.87 0 0.524 5.889 39 5.4509 5 311 15.2 390.5 15.71

4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.9411 0.22489 12.5 7.87 0 0.524 6.377 94.3 6.3467 5 311 15.2 392.52 20.4517 1.05393 0 8.14 0 0.538 5.935 29.3 4.4986 4 307 21 386.85 6.58

25

Prediction

Multiple Linear Regression using subset selection

Aim: To estimate median residential property value for a census tract

26

The Regression Model

Coefficient Std. Error p-value SS32.677 7.444 0.000 128852 239-0.094 0.049 0.054 3566 0.7380.055 0.020 0.007 2550 5.0250.030 0.091 0.742 1529 60362.836 1.199 0.019 645

-15.889 5.463 0.004 1433.872 0.597 0.000 46970.007 0.019 0.728 0

-1.405 0.292 0.000 9380.358 0.097 0.000 1

-0.013 0.005 0.019 174-0.934 0.208 0.000 6200.014 0.004 0.000 502

-0.582 0.073 0.000 1623

Training Data scoring - Summary Report

Total sum of squared

errorsRMS Error Average

Error

6036 4.884 0.000

Validation Data scoring - Summary Report



Error

2848 4.329 0.066

Test Data scoring - Summary Report



Error

2392 4.866 -1.019

Input variablesResidual dfMultiple R-squaredStd. Dev. estimateResidual SS

Constant termCRIMZNINDUSCHASNOXRMAGE

BLSTAT

DISRADTAXPTRATIO

# Records training 253# Records validation 152# Records test 101

27

Subset selection (exhaustive enumeration)

1 2 3 4 5 6 72 19472.3789 362.7529 0.5441 0.5432 0.0000 Constant LSTAT * * * * *3 15439.3086 185.6474 0.6386 0.6371 0.0000 Constant RM LSTAT * * * *4 13727.9863 111.6489 0.6786 0.6767 0.0000 Constant RM PTRATIO LSTAT * * *5 13228.9072 91.4852 0.6903 0.6878 0.0000 Constant RM DIS PTRATIO LSTAT * *6 12469.3447 59.7537 0.7081 0.7052 0.0000 Constant NOX RM DIS PTRATIO LSTAT *7 12141.0723 47.1754 0.7158 0.7123 0.0000 Constant CHAS NOX RM DIS PTRATIO LSTAT

R-SquaredAdjusted R-

SquaredSubset size RSS Cp ProbModels (Constant present in all models)

28

The Regression Model

Coefficient Std. Error p-value SS42.8367 7.1766 0.0000 126430.6016 247.0000

-21.7852 4.6042 0.0000 3404.4565 0.66013.7503 0.6177 0.0000 6583.3579 5.3467

-1.4072 0.2535 0.0000 211.6853 7061.1646-1.0086 0.1747 0.0000 1453.9551-0.5907 0.0696 0.0000 2060.2676LSTAT

Residual dfMultiple R-squaredStd. Dev. EstimateResidual SS

NOXRMDISPTRATIO

Predictor (Indep. Var.)Constant

XLMiner : Multiple Linear Regression - Prediction of Validation DataMaxAbsErr= 20.33

RMSErr= Data range4.9355 AvMEDV= 22.9645 %RMSErr= 21.5% AvAbsErr= 3.57

SqErrPredicted

ValueActual Value NOX RM DIS PTRATIO LSTAT AbsErr

0.8439637 22.0187 21.1 0.4640 5.8560 4.4290 18.6000 13.0000 0.920.2196854 32.8687 32.4 0.4470 6.7580 4.0776 17.6000 3.5300 0.470.2137043 25.4623 25 0.4890 6.1820 3.9454 18.6000 9.4700 0.466.6637521 31.0814 28.5 0.4110 6.8610 5.1167 19.2000 3.3300 2.584.0947798 22.4236 20.4 0.5470 5.8720 2.4775 17.8000 15.3700 2.0218.224484 24.5690 20.3 0.5440 5.9720 3.1025 18.4000 9.9700 4.270.3253246 23.4704 22.9 0.5240 6.0120 5.5605 15.2000 12.4300 0.5751.86411 14.6983 21.9 0.7180 4.9630 1.7523 20.2000 14.0000 7.20

Data_Partition1!$C$273:$P$424 Back to Navigator

29

%AvAbsErr=15.6%

AbsErr Freq0 02 614 406 258 10

10 912 214 316 018 020 122 1

0

10

20

30

40

50

60

70

0 2 4 6 8 10 12 14 16 18 20 22

AbsErr

Freq

uenc

y in

Valid

atio

n Da

tase

t

30

PredictionK_Nearest Neighbors

Aim: To estimate median residential property value for a census tract

31

XLMiner : K-Nearest Neighbors Prediction

NOX RM DIS PTRATIO LSTATMEDV

DataSource data w orksheet Data_Partition1Training data used for building the model Data_Partition1!$C$19:$Q$322Validation data Data_Partition1!$C$323:$Q$524# cases in the training data set 304# cases in the validation data set 202Normalization TRUE# nearest neighbors (k) 1

VariablesInput variablesOutput variable

32

Parameters/Options# Nearest neighbors 1




Error

0 0 0




Error

3314 4.669 0.805

Test Data scoring - Summary Report



Error

3895 6.210 -0.450

Timings

3.00Overall (secs)

# Records training 253# Records validation 152# Records test 101

33

Validation Data prediction details

Row Id. Predicted Value

Actual Value

Residual Actual #Nearest Neighbors

CRIM ZN INDUS CHAS NOX

3 28.70 34.70 6.00 1 0.02729 0 7.07 0 0.4699 14.40 16.50 2.10 1 0.21124 12.5 7.87 0 0.524

13 22.90 21.70 -1.20 1 0.09378 12.5 7.87 0 0.52415 19.60 18.20 -1.40 1 0.63796 0 8.14 0 0.53816 20.40 19.90 -0.50 1 0.62739 0 8.14 0 0.53820 20.40 18.20 -2.20 1 0.7258 0 8.14 0 0.53825 16.60 15.60 -1.00 1 0.75026 0 8.14 0 0.53829 19.60 18.40 -1.20 1 0.77299 0 8.14 0 0.538

34

ClassificationClassification Tree

Aim: To classify census tracts into high and low residential property

value classes

35

Boston Housing Data

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV HIGHCLASS0.00632 18 2.31 0 0.54 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24 00.02731 0 7.07 0 0.47 6.421 78.9 4.97 2 242 17.8 396.9 9.14 21.6 00.02729 0 7.07 0 0.47 7.185 61.1 4.97 2 242 17.8 392.83 4.03 34.7 10.03237 0 2.18 0 0.46 6.998 45.8 6.06 3 222 18.7 394.63 2.94 33.4 10.06905 0 2.18 0 0.46 7.147 54.2 6.06 3 222 18.7 396.9 5.33 36.2 10.02985 0 2.18 0 0.46 6.43 58.7 6.06 3 222 18.7 394.12 5.21 28.7 00.08829 13 7.87 0 0.52 6.012 66.6 5.56 5 311 15.2 395.6 12.43 22.9 00.14455 13 7.87 0 0.52 6.172 96.1 5.95 5 311 15.2 396.9 19.15 27.1 00.21124 13 7.87 0 0.52 5.631 100 6.08 5 311 15.2 386.63 29.93 16.5 00.17004 13 7.87 0 0.52 6.004 85.9 6.59 5 311 15.2 386.71 17.1 18.9 00.22489 13 7.87 0 0.52 6.377 94.3 6.35 5 311 15.2 392.52 20.45 15 00.11747 13 7.87 0 0.52 6.009 82.9 6.23 5 311 15.2 396.9 13.27 18.9 00.09378 13 7.87 0 0.52 5.889 39 5.45 5 311 15.2 390.5 15.71 21.7 0

36

Training Log

#Nodes Error0 13.821 3.452 2.973 0.674 0.655 0.566 0.27 0.148 0.069 0.05

10 0.0511 0.0412 0.0213 0.0114 0.0115 0

Actual Class

0 1

0 152 61 8 36

Class # Cases # Errors % Error0 158 6 3.801 44 8 18.18Overall 202 14 6.93

Error Report

Growing the Tree

Validation Misclassification Summary

Classification Confusion MatrixPredicted Class

37

XLMiner : Classification Tree - Prune Log

# Decision Nodes

Error

15 0.079214 0.064413 0.064412 0.064411 0.064410 0.0644 Std. Err. 0.01727089 0.07438 0.07437 0.07436 0.06935 0.06934 0.06933 0.06932 0.0991 0.2079

Back to Navigator

<-- Best Prune

<-- Minimum Error Prune

38

Classification Tree : Full Tree

Back to Navigator

6.5505

1.35929

6.791

10.1702

73.0 % 7.635

19.45

0.65 % 1.31 % 3.43515

5.59 % 7.06449

1.25934

18.1 35.0000

286.000

8.22 % 0.32 % 2.30 %

0.98 % 4.13499

2.30 % 4.62499

2.30 % 378

0.32 % 0.32 % 0.32 % 0.32 % 0.98 % 0.65 %

RM

DIS RM

CRIM 0 LSTAT PTRATIO

1 0 DIS 0 RM DIS

PTRATIO ZN TAX 1 1 0

1 LSTAT 0 LSTAT 1 TAX

1 0 0 1 0 1

228 76

6 222 31 45

2 4 14 17 37 8

5 9 12 25 1 7

3 2 7 2 7 5

1 1 1 1 3 2

39

Classification Tree : Best Pruned Tree

Back to Navigator

6.5505

67.3 % 6.791

7.92 % 19.45

21.7 % 2.97 %

RM

0 RM

0 PTRATIO

1 0

136 66

16 50

44 6

40

Classification Tree : Minimum Error Tree

Back to Navigator

6.5505

1.35929

6.791

10.1702

65.8 % 7.635

19.45

1.48 % 0 % 3.43515

2.47 % 7.06449

2.97 %

0 % 5.44 % 286.000

14.8 %

3.46 % 378

2.97 % 0.49 %

RM

DIS RM

CRIM 0 LSTAT PTRATIO

1 0 DIS 0 RM 0

1 0 TAX 1

1 TAX

0 1

136 66

3 133 16 50

3 0 11 5 44 6

0 11 14 30

7 7

6 1

41

ClassificationNeural Network

Aim: To classify census tracts into high and low residential property

value classes

42

XLMiner : Neural Network Classification

0 17860 1260

125

Epochs InformationNumber of Epochs 30Accumulated Trials 9120ClassTrials

ArchitectureNumber of hidden layers 1Hidden Layer# NodesStep size for gradient descent 0.1000Weight change momentum 0.6000Weight decay 0.0000Cost Function Squared ErrorHidden layer sigmoid StandardOutput layer sigmoid Standard

43


0.5

Actual Class

1 01 40 110 4 249

Class # Cases # Errors % Error1 51 11 21.570 253 4 1.58

Overall 304 15 4.93


0.5

Actual Class

1 01 26 70 1 168

Class # Cases # Errors % Error1 33 7 21.210 169 1 0.59

Overall 202 8 3.96

Cut off Prob.Val. for Success (Updatable)


Error Report

Cut off Prob.Val. for Success (Updatable)


Error Report

44

Lift chart (validation dataset)

05

101520253035

0 100 200 300

# cases

Cum

ulati

ve

CumulativeHIGHV whensorted usingpredicted valuesCumulativeHIGHV usingaverage

Decile-wise lift chart (validation dataset)

01234567

1 2 3 4 5 6 7 8 9 10

Deciles

Decile

mean

/ Glob

al me

an

45

Data Reduction and Exploration

Hierarchical Clustering

Aim: To cluster electric utilities into similar groups

46

Utilities Dataseq# x1 x2 x3 x4 x5 x6 x7 x8

Arizona 1 1.06 9.2 151 54.4 1.6 9077 0 0.628Boston 2 0.89 10.3 202 57.9 2.2 5088 25.3 1.555Central 3 1.43 15.4 113 53 3.4 9212 0 1.058Common 4 1.02 11.2 168 56 0.3 6423 34.3 0.7Consolid 5 1.49 8.8 192 51.2 1 3300 15.6 2.044Florida 6 1.32 13.5 111 60 -2.2 11127 22.5 1.241Hawaiian 7 1.22 12.2 175 67.6 2.2 7642 0 1.652Idaho 8 1.1 9.2 245 57 3.3 13082 0 0.309Kentucky 9 1.34 13 168 60.4 7.2 8406 0 0.862Madison 10 1.12 12.4 197 53 2.7 6455 39.2 0.623Nevada 11 0.75 7.5 173 51.5 6.5 17441 0 0.768NewEngla 12 1.13 10.9 178 62 3.7 6154 0 1.897Northern 13 1.15 12.7 199 53.7 6.4 7179 50.2 0.527Oklahoma 14 1.09 12 96 49.8 1.4 9673 0 0.588Pacific 15 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4Puget 16 1.16 9.9 252 56 9.2 15991 0 0.62SanDiego 17 0.76 6.4 136 61.9 9 5714 8.3 1.92Southern 18 1.05 12.6 150 56.7 2.7 10140 0 1.108Texas 19 1.16 11.7 104 54 -2.1 13507 0 0.636Wisconsi 20 1.2 11.8 148 59.9 3.5 7287 41.1 0.702United 21 1.04 8.6 204 61 3.5 6650 0 2.116Virginia 22 1.07 9.3 174 54.3 5.9 10093 26.6 1.306

47

Dendrogram (Data range:T12-5!$M$3:$U$24, Method:Single linkage)

1 18 14 19 9 2 4 10 13 20 7 12 21 15 22 6 3 8 16 17 11 50

0.5

1

1.5

2

2.5

3

3.5

4

Dist

ance

48

Predicted Clusters

Cluster id. x1 x2 x3 x4 x5 x6 x7 x81 1.06 9.2 151 54.4 1.6 9077 0 0.6281 0.89 10.3 202 57.9 2.2 5088 25.3 1.5551 1.43 15.4 113 53 3.4 9212 0 1.0581 1.02 11.2 168 56 0.3 6423 34.3 0.72 1.49 8.8 192 51.2 1 3300 15.6 2.0441 1.32 13.5 111 60 -2.2 11127 22.5 1.2411 1.22 12.2 175 67.6 2.2 7642 0 1.6521 1.1 9.2 245 57 3.3 13082 0 0.3091 1.34 13 168 60.4 7.2 8406 0 0.8621 1.12 12.4 197 53 2.7 6455 39.2 0.6233 0.75 7.5 173 51.5 6.5 17441 0 0.7681 1.13 10.9 178 62 3.7 6154 0 1.8971 1.15 12.7 199 53.7 6.4 7179 50.2 0.5271 1.09 12 96 49.8 1.4 9673 0 0.5881 0.96 7.6 164 62.2 -0.1 6468 0.9 1.41 1.16 9.9 252 56 9.2 15991 0 0.624 0.76 6.4 136 61.9 9 5714 8.3 1.921 1.05 12.6 150 56.7 2.7 10140 0 1.1081 1.16 11.7 104 54 -2.1 13507 0 0.6361 1.2 11.8 148 59.9 3.5 7287 41.1 0.7021 1.04 8.6 204 61 3.5 6650 0 2.1161 1.07 9.3 174 54.3 5.9 10093 26.6 1.306

Back to Navigator

49

Dendrogram (Data range:T12-5!$M$3:$U$24, Method:Complete linkage)

1 18 14 19 6 3 9 2 22 4 20 10 13 5 7 12 21 15 17 8 16 110

1

2

3

4

5

6

7

Dist

ance

50

Predicted Clusters

Cluster id. x1 x2 x3 x4 x5 x6 x7 x81 1.06 9.2 151 54.4 1.6 9077 0 0.6282 0.89 10.3 202 57.9 2.2 5088 25.3 1.5551 1.43 15.4 113 53 3.4 9212 0 1.0582 1.02 11.2 168 56 0.3 6423 34.3 0.72 1.49 8.8 192 51.2 1 3300 15.6 2.0441 1.32 13.5 111 60 -2.2 11127 22.5 1.2413 1.22 12.2 175 67.6 2.2 7642 0 1.6524 1.1 9.2 245 57 3.3 13082 0 0.3091 1.34 13 168 60.4 7.2 8406 0 0.8622 1.12 12.4 197 53 2.7 6455 39.2 0.6234 0.75 7.5 173 51.5 6.5 17441 0 0.7683 1.13 10.9 178 62 3.7 6154 0 1.8972 1.15 12.7 199 53.7 6.4 7179 50.2 0.5271 1.09 12 96 49.8 1.4 9673 0 0.5883 0.96 7.6 164 62.2 -0.1 6468 0.9 1.44 1.16 9.9 252 56 9.2 15991 0 0.623 0.76 6.4 136 61.9 9 5714 8.3 1.921 1.05 12.6 150 56.7 2.7 10140 0 1.1081 1.16 11.7 104 54 -2.1 13507 0 0.6362 1.2 11.8 148 59.9 3.5 7287 41.1 0.7023 1.04 8.6 204 61 3.5 6650 0 2.1162 1.07 9.3 174 54.3 5.9 10093 26.6 1.306

Back to Navigator

51

Predicted Clusters (sorted)

Cluster id. x1 x2 x3 x4 x5 x6 x7 x81 1.06 9.2 151 54.4 1.6 9077 0 0.6281 1.43 15.4 113 53 3.4 9212 0 1.0581 1.32 13.5 111 60 -2.2 11127 22.5 1.2411 1.34 13 168 60.4 7.2 8406 0 0.8621 1.09 12 96 49.8 1.4 9673 0 0.5881 1.05 12.6 150 56.7 2.7 10140 0 1.1081 1.16 11.7 104 54 -2.1 13507 0 0.6362 0.89 10.3 202 57.9 2.2 5088 25.3 1.5552 1.02 11.2 168 56 0.3 6423 34.3 0.72 1.49 8.8 192 51.2 1 3300 15.6 2.0442 1.12 12.4 197 53 2.7 6455 39.2 0.6232 1.15 12.7 199 53.7 6.4 7179 50.2 0.5272 1.2 11.8 148 59.9 3.5 7287 41.1 0.7022 1.07 9.3 174 54.3 5.9 10093 26.6 1.3063 1.22 12.2 175 67.6 2.2 7642 0 1.6523 1.13 10.9 178 62 3.7 6154 0 1.8973 0.96 7.6 164 62.2 -0.1 6468 0.9 1.43 0.76 6.4 136 61.9 9 5714 8.3 1.923 1.04 8.6 204 61 3.5 6650 0 2.1164 1.1 9.2 245 57 3.3 13082 0 0.3094 0.75 7.5 173 51.5 6.5 17441 0 0.7684 1.16 9.9 252 56 9.2 15991 0 0.62

MeansCluster 1 1.21 12.5 128 55.5 1.7 10163 3.2 0.874Cluster 2 1.13 10.9 183 55.1 3.1 6546 33.2 1.065Cluster 3 1.02 9.1 171 62.9 3.7 6526 1.8 1.797Cluster 4 1.00 8.9 223 54.8 6.3 15505 0.0 0.566

52

AffinityAssociation Rules

(Market Basket Analysis)

Aim: to identify types of books that are likely to be bought by customers

based on past purchases of books

53

Chi

ldB

ks

You

thB

ks

Coo

kBks

DoI

tYB

ks

Ref

Bks

ArtB

ks

Geo

gBks

ItalC

ook

ItalA

tlas

ItalA

rt

Flor

ence

0 1 0 1 0 0 1 0 0 0 01 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 01 1 1 0 1 0 1 0 0 0 00 0 1 0 0 0 1 0 0 0 01 0 0 0 0 1 0 0 0 0 10 1 0 0 0 0 0 0 0 0 00 1 0 0 1 0 0 0 0 0 01 0 0 1 0 0 0 0 0 0 01 1 1 0 0 0 1 0 0 0 00 0 0 0 0 0 0 0 0 0 00 0 1 0 0 0 1 0 0 0 01 0 0 0 0 1 0 0 0 0 11 1 0 1 1 1 0 0 1 1 01 1 1 0 0 0 0 0 0 0 01 1 1 0 0 0 1 0 0 0 00 0 1 0 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 0 01 1 1 1 1 1 1 0 0 0 01 1 1 0 0 1 0 0 0 0 1

2000 customers

54

XLMiner : Association Rules

Input Data Sheet1!$A$1:$K$2001Data Format Binary MatrixMin. Support 200Min. Conf. % 70# Rules 19

Data

Rule # Conf. % Antecedent (a) Consequent (c) Support(a) Support(c) Support(a U c) Lift Ratio

1 100 ItalCook=> CookBks 227 862 227 2.322 82.19 DoItYBks, ArtBks=> CookBks 247 862 203 1.913 81.89 DoItYBks, GeogBks=> CookBks 265 862 217 1.904 80.33 CookBks, RefBks=> ChildBks 305 846 245 1.905 80 ArtBks, GeogBks=> ChildBks 255 846 204 1.896 81.18 ArtBks, GeogBks=> CookBks 255 862 207 1.887 79.63 YouthBks, CookBks=> ChildBks 324 846 258 1.888 80.86 ChildBks, RefBks=> CookBks 303 862 245 1.889 78.87 DoItYBks, GeogBks=> ChildBks 265 846 209 1.86

10 79.35 ChildBks, DoItYBks=> CookBks 368 862 292 1.8411 77.87 CookBks, DoItYBks=> ChildBks 375 846 292 1.8412 77.66 CookBks, GeogBks=> ChildBks 385 846 299 1.8413 78.18 ChildBks, YouthBks=> CookBks 330 862 258 1.8114 77.85 ChildBks, ArtBks=> CookBks 325 862 253 1.8115 75.75 CookBks, ArtBks=> ChildBks 334 846 253 1.7916 76.67 ChildBks, GeogBks=> CookBks 390 862 299 1.7817 70.65 GeogBks=> ChildBks 552 846 390 1.6718 70.63 RefBks=> ChildBks 429 846 303 1.6719 71.1 RefBks=> CookBks 429 862 305 1.65

55

Some Utilities

• Sampling from worksheets and databases• Database scoring• Graphics• Binning

56

Simple Random Sampling

57

Stratified Random Sampling

58

Scoring to databases and worksheets

59

Binning continuous variables

60

Missing Data

61

Graphics: Boston Housing data

Box Plot

0

20

40

60

80

100

120

Y Va

lues

AGE

Histogram

020406080

100120140160180

0 10 20 30 40 50 60 70 80 90 100

AGE

Freq

uenc

y

62

Box Plot

0

1

2

3

4

5

6

7

8

9

10

Y V

alu

es

RM

Histogram

0

50

100

150

200

250

3 3.6 4.2 4.8 5.4 6 6.6 7.2 7.8 8.4 9 9.6

RM

Fre

qu

ency

63

Matrix Plot

6.6

6.6

5.4

5.4

4.2

4.2

3

3

1.8 1.8

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

9

9

7.8

7.8

6.6

6.6

5.4

5.4

4.2

4.2

3

3

TAX102

AGE102

RM100

High tax towns have fewer rooms on average?

64

Box Plot

1 2 3 4 50

2

4

6

8

10

Binned_TAX

Y Va

lues

RM

65

Future Extensions

• Cross Validation• Bootstrap, Bagging and Boosting• Error-based clustering• Time Series and Sequences• Support Vector Machines• Collaborative Filtering

66

In Conclusion

• XLMiner is a modern tool-belt for data mining. It is an affordable, easy-to-use tool for consultants, MBA’s and business analysts to learn, create and deploy data mining methods,

• More generally, XLMiner is a tool for data analysis in Excel that uses classical and modern, computationally intensive techniques.

data mining in excel using xl miner

Documents

radial basis fns

summary reporttotal sum

test data sets

data mining methods

test data scoring

training data scoring

data mining models

validation data scoring