data mining research

21
Data Mining Research David L. Olson University of Nebraska

Upload: edolie

Post on 10-Feb-2016

58 views

Category:

Documents


0 download

DESCRIPTION

Data Mining Research. David L. Olson University of Nebraska. Data Mining Research. Business Applications Credit scoring Customer classification Fraud detection Human resource management Algorithms Database related Data warehouse products claim internal data mining Text mining - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Mining Research

Data Mining Research

David L. OlsonUniversity of Nebraska

Page 2: Data Mining Research

Data Mining Research

• Business Applications– Credit scoring– Customer classification– Fraud detection– Human resource management

• Algorithms• Database related– Data warehouse products claim internal data mining

• Text mining• Data Mining Process

Page 3: Data Mining Research

Personal (with others)• Business Applications

– Introduction to Business Data Mining with Yong Shi [2006]– Qing Cao - RFM

• Algorithms– Advanced Data Mining Techniques with Dursun Delen [2008]– Moshkovich & Mechitov – Ordinal scales in trees– Data set balancing

• Database related– encyclopedia

• Text mining– Web log ethics

• Data Mining Process– Ton Stam, Dursun Delen

Page 4: Data Mining Research

RFMwith Qing Cao, Ching Gu, Donhee Lee

• Recency– Time since customer made last purchase

• Frequency– Number of purchases this customer made over

time frame• Monetary– Average purchase amount (or total)

Page 5: Data Mining Research

Variants

• F & M highly correlated– Bult & Wansbeek [1995] Journal of Marketing Science

• Value = M/R– Yang (2004) Journal of Targeting, Measurement and Analysis for

Marketing

Page 6: Data Mining Research

Limitations

• Other attributes may be important– Product variation– Customer age– Customer income– Customer lifestyle

• Still, RFM widely used– Works well if response rate is high

Page 7: Data Mining Research

Data

• Meat retailer in Nebraska• 64,180 purchase orders (mail)• 10,000 individual customers• Oct 11, 1998 to Oct 3, 2003• ORDER DATA• ORDER AMOUNT• PRESENCE OF PROMOTION

Page 8: Data Mining Research

Data

• Nebraska food products firm• 64,180 individual purchase orders (by mail)• 10,000 individual customers• 11 Oct 1998 to 3 Oct 2003• Data:– Order date– Order amount (price)– Whether or not promotion involved

Page 9: Data Mining Research

Treatment

• Used 5,000 observations to build model– To the end of 2002

• Used another 5,000 for testing– 2003

Page 10: Data Mining Research

Correlations* - 0.01 significance; ** - 0.05 significance; *** - 0.001 significance

R F MR 1.0 F -0.371 1.0 M -0.278 0.749 1.0Response 2003 0.209 *** 0.135 *** 0.133 ***Response 2003 $ -0.100 *** 0.534 *** 0.751 ***

Page 11: Data Mining Research

Data

Factor Min Max Group1 Group2 Group3 Group4 Group5R 1 1542 1233 + 925-1232 617-924 309-616 1-308

Count 548 209 297 464 3482

F 1 56 1-11 12-22 23-33 34-44 45 +

Count 4503 431 49 13 4

M 20 15199 0-3040 3041-6060

6061-9080

9081-12100

12100 +

Count 4900 81 17 1 1

Page 12: Data Mining Research

Count by RFM CellRF R F M1 M2 M3 M4 M5

55 R 1-308 F 45+ 1 2 1 0 054   F 34-44 3 6 4 0 053   F 23-33 22 23 4 0 052   F 12-22 355 36 5 1 151   F 1-11 3003 12 3 0 045 R309-616 F 45+ 0 0 0 0 044   F 34-44 0 0 0 0 043   F 23-33 0 0 0 0 0

42   F 12-22 29 1 0 0 041   F 1-11 433 1 0 0 035 R617-924 F 45+ 0 0 0 0 034   F 34-44 0 0 0 0 033   F 23-33 0 0 0 0 032   F 12-22 3 0 0 0 031   F 1-11 294 0 0 0 025 R 925-1232 F 45+ 0 0 0 0 024   F 34-44 0 0 0 0 023   F 23-33 0 0 0 0 0

22   F 12-22 0 0 0 0 0

21   F 1-11 209 0 0 0 0

15 R 1233+ F 45+ 0 0 0 0 0

14   F 34-44 0 0 0 0 0

13   F 23-33 0 0 0 0 0

12   F 12-22 0 0 0 0 0

11   F 1-11 548 0 0 0 0

Page 13: Data Mining Research

Basic Model Coincidence MatrixCorrect 0.6076

  Actual 0 Actual 1 TotalsModel 0 872 1949 2821Model 1 13 2166 2179Totals 885 4114 5000

Page 14: Data Mining Research

BALANCE CELLS

• Adjusted boundaries of 5 x 5 x 5 matrix• Can’t get all to equal average of 8– Lumpy (due to ties)– Ranged from 4 to 11

Page 15: Data Mining Research

Balanced Cell DensitiesCorrect 0.8380

RF R F M1 M2 M3 M4 M555 R 1-22 F 9 + 43 41 41 41 4254   F 6-8 43 44 43 43 4453   F 4-5 57 64 63 61 6352   F = 3 30 31 35 34 3451   F 1-2 26 25 25 21 2445 R 23-48 F 9 + 41 41 41 41 4144   F 6-8 34 38 36 38 3743   F 4-5 58 56 57 57 5742   F = 3 40 36 37 38 3741   F 1-2 19 20 18 22 2035 R 49-151 F 9 + 63 64 62 63 6334   F 6-8 49 50 50 50 5033   F 4-5 43 43 44 45 4432   F = 3 29 21 28 27 2831   F 1-2 19 23 19 18 2125 R 152-672 F 9 + 38 38 38 38 3824   F 6-8 50 49 51 50 5023   F 4-5 51 51 51 51 5122   F = 3 32 33 31 32 3321   F 1-2 29 25 33 26 3015 R 673+ F 9 + 16 15 15 15 1514   F 6-8 15 16 15 16 1513   F 4-5 30 30 30 30 3012   F = 3 59 70 63 64 6411   F 1-2 67 73 93 76 69

Page 16: Data Mining Research

Alternatives• LIFT

– Sort groups by best response– Apply your marketing budget to the most profitable (until you run out

of budget)– LIFT is the gain obtained above par (random)

• VALUE FUNCTION• (Yang, 2004)

– Throw out F (correlated with M)– Use ratio of M/R

• Logistic Regression• Decision Tree• Neural Network

Page 17: Data Mining Research

LIFTEqual Groups

Page 18: Data Mining Research

V Value by CellCell Min n 0 1 % Avg$

1 0 424 3 421 0.993 94.422 0.0464 286 2 284 0.993 101.263 0.108 281 1 280 0.996 107.184 0.195 352 0 352 1.000 108.255 0.376 303 2 301 0.993 119.996 0.72 285 9 276 0.968 136.137 1.252 292 57 235 0.805 127.058 1.952 319 120 199 0.624 98.319 2.73 293 101 192 0.655 101.01

10 3.74 229 102 127 0.555 101.6911 4.972 231 87 144 0.623 102.3312 6.524 254 86 168 0.661 107.1213 8.6 218 75 143 0.656 119.8314 11.08 216 71 145 0.671 119.9915 14.34 191 49 142 0.743 122.5516 18.35 207 46 161 0.778 157.8217 24.15 166 30 136 0.819 159.1718 32.87 148 17 131 0.885 220.1819 48.2 175 16 159 0.909 284.7420 92 130 11 119 0.915 424.69

Totals 5000 885 4115 0.823 131.33

Page 19: Data Mining Research

V Model Lift

Page 20: Data Mining Research

Models• Regression:

-0.4775 + 0.00853 R + 0.1675 F + 0.00213 MTest data: Correct 0.8230

• Decision TreeIF R ≤ 82

AND R ≤ 32 YES (1567 right, 198 wrong)ELSE R > 32AND F ≤ 3 AND M ≤ 296 NO (285 right, 91 wrong)ELSE M > 296 YES (28 right, 9 wrong)ELSE F > 3 YES (729 right, 110 wrong)

ELSE R > 82 YES (2391 right, 3 wrong)Test data: Correct 0.8678

• Neural NetworkTest data: Correct 0.8674

Page 21: Data Mining Research

COMPARISONSModel Test Accuracy Benefits Drawbacks

RFM 0.6076 Simplest data Uneven cell densities

Degenerate (all 1) 0.8230

Balanced cell sizes 0.7156 Better statistically More data manipulation

Balanced cell sizes $ 0.8380 Better statistically

Value function 0.8180 Condense to one IV Less information

Logistic regression 0.8230

(degenerate)

Additional IVs Formula hard to apply

Decision tree 0.8678 Easy to interpret

Neural network 0.8674 Fit nonlinear data Hard to apply model