13 june 2013 | virtual business analytics chapter a best practices framework for data mining mark...

13 June 2013 | Virtual Business Analytics Chapter

A Best Practices Framework for Data Mining

Mark Tabladillo, Ph.D., Data Mining Scientist

Artus Krohn-Grimberghe, Ph.D., Consultant and Assistant Professor

About MarkTab Training and Consulting with http://marktab.com

Data Mining Resources and Blog at http://marktab.net

Ph.D. – Industrial Engineering, Georgia Tech

Training and consulting internationally across many industries – SAS and Microsoft

Contributed to peer-reviewed research and legislation

◦ Mentoring doctoral dissertations at the accredited University of Phoenix

Presenter

http://marktab.com/

http://marktab.net/

http://marktab.net/

About Artus Assistant Professor for Analytic Information Systems and Business Intelligence

PhD in computer science

Research: data mining for e-commerce and mobile business

Consultant

4

Section OneDATA MINING FOUNDATION

Definition 1 (Informal) Data mining is the automated or semi-automated process of discovering patterns in data.

Definition 2Data Mining is a process using

1. Exploratory Data AnalysisStatistical and visual data analysis techniques.

Forming a hypothesis

2. Data Modeling & Predictions Describe data using probability distributions and Machine Learning algorithms (“model”).

Fitting a hypothesis

3. Statistical Learning TheoryModel selection, model evaluation

6

Data Mining Visualized

Target: attribute we are interested in.

Input: data available for our predictions.

Function f: describes the relationship between target and input.Regrettably, f is unknown and unknowable.

7

Input Target

f ( )

Data Mining Visualized

8

Input Target

f ( )

Hypothesis h )(

UnknownReal world:

Data Mining model:

Need to find “good” h.h is your DM “algorithm”.

Input data has to be appropriate.Select and transform as needed

Correct modeling of target is crucial

9

Top 10 ExpectationsBEST PRACTICE: LEARN FROM EXPERIENCE

10

•People can start data mining in 10 minutes…

Marketing More Scientific

•Better models come from days, weeks or months of iterative improvement

Expectation Ten

11

•Data miners can provide provably good models with little or zero knowledge of the specific industry…


•Knowing the industry and organizational goals helps orient the questions, modeling, and analysis.

Expectation Nine

12

•Open source software can provide quality results worthy of peer-reviewed literature…


•Commercial software with years-long service options is required for enterprise scale.

Expectation Eight

13

•We can learn a lot from the current data warehouses, cubes, and big data…


•We can improve our modeling by creating new data collection strategies.

Expectation Seven

14

•People can build data mining models with little or zero data cleaning…


•Better results happen when we organize and rearrange data for best success.

Expectation Six

15

•Data mining can provide answers to problems…


•Most times we only get detail insights toward larger problems, and sometimes uncover more problems than we started with.

Expectation Five

16

•A little data mining knowledge can provide an organization with a competitive edge…


•The edge grows along with experience and better study of the methodology and mathematics.

Expectation Four

17

•Individual professionals can deliver excellent predictive analysis…


•Small teams working together can help quickly and efficiently conquer some of the most difficult analytic challenges.

Expectation Three

18

•Numbers speak for themselves and can influence better decision making…


•Leadership strategy helps teams deliver results in the best way given the current culture.

Expectation Two

19

•A lot of data mining best practices and strategies can be communicated in an hour or a day…


•The best commitment is ongoing education on both data mining and machine learning technology.

Expectation One

20

Section TwoANALYZING AND PREPARING DATA

Best practice: study individual attributes

Histograms and frequencies (discrete)

Kernel density estimates

Cumulative distribution function

Rank-order plots and lift charts

Summary statistics (continuous)

Box-and-whisker plots

21

Best practice: study combinations

Pivot tables

Scatter plots

Logarithmic plots

Naïve Bayes

Correlation matrices

False-Color plots

Scatter-Plot matrix

Co-plot

22

23

Section ThreeMACHINE LEARNING ALGORITHMS

How to Choose an Algorithm Choosing an algorithm or series of algorithms is an art

One algorithm could perform different tasks

Be willing to experiment with algorithms and algorithm parameters

24

Algorithms for Data Mining Tasks (1 of 2)Algorithm Name

Description

Microsoft Time Series

Analyzes time-related data by using a linear decision tree.Patterns can be used to predict future values in the time series.

Microsoft Decision Trees

Makes predictions based on the relationships between columns in the dataset, and models the relationships as a tree-like series of splits on specific values.Supports the prediction of both discrete and continuous attributes.

Microsoft Linear Regression

If there is a linear dependency between the target variable and the variables being examined, finds the most efficient relationship between the target and its inputs.Supports prediction of continuous attributes.

Microsoft Clustering

Identifies relationships in a dataset that you might not logically derive through casual observation. Uses iterative techniques to group records into clusters that contain similar characteristics.

Algorithms for Data Mining Tasks (2 of 2)Algorithm Name Description

Microsoft Naïve Bayes

Finds the probability of the relationship between all input and predictable columns. This algorithm is useful for quickly generating mining models to discover relationships.Supports only discrete or discretized attributes.Treats all input attributes as independent.

Microsoft Logistic Regression

Analyzes the factors that contribute to an outcome, where the outcome is restricted to two values, usually the occurrence or non-occurrence of an event.Supports the prediction of both discrete and continuous attributes.

Microsoft Neural Network

Analyzes complex input data or business problems for which a significant quantity of training data is available but for which rules cannot be easily derived by using other algorithms.Can predict multiple attributes.Can be used to classify discrete attributes and regression of continuous attributes.

Microsoft Association Rules

Builds rules that describe which items are likely to appear together in a transaction.

Microsoft Sequence Clustering

Identifies clusters of similarly ordered events in a sequence.Provides a combination of sequence analysis and clustering.

Best practice: Document your science

Describe the business problem

Determine how to measure success (including baseline)

Document what was learned during data preparation and analysis

Justify the algorithms used during the investigation

List assumptions were made

27

28

Section FourACHIEVING BUSINESS VALUE

Leadership challenges Build on organizational communications

Consider redoing analysis

Find results champions

Celebrate the results

29

Best practice: prepare the next cycle

Note strengths, weaknesses, opportunities, risks

Build consensus on model expiration dates

Encourage and improve the process

Create insight into new future data collection

30

Conclusion Best Practices Framework

Provide a data mining foundation

Prepare the data

Evaluate machine learning output

Plan to move toward actionable decisions

31

Resources http://www.lfd.uci.edu/~gohlke/pythonlibs/ Free Win x64 Python libs

http://www.enthought.com/products/epd.php Commercial Python

http://www.burns-stat.com/pages/Tutor/R_inferno.pdf R Tutorial

http://technet.microsoft.com/en-us/sqlserver/cc510301.aspx SQL Server Analysis Services Data Mining

http://marktab.net Data Mining Portal

http://sqlserverdatamining.com Data Mining Team Portal

Books: “Data Mining with SQL Server 2008”, “Data Mining for Business Intelligence”, “Practical Time Series Forecasting”

32

http://www.lfd.uci.edu/~gohlke/pythonlibs/

http://www.lfd.uci.edu/~gohlke/pythonlibs/

http://www.enthought.com/products/epd.php

http://www.enthought.com/products/epd.php

http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

http://technet.microsoft.com/en-us/sqlserver/cc510301.aspx

http://technet.microsoft.com/en-us/sqlserver/cc510301.aspx

http://marktab.net/

http://sqlserverdatamining.com/

13 june 2013 | virtual business analytics chapter a best practices framework for data mining mark...

Documents

input data

data mining models

data mining tasks

data mining foundation

informal data mining

data mining resources

data miners

data available