13 june 2013 | virtual business analytics chapter a best practices framework for data mining mark...
TRANSCRIPT
13 June 2013 | Virtual Business Analytics Chapter
A Best Practices Framework for Data Mining
Mark Tabladillo, Ph.D., Data Mining Scientist
Artus Krohn-Grimberghe, Ph.D., Consultant and Assistant Professor
About MarkTab Training and Consulting with http://marktab.com
Data Mining Resources and Blog at http://marktab.net
Ph.D. – Industrial Engineering, Georgia Tech
Training and consulting internationally across many industries – SAS and Microsoft
Contributed to peer-reviewed research and legislation
◦ Mentoring doctoral dissertations at the accredited University of Phoenix
Presenter
About Artus Assistant Professor for Analytic Information Systems and Business Intelligence
PhD in computer science
Research: data mining for e-commerce and mobile business
Consultant
4
Section OneDATA MINING FOUNDATION
Definition 1 (Informal) Data mining is the automated or semi-automated process of discovering patterns in data.
Definition 2Data Mining is a process using
1. Exploratory Data AnalysisStatistical and visual data analysis techniques.
Forming a hypothesis
2. Data Modeling & Predictions Describe data using probability distributions and Machine Learning algorithms (“model”).
Fitting a hypothesis
3. Statistical Learning TheoryModel selection, model evaluation
6
Data Mining Visualized
Target: attribute we are interested in.
Input: data available for our predictions.
Function f: describes the relationship between target and input.Regrettably, f is unknown and unknowable.
7
Input Target
f ( )
Data Mining Visualized
8
Input Target
f ( )
Hypothesis h )(
UnknownReal world:
Data Mining model:
Need to find “good” h.h is your DM “algorithm”.
Input data has to be appropriate.Select and transform as needed
Correct modeling of target is crucial
9
Top 10 ExpectationsBEST PRACTICE: LEARN FROM EXPERIENCE
10
•People can start data mining in 10 minutes…
Marketing More Scientific
•Better models come from days, weeks or months of iterative improvement
Expectation Ten
11
•Data miners can provide provably good models with little or zero knowledge of the specific industry…
Marketing More Scientific
•Knowing the industry and organizational goals helps orient the questions, modeling, and analysis.
Expectation Nine
12
•Open source software can provide quality results worthy of peer-reviewed literature…
Marketing More Scientific
•Commercial software with years-long service options is required for enterprise scale.
Expectation Eight
13
•We can learn a lot from the current data warehouses, cubes, and big data…
Marketing More Scientific
•We can improve our modeling by creating new data collection strategies.
Expectation Seven
14
•People can build data mining models with little or zero data cleaning…
Marketing More Scientific
•Better results happen when we organize and rearrange data for best success.
Expectation Six
15
•Data mining can provide answers to problems…
Marketing More Scientific
•Most times we only get detail insights toward larger problems, and sometimes uncover more problems than we started with.
Expectation Five
16
•A little data mining knowledge can provide an organization with a competitive edge…
Marketing More Scientific
•The edge grows along with experience and better study of the methodology and mathematics.
Expectation Four
17
•Individual professionals can deliver excellent predictive analysis…
Marketing More Scientific
•Small teams working together can help quickly and efficiently conquer some of the most difficult analytic challenges.
Expectation Three
18
•Numbers speak for themselves and can influence better decision making…
Marketing More Scientific
•Leadership strategy helps teams deliver results in the best way given the current culture.
Expectation Two
19
•A lot of data mining best practices and strategies can be communicated in an hour or a day…
Marketing More Scientific
•The best commitment is ongoing education on both data mining and machine learning technology.
Expectation One
20
Section TwoANALYZING AND PREPARING DATA
Best practice: study individual attributes
Histograms and frequencies (discrete)
Kernel density estimates
Cumulative distribution function
Rank-order plots and lift charts
Summary statistics (continuous)
Box-and-whisker plots
21
Best practice: study combinations
Pivot tables
Scatter plots
Logarithmic plots
Naïve Bayes
Correlation matrices
False-Color plots
Scatter-Plot matrix
Co-plot
22
23
Section ThreeMACHINE LEARNING ALGORITHMS
How to Choose an Algorithm Choosing an algorithm or series of algorithms is an art
One algorithm could perform different tasks
Be willing to experiment with algorithms and algorithm parameters
24
Algorithms for Data Mining Tasks (1 of 2)Algorithm Name
Description
Microsoft Time Series
Analyzes time-related data by using a linear decision tree.Patterns can be used to predict future values in the time series.
Microsoft Decision Trees
Makes predictions based on the relationships between columns in the dataset, and models the relationships as a tree-like series of splits on specific values.Supports the prediction of both discrete and continuous attributes.
Microsoft Linear Regression
If there is a linear dependency between the target variable and the variables being examined, finds the most efficient relationship between the target and its inputs.Supports prediction of continuous attributes.
Microsoft Clustering
Identifies relationships in a dataset that you might not logically derive through casual observation. Uses iterative techniques to group records into clusters that contain similar characteristics.
Algorithms for Data Mining Tasks (2 of 2)Algorithm Name Description
Microsoft Naïve Bayes
Finds the probability of the relationship between all input and predictable columns. This algorithm is useful for quickly generating mining models to discover relationships.Supports only discrete or discretized attributes.Treats all input attributes as independent.
Microsoft Logistic Regression
Analyzes the factors that contribute to an outcome, where the outcome is restricted to two values, usually the occurrence or non-occurrence of an event.Supports the prediction of both discrete and continuous attributes.
Microsoft Neural Network
Analyzes complex input data or business problems for which a significant quantity of training data is available but for which rules cannot be easily derived by using other algorithms.Can predict multiple attributes.Can be used to classify discrete attributes and regression of continuous attributes.
Microsoft Association Rules
Builds rules that describe which items are likely to appear together in a transaction.
Microsoft Sequence Clustering
Identifies clusters of similarly ordered events in a sequence.Provides a combination of sequence analysis and clustering.
Best practice: Document your science
Describe the business problem
Determine how to measure success (including baseline)
Document what was learned during data preparation and analysis
Justify the algorithms used during the investigation
List assumptions were made
27
28
Section FourACHIEVING BUSINESS VALUE
Leadership challenges Build on organizational communications
Consider redoing analysis
Find results champions
Celebrate the results
29
Best practice: prepare the next cycle
Note strengths, weaknesses, opportunities, risks
Build consensus on model expiration dates
Encourage and improve the process
Create insight into new future data collection
30
Conclusion Best Practices Framework
Provide a data mining foundation
Prepare the data
Evaluate machine learning output
Plan to move toward actionable decisions
31
Resources http://www.lfd.uci.edu/~gohlke/pythonlibs/ Free Win x64 Python libs
http://www.enthought.com/products/epd.php Commercial Python
http://www.burns-stat.com/pages/Tutor/R_inferno.pdf R Tutorial
http://technet.microsoft.com/en-us/sqlserver/cc510301.aspx SQL Server Analysis Services Data Mining
http://marktab.net Data Mining Portal
http://sqlserverdatamining.com Data Mining Team Portal
Books: “Data Mining with SQL Server 2008”, “Data Mining for Business Intelligence”, “Practical Time Series Forecasting”
32