crisp-dm agile approach to data mining projects
TRANSCRIPT
CRISP-DM
Agile Approach to Data Mining Projects
Michał Łopuszyński
Warsaw Data Science Meetup, 2016.06.07
About me
I work at ICM UW•
Our group = Applied Data Analysis Lab•
Supercomputing centre, weather forecast , virtual library, open science platform, visualization solutions, ...
•
Involved in modelling and data analysis projects from cosmology, medicine, bioinformatics, quantum chemistry, biophysics, fluid dynamics, materialsscience, social network analysis ...
•
Automatic information extraction from PDFs •
Text-mining in scientific literature •
Variety of application projects (analysis of court judgments, aviation, deploying solutions on the big data stack Spark/Hadoop, trainings)
•
About me
adalab.icm.edu.pl
What is CRISP-DM?
Cross Industry Standard Process for Data Mining
•
SPSS, Teradata, Daimler, OCHRA, NCR
Developed in 1996 by big playersin data analysis
•
•
I follow "CRISP-DM 1.0 Step-by-step data mining guide"•
01001110010101011100100111000110
100101110101100010011101001
1000000011100000110000110110110
110000110010010001
DATA
BusinessUnderstanding
DataUnderstanding
DataPreparation
Modelling
Evaluation
DeploymentMost popular methodologyfor data-centric projects
See KDNuggets Polls •Runner-up SEMMA•
I find it agile •Introduces almost no overhead •Emphasizes adaptive transitionsbetween project phases
•
2007, 2014
Business Understanding
Determine business objectives•
Resources (data!), risks, costs & benefitsAssess situation•
Ideally with quantitative success criteriaDetermine data mining goals•
Estimate time line, budget, but also tools andtechniques
Develop project plan•
01001110010101011100100111000110
100101110101100010011101001
1000000011100000110000110110110
110000110010010001
DATA
BusinessUnderstanding
DataUnderstanding
DataPreparation
Modelling
Evaluation
Deployment
Business Understanding
Difficult!•
Often, you have to enter a new field•
You have to explain data science limitations to non-experts
•
Source: http://xkcd.com/1425
No, performance will not be 100% •
We need much more data to train an accurate model
•
For tomorrow, it is impossible•
Business Understanding – my DOs and DON'Ts
Have a lot of patience for vaguely defined problems•
Do not waste your time on ill-defined, unrealistic projects•
Learn to concretize or even reduce the scope of the initial idea•Data sample •
Real-life use cases•
Quantitative success metrics•
Data Understanding
Collect initial data•
Persist resultsDescribe data•
Persist resultsExplore data•
Carefully document problems and issues found! Verify data quality•
01001110010101011100100111000110
100101110101100010011101001
1000000011100000110000110110110
110000110010010001
DATA
BusinessUnderstanding
DataUnderstanding
DataPreparation
Modelling
Evaluation
Deployment
Data Understanding – Validate Everything
<judgement id="..."> <date>3013-12-04 00:00:00.0 CET</date> <publicationDate>2014-07-23 02:52:17.0 CEST</publicationDate> <courtId>15250000</courtId> <departmentId>503</departmentId> <chairman>Małgorzata ...</chairman> <judges> <judge>Małgorzata ...</judge>
</judges> ...
</judgement><judgement id="..."> <date>2012-10-01 00:00:00.0 CEST</date> <publicationDate>2014-12-31 18:15:05.0 CET</publicationDate> <courtId>15450500</courtId> <departmentId>6027</departmentId> <judges> <judge>Piotr ...</judge> <judge>wskazał</judge> <judge>czego wymaga art. 17a ust. 2 ustawy</judge> ... </judges></judgement>
Data Understanding – Spot Anomalies
Histogram of certain smooth quantity measured using "precise equipment"
Explanation – effect of human interface between precise equipment & db
Data Understanding – Spot Anomalies
Secondary school examination (Matura) score distribution from Polish
Exploratory data analysis can reveal imperfections of conducted experiment
Source: CKE Materials, Matura 2012
Data Understanding – my DOs and DON'Ts
Do not trust data quality estimates provided by your customer•
Verify as far as you can, if your data is correct, complete, coherent,deduplicated, representative, independent, up-to-date, stationary
•
Understand anomalies and outliers•
Do not economize on this phase•The earlier you discover issues with your data the better (yes, your data will have issues!)
•
Data understanding leads to domain understanding, it will pay off in the modelling phase
•
Investigate what sort of processing was applied to the raw data•
Data Preparation
Select data•
Clean data•
Generate derived attributesConstruct data•
Merge information from different sources Integrate data•
Convert to format convenient for modelling Format data•
01001110010101011100100111000110
100101110101100010011101001
1000000011100000110000110110110
110000110010010001
DATA
BusinessUnderstanding
DataUnderstanding
DataPreparation
Modelling
Evaluation
Deployment
Data Preparation
Tedious!•
Make, Drake
Use workflow tools to document, automate & parallelize data prep.•
classification-jsonl
data-aux/class-riffle
data-clean/joind-jsonl
data-aux/metad-riffle data-aux/priis-json data-aux/prinf-json
stat/basic stat/basic-fp7 stat/collab
metadata-jsonl projects-from-iis-jsonl projects-from-infspace-jsonlmetadata-extracted-jsonl
Oozie, Azkaban, Luigi, Airflow, ...
Data Preparation
Data understanding and preparation will usually consume half or more of your project time!
•
20% 20%14%
10% 10%10%
What % of time in your data mining project(s) is spent on data cleaning and preparation?
8%
4%
25%
25%
39%
Percentage of responses
Percentage of time
Source: M.A.Munson, A Study on the Importance of and Time Spent Different Modeling Steps, ACM SIGKDD Explorations Newsletter 13, 65-71 (2011)
Source: KDNuggets Poll 2003
Data Preparation – my DOs and DON'Ts
Use workflow tools to help you with the above •
Prepare your customer that data understanding and preparationtake considerable amount of time
•
Automate this phase as far as possible•
When merging multiple sources, track provenance of your data•
Modelling
Generate test design•
Feature eng., optimize model parametersBuild model•
Iterate the aboveAssess model•
Assumptions, measure of accuracySelect modelling technique•
01001110010101011100100111000110
100101110101100010011101001
1000000011100000110000110110110
110000110010010001
DATA
BusinessUnderstanding
DataUnderstanding
DataPreparation
Modelling
Evaluation
Deployment
Modelling – Tooling Selection
Where your model will be deployed?•
Do you need to distribute your computations? (avoid!)
•
Breadth = performance, lots of general purpose libraries and tooling, easy creation of web services
Should I use general purpose language?•
C++JavaC#
RMatlab
Mathematica
PythonScala
ClojureF#
BreadthD
epth
(quality of general purpose tooling)(q
ualit
y of
dat
a an
alys
is to
olin
g)
Depth = easy data manipulation, latest models and statistical techniques available
Should I use data analysis language?•
Can I afford a prototype?•
Modelling – my DOs and DON'Ts
Develop your model with deployment conditions in mind•
Allocate time for hyperparameter optimization•
• Whenever possible, peek inside your model and consult it withdomain expert
Assess feature importance•
Run your model on simulated data•
Be creative with your features (feature engineering)•Esp. from textual data or time-series you can generate a lot of std. features •Make conscious decision about missing data (NAs) and outliers (regression!)•
Evaluation
Review process•
To deploy or not to deploy?Determine next steps• Determine next steps
Business success criteria fulfilled?Evaluate results•
01001110010101011100100111000110
100101110101100010011101001
1000000011100000110000110110110
110000110010010001
DATA
BusinessUnderstanding
DataUnderstanding
DataPreparation
Modelling
Evaluation
Deployment
Evaluation – my DOs and DON'Ts
Work with the performance criteria dictated by your customer'sbusiness model
•
Assess not only performance, but also practical aspects, related todeployment, for example:
•
Training and prediction speed•
Robustness and maintainability (tooling, dependence on other subsystems, library vs. homegrown code)
•
Watch out for data leakage, for example:•Time series – mixing past and future•
Meaningful identifiers•
Other nasty ways of artificially introducing extra information, not available in production
•
Deployment
Plan monitoring and maintenance•
Produce final report•
Plan deployment•
Collect lessons learned!Review project•
01001110010101011100100111000110
100101110101100010011101001
1000000011100000110000110110110
110000110010010001
DATA
BusinessUnderstanding
DataUnderstanding
DataPreparation
Modelling
Evaluation
Deployment
Deployment – my DOs and DON'Ts
Read this paper, for excellent insights!
Thank you!
Questions?
@lopusz