data mining iiavellido/teaching/13-14/... · market segmentation, shopping basket analysis...
TRANSCRIPT
Lluis Belanche + Alfredo Vellido
Intelligent Data Analysis and Data Miningor …
Data Analysis and Knowledge Discoverya.k.a. Data Mining II
DATA MINING as a methodology (from previous session …)
CRISP: a DM methodologyCRoss‐Industry Standard Process for Data Mining: neutral methodology from the point of view of industry, tool and application (free & non‐proprietary)Pete Chapman, Randy Kerber (NCR); Julian Clinton, Thomas Khabaza, Colin Shearer (SPSS), Thomas Reinartz, Rüdiger Wirth (DaimlerChrysler)CRISP‐DM was conceived in 1996DaimlerChrysler: leaders in industrial application, SPSS: leaders in product development (Clementine, 1994), NCR: owners of large (huge!) databases (Teradata)Financed by the EU. Version 1.0 released officially in 1999
IDADM
CRISP: Hierarchic structure of the methodology
IDADM
CRISP: The virtuous loop of methodology phases
IDADM
CRISP: Phases: Problem understanding
DETERMINEPROBLEMGOAL
ASSESS SITUATION
DETERMINEDM
GOALS
PRODUCE PROJECTPLAN
PROBLEM UNDERSTANDING
DATA
UNDERST’ING
DATA
PREPARATIONMODELLING EVALUATION
IMPLEMEN
TATION
BACKGROUND
INVENTORY RESOURCES
GOALS DM
PROJECT
PLAN
PROBLEM
GOALS
SUCCESS
CRITERIA
SUCCESS CRITERIA DM
REQUERIMS. ASSUMPTIONS LIMITATIONS
RISKS CONTINGEN. TERMINOLOG. COSTS &
BENEFITS
INITIAL SELECTION OF
TOOLS
IDADM
DM application areas (’10‐>’11)IDADM
end of last session wrap‐up
CRISP: Phases: Data understanding
OBTAIN INITIAL DATA
DESCRIPTION DATA
EXPLORATION DATA
VERIFICATION QUALITY DATA
INITIAL DATA REPORT
DATA DESCRIPTIVE REPORT
DATA EXPLORATION
REPORT
DATA QUALITY REPORT
PROBLEM UNDERSTANDING
DATA
UNDERST’ING
DATA
PREPARATIONMODELLING EVALUATION
IMPLEMEN
TATION
IDADM
METROFANG: a real story about data understanding (1)http://www.secadolodos.com/73027_es/METROFANG‐(Barcelona‐Espa%25C3%25B1a)/
IDADM
METROFANG: a real story about data understanding (2)caudal entrada
0,00
50,00
100,00
150,00
200,00
250,00
300,00
350,00
1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671
Par motor Secador A
0,00
20,00
40,00
60,00
80,00
100,00
120,00
140,00
1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671
Missing data
Stationality
Outliers
Time Series
Weekend?
FORUM???
IDADM
Storing data (’07)
IDADM
CRISP: Phases: Data preparation
DATA SELECTION
DATA CLEANING
RECONSTRUCT DATA
INTEGRATE DATA
DATA FORMATTING
ARGUMENTS FOR SELECTION
DATA CLEANING REPORT
DERIVATED VARIABLES
INTEGRATED DATA
OSERVATIONS GENERATED
DATA WITH NEW FORMAT
PROBLEM UNDERSTANDING
DATA
UNDERST’ING
DATA
PREPARATIONMODELLING EVALUATION
IMPLEMEN
TATION
IDADM
Is data preparation that important?
IDADM
Common data types analyzed …(’07)
Compared to 2005 KDnuggets Poll on “Types of data you analyzed/mined in last 12 months”, the biggest increase was in anonymized data (perhaps and indicator of increasing importance of privacy issues).
IDADM
Common data types analyzed …(’09)
Compared to 2005 KDnuggets Poll on “Types of data you analyzed/mined in last 12 months”, thebiggest increase was in anonymized data (perhaps and indicator of increasing importance of privacy issues).
Comparing with 2008, the top 5 categories are unchanged.
IDADM
Common data types analyzed …(‐>’12)IDADM
How large is it? … (’06 ‐> ‘09)IDADM
How large is it? … (’09 ‐> ‘13)IDADM
The “Big Data” Challenge
How large is it? … (’09 ‐> ‘13)IDADM
Some fun facts:• Google processes over 20 PB worth of data every day.
• Back in December 2007, YouTube generated 27 PB of traffic.
• The CERN Large Hadron Collider (HLC) generetes about 20 PB of usable data
per year.
• The volume of global annual data traffic is expected exceed 60,000 PB in
2016, from 8,000 petabytes in 2011
• In the next decade, astronomers expect to be processing 10 PB of data every
hour from the Square Kilometre Array (SKA) telescope ►one exabyte every
four days.
• 10 PB of data every hour from the Square Kilometre Array (SKA) telescope ►one exabyte every four days.
Data manipulation tools …(’08)
IDADM
Data “manipulation” tools …(‐>’12)
IDADM
Data “manipulation”tools …(‐>’13)
IDADM
CRISP: Phases: Modelling
SELECT MODELINGTECHNIQUE
CREATE TEST DESIGN
BUILDMODEL
VALIDATE MODEL
SELECTED
TECHNIQUE
TEST DESIGN
PARAMETER SELECTION
MODEL VALIDATION
MODEL MODEL DESCRIPTION
PROBLEM UNDERSTANDING
DATA
UNDERST’ING
DATA
PREPARATIONMODELLING EVALUATION
IMPLEMEN
TATION
IDADM
CRISP: A typology of DM problemsPROBLEM DESCRIPTION EXAMPLES TECHNIQUES DATA SUMMARY
and DESCRIPTION
Compact and aggregated data description. Exploratory Analysis
Almost any problem includes some elements of data description
ERPs, stats., OLAP, EIS, control dashboards
SEGMENTATION Finding data groups (unsupervised) segm / clust / classif
Market Segmentation, Shopping Basket analysis
Clustering, NNs (SOM, GTM), visualización
CONCEPTUAL DESCRIPTION
Accessible and useful description of concepts / classes / groups. Knowledge comes first, then precissión. Linked to clasif / segmentation
Ex.: Description of customer groups according to loyalty. Rule segment profiling if SEX=male and age>45 then CUST=loyal
Rule Induction, Conceptual Clustering
CLASIFICATION Assumed that different ítems can be assigned to a given closed cathegory (supervised)
Bankruptcy prediction, Credit Scoring
Discriminant Analysis, Rule Induction, Decision Trees, NNs, C-B Reasoning, GAs
PREDICTION (REGRESSION, FORECASTING)
Continuous dependent variable. Given values of the predictive variables, predict (supervised)
Markets, company benefit pred., Market share forec.
Regression Analysis, Regression Trees, NNs, Box-Jenkins, GAs
DEPENDENCY ANALYSIS
Looking for dependencies between variables (superv. or unsuperv.) Often with segmentation
Basket Analysis Ex.: 30% of those who bought peanuts also bought beer …
Correlation Analysis, Association Rules, Bayesian Networks, Inductive Logic Prog.
IDADM
CRISP: Selection of techniquesU N I V E R S E OF T E C H N I Q U E S
TECHNIQUES SUITED TO A PROBLEM
POLITICAL REQUIREMENTS
(Business, executive)
LIMITATIONS
Data types, knowledge
SELECTED TOOL(S)
Money, time, hh.rr.
(Definided by tools)
IDADM