data mining iiavellido/teaching/13-14/... · market segmentation, shopping basket analysis...

27
Lluis Belanche + Alfredo Vellido Intelligent Data Analysis and Data Mining or … Data Analysis and Knowledge Discovery a.k.a. Data Mining II

Upload: others

Post on 22-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

Lluis Belanche + Alfredo Vellido

Intelligent Data Analysis and Data Miningor …

Data Analysis and Knowledge Discoverya.k.a. Data Mining II

Page 2: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

DATA MINING as a methodology (from previous session …) 

Page 3: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

CRISP: a DM methodologyCRoss‐Industry Standard Process for Data Mining: neutral methodology from the point of view of industry, tool and application (free & non‐proprietary)Pete Chapman, Randy Kerber (NCR); Julian Clinton, Thomas Khabaza, Colin Shearer (SPSS), Thomas Reinartz, Rüdiger Wirth (DaimlerChrysler)CRISP‐DM was conceived in 1996DaimlerChrysler: leaders in industrial application, SPSS: leaders in product development (Clementine, 1994), NCR: owners of large (huge!) databases (Teradata)Financed by the EU. Version 1.0 released officially in 1999

IDADM

Page 4: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

CRISP: Hierarchic structure of the methodology

IDADM

Page 5: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

CRISP: The virtuous loop of methodology phases

IDADM

Page 6: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

CRISP: Phases: Problem understanding

DETERMINEPROBLEMGOAL

ASSESS SITUATION

DETERMINEDM

GOALS

PRODUCE PROJECTPLAN

PROBLEM UNDERSTANDING

DATA 

UNDERST’ING

DATA

PREPARATIONMODELLING EVALUATION

IMPLEMEN

TATION

BACKGROUND

INVENTORY RESOURCES

GOALS DM

PROJECT

PLAN

PROBLEM

GOALS

SUCCESS

CRITERIA

SUCCESS CRITERIA DM

REQUERIMS. ASSUMPTIONS LIMITATIONS

RISKS CONTINGEN. TERMINOLOG. COSTS & 

BENEFITS

INITIAL SELECTION OF 

TOOLS

IDADM

Page 7: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

DM application areas (’10‐>’11)IDADM

Page 8: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

end of last session wrap‐up

Page 9: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

CRISP: Phases: Data understanding

OBTAIN INITIAL DATA

DESCRIPTION DATA

EXPLORATION DATA

VERIFICATION QUALITY DATA

INITIAL DATA REPORT

DATA DESCRIPTIVE REPORT

DATA EXPLORATION 

REPORT

DATA QUALITY REPORT

PROBLEM UNDERSTANDING

DATA 

UNDERST’ING

DATA

PREPARATIONMODELLING EVALUATION

IMPLEMEN

TATION

IDADM

Page 10: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

METROFANG: a real story about data understanding (1)http://www.secadolodos.com/73027_es/METROFANG‐(Barcelona‐Espa%25C3%25B1a)/

IDADM

Page 11: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

METROFANG: a real story about data understanding (2)caudal entrada

0,00

50,00

100,00

150,00

200,00

250,00

300,00

350,00

1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671

Par motor Secador A

0,00

20,00

40,00

60,00

80,00

100,00

120,00

140,00

1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671

Missing data

Stationality

Outliers

Time Series 

Weekend?

FORUM???

IDADM

Page 12: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

Storing data (’07)

IDADM

Page 13: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

CRISP: Phases: Data preparation

DATA SELECTION

DATA CLEANING

RECONSTRUCT DATA

INTEGRATE DATA

DATA FORMATTING

ARGUMENTS FOR SELECTION

DATA CLEANING REPORT

DERIVATED VARIABLES

INTEGRATED DATA

OSERVATIONS GENERATED

DATA WITH NEW FORMAT

PROBLEM UNDERSTANDING

DATA 

UNDERST’ING

DATA

PREPARATIONMODELLING EVALUATION

IMPLEMEN

TATION

IDADM

Page 14: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

Is data preparation that important?

IDADM

Page 15: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

Common data types analyzed …(’07)

Compared to 2005 KDnuggets Poll on “Types of data you analyzed/mined in last 12 months”, the biggest increase was in anonymized data (perhaps and indicator of increasing importance of privacy issues).

IDADM

Page 16: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

Common data types analyzed …(’09)

Compared to 2005 KDnuggets Poll on “Types of data you analyzed/mined in last 12 months”, thebiggest increase was in anonymized data (perhaps and indicator of increasing importance of privacy issues).

Comparing with 2008, the top 5 categories are unchanged.

IDADM

Page 17: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

Common data types analyzed …(‐>’12)IDADM

Page 18: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

How large is it? … (’06 ‐> ‘09)IDADM

Page 19: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

How large is it? … (’09 ‐> ‘13)IDADM

The “Big Data” Challenge

Page 20: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

How large is it? … (’09 ‐> ‘13)IDADM

Some fun facts:• Google processes over 20 PB worth of data every day.

• Back in December 2007, YouTube generated 27 PB of traffic.

• The CERN Large Hadron Collider (HLC) generetes about 20 PB of usable data 

per year.

• The volume of global annual data traffic is expected exceed 60,000 PB in 

2016, from 8,000 petabytes in 2011

• In the next decade, astronomers expect to be processing 10 PB of data every 

hour from the Square Kilometre Array (SKA) telescope ►one exabyte every 

four days.

Page 21: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

• 10 PB of data every hour from the Square Kilometre Array (SKA) telescope ►one exabyte every four days.

Page 22: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

Data manipulation tools …(’08)

IDADM

Page 23: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

Data “manipulation” tools …(‐>’12)

IDADM

Page 24: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

Data “manipulation”tools …(‐>’13)

IDADM

Page 25: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

CRISP: Phases: Modelling

SELECT MODELINGTECHNIQUE

CREATE TEST DESIGN

BUILDMODEL

VALIDATE MODEL

SELECTED

TECHNIQUE

TEST DESIGN

PARAMETER SELECTION

MODEL VALIDATION

MODEL MODEL DESCRIPTION

PROBLEM UNDERSTANDING

DATA 

UNDERST’ING

DATA

PREPARATIONMODELLING EVALUATION

IMPLEMEN

TATION

IDADM

Page 26: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

CRISP: A typology of DM problemsPROBLEM DESCRIPTION EXAMPLES TECHNIQUES DATA SUMMARY

and DESCRIPTION

Compact and aggregated data description. Exploratory Analysis

Almost any problem includes some elements of data description

ERPs, stats., OLAP, EIS, control dashboards

SEGMENTATION Finding data groups (unsupervised) segm / clust / classif

Market Segmentation, Shopping Basket analysis

Clustering, NNs (SOM, GTM), visualización

CONCEPTUAL DESCRIPTION

Accessible and useful description of concepts / classes / groups. Knowledge comes first, then precissión. Linked to clasif / segmentation

Ex.: Description of customer groups according to loyalty. Rule segment profiling if SEX=male and age>45 then CUST=loyal

Rule Induction, Conceptual Clustering

CLASIFICATION Assumed that different ítems can be assigned to a given closed cathegory (supervised)

Bankruptcy prediction, Credit Scoring

Discriminant Analysis, Rule Induction, Decision Trees, NNs, C-B Reasoning, GAs

PREDICTION (REGRESSION, FORECASTING)

Continuous dependent variable. Given values of the predictive variables, predict (supervised)

Markets, company benefit pred., Market share forec.

Regression Analysis, Regression Trees, NNs, Box-Jenkins, GAs

DEPENDENCY ANALYSIS

Looking for dependencies between variables (superv. or unsuperv.) Often with segmentation

Basket Analysis Ex.: 30% of those who bought peanuts also bought beer …

Correlation Analysis, Association Rules, Bayesian Networks, Inductive Logic Prog.

IDADM

Page 27: Data Mining IIavellido/teaching/13-14/... · Market Segmentation, Shopping Basket analysis Clustering, NNs (SOM, GTM), visualización CONCEPTUAL DESCRIPTION ... Microsoft PowerPoint

CRISP: Selection of techniquesU N I V E R S E  OF  T E C H N I Q U E S

TECHNIQUES SUITED TO A PROBLEM

POLITICAL REQUIREMENTS

(Business, executive)

LIMITATIONS

Data types, knowledge

SELECTED TOOL(S)

Money, time, hh.rr.

(Definided by tools)

IDADM