dmdw lesson 05 + 06 + 07 - data mining applied
DESCRIPTION
TRANSCRIPT
STAATLICHANERKANNTEFACHHOCHSCHULE
Author I: Dip.-Inf. (FH) Johannes HoppeAuthor II: M.Sc. Johannes HofmeisterAuthor III: Prof. Dr. Dieter HomeisterDate: 01.04.2011
08.04.201115.04.2011
STUDIERENUND DURCHSTARTEN.
STAATLICHANERKANNTEFACHHOCHSCHULE
Data Mining Applied
Author I: Dip.-Inf. (FH) Johannes HoppeAuthor II: M.Sc. Johannes HofmeisterAuthor III: Prof. Dr. Dieter HomeisterDate: 01.04.2011
08.04.201115.04.2011
Applications of Data Mining
01
3
Applications of Data Mining
4
Applications of Data Mining
Applications of Data Mining
› Database Marketing › Time-series prediction, detecting "trends" › Detection (of whatever is detectable)› Probability Estimation › Information compression › Sensitivity Analysis
5
Applications of Data Mining
Database Marketing (1/2)
Response modeling
› Model for the response of specific customers. › Systematic selection of (old and potential) customers. › Advertisements and promotion based on these results.
( CRM)› Visualization: "Lift chart" shows how successful the
selection should be. (later topic: DM validation)
6
Lift Chart Example
7
“For contacting 10% of customers, using no model we should get 10% of responders and using the given model we should get 30% of responders.”
Applications of Data Mining
Database Marketing (2/2)
Cross selling: Selling additional products to existing customers
› Question: Which customer might buy which other product?› Uses historical purchase data › Uses credit card information, lifestyle data, demographic data,
etc. › Other possible information: Did the customer query special
information? How customer heard of the company?
8
Applications of Data Mining
Database Marketing (2/2)
Cross selling: Selling additional products to existing customers
› Results for direct marketing, mailing lists, direct advertising (Amazon)
› Amazon: "Customers who bought this item also bought" and "personalized recommendations"
9
Applications of Data Mining
Time-series prediction
› Time series: Stock prices, market shares, … › Extrapolation of future values › Detection of newly arising trends like customer movements to
other products
› Own experience: German print magazines
10
Applications of Data Mining
DetectionIdentification of existence or occurrence of a condition
Fraud detection: › Identifying patterns/criteria to detect credit card fraud › Estimating creditworthiness ( German Schufa) › Prediction of mail orders that will not be paid
11
Applications of Data Mining
DetectionIdentification of existence or occurrence of a condition
Intrusion detection (in computer networks) › Find patterns that indicate when an attack
is made on an network › e.g. clustering: small clusters are of high interest,
they point to unusual cases.› Definition of Classes may be useful:
e.g. harmless, possible harmful,harmful, immediately close LAN
12
Applications of Data Mining
DetectionIdentification of existence or occurrence of a condition
Typical difficulties › Needs knowledge› DM costs › Cost of missing a fraud › Cost of false positives
(e.g. falsely accusing someone of fraud, company image problems)
13
Applications of Data Mining
Probability Estimation
› Approximate the likelihood of an event given an observation
› e.g. for classify a potential customer into an A,B,C range before any business
14
Applications of Data Mining
Information Compression
› Can be viewed as a special type of estimation problem. › For a given set of data, estimate the key components that
be can be used to construct the data.
15
Applications of Data Mining
Sensitivity Analysis
› Understand how changes in one variable affect others. › Identify sensitivity of one variable on another
(find out if dependencies exist).
16
Data Mining Algorithms
02
17
Data Mining Algorithms
Data Mining Algorithms
› Different algorithms, different uses› Combined› The algorithm depends on what you want to do› Not every algorithm is suited for what you want to do
18
Data Mining Algorithms
Algorithms in SSAS: Groups
› Classification algorithms› Regression algorithms› Association algorithms› Segmentation algorithms› Sequence analysis algorithms› Plug-In algorithms
19
Data Mining Algorithms
Classification algorithms
› Predict discrete attributes› Based on experience values› Algorithms in SSAS:
› Naive Bayes› Decision Trees› Neural Networks
20
Data Mining Algorithms
Regression algorithms
› Predict continuous attributes› The same as classification algorithms› Algorithms in SSAS
› Linear Regression (Line)› Logistic Regression (Curve)› MS Time Series
21
Data Mining Algorithms
Association algorithms
› Predict likely combinations› Find elements that occur in combination› Algorithms in SSAS:
› MS Associtation Algorithm (Apriori)
22
Data Mining Algorithms
Segmentation algorithms
› Also called „Clustering algorithms“› Groups data with similar properties› Algorithms in SSAS:
› MS Clustering Algorithms (e.g. K-Means)
23
Data Mining Algorithms
Sequence analysis algorithms
› …are clustering algorithms› Consider the sorting; the sequence of values while
clustering› Does not group by similar properties› Groups by similar sequences› Algorithms in SSAS:
› MS Sequence Clustering
24
Data Mining Algorithms
Plug-In algorithms
› .NET Wrapper for COM objects› Use ANY algorithm› Provided as an assembly
› (possible workshop to create one)
25
Repetition - Datatypes, Contentypes
03
26
Repetition - Datatypes, Contentypes
Applying anAlgorithm
› Datatypes› Contenttypes
27
Repetition - Datatypes, Contentypes
Datatypes
› Define the structure of the values› Available datatypes:› Text› Long› Boolean› Double› Date
28
29
Repetition - Datatypes, Contentypes
Contenttypes
› Define the behaviour of values› Discrete› Continuous› Discretized› Key› Key Sequence› Key Time› Ordered› Cyclical
Repetition - Datatypes, Contentypes
Contenttype: Discrete
› Fixed set of values› Example:
› Commute Distance: 1-2, 2-5, 5-10› Region: Pacific, Northern America, Europe› Name: … … …
› Boolean values are always discrete› Text is most likely discrete
30
Repetition - Datatypes, Contentypes
Contenttype: Continuous
› Unlimited set of values› Infinite items possible› Example
› Income› Age
› Difference between Continuous and Discrete is the most important one
31
32
Repetition - Datatypes, Contentypes
Contenttype: Discretized
› Continuous values converted into discrete values› Examples:
› Income to Categories:A, B, C, …
› Age to groups:0-20,21-30, 31-40, …
33
Repetition - Datatypes, Contentypes
Contenttype: Key
› Key› Uniquely identifies a row
› Key Sequence (sequence clustering models)› Series of events› Sorted
› Key Time (time series models)› Identify values on a time scale
Repetition - Datatypes, Contentypes
Contenttype: Ordered
› Discrete values that have a sorting order› No distances visible› No relations visible› „One Star“ to „Five Stars“
34
35
Repetition - Datatypes, Contentypes
Contenttype: Cyclical
› Discrete values that have a cyclical sorting order› Example:
› Weekdays: Monday, Tuesday, … Sunday, Monday, …1,2,3, …,7, 1, …
› MonthsJan, Feb, Mar, … , Dec, Jan, …1, 2, 3, …, 12, 1, …
36
Available Combinations
Datatype Contenttype
Text Discrete, Discretized, Sequence
Long Continuous, Cyclical, Discrete, Discretized, Key Sequence, Key Time, Ordered, Sequence
Boolean Discrete
Double Continuous, Cyclical, Discrete, Discretized, Key Sequence, Key Time, Ordered, Sequence, Time
Date Continuous, Discrete, Discretized, Key Time
Data Mining Algorithms - Decision Trees
04
37
Applied Data Mining - Decision Trees
38
Applied Data Mining - Decision Trees
In General
› Also known as: Classification Trees› Goal: Sequentially partition Data› Can detect non-linear relationships› Machine Learning Technique
› Separate into Training and Testing set› Training set is created to create model based on certain criteria› Test set is used to verify the model
39
Applied Data Mining - Decision Trees
40
2,6 % respose rate(Total: 10.000 persons)
Male 3,2%(Total: 4.677)
Female 2,1%(Total: 5,323)
Income > $30 000: 3,6 %
Age < 40: 3,2 %
Income < $30 000: 2,3 %
Age > 40: 3,8%
Tree for response of a mailing action
Applied Data Mining - Decision Trees
41
Males: $30 000
Female: 40+Response Rate: > 3,5 %
Trained Tree
Using the Trained Tree
Example: the management decides to mail only to groups with response rate >3.5%.
Applied Data Mining - Decision Trees
42
› Pros› Very flexible, white box Model› Kiss – Keep it simple, stupid!› Little preparation and resources needed
› Cons› Can be tuned until death› Long time to build› Requires wisely selected training data!
False training yields false results Big tree might require disk swapping
(Computation might be difficult if it does not fit into main memory.)
Project: “DMDW Mining Test”
43
Project: “DMDW Mining Test”(explanation of one note)
44
Project: “DMDW Mining Test”(shows connections, more useful if there are more predictable values)
Project: “DMDW Mining Test”(Generic Content Tree Viewer DMX (Data Mining Extensions))
References
References for Decisions TreesOlivia Parr Rud et. al, Data Mining Cookbook - Modeling Data for Marketing, Risk, and Customer Relationship Management, Wiley, 2001
David A. Grossman, Ophir Frieder: Introduction to Data Mining, Illinois Institute of Technology 2005
Andrew W. Moore: Decision Trees, Carnegie Mellon University, http://www.autonlab.org/tutorials/dtree16.pdf
Nong Ye (ed.): The Handbook of Data Mining, Lawrence Erlbaum Associates, 2003
Sushimita Mitra, Tinku Acharya, Data Mining - Multimedia, Soft Computing and Bioinformatics, Wiley, 2003
http://en.wikipedia.org/wiki/Classification_tree
47
Data Mining Algorithms - Clustering
05
48
Data Mining Algorithms - Clustering
49
12
X
Data Mining Algorithms - Clustering
Clustering
› Segmentation Algorithm› Find homogenous groups within set› Find similar variables for different cases› Identify new relationships that were unclear before
(heuristics)
› e.g. „Person who rides a bike to work doesn‘t live far from his workplace“ (this is not obvious)
50
51
1
1
1
1
2
2
2
X
X
X
X
classify
Independent Variables
Homogeneous Subsets
identify
Description of class
12
X
52
1
1
1
1
2
2
2
X
X
X
X
classify
Independent Variables
Homogeneous Subsets
identify
Description of class
12
X
1. Clustering 2. Classification
53
Clustering
1. Clustering
› Reduces data to classes of equal types› Become frieds with the data › Iterative Algorithm› Clustering› Validate› Classify› Apply
http://msdn.microsoft.com/en-us/library/ms174879.aspx
Data Mining Algorithms - Clustering
2. Classification
› Create a Description of a group› Give it a „name“› Also: Characterization
54
Process
› Start with random values› Reuse will create different sets and different groups› Different clustering technique / algorithm will create
different group› Reuse on same dataset, reseed› Expert evaluate found classes and plausibility › Good classes used for predictions
55
1. Clustering Evaluate, Check Good? 2. Classify
Apply(Predict)
56
Clustering
MS Clustering Algorithm
› Combination of two algorithms› K-Means – Hard!
› Datapoint can be in only one cluster
› Expectation Maximization – Soft› Datapoint has different combinations› Datapoint belongs to different clusters› Probability is calculated
Source: http://msdn.microsoft.com/en-us/library/cc280445.aspx
57
Clustering
› Pros› No predictable variable to choose› Trains itself without much effort› Easy to configure
› „Cons“› Interpretation is everything› Good eye needed› Expert has to check for plausibility
Project: “DMDW Mining Test”(strongest relations only, amount of matching cases for Region Europe)
Project: “DMDW Mining Test”(good to know: continuous attributes are shown by there arithmetic average)
Project: “DMDW Mining Test”(comparing two clusters)
THANK YOUFOR YOUR ATTENTION
61