dmdw lesson 05 + 06 + 07 - data mining applied

STAATLICHANERKANNTEFACHHOCHSCHULE

Author I: Dip.-Inf. (FH) Johannes HoppeAuthor II: M.Sc. Johannes HofmeisterAuthor III: Prof. Dr. Dieter HomeisterDate: 01.04.2011

08.04.201115.04.2011

STUDIERENUND DURCHSTARTEN.

STAATLICHANERKANNTEFACHHOCHSCHULE

Data Mining Applied

Author I: Dip.-Inf. (FH) Johannes HoppeAuthor II: M.Sc. Johannes HofmeisterAuthor III: Prof. Dr. Dieter HomeisterDate: 01.04.2011

08.04.201115.04.2011

Applications of Data Mining

01

3


4



› Database Marketing › Time-series prediction, detecting "trends" › Detection (of whatever is detectable)› Probability Estimation › Information compression › Sensitivity Analysis

5


Database Marketing (1/2)

Response modeling

› Model for the response of specific customers. › Systematic selection of (old and potential) customers. › Advertisements and promotion based on these results.

( CRM)› Visualization: "Lift chart" shows how successful the

selection should be. (later topic: DM validation)

6

Lift Chart Example

7

“For contacting 10% of customers, using no model we should get 10% of responders and using the given model we should get 30% of responders.”



Cross selling: Selling additional products to existing customers

› Question: Which customer might buy which other product?› Uses historical purchase data › Uses credit card information, lifestyle data, demographic data,

etc. › Other possible information: Did the customer query special

information? How customer heard of the company?

8



Cross selling: Selling additional products to existing customers

› Results for direct marketing, mailing lists, direct advertising (Amazon)

› Amazon: "Customers who bought this item also bought" and "personalized recommendations"

9


Time-series prediction

› Time series: Stock prices, market shares, … › Extrapolation of future values › Detection of newly arising trends like customer movements to

other products

› Own experience: German print magazines

10


DetectionIdentification of existence or occurrence of a condition

Fraud detection: › Identifying patterns/criteria to detect credit card fraud › Estimating creditworthiness ( German Schufa) › Prediction of mail orders that will not be paid

11



Intrusion detection (in computer networks) › Find patterns that indicate when an attack

is made on an network › e.g. clustering: small clusters are of high interest,

they point to unusual cases.› Definition of Classes may be useful:

e.g. harmless, possible harmful,harmful, immediately close LAN

12



Typical difficulties › Needs knowledge› DM costs › Cost of missing a fraud › Cost of false positives

(e.g. falsely accusing someone of fraud, company image problems)

13


Probability Estimation

› Approximate the likelihood of an event given an observation

› e.g. for classify a potential customer into an A,B,C range before any business

14


Information Compression

› Can be viewed as a special type of estimation problem. › For a given set of data, estimate the key components that

be can be used to construct the data.

15


Sensitivity Analysis

› Understand how changes in one variable affect others. › Identify sensitivity of one variable on another

(find out if dependencies exist).

16

Data Mining Algorithms

02

17



› Different algorithms, different uses› Combined› The algorithm depends on what you want to do› Not every algorithm is suited for what you want to do

18


Algorithms in SSAS: Groups

› Classification algorithms› Regression algorithms› Association algorithms› Segmentation algorithms› Sequence analysis algorithms› Plug-In algorithms

19


Classification algorithms

› Predict discrete attributes› Based on experience values› Algorithms in SSAS:

› Naive Bayes› Decision Trees› Neural Networks

20


Regression algorithms

› Predict continuous attributes› The same as classification algorithms› Algorithms in SSAS

› Linear Regression (Line)› Logistic Regression (Curve)› MS Time Series

21


Association algorithms

› Predict likely combinations› Find elements that occur in combination› Algorithms in SSAS:

› MS Associtation Algorithm (Apriori)

22


Segmentation algorithms

› Also called „Clustering algorithms“› Groups data with similar properties› Algorithms in SSAS:

› MS Clustering Algorithms (e.g. K-Means)

23


Sequence analysis algorithms

› …are clustering algorithms› Consider the sorting; the sequence of values while

clustering› Does not group by similar properties› Groups by similar sequences› Algorithms in SSAS:

› MS Sequence Clustering

24


Plug-In algorithms

› .NET Wrapper for COM objects› Use ANY algorithm› Provided as an assembly

› (possible workshop to create one)

25

Repetition - Datatypes, Contentypes

03

26


Applying anAlgorithm

› Datatypes› Contenttypes

27


Datatypes

› Define the structure of the values› Available datatypes:› Text› Long› Boolean› Double› Date

28

29


Contenttypes

› Define the behaviour of values› Discrete› Continuous› Discretized› Key› Key Sequence› Key Time› Ordered› Cyclical


Contenttype: Discrete

› Fixed set of values› Example:

› Commute Distance: 1-2, 2-5, 5-10› Region: Pacific, Northern America, Europe› Name: … … …

› Boolean values are always discrete› Text is most likely discrete

30


Contenttype: Continuous

› Unlimited set of values› Infinite items possible› Example

› Income› Age

› Difference between Continuous and Discrete is the most important one

31

32


Contenttype: Discretized

› Continuous values converted into discrete values› Examples:

› Income to Categories:A, B, C, …

› Age to groups:0-20,21-30, 31-40, …

33


Contenttype: Key

› Key› Uniquely identifies a row

› Key Sequence (sequence clustering models)› Series of events› Sorted

› Key Time (time series models)› Identify values on a time scale


Contenttype: Ordered

› Discrete values that have a sorting order› No distances visible› No relations visible› „One Star“ to „Five Stars“

34

35


Contenttype: Cyclical

› Discrete values that have a cyclical sorting order› Example:

› Weekdays: Monday, Tuesday, … Sunday, Monday, …1,2,3, …,7, 1, …

› MonthsJan, Feb, Mar, … , Dec, Jan, …1, 2, 3, …, 12, 1, …

36

Available Combinations

Datatype Contenttype

Text Discrete, Discretized, Sequence

Long Continuous, Cyclical, Discrete, Discretized, Key Sequence, Key Time, Ordered, Sequence

Boolean Discrete

Double Continuous, Cyclical, Discrete, Discretized, Key Sequence, Key Time, Ordered, Sequence, Time

Date Continuous, Discrete, Discretized, Key Time

Data Mining Algorithms - Decision Trees

04

37

Applied Data Mining - Decision Trees

38


In General

› Also known as: Classification Trees› Goal: Sequentially partition Data› Can detect non-linear relationships› Machine Learning Technique

› Separate into Training and Testing set› Training set is created to create model based on certain criteria› Test set is used to verify the model

39


40

2,6 % respose rate(Total: 10.000 persons)

Male 3,2%(Total: 4.677)

Female 2,1%(Total: 5,323)

Income > $30 000: 3,6 %

Age < 40: 3,2 %

Income < $30 000: 2,3 %

Age > 40: 3,8%

Tree for response of a mailing action


41

Males: $30 000

Female: 40+Response Rate: > 3,5 %

Trained Tree

Using the Trained Tree

Example: the management decides to mail only to groups with response rate >3.5%.


42

› Pros› Very flexible, white box Model› Kiss – Keep it simple, stupid!› Little preparation and resources needed

› Cons› Can be tuned until death› Long time to build› Requires wisely selected training data!

False training yields false results Big tree might require disk swapping

(Computation might be difficult if it does not fit into main memory.)

Project: “DMDW Mining Test”

43

Project: “DMDW Mining Test”(explanation of one note)

44

Project: “DMDW Mining Test”(shows connections, more useful if there are more predictable values)

Project: “DMDW Mining Test”(Generic Content Tree Viewer DMX (Data Mining Extensions))

References

References for Decisions TreesOlivia Parr Rud et. al, Data Mining Cookbook - Modeling Data for Marketing, Risk, and Customer Relationship Management, Wiley, 2001

David A. Grossman, Ophir Frieder: Introduction to Data Mining, Illinois Institute of Technology 2005

Andrew W. Moore: Decision Trees, Carnegie Mellon University, http://www.autonlab.org/tutorials/dtree16.pdf

Nong Ye (ed.): The Handbook of Data Mining, Lawrence Erlbaum Associates, 2003

Sushimita Mitra, Tinku Acharya, Data Mining - Multimedia, Soft Computing and Bioinformatics, Wiley, 2003

http://en.wikipedia.org/wiki/Classification_tree

47

http://www.autonlab.org/tutorials/dtree16.pdf

http://en.wikipedia.org/wiki/Classification_tree

Data Mining Algorithms - Clustering

05

48


49

12

X


Clustering

› Segmentation Algorithm› Find homogenous groups within set› Find similar variables for different cases› Identify new relationships that were unclear before

(heuristics)

› e.g. „Person who rides a bike to work doesn‘t live far from his workplace“ (this is not obvious)

50

51

1

1

1

1

2

2

2

X

X

X

X

classify

Independent Variables

Homogeneous Subsets

identify

Description of class

12

X

52

1

1

1

1

2

2

2

X

X

X

X

classify

Independent Variables

Homogeneous Subsets

identify

Description of class

12

X

1. Clustering 2. Classification

53

Clustering

1. Clustering

› Reduces data to classes of equal types› Become frieds with the data › Iterative Algorithm› Clustering› Validate› Classify› Apply

http://msdn.microsoft.com/en-us/library/ms174879.aspx

http://msdn.microsoft.com/en-us/library/ms174879.aspx


2. Classification

› Create a Description of a group› Give it a „name“› Also: Characterization

54

Process

› Start with random values› Reuse will create different sets and different groups› Different clustering technique / algorithm will create

different group› Reuse on same dataset, reseed› Expert evaluate found classes and plausibility › Good classes used for predictions

55

1. Clustering Evaluate, Check Good? 2. Classify

Apply(Predict)

56

Clustering

MS Clustering Algorithm

› Combination of two algorithms› K-Means – Hard!

› Datapoint can be in only one cluster

› Expectation Maximization – Soft› Datapoint has different combinations› Datapoint belongs to different clusters› Probability is calculated

Source: http://msdn.microsoft.com/en-us/library/cc280445.aspx

http://msdn.microsoft.com/en-us/library/cc280445.aspx

57

Clustering

› Pros› No predictable variable to choose› Trains itself without much effort› Easy to configure

› „Cons“› Interpretation is everything› Good eye needed› Expert has to check for plausibility

Project: “DMDW Mining Test”(strongest relations only, amount of matching cases for Region Europe)

Project: “DMDW Mining Test”(good to know: continuous attributes are shown by there arithmetic average)

Project: “DMDW Mining Test”(comparing two clusters)

THANK YOUFOR YOUR ATTENTION

61

dmdw lesson 05 + 06 + 07 - data mining applied

Technology

demographic data

lifestyle data

historical purchase

given model

direct marketing

potential customers

lift curve

response modeling model