dmdw lesson 05 + 06 + 07 - data mining applied

61
STAATLICH ANERKANNTE FACHHOCHSCHULE Author I: Dip.-Inf. (FH) Johannes Hoppe Author II: M.Sc. Johannes Hofmeister Author III: Prof. Dr. Dieter Homeister Date: 01.04.2011 08.04.2011 15.04.2011 STUDIEREN UND DURCHSTARTEN.

Upload: johannes-hoppe

Post on 20-Dec-2014

753 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

STAATLICHANERKANNTEFACHHOCHSCHULE

Author I: Dip.-Inf. (FH) Johannes HoppeAuthor II: M.Sc. Johannes HofmeisterAuthor III: Prof. Dr. Dieter HomeisterDate: 01.04.2011

08.04.201115.04.2011

STUDIERENUND DURCHSTARTEN.

Page 2: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

STAATLICHANERKANNTEFACHHOCHSCHULE

Data Mining Applied

Author I: Dip.-Inf. (FH) Johannes HoppeAuthor II: M.Sc. Johannes HofmeisterAuthor III: Prof. Dr. Dieter HomeisterDate: 01.04.2011

08.04.201115.04.2011

Page 3: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applications of Data Mining

01

3

Page 4: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applications of Data Mining

4

Page 5: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applications of Data Mining

Applications of Data Mining

› Database Marketing › Time-series prediction, detecting "trends" › Detection (of whatever is detectable)› Probability Estimation › Information compression › Sensitivity Analysis

5

Page 6: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applications of Data Mining

Database Marketing (1/2)

Response modeling

› Model for the response of specific customers. › Systematic selection of (old and potential) customers. › Advertisements and promotion based on these results.

( CRM)› Visualization: "Lift chart" shows how successful the

selection should be. (later topic: DM validation)

6

Page 7: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Lift Chart Example

7

“For contacting 10% of customers, using no model we should get 10% of responders and using the given model we should get 30% of responders.”

Page 8: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applications of Data Mining

Database Marketing (2/2)

Cross selling: Selling additional products to existing customers

› Question: Which customer might buy which other product?› Uses historical purchase data › Uses credit card information, lifestyle data, demographic data,

etc. › Other possible information: Did the customer query special

information? How customer heard of the company?

8

Page 9: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applications of Data Mining

Database Marketing (2/2)

Cross selling: Selling additional products to existing customers

› Results for direct marketing, mailing lists, direct advertising (Amazon)

› Amazon: "Customers who bought this item also bought" and "personalized recommendations"

9

Page 10: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applications of Data Mining

Time-series prediction

› Time series: Stock prices, market shares, … › Extrapolation of future values › Detection of newly arising trends like customer movements to

other products

› Own experience: German print magazines

10

Page 11: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applications of Data Mining

DetectionIdentification of existence or occurrence of a condition

Fraud detection: › Identifying patterns/criteria to detect credit card fraud › Estimating creditworthiness ( German Schufa) › Prediction of mail orders that will not be paid

11

Page 12: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applications of Data Mining

DetectionIdentification of existence or occurrence of a condition

Intrusion detection (in computer networks) › Find patterns that indicate when an attack

is made on an network › e.g. clustering: small clusters are of high interest,

they point to unusual cases.› Definition of Classes may be useful:

e.g. harmless, possible harmful,harmful, immediately close LAN

12

Page 13: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applications of Data Mining

DetectionIdentification of existence or occurrence of a condition

Typical difficulties › Needs knowledge› DM costs › Cost of missing a fraud › Cost of false positives

(e.g. falsely accusing someone of fraud, company image problems)

13

Page 14: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applications of Data Mining

Probability Estimation

› Approximate the likelihood of an event given an observation

› e.g. for classify a potential customer into an A,B,C range before any business

14

Page 15: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applications of Data Mining

Information Compression

› Can be viewed as a special type of estimation problem. › For a given set of data, estimate the key components that

be can be used to construct the data.

15

Page 16: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applications of Data Mining

Sensitivity Analysis

› Understand how changes in one variable affect others. › Identify sensitivity of one variable on another

(find out if dependencies exist).

16

Page 17: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Data Mining Algorithms

02

17

Page 18: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Data Mining Algorithms

Data Mining Algorithms

› Different algorithms, different uses› Combined› The algorithm depends on what you want to do› Not every algorithm is suited for what you want to do

18

Page 19: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Data Mining Algorithms

Algorithms in SSAS: Groups

› Classification algorithms› Regression algorithms› Association algorithms› Segmentation algorithms› Sequence analysis algorithms› Plug-In algorithms

19

Page 20: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Data Mining Algorithms

Classification algorithms

› Predict discrete attributes› Based on experience values› Algorithms in SSAS:

› Naive Bayes› Decision Trees› Neural Networks

20

Page 21: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Data Mining Algorithms

Regression algorithms

› Predict continuous attributes› The same as classification algorithms› Algorithms in SSAS

› Linear Regression (Line)› Logistic Regression (Curve)› MS Time Series

21

Page 22: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Data Mining Algorithms

Association algorithms

› Predict likely combinations› Find elements that occur in combination› Algorithms in SSAS:

› MS Associtation Algorithm (Apriori)

22

Page 23: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Data Mining Algorithms

Segmentation algorithms

› Also called „Clustering algorithms“› Groups data with similar properties› Algorithms in SSAS:

› MS Clustering Algorithms (e.g. K-Means)

23

Page 24: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Data Mining Algorithms

Sequence analysis algorithms

› …are clustering algorithms› Consider the sorting; the sequence of values while

clustering› Does not group by similar properties› Groups by similar sequences› Algorithms in SSAS:

› MS Sequence Clustering

24

Page 25: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Data Mining Algorithms

Plug-In algorithms

› .NET Wrapper for COM objects› Use ANY algorithm› Provided as an assembly

› (possible workshop to create one)

25

Page 26: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Repetition - Datatypes, Contentypes

03

26

Page 27: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Repetition - Datatypes, Contentypes

Applying anAlgorithm

› Datatypes› Contenttypes

27

Page 28: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Repetition - Datatypes, Contentypes

Datatypes

› Define the structure of the values› Available datatypes:› Text› Long› Boolean› Double› Date

28

Page 29: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

29

Repetition - Datatypes, Contentypes

Contenttypes

› Define the behaviour of values› Discrete› Continuous› Discretized› Key› Key Sequence› Key Time› Ordered› Cyclical

Page 30: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Repetition - Datatypes, Contentypes

Contenttype: Discrete

› Fixed set of values› Example:

› Commute Distance: 1-2, 2-5, 5-10› Region: Pacific, Northern America, Europe› Name: … … …

› Boolean values are always discrete› Text is most likely discrete

30

Page 31: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Repetition - Datatypes, Contentypes

Contenttype: Continuous

› Unlimited set of values› Infinite items possible› Example

› Income› Age

› Difference between Continuous and Discrete is the most important one

31

Page 32: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

32

Repetition - Datatypes, Contentypes

Contenttype: Discretized

› Continuous values converted into discrete values› Examples:

› Income to Categories:A, B, C, …

› Age to groups:0-20,21-30, 31-40, …

Page 33: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

33

Repetition - Datatypes, Contentypes

Contenttype: Key

› Key› Uniquely identifies a row

› Key Sequence (sequence clustering models)› Series of events› Sorted

› Key Time (time series models)› Identify values on a time scale

Page 34: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Repetition - Datatypes, Contentypes

Contenttype: Ordered

› Discrete values that have a sorting order› No distances visible› No relations visible› „One Star“ to „Five Stars“

34

Page 35: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

35

Repetition - Datatypes, Contentypes

Contenttype: Cyclical

› Discrete values that have a cyclical sorting order› Example:

› Weekdays: Monday, Tuesday, … Sunday, Monday, …1,2,3, …,7, 1, …

› MonthsJan, Feb, Mar, … , Dec, Jan, …1, 2, 3, …, 12, 1, …

Page 36: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

36

Available Combinations

Datatype Contenttype

Text Discrete, Discretized, Sequence

Long Continuous, Cyclical, Discrete, Discretized, Key Sequence, Key Time, Ordered, Sequence

Boolean Discrete

Double Continuous, Cyclical, Discrete, Discretized, Key Sequence, Key Time, Ordered, Sequence, Time

Date Continuous, Discrete, Discretized, Key Time

Page 37: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Data Mining Algorithms - Decision Trees

04

37

Page 38: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applied Data Mining - Decision Trees

38

Page 39: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applied Data Mining - Decision Trees

In General

› Also known as: Classification Trees› Goal: Sequentially partition Data› Can detect non-linear relationships› Machine Learning Technique

› Separate into Training and Testing set› Training set is created to create model based on certain criteria› Test set is used to verify the model

39

Page 40: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applied Data Mining - Decision Trees

40

2,6 % respose rate(Total: 10.000 persons)

Male 3,2%(Total: 4.677)

Female 2,1%(Total: 5,323)

Income > $30 000: 3,6 %

Age < 40: 3,2 %

Income < $30 000: 2,3 %

Age > 40: 3,8%

Tree for response of a mailing action

Page 41: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applied Data Mining - Decision Trees

41

Males: $30 000

Female: 40+Response Rate: > 3,5 %

Trained Tree

Using the Trained Tree

Example: the management decides to mail only to groups with response rate >3.5%.

Page 42: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Applied Data Mining - Decision Trees

42

› Pros› Very flexible, white box Model› Kiss – Keep it simple, stupid!› Little preparation and resources needed

› Cons› Can be tuned until death› Long time to build› Requires wisely selected training data!

False training yields false results Big tree might require disk swapping

(Computation might be difficult if it does not fit into main memory.)

Page 43: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Project: “DMDW Mining Test”

43

Page 44: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Project: “DMDW Mining Test”(explanation of one note)

44

Page 45: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Project: “DMDW Mining Test”(shows connections, more useful if there are more predictable values)

Page 46: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Project: “DMDW Mining Test”(Generic Content Tree Viewer DMX (Data Mining Extensions))

Page 47: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

References

References for Decisions TreesOlivia Parr Rud et. al, Data Mining Cookbook - Modeling Data for Marketing, Risk, and Customer Relationship Management, Wiley, 2001

David A. Grossman, Ophir Frieder: Introduction to Data Mining, Illinois Institute of Technology 2005

Andrew W. Moore: Decision Trees, Carnegie Mellon University, http://www.autonlab.org/tutorials/dtree16.pdf

Nong Ye (ed.): The Handbook of Data Mining, Lawrence Erlbaum Associates, 2003

Sushimita Mitra, Tinku Acharya, Data Mining - Multimedia, Soft Computing and Bioinformatics, Wiley, 2003

http://en.wikipedia.org/wiki/Classification_tree

47

Page 48: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Data Mining Algorithms - Clustering

05

48

Page 49: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Data Mining Algorithms - Clustering

49

12

X

Page 50: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Data Mining Algorithms - Clustering

Clustering

› Segmentation Algorithm› Find homogenous groups within set› Find similar variables for different cases› Identify new relationships that were unclear before

(heuristics)

› e.g. „Person who rides a bike to work doesn‘t live far from his workplace“ (this is not obvious)

50

Page 51: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

51

1

1

1

1

2

2

2

X

X

X

X

classify

Independent Variables

Homogeneous Subsets

identify

Description of class

12

X

Page 52: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

52

1

1

1

1

2

2

2

X

X

X

X

classify

Independent Variables

Homogeneous Subsets

identify

Description of class

12

X

1. Clustering 2. Classification

Page 53: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

53

Clustering

1. Clustering

› Reduces data to classes of equal types› Become frieds with the data › Iterative Algorithm› Clustering› Validate› Classify› Apply

http://msdn.microsoft.com/en-us/library/ms174879.aspx

Page 54: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Data Mining Algorithms - Clustering

2. Classification

› Create a Description of a group› Give it a „name“› Also: Characterization

54

Page 55: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Process

› Start with random values› Reuse will create different sets and different groups› Different clustering technique / algorithm will create

different group› Reuse on same dataset, reseed› Expert evaluate found classes and plausibility › Good classes used for predictions

55

1. Clustering Evaluate, Check Good? 2. Classify

Apply(Predict)

Page 56: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

56

Clustering

MS Clustering Algorithm

› Combination of two algorithms› K-Means – Hard!

› Datapoint can be in only one cluster

› Expectation Maximization – Soft› Datapoint has different combinations› Datapoint belongs to different clusters› Probability is calculated

Source: http://msdn.microsoft.com/en-us/library/cc280445.aspx

Page 57: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

57

Clustering

› Pros› No predictable variable to choose› Trains itself without much effort› Easy to configure

› „Cons“› Interpretation is everything› Good eye needed› Expert has to check for plausibility

Page 58: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Project: “DMDW Mining Test”(strongest relations only, amount of matching cases for Region Europe)

Page 59: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Project: “DMDW Mining Test”(good to know: continuous attributes are shown by there arithmetic average)

Page 60: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

Project: “DMDW Mining Test”(comparing two clusters)

Page 61: DMDW Lesson 05 + 06 + 07 - Data Mining Applied

THANK YOUFOR YOUR ATTENTION

61