dmdw lesson 05 + 06 + 07 - data mining applied

Post on 20-Dec-2014

753 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

STAATLICHANERKANNTEFACHHOCHSCHULE

Author I: Dip.-Inf. (FH) Johannes HoppeAuthor II: M.Sc. Johannes HofmeisterAuthor III: Prof. Dr. Dieter HomeisterDate: 01.04.2011

08.04.201115.04.2011

STUDIERENUND DURCHSTARTEN.

STAATLICHANERKANNTEFACHHOCHSCHULE

Data Mining Applied

Author I: Dip.-Inf. (FH) Johannes HoppeAuthor II: M.Sc. Johannes HofmeisterAuthor III: Prof. Dr. Dieter HomeisterDate: 01.04.2011

08.04.201115.04.2011

Applications of Data Mining

01

3

Applications of Data Mining

4

Applications of Data Mining

Applications of Data Mining

› Database Marketing › Time-series prediction, detecting "trends" › Detection (of whatever is detectable)› Probability Estimation › Information compression › Sensitivity Analysis

5

Applications of Data Mining

Database Marketing (1/2)

Response modeling

› Model for the response of specific customers. › Systematic selection of (old and potential) customers. › Advertisements and promotion based on these results.

( CRM)› Visualization: "Lift chart" shows how successful the

selection should be. (later topic: DM validation)

6

Lift Chart Example

7

“For contacting 10% of customers, using no model we should get 10% of responders and using the given model we should get 30% of responders.”

Applications of Data Mining

Database Marketing (2/2)

Cross selling: Selling additional products to existing customers

› Question: Which customer might buy which other product?› Uses historical purchase data › Uses credit card information, lifestyle data, demographic data,

etc. › Other possible information: Did the customer query special

information? How customer heard of the company?

8

Applications of Data Mining

Database Marketing (2/2)

Cross selling: Selling additional products to existing customers

› Results for direct marketing, mailing lists, direct advertising (Amazon)

› Amazon: "Customers who bought this item also bought" and "personalized recommendations"

9

Applications of Data Mining

Time-series prediction

› Time series: Stock prices, market shares, … › Extrapolation of future values › Detection of newly arising trends like customer movements to

other products

› Own experience: German print magazines

10

Applications of Data Mining

DetectionIdentification of existence or occurrence of a condition

Fraud detection: › Identifying patterns/criteria to detect credit card fraud › Estimating creditworthiness ( German Schufa) › Prediction of mail orders that will not be paid

11

Applications of Data Mining

DetectionIdentification of existence or occurrence of a condition

Intrusion detection (in computer networks) › Find patterns that indicate when an attack

is made on an network › e.g. clustering: small clusters are of high interest,

they point to unusual cases.› Definition of Classes may be useful:

e.g. harmless, possible harmful,harmful, immediately close LAN

12

Applications of Data Mining

DetectionIdentification of existence or occurrence of a condition

Typical difficulties › Needs knowledge› DM costs › Cost of missing a fraud › Cost of false positives

(e.g. falsely accusing someone of fraud, company image problems)

13

Applications of Data Mining

Probability Estimation

› Approximate the likelihood of an event given an observation

› e.g. for classify a potential customer into an A,B,C range before any business

14

Applications of Data Mining

Information Compression

› Can be viewed as a special type of estimation problem. › For a given set of data, estimate the key components that

be can be used to construct the data.

15

Applications of Data Mining

Sensitivity Analysis

› Understand how changes in one variable affect others. › Identify sensitivity of one variable on another

(find out if dependencies exist).

16

Data Mining Algorithms

02

17

Data Mining Algorithms

Data Mining Algorithms

› Different algorithms, different uses› Combined› The algorithm depends on what you want to do› Not every algorithm is suited for what you want to do

18

Data Mining Algorithms

Algorithms in SSAS: Groups

› Classification algorithms› Regression algorithms› Association algorithms› Segmentation algorithms› Sequence analysis algorithms› Plug-In algorithms

19

Data Mining Algorithms

Classification algorithms

› Predict discrete attributes› Based on experience values› Algorithms in SSAS:

› Naive Bayes› Decision Trees› Neural Networks

20

Data Mining Algorithms

Regression algorithms

› Predict continuous attributes› The same as classification algorithms› Algorithms in SSAS

› Linear Regression (Line)› Logistic Regression (Curve)› MS Time Series

21

Data Mining Algorithms

Association algorithms

› Predict likely combinations› Find elements that occur in combination› Algorithms in SSAS:

› MS Associtation Algorithm (Apriori)

22

Data Mining Algorithms

Segmentation algorithms

› Also called „Clustering algorithms“› Groups data with similar properties› Algorithms in SSAS:

› MS Clustering Algorithms (e.g. K-Means)

23

Data Mining Algorithms

Sequence analysis algorithms

› …are clustering algorithms› Consider the sorting; the sequence of values while

clustering› Does not group by similar properties› Groups by similar sequences› Algorithms in SSAS:

› MS Sequence Clustering

24

Data Mining Algorithms

Plug-In algorithms

› .NET Wrapper for COM objects› Use ANY algorithm› Provided as an assembly

› (possible workshop to create one)

25

Repetition - Datatypes, Contentypes

03

26

Repetition - Datatypes, Contentypes

Applying anAlgorithm

› Datatypes› Contenttypes

27

Repetition - Datatypes, Contentypes

Datatypes

› Define the structure of the values› Available datatypes:› Text› Long› Boolean› Double› Date

28

29

Repetition - Datatypes, Contentypes

Contenttypes

› Define the behaviour of values› Discrete› Continuous› Discretized› Key› Key Sequence› Key Time› Ordered› Cyclical

Repetition - Datatypes, Contentypes

Contenttype: Discrete

› Fixed set of values› Example:

› Commute Distance: 1-2, 2-5, 5-10› Region: Pacific, Northern America, Europe› Name: … … …

› Boolean values are always discrete› Text is most likely discrete

30

Repetition - Datatypes, Contentypes

Contenttype: Continuous

› Unlimited set of values› Infinite items possible› Example

› Income› Age

› Difference between Continuous and Discrete is the most important one

31

32

Repetition - Datatypes, Contentypes

Contenttype: Discretized

› Continuous values converted into discrete values› Examples:

› Income to Categories:A, B, C, …

› Age to groups:0-20,21-30, 31-40, …

33

Repetition - Datatypes, Contentypes

Contenttype: Key

› Key› Uniquely identifies a row

› Key Sequence (sequence clustering models)› Series of events› Sorted

› Key Time (time series models)› Identify values on a time scale

Repetition - Datatypes, Contentypes

Contenttype: Ordered

› Discrete values that have a sorting order› No distances visible› No relations visible› „One Star“ to „Five Stars“

34

35

Repetition - Datatypes, Contentypes

Contenttype: Cyclical

› Discrete values that have a cyclical sorting order› Example:

› Weekdays: Monday, Tuesday, … Sunday, Monday, …1,2,3, …,7, 1, …

› MonthsJan, Feb, Mar, … , Dec, Jan, …1, 2, 3, …, 12, 1, …

36

Available Combinations

Datatype Contenttype

Text Discrete, Discretized, Sequence

Long Continuous, Cyclical, Discrete, Discretized, Key Sequence, Key Time, Ordered, Sequence

Boolean Discrete

Double Continuous, Cyclical, Discrete, Discretized, Key Sequence, Key Time, Ordered, Sequence, Time

Date Continuous, Discrete, Discretized, Key Time

Data Mining Algorithms - Decision Trees

04

37

Applied Data Mining - Decision Trees

38

Applied Data Mining - Decision Trees

In General

› Also known as: Classification Trees› Goal: Sequentially partition Data› Can detect non-linear relationships› Machine Learning Technique

› Separate into Training and Testing set› Training set is created to create model based on certain criteria› Test set is used to verify the model

39

Applied Data Mining - Decision Trees

40

2,6 % respose rate(Total: 10.000 persons)

Male 3,2%(Total: 4.677)

Female 2,1%(Total: 5,323)

Income > $30 000: 3,6 %

Age < 40: 3,2 %

Income < $30 000: 2,3 %

Age > 40: 3,8%

Tree for response of a mailing action

Applied Data Mining - Decision Trees

41

Males: $30 000

Female: 40+Response Rate: > 3,5 %

Trained Tree

Using the Trained Tree

Example: the management decides to mail only to groups with response rate >3.5%.

Applied Data Mining - Decision Trees

42

› Pros› Very flexible, white box Model› Kiss – Keep it simple, stupid!› Little preparation and resources needed

› Cons› Can be tuned until death› Long time to build› Requires wisely selected training data!

False training yields false results Big tree might require disk swapping

(Computation might be difficult if it does not fit into main memory.)

Project: “DMDW Mining Test”

43

Project: “DMDW Mining Test”(explanation of one note)

44

Project: “DMDW Mining Test”(shows connections, more useful if there are more predictable values)

Project: “DMDW Mining Test”(Generic Content Tree Viewer DMX (Data Mining Extensions))

References

References for Decisions TreesOlivia Parr Rud et. al, Data Mining Cookbook - Modeling Data for Marketing, Risk, and Customer Relationship Management, Wiley, 2001

David A. Grossman, Ophir Frieder: Introduction to Data Mining, Illinois Institute of Technology 2005

Andrew W. Moore: Decision Trees, Carnegie Mellon University, http://www.autonlab.org/tutorials/dtree16.pdf

Nong Ye (ed.): The Handbook of Data Mining, Lawrence Erlbaum Associates, 2003

Sushimita Mitra, Tinku Acharya, Data Mining - Multimedia, Soft Computing and Bioinformatics, Wiley, 2003

http://en.wikipedia.org/wiki/Classification_tree

47

Data Mining Algorithms - Clustering

05

48

Data Mining Algorithms - Clustering

49

12

X

Data Mining Algorithms - Clustering

Clustering

› Segmentation Algorithm› Find homogenous groups within set› Find similar variables for different cases› Identify new relationships that were unclear before

(heuristics)

› e.g. „Person who rides a bike to work doesn‘t live far from his workplace“ (this is not obvious)

50

51

1

1

1

1

2

2

2

X

X

X

X

classify

Independent Variables

Homogeneous Subsets

identify

Description of class

12

X

52

1

1

1

1

2

2

2

X

X

X

X

classify

Independent Variables

Homogeneous Subsets

identify

Description of class

12

X

1. Clustering 2. Classification

53

Clustering

1. Clustering

› Reduces data to classes of equal types› Become frieds with the data › Iterative Algorithm› Clustering› Validate› Classify› Apply

http://msdn.microsoft.com/en-us/library/ms174879.aspx

Data Mining Algorithms - Clustering

2. Classification

› Create a Description of a group› Give it a „name“› Also: Characterization

54

Process

› Start with random values› Reuse will create different sets and different groups› Different clustering technique / algorithm will create

different group› Reuse on same dataset, reseed› Expert evaluate found classes and plausibility › Good classes used for predictions

55

1. Clustering Evaluate, Check Good? 2. Classify

Apply(Predict)

56

Clustering

MS Clustering Algorithm

› Combination of two algorithms› K-Means – Hard!

› Datapoint can be in only one cluster

› Expectation Maximization – Soft› Datapoint has different combinations› Datapoint belongs to different clusters› Probability is calculated

Source: http://msdn.microsoft.com/en-us/library/cc280445.aspx

57

Clustering

› Pros› No predictable variable to choose› Trains itself without much effort› Easy to configure

› „Cons“› Interpretation is everything› Good eye needed› Expert has to check for plausibility

Project: “DMDW Mining Test”(strongest relations only, amount of matching cases for Region Europe)

Project: “DMDW Mining Test”(good to know: continuous attributes are shown by there arithmetic average)

Project: “DMDW Mining Test”(comparing two clusters)

THANK YOUFOR YOUR ATTENTION

61

top related