data mining and decision trees

Data Mining and Decision Trees

Prof. Sin-Min Lee

Department of Computer Science

Evolution of Database Technology

• 1960s:– Data collection, database creation, IMS and network DBMS

• 1970s: – Relational data model, relational DBMS implementation

• 1980s: – RDBMS, advanced data models (extended-relational, OO, deductive,

etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

• 1990s—2000s: – Data mining and data warehousing, multimedia databases, and Web

databases

What Is Data Mining?• Data mining (knowledge discovery in databases):

– Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases

• Alternative names and their “inside stories”: – Data mining: a misnomer?– Knowledge discovery(mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

• What is not data mining?– (Deductive) query processing. – Expert systems or small ML/statistical programs

Why Data Mining? — Potential Applications

• Database analysis and decision support– Market analysis and management

• target marketing, customer relation management, market basket

analysis, cross selling, market segmentation

– Risk analysis and management

• Forecasting, customer retention, improved underwriting, quality

control, competitive analysis

– Fraud detection and management

• Other Applications– Text mining (news group, email, documents) and Web analysis.

– Intelligent query answering

Market Analysis and Management (1)

• Where are the data sources for analysis?– Credit card transactions, loyalty cards, discount coupons, customer

complaint calls, plus (public) lifestyle studies

• Target marketing– Find clusters of “model” customers who share the same characteristics:

interest, income level, spending habits, etc.

• Determine customer purchasing patterns over time– Conversion of single to a joint bank account: marriage, etc.

• Cross-market analysis– Associations/co-relations between product sales

– Prediction based on the association information

Market Analysis and Management (2)• Customer profiling

– data mining can tell you what types of customers buy what products

(clustering or classification)

• Identifying customer requirements

– identifying the best products for different customers

– use prediction to find what factors will attract new customers

• Provides summary information

– various multidimensional summary reports

– statistical summary information (data central tendency and variation)

Corporate Analysis and Risk Management

• Finance planning and asset evaluation– cash flow analysis and prediction– contingent claim analysis to evaluate assets – cross-sectional and time series analysis (financial-ratio, trend

analysis, etc.)

• Resource planning:– summarize and compare the resources and spending

• Competition:– monitor competitors and market directions – group customers into classes and a class-based pricing

procedure– set pricing strategy in a highly competitive market

Fraud Detection and Management (1)• Applications

– widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.

• Approach– use historical data to build models of fraudulent behavior and use

data mining to help identify similar instances

• Examples– auto insurance: detect a group of people who stage accidents to

collect on insurance

– money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network)

– medical insurance: detect professional patients and ring of doctors and ring of references

Fraud Detection and Management (2)

• Detecting inappropriate medical treatment– Australian Health Insurance Commission identifies that in many

cases blanket screening tests were requested (save Australian $1m/yr).

• Detecting telephone fraud– Telephone call model: destination of the call, duration, time of

day or week. Analyze patterns that deviate from an expected norm.

– British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.

• Retail– Analysts estimate that 38% of retail shrink is due to dishonest

employees.

Other Applications• Sports

– IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat

• Astronomy– JPL and the Palomar Observatory discovered 22 quasars with the

help of data mining

• Internet Web Surf-Aid– IBM Surf-Aid applies data mining algorithms to Web access logs

for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

Data Mining: A KDD Process

– Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Steps of a KDD Process • Learning the application domain:

– relevant prior knowledge and goals of application

• Creating a target data set: data selection• Data cleaning and preprocessing: (may take 60% of effort!)• Data reduction and transformation:

– Find useful features, dimensionality/variable reduction, invariant representation.

• Choosing functions of data mining – summarization, classification, regression, association, clustering.

• Choosing the mining algorithm(s)• Data mining: search for patterns of interest• Pattern evaluation and knowledge presentation

– visualization, transformation, removing redundant patterns, etc.

• Use of discovered knowledge

Area 1: Risk Analysis

• Insurance companies and banks use data mining for risk analysis.

• And insurance company searches in its own insurants and claims databases for relationships between personal characteristics and claim behavior.

Continued

• The company is especially interested in the characteristics of insurants with a high deviating claim behavior.

• With data mining, these so-called risk-profiles can be discovered and the company can use this information to adapt its premium polity.

Area 2: Direct Marketing

• Data mining can also be used to discover the relationship between one’s personal characteristics, e.g. age, gender, hometown, and the probability that one will respond to a mailing.

• Such relationships can be used to select those customers from the mailing database that have the highest probability of responding to a mailing.

• This allows the company to mail its prospects selectively, thus maximizing the response.

• For example:

1. Company X sends a mailing to a number of prospects.

2. The response is 2%.

What Data Mining can do

• Enables companies to determine relationships among “internal” and “external” factors.

• Predict cross-sell opportunities and make recommendations

• Segment markets and personalize communications.

• Predicts outcomes of future situations

The process Of Data Mining

• There are 3 main steps in the Data Mining process:– Preparation: data is selected from the

warehouse and “cleansed”.– Processing: algorithms are used to process the

data. This step uses modeling to make predictions.

– Analysis: output is evaluated.

Reasons for growing popularity

• Growing data volume- enormous amount of existing and appearing data that require processing.

• Limitations of Human Analysis- humans lacking objectiveness when analyzing dependencies for data.

• Low cost of Machine Learning- the data mining process has a lower cost than hiring highly trained professionals to analyze data.

Data Mining Techniques

• Association Rule- is to discover interesting associations between attributes that are contained in a database.

• Clustering- finds appropriate groupings of elements for a set of data.

• Sequential patterns-looking for patterns where one event leads to another later event.

• Classification- looking for new patterns.

Applications of Data Mining

• Data Mining is applied in the following areas:– Prediction of the Stock Market: predicting the future

trends.

– Bankruptcy prediction: prediction based on computer generated rules, using models

– Foreign Exchange Market: Data Mining is used to identify trading rules.

– Fraud Detection: construction of algorithms and models that will help recognize a variety of fraud patterns.

Results of Data Mining Include:

• Forecasting what may happen in the future• Classifying people or things into groups by

recognizing patterns• Clustering people or things into groups based on

their attributes• Associating what events are likely to occur

together• Sequencing what events are likely to lead to later

events

Data mining is not•Brute-force crunching of bulk data •“Blind” application of algorithms•Going to find relationships where none exist•Presenting data in different ways•A database intensive task•A difficult to understand technology requiring an advanced degree in computer science

What data mining has done for...

Scheduled its workforce to provide faster, more accurate

answers to questions.

The US Internal Revenue Service needed to improve customer service and...


analyzed suspects’ cell phone usage to focus investigations.

The US Drug Enforcement Agency needed to be more effective in their drug “busts” and


Reduced direct mail costs by 30% while garnering 95% of the

campaign’s revenue.

HSBC need to cross-sell more effectively by identifying profiles that would be interested in higheryielding investments and...

Data Mining process model -DM

Search in State SpacesSearch in State Spaces

Decision Trees

•A decision tree is a special case of a state-space graph.

•It is a rooted tree in which each internal node corresponds to a decision, with a subtree at these nodes for each possible outcome of the decision.

•Decision trees can be used to model problems in which a series of decisions leads to a solution.

•The possible solutions of the problem correspond to the paths from the root to the leaves of the decision tree.

Decision Trees•Example: The n-queens problem

•How can we place n queens on an nn chessboard so that no two queens can capture each other?

•Q•Q

•x

•x

•x

•x

•x

•x

•x•x

•x

•x•x

•x

•x

•x

•x•x

•x

•x

•x

•x•x

•x

•x

•x

•x

•x

•x

A queen can move any A queen can move any number of squares number of squares horizontally, vertically, and horizontally, vertically, and diagonally.diagonally.

Here, the possible target Here, the possible target squares of the queen Q are squares of the queen Q are marked with an marked with an xx..

•Let us consider the 4-queens problem.

•Question: How many possible configurations of 44 chessboards containing 4 queens are there?

•Answer: There are 16!/(12!4!) = (13141516)/(234) = 13754 = 1820 possible configurations.

•Shall we simply try them out one by one until we encounter a solution?

•No, it is generally useful to think about a search problem more carefully and discover constraints on the problem’s solutions.

•Such constraints can dramatically reduce the size of the relevant state space.

Obviously, in any solution of the n-queens problem, Obviously, in any solution of the n-queens problem, there must be there must be exactly one queen in each columnexactly one queen in each column of of the board. the board.

Otherwise, the two queens in the same column could Otherwise, the two queens in the same column could capture each other.capture each other.

Therefore, we can describe the solution of this problem Therefore, we can describe the solution of this problem as a as a sequence of n decisionssequence of n decisions: :

Decision 1: Place a queen in the first column.Decision 1: Place a queen in the first column.

Decision 2: Place a queen in the second column.Decision 2: Place a queen in the second column.......Decision n: Place a queen in the n-th column.Decision n: Place a queen in the n-th column.

Backtracking in Decision Trees

•Q

•Q

•Q

•Q

•Q

•Q

•Q

•Q

•Q

•Q

•Q

•Q

•Q

•Q

•Q

•Q

•Q

•Q

place 1place 1stst queen queen

place 2place 2ndnd queen queen

place 3place 3rdrd queen queen

place 4place 4thth queen queen

empty boardempty board

Neural NetworkMany inputs and a single outputTrained on signal and background sampleWell understood and mostly accepted in HEP

Decision TreeMany inputs and a single output

Trained on signal and background sample

Used mostly in life sciences & business

Decision treeBasic

Algorithm• Initialize top node to all examples• While impure leaves available

– select next impure leave L

– find splitting attribute A with maximal information gain

– for each value of A add child to L

Decision treeFind good

split• Sufficient statistics to compute info gain: count matrix

outlook temperature humidity windy playsunny hot high FALSE nosunny hot high TRUE noovercast hot high FALSE yesrainy mild high FALSE yesrainy cool normal FALSE yesrainy cool normal TRUE noovercast cool normal TRUE yessunny mild high FALSE nosunny cool normal FALSE yesrainy mild normal FALSE yessunny mild normal TRUE yesovercast mild high TRUE yesovercast hot normal FALSE yesrainy mild high TRUE no

play don't playsunny 2 3

overcast 4 0rainy 3 2

outlook

play don't playhigh 3 4

normal 6 1humidity

play don't playhot 2 2mild 4 2cool 3 1

temperature

play don't playFALSE 6 2TRUE 3 3

windy

gain: 0.25 bits

gain: 0.16 bits

gain: 0.03 bits

gain: 0.14 bits

Decision trees

• Simple depth-first construction

• Needs entire data to fit in memory

• Unsuitable for large data sets

• Need to “scale up”

Decision Trees

Planning Tool

Decision Trees

• Enable a business to quantify decision making

• Useful when the outcomes are uncertain

• Places a numerical value on likely or potential outcomes

• Allows comparison of different possible decisions to be made

Decision Trees

• Limitations:– How accurate is the data used in the construction of the

tree?

– How reliable are the estimates of the probabilities?

– Data may be historical – does this data relate to real time?

– Necessity of factoring in the qualitative factors – human resources, motivation, reaction, relations with suppliers and other stakeholders

Process

The Process

Expand by opening new outlet

Maintain current status

Economic growth rises

Economic growth declines

0.7

0.3

Expected outcome£300,000

Expected outcome-£500,000

£0

A square denotes the point where a decision is made, In this example, a business is contemplating opening a new outlet. The uncertainty is the state of the economy – if the economy continues to grow healthily the option is estimated to yield profits of £300,000. However, if the economy fails to grow as expected, the potential loss is estimated at £500,000.

There is also the option to do nothing and maintain the current status quo! This would have an outcome of £0.

The circle denotes the point where different outcomes could occur. The estimates of the probability and the knowledge of the expected outcome allow the firm to make a calculation of the likely return. In this example it is:

Economic growth rises: 0.7 x £300,000 = £210,000

Economic growth declines: 0.3 x £500,000 = -£150,000

The calculation would suggest it is wise to go ahead with the decision ( a net ‘benefit’ figure of +£60,000)

The Process

Expand by opening new outlet

Maintain current status

Economic growth rises

Economic growth declines

0.5

0.5

Expected outcome£300,000

Expected outcome-£500,000

£0

Look what happens however if the probabilities change. If the firm is unsure of the potential for growth, it might estimate it at 50:50. In this case the outcomes will be:

Economic growth rises: 0.5 x £300,000 = £150,000

Economic growth declines: 0.5 x -£500,000 = -£250,000

In this instance, the net benefit is -£100,000 – the decision looks less favourable!

Advantages

Disadvantages

Trained Decision

Tree

(Binned Likelihood Fit)(Limit)

Decision Trees from Data BaseEx Att Att Att ConceptNum Size Colour Shape Satisfied

1 med blue brick yes2 small red wedge no3 small red sphere yes4 large red wedge no5 large green pillar yes6 large red pillar no7 large green sphere yes

Choose target : Concept satisfiedUse all attributes except Ex Num

Rules from TreeIF (SIZE = large AND ((SHAPE = wedge) OR (SHAPE = pillar AND COLOUR = red) )))OR (SIZE = small AND SHAPE = wedge)THEN NO

IF (SIZE = large AND ((SHAPE = pillar) AND COLOUR = green) OR SHAPE = sphere) )OR (SIZE = small AND SHAPE = sphere)OR (SIZE = medium)THEN YES

Disjunctive Normal Form - DNF

IF(SIZE = medium)OR (SIZE = small AND SHAPE = sphere)OR (SIZE = large AND SHAPE = sphere)OR (SIZE = large AND SHAPE = pillar AND COLOUR = greenTHEN CONCEPT = satisfied

ELSE CIONCEPT = not satisfied

data mining and decision trees

Documents

data dredging

data archeology

data collection

data warehousing

data sources

web analysis

relational data model

variationcorporate analysis