amer kanj data mining for business professionals

35
Amer Kanj Data Mining For Business Professionals

Upload: kelly-poole

Post on 02-Jan-2016

220 views

Category:

Documents


2 download

TRANSCRIPT

Amer Kanj

Data Mining

For Business Professionals

Contents

• Data Mining Overview

• Types of Data Mining

• Why use Data Mining

• How do we Mine Data

• Models of Data Mining

Data Mining Overview

Data Mining deals with large volumes of data stored in DBMS

It is the process of analyzing large databases to find useful patterns

Data Mining is the process of automating information discovery

It automates the process of discovering useful trends and patterns

Data Mining Overview (Cont)

The fundamental assumption of Data

Mining is that large data may contain

recurring hidden patterns

A Data Mining tool does not require any

assumptions

It tries to discover relationships and

hidden patterns that may not always be

obvious

Types of Data Mining

Business professionals look for Data Mining approaches that meet their needs.They requires Data Mining to:

Be understandable Have good performance Be accurate

They define three fundamental approaches to

Data Mining: Classification Studies Clustering Studies Visualization Studies

Classification Studies

Classification studies = Supervised learning

Very common in business world.

A telecommunication company’s analyst wants

to: Understand why some customers remain loyal while others leave

Predict which customers likely to lose to competitors

Classification Studies (cont)

So he can: Construct a model derived from historical data of loyal

customers versus customers who have left A good model enables him to better understanding his

customers and to predict which customer will stay and

which will leave

A study will identify an overall goal and

the data to be used

Classification Rules Classification rules help assign new objects to a set of

classes Given a new automobile insurance applicant, should he/she be classified as low risk, medium risk or high risk?

Classification rules for above example could use a

variety of knowledge, such as educational level of applicant, salary of applicant, age of applicant, etc…

person p, p.degree = masters & p.income > 75,000

p.credit = excellent person p, p.degree = bachelors and (p.income >= 25,000 and p.income <= 75,000) p.credit = good

Classification rules can compactly shown as a decision tree

Clustering Studies Clustering Studies = Unsupervised Learning

A method of grouping rows of data that share

similar trends and patterns

We have no dependent variable

Clustering can also be based on historical

patterns, but the outcome (loyal or lost) is not

supplied with the training data

Clustering techniques try to look for similarities

within a data set and group similar rows

together into clusters or segments

Customers are clustered into four segments

Cluster 1

Income: High

Children: 1

Car: Luxery

Income: high

Children: 0

Car: Compact

Cluster 2

Income: Medium

Children: 2

Car: Sedan and Car: Track

Income: Medium

Children: 3

Cluster 4

Cluster 3

Visualization

It is simply the graphical presentation of

data

Microsoft Excel has graphing and

mapping capabilities in its product

Representing data graphically often

brings out points that you would not

normally see

Why use Data Mining

Direct Marking

Trend Analysis

Fraud Detection

Forecasting in Financial Markets

Direct Marketing

The ability to predict who is most likely or

most desirable to buy certain product can

save companies immense amounts in

marketing expenditures

Trend Analysis

Understanding trends in the marketplace

is a strategic advantage, because it is

useful in reducing costs and timeliness to

market

Fraud Detection

data Mining techniques can model which

insurance claims, cellular phone calls, or

credit card purchases are likely to be

fraudulent

Forecasting in Financial Markets

The use of data mining to model financial

markets is used extensively

How Do We Mine Data

There are five steps to Data Mining:

Data Manipulating

Defining a study

Reading the data and building a model

Understanding the model

Prediction

Data Preparation Data preparation is considered as the

heart of the Data Mining process Data usually accumulates in transactional

database where actual records of transactions are stored

Data preparation requires that the data from distributed databases be pooled together, cleansed from redundant, inconsistent, incomplete, irrelevant, and otherwise inappropriate data

Data Preparation (Cont)

Data Cleaning: A column containing a list of soft drinks may have the

values “Pepsi” , “Pepsi Cola”, and “Cola”. The values refer to the same drink, but are not known to

the computer as the same.

Missing Values: Some Data Mining approaches require rows of data to

be complete in order to mine the data If too many values are missing in a data set, it becomes

hard to gather any useful information from this data or to

make predictions from it

Data Preparation (Cont)

Data Derivation: If I have column called maximum$-2002 and maximum$-2003 to describe the dollars spent in 2002 and 2003 Then an interesting derivation is $-difference, which is the change in amount of money spent between 2002 and 2003

Merging Data: Data usually stored in the form of tables Merging data in a relational system can be achieved in a number of ways:

1. Merging tables through a view (Query Tools) 2. An SQL statement, or 3. An export of data into a flat file

Defining a Study

Differs from Supervised (Classification)

versus Unsupervised (Clustering) learning

For Supervised learning: Involves articulating a goal Specifying the data fields that are used in the study

For Unsupervised learning: The goal is to group similar types of data, usually used in

many activities, or To identify exceptions in a data set, which is useful in

discovering fraudulent or incorrect data

Read the data and build a Model

A data mining product reads a data set and constructs a model

A model will summarize large amounts of data by accumulating indicators

such Indicators: Frequencies: Show how often a certain value occurs Weight: or impacts, indicate how well some inputs indicate the occurrence of an output Conjunctions: Sometimes inputs have more weight together than apart Differentiation: Indicates how much more important an input criterion is to one outcome than another

Understanding the Model

Model understand takes different

forms based on the type of model

used to represent the data

We will discuss Data Mining

Models later…

Prediction

Prediction is the process of choosing the best possible outcomes based on historical data

Predictive data mining methods fall into three broad categories:

Mathematical methods Logic methods Distance methods

Prediction (Cont)

Mathematical method: Linear math solution Non-linear math solution

Logic methods: Quite different from what math methods produce Logical methods often produce tree-like solutions Best known logical solutions are decision trees, and

decision rules.

Prediction (Cont)

Distance methods:

A representative sample of cases is kept on file

These cases will be used as a benchmark for

classifying new cases

Features of the new case are measured against

features of the benchmark cases for proximity

Prediction (Cont)

Here are a few interesting predictive

capabilities: Understanding why a prediction is made: some models

will provide the reasons why a prediction is made

Margin of victory: if the best case prediction has a score

of 100 and the challenger prediction has a score of 50,

then the margin of victory is 50%. If the prediction has a

score of 100 and the challenger has 99, then the margin

of victory would be 1%. Generally, the higher the margin

of victory, the more likely the prediction is to be true

Prediction (Cont)

Scenario playing: Some prediction models have

the ability to change parameters to see how

predictions change

Understanding prediction affinities: Is to set two

variables constant and see what the other

predictions would look like

Data Mining Models

Decision Trees

Genetic Algorithms

Neural Nets

Agent Network Technology

Hybrid Models

Statistics

Data mining Models (Cont)

Decision Trees: Creating a tree-like structure to describe a data set The greatest benefit to decision tree approaches is their

understandability

Genetic Algorithms: Are a method of combinatorial optimization based on

processes in biological evolution

Data Mining Models (Cont)

Neural Nets: Are used extensively in the business world as predictive

models Neural Nets are widely used in the financial market to

model fraud in credit cards and monetary transactions

Agent Network Technology: This method of model treats all data elements as

agents

that are connected to each other in a significant way

Data Mining Models (Cont)

Hybrid Models: Vendor Tools that make use of more than one

approach are referred to as hybrid systems Being a hybrid system does not always imply that

the tool uses a hybrid algorithm For example, Thinking Machines, with their Darwin

product, makes use of several different mining

algorithm. While the algorithm themselves are not

hybrid, the product uses the algorithms in

combination

Data Mining Models (Cont)

Statistics:

Used to create a model of data sets

Uses probability, data analysis, and

statistical inference

Thank You For Listening

Qs…QS…Qs