data minig

Classification and Prediction

- The Course

DS

DS

DS

OLAP

DM

Association

Classification

ClusteringDS = Data sourceDW = Data warehouseDM = Data MiningDP = Staging Database

DP DW

Chapter Objectives

Learn basic techniques for data classification and prediction.

Realize the difference between the following classifications of data:– supervised classification – prediction– unsupervised classification

Chapter Outline

What is classification and prediction of data?

How do we classify data by decision tree induction?

What are neural networks and how can they classify?

What is Bayesian classification?

Are there other classification techniques?

How do we predict continuous values?

What is Classification?

The goal of data classification is to organize and categorize data in distinct classes.

– A model is first created based on the data distribution.

– The model is then used to classify new data.

– Given the model, a class can be predicted for new data.

Classification = prediction for discrete and nominal values

What is Prediction?

The goal of prediction is to forecast or deduce the value of an attribute based on values of other attributes.

– A model is first created based on the data distribution.

– The model is then used to predict future or unknown values

In Data Mining

– If forecasting discrete value Classification

– If forecasting continuous value Prediction

Supervised and Unsupervised

Supervised Classification = Classification

– We know the class labels and the number of classes

Unsupervised Classification = Clustering

– We do not know the class labels and may not know the number of classes

Preparing Data Before Classification

Data transformation:

– Discretization of continuous data

– Normalization to [-1..1] or [0..1]

Data Cleaning:

– Smoothing to reduce noise

Relevance Analysis:

– Feature selection to eliminate irrelevant attributes

Application

Credit approval

Target marketing

Medical diagnosis

Defective parts identification in manufacturing

Crime zoning

Treatment effectiveness analysis

Etc

Classification is a 3-step process

1. Model construction (Learning):

• Each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the class label.

• The set of all tuples used for construction of the model is called training set.

– The model is represented in the following forms:

• Classification rules, (IF-THEN statements),

• Decision tree

• Mathematical formulae

1. Classification Process (Learning)

Name Income Age

Samir Low <30

Ahmed Medium [30...40]

Salah High <30

Ali Medium >40

Sami Low [30..40]

Emad Medium <30

Classification Method

IF Income = ‘High’OR Age > 30THEN Class = ‘Good

OR

Decision Tree

OR

Mathematical For

Classification Model

Credit rating

bad

good

good

good

good

bad

classTraining Data

Classification is a 3-step process

2. Model Evaluation (Accuracy):

– Estimate accuracy rate of the model based on a test set.

– The known label of test sample is compared with the classified result from the model.

– Accuracy rate is the percentage of test set samples that are correctly classified by the model.

– Test set is independent of training set otherwise over-fitting will occur

2. Classification Process (Accuracy Evaluation)

Name Income Age

Naser Low <30

Lutfi Medium <30

Adel High >40

Fahd Medium [30..40]


Credit rating

Bad

Bad

good

good

class

Accuracy 75%

Model

Bad

good

good

good

Classification is a three-step process

3. Model Use (Classification):

– The model is used to classify unseen objects.

• Give a class label to a new tuple

• Predict the value of an actual attribute

3. Classification Process (Use)

Name Income Age

Adham Low <30


Credit rating

?

Classification Methods Decision Tree Induction

Neural Networks

Bayesian Classification

Association-Based Classification

K-Nearest Neighbour

Case-Based Reasoning

Genetic Algorithms

Rough Set Theory

Fuzzy Sets

Etc.

Classification Method

Evaluating Classification Methods

Predictive accuracy

– Ability of the model to correctly predict the class label Speed and scalability

– Time to construct the model

– Time to use the model Robustness

– Handling noise and missing values Scalability

– Efficiency in large databases (not memory resident data) Interpretability:

– The level of understanding and insight provided by the model

Chapter Outline

What is classification and prediction of data?

How do we classify data by decision tree induction?

What are neural networks and how can they classify?

What is Bayesian classification?

Are there other classification techniques?

How do we predict continuous values?

Decision Tree

What is a Decision Tree?

A decision tree is a flow-chart-like tree structure.– Internal node denotes a test on an attribute– Branch represents an outcome of the test

• All tuples in branch have the same value for the tested attribute.

Leaf node represents class label or class label distribution

Sample Decision Tree

Income

Age

2000 6000 1000020

50

80

Income

YESNo

< 6K

Excellent customers

Fair customers

>= 6K

Sample Decision Tree

Income

Age

2000 6000 1000020

50

80

Income

AgeNO

NO

<6k >=6k

<50 >=50

Yes

Sample Decision TreeOutlook Temp Humidity Windy

sunny hot high FALSE

sunny hot high TRUE

overcast hot high FALSE

rainy mild high FALSE

rainy cool normal FALSE

rainy cool Normal TRUE

overcast cool Normal TRUE

sunny mild High FALSE

sunny cool Normal FALSE

rainy mild Normal FALSE

sunny mild normal TRUE

overcast mild High TRUE

overcast hot Normal FALSE

rainy mild high TRUE

Play?

No

No

Yes

Yes

Yes

No

Yes

No

Yes

Yes

Yes

Yes

Yes

No

http://www-lmmb.ncifcrf.gov/~toms/paper/primer/latex/index.htmlhttp://directory.google.com/Top/Science/Math/Applications/Information_Theory/Papers/

Decision-Tree Classification Methods

The basic top-down decision tree generation approach usually consists of two phases:

1. Tree construction• At the start, all the training examples are at the root.• Partition examples are recursively based on selected

attributes.

2. Tree pruning• Aiming at removing tree branches that may reflect noise

in the training data and lead to errors when classifying test data improve classification accuracy

How to Specify Test Condition?

Depends on attribute types– Nominal– Ordinal– Continuous

Depends on number of ways to split– 2-way split– Multi-way split

Splitting Based on Nominal Attributes

Multi-way split: Use as many partitions as distinct values.

Binary split: Divides values into two subsets. Need to find optimal partitioning.

CarTypeFamily

Sports

Luxury

CarType{Family, Luxury} {Sports}

CarType{Sports, Luxury} {Family} OR

Multi-way split: Use as many partitions as distinct values.

Binary split: Divides values into two subsets. Need to find optimal partitioning.

What about this split?

Splitting Based on Ordinal Attributes

SizeSmall

Medium

Large

Size{Medium,

Large} {Small}Size

{Small, Medium} {Large}

OR

Size{Small, Large} {Medium}

Splitting Based on Continuous Attributes

Different ways of handling– Discretization to form an ordinal categorical attribute

• Static – discretize once at the beginning• Dynamic – ranges can be found by equal

interval bucketing, equal frequency bucketing (percentiles), or clustering.

– Binary Decision: (A < v) or (A v)• consider all possible splits and finds the best cut• can be more compute intensive

Splitting Based on Continuous Attributes

TaxableIncome> 80K?

Yes No

TaxableIncome?

(i) Binary split (ii) Multi-way split

< 10K

[10K,25K) [25K,50K) [50K,80K)

> 80K

Tree Induction

Greedy strategy.– Split the records based on an attribute test that

optimizes certain criterion.

Issues– Determine how to split the records

• How to specify the attribute test condition?• How to determine the best split?

– Determine when to stop splitting

How to determine the Best Split

Income Age

>=10k<10k young old

Customers

fair customersGood customers

How to determine the Best Split

Greedy approach: – Nodes with homogeneous class distribution are

preferred

Need a measure of node impurity:

High degreeof impurity

Low degreeof impurity

pure

50% red50% green

75% red25% green

100% red0% green

Measures of Node Impurity

Information gain– Uses Entropy

Gain Ratio– Uses Information

Gain and Splitinfo

Gini Index– Used only for

binary splits

Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-

conquer manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they are

discretized in advance)– Examples are partitioned recursively based on selected

attributes– Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain) Conditions for stopping partitioning

– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf– There are no samples left

Classification Algorithms

ID3– Uses information gain

C4.5– Uses Gain Ratio

CART– Uses Gini

Entropy: Used by ID3

Entropy measures the impurity of S S is a set of examples p is the proportion of positive examples q is the proportion of negative examples

Entropy(S) = - p log2 p - q log2 q

ID3playdon’t play

pno = 5/14

pyes = 9/14

Impurity = - pyes log2 pyes - pno log2 pno

= - 9/14 log2 9/14 - 5/14 log2 5/14

= 0.94 bits

outlook temperature humidity windy playsunny hot high FALSE nosunny hot high TRUE noovercast hot high FALSE yesrainy mild high FALSE yesrainy cool normal FALSE yesrainy cool normal TRUE noovercast cool normal TRUE yessunny mild high FALSE nosunny cool normal FALSE yesrainy mild normal FALSE yessunny mild normal TRUE yesovercast mild high TRUE yesovercast hot normal FALSE yesrainy mild high TRUE no

ID3 play

don’t play

amount of information required to specify class of an example given that it reaches node

0.94 bits

0.0 bits* 4/14

0.97 bits* 5/14

0.97 bits* 5/14

0.98 bits* 7/14

0.59 bits* 7/14

0.92 bits* 6/14

0.81 bits* 4/14

0.81 bits* 8/14

1.0 bits* 4/14

1.0 bits* 6/14

outlook

sunny overcast rainy

+= 0.69 bits

gain: 0.25 bits

+= 0.79 bits

+= 0.91 bits

+= 0.89 bits

gain: 0.15 bits gain: 0.03 bits gain: 0.05 bits

play don't playsunny 2 3

overcast 4 0rainy 3 2

humidity temperature windy

high normal hot mild cool false true

play don't playhot 2 2mild 4 2cool 3 1

play don't playhigh 3 4

normal 6 1

play don't playFALSE 6 2TRUE 3 3

maximal

information

gain

ID3 play

don’t playoutlook


maximal

information

gain

0.97 bits

0.0 bits* 3/5

humidity temperature windy

high normal hot mild cool false true

+= 0.0 bits

gain: 0.97 bits

+= 0.40 bits

gain: 0.57 bits

+= 0.95 bits

gain: 0.02 bits

0.0 bits* 2/5

0.0 bits* 2/5

1.0 bits* 2/5

0.0 bits* 1/5

0.92 bits* 3/5

1.0 bits* 2/5

outlook temperature humidity windy playsunny hot high FALSE nosunny hot high TRUE nosunny mild high FALSE nosunny cool normal FALSE yessunny mild normal TRUE yes

ID3 play

don’t playoutlook


humidity

high normal

outlook temperature humidity windy playrainy mild high FALSE yesrainy cool normal FALSE yesrainy cool normal TRUE norainy mild normal FALSE yesrainy mild high TRUE no

1.0 bits*2/5

temperature windy

hot mild cool false true

+= 0.95 bits

gain: 0.02 bits

+= 0.95 bits

gain: 0.02 bits

+= 0.0 bits

gain: 0.97 bits

humidity

high normal

0.92 bits* 3/5

0.92 bits* 3/5

1.0 bits* 2/5

0.0 bits* 3/5

0.0 bits* 2/5

0.97 bits

ID3

play

don’t playoutlook


windy

false true

humidityhigh

normal

outlook temperature humidity windy playsunny hot high FALSE nosunny hot high TRUE noovercast hot high FALSE yesrainy mild high FALSE yesrainy cool normal FALSE yesrainy cool normal TRUE noovercast cool normal TRUE yessunny mild high FALSE nosunny cool normal FALSE yesrainy mild normal FALSE yessunny mild normal TRUE yesovercast mild high TRUE yesovercast hot normal FALSE yesrainy mild high TRUE no

Yes

NoNo Yes Yes

C4.5

Information gain measure is biased towards attributes with a large number of values

C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)

– GainRatio(A) = Gain(A)/SplitInfo(A)

Ex.

– gain_ratio(income) = 0.029/0.926 = 0.031

The attribute with the maximum gain ratio is selected as the splitting attribute

)||

||(log

||

||)( 2

1 D

D

D

DDSplitInfo j

v

j

jA

926.0)14

5(log

14

5)

14

4(log

14

4)

14

5(log

14

5)( 222 DSplitInfoA

CART If a data set D contains examples from n classes, gini index,

gini(D) is defined as

where pj is the relative frequency of class j in D If a data set D is split on A into two subsets D1 and D2, the gini

index gini(D) is defined as

Reduction in Impurity:

The attribute provides the smallest ginisplit(D) (or the largest

reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)

n

jp jDgini

1

21)(

)(||||)(

||||)( 2

21

1 DginiDD

DginiDDDginiA

)()()( DginiDginiAginiA

CART

Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”

Suppose the attribute income partitions D into 10 in D1: {low,

medium} and 4 in D2

but gini{medium,high} is 0.30 and thus the best since it is the lowest

All attributes are assumed continuous-valued May need other tools, e.g., clustering, to get the possible split

values Can be modified for categorical attributes

459.014

5

14

91)(

22

Dgini

)(14

4)(

14

10)( 11},{ DGiniDGiniDgini mediumlowincome

Comparing Attribute Selection Measures

The three measures, in general, return good results but

– Information gain:

• biased towards multivalued attributes

– Gain ratio:

• tends to prefer unbalanced splits in which one partition is much smaller than the others

– Gini index:

• biased to multivalued attributes

• has difficulty when # of classes is large

• tends to favor tests that result in equal-sized partitions and purity in both partitions

Other Attribute Selection Measures

CHAID: a popular decision tree algorithm, measure based on χ2 test

for independence

C-SEP: performs better than info. gain and gini index in certain cases

G-statistics: has a close approximation to χ2 distribution

MDL (Minimal Description Length) principle (i.e., the simplest solution

is preferred):

– The best tree as the one that requires the fewest # of bits to both

(1) encode the tree, and (2) encode the exceptions to the tree

Multivariate splits (partition based on multiple variable combinations)

– CART: finds multivariate splits based on a linear comb. of attrs.

Which attribute selection measure is the best?

– Most give good results, none is significantly superior than others

Underfitting and Overfitting

Overfitting

Underfitting: when model is too simple, both training and test errors are large

Overfitting due to Noise

Decision boundary is distorted by noise point

Underfitting due to Insufficient Examples

Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region

- Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

Two approaches to avoid Overfitting

Prepruning:

– Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold

– Difficult to choose an appropriate threshold

Postpruning:

– Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees

– Use a set of data different from the training data to decide which is the “best pruned tree”

Scalable Decision Tree Induction Methods

ID3, C4.5, and CART are not efficient when the training set doesn’t fit the available memory. Instead the following algorithms are used

– SLIQ • Builds an index for each attribute and only class

list and the current attribute list reside in memory– SPRINT

• Constructs an attribute list data structure – RainForest

• Builds an AVC-list (attribute, value, class label)– BOAT

• Uses bootstrapping to create several small samples

BOAT

BOAT (Bootstrapped Optimistic Algorithm for Tree

Construction)

– Use a statistical technique called bootstrapping to create

several smaller samples (subsets), each fits in memory

– Each subset is used to create a tree, resulting in several

trees

– These trees are examined and used to construct a new

tree T’

• It turns out that T’ is very close to the tree that would

be generated using the whole data set together

– Adv: requires only two scans of DB, an incremental alg.

Why decision tree induction in data mining?

Relatively faster learning speed (than other classification methods)

Convertible to simple and easy to understand classification rules

Comparable classification accuracy with other methods

Converting Tree to Rules

R1: IF (Outlook=Sunny) AND (Humidity=High) THEN Play=No R2: IF (Outlook=Sunny) AND (Humidity=Normal) THEN Play=YesR3: IF (Outlook=Overcast) THEN Play=Yes R4: IF (Outlook=Rain) AND (Wind=Strong) THEN Play=NoR5: IF (Outlook=Rain) AND (Wind=Weak) THEN Play=Yes

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

Decision trees:The Weka tool

@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}@attribute temperature {hot, mild, cool}@attribute humidity {high, normal}@attribute windy {TRUE, FALSE}@attribute play {yes, no}

@datasunny,hot,high,FALSE,nosunny,hot,high,TRUE,noovercast,hot,high,FALSE,yesrainy,mild,high,FALSE,yesrainy,cool,normal,FALSE,yesrainy,cool,normal,TRUE,noovercast,cool,normal,TRUE,yessunny,mild,high,FALSE,nosunny,cool,normal,FALSE,yesrainy,mild,normal,FALSE,yessunny,mild,normal,TRUE,yesovercast,mild,high,TRUE,yesovercast,hot,normal,FALSE,yesrainy,mild,high,TRUE,no

http://www.cs.waikato.ac.nz/ml/weka/

Bayesian Classifier

Thomas Bayes (1702-1761)

X C

D

166

74

|X| = 10|C| = 20|D| = 100

P(X) = 10/100P(C) = 20/100P(X,C) = 4/100

4

Basic StatisticsBasic Statistics

P(X,C) = P(C|X)*P(X) = P(X|C)*P(C)

P(X|C) = P(X,C)/P(C) = 4/20P(C|X) = P(X,C)/P(X) = 4/10

Assume• D = All students• X = ICS students• C = SWE students

XP

CXPCPXCP

||

Bayesian Classifier – Basic Bayesian Classifier – Basic EquationEquation

Class Posterior Probability

Class Prior Probability Descriptor Posterior Probability

Descriptor Prior Probability

P(X,C) = P(C|X)*P(X) = P(X|C)*P(C)

Naive Bayesian Classifier Naive Bayesian Classifier

X)(

)(| .... |||X| 1

11312111 P

CPCxPCxPCxPCxPCP n

X)

.... X(

)(||||| 2

22322212 P

CPCxPCxPCxPCxPCP n

X)

.... X(

)(||||| 321 P

CPCxPCxPCxPCxPCP m

mnmmmm

Independence assumption about descriptors

XP

CXPCPXCP

||

Training Data

Outlook Temp Humidity Windy

sunny hot high FALSE

sunny hot high TRUE

overcast hot high FALSE

rainy mild high FALSE

rainy cool normal FALSE

rainy cool Normal TRUE

overcast cool Normal TRUE

sunny mild High FALSE

sunny cool Normal FALSE

rainy mild Normal FALSE

sunny mild normal TRUE

overcast mild High TRUE

overcast hot Normal FALSE

rainy mild high TRUE

Play?

No

No

Yes

Yes

Yes

No

Yes

No

Yes

Yes

Yes

Yes

Yes

No

P(yes) = 9/14P(no) = 5/14

Bayesian Classifier – zero frequency Bayesian Classifier – zero frequency problemproblem

What if a descriptor value doesn’t occur with every class value

P(outlook=overcast|No)=0

Remedy: add 1 to the count for every descriptor-class combination (Laplace Estimator)

Outlook | No Yes----------------------------------Sunny | 3+1 2+1----------------------------------Overcast | 0+1 4+1----------------------------------Rainy | 2+1 3+1

Temp. | No Yes----------------------------------Hot | 2+1 2+1----------------------------------Mild | 2+1 4+1----------------------------------Cool | 1+1 3+1

Humidity | No Yes----------------------------------High | 4+1 3+1----------------------------------Normal | 1+1 6+1

Windy | No Yes----------------------------------False | 2+1 6+1----------------------------------True | 3+1 3+1

)|( kCP XLikelihood:

2

2

2/12 2

)(exp

)2(

1|

x

CxPContinues variable:

Bayesian Classifier – General Bayesian Classifier – General EquationEquation

X

XX

P

CPCPCP kk

k

||

Bayesian Classifier – Dealing with numeric Bayesian Classifier – Dealing with numeric attributesattributes

Naïve Bayesian Classifier: Comments

Advantages – Easy to implement – Good results obtained in most of the cases

Disadvantages– Assumption: class conditional independence, therefore loss of

accuracy– Practically, dependencies exist among variables

• E.g., hospitals: patients: Profile: age, family history, etc.

Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.

• Dependencies among these cannot be modeled by Naïve Bayesian Classifier

How to deal with these dependencies?– Bayesian Belief Networks

Bayesian Belief Networks

Bayesian belief network allows a subset of the variables

conditionally independent

A graphical model of causal relationships– Represents dependency among the variables – Gives a specification of joint probability distribution

X Y

ZP

Nodes: random variables Links: dependency X and Y are the parents of Z, and

Y is the parent of P No dependency between Z and P Has no loops or cycles

Bayesian Belief Network: An Example

FamilyHistory

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

LC

~LC

(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Bayesian Belief Networks

The conditional probability table (CPT) for variable LungCancer:

n

iYParents ixiPxxP n

1))(|(),...,( 1

CPT shows the conditional probability for each possible combination of its parents

Derivation of the probability of a particular combination of values of X, from CPT:

Training Bayesian Networks

Several scenarios:

– Given both the network structure and all variables observable: learn only the CPTs

– Network structure known, some hidden variables: gradient descent (greedy hill-climbing) method, analogous to neural network learning

– Network structure unknown, all variables observable: search through the model space to reconstruct network topology

– Unknown structure, all hidden variables: No good algorithms known for this purpose.

Support Vector Machines

Sabic

Email Mohammed S. Al-Shahrani– [email protected]


Find a linear hyperplane (decision boundary) that will separate the data


One Possible Solution

B1


Another possible solution

B2


Other possible solutions

B2


Which one is better? B1 or B2? How do you define better?

B1

B2


Find a hyper plane that maximizes the margin => B1 is better than B2

B1

B2

b11

b12

b21

b22

margin

Support Vectors

Support Vectors


B1

b11

b12

Support Vectors


B1

b11

b12

0 bxw

1 bxw

1 bxw

1bxw if1

1bxw if1)(

xf 2||||

2 Margin

w

Finding the Decision Boundary

Let {x1, ..., xn} be our data set and let yi {1,-1} be the class label of xi

The decision boundary should classify all points correctly

The decision boundary can be found by solving the following constrained optimization problem

This is a constrained optimization problem. Solving it is beyond our course


We want to maximize:

– Which is equivalent to minimizing:

– But subjected to the following constraints:

• This is a constrained optimization problem

– Numerical approaches to solve it (e.g., quadratic programming)

2||||

2 Margin

w

1bxw if1

1bxw if1)(

i

i

ixf

2

||||)(

2wwL

Classifying new Tuples

The decision boundary is determined only by the support vectors

Let tj (j=1, ..., s) be the indices of the s support vectors.

For testing with a new data z

– Compute and

classify z as class 1 if the sum is positive, and class 2

otherwise


Support Vectors

Support Vector Machines What if the training set is not linearly separable?

Slack variables ξi can be added to allow misclassification of difficult or noisy examples, resulting margin called soft.

ξi

ξi


What if the problem is not linearly separable?– Introduce slack variables

• Need to minimize:

• Subject to:

ii

ii

1bxw if1

-1bxw if1)(

ixf

N

i

kiC

wwL

1

2

2

||||)(

Nonlinear Support Vector Machines

What if decision boundary is not linear?

Non-linear SVMs

Datasets that are linearly separable with some noise work out great:

But what are we going to do if the dataset is just too hard?

How about… mapping data to a higher-dimensional space:

0

0

0

x2

x

x

x

Non-linear SVMs: Feature spaces

General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

prediction

Linear Regression

What Is Prediction?

(Numerical) prediction is similar to classification

– construct a model

– use model to predict continuous or ordered value for a given input

Prediction is different from classification

– Classification refers to predict categorical class label

– Prediction models continuous-valued functions

Major method for prediction: regression

– model the relationship between one or more predictor variables and a response variable

Prediction

Attribute (X)

Att

ribu

te (

Y)

Predictor

Res

pons

eTraining data

Types of Correlation

Positive correlation Negative correlation No correlation

Regression Analysis

Simple Linear regression

multiple regression

Non-linear regression

Other regression methods:

– generalized linear model,

– Poisson regression,

– log-linear models,

– regression trees

describes the linear relationship between a predictor variable, plotted on the x-axis, and a response variable, plotted on the y-axis

X

Y

Simple Linear Regression

1oY X

X

Y

o1.0

1


X

Y


X

Y ε

ε


Fitting data to a linear model

1i o i iY X

intercept slope residuals


How to fit data to a linear model?

Least Square Method


Least Squares Regression

Residual (ε) =

Sum of squares of residuals =

Model line:

we must find values of and that minimise o 1

XY 10ˆ

YY ˆ

2)ˆ( YY

2)ˆ( YY

Linear Regression

A model line: y = w0 + w1 x acquired by using

Method of least squares to estimates the best-fitting

straight line has:

||

1

2

||

1

)(

))((

1 D

ii

D

iii

xx

yyxxw

xwyw10

Multiple Linear Regression

Multiple linear regression: involves more than one predictor variable

The linear model with a single predictor variable X can easily be extended to two or more predictor variables

– Solvable by extension of least square method or using SAS, S-Plus

1 1 2 2 ...o p pY X X X

Some nonlinear models can be modeled by a polynomial function

A polynomial regression model can be transformed into linear regression model. For example,

y = w0 + w1 x + w2 x2 + w3 x3

convertible to linear with new variables: x2 = x2, x3= x3

y = w0 + w1 x + w2 x2 + w3 x3

Other functions, such as power function, can also be transformed to linear model

Some models are intractable nonlinear

– possible to obtain least square estimates through extensive calculation on more complex formulae

Nonlinear Regression

Artificial Neural Networks (ANN)

What is a ANN?

ANN is a data structure that supposedly simulates the behavior of neurons in a biological brain.

ANN is composed of layers of units interconnected.

Messages are passed along the connections from one unit to the other.

Messages can change based on the weight of the connection and the value in the node

General Structure of ANN

InputLayer

HiddenLayer

OutputLayer

x1 x2 x3 x4 x5

y

k-

fw0

w1

wn

x0

x1

xn

ANN

X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0

X1

X2

X3

Y

Black box

Output

Input

Output Y is 1 if at least two of the three inputs are equal to 1.

ANN

X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0

X1

X2

X3

Y

Black box

0.3

0.3

0.3 t=0.4

Outputnode

Inputnodes

otherwise0

trueis if1)( where

)04.03.03.03.0( 321

zzI

XXXIY

Artificial Neural Networks

Model is an assembly of inter-connected nodes and weighted links

Output node sums up each of its input value according to the weights of its links

Compare output node against some threshold t

X1

X2

X3

Y

Black box

w1

t

Outputnode

Inputnodes

w2

w3

)( tXwIYi

ii

Perceptron Model

)( tXwsignYi

ii

or

Neural Networks

Advantages

– prediction accuracy is generally high.

– robust, works when training examples contain errors.

– output may be discrete, real-valued, or a vector of several discrete or real-valued attributes.

– fast evaluation of the learned target function.

Criticism

– long training time.

– difficult to understand the learned function (weights).

– not easy to incorporate domain knowledge.

Learning Algorithms

Back propagation for classification

Kohonen feature maps for clustering

Recurrent back propagation for classification

Radial basis function for classification

Adaptive resonance theory

Probabilistic neural networks

Major Steps for Back Propagation Network

Constructing a network

– input data representation

– selection of number of layers, number of nodes in each layer.

Training the network using training data

Pruning the network

Interpret the results

A Multi-Layer Feed-Forward Neural Network

wij

i

jiijj OwI

jIje

O

1

1

InputLayer

HiddenLayer

OutputLayer

x1 x2 x3 x4 x5

y

How A Multi-Layer Neural Network Works?

The inputs to the network correspond to the attributes measured for

each training tuple

Inputs are fed simultaneously into the units making up the input layer

They are then weighted and fed simultaneously to a hidden layer

The number of hidden layers is arbitrary, although usually only one

The weighted outputs of the last hidden layer are input to units making

up the output layer, which emits the network's prediction

The network is feed-forward in that none of the weights cycles back to

an input unit or to an output unit of a previous layer

From a statistical point of view, networks perform nonlinear

regression: Given enough hidden units and enough training samples,

they can closely approximate any function

Defining a Network Topology

First decide the network topology: # of units in the input layer, # of hidden layers (if > 1), # of units in each hidden layer, and # of units in the output layer

Normalizing the input values for each attribute measured in the training tuples to [0.0—1.0]

One input unit per domain value

Output, if for classification and more than two classes, one output unit per class is used

Once a network has been trained and its accuracy is unacceptable, repeat the training process with a different network topology or a different set of initial weights

Backpropagation Iteratively process a set of training tuples & compare the network's

prediction with the actual known target value

For each training tuple, the weights are modified to minimize the

mean squared error between the network's prediction and the

actual target value

Modifications are made in the “backwards” direction: from the

output layer, through each hidden layer down to the first hidden

layer, hence “backpropagation”

Steps– Initialize weights (to small random #s) and biases in the network– Propagate the inputs forward (by applying activation function) – Backpropagate the error (by updating weights and biases)– Terminating condition (when error is very small, etc.)

Backpropagation

))(1( jjjjj OTOOErr

jkk

kjjj wErrOOErr )1(

ijijij OErrlww )(

jjj Errl)(

InputLayer

HiddenLayer

OutputLayer

x1 x2 x3 x4 x5

y

Generated value Correct value

Network Pruning

Fully connected network will be hard to articulate

n input nodes, h hidden nodes and m output nodes lead to h(m+n) links (weights)

Pruning: Remove some of the links without affecting classification accuracy of the network.

Other Classification Methods

Associative classification: Association rule based condSet class

Genetic algorithm: Initial population of encoded rules are changed by mutation and cross-over based on survival of accurate once (survival).

K-nearest neighbor classifier: Learning by analogy.

Case-based reasoning: Similarity with other cases.

Rough set theory: Approximation to equivalence classes.

Fuzzy sets: Based on fuzzy logic (truth values between 0..1).

Lazy Learners

Lazy vs. Eager Learning Lazy vs. eager learning

– Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple

– Eager learning (the above discussed methods): Given a set of training set, constructs a classification model before receiving new (e.g., test) data to classify

Lazy: less time in training but more time in predicting

Lazy Learner: Instance-Based Methods

Instance-based learning:

– Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified

Typical approaches

– k-nearest neighbor approach

• Instances represented as points in a Euclidean space.

– Case-based reasoning

• Uses symbolic representations and knowledge-based inference

Nearest Neighbor Classifiers

Basic idea:– If it walks like a duck, quacks like a duck, then it’s

probably a duck

Test Record

Compute Distance

Choose k of the “nearest” records

Trainingrecords

Instance-Based Classifiers

Atr1 ……... AtrN ClassA

B

B

C

A

C

B

Set of Stored Cases

Atr1 ……... AtrN

Unseen Case

• Store the training records

• Use training records to predict the class label of unseen cases

Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x

The k-Nearest Neighbor Algorithm All instances correspond to points in the n-D space

The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2)

Target function could be discrete- or real- valued

For discrete-valued, k-NN returns the most common value among the k training examples nearest to xq

Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples

.

_+

_ xq

+

_ _+

_

_

+

.

..

. .

Nearest-Neighbor Classifiers

Requires three things

– The set of stored records

– Distance Metric to compute distance between records

– The value of k, the number of nearest neighbors to retrieve

To classify an unknown record:

– Compute distance to other training records

– Identify k nearest neighbors

– Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Unknown record

Nearest Neighbor Classification

Compute distance between two points:

– Euclidean distance

Determine the class from nearest neighbor list

– take the majority vote of class labels among the k-nearest neighbors

– Weigh the vote according to distance

• weight factor, w = 1/d2

i ii

qpqpd 2)(),(

Nearest Neighbor Classification…

Scaling issues

– Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes

– Example:

• height of a person may vary from 1.5m to 1.8m

• weight of a person may vary from 90lb to 300lb

• income of a person may vary from $10K to $1M

Nearest Neighbor Classification…

Choosing the value of k:– If k is too small, sensitive to noise points– If k is too large, neighborhood may include points from other

classes

X

Metrics for Performance Evaluation

Focus on the predictive capability of a model

– Rather than how fast it takes to classify or build models, scalability, etc.

Confusion Matrix:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

a: TP (true positive)

b: FN (false negative)

c: FP (false positive)

d: TN (true negative)

Metrics for Performance Evaluation…

Most widely-used metric:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a(TP)

b(FN)

Class=No c(FP)

d(TN)

FNFPTNTPTNTP

dcbada

Accuracy

Error Rate = 1 - Accuracy

Limitation of Accuracy

Consider a 2-class problem

– Number of Class 0 examples = 9990

– Number of Class 1 examples = 10

If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %

– Accuracy is misleading because model does not detect any class 1 example

Alternative Classifier Accuracy Measures

accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)

– sensitivity = tp/pos /* true positive recognition rate */

– specificity = tn/neg /* true negative recognition rate */

precision = tp/(tp + fp)

Predictor Error Measures Test error (generalization error): the average loss over the test set

– Mean absolute error:

– Mean squared error:

– Relative absolute error:

– Relative squared error:

– The mean squared-error exaggerates the presence of outliers Popularly use (square) root mean-square error, similarly, root relative squared error

d

yyd

iii

1

|'|

d

yyd

iii

1

2)'(

d

ii

d

iii

yy

yy

1

1

||

|'|

d

ii

d

iii

yy

yy

1

2

1

2

)(

)'(

Evaluating Accuracy

Holdout method

– Given data is randomly partitioned into two independent sets

• Training set (e.g., 2/3) for model construction

• Test set (e.g., 1/3) for accuracy estimation

– Random sampling: a variation of holdout

• Repeat holdout k times, accuracy = avg. of the accuracies obtained

Cross-validation (k-fold, where k = 10 is most popular)

– Randomly partition the data into k mutually exclusive subsets, each approximately equal size

– At i-th iteration, use Di as test set and others as training set

Evaluating Accuracy Bootstrap

– Works well with small data sets

– Samples the given training tuples uniformly with replacement

Several boostrap methods, and a common one is .632 boostrap

– Suppose we are given a data set of d tuples. The data set is sampled d times, with replacement, resulting in a training set of d samples. The data tuples that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)

– Repeat the sampling procedure k times, overall accuracy of the model:

))(368.0)(632.0()( _1

_ settraini

k

isettesti MaccMaccMacc

Ensemble Methods

Construct a set of classifiers from the training data

Predict class label of previously unseen records by aggregating predictions made by multiple classifiers

– Use a combination of models to increase accuracy

– Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an improved model M*

Popular ensemble methods

– Bagging

• averaging the prediction over a collection of classifiers

– Boosting

• weighted vote with a collection of classifiers

General IdeaOriginal

Training data

....D1D2 Dt-1 Dt

D

Step 1:Create Multiple

Data Sets

C1 C2 Ct -1 Ct

Step 2:Build Multiple

Classifiers

C*Step 3:

CombineClassifiers

Bagging: Boostrap Aggregation

Analogy: Diagnosis based on multiple doctors’ majority vote

Training

– Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., boostrap)

– A classifier model Mi is learned for each training set Di

Classification: classify an unknown sample X

– Each classifier Mi returns its class prediction

– The bagged classifier M* counts the votes and assigns the class with the most votes to X

Prediction: can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple

Bagging: Boostrap Aggregation Accuracy

– Often significant better than a single classifier derived from D

– For noise data: not considerably worse, more robust

– Proved improved accuracy in prediction

Boosting Analogy: Consult several doctors, based on a combination of

weighted diagnoses—weight assigned based on the previous diagnosis accuracy

How boosting works?

– Weights are assigned to each training tuple

– A series of k classifiers is iteratively learned

– After a classifier Mi is learned, the weights are updated to

allow the subsequent classifier, Mi+1, to pay more attention to

the training tuples that were misclassified by Mi

– The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy

Boosting

The boosting algorithm can be extended for the prediction of continuous values

Comparing with bagging: boosting tends to achieve greater accuracy, but it also risks overfitting the model to misclassified data

Boosting: Adaboost Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)

Initially, all the weights of tuples are set the same (1/d) Generate k classifiers in k rounds. At round i,

– Tuples from D are sampled (with replacement) to form a training set D i of the same size

– Each tuple’s chance of being selected is based on its weight

– A classification model Mi is derived from Di

– Its error rate is calculated using Di as a test set

– If a tuple is misclassified, its weight is increased, otherwise it is decreased

Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate is the sum of the weights of the misclassified tuples:

The weight of classifier Mi’s vote is )(

)(1log

i

i

Merror

Merror

d

jji errwMerror )()( jX

Summary Classification Vs prediction Eager learners

– Decision tree– Bayesian– Support vector Machines (SVM)– Neural Networks– Linear regression

Lazy learners– K-Nearest Neighbor (KNN)

Performance (Accuracy) Evaluation– Holdout– Cross validation– Bootstrap

Ensemble Methods– Bagging– Boosting

data minig

Education

classification classification

prediction of data

goal of data classification

data set

classification techniques

support vector machines

data warehousedm

data miningdp