decision tree, softmax regression and ensemble methods in machine learning

39
DECISION TREE, SOFTMAX REGRESSION AND ENSEMBLE METHODS IN MACHINE LEARNING - Abhishek Vijayvargia

Upload: abhishek-vijayvargia

Post on 07-Jul-2015

335 views

Category:

Data & Analytics


0 download

DESCRIPTION

Decision Tree, Softmax Regression and Ensemble Methods in Machine Learning. There use and practice.

TRANSCRIPT

Page 1: Decision tree, softmax regression and ensemble methods in machine learning

DECISION TREE, SOFTMAX

REGRESSION AND ENSEMBLE

METHODS IN MACHINE LEARNING

- Abhishek Vijayvargia

Page 2: Decision tree, softmax regression and ensemble methods in machine learning

WHAT IS MACHINE LEARNING

Formal Approach

Filed of study that gives computers the ability to learn

without explicitly programmed.

Informal Approach

Page 3: Decision tree, softmax regression and ensemble methods in machine learning

MACHINE LEARNING

Supervised Learning

Supervised learning is the machine learning task of

inferring a function from labeled training data.

Approximation

Unsupervised Learning

Trying to find hidden structure in unlabeled data.

Examples given to the learner are unlabeled, there is no

error or reward signal to evaluate a potential solution.

Shorter Description

Reinforcement learning

Learning by interacting with an environment

Page 4: Decision tree, softmax regression and ensemble methods in machine learning

SUPERVISED LEARNING

Classification

Output variable takes class labels.

Ex. Predicting a mail is spam/ham

Regression

Output variable is numeric or continuous.

Ex. Measuring temperature

Page 5: Decision tree, softmax regression and ensemble methods in machine learning

DECISION TREES

Is this restaurant good?

( YES/ NO)

Page 6: Decision tree, softmax regression and ensemble methods in machine learning

DECISION TREES

What are the factors which decide that restaurant is

good for you or not?

Type : Italian, South Indian, French

Atmosphere: Casual, Fancy

How many people inside it? (10< people > 30 )

Cost

Weather outside : Rainy, Sunny, Cloudy

Hungry : Yes/No

Page 7: Decision tree, softmax regression and ensemble methods in machine learning

DECISION TREE

Hungry

Rainy

People >

10

YES No

YES

Type

Cost

YES No

No

True

True False

False

TrueFalse

French South Indian

MoreLess

Page 8: Decision tree, softmax regression and ensemble methods in machine learning

DECISION TREE LEARNING

Pick best attribute

Make a decision tree node containing that attribute

For each value of decision node create a

descendent of node

Sort training example to leaves

Iterate on subsets using remaining attributes

Page 9: Decision tree, softmax regression and ensemble methods in machine learning

DECISION TREE : PICK BEST ATTRIBUTE

+ + -

+ + - -

+ -+-

TrueFalse

- - + -

+ - +

+ - - +

+ - + -

+ + +

- - - +

TrueFalse

+ + +

+ +

TrueFalse

- - - -

- - -

Graph. 1 Graph. 2 Graph. 3

Page 10: Decision tree, softmax regression and ensemble methods in machine learning

DECISION TREE : PICK BEST ATTRIBUTE

Select the attribute which gives MAXIMUM Information

Gain.

Gain measures how well a given attribute separates

training examples into targeted classes.

Entropy is a measure of the amount of uncertainty in the

(data) set.

H(S) = − 𝑥∈𝑋 𝑝(𝑥) log2 𝑝(𝑥)

S: Current data set for which entropy is calculated.

X: Set of classes in X.

p(x) : The proportion of the number of elements in class to

the number of elements in set.

Page 11: Decision tree, softmax regression and ensemble methods in machine learning

DECISION TREE : INFORMATION GAIN

Information gain IG(A) is the measure of the difference in entropy from before to after the set S is split on an attribute A.

In other words, how much uncertainty in S was reduced after splitting set S on attribute A.

IG(A,S) = H(S) - 𝑡∈𝑇 𝑝 𝑡 𝐻(𝑡)

H(S) : Entropy of set S

T : The subsets created from splitting set S by attribute A such that S = 𝑡∈𝑇 𝑡

p(t) : The proportion of the number of elements in t to the number of elements in set S

Page 12: Decision tree, softmax regression and ensemble methods in machine learning

DECISION TREE ALGORITHM : BIAS

Restriction Bias : All type of possible decision tree.

Preference Bias : Which decision tree algorithm

prefer?

Good split at TOP

Correct over Incorrect

Shorter tree

Page 13: Decision tree, softmax regression and ensemble methods in machine learning

DECISION TREE : CONTINUOUS ATTRIBUTE

Branch on number of possible values?

Include age only in training set?

Useless when we get some age not present in training

set

Represent in the form of range

Age

1.11 1.111

20<=Age<30

Page 14: Decision tree, softmax regression and ensemble methods in machine learning

DECISION TREE : CONTINUOUS ATTRIBUTE

Does it make sense to repeat an attribute along a

path in the tree?

B

A A

A

B

A

Page 15: Decision tree, softmax regression and ensemble methods in machine learning

DECISION TREE : WHEN DO WE STOP?

Everything classified correctly? (same example/

noisy two answer for same)

No more attribute? ( not good for continuous

attribute/ infinite possibility)

Pruning

Page 16: Decision tree, softmax regression and ensemble methods in machine learning

SOFTMAX REGRESSION

Softmax Regression ( or multinomial logistic

regression) is a classification method that

generalizes logistic regression to multiclass

problems. (i.e. with more than two possible discrete

outcomes.)

Used to predict the probabilities of the different

possible outcomes of a categorically distributed

dependent variable, given a set of independent

variables (which may be real-valued, binary-valued,

categorical-valued, etc.).

Page 17: Decision tree, softmax regression and ensemble methods in machine learning

LOGISTIC REGRESSION

Logistic regression is used to refer specifically to

the problem in which the dependent variable is

binary ( only two categories).

As output variable y ∈ 0,1 , it seems natural to

choose Bernoulli family of distribution to model

conditional distribution of y given x.

Logistic function (which always takes on values

between zero and one)

𝐹 𝑡 = 1 1+𝑒−𝑡 = 1𝑒−𝜃𝑇𝑥

Page 18: Decision tree, softmax regression and ensemble methods in machine learning

SOFTMAX REGRESSION

Used in classification problem in which response

variable y can take on any one of k values.

𝑦 ∈ 1,2,… , 𝑘 .

Ex. Classify emails into three classes { Primary,

Social, Promotions }

Response variable is still discrete but can take

more than two values.

To derive General Linear Model for multinomial data

we begin by expressing the multinomial as an

exponential family distribution.

Page 19: Decision tree, softmax regression and ensemble methods in machine learning

SOFTMAX REGRESSION

To parameterize a multinomial over k-possible

outcomes, we could use k parameters ∅1, … , ∅𝑘specifying probability of each outcomes.

These parameters are redundant because 𝑖=1𝑘 ∅𝑖 =

1. So ∅𝑖 = 𝑝 𝑦 = 𝑖; ∅

and 𝑝(𝑦 = 𝑘; ∅) = 1 − 𝑖=1𝑘 ∅𝑖

Indicator Function 1{.} takes a value of 1 if it’s

argument is true, and 0 otherwise.

1{True} = 1, 1{False} = 0.

Page 20: Decision tree, softmax regression and ensemble methods in machine learning

SOFTMAX REGRESSION

Multinomial is member of exponential family.

𝑝 𝑦; ∅ = ∅11{𝑦=1}

∅21{𝑦=2}

…… . ∅𝑘1{𝑦=𝑘}

= ∅11{𝑦=1}

∅21{𝑦=2}

…… . ∅𝑘1− 𝑖=1𝑘−1{𝑦=𝑖}

=𝑏 𝑦 exp 𝜔𝑇 𝑇 𝑦 − a ω

Where 𝜔 =

log ∅1 ∅𝑘log ∅2 ∅𝑘⋮

log ∅𝑘 − 1 ∅𝑘

𝑎 𝜔 = − log∅𝑘

𝑏 𝑦 = 1 𝑇 𝑦 ∈ 𝑅𝑘_1

Page 21: Decision tree, softmax regression and ensemble methods in machine learning

SOFTMAX REGRESSION

The link function is given as

𝜔𝑖 = log∅𝑖

∅𝑘

To invert the link function and derive the response

function

𝑒𝜔𝑖 =∅𝑖∅𝑘

∅𝑘𝑒𝜔𝑖 = ∅𝑖

∅𝑘

𝑖=1

𝑘

𝑒𝜔𝑖 =

𝑖=1

𝑘

∅𝑖 = 1

Page 22: Decision tree, softmax regression and ensemble methods in machine learning

SOFTMAX REGRESSION

So we get ∅𝑘= 1 𝑖=1𝑘 𝑒𝜔𝑖

we can substitute it back in

the equation to give response function

∅𝑖= 𝑒𝜔𝑖

𝑖=1𝑘 𝑒𝜔𝑖

Conditional distribution of y given x is given by

𝑝 𝑦 = 𝑖 𝑥; 𝜃 = 𝜔𝑖

= 𝑒𝜔𝑖

𝑖=1𝑘 𝑒𝜔𝑖

= 𝑒−𝜃𝑖𝑇𝑖𝑥

𝑖=1𝑘 𝑒−𝜃

𝑇𝑖𝑥

Page 23: Decision tree, softmax regression and ensemble methods in machine learning

SOFTMAX REGRESSION

Softmax regression is a generalization of logistic

regression.

Our Hypothesis will output

ℎ𝜃 𝑥 =

∅1∅2⋮∅𝑘

In other words, our hypothesis will output the

estimated probability 𝑝 𝑦 = 𝑖 𝑥; 𝜃 for every value of

i = 1, .. k.

Page 24: Decision tree, softmax regression and ensemble methods in machine learning

ENSEMBLE LEARNING

Ensemble learning use multiple learning algorithms

to obtain better predictive performance than could

be obtained from any of the constituent learning

algorithms.

Ensemble learning is primarily used to improve the

prediction performance of a model, or reduce the

likelihood of an unfortunate selection of a poor one.

Page 25: Decision tree, softmax regression and ensemble methods in machine learning

HOW GOOD ARE ENSEMBLES?

Let’s look at NetFlix Prize Competition…

Page 26: Decision tree, softmax regression and ensemble methods in machine learning

NETFLIX PRIZE : STARTED IN OCT 2006

Supervised Learning Task

Training Data is a set of users and rating (1,2,3,4,5

stars) those users have given to movies.

Construct a classifier that given a user and an unrated

movie, correctly classified that movie as either 1,2,3,4 or

5 stars.

$1 Million prize for a 10% improvement over Netflix

current movie recommender/Classifier.

Page 27: Decision tree, softmax regression and ensemble methods in machine learning

NETFLIX PRIZE : LEADER BOARD

Page 28: Decision tree, softmax regression and ensemble methods in machine learning

ENSEMBLE LEARNING : GENERAL IDEA

Page 29: Decision tree, softmax regression and ensemble methods in machine learning

ENSEMBLE LEARNING : BAGGING

Given :

Training Set of N examples

A class of learning models ( decision tree, NB, SVM,RF etc. )

Training :

At each iteration I a training set Si of N tuples is sampled with replacement from S.

A classifier model Mi is learned for each training set Si.

Classification : Classify an unknown sample x

Each classifier Mi returns it’s class prediction.

The bagged classifier M* count the votes and assign the class with the most votes.

Page 30: Decision tree, softmax regression and ensemble methods in machine learning

ENSEMBLE LEARNING : BAGGING

Bagging reduces variance by voting/averaging.

Can help a lot when data is noisy.

If learning algorithm is unstable, then Bagging

almost always improves performance.

Page 31: Decision tree, softmax regression and ensemble methods in machine learning

ENSEMBLE LEARNING : RANDOM FORESTS

Random Forests grows many classification trees.

To classify a new object from an input vector, put

the input vector down each of the trees in the

forest.

Each tree gives a classification, and we say the tree

"votes" for that class.

The forest chooses the classification having the

most votes (over all the trees in the forest).

Page 32: Decision tree, softmax regression and ensemble methods in machine learning

ENSEMBLE LEARNING : RANDOM FORESTS

Each tree is grown as follows:

If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree.

If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.

Each tree is grown to the largest extent possible. There is no pruning.

Page 33: Decision tree, softmax regression and ensemble methods in machine learning

FEATURES OF RANDOM FORESTS

Better in accuracy among current algorithms.

Runs efficiently on large data bases.

It can handle thousands of input variables without

variable deletion.

It gives estimates of what variables are important in

the classification.

Effective method for estimating missing data and

maintains accuracy when a large proportion of the

data are missing.

Generated forests can be saved for future use on

other data.

Page 34: Decision tree, softmax regression and ensemble methods in machine learning

ENSEMBLE LEARNING : BOOSTING

Create a sequence of classifiers, giving higher

influence to more accurate classifiers.

At each iteration, make examples currently

misclassified more important( get larger weight in

the construction of the next classifier)

Then combine classifier by weighted vote (weight

given by classifier accuracy)

Page 35: Decision tree, softmax regression and ensemble methods in machine learning

ENSEMBLE LEARNING : BOOSTING

Suppose there are just 7 training examples {1,2,3,4,5,6,7}

Initially each example has a 0.142 (1/7) probability of being sampled.

1st round of boosting samples ( with replacement) 7 examples { 3,5,5,4,6,7,3} and build a classifier from them.

Suppose examples {2,3,4,6,7} are correctly predicted by this classifier and examples {1,5} are wrongly predicted:

Weight of examples {1,5} are increased.

Weight of examples {2,3,4,6,7} are decreased.

2nd round of boosting again take 7 examples, but now examples {1,5} are more likely to be sampled.

And so on until some convergence is achieved.

Page 36: Decision tree, softmax regression and ensemble methods in machine learning

ENSEMBLE LEARNING : BOOSTING

Weights models according to performance.

Encourage new model to become an “expert” for

instances misclassified by earlier model.

Combines “Weak Learner” to generate “strong

learner”.

Page 37: Decision tree, softmax regression and ensemble methods in machine learning

ENSEMBLE LEARNING

Netflix 1st prize winner gradient boosted decision

tree.

http://www.netflixprize.com/assets/GrandPrize2009

_BPC_BellKor.pdf

Page 38: Decision tree, softmax regression and ensemble methods in machine learning

THANK YOU FOR YOUR ATTENTION

Page 39: Decision tree, softmax regression and ensemble methods in machine learning

Ask Question to narrow down possiblity

Informatica building example

Mango machine learning

Cannot look all trees