data analytics and machine learning

20
Data Analytics & Machine Learning MCS4102 Assignment 3.2 - Decision Trees U.V Vandebona

Upload: upekha-vandebona

Post on 17-Feb-2017

145 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Data Analytics and Machine Learning

Data Analytics & Machine LearningMCS4102

Assignment 3.2 - Decision Trees

U.V Vandebona

Page 2: Data Analytics and Machine Learning

No. Outlook Temp. Humidity Windy Class1 Sunny Hot High FALSE Don't Play2 Sunny Hot High TRUE Don't3 Overcast Hot High FALSE Play4 Rainy Mild High FALSE Play5 Rainy Cool Normal FALSE Play6 Rainy Cool Normal TRUE Don't Play7 Overcast Cool Normal TRUE Play8 Sunny Mild High FALSE Don't Play9 Sunny Cool Normal FALSE Play

10 Rainy Mild Normal FALSE Play11 Sunny Mild Normal TRUE Play12 Overcast Mild High TRUE Play13 Overcast Hot Normal FALSE Play14 Rainy Mild High TRUE Don't Play15 Sunny Mild Normal TRUE Play16 Overcast Mild High TRUE Play17 Overcast Hot Normal FALSE Play18 Rainy Mild High TRUE Don't Play

Page 3: Data Analytics and Machine Learning

Play : 12 Don't Play : 6

Outlook Gain : 0.251629167

Play : 3 Don't Play : 3 Play : 6 Don't

Play : 0 Play : 3 Don't Play : 3

Sunny E : 1.00000 Overcast E : 0 Rainy E : 1.00000

Total Rec. : 6 Total Rec. : 6 Total Rec. : 6

Play 12 Don't Play 6

Temp. Gain : 0.009155391

Play : 3 Don't Play : 2 Play : 6 Don't

Play : 3 Play : 3 Don't Play : 1

Hot E : 0.97095 Mild E : 0.91830 Cool E : 0.81128

Total Rec. : 5 Total Rec. : 9 Total Rec. : 4

Page 4: Data Analytics and Machine Learning

Play : 12 Don't Play : 6

Humidity Gain : 0.171128637

Play : 4 Don't Play : 5 Play : 8 Don't Play 1

High E : 0.99108 Normal E : 0.50326 Total Rec. : 9 Total Rec. : 9

Play : 12 Don't Play : 6

Windy Gain : 0.040655551

Play : 7 Don't Play : 2 Play : 5 Don't Play : 4

FALSE E : 0.76420 TRUE E ; 0.99108 Total Rec. : 9 Total Rec. 9

Page 5: Data Analytics and Machine Learning

Outlook

Sunny

?

Overcast

[Play]

Rain

?

Page 6: Data Analytics and Machine Learning

Play : 3 Don't Play : 3

Temp. Gain : 0.540852083

Play : 0 Don't Play : 2 Play : 2 Don't

Play 1 Play : 1 Don't Play : 0

Hot E : 0 Mild E : 0.91830 Cool E : 0 Total Rec. : 2 Total Rec. : 3 Total Rec. : 1

Play : 3 Don't Play : 3

Humidity Gain : 1.00000

Play : 0 Don't Play : 3 Play : 3 Don't

Play : 0

High E : 0 Normal E : 0 Total Rec. : 3 Total Rec. : 3

Play : 3 Don't Play : 3

Windy Gain : 0.08170

Play : 1 Don't Play : 2 Play : 2 Don't

Play : 1

FALSE E : 0.91830 TRUE E : 0.91830 Total Rec. : 3 Total Rec. : 3

Page 7: Data Analytics and Machine Learning

Outlook

Sunny

Humidity

High

[Don’t Play]

Normal

[Play]

Overcast

[Play]

Rain

?

Page 8: Data Analytics and Machine Learning

Play 3 Don't Play 3

Temp. Gain : 0.00000

Play 0 Don't Play 0 Play 2 Don't

Play 2 Play 1 Don't Play 1

Hot E : 0 Mild E : 1.00000 Cool E : 1.00000 Total Rec. : 0 Total Rec. : 4 Total Rec. : 2

Play 3 Don't Play 3

Humidity Gain : 0.08170

Play 1 Don't Play 2 Play 2 Don't

Play 1

High E : 0.91830 Normal E : 0.91830 Total Rec. : 3 Total Rec. : 3

Play 3 Don't Play 3

Windy Gain : 1.00000

Play 3 Don't Play 0 Play 0 Don't

Play 3

FALSE E : 0 TRUE E : 0 Total Rec. : 3 Total Rec. : 3

Page 9: Data Analytics and Machine Learning

Outlook

Sunny

Humidity

High

[Don’t Play]

Normal

[Play]

Overcast

[Play]

Rain

Windy

False

[Play]

True

[Don’t Play]

Final Decision Tree

Page 10: Data Analytics and Machine Learning

Previous Decision Tree with 14 Records

Derived the same kind of Decision Tree as the previous. And the previous high information gain values got more higher values.

Page 11: Data Analytics and Machine Learning

Data Analytics & Machine LearningMCS4102

Assignment 3.1Bayesian Learning Techniques - Naïve Bayes

Algorithm

U.V Vandebona (MCS/2013/072)Index No : 13440722

Page 12: Data Analytics and Machine Learning

Naïve Bayes Algorithm for Twitter Text Analysis Twitter analysis aims to detect the

class the tweet is belongs to. For example if classes are positive &

negative:› “Have a nice day!”

Algorithm should tell that this is a positive message.

› “I had a bad day” Algorithm should tell that this is a negative

message.

Page 13: Data Analytics and Machine Learning

Classification Task From the machine learning domain

point of view this can be seen as a classification task and naive Bayes is an algorithm which suits well for this kind of a task.

The naive Bayes algorithm uses probabilities to decide which class best matches for a given input text. 

Page 14: Data Analytics and Machine Learning

Training The classification decision is based on

a model obtained after the training process.

Model training is done by analyzing the relationship between the words in the training tweets and their classification categories.

Page 15: Data Analytics and Machine Learning

Training Set Each tweet that will classify contains words

noted with Wi (i=1..n) . For each word Wi from the training data set

can extract the following probabilities (P)› P(Wi given Positive) = (The number of

positive tweets with the Wi) / The number of positive tweets

› P(Wi given Negative) = (The number of negative tweets with the Wi) / The number of negative tweets

Page 16: Data Analytics and Machine Learning

Test Set For the entire test set we will have:

› P(Positive) = (The number of positive tweets) / The total number of tweets

› P(Negative) = (The number of negative tweets) / The total number of tweets

Page 17: Data Analytics and Machine Learning

Calculation For calculating the probability of a

tweet being positive or negative, given the containing words› P(Positive given tweet) = P(Tweet

given Positive) x P(Positive) / P(Tweet)

› P(Negative given tweet) = P(Tweet given Negative) x P(Negative) / P(Tweet)

Page 18: Data Analytics and Machine Learning

Calculation As P(Tweet) is 1 and also, each Text will be

present once in the training set› P(Positive given tweet)

= P(Tweet given Positive) x P(Positive) = P(W1 given Positive) x P(W2 given Positive)

x … … x P(Wn given Positive ) x P(Positive)

› P(Negative given tweet) = P(Tweet given Negative) x P(Negative)

= P(W1 give Negative) x P(W2 given Negative) x … … x P(Wn given Negative ) x P(Negative)

Page 19: Data Analytics and Machine Learning

Calculation At the end by comparing P(Positive

given tweet) and P(Negative given tweet), the term with the higher probability will decide if the tweet is positive or negative.