cs513-data mining - lecture 5:...

53
CS513-Data Mining Lecture 5: Algorithms Waheed Noor Computer Science and Information Technology, University of Balochistan, Quetta, Pakistan Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 1 / 55

Upload: others

Post on 22-May-2020

12 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

CS513-Data MiningLecture 5: Algorithms

Waheed Noor

Computer Science and Information Technology,University of Balochistan,

Quetta, Pakistan

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 1 / 55

Page 2: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Outline1 Homework

2 Statistical Modeling

3 Classification Rules: Inferring Rudimentary Rules

4 Decision Trees

5 Association Rules

6 Quiz

7 Linear ModelsLinear Regression

8 Clusteringk-means Clustering

9 ReadingsReferences

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 2 / 55

Page 3: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Outline1 Homework

2 Statistical Modeling

3 Classification Rules: Inferring Rudimentary Rules

4 Decision Trees

5 Association Rules

6 Quiz

7 Linear ModelsLinear Regression

8 Clusteringk-means Clustering

9 ReadingsReferences

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 3 / 55

Page 4: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Homework

Q1: What is the probability of picking red box, if you have pickedorange.Q2: What is the probability of picking apple from red box.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 4 / 55

Page 5: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Outline1 Homework

2 Statistical Modeling

3 Classification Rules: Inferring Rudimentary Rules

4 Decision Trees

5 Association Rules

6 Quiz

7 Linear ModelsLinear Regression

8 Clusteringk-means Clustering

9 ReadingsReferences

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 5 / 55

Page 6: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Statistical Modeling

Statistical modeling enables you to encode contributions ofobservations in the training data in the form of probabilities.The decision is then made for a class with the highest probabilitygiven the test example.Feeding the algorithm with more information contributed by morevariables can be beneficial.

ExampleIf X = [X1,X2, · · · ,XM ] is the feature vector of M-dimensions and Y isa binary class variable Y ∈ [0,1] then for a given test examplex i = (xi1, xi2, · · · , xiM), if P(Y = 0 | X = x i) = 0.28 andP(Y = 1 | X = x i) = 0.72, the predicted class will be Y = 1 since ithas the maximum probability.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 6 / 55

Page 7: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Statistical Modeling

For simplicity we can make an assumption about the sample thatthe attributes of the samples are equally important andindependent.This method is then called the naive Bayes, since it is based onBayes rule and naively assuming the independence.More precisely, attributes are independent given the class, i.e.,P(X1,X2 | Y ) = P(X1 | Y )P(X2 | Y ).However, the attributes of the real word datasets, on the otherhand, are neither equally important nor independent, i.e., theassumption is not true.But these techniques are still able to achieve good results withsimple models.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 7 / 55

Page 8: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Statistical ModelingUsing Bayes rule for conditional (posterior) probabilities:

P(Y = yi | X = x i ) =P(X = x i | Y = yi )P(Y = yi )

P(X = x i ),

Using rules of probability, we re-write the expression:

P(Y | X ) =P(X = x i | Y = yi )P(Y = yi )∑

Y

P(X = x i | Y )P(Y ),

under naive assumption and binary class, we will have:

M∏m=1

P(Xm = xmi | Y = yi)P(Y = yi)

M∏m=1

P(Xm = xmi | Y = 0)P(Y = 0) +M∏

m=1

P(Xm = xmi | Y = 1)P(Y = 1)

,

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 8 / 55

Page 9: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

An Example

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 9 / 55

Page 10: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

An Example

Figure : Summary table of above data

Figure : Test Example

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 10 / 55

Page 11: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Class Activity

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 11 / 55

Page 12: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Discussion: Statistical ModelingSimple and easy to implement but still achieve good results inpractice.Things can go wrong if an attribute have not appeared with aparticular class.For example: If the outlook = sunny have not appear with play =yest then P(Outlook = sunny | Play = yes) = 0 and due toproduct, all probabilities will be zero, which is not good.Treatment: one possible way is weight probabilities, i.e., add 1 tothe nominator and 3 to denominator.This method is called Laplace estimator after the greateighteenth-century French mathematician Pierre Laplace.Missing attributes are not a problem neither missing values, sincethey will then not appear in frequency calculation.For numeric attributes, we just need to assume the respectivedistribution and estimate their parameters.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 12 / 55

Page 13: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Outline1 Homework

2 Statistical Modeling

3 Classification Rules: Inferring Rudimentary Rules

4 Decision Trees

5 Association Rules

6 Quiz

7 Linear ModelsLinear Regression

8 Clusteringk-means Clustering

9 ReadingsReferences

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 13 / 55

Page 14: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

1R

Definition ((Holte, 1993))1R or 1-rule, the simple form of classification rule algorithm thatgenerates a one-level decision tree expressed in the form of a set ofrules that all test one particular attribute.

It is simple and efficientOften achieve high accuracy that may reflect the fact that thestructure underlying many real-world datasets is quiterudimentary.One attribute is enough to determine the class of an instanceaccurately.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 14 / 55

Page 15: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

How 1R Works

Select an attribute and generate branches for each possible value,i.e., test for different values.Assign class for each branch, that occurs most often in thetraining data.Repeat for all input attributes.Break ties by picking the class randomly.Each attribute generate a set of rules and each rule in the setcorresponds to a value of the attribute.Select the rule set with highest accuracy or smallest error.

Class ActivityYou should be able to write its pseudocode for this algorithm.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 15 / 55

Page 16: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

An Example

Figure : Weather Data

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 16 / 55

Page 17: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

1R-Rule for Weather Data

Figure : 1R-Rules Example for Weather Data

HomeworkIdentify possible set of classification rules for IRIS data and determine1R.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 17 / 55

Page 18: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Outline1 Homework

2 Statistical Modeling

3 Classification Rules: Inferring Rudimentary Rules

4 Decision Trees

5 Association Rules

6 Quiz

7 Linear ModelsLinear Regression

8 Clusteringk-means Clustering

9 ReadingsReferences

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 18 / 55

Page 19: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Divide-and-Conquer: Decision Trees

Construction of decision tree is a recursive process.First of all an attribute is selected as root node and branches aremade for each possible value of the attribute.Then the process is repeated recursively for each branch usingonly those instances that reached the branch.Process is terminated when all instances reaching that branchhave same classification.That means we will have class labels at the leaves.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 19 / 55

Page 20: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Decision Trees

We need some measure to determine the splitting attribute amongmany.The measure, often called measure of purity, used for decisiontrees is the information measured in bits.Bits associated to the node of a tree represent the amount ofinformation need to classify new instance given the instancesreaching the node.The information is generally calculated using information value orentropy, which is definedentropy(p1,p2, . . . ,pn) = −p1× log p1−p2× log p2 . . .−pn× log pn.Where each p represent the fraction of class labels at leaf andtotal instances reaching the node.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 20 / 55

Page 21: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Explanation by ExampleConsider again the weather data:

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 21 / 55

Page 22: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Explanation by Example

Step 1: Consider each attribute as a root and calculate information.

Figure : Tree Stumpsfor Weather Data

Consider outlook for example:

1 info([2,3]) = entropy(2/5,3/5)= −2/5 log 2/5− 3/5 log 3/5= 0.971 bits. Similarly,info([4,0]) = 0info([3,2]) = 0.971

2 Now take average weighted by fraction ofexamples reaching that branch:info([2,3], [4,0], [3,2]) =(5/14)0.971+(4/14)0+(5/14)0.971 = 0.693bits.

3 Finally, calculate information gain:gain(outlook) =info([9,5])− info([2,3], [4,0], [3,2]) =0.940− 0.693 = 0.247bits,

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 22 / 55

Page 23: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Explanation by Example

The information gain value for outlook attribute, 0.247, correspondto the information value of creating branch on outlook.Repeating the procedure for each attribute will give us:gain(outlook) = 0.247bitsgain(temperature) = 0.029bitsgain(humidity) = 0.152bitsgain(windy) = 0.048bits,We will select attribute as the root of the tree that is responsiblefor most information gain, i.e., outlook, in our case.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 23 / 55

Page 24: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Explanation by Example

Step 2: Continue Step 1 recursively for each branch, excluding theselected attribute in the step 1 since it will not add anything new.

The information gain for remaining three attributes when outlookis sunny :gain(temperature) = 0.571bitsgain(humidity) = 0.971bitsgain(windy) = 0.020bits,Obviously, we will select humidity for this branch.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 24 / 55

Page 25: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Explanation by ExampleStep 3: Continue the recursive procedure for growing tree until either:

All leaf nodes are pure, i.e., when they contain all instancesbelongs to same classORNo further split is possible.

Figure : Final Decision Tree for Weather Data

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 25 / 55

Page 26: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Discussion

The information measure needs to have following characteristics1 When the number of any class is zero, the information value is zero.2 When the number of all possible classes are equal, the information

mush reach maximum3 The information must obey the multistage process, i.e.,

info([2,3,4]) = info([2,7]) + (7/9)info([3,4]).

Fortunately, entropy measure satisfies all three characteristics.Attributes with high branching will always be preferred by thiscriterion, which in some cases may not be good forprediction/decision.One remedy to this problem is to calculate gain ratio rather thaninformation gain.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 26 / 55

Page 27: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Discussion

Assume that we have an ID attribute for weather data then we will have14 branches for each instance, the gain ratio will be calculated asfollowed:

1 Calculate intrinsic information value in the following way:info([1,1, . . . ,1]) = −1/14× log 1/14× 14 = 3.807 bits

2 Calculate gain ratio: gain ratio = information gain / intrinsicinformation value = 0.247

As we can see that the value is reduced to great extend, however incomparison to other four attributes, ID attribute still has higher value.

The above issue can be fixed by choosing the attribute maximizinggain ratio and the information gain of that attribute is at least as greatas the average information gain of all attributes under consideration.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 27 / 55

Page 28: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Discussion

The divide-and-conquer approach to tree induction1 is mainlycontributed by J. Ross Quinlan of the University of Sydney,Australia in 1986.Over the years the technique is refined by many researchers butmainly by Quinlan.The method based on information gain is actually called ID3,whereas the use of gain ratio is one of its improvement that makesit robust in practice.Series of continuous improvements resulted in C4.5 algorithm.That includes methods for dealing with numeric attributes, missingvalues, noisy data, and generating rules from trees.

1Also called Top-down induction of decision treesWaheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 28 / 55

Page 29: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Pruning

Readings & HomeworkRead and understand different types of decision pruning techniquesand also prepare and make report of one or two pages on it

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 29 / 55

Page 30: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Outline1 Homework

2 Statistical Modeling

3 Classification Rules: Inferring Rudimentary Rules

4 Decision Trees

5 Association Rules

6 Quiz

7 Linear ModelsLinear Regression

8 Clusteringk-means Clustering

9 ReadingsReferences

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 30 / 55

Page 31: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Association Rules

DefinitionAssociation rules, unlike classification rules, are rules that can predictmore than one variable, i.e., the prediction is not limited to classvariable only.

Like classification rules, they can also be generated bydivide-and-conquer rule induction procedure for each possibleexpression that can occur on the right hand side of the rule.That is, rule induction process is executed for every possiblecombination of attribute and for every possible combination ofvalues.In this way, enormous rules will be created.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 31 / 55

Page 32: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Item Sets

We need to prune down based on coverage/support andaccuracy/confidence.The above procedure is simply infeasible.To make it work, ignore left or right hand side of the rule butconcentrate on the attribute-value pairs the have pre-definedminimum coverage.Resulting set of pairs is called item set and a pair ofattribute-value is called item.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 32 / 55

Page 33: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Explanation by Example

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 33 / 55

Page 34: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Constructing Association Rules

After identifying all item sets with minimum pre-specified coverage, wecan generate rules as:

Convert each item sets into a rule, or set of rules, with at least thepre-specified minimum accuracy.Some item sets may produce one rule and some may producemany.For example a three item sets:humidity = normal ,windy = false,play = yes(4)

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 34 / 55

Page 35: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Outline1 Homework

2 Statistical Modeling

3 Classification Rules: Inferring Rudimentary Rules

4 Decision Trees

5 Association Rules

6 Quiz

7 Linear ModelsLinear Regression

8 Clusteringk-means Clustering

9 ReadingsReferences

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 35 / 55

Page 36: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Outline1 Homework

2 Statistical Modeling

3 Classification Rules: Inferring Rudimentary Rules

4 Decision Trees

5 Association Rules

6 Quiz

7 Linear ModelsLinear Regression

8 Clusteringk-means Clustering

9 ReadingsReferences

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 36 / 55

Page 37: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Numeric Prediction: Linear RegressionSo far we have seen classification or rules based oncategorical/nominal attributes, numeric attributes can be used withthese methods if discretization is applied first.

Definition (Linear Regression)If the outcome/class is numeric and all the input attributes are alsonumeric, linear regression is a natural choice. In linear regression,class is expressed as linear combination of attributes with somepredetermined weights/coefficients.

Y = w0X0 + w1X1 + w2X2 + . . . + wK XK

where y is the class; wk ,∀k = 1,2, . . . ,K , are weights/coefficientsassociated with k th attribute, and Xk , ∀k = 1,2, . . . ,K , represents k th

attribute.The attribute X0 is a constant equal to 1 and w0 is its weights.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 37 / 55

Page 38: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Linear Regression: How it works

The weights are calculated from the training data.Least-Square method is used to estimate the best set of weightsfor each attribute.Least-square method, first captures the sum-of-square differentbetween predicted class and actual class for each instance.Which is then minimized to get the best set of estimated weightsfrom the training data.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 38 / 55

Page 39: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Linear Regression: How it is done

For each instance, i , we can predict its class, yi , by yi =∑K

k=0 wkxik ,since, we are interested in the different between actual yi andpredicted class yi :

yi − yi =

(yi −

K∑k=0

wkxik

),

and sum-of-square of this difference over all instances will give us:

n∑i=1

(yi −

K∑k=0

wkxik

)2

.

Therefore, we have to minimize the expression above with respect tothe weights.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 39 / 55

Page 40: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Linear Regression: Discussion

Linear regression is simple yet very useful technique for numericprediction.Widely used in statistical domain for long time.Suffers if there is evidence of nonlinear relationship amongattributes in the training data.Still very useful to be the building block for more complex learningtechniques.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 40 / 55

Page 41: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Outline1 Homework

2 Statistical Modeling

3 Classification Rules: Inferring Rudimentary Rules

4 Decision Trees

5 Association Rules

6 Quiz

7 Linear ModelsLinear Regression

8 Clusteringk-means Clustering

9 ReadingsReferences

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 41 / 55

Page 42: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

ClusteringClustering is a form unsupervised learning.Unlike classification, in clustering problems the classattribute/variable is not given in the data set, neither we need topredict the class.Rather we are interested to identify the natural groups (clusters)the data instances falls into, i.e., we try to capture some similaritybetween instances.The grouping reflects the underlying mechanism of the problemdomain from where the data is generated, often unknown.Similarity between data points to form a cluster is often capturedby some distance function that represents distance of a point fromthe centers of the possible clusters.Then the clustering algorithm tries to minimize distance functionsuch that each point assigned to the cluster is closest to thatcluster’s center and clusters are far away from each other.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 42 / 55

Page 43: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Types of Clustering

Previously we have seen different types of clusters such as:

Hard/Exclusive: Where each instance belongs to only one cluster.Soft/Overlapping: Where each instance may belong to many clus-

ters.Probabilistic: Where each instance belongs to each cluster

with a certain probability.Hierarchical: Where instances are grouped primitively at the

top level that are refined further all the way downto the leafs.

Obviously, the choice depends on the problem or underlying clusteringmechanism if known.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 43 / 55

Page 44: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

An Example: Clustering Documents

Assume a document is represented by a vector of words,(x1, x2, . . . , xK ), where xk = 1 if k th word exist in the document.Documents with similar set of words may be about the same topic.

Some Issues: ClusteringClustering of high dimensional data is difficult since the distancebetween data points in high-dimensional space may be about thesame, curse of dimensionality.Small data sets are easy to cluster whereas large datasetsbecomes computationally intractable.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 44 / 55

Page 45: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Euclidean DistanceDefinitionThe distance between two points is the length of the path connectingthem in the Euclidean spacea. The Euclidean distance between twopoints (x1, y1) and (x2, y2) in two-dimension space (R2) is given byPythagorean theorem.

d =√

(x2 − x1)2 + (y2 − y1)2.

aIn Euclidean space, set of points are expressible in terms of distance andangle.

The above formula for distance between points p and q forn-dimensional space (Rn) is then:

d = ||p − q|| =

√√√√ n∑i=1

(qi − pi)2

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 45 / 55

Page 46: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Outline1 Homework

2 Statistical Modeling

3 Classification Rules: Inferring Rudimentary Rules

4 Decision Trees

5 Association Rules

6 Quiz

7 Linear ModelsLinear Regression

8 Clusteringk-means Clustering

9 ReadingsReferences

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 46 / 55

Page 47: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

k-means Clustering

DefinitionThe k-means algorithm (MacQueen, 1967) for clustering is an iterativealgorithm for assigning data points to predefined number of clustersbased on minimum Euclidean distance between points and thecentroid of the cluster.

The closest data point to the centroids is found by minimizing(optimizing) following objective function for k-means cluster algorithm.

J∑j=1

n∑i=1

||x ji − cj ||2,

||x i − cj ||2 is the squared distance between a data point x ji and the

centroid cj of j th cluster.In the above objective function, we are actually minimizingwithin-cluster sum of square distance.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 47 / 55

Page 48: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

k-means: The Algorithm

Initialize: Select random points (may be data points) forJ number of clusters as centroids.

Step1: Assign each data point to its closest centroid.Step2: Replace j th centroid with average (mean) of

the assigned data points to j th cluster.Convergence:

If centroids do not change stop otherwise re-peat Step 1.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 48 / 55

Page 49: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Illustration

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 49 / 55

Page 50: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Discussion

Results of the k-mean algorithm depends on the initialization,better to try different number of starting points to avoidsub-optimal results.Selecting number of cluster optimally depends on the domainknowledge or we can use different model selection measuressuch MDL, AIC or BIC. We will see some of them later.It may possible that a cluster is empty, you should take care of thisin the implementation and try re-initialization.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 50 / 55

Page 51: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Outline1 Homework

2 Statistical Modeling

3 Classification Rules: Inferring Rudimentary Rules

4 Decision Trees

5 Association Rules

6 Quiz

7 Linear ModelsLinear Regression

8 Clusteringk-means Clustering

9 ReadingsReferences

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 51 / 55

Page 52: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

Important Readings

– Read some literature on ID3 and C4.5, and their applications.– Covering Algorithms: Classification rule construction and theirrelation with decision trees.– Logistic regression and its application; try it in Weka and discussyour findings.– Instance based learning and their algorithms; discuss how you usedWeka to apply theses algorithms.– Read and understand other distance measures such as Mahalanobisand Manhattan.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 52 / 55

Page 53: CS513-Data Mining - Lecture 5: Algorithmscsit.uob.edu.pk/images/web/staff/lecture/doc-7.2016-4-26.No-13.pdf · info([2;3;4]) = info([2;7]) + (7=9)info([3;4]): Fortunately, entropy

References I

*ReferencesBishop, C. M. (2006). Pattern recognition and machine learning. New

York: Springer.Holte, R. C. (1993). Very simple classification rules perform well on

most commonly used datasets. Machine Learning, 11, 63–91.MacQueen, J. B. (1967). Some methods for classification and analysis

of multivariate observations. In Proceedings of 5-th berkeleysymposium on mathematical statistics and probability (pp.281–297). Berkeley, University of California Press.

Quinlan, J. R. (1993). C4.5: Programs for machine learning. SanFrancisco, CA, USA: Morgan Kaufmann Publishers Inc.

Witten, I. H., & Frank, E. (2005). Data mining: Practical machinelearning tools and techniques, second edition. San Francisco,CA: Morgan Kaufmann.

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 53 / 55