cs513-data mining - lecture 5:...

CS513-Data MiningLecture 5: Algorithms

Waheed Noor

Computer Science and Information Technology,University of Balochistan,

Quetta, Pakistan

Waheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 1 / 55

Outline1 Homework

2 Statistical Modeling

3 Classification Rules: Inferring Rudimentary Rules

4 Decision Trees

5 Association Rules

6 Quiz

7 Linear ModelsLinear Regression

8 Clusteringk-means Clustering

9 ReadingsReferences


Outline1 Homework



4 Decision Trees

5 Association Rules

6 Quiz





Homework

Q1: What is the probability of picking red box, if you have pickedorange.Q2: What is the probability of picking apple from red box.


Outline1 Homework



4 Decision Trees

5 Association Rules

6 Quiz





Statistical Modeling

Statistical modeling enables you to encode contributions ofobservations in the training data in the form of probabilities.The decision is then made for a class with the highest probabilitygiven the test example.Feeding the algorithm with more information contributed by morevariables can be beneficial.

ExampleIf X = [X1,X2, · · · ,XM ] is the feature vector of M-dimensions and Y isa binary class variable Y ∈ [0,1] then for a given test examplex i = (xi1, xi2, · · · , xiM), if P(Y = 0 | X = x i) = 0.28 andP(Y = 1 | X = x i) = 0.72, the predicted class will be Y = 1 since ithas the maximum probability.


Statistical Modeling

For simplicity we can make an assumption about the sample thatthe attributes of the samples are equally important andindependent.This method is then called the naive Bayes, since it is based onBayes rule and naively assuming the independence.More precisely, attributes are independent given the class, i.e.,P(X1,X2 | Y ) = P(X1 | Y )P(X2 | Y ).However, the attributes of the real word datasets, on the otherhand, are neither equally important nor independent, i.e., theassumption is not true.But these techniques are still able to achieve good results withsimple models.


Statistical ModelingUsing Bayes rule for conditional (posterior) probabilities:

P(Y = yi | X = x i ) =P(X = x i | Y = yi )P(Y = yi )

P(X = x i ),

Using rules of probability, we re-write the expression:

P(Y | X ) =P(X = x i | Y = yi )P(Y = yi )∑

Y

P(X = x i | Y )P(Y ),

under naive assumption and binary class, we will have:

M∏m=1

P(Xm = xmi | Y = yi)P(Y = yi)

M∏m=1

P(Xm = xmi | Y = 0)P(Y = 0) +M∏

m=1

P(Xm = xmi | Y = 1)P(Y = 1)

,


An Example


An Example

Figure : Summary table of above data

Figure : Test Example


Class Activity


Discussion: Statistical ModelingSimple and easy to implement but still achieve good results inpractice.Things can go wrong if an attribute have not appeared with aparticular class.For example: If the outlook = sunny have not appear with play =yest then P(Outlook = sunny | Play = yes) = 0 and due toproduct, all probabilities will be zero, which is not good.Treatment: one possible way is weight probabilities, i.e., add 1 tothe nominator and 3 to denominator.This method is called Laplace estimator after the greateighteenth-century French mathematician Pierre Laplace.Missing attributes are not a problem neither missing values, sincethey will then not appear in frequency calculation.For numeric attributes, we just need to assume the respectivedistribution and estimate their parameters.


Outline1 Homework



4 Decision Trees

5 Association Rules

6 Quiz





1R

Definition ((Holte, 1993))1R or 1-rule, the simple form of classification rule algorithm thatgenerates a one-level decision tree expressed in the form of a set ofrules that all test one particular attribute.

It is simple and efficientOften achieve high accuracy that may reflect the fact that thestructure underlying many real-world datasets is quiterudimentary.One attribute is enough to determine the class of an instanceaccurately.


How 1R Works

Select an attribute and generate branches for each possible value,i.e., test for different values.Assign class for each branch, that occurs most often in thetraining data.Repeat for all input attributes.Break ties by picking the class randomly.Each attribute generate a set of rules and each rule in the setcorresponds to a value of the attribute.Select the rule set with highest accuracy or smallest error.

Class ActivityYou should be able to write its pseudocode for this algorithm.


An Example

Figure : Weather Data


1R-Rule for Weather Data

Figure : 1R-Rules Example for Weather Data

HomeworkIdentify possible set of classification rules for IRIS data and determine1R.


Outline1 Homework



4 Decision Trees

5 Association Rules

6 Quiz





Divide-and-Conquer: Decision Trees

Construction of decision tree is a recursive process.First of all an attribute is selected as root node and branches aremade for each possible value of the attribute.Then the process is repeated recursively for each branch usingonly those instances that reached the branch.Process is terminated when all instances reaching that branchhave same classification.That means we will have class labels at the leaves.


Decision Trees

We need some measure to determine the splitting attribute amongmany.The measure, often called measure of purity, used for decisiontrees is the information measured in bits.Bits associated to the node of a tree represent the amount ofinformation need to classify new instance given the instancesreaching the node.The information is generally calculated using information value orentropy, which is definedentropy(p1,p2, . . . ,pn) = −p1× log p1−p2× log p2 . . .−pn× log pn.Where each p represent the fraction of class labels at leaf andtotal instances reaching the node.


Explanation by ExampleConsider again the weather data:


Explanation by Example

Step 1: Consider each attribute as a root and calculate information.

Figure : Tree Stumpsfor Weather Data

Consider outlook for example:

1 info([2,3]) = entropy(2/5,3/5)= −2/5 log 2/5− 3/5 log 3/5= 0.971 bits. Similarly,info([4,0]) = 0info([3,2]) = 0.971

2 Now take average weighted by fraction ofexamples reaching that branch:info([2,3], [4,0], [3,2]) =(5/14)0.971+(4/14)0+(5/14)0.971 = 0.693bits.

3 Finally, calculate information gain:gain(outlook) =info([9,5])− info([2,3], [4,0], [3,2]) =0.940− 0.693 = 0.247bits,



The information gain value for outlook attribute, 0.247, correspondto the information value of creating branch on outlook.Repeating the procedure for each attribute will give us:gain(outlook) = 0.247bitsgain(temperature) = 0.029bitsgain(humidity) = 0.152bitsgain(windy) = 0.048bits,We will select attribute as the root of the tree that is responsiblefor most information gain, i.e., outlook, in our case.



Step 2: Continue Step 1 recursively for each branch, excluding theselected attribute in the step 1 since it will not add anything new.

The information gain for remaining three attributes when outlookis sunny :gain(temperature) = 0.571bitsgain(humidity) = 0.971bitsgain(windy) = 0.020bits,Obviously, we will select humidity for this branch.


Explanation by ExampleStep 3: Continue the recursive procedure for growing tree until either:

All leaf nodes are pure, i.e., when they contain all instancesbelongs to same classORNo further split is possible.

Figure : Final Decision Tree for Weather Data


Discussion

The information measure needs to have following characteristics1 When the number of any class is zero, the information value is zero.2 When the number of all possible classes are equal, the information

mush reach maximum3 The information must obey the multistage process, i.e.,

info([2,3,4]) = info([2,7]) + (7/9)info([3,4]).

Fortunately, entropy measure satisfies all three characteristics.Attributes with high branching will always be preferred by thiscriterion, which in some cases may not be good forprediction/decision.One remedy to this problem is to calculate gain ratio rather thaninformation gain.


Discussion

Assume that we have an ID attribute for weather data then we will have14 branches for each instance, the gain ratio will be calculated asfollowed:

1 Calculate intrinsic information value in the following way:info([1,1, . . . ,1]) = −1/14× log 1/14× 14 = 3.807 bits

2 Calculate gain ratio: gain ratio = information gain / intrinsicinformation value = 0.247

As we can see that the value is reduced to great extend, however incomparison to other four attributes, ID attribute still has higher value.

The above issue can be fixed by choosing the attribute maximizinggain ratio and the information gain of that attribute is at least as greatas the average information gain of all attributes under consideration.


Discussion

The divide-and-conquer approach to tree induction1 is mainlycontributed by J. Ross Quinlan of the University of Sydney,Australia in 1986.Over the years the technique is refined by many researchers butmainly by Quinlan.The method based on information gain is actually called ID3,whereas the use of gain ratio is one of its improvement that makesit robust in practice.Series of continuous improvements resulted in C4.5 algorithm.That includes methods for dealing with numeric attributes, missingvalues, noisy data, and generating rules from trees.

1Also called Top-down induction of decision treesWaheed Noor (CS&IT, UoB, Quetta) CS513-Data Mining April, 2016 28 / 55

Pruning

Readings & HomeworkRead and understand different types of decision pruning techniquesand also prepare and make report of one or two pages on it


Outline1 Homework



4 Decision Trees

5 Association Rules

6 Quiz





Association Rules

DefinitionAssociation rules, unlike classification rules, are rules that can predictmore than one variable, i.e., the prediction is not limited to classvariable only.

Like classification rules, they can also be generated bydivide-and-conquer rule induction procedure for each possibleexpression that can occur on the right hand side of the rule.That is, rule induction process is executed for every possiblecombination of attribute and for every possible combination ofvalues.In this way, enormous rules will be created.


Item Sets

We need to prune down based on coverage/support andaccuracy/confidence.The above procedure is simply infeasible.To make it work, ignore left or right hand side of the rule butconcentrate on the attribute-value pairs the have pre-definedminimum coverage.Resulting set of pairs is called item set and a pair ofattribute-value is called item.


Constructing Association Rules

After identifying all item sets with minimum pre-specified coverage, wecan generate rules as:

Convert each item sets into a rule, or set of rules, with at least thepre-specified minimum accuracy.Some item sets may produce one rule and some may producemany.For example a three item sets:humidity = normal ,windy = false,play = yes(4)


Outline1 Homework



4 Decision Trees

5 Association Rules

6 Quiz





Numeric Prediction: Linear RegressionSo far we have seen classification or rules based oncategorical/nominal attributes, numeric attributes can be used withthese methods if discretization is applied first.

Definition (Linear Regression)If the outcome/class is numeric and all the input attributes are alsonumeric, linear regression is a natural choice. In linear regression,class is expressed as linear combination of attributes with somepredetermined weights/coefficients.

Y = w0X0 + w1X1 + w2X2 + . . . + wK XK

where y is the class; wk ,∀k = 1,2, . . . ,K , are weights/coefficientsassociated with k th attribute, and Xk , ∀k = 1,2, . . . ,K , represents k th

attribute.The attribute X0 is a constant equal to 1 and w0 is its weights.


Linear Regression: How it works

The weights are calculated from the training data.Least-Square method is used to estimate the best set of weightsfor each attribute.Least-square method, first captures the sum-of-square differentbetween predicted class and actual class for each instance.Which is then minimized to get the best set of estimated weightsfrom the training data.


Linear Regression: How it is done

For each instance, i , we can predict its class, yi , by yi =∑K

k=0 wkxik ,since, we are interested in the different between actual yi andpredicted class yi :

yi − yi =

(yi −

K∑k=0

wkxik

),

and sum-of-square of this difference over all instances will give us:

n∑i=1

(yi −

K∑k=0

wkxik

)2

.

Therefore, we have to minimize the expression above with respect tothe weights.


Linear Regression: Discussion

Linear regression is simple yet very useful technique for numericprediction.Widely used in statistical domain for long time.Suffers if there is evidence of nonlinear relationship amongattributes in the training data.Still very useful to be the building block for more complex learningtechniques.


Outline1 Homework



4 Decision Trees

5 Association Rules

6 Quiz





ClusteringClustering is a form unsupervised learning.Unlike classification, in clustering problems the classattribute/variable is not given in the data set, neither we need topredict the class.Rather we are interested to identify the natural groups (clusters)the data instances falls into, i.e., we try to capture some similaritybetween instances.The grouping reflects the underlying mechanism of the problemdomain from where the data is generated, often unknown.Similarity between data points to form a cluster is often capturedby some distance function that represents distance of a point fromthe centers of the possible clusters.Then the clustering algorithm tries to minimize distance functionsuch that each point assigned to the cluster is closest to thatcluster’s center and clusters are far away from each other.


Types of Clustering

Previously we have seen different types of clusters such as:

Hard/Exclusive: Where each instance belongs to only one cluster.Soft/Overlapping: Where each instance may belong to many clus-

ters.Probabilistic: Where each instance belongs to each cluster

with a certain probability.Hierarchical: Where instances are grouped primitively at the

top level that are refined further all the way downto the leafs.

Obviously, the choice depends on the problem or underlying clusteringmechanism if known.


An Example: Clustering Documents

Assume a document is represented by a vector of words,(x1, x2, . . . , xK ), where xk = 1 if k th word exist in the document.Documents with similar set of words may be about the same topic.

Some Issues: ClusteringClustering of high dimensional data is difficult since the distancebetween data points in high-dimensional space may be about thesame, curse of dimensionality.Small data sets are easy to cluster whereas large datasetsbecomes computationally intractable.


Euclidean DistanceDefinitionThe distance between two points is the length of the path connectingthem in the Euclidean spacea. The Euclidean distance between twopoints (x1, y1) and (x2, y2) in two-dimension space (R2) is given byPythagorean theorem.

d =√

(x2 − x1)2 + (y2 − y1)2.

aIn Euclidean space, set of points are expressible in terms of distance andangle.

The above formula for distance between points p and q forn-dimensional space (Rn) is then:

d = ||p − q|| =

√√√√ n∑i=1

(qi − pi)2


Outline1 Homework



4 Decision Trees

5 Association Rules

6 Quiz





k-means Clustering

DefinitionThe k-means algorithm (MacQueen, 1967) for clustering is an iterativealgorithm for assigning data points to predefined number of clustersbased on minimum Euclidean distance between points and thecentroid of the cluster.

The closest data point to the centroids is found by minimizing(optimizing) following objective function for k-means cluster algorithm.

J∑j=1

n∑i=1

||x ji − cj ||2,

||x i − cj ||2 is the squared distance between a data point x ji and the

centroid cj of j th cluster.In the above objective function, we are actually minimizingwithin-cluster sum of square distance.


k-means: The Algorithm

Initialize: Select random points (may be data points) forJ number of clusters as centroids.

Step1: Assign each data point to its closest centroid.Step2: Replace j th centroid with average (mean) of

the assigned data points to j th cluster.Convergence:

If centroids do not change stop otherwise re-peat Step 1.


Illustration


Discussion

Results of the k-mean algorithm depends on the initialization,better to try different number of starting points to avoidsub-optimal results.Selecting number of cluster optimally depends on the domainknowledge or we can use different model selection measuressuch MDL, AIC or BIC. We will see some of them later.It may possible that a cluster is empty, you should take care of thisin the implementation and try re-initialization.


Outline1 Homework



4 Decision Trees

5 Association Rules

6 Quiz





Important Readings

– Read some literature on ID3 and C4.5, and their applications.– Covering Algorithms: Classification rule construction and theirrelation with decision trees.– Logistic regression and its application; try it in Weka and discussyour findings.– Instance based learning and their algorithms; discuss how you usedWeka to apply theses algorithms.– Read and understand other distance measures such as Mahalanobisand Manhattan.


References I

*ReferencesBishop, C. M. (2006). Pattern recognition and machine learning. New

York: Springer.Holte, R. C. (1993). Very simple classification rules perform well on

most commonly used datasets. Machine Learning, 11, 63–91.MacQueen, J. B. (1967). Some methods for classification and analysis

of multivariate observations. In Proceedings of 5-th berkeleysymposium on mathematical statistics and probability (pp.281–297). Berkeley, University of California Press.

Quinlan, J. R. (1993). C4.5: Programs for machine learning. SanFrancisco, CA, USA: Morgan Kaufmann Publishers Inc.

Witten, I. H., & Frank, E. (2005). Data mining: Practical machinelearning tools and techniques, second edition. San Francisco,CA: Morgan Kaufmann.