lnai 7988 - when classification becomes a problem: using

15
P. Perner (Ed.): MLDM 2013, LNAI 7988, pp. 466–480, 2013. © Springer-Verlag Berlin Heidelberg 2013 When Classification becomes a Problem: Using Branch-and-Bound to Improve Classification Efficiency Armand Prieditis and Moontae Lee Neustar Labs Mountain View, CA 94041 [email protected] Abstract. In a typical machine learning classification task there are two phases: training and prediction. This paper focuses on improving the efficiency of the prediction phase. When the number of classes is low, linear search among the classes is an efficient way to find the most likely class. However, when the number of classes is high, linear search is inefficient. For example, some applications such as geolocation or time-based classification might require millions of subclasses to fit the data. Specifically, this paper describes a branch- and-bound method to search for the most likely class where the training examples can be partitioned into thousands of subclasses. To get some idea of the performance of branch-and-bound classification, we generated a synthetic set of random trees comprising billions of classes and evaluated branch-and- bound classification. Our results show that branch-and-bound classification is effective when the number of classes is large. Specifically, branch-and-bound improves search efficiency logarithmically. Keywords: Branch-and-bound, Bayesian Models, Classification. 1 Introduction and Motivation In a typical machine learning classification task there are two phases: training and prediction. When the number of classes is low, linear search among the classes is an efficient way to find the most likely class. For example, consider a Bayesian Classifier, “where the task is to find the most likely class given an input example. More specifically, the task is to find a class c given input x such that P(c|x) is maximized. Using Bayes Theorem and removing the normalizing constant, the task is to find a class c such that P(x|c)P(c) is maximized. When the number of classes is low, linear search among the classes is an efficient way to find the most likely class. However, when the number of classes is high, linear search is inefficient. This paper focuses on improving the efficiency of the prediction phase when the number of classes is high (i.e. when the training data can be partitioned into thousands of subclasses) by using branch-and-bound. This paper presents preliminary results in artificial domains where we can directly control the number of subclasses. We are currently running experiments in several natural domains. The motivation for this

Upload: others

Post on 24-Oct-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LNAI 7988 - When Classification becomes a Problem: Using

P. Perner (Ed.): MLDM 2013, LNAI 7988, pp. 466–480, 2013. © Springer-Verlag Berlin Heidelberg 2013

When Classification becomes a Problem: Using Branch-and-Bound

to Improve Classification Efficiency

Armand Prieditis and Moontae Lee

Neustar Labs Mountain View, CA 94041

[email protected]

Abstract. In a typical machine learning classification task there are two phases: training and prediction. This paper focuses on improving the efficiency of the prediction phase. When the number of classes is low, linear search among the classes is an efficient way to find the most likely class. However, when the number of classes is high, linear search is inefficient. For example, some applications such as geolocation or time-based classification might require millions of subclasses to fit the data. Specifically, this paper describes a branch-and-bound method to search for the most likely class where the training examples can be partitioned into thousands of subclasses. To get some idea of the performance of branch-and-bound classification, we generated a synthetic set of random trees comprising billions of classes and evaluated branch-and-bound classification. Our results show that branch-and-bound classification is effective when the number of classes is large. Specifically, branch-and-bound improves search efficiency logarithmically.

Keywords: Branch-and-bound, Bayesian Models, Classification.

1 Introduction and Motivation

In a typical machine learning classification task there are two phases: training and prediction. When the number of classes is low, linear search among the classes is an efficient way to find the most likely class. For example, consider a Bayesian Classifier, “where the task is to find the most likely class given an input example. More specifically, the task is to find a class c given input x such that P(c|x) is maximized. Using Bayes Theorem and removing the normalizing constant, the task is to find a class c such that P(x|c)P(c) is maximized. When the number of classes is low, linear search among the classes is an efficient way to find the most likely class. However, when the number of classes is high, linear search is inefficient. This paper focuses on improving the efficiency of the prediction phase when the number of classes is high (i.e. when the training data can be partitioned into thousands of subclasses) by using branch-and-bound. This paper presents preliminary results in artificial domains where we can directly control the number of subclasses. We are currently running experiments in several natural domains. The motivation for this

Page 2: LNAI 7988 - When Classification becomes a Problem: Using

When Classification becomes a Problem 467

work is threefold. First, some applications might require millions of subclasses to fit the data. In particular, those applications involving geo-location and time can have an enormous number of subclasses. Note that the subclasses do not have to be explicitly labeled in the data—they can be discovered in the training data with methods such as Expectation-Maximization. Such methods are beyond the scope of this paper. Second, fast prediction time is important in domains where a real-time response in needed. For example, a customer at a website might not be willing to wait more than a second to receive a targeted advertising whose selection is based on a prediction. Finally and in general, the more subclasses the higher the accuracy. Of course, too many subclasses can result in overfitting.

The rest of the paper is organized as follows. Section 2 presents our approach to branch-and-bound classification. Section 3 describes how to derive a lower-bound for branch-and-bound classification. Section 4 describes how to use the lower-bound function for pruning in branch-and-bound classification. Section 5 describes the particular learning method we used to build trees. Section 6 presents results of applying our approach. Section 7 describes a non-probabilistic approach to branch-and-bound pruning. Section 8 discusses related work. Finally, Section 9 discusses the conclusions of this work and outlines several promising directions for future work.

2 Branch and Bound in Tree-Structured Classification

Simple models in machine learning are typically computationally efficient. For example, a probabilistic model where each of the attributes is independent of each other conditioned on the class is easy to compute. However, such a model, which is called the “Naïve Bayes” model, is sometimes inaccurate. If the training data could be grouped (i.e., clustered) into subclasses, the results are likely to be more accurate. Of course the number of subclasses depends on the desired fit of the model to the training data: too many subclasses and the model can overfit the training data, resulting in a low training error rate but a high test error rate; too few and model has both a high training error rate and a high test error rate.

The classical method of finding subclasses is to assume that the data includes a column with k values, which correspond to each of the k subclasses and are missing for every row in the training data. (A row in the training data corresponds to an example pattern and a column corresponds to a feature of the rows.) The learning task is to find the probability of each of the k subclasses for each row in the training data. Once the parameters associated with each subclass are determined, each row can then be “clustered” into one of the k subclasses based on the parameters. Many different methods can be used to fit those subclasses including the k-means algorithm [1] and the EM algorithm [2].

In order to exploit branch-and-bound’s efficiency, the classes described in this paper are assumed to be organized as a tree, which represents the training data at various levels of generality. For purposes of this paper, how the tree is built is unimportant; what is important is that there exists a method to find a most likely leaf node at a particular level. However, for reproducibility, we will describe the particular tree-building method that we used later.

Page 3: LNAI 7988 - When Classification becomes a Problem: Using

468 A. Prieditis and M. Lee

The most important aspect of the tree is that there exists a way to evaluate degree of fit for an input test case and node. At prediction time, branch-and-bound finds a node in the tree that it believes to be the most likely node in the tree. Other methods are possible to determine degree of fit; we choose likelihood because our particular tree is a probabilistic one. Branch-and-bound then returns the prediction associated with that node. In general likelihood (or degree of fit) is correlated with the accuracy of the prediction generated from a test case. Indeed, this is one of the cornerstone assumptions in most machine learning: that nearness of inputs correlates to nearness of outputs. Note that if the learning was not structured as a tree, branch-and-bound would not make sense. For example, if the k subclasses were not in a hierarchy, the pruning as described here would not make sense.

To understand how each leaf node in the tree corresponds to a subclass, consider a binary tree. For example, in a binary tree, the left child of the root might have subclass label “Root/Left” and the right child “Root/Right.” Assuming a tree with two levels, there are four subclasses, which correspond to the each of the leaf nodes and which can be identified by the following labels: “Root/Left/Left,” “Root/Left/Right,” “Root/Right/Left,” and “Root/Right/Right.”

Note that the indexing scheme does not have to explicitly name each node using such hierarchical names. For example, in a binary tree, with root node having index 0, a node having index i can have as a left child the index 2i+1 and as a right child the index 2i+2. For ease of retrieval, we choose this method of numbering the tree’s nodes. Other number schemes are possible.

We mentioned that each node is associated with a set of training examples. When we say “associated” we do not necessarily mean that any training examples are stored at a node. Instead, each node contains information that is summarizes the training examples at that node. Specifically, the nodes in the tree are probabilistic and each node is associated with:

• The probability p(t|s) of a node t given the parent node s: the number of training examples at the node relative to the number of training examples at the parent of the node.

• A mean vector µ: the average value of each column at the node. • A variance vector σ2: the variance of each column at the node. Note that

the variance typically decreases along a path • A min vector: the minimum value of each column at the node. Note that

the min value is non-decreasing along a path. • A max vector: the minimum value of each column at the node. Note that

the max value is non-increasing along a path.

For simplicity, we used a Naïve Bayes model to compute likelihood. Under the Naïve Bayes model, the columns are assumed to be conditionally independent from each other given the subclass. Other models are possible such as a multivariate normal; we chose the Naïve Bayes model because it has a prediction-time complexity that is linear in the number of columns.

Using the Naïve Bayes assumption and assuming a Gaussian distribution, the likelihood at a node for a particular j length vector x is defined as:

Page 4: LNAI 7988 - When Classification becomes a Problem: Using

When Classification becomes a Problem 469

; , 1√2 exp 12 (1)

More specifically, to find the most likely node in the tree we need to find that node which maximizes the path probability to the node multiplied by the probability of the node given the input (the p function). The path probability is the product of the probability of the child given its parent, for each child node in the path. Note that the path probability to a node is the same as the probability of the subclass associated with node.

Since we are interested in finding the most likely node, the p function can be simplified by taking the log and removing constants that are the same for any node:

; , log (2)

The term “nlp” means negative of the log of the likelihood. This form is simpler to compute because it involves a sum rather than a product. Using the nlp function, the task of branch-and-bound is to find a leaf node in the tree that minimizes the nlp function plus the sum of the path costs, which are derived from the negative of the log of the likelihood of each node given the parent node, over all the ancestor nodes in the path to the node. Once the most likely node is found, its most likely output value is returned. Here again, a variety of methods exist to produce the most likely output. Under the Naïve Bayes assumption and a Gaussian distribution, the most likely output at a node is the node’s mean output, which is simply that part of the mean vector µ that pertains to the output.

Note that the nlp function says that there’s a tradeoff at getting close to the mean of a leaf node (as weighted by the inverse of the standard deviation) and finding a leaf node with a low standard deviation.

3 Deriving a Lower-Bound

A lower-bound function is important in branch-and-bound algorithms because they guarantee never to prune a potential least cost solution. It’s possible to derive several different lower-bounds for the nlp function; here we describe a few that we derived. For a more general theory of how to derive lower-bound functions in search see [3]. One simple lower-bound is based on bounding boxes for each column. For example, if the value x for a particular column is outside of the bounding box for of the min and the max for that particular column, we know that the distance from x to the closest of min or max for that particular column is a lower-bound on the distance to the mean for that column. Therefore, we can substitute the closest of min or max for the mean in that case. If x is within the bounding box of min and max then return 0 for that particular column.

Page 5: LNAI 7988 - When Classification becomes a Problem: Using

470 A. Prieditis and M. Lee

Another lower-bound is the minimum of the sum of the negative of the log of the p(t|s), where s is the parent and t from a node to any leaf node. For some applications, the combination of these lower-bounds might be tight. However, for the bounding-box style of a lower-bound, when x is within the bounding box of min and max, no pruning occurs (at least none based on that particular column).

A better lower-bound is based on the nlp function itself. Note that when we say “lower bound,” we mean that the function is a lower bound on the value returned by the most likely leaf node below that node. Note that a lower-bound on log is easy to derive simply by adding 1 to the standard deviation of each node. Therefore, a lower-bound on log is 0. That leaves the accuracy burden of the lower-bound l on the following expression:

; , (3)

One way to guarantee that this expression produces a lower-bound is to ensure that the standard deviation along a path does not increase for each particular column. How reasonable is this assumption? Suppose that a node is associated with the following data set: {-5,-1,1,5}. The standard deviation at that node is 4.16333. Suppose further that we split this data set into two subsets. Taking into account mirror images, there are exactly three potential splits:

Child Standard Deviation

for Each Child 1 2 1 2

{-5,-1} {1,5} 2.82843 2.82843 {-5,1} {-1,5} 4.24264 4.24264 {-5,5} {-1,1} 7.07107 1.41421

In at least two of these splits at least one child has a standard deviation that

increases from the root node. In particular, the second split results in two children whose standard deviation increases for both children. For the third split, the standard deviation increases for one child and decreases for another. For the first split, the standard deviation decreases for both children. Intuitively, this split groups the negative values together and the positive values together, which makes sense from a clustering point of view.

It may be that any reasonable machine learning algorithm will attempt to reduce the “disorder” at a node by choosing those splits that reduce the standard deviation for a column [4]. Therefore, any reasonable machine learning algorithm would prefer the first split, which does reduce the standard deviation. However, with multiple columns it may be that one column’s standard deviation goes up while another column’s standard deviation goes down. So, it is not reasonable to assume that the standard deviation will be reduced along a path for each particular column because the machine learning algorithm might attempt to “average” out the reduction of “disorder” among all the columns.

Page 6: LNAI 7988 - When Classification becomes a Problem: Using

When Classification becomes a Problem 471

We believe that any reasonable machine learning algorithm will split a node into a set of child nodes (while building the tree) such that the function (3) is reduced from the node to any child. We haven’t proven this yet, but we believe that any node can be split into a set of nodes that is guaranteed to reduce this function from the node to any child. For want of a better name, let us call this function the “disorder” function. Intuitively, this is what a machine learning algorithm is supposed to do: reduce the “disorder” for the purposes of classification. We also believe that even if a reasonable machine learning algorithm can’t always reduce the “disorder,” it will come close. More specifically, the above expression can be viewed as the average “disorder” at a node. (Technically, the average is simply the above expression divided by j. Since j is a constant for all nodes, it doesn’t matter if we divide by j or not.)

4 Using the Lower-Bound Function for Pruning

The general branch-and-bound strategy is well-known; here we describe how to apply it in the context of prediction. More specifically, a lower-bound function can be plugged in standard branch-and-bound algorithms for effectively searching the tree. There are dozens of different branch-and-bound algorithms, developed as early as the 1960’s [5]. The objective of each is to compute the function f(s) below:

min , | (4)

Here, s corresponds to a node (or state), t(s) is a value for a leaf node s, T(s) is true if s is a leaf node; false otherwise, c(s,t) is the cost of reaching node t from node s, and d(s) returns the set of child nodes for node s. Note that the arc cost function c(s,t) is the negative of the log of the p(t|s), where s is the parent and t is the child.

The basic idea behind a branch-and-bound algorithm is that if a lower-bound l(s) for a node s in the tree exceeds an upper-bound u(t) for some other node t, then the first node can be pruned. Typically, this is accomplished by recording the minimum upper-bound seen among all the paths explored so far, where a path corresponds to sequence of states generated by the child function d(s). For example, one can obtain an initial upper-bound for a node by randomly searching the tree below the node and returning the cost of the path (i.e., the sum of the c(s,t) values along the path) plus the leaf node value. When the upper bound is the same as the lower bound, the procedure halts; at that point the upper-bound = lower-bound = f(s).

However, this method is inefficient as it will evaluate nodes whose path cost + lower-bound estimate exceeds f(s). That is, this type of search can go “deeper” than the leaf node associated with f(s). A more efficient method is to evaluate only those nodes whose path cost + lower-bound estimate does not exceed f(s). One way to compute the f(s) function with this idea is the classic A* algorithm [6]. However, this algorithm requires exponential space. Instead, we chose the IDA* algorithm [6], a well-known search algorithm, because it only requires linear space. This algorithm trades off memory for computation by repeating a series of ever deeper searches until

Page 7: LNAI 7988 - When Classification becomes a Problem: Using

472 A. Prieditis and M. Lee

it finds a solution. At any depth of search, it prunes off all those nodes whose cost so far + underestimated cost exceeds the current bound.

The IDA* algorithm is shown below. As input it takes in an index or pointer to a node.

Algorithm 1. def f(s:node): bound = 0 newbound = search(s,bound) while (newbound > bound): bound = newbound newbound = g(s,bound) return newbound def g(s:node,bound:real): if T(s) then return t(s) else if l(s) > bound then return l(s) else return min{g(t,bound-c(s,t))+c(s,t)|t in d(s)} The IDA* algorithm comprises two loops. The outer loop (embodied by the “f”

function), incrementally increases the bound until a solution is found within the bound. The inner loop does the heavy lifting: it prunes all nodes whose cost so far plus estimate exceeds the bound. Once the most likely leaf node is found, its most likely output value (and node) can be returned. For simplicity, the algorithm shown above only shows return output value and not the node.

5 The Particular Method that We Used to Build the Trees

We mentioned that the particular method we used to build the tree is unimportant for purposes of branch-and-bound classification. For reproducibility, here we describe the particular method that we used to build the tree. More generally, the training procedure that builds the tree should be able to split the training data into k subclasses using any standard clustering method. Specifically, we used an EM-like procedure assuming two subclasses per node. Of course the number of subclasses depends on the desired fit of the model to the training data: too many subclasses and the model can overfit the training data, resulting in a low training error rate but a high test error rate; too few and model has both a high training error rate and a high test error rate. Two subclasses is the smallest number of subclasses with which a tree can be built. Moreover, a binary tree can “simulate” multiple subclasses by repeated division of each node into finer and finer subclasses.

The EM-like procedure that we used to build the tree assumes that the data includes a column with k values, which correspond to each of the k subclasses.

Page 8: LNAI 7988 - When Classification becomes a Problem: Using

When Classification becomes a Problem 473

This column’s data is missing for every row in the testing data. We built the tree by, at each split, finding k mean vectors, k variance vectors, and k subclass probabilities such that the probability of the data given the parameters is maximized under the assumed probability model. In our case, we used k=2 but k can be any integer.

Once these parameters (mean, variance, and subclass probabilities) are determined, each row can then be “clustered” into one of the k subclasses based on the parameters and the process can be recursively repeated on each of the subclasses. By “recursively repeated,” we mean that each row is assigned to one of the k subclasses and the process repeats recursively for each of the rows associated with each of the k subclasses.

Given a set of rows in a data set and a set of k subclasses, each of which is characterized by such a probability distribution function and a prior probability (i.e., frequency), the task is to find an assignment of each row to a single class.

This task is accomplished by finding the subclass that minimizes the nlp function plus the negative of the log of the subclass. This functional form tells us that there is a tradeoff between finding a subclass whose mean is close to the row (as weighted by the inverse of the standard deviation), finding a subclass with a low standard deviation (due to the log of the standard deviation term), and finding a subclass with high path probability. We used a “hard” assignment tree-building procedure: once such a subclass is found, the assignment of the row to the subclass is “hard” in that each example is assigned to only one subclass, the most likely one. In contrast, in traditional EM, each row has a probability of belonging to each class.

Other tree-building procedures exist in machine learning. For example, decision trees build a structure by splitting on attribute values or ranges. However, such resulting structures may not be amenable to pruning.

6 Summary of Results with Random Trees

With only a handful of subclasses, linear search for the most likely subclass will outperform branch-and-bound because there is little to structure into a tree. Thus the branch-and-bound algorithm is not meant for small data sets that yield only a few subclasses. Therefore, to get some idea of how well classification works, it is essential to test it on large trees, ones stemming from billions of training examples. Unfortunately data to build such trees is not easily available nor is it readily shareable.

As a result, we built a random set of data selected from up to 60 billion mixtures (each mixture corresponds to a subclass) and then ran our algorithm and compared it to the results for linear search.

The trees we used for the experiments here all have a uniform branching factor of k, where the k sibling nodes represent a split of the data into k subclasses. Each node of this tree is associated with a set of training examples and the children of a node represent a partition of those training examples from the parent. That is, the training examples at the parent node correspond to the union of those at the child nodes. In general, the trees do not have to have a uniform branching factor but it made experiment generation simpler.

Page 9: LNAI 7988 - When Classification becomes a Problem: Using

474 A. Prieditis and M. L

We actually built the trimplicitly generated depthtraversed. Note that our cothe same answer will be reonly interested in the numbalgorithms.

First, we implicitly creatnegative log probability of cthe following process. Assu

1) Determines the randprime number given

2) Samples B real valKnuth coefficient to

3) Takes only the fractthat the sum will be

Fig. 1. Tree Dept

As a random seed determialways recreate any path explicitly storing all of the prediction algorithms on exthat does not require storingand recreate it as needed unique prime number giventhat a different random tree

Lee

ree on the fly in such a way that the random tree wh-first and exactly re-created each time the tree w

omparison is not one in terms of accuracy because exaceturned both branch-and-bound and linear search. We er of nodes visited as the yardstick for comparison for b

te every path cost between a parent and a child node (child node given parent node) based on the parent index

uming the branching factor is denoted by B, the system

dom seed based on the parent node index and uniqfor path cost generation. ues uniformly in [0, 1] and multiplies the irratioavoid getting periodic fractions.

tional parts of the real values and normalizes them1.

th vs. Log Number of Nodes Visited for Three Trees

inistically generates the pseudo-random numbers, we cost in the tree based only on the parent index withnodes in the memory. This is an important point for testxtremely large training data: we have developed a methg the training data. Instead, we generate the tree on thefor purposes of prediction. Note that if we initialize

n for path cost generation to a different value, it guarant will result without loss of generality.

was was ctly are

both

i.e., x by

que

onal

m so

can hout ting hod

e fly the

tees

Page 10: LNAI 7988 - When Classification becomes a Problem: Using

When Classification becomes a Problem 475

Second, we implicitly generate the range of each column of each node given the column range of its parent node by the following process. The system:

1) Determines the random seed based on the parent node index and unique prime number for division generation.

2) Uniformly samples b-1 points between parent’s min-max range. Clearly these b-1 points can be again recreated as they are sampled deterministically based on the seed that we generated on 1).

3) If the current node is ith child of parent node, the ith subdivision will have its own min-max range.

Though we could not generate a min-max range for each node in one shot, we can create it by following the label path starting from the root. For example, in the binary tree, the label path such as “LEFT/RIGHT/RIGHT/LEFT” can uniquely indicate the range of given node in an efficient manner.

Finally, the system generates the mean and variance vectors for each node given its range as follows:

1) Determine the random seed for mean and variance based on the current node index and unique prime numbers for mean and variance respectively.

2) Uniformly sample one point between its min-max range based on the mean seed.

3) Compute a simple variance by averaging the sum of squared distances between min and sampled mean, and the max and sampled mean.

4) Add a small normalized Gaussian noise based on the variance seed. 5) Add 1 to the variance computed from 4)

Here adding 1 in the step 5) prevents variance from being closer to 0, which causes a numerical problem.

Put together, these steps enable the creation of implicit random trees that simulate billions of mixture distributions, each with their own mixture proportions, min-max ranges, mean vectors, and variance vectors as a function of the node index and a random seed. It is important to understand that is it not possible to explicitly store all of these parameters for billions of nodes. We believe that such implicit random trees are a good vehicle to evaluate machine learning algorithms on enormous training sets with millions of subclasses.

Figure 1 shows the results of these experiments. As the figure shows, Branch-and-Bound search significantly beats linear search. In particular for binary trees, it shows nearly two orders of magnitude speedup. In fact, as depth x increases the speedup is e1.37x for a binary tree: this means the deeper the tree, the more effective the Branch-and-Bound. This means that with Branch-and-Bound search it is possible to do prediction with trees that are roughly 1.37 times deeper than for exhaustive (linear) search. This puts certain prediction problems within reach that were not possible before. Note that the comparison to linear search is not a strawman: this is the technique that is typically used in prediction and that fails with a large number of subclasses.

Page 11: LNAI 7988 - When Classification becomes a Problem: Using

476 A. Prieditis and M. Lee

7 Branch-and-Bound Classification for Non-probabilistic Models

Probabilistic models are attractive because they can automatically weight and normalize each attribute based on the variance. One non-probabilistic approach that we consider here is k-means clustering [1], a popular clustering method. In k-means clustering, a set of k initial means is generated, one per subclass. Each row is assigned to the subclass whose sum of distances squared to each attribute is minimized. Once the assignments are complete, the means are reset based on the assignments of rows to classes and the procedure repeats. Many different methods can be used to terminate this procedure, but one popular method is to terminate when the sum of the distances squared does not significantly change between iterations.

Note that k-means can be applied hierarchically: each node can again be split into k subclasses. This hierarchy is essential for the type of branch-and-bound pruning described here because it produces a tree. Note also that k-means does not suffer from an exponential branching factor as the number of attributes increases. This is because k-means combines multiple attributes by considering their sum.

To understand how to prune with k-means trees, consider that along any path the extrema (min and max) of an attribute tightens: the min increases and the max decreases. This means that the sum of the distances squared to the nearest extrema will decrease along a path. And that means that this sum can be used for pruning as a lower-bound in branch-and-bound algorithms such as IDA*. The downside of k-means trees is that they treat each attribute as being equal (i.e., the distance measure is the same for each attribute). The attributes can be normalized by dividing each attribute’s deviation from the mean by the variance (just as in the nlp function). However, this normalization does not solve the problem of equal weighting.

Note that clustering and classification are two different tasks: clustering involves grouping and classification involves prediction. However, clustering algorithms can be used to find subclasses for classification tasks.

8 Related Work

The Nearest Neighbor (NN) method is one of the oldest machine learning methods [8]. Instead of learning a classifier through training process, this method finds the nearest training example to an input query and then returns an output associated with that example. With a large number of training examples, this method often outperforms other machine learning methods such as neural nets, Bayesian nets, and decision trees, especially when many outputs are not strongly determined by one mathematical function of given input while they have weak correlations.

One straightforward but inefficient way to find the nearest training example is through linear search: compare the input to every training example and evaluate the output based on the nearest. The average time complexity of the linear approach is n/2 for n training examples. Linear search is instantly incremental because no processing is required to add a new training example. However this approach repeats the entire

Page 12: LNAI 7988 - When Classification becomes a Problem: Using

When Classification becomes a Problem 477

search process again for each input query, becoming highly inefficient as n grows large.

The training examples can be organized into an approximately balanced tree called a kd-tree for faster prediction through branch-and-bound methods [9]. More specifically, the tree can be built in O(nlog n) time through a linear time median-finding algorithm. For example, one way is to sort the data on one dimension and then split the data at the median for that dimension. The process can then be repeated on each of the two parts. The fundamental concept in all of these ways is to split the data approximately in half for a particular dimension and then repeat the process on the two parts for the next dimension. Once all the dimensions have been split, this process repeats again from the first dimension. For k dimensions, this yields a tree with branching factor 2k. For example, for two dimensions, the data is cut into two sets for the first dimension and each of those sets into two more sets, thus yielding a quad tree.

Assuming that n = Bd, where B is the average branching factor and d is the depth of the search for an input query can be completed in O(Bd/k) time, where k is a constant greater than 1. Although this method is not instantly incremental, standard tree rebalancing methods attempt to keep the tree balanced with adding new training examples efficiently (i.e., in O(log n) time).

One shortcoming of the kd tree is that it does not scale up as the number of input dimensions grows: the tree’s branching factor is 2k, where k is the number of input dimensions (i.e., number of features or columns). This is because the kd trees are created by splitting on one dimension at a time. Therefore, the search complexity for a nearest neighbor in high-dimensional spaces can be worse than linear search.

Once the kd tree has been built by whatever method, various branch-and-bound methods can be applied to find the nearest neighbor. These methods use lower-bounds that can be derived from bounding hyper-rectangles associated with each node. That is, the distance from a particular input to the bounding hyper-rectangle is a lower-bound on the distance to any training example within the hyper-rectangle. This property is true for any node in the tree. For example, in two dimensions, there are nine cases for input location relative to a bounding rectangle, each having their own distance formula: within the rectangle (lower-bound is zero in that case), directly below, directly above, directly right, directly left, above-left, above-right, below-right, or below-left. The distance to this rectangle from an input is a lower-bound on the distance to any training example within the rectangle. This lower-bound property can be exploited by various branch-and-bound methods to dramatically reduce the search space for a nearest neighbor by pruning off branches of the tree that are guaranteed to be farther away than a given reasonable upper-bound.

Moore describes an EM-based variant of the standard kd tree-building procedure [10]. This variant uses bounding boxes to determine how deep to build the tree—that is, it can prune off parts of the tree at training time, thus helping to control over-fitting at prediction time. Note that the system described by Moore is focused on tree-building rather than prediction with a particular input, which is the focus of this paper. For example the EM-based variant described by Moore does not search the tree for a particular input as described here.

Page 13: LNAI 7988 - When Classification becomes a Problem: Using

478 A. Prieditis and M. Lee

Other tree-building procedures exist in machine learning. For example, decision trees build a structure by splitting on attribute values or ranges. For continuous values, the procedure is remarkably similar to that of building kd-trees: choose a dimension, determine a splitting point, and then split the data. However, not all of such trees are amendable to the branch-and-bound approach because it is not easy to derive lower-bounds.

The key insight with kd-trees and other learned structures is that more time spent during training to build easily searchable structures can pay off in less time at prediction. The purpose of this paper is to explore the application of those insights in a probabilistic context. Part of the question that drove this research is whether or not these methods are applicable in a probabilistic context and under what conditions?

Even without a kd-tree, nearest neighbor (NN) methods suffer from other shortcomings. First, they do not generalize a classifier from training examples. That is, NN methods are likely to overfit the data if the training set has a relatively close example to an input query. In contrast, NN has no way to predict if none of training examples are closer or multiple training examples are exist in neighborhood close to a query.

Intuitively, adding a simple training process such as averaging multiple nearest neighbors might improve accuracy. However, it is difficult to choose an appropriate k, the number of training examples for prediction and to define the ‘nearness’ appropriately to balance across multiple dimensions. (Note that when k=n, this method returns the mean of all training examples as the output.)

Second and for the same reasons, nearest neighbor methods are sensitive to outliers. For example, a single nearest example from the training set could be an outlier, but the k-nearest neighbors might suggest a different output, smoothed away from that outlier. The problem is that it is hard to determine a proper k and ideally k should be varied for each prediction. Third, the standard distance metric in nearest neighbor weighs each input dimension equally, thus potentially exaggerating the importance of some insignificant dimensions. Locally weighted linear regression can help [11] in the sense that it generalizes based on relatively important examples, but still determining the neighborhood of locality is difficult and not feasible without structurally differentiating the weight of each dimension from the beginning. Finally, nearest neighbor methods cannot easily handle missing data, which frequently happens in real world data.

The probabilistic method in this paper addresses these shortcomings in a statistically sound framework.

9 Conclusions and Future Work

This paper described a new method for efficient classification when the number of classes is large. We showed its application in the context of a probabilistic model. The results show that the deeper the tree, the more efficient this algorithm gets as compared to linear search. With Branch-and-Bound search it is possible to do prediction with trees that are roughly 1.37 times deeper than for linear search.

Page 14: LNAI 7988 - When Classification becomes a Problem: Using

When Classification becomes a Problem 479

This puts certain prediction problems within reach that were not possible before. We believe algorithms such as classification might form the cornerstone of handling enormous data sets, which result in an enormous number of subclasses.

We are currently exploring several promising avenues of future work. First, it is possible to apply lazy over-fitting control in the context of trees as described here. That is, it is possible to apply over-fitting control methods after training rather than during training. For example, the most likely leaf node might not be the most likely node in the entire tree. While exhaustively searching for the most likely node in the tree is not feasible, it might be possible to search a local neighborhood of the path leading to the most likely leaf node. The idea is that it might be more likely to find that node off the path than anywhere else. We are currently evaluating that hypothesis. It may also be that the most likely node at a lower depth is more likely than one at a deeper depth. This suggests the following iterative method of overfitting control: find the most likely node at level 1, continue iterating by level, replacing the most likely node if the likelihood is higher than the one being replaced, stop when the likelihood starts to decrease (this suggest overfitting).

Second, such as sampling might be useful to search through the resulting tree instead of pruning. We are currently evaluating such methods.

Third, it may be possible to derive better heuristics so that we can get a better lower-bound on the log of the standard deviation. It may also be possible to get more accurate heuristics through learning or caching searches in the tree.

Finally, the artificial dataset on which we ran our experiments could be biased towards our strategy. To rule that out, we are currently running experiments on two natural domains: the Reuters-RCV1 text classification dataset [12] and the Gene Ontology dataset [13], both of which are large and with tree-structured classes.

References

1. Moore, A.: K-Means and Hierchical Clustering (2001), tutorial at http://www.autonlab.org/tutorials/kmeans11.pdf

2. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 1–38 (1977)

3. Prieditis, A.E.: Machine discovery of effective admissible heuristics. Machine Learning 12(1-3), 117–141 (1993)

4. Nowozin, S.: Improved information gain estimates for decision tree induction. arXiv preprint arXiv:1206.4620 (2012)

5. Land, A.H., Doig, A.G.: An automatic method of solving discrete programming problems. Econometrica 28(3), 497–520 (1960)

6. Nilsson, N.: Principles of Artificial Intelligence. Tioga Publishing Company (1980) 7. Korf, R.: Depth-first Iterative-Deepening: An Optimal Admissible Tree Search. Artificial

Intelligence 27, 97–109 (1985) 8. Samet, H.: Foundations of Multidimensional and Metric Data Structures. Morgan

Kaufmann (2006) 9. Bentley, J.L.: Multidimensional binary search trees used for associative searching.

Communications of the ACM 18(9), 509–517 (1975)

Page 15: LNAI 7988 - When Classification becomes a Problem: Using

480 A. Prieditis and M. Lee

10. Moore, A.: Very Fast EM-based Mixture Model Clustering Using Multiresolution kd-trees. In: Advances in Neural Information Processing Systems. Morgan Kaufmann (1999)

11. Moore, A., Schneider, J., et al.: Efficient Locally Weighted Polynomial Regression Predictions. In: Fourteenth International Conference on Machine Learning. Morgan Kaufmann (1997)

12. Lewis, D.D., Yang, Y., Rose, T., Li, F.: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004), http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf

13. Gene Ontology dataset, http://www.geneontology.org/