pac learning intersections of halfspaces with membership queries

23
Algorithmica (1998) 22: 53–75 Algorithmica © 1998 Springer-Verlag New York Inc. PAC Learning Intersections of Halfspaces with Membership Queries 1 S. Kwek 2 and L. Pitt 3 Abstract. A randomized learning algorithm POLLY is presented that efficiently learns intersections of s halfspaces in n dimensions, in time polynomial in both s and n. The learning protocol is the PAC (probably approximately correct) model of Valiant, augmented with membership queries. In particular, POLLY receives a set S of m = poly(n, s, 1/ε, 1/δ) randomly generated points from an arbitrary distribution over the unit hypercube, and is told exactly which points are contained in, and which points are not contained in, the convex polyhedron P defined by the halfspaces. POLLY may also obtain the same information about points of its own choosing. It is shown that after poly(n, s ,1,1, log(1/d )) time, the probability that POLLY fails to output a collection of s halfspaces with classification error at most ε, is at most δ. Here, d is the minimum distance between the boundary of the target and those examples in S that are not lying on the boundary. The parameter log(1/d ) can be bounded by the number of bits needed to encode the coefficients of the bounding hyperplanes and the coordinates of the sampled examples S. Moreover, POLLY can be extended to learn unions of k disjoint polyhedra with each polyhedron having at most s facets, in time poly(n, k , s ,1,1, log(1/d ),1) where γ is the minimum distance between any two distinct polyhedra. Key Words. PAC learning, Membership queries, Intersections of halfspaces, Unions of polyhedra, Occam algorithm. 1. Introduction. A centrally investigated problem within machine learning and com- putational learning theory is that of inductive concept learning.A concept c to be learned (called the target concept) is a subset of an example space, or instance space X . Ele- ments x c X are called positive examples of c, while elements x X - c are called negative examples. The general task is to determine c given positive and negative examples of c. If nothing about the target concept c may be assumed, then the problem is too un- constrained, and many negative results show that learning is intractable based on both complexity-theoretic and cryptographic assumptions. Consequently, a reasonable and widely adopted assumption is that c resides in some a priori known concept class C 2 X . Besides attempting to characterize more generally those concept classes that are effi- ciently learnable, much research focuses on determining whether or not efficient learning algorithms exist for particular concept classes. Specific classes that have been widely investigated include certain subclasses of boolean formulae, various types of formal grammars and automata, and a variety of geometric regions. 1 This work was done when the the first author was a graduate student at the University of Illinois at Urbana- Champaign, and was supported in part by NSF Grant IRI-9014840 and NASA Grant NAG 1-613. The second author was supported in part by NSF Grant IRI-9014840. 2 Department of Computer Science, Washington University, St. Louis, MO 63130, USA. [email protected]. 3 Computer Science Department, University of Illinois, Urbana, IL 61801, USA. [email protected]. Received February 5, 1997; revised July 1, 1997. Communicated by P. Auer and W. Maass.

Upload: independent

Post on 05-Dec-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Algorithmica (1998) 22: 53–75 Algorithmica© 1998 Springer-Verlag New York Inc.

PAC Learning Intersections of Halfspaces withMembership Queries1

S. Kwek2 and L. Pitt3

Abstract. A randomized learning algorithm POLLY is presented that efficiently learns intersections ofshalfspaces inn dimensions, in time polynomial in boths andn. The learning protocol is the PAC (probablyapproximately correct) model of Valiant, augmented with membership queries. In particular, POLLY receivesa setS of m = poly(n, s,1/ε,1/δ) randomly generated points from an arbitrary distribution over the unithypercube, and is told exactly which points are contained in, and which points are not contained in, the convexpolyhedronP defined by the halfspaces. POLLY may also obtain the same information about points of its ownchoosing. It is shown that afterpoly(n, s, 1/ε, 1/δ, log(1/d)) time, the probability that POLLY fails to outputa collection ofs halfspaces with classification error at mostε, is at mostδ. Here,d is the minimum distancebetween the boundary of the target and those examples inS that are not lying on the boundary. The parameterlog(1/d) can be bounded by the number of bits needed to encode the coefficients of the bounding hyperplanesand the coordinates of the sampled examplesS. Moreover, POLLY can be extended to learn unions ofk disjointpolyhedra with each polyhedron having at mosts facets, in timepoly(n, k, s, 1/ε, 1/δ, log(1/d), 1/γ ) whereγ is the minimum distance between any two distinct polyhedra.

Key Words. PAC learning, Membership queries, Intersections of halfspaces, Unions of polyhedra, Occamalgorithm.

1. Introduction. A centrally investigated problem within machine learning and com-putational learning theory is that ofinductive concept learning. A concept cto be learned(called thetarget concept) is a subset of anexample space, or instance space X. Ele-mentsx ∈ c ⊆ X are calledpositive examplesof c, while elementsx ∈ X − c arecallednegative examples. The general task is to determinec given positive and negativeexamples ofc.

If nothing about the target conceptc may be assumed, then the problem is too un-constrained, and many negative results show that learning is intractable based on bothcomplexity-theoretic and cryptographic assumptions. Consequently, a reasonable andwidely adopted assumption is thatc resides in some a priori knownconcept class C⊂ 2X.Besides attempting to characterize more generally those concept classes that are effi-ciently learnable, much research focuses on determining whether or not efficient learningalgorithms exist for particular concept classes. Specific classes that have been widelyinvestigated include certain subclasses of boolean formulae, various types of formalgrammars and automata, and a variety of geometric regions.

1 This work was done when the the first author was a graduate student at the University of Illinois at Urbana-Champaign, and was supported in part by NSF Grant IRI-9014840 and NASA Grant NAG 1-613. The secondauthor was supported in part by NSF Grant IRI-9014840.2 Department of Computer Science, Washington University, St. Louis, MO 63130, USA. [email protected] Computer Science Department, University of Illinois, Urbana, IL 61801, USA. [email protected].

Received February 5, 1997; revised July 1, 1997. Communicated by P. Auer and W. Maass.

54 S. Kwek and L. Pitt

We investigate the learnability of geometric regions formed by the intersection (dually,the union) of halfspaces inn-dimensional Euclidean space. We show that such regionscan be learned in time that depends only polynomially in both the dimension and inthe number of bounding hyperplanes (and also on several other natural parameters),in a standard learning model—Valiant’s PAC (probably approximately correct) model,augmented with membership queries. Furthermore, our algorithm can be extended tolearn unions ofk disjoint polyhedra.

Halfspaces (linear threshold functions) are of interest for a number of reasons. They area fundamental building block of a variety of representations of more complex functions,and they have played a central role in machine learning. Early work on perceptronlearning was in essence an investigation into the learnability of separating hyperplanesin Euclidean space. More recently, with the introduction of algorithms that performwell empirically at learning more complicated multilayered neural nets, there has beenconsiderable interest in both the expressiveness and learnability of various architecturesbased on linear threshold gates and their continuous approximations.

Besides representing fundamental geometric objects (convex polyhedra), intersec-tions of halfspaces are quite interesting from the learning perspective because they area continuous-valued relaxation of boolean functions that have short representations aseither a DNF or CNF formula. Using a standard representation of boolean functionsas polynomials over the field{0,1}, CNF formulas may be represented by conjunctionsof linear inequalities, hence intersections of halfspaces, in the unit hypercube. Conse-quently, it is known that a learning algorithm in the PAC learning model that runs intime polynomial in both the dimension and the number of bounding hyperplanes (as ouralgorithm does), but that does not need to rely additionally on membership queries (as,alas, our algorithm does), could be used to learn CNF formulas in the PAC setting in poly-nomial time. Whether or not CNF (dually, DNF) formulas can be efficiently learned hasremained one of the more challenging open problems in the field since the introductionof the PAC model by Valiant in 1984.

Finally, the learnability of intersections of halfspaces, hence convex polyhedra, maybe of interest from the vantage point of computational geometry. Motivated by roboticsproblems, algorithms have been sought that construct models of an unknown geometricobject using different types of “probes” (e.g., the “finger” probe or the “x-ray” probe)that correspond to sensory input that might be available to a robot [1]. Our membershipqueries are somewhat weaker than the finger probe used in such work, but we demandonly approximate, as opposed to exact, identification of the object.

In a closely related work by Baum [2], an algorithm is presented that learns the inter-section ofs halfspaces inn dimensions in time polynomial ins andn, in a nonstandardPAC-inspired learning model designed “to rule out with high confidence pathologicalconfigurations with several planes arbitrarily close to one another or large fractions of themeasure right on top of decision boundaries.” In particular, the algorithm is not requiredto work for all possible intersections of halfspaces and all possible distributions of pointsin Rn: His model precludes adversaries that choose a distribution–halfspace intersectionpair that are particularly difficult (e.g., where much of the distribution is centered aroundpoints that lie very close to the bounding hyperplanes or their intersections). Our algo-rithm works for all distributions and for all possible intersections of halfspaces to belearned.

PAC Learning Intersections of Halfspaces with Membership Queries 55

The idea behind Baum’s algorithm is to interpolate each bounding hyperplane byfindingn points that are close to it. When the distribution focuses on many points that lienear corners or edges of the bounding polyhedron (such distributions are in essence ruledout by Baum’s model), the interpolation technique encounters difficulties. Our algorithmuses a different method: Instead of finding points that are “on” a bounding halfspace andinterpolating, we try to determine which negative sample points lie outside of the samebounding hyperplane. Then we use linear programming to separate all of these negativesample points from the positive examples.

In order to state our main theorem, we first review the PAC learning model with mem-bership queries. In Valiant’s distribution-free, or PAC, learning model [3], the learner’sgoal is to infer an unknown target conceptc chosen from some known concept classC. Typically, C is decomposed into a family of subclassesCn, wheren represents ameasure of the complexity of each example (in our casen is the Euclidean dimension,andCn consists of intersections of halfspaces inRn). To obtain information aboutc,the learning algorithm is provided access to labeled (positive and negative) examplesof c, drawn randomly according to some unknown probability distributionD over theinstance spaceX (in our caseX = Rn). The learner is also given as input 0< ε, δ < 1,and an upper bounds on the representational “size” size(c) of c (in our cases is thenumber of halfspaces definingc). The requirement that the algorithm be givens may bedropped as standard techniques allow the inference of the appropriate parametersat onlya polynomial increase in cost, and relaxation of polynomial running time to expectedpolynomial running time.

The learner’s goal is to output, with probability at least 1− δ, the description of aconceptc′ that has probability at mostε of disagreeing withc on a randomly drawnexample fromD (thus,c′ haserror at mostε). If such a learning algorithmA exists (thatis, an algorithmA meeting the goal for anyn ≥ 1, any target conceptc ∈ Cn, any targetdistributionD, anyε, δ > 0, and anys ≥ size(c)), we say thatC is PAC-learnable. Wesay that a PAC learning algorithm is a polynomial-time (or efficient) algorithm if thenumber of examples drawn and computation time are polynomial inn, s, 1/ε, 1/δ, andperhaps other natural parameters.

The PAC model may be augmented to allowmembership queries. A membershipquery proposed by a learning algorithm is a pointx ∈ X, and the response obtained froman oracle or teacher is “yes” or “no”, depending on whether or notx ∈ c. The PAC model(and a somewhat more demanding model of exact learning using equivalence queries),both with and without membership queries, is a well-motivated and widely used modelin the investigation of learning phenomena. The central result presented in this paper isthe following.

THEOREM1. Suppose the target concept is an intersection of s halfspaces in[0,1]n,defining a polytope P. Letε, δ > 0, and let S be an initially obtained random sample of

m= max

{2

εlog

4

δ,

16(n+ 1)s log(3s)

εlog

13

ε

}examples. Let L P(m,n) be the time needed to solve a linear programming problemwith m constraints and n unknowns. Let d denote the minimum distance between the

56 S. Kwek and L. Pitt

boundary of P and any example in S that does not lie on the boundary. Then in time

O

((n log

1

d+ log

2

δ

)(m

(log

s

δ+ n log

mn

d

)+ sL P(m,n)

))and with sample complexity m(excluding membership queries), the probability thatPOLLY (Figure 3) fails to output a collection of s halfspaces with classification error atmostε, is at mostδ.

Comments and Extensions.

• Assume that the representation of a rational number is of the forma1 · · ·al .b1 · · ·br

(some numberl of bits to the left of the decimal, and some numberr of bits to theright). Let B denote the length of the coefficient of the bounding hyperplanes or thecoordinate of the examples inS that has the longest represenation. The parameterdis at leastÄ((2B√n)−1) (see Lemma 8). Thus, the time complexity of POLLY is alsopolynomial in the length of the representation of the target concept and the initialsampleS. We choose to bound the time complexity in terms of log(1/d) instead ofB,thus allowing for the possibility that the coefficients of the bounding hyperplanes areirrational numbers (and hence cannot be represented with a finite number of bits).• There are several known polynomial-time algorithms [4]–[5] for solving the lin-

ear programming problem. Among these algorithms, whenm is much larger thann (as in our case), Vaidya’s algorithm [5] has the best worst-case running time ofO((mn2+m1.5n)L

). Here,L is the number of bits needed to encode the linear pro-

gram, which in our case is bounded by the total number of bits needed to specify allthe coordinates of the examples inS.• The statement in Theorem 1 assumes that the cost of an arithmetic operation is constant.

If the cost is not constant, then the time complexity of our algorithm increases by afactor of B.• A simple extension of the algorithm can learn the union ofk disjoint polyhedra with at

mosts facets each, in time polynomial inn, k, s, 1/ε, 1/δ, log(1/d), and 1/γ , whereγ is the minimum distance between any pair of polyhedra, and between any samplednegative example and a bounding hyperplane. More details are given in Section 5.• Without loss of generality, we can extend the algorithm to learn intersections of half-

spaces not bounded by the unit cube by simply projecting the example space into theunit cube (i.e., scaling the coordinates of the sampled examples).

After briefly reviewing previous work in Section 2, we give the high-level intuitionbehind the algorithm and proof in Section 3, and then proceed with the technical detailsin Section 4. Section 5 generalizes Theorem 1 to learning unions of disjoint polyhedra.

2. Background and Related Work. The problem of learning halfspaces is one ofthe most extensively studied topics in computational learning theory. Most considerrestrictions on either the number of halfspaces, or the dimension.

Learning a Single Halfspace. Learning a single halfspace corresponds to training asimple perceptron for which efficient algorithms are known in the PAC model [9] and

PAC Learning Intersections of Halfspaces with Membership Queries 57

in the exact model using equivalence queries [10]. In both cases the algorithms workwithout membership queries. Littlestone [11] gives an efficient algorithm WINNOW forlearning a single halfspace over the boolean domain in the more demanding “mistakebound model.” Bultman and Maass [12] present an algorithm that learns a single half-space in the discretized domainZ2

m = (1, . . . ,m)2 usingθ(logm) membership queriesin time O((logm)O(1)). Shevchenko [13] investigates the learning of a single halfspacein Z2

m = (1, . . . ,m)n usingO((logm)(n−1)dn/2e+n) membership queries inpoly(logm)time. Recently, Blum et al. [14] learn a single halfspace in the presense of randomclassification noise using a simple greedy method to find a weak hypothesis and thenapply a boosting technique to achieve the desired accuracy.

Learning Two Halfspaces. Blum and Rivest [15] show that finding an intersectionof two halfspaces that are consistent with a sample of labeled points from the booleandomain, if it exists, is NP-complete. Baum [16] presents an algorithm that learns intersec-tions of two halfspaces from examples and membership queries, or from examples aloneif the distribution obeys a symmetry condition. This result has been extended by Blumet al. to learning intersections of two (not necessarily homogeneous) halfspaces wherethe membership queries on points that are distanced from the bounding hyperplanes areunreliable and the distribution has weight 0 withind of the boundary [17].

Learning Three or More Halfspaces. In a closely related work by Baum [16], analgorithm is presented to learn convex polyhedra in a PAC-inspired model augmentedwith membership queries. The model he used is constructed to exclude difficult situations(see Section 1 for more details) which are handled by our algorithm.

If the probability distribution of the instances is uniform, Blum and Kannan [18]present a polynomial-time algorithm for learning the intersection of a constant numberof halfspaces. When membership queries are allowed, it has been shown that in theboolean domain, exact learning from equivalence and membership queries remains NP-hard if the algorithm is required to find (as ours does) an intersection with thesamenumber of halfspacesk, for any fixedk ≥ 3 [19], [20].

Mostly negative results have been obtained for learning in the case that the num-ber of halfspaces is not held constant, and the learning algorithm is allowed to out-put a hypothesis containing more halfspaces than the target concept. By using simpleprediction-preserving reductions it can be shown that PAC-learnability of the class ofconvex polyhedra without membership queries implies PAC-learnability of DNF formu-las [21], [11], which is one of the more challenging open problems in learning theory.Long and Warmuth [22] show that learning convex polyhedra in the continuous domainis as hard as learning polynomially sized circuits, assuming that the chosen represen-tation (whose size dictates a parameter in which the algorithm must be polynomial) isthe list of vertices, instead of bounding hyperplanes, of the polytope. It follows that ifone-way functions exist, then learning convex polyhedra is intractable in this model.

Learning in Constant Dimension. For constant dimension, Baum [23] presents anOccam algorithm that learns intersections ofk halfspaces inRd, for fixed d, in timeO(((kd2/ε) log2(1/ε))(d+3)/2). An algorithm for learning convex polytopes in the dis-cretized domain of constant dimension, using stronger queries, such as superset or dis-jointness queries has been given [24]. Also in discrete domains, Auer et al. [25] generalize

58 S. Kwek and L. Pitt

Littlestone’s WINNOW algorithm [11] to the class of depth two linear threshold circuitswith constant fan-in at the input gates, and, as a consequence, obtain a noise-tolerantmistake-bound algorithm for learning (the class of circuits containing) the class of con-vex polyhedra in fixed dimension. In the continuous domain, Bshouty et al. [26] presenta noise-tolerant PAC algorithm for learning the class of arbitrary boolean functions ofs halfspaces in constant dimension (see also [27] for earlier work). For the discrete do-main, Ben-David et al. [28] give an exact learning algorithm for the same class usingequivalence queries only.

Recently the more difficult related problem of finding a convex polygon (in the plane)of s sides that misclassifies the fewest number of points in a finite sample has beenstudied [29]–[31]. AnO(n6k)-time algorithm for finding one polygon in the class ofconvexs-gons that minimizes the classification error on a sample of labeled points isgiven [31]. This result implies that convex polygons are learnable in the PAC model withrandom classification noise [32] and in the “agnostic” PAC model [33].

Learning Boxes. There have been a number of papers on learning the simpler classof concepts formed from halfspaces whose bounding hyperplanes are parallel to thecoordinate axes [34]–[40]. Blumer et al. [9] show that choosing the smallest boundingbox covering the set of positive examples in a given sample yields a PAC learningalgorithm that is polynomial in both the number of hyperplanes and the dimension.Chen and Maass [35] present an efficient on-line mistake bounded algorithm that learnsa single box inRd, while Auer [34] investigates the case where there is noise in themistake bound model. Ameur [40] improves the space complexity of Chen and Maass’algorithm at the cost of increased time complexity. Maass and Warmuth [39] introducethe notion of virtual weights for learning boxes with a lower mistake bound than Chenand Maass’ algorithm, but the hypothesis is a linear threshold function rather than a box.Dobkin et al. [38] give anO(n2d−2 logn)-time algorithm for minimizing disagreementfor a single box inRd.

3. The Forest. Recall that theVC-dimensionof a concept classC, denoted byVC-dim(C), is the size of the largest sample that can be labeled in all possible ways byconcepts inC. The following theorem due to Blumer et al. [9] gives a sufficient conditionfor learning in the PAC model.

THEOREM2 [9]. Suppose a concept classC has finite VC-dimension. Suppose also thatthere is an algorithmA such that, for all possible choices of target concept f∈ C, givenany sample S of examples labeled according to f, with size of S at least

max

(2

εlog

2

δ,

8VC-dim(C)ε

log13

ε

),

A outputs a hypothesis inC that classifies S the same way as f. ThenA learnsC in thePAC model.

Because the class of intersections ofs halfspaces has VC-dimension at most 2(n +1)s log(3s) [9], by the above theorem, any algorithm that obtains from distributionD a

PAC Learning Intersections of Halfspaces with Membership Queries 59

sampleSof size at least

m= max

{2

εlog

4

δ,

16(n+ 1)s log(3s)

εlog

13

ε

},

and outputs a consistent intersection ofs halfspaces, achieves the PAC-criterion of learn-ing with accuracy boundε and confidence boundδ/2. This result holds even if the algo-rithm uses membership queries to construct a hypothesis consistent withS. Moreover,the hypothesis found need only classify the points ofS correctly, and need not classifycorrectly any additional points that it may have generated and posed as membershipqueries in the process of obtaining its hypothesis. Algorithm POLLY (Figure 3) usesthis approach to find the hyperplanes that bound the convex POLYhedron defined by theintersection of halfspaces.

We know that every positive example is a positive example for each halfspace of theintersection. Each negative example is a negative example for (hence is “outside of”)at least one of the halfspaces. The problem is a “credit assignment” (actually “blameassignment”) problem as discussed in this context by Baum [23]: If we can figure out, foreach negative examplex, at least one of thes halfspaces that do not containx, then wecan learn each halfspace separately by using any polynomial-time linear programmingalgorithm to find a hyperplane separating the positive examples from those negativeexamples that are blamed on a particular hyperplane. In order for Theorem 2 (or any ofits extensions) to apply, we must not find an intersection of “too many” halfspaces. Inparticular, we cannot simply find for each negative example a hyperplane separating thatnegative example from all positives. While the resulting intersection would correctlyclassify each point inS, the number of hyperplanes could be as large as the number ofnegative sample points. The generalizations of Theorem 2 (pertaining to “Occam algo-rithms”) may only be applied when the consistent hypothesis produced by the learningalgorithm has a representational size that is significantly smaller than the sample size—there must be some amount of “compression” of the sample data. POLLY will actuallyproduce no more hyperplanes than the number in the target. This is the task found NP-hard when the domain is restricted to the boolean hypercube in the PAC model withoutmembership queries [15], and in the more demanding exact learning model with bothmembership and equivalence queries [19], [20].

Let S+ and S− denote the positive and negative examples ofS, respectively. Tounderstand the method that POLLY uses to find a collection of at mosts hyperplaneswhose intersection contains all ofS+ but none ofS−, for simplicity we consider thetwo-dimensional case.

Each negative examplea ∈ S− is “outside” of at least one line containing an edgeof P. For each negative example, POLLY chooses one of these lines to blame in thefollowing manner: Choose any pointc ∈ P. The segmentac connecting a negativeexamplea with the pointc must intersect one of the sides of the polygon. Let facet(a)denote the side of the polygon that the segmentac intersects. (Inn dimensions, facet(a)is an(n − 1) dimensional surface bounding the polyhedron.) We write thata “seesc”facet(a) to indicate that the direct line of sight froma to c first intersects facet(a) (referto Figure 1).

POLLY picks an initial negative examplea and attempts to determine which othernegative examplesb seec facet(a). POLLY employs a simple test, called themidpoint test

60 S. Kwek and L. Pitt

Fig. 1. Each negative examplex seesc some facet (denoted facet(x)) that is determined by the intersection ofthe linexc and the polytope. Note that while facet(a) 6= facet(b), b is nonetheless “outside” of the lineh(a)containing facet(a).

(explained soon) to determine whether or not facet(b) = facet(a). Ideally, the midpointtest should be able to determine, for each pair of negative examplesa andb, whetheror not a andb seec the same facet. In this (ideal) case, the midpoint test can be usedto partition the negative examples, and we can separately run (in parallel if we wish)sdifferent halfspace-learning algorithms: The bounding lineh(a) coincident with facet(a)will be learned when a halfspace-learning algorithm is presented with positive examplesS+ (all on the inside ofP, hence “inside”h(a)), and with negative examplesa togetherwith all other negative examples that seec facet(a), hence are all outsideh(a).

Unfortunately, the midpoint test is not powerful enough to partition the negativeexamples so nicely. However, it has the following property: If facet(b) = facet(a), thenthe midpoint test is guaranteed to say “yes”. On the other hand, if facet(b) 6= facet(a),then the only time the midpoint test says “yes” is ifb is on the same side ofh(a) as isa, as shown in Figure 1. Consequently, the set of points presented (as inequalities to besatisfied) to a linear programming algorithm will be linearly separable, and a separatingline will be found.

Note however that the midpoint test no longer partitions the negative examples: it ispossible that midpoint-test(p,q) = “yes” = midpoint-test(q, r ), but midpoint-test(p, r )= “no”. Thus, instead of attempting to find such a partition, POLLY chooses a negativeexamplea ∈ S−, computes same-side(a) = {b ∈ S− : midpoint-test(a,b) = “yes”},and runs a linear programming algorithm to find a line separatingS+ and same-side(a).POLLY then removesa and all points in same-side(a) from S−, and iterates the procedureby choosing some remaininga ∈ S−. After at mosts iterations, up tos separating lineswill be found, and each element ofS− will be outside of at least one line, while all pointsof S+ will be inside of every line found.

The Midpoint Test. So what is the midpoint test? Suppose thata andb are extremelyclose toP and seec different facets. Then it is easy to see that the midpoint between themwill be in P. On the other hand, ifa andb seec the same facet, then no matter how closethey are toP, the midpoint will be outside ofP.

PAC Learning Intersections of Halfspaces with Membership Queries 61

Fig. 2. Surrogate pointsa′ andb′ that are close toP. If a′ andb′ are sufficiently far from the vertices, thenthe midpoint ofa′ andb′ will be inside of P. The distance betweena′′ (similarly, b′′) and the vertex of thepolytope is at leastd1, and the distance betweena′ anda′′ (similarly, b′ andb′′) is at mostd2.

Since the sampleS in general will not contain pointsa andb that are close enoughto P to determine whether or nota andb seec the same facet, POLLY instead tests themidpoint between surrogate pointsa′ andb′ that are closer toc and lie along the linesacandbc, respectively. In particular, using a binary search and membership queries, POLLY

finds for each negative examplea a pointa′ betweena andc such thata′ is outside ofP,and the distance betweena′ and the point of intersectiona′′ of acand facet(a) is at mostsome small amount, sayd2 (see Figure 2). The midpoint test applied toa andb is that ofposing a membership query on the midpoint ofa′ andb′. If a andb seec the same facet,then botha′ andb′ will be on the same side ofh(a), hence so will their midpoint. Ifaandb do not seec the same facet we will show that the midpoint will either be inP (andPOLLY will then excludeb when trying to learn the lineh(a) incident with facet(a)), orelseb will be on the same side ofh(a) anyway, as discussed above.

The reader may now have the following concern: Isn’t it the case that ifac or bc(or both) intersectP very close to a vertex (or, in the worst case, at a vertex), thenthe midpoint ofa′ and b′ will not be inside ofP, unlessd2 is set very small? Thisindeed is the case—the closer to a vertex the intersection is, the smallerd2 must be.Consequently, POLLY will choose a pointc such that the intersections ofac andbc withP will occur some minimum distanced1 from any vertex ofP, and will choosed2 smallenough (depending ond1) so that the midpoint test works as promised. (The reader mightwonder what happens when the angle of the vertex gets very shallow. When this happens,we do not need to worry, sinceb will then end up being outsideh(a) anyway.)

This leaves only the question of how to choose such ac. If all the sampled positiveexamples fall in the same hyperplane, then we can reduce the learning problem to oneof lower dimension. Otherwise, we find a ball (explained soon) that is contained in thetarget polytope and choose the pointc randomly within this ball. We show that, for asuitable choice ofd1, there is a probability of at least a half thatc will be such that,simultaneously for eacha ∈ S−, the point of intersectiona′′ of ac andP will be at leastd1 from any vertex ofP.

62 S. Kwek and L. Pitt

4. The Trees. Before presenting the proof of Theorem 1, we review some elementarygeometric concepts. Ahyperplaneand ahalfspacein Rn are the sets of points satisfyingsome linear equationa1x1+· · ·+anxn = b and some linear inequalitya1x1+· · ·+anxn

≥ b, respectively, and not allai ’s are simultaneously zero. A (convex) polyhedronisan intersection of a finite set of halfspaces. The boundary of a polyhedron consists ofconvex polyhedra of lower dimension calledfaces. A k-facedenotes ak-dimensionalface. The(n− 1)-faces of ann-dimensional polyhedron are calledfacets. For example,the facets of a three-dimensional polyhedron are plane polygons, the 1-faces and 0-facesare its edges and vertices, respectively. A set of points inRn is coplanarif they all lie onsome hyperplane of dimension smaller thann. The line between a pair of pointsa andb is denotedab, the set of points on the linea andb that fall betweena andb (inclusive)is denotedab. The length of the segment is denoted|ab|.

Let S= S+ ∪S− be the sample of positive and negative examples of sizem drawn byPOLLY. Let P denote the convex polyhedron formed by the intersection of thes closed4

halfspaces that is to be learned. We assume that the instance space is the unit hypercube,hence the distance between any two points in the instance space is at most

√n.

If the set of positive sample pointsS+ is coplanar, which we can determine in poly-nomial time, then we can reduce the learning problem to one of lower dimension. Thus,without loss of generality, we assumeS+ is not coplanar. LetpS+ denote the mean ofS+,and letBS+ denote the set{pS+ + 〈α1(l/1000), . . . , αn(l/1000)〉,−500≤ αi ≤ 500} ofdiscrete points where eachαi is an integer, and

l = 2d

m√

n.

The choice of the grid size of 1000 is completely arbitrary and does not affect thecorrectness or efficiency of our algorithm. It could be any integer greater than 1. Thenext lemma claims thatBS+ is contained inP.

LEMMA 3. Suppose S+ is not coplanar. Then BS+ is properly contained in P.

PROOF. The distance of the meanpS+ from any bounding hyperplanehi is simply themean of the distances of the positive sample points fromhi . SinceS+ is not coplanar,one of these distances is nonzero, and hence is at leastd. Thus,pS+ is at leastd/m awayfrom the boundary. Now, the points inBS+ are contained in then-dimensional squarebox with sidelengthl and centerpS+ . The points in this box that are furthest away frompS+ are its vertices. The distance between a vertex andpS+ is

√n(l/2)2 = d/m. Hence

the box (and thus all points inBS+ ) is in P.

DEFINITION 1. A pointc ∈ P is ad1-centerif and only if the following two conditionsare satisfied:

1. For alla ∈ S−, the distance between the point of intersectiona′′ of ac and P, andany(n− 2)-face ofP, exceedsd1. (For example, in three dimensions,a′′ is at leastd1 from any edge ofP.)

2. The distance betweenc and each facet ofP is greater thand1.

4 Points on the boundary are considered to be in the target concept.

PAC Learning Intersections of Halfspaces with Membership Queries 63

We show that if we setd1 small enough (but not too small), then the probability ofdrawing a point uniformly fromBS+ that happens to be ad1-center is at least12. Intuitively,if d1 is close to zero, then the region inBS+ that contains the “non-d1-center” points hasnegligible volume when compared with the volume ofBS+ . However, in order to usethis result later, we need to ensure thatd1 is not too small. The proof of the followingtechnical lemma appears in the Appendix.

LEMMA 4. If d1 ≤ (n/2dms2)(d2/n1.5m)n, then the probability of drawing a pointuniformly from BS+ that is a d1-center is at least12.

Figure 3 lists the pseudocode of our algorithm for learning intersections ofs half-spaces. Suppose thatc is ad1-center, and thatd2 is sufficiently small. For eacha ∈ S−,let h(a) denote the hyperplane containing facet(a)—the facet ofP that intersects the linesegmentac at a′′. Let H(a) denote the halfspace in the target concept with boundaryh(a) (see Figure 4). Note thata 6∈ H(a) but c ∈ H(a). Step 6 of POLLY, by askingmembership queries on points betweena andc, does a binary search to find a pair ofpointsa′ /∈ P andc′ ∈ P such thata′ andc′ lie on ac, with |a′c′| < d2. Consequently,|a′a′′| < d2, wherea′′ is the point of intersection ofacandP. POLLY then picks a pointa in S− (step 9), and applies the midpoint test to find the set of points same-side(a) ={b ∈ S− : (a′ + b′)/2 is a negative example} (step 10). In step 11, POLLY uses linearprogramming to find a halfspaceHa that contains all ofS+ but none of same-side(a). Toprove that POLLY is correct, assuming that a sufficiently small value ofd2 is chosen, wemust show that such a halfspace always exists, i.e., that the points of same-side(a) arelinearly separable fromS+. Further, we must also show that negative examplesb thatseec the same facet thata does will be included in same-side(a). These two facts aregiven in the following lemma:

LEMMA 5 (the Midpoint Lemma). Let c be a d1-center where

d1 = n

2dms2

(d2

n1.5m

)n

and d2 = d1

12√

nd

(≤√

n

24ms2(

d2

n1.5m)n).

If d2 in step2 of POLLY is set to less than d2, then, after each execution of step10 ofPOLLY,

1. same-side(a) contains all elements b remaining in S− for which facet(b) = facet(a)(and are thus outside of the hyperplane h(a)),

2. if same-side(a) contains b∈ S−, then b is outside of the hyperplane h(a).

The proof of Lemma 5 appears after the proof of Theorem 1, which shows that, withhigh probability, POLLY learns intersections ofs halfspaces efficiently.

PROOF OFTHEOREM1. Letd be the smallest distance between a negative example ofSobtained by POLLY, and a bounding hyperplane ofP. Let d1 = (n/2dms2)(d2/n1.5m)n.POLLY does not know the value ofd, hence cannot setd2 to a value less thand2, tosatisfy the hypothesis of the midpoint lemma. Instead, POLLY makes an initial guess

64 S. Kwek and L. Pitt

POLLY(n, s, ε, δ)1 Draw a sampleSof

m= max

{4

εlog

4

δ,

16(n+ 1)s log(3s)

εlog

26

ε

}examples from distributionD and letS+ andS− be the positive andnegative examples, respectively, ofS.

2 d2← (√

n/24ms2)(1/n1.5m)n (d2 is our guess of the desired valuefor d2)

3 pS+ ← mean ofS+

4 while not donedo5 try again: resetS− ← negative examples inS

draw a positive examplec uniformly from BS+

6 for eacha ∈ S− use binary search and membership queries to finda′

onac but not inP such thatdist(a′, P) < d2

7 H← ∅8 while S− 6= ∅ do9 pick a pointa in S−

10 same-side(a)← {b ∈ S− : member((a′ + b′)/2) = “no”}11 use linear programming to find a halfspaceHa containingS+ but

not same-side(a)12 if (no such a halfspace is found)or (|H| = s) then13 d2← d2/214 goto 415 else16 H← H ∪ {Ha}17 S− ← S−− same-side(a)18 end if19 end while20 end while21 return H

Fig. 3.Algorithm POLLY.

d2 = (√

n/24ms2)(1/n1.5m)n. Note that ifd2n ≥ 1, then the initial guessd2 alreadysatisfies the required bound. Otherwise, the initial guessd2 is only larger than the desiredvalue ofd2 by a factor ofd−2n. If POLLY is unable to find a consistent intersection ofs hyperplanes, then in step 13 the value ofd2 is halved and the algorithm is startedagain with this new value ofd2. After at most 2n log(1/d) executions of step 13, and ateach subsequent iteration of the main loop,d2 will satisfy the bound in the hypothesisof the midpoint lemma.

Moreover, by Lemma 4, afterd2 becomes sufficiently small, the probability that thepointc obtained ateachexecution of step 5 of POLLY is ad1-center is at least12. Hence,the probability that ad1-center is not obtained within the next log(2/δ) executions of step5 after d2 is sufficiently small is at mostδ/2. Hence, with probability at least 1− δ/2,

PAC Learning Intersections of Halfspaces with Membership Queries 65

Fig. 4.The two-dimensional affine planePa,b,c spanned by the pointsa, b, andc. If the midpointw of a′ andb′ is not in the polytope, then it must either be on the line segmenta′e or e′b′. Notice also thatb′ is in thehalfspaceH(a) if and only if a′ is in the halfspaceH(b).

within a total of at mostt = 2n log(1/d) + log(2/δ) iterations of the loop of step 4, ad1-center will be obtained andd2 will satisfy the bound in the hypothesis of the midpointlemma. Using the midpoint lemma together with a simple induction, it can be shown thatPOLLY will output a collectionH of halfspaces whose intersection correctly classifiesall points in S. By Theorem 2, the probability thatH has actual error exceedingε onthe distributionD from which S was sampled is at mostδ/2. Thus, the probability thatPOLLY outputs some collectionH whose error is at mostε within t iterations, is at least1− δ.

We analyze the running time of each of the firstt iterations of the main loop of steps4–20 (after which, by the argument above, with high probability POLLY will have halted).The two dominating steps in the main loop are steps 6 and 11. Step 6 is executed at mostmtimes for each of thet executions of the main loop until POLLY is likely to succeed. Sincethe distance betweenaandc is at most

√n, andd2 ≥ (2−t )(

√n/24ms2)(1/n1.5m)n, it can

be shown easily that each of thesem binary searches takes timeO(t+ logs+n logmn).Step 11 is executed at mosts times and in each iteration a linear programming algorithmis run with an input of at mostm constraints andn unknowns, taking timeL P(m,n), fora total time ofO(sL P(m,n)). Thus, the total time for each iteration of the main loop isdominated by

O(m (t + logs+ n logmn)+ sL P(m,n))

= O

(m

(log

s

δ+ n log

mn

d

)+ sL P(m,n)

).

PROOF OFLEMMA 5. (1) If facet(b) = facet(a), then botha andb are on the outside of

66 S. Kwek and L. Pitt

h(a). Further, the entire segmentsaa′ andbb′ are on the outside ofh(a), sinceacandbcfirst crossh(a) at a′′ andb′′, respectively. Thus, the midpoint betweena′ andb′ is alsoon the outside ofh(a), hence is a negative example, and will be placed in same-side(a).

(2) Let c be ad1-center, and letd2 ≤ (d1/12√

n)d. We proceed by contradiction.We show that ifb is insideh(a), then the midpoint(a′ + b′)/2 is insideP (a positiveexample), henceb would not have been placed in same-side(a).

Let b be insideh(a), that is,b ∈ H(a). Sincec is also in H(a), b′ ∈ H(a) also(see Figure 4). LetPa,b,c be the two-dimensional affine-plane spanned bya, b, andc.The target concept restricted toPa,b,c is a convex polygonp1 · · · pr . Without loss ofgenerality, suppose that the line segmentac intersects the side facet(a) = p1 p2.

Sincea′ andb′ are on opposite sides ofh(a), a′b′ must intersecth(a) at some pointe. Sincea′′ is on the line segmentp1 p2, eitherp1 or p2 is on the same side ofa′a′′ ase.Without loss of generality we assume thateandp2 are on the same side of the linea′a′′.

The proofs of the two following claims can be found in the Appendix.

CLAIM 6. If b′ ∈ H(a) and e= a′b′ ∩ h(a), then|ea′′| ≤ d1/3.

CLAIM 7. b′ ∈ H(a) if and only if a′ ∈ H(b).

Notice that, by Claim 7,a′ ∈ H(b). Together with Claim 6 (switching the rolesof a andb), we have|e′b′′| ≤ d1/3 wheree′ = a′b′ ∩ h(b). Let w be the midpointof a′ andb′. To show thatw is insideP, we argue thatw /∈ a′e. (The argument thatw /∈ e′b′ is symmetric.) Suppose to the contrary thatw ∈ a′e (see Figure 5). Then|a′e| ≤ |a′a′′| + |a′′e| < d2 + d1/3 < d1/2. Let h be the hyperplane that passesthroughb′ and is parallel toh(a). Sincew ∈ a′e, |eb′| < |a′e| < d1. This means thatdist(b′, h(a)) < d1 which impliesdist(h, h(a)) < d1. Henceb cannot be in the stripbetweenh andh(a). However, this means thatb is to the right ofh. This implies thatcis to the left ofh sinceb′ is oncb. Now, sincec is in H(a), c has to be betweenh andh(a) which impliesdist(c, h(a)) < d1. Thus,c violates part 2 of Definition 1, and is notad1-center.

Instead of bounding the time complexity of POLLY by log 1/d, we can bound it bythe bit complexityB using the following lemma.

LEMMA 8. The minimum distance d between the boundary of P and those sampledexamples in S that do not lie on the boundary is at least(2B√n)−1

PROOF. Let s = 〈α1, . . . , αn〉 be an arbitrary element ofS not lying on the boundaryof P. If s is labeled positive, then, by our assumption, it does not lie on any boundinghyperplane. Ifs is labeled negative, then there is some bounding hyperplane whichit does not lie on. In either case,s is not lying on some bounding hyperplanehi :∑n

i=1 ai xi + an+1 = 0 and we have

dist(s, hi ) = a1α1+ · · · + anαn + an+1√a2

1 + · · · + a2n+1

.

PAC Learning Intersections of Halfspaces with Membership Queries 67

Fig. 5. If the midpointw of a′ andb′ is ona′e, thendist(h(a), h) < d1 and neitherb nor c can be in the stripbounded by the hyperplanesh(a) andh. This contradicts the fact thatb′ is betweenb andc on the linebc.

Without loss of generality, we can assume theai ’s and theαi ’s are of integer valueswith bit complexityB. Thus, the denominator is at most 2B√n while the numerator isat least 1.

5. Learning Disjoint Unions of Polyhedra. POLLY can be extended easily to learnunions ofk disjoint polyhedra with each polyhedron having at mosts facets, in timepolynomial in boths and k. However, here we require that the minimum separationbetween any pair of distinct polyhedra is at least some distanceγ , and the time complexityscales polynomially with 1/γ , and not log 1/γ , as we might desire. The main idea is topartition S+ such that two sampled positive examples are in the same partition if andonly if they are in the same polyhedron.

To determine whether two sampled positive examplesp1 and p2 are in the samepolyhedron, we simply “march” fromp1 to p2 along the line segmentp1 p2 taking asmall step ofγ at a time. If p1 and p2 are not in the same polyhedron, then, due totheγ separation, there must be a segment inp1 p2 with length greater thanγ such thatthe entire segment does not intersect any of the polyhedra. Thus, in this case, we mustencounter at least one point on this “negative” segment along our march. Conversely, ifp1 andp2 are in the same polyhedron, then, by convexity,p1 p2 lies entirely in the same

68 S. Kwek and L. Pitt

Fig. 6. The surrogate pointa′ is too far from the pointa′′. To find the right surrogate point, we march alongthe line segmentac from a to c moving a distance ofγ at a time until we encounter a negative pointa. Witha, we then perform the binary search betweena andc to find the correct surrogate point.

polyhedron asp1 andp2. In this case we only encounter positive points along our march.Therefore, we can conclude that we encounter only positive points during our march ifand only if both positive examplesp1 and p2 are in the same polyhedron.

With the partitioning ofS+, we can then apply POLLY to learn the polyhedron con-taining the points in each partition individually. However, there is one minor problem.Recall that in POLLY, we need to find a surrogate pointa′ for each sampled negativeexamplea using binary search. In this case the pointa′ found may turn out to be somedistance greater thand2 away from the point of intersectiona′′ of the polyhedron thatwe are trying to learn andac wherec is the candidated1-center (see Figure 6).

The solution here is simply to march from the pointc towarda, moving a distance ofγ in each step until we first encounter a negative pointa. Due to theγ separation, we candeduce that there is only one bounding hyperplane that is betweena andc (see Figure 6),namely the bounding hyperplane facet(a), of the polyhedron that we are learning, whichintersects the line segmentac. Thus, we can then apply the binary search betweena andc to find the correct surrogate pointa′.

Since we are runningk copies of POLLY, to bound the probability of failing to find aconsistent hypothesis from above byδ/2, we need to make each copy of POLLY failingwith probability at mostδ/2k (so that the probability of allk copies of POLLY failing isat mostδ/2). This can be done by iterating each copy of POLLY n log(1/d)+ log(2k/δ)times.

Another question that we need to answer is how large must the sample size be. TheVC-dimension of a concept formed by the union of up tok concepts, each with VC-dimensionD, has VC-dimension bounded by 2Dk log(3k) [9]. Therefore, the class ofunions ofk disjoint polyhedra with each polyhedron having at mosts facets has VC-dimension at most 4(n+ 1)sklog(3s) log(3k). By Theorem 2, we only need to ensurethat the sample size is at least

m= max

{2

εlog

4

δ,

32(n+ 1)sklog(3s) log(3k)

εlog

13

ε

}.

The time complexity of our new algorithm has an additional initial cost ofO(mk/γ )for partitioning the sampled positive examples. The running time for each iteration alsoincreases byO(m

√n/γ ) due to the need for locating the intermediate pointa for each

sampled negative pointa. Thus, we have the following result.

PAC Learning Intersections of Halfspaces with Membership Queries 69

THEOREM9. Suppose the target concept is a disjoint union of k convex polyhedra whereeach polyhedron is an intersection of at most s halfspaces in[0,1]n and has a distance ofat leastγ away from the other polyhedra. Letε, δ > 0,and let S be an initially obtainedrandom sample of

m= max

{2

εlog

4

δ,

32(n+ 1)sklog(3s) log 3k

εlog

13

ε

}labeled examples. Let L P(m,n) be the time needed to solve a linear programmingproblem with m constraints and n unknowns. Denote the minimum distance betweenthe boundary of P and those sampled examples in S that do not lie on the boundary byd. Then in time

O

(mk

γ+(

n log1

d+ log

2k

δ

)(m

(√n

γ+ n log

1

d+ log

2ks

δ

)+ ksL P(m,n)

))and with sample complexity m(excluding membership queries), the probability thatPOLLY (Figure3, modified as discussed) fails to output a collection of s halfspaces withclassification error at mostε, is at mostδ.

Notice that if some of the polyhedra in the target concept are very close to eachother, then the parameter 1/γ could be exponential in the bit complexityB. Ideally, onewould like to bound the time complexity (and hence the number of membership queries)by log(1/γ ) instead of 1/γ . We conclude our presentation by posing the followingopen problem: Are unions of polyhedra PAC-learnable with membership queries in timepolynomial inn, k, s, 1/ε, 1/δ, log(1/d), and log(1/γ )?

Acknowledgments. We thank Herbert Edelsbrunner for helpful conversations regard-ing this work, and Avrim Blum for pointing out Baum’s paper [2]. We also thank EricBaum for helpful clarifications and the COLT ’96 committee for valuable comments.We appreciate the useful suggestions made by an anonymous referee.

Appendix

PROOF OFLEMMA 4. Before presenting the details of the proof of Lemma 4, we illus-trate the intuition behind the proof by looking at the two-dimensional case.

The Two-Dimensional Case. Let a be an arbitrary point inS− and lete be a vertex ofthe target polygon (see Figure 7) visible froma. Letβ denote a point on an edge incidenttoe that is a distanced1 away frome. The points inBS+ that fall in the cone formed by thetwo rays extending froma toward the directions toe andβ, respectively, violate part 1of Definition 1 of being ad1-center. If we can show the number of these “bad” points isbounded by a function dependent ond1 (and the other parameters), then we can select asmall enoughd1 such that the union of the sets of bad points over all possible choices ofa ande has size smaller than half of|BS+|. The latter implies that the probability that apoint drawn uniformly fromBS+ is ad1-center is at least a half.

70 S. Kwek and L. Pitt

Fig. 7. The points in4apq∩ BS+ , the shaded region, cannot be ad1-center. The triangle4apq is containedin the parallelogrambapq.

To obtain such a bound, we look at the triangle4apq (see Figure 7) formed by therays, and the horizontal supporting line of the polygon formed by the intersection of thecone and the target polygon. This triangle is contained in the parallelogrambapq. Sincethe height of the parallelogram is at most 1, the number of bad points in the parallelogramis at most ( |pq|

l/1000+ 1

)(1

l/1000+ 1

).

By observing that the length of the basepq is proportional tod1 gives us the desiredbound. The number of points inBS+ that violate part 2 of Definition 1 of being ad1-centercan be obtained in a similar but easier manner.

The General Case. The idea of the proof for the case of arbitrary dimension is the sameas the two-dimensional case. Leta be an arbitrary point inS− (refer to Figure 8 for anillustration in three dimensions). LetF be a facet ofP containing some(n− 2)-faceEof P. Let F ′ = {p ∈ F : dist(p, E) < d1}. Since the distance between any two pointsx and y in F ′ is at most

√n, the projections ofx and y onto the(n − 2)-dimensional

hyperplaneh(E) that containsE must be at most√

n apart. Hence, the projection ofF ′ onto h(E) is contained in an(n − 2)-dimensional hyperrectangle with all sides oflength at most

√n. Thus,F ′ is contained in some(n− 1)-dimensional hyperrectangle

B of side lengths√

n× · · · ×√n× d1. Let y be a point inP such that the line segmentya intersectsF ′, and such that all other pointsp in P for which pa intersectsF ′ eitherlie on or above the planeH that is parallel toB and passes throughy. Do a perspectiveprojection ofB from a onto H , and denote the resulting box byB′. Observe that ifb isa point inP such thatab intersectsF ′ (henceb is aboveH ), thenb is in the “pyramid”obtained by taking the convex hull ofB′ anda.

PAC Learning Intersections of Halfspaces with Membership Queries 71

Fig. 8. The regionF ′ is bounded by a hyperrectangleB. The hyperrectangleB′ is obtained by doing aperspective projection ofB on the hyperplaneH .

Suppose the side lengths ofB′ ares1 × · · · × sn−1. Define a coordinate system forH with an arbitrary vertex ofB′ as the origino and orthogonal basisv1, . . . , vn−1

wherevi is a unit vector with direction along an edge inB′ going out fromo so thatB′ = {o +∑n−1

i=1 λi vi }, and suppose thaty = 〈µ1, . . . , µn−1〉 under this coordinatesystem. Letz = ya ∩ F ′, and let pi = 〈µ1, . . . , µi−1,0, µi+1, . . . , µn−1〉 andqi =〈µ1, . . . , µi−1, si , µi+1, . . . , µn−1〉. Further, letp′i = pi a ∩ B andq′i = qi a ∩ B.

Since|ay| < √n and, by definition,|az| > d, we have

|pi qi ||p′i q′i |

= |ay||az| <

√n

d.

Sincesi = |pi qi |, si < (√

n/d)|p′i q′i |. Together with the fact that the side lengths ofB are at most

√n × · · ·√n × d1, we conclude that the side lengths ofB′ are at most

n/d × · · · × n/d × (√n/d)d1.The pyramid defined by the convex hull ofB′ anda is contained in some hyperpar-

allelepiped with baseB′ and the top contains the pointa. Hence, the number of pointsin BS+ that are in the pyramid is at most

(n

d

1

l/1000+ 1

)n−2(√n

dd1

1

l/1000+ 1

)(1

l/1000+ 1

)

72 S. Kwek and L. Pitt

which can be bounded above by

2n

(n

d

)n−2√n

dd1

(1

l/1000

)n

.

There are at mostmchoices of the pointa. There are at most an(s

2

)number of(n−2)-

faces, each giving rise to two pyramids, one on each side of the(n − 2)-face. Hence,the number of points inBS+ that violate part 1 of Definition 1 is at most

m(s2− s)2n

(n

d

)n−1

d1

(1

l/1000

)n

.

There ares facets each with “surface area” at most(√

n)n−1. Using a similar argumentas above, it can be shown that the number of points inBS+ that ared1-close to the boundaryof P (i.e., which violate part 2 of Definition 1) is at mosts2nd1(

√n)n−1(l/1000)−n. Since

this is much smaller thanms2nd1(n/d)n−1(l/1000)−n, the total number of points inBS+

that are not ad1-center is at most

ms22nd1

(n

d

)n−1( l

1000

)−n

.

Therefore, the probability that a point drawn uniformly fromBS+ is not ad1-center is atmost

ms2d1d

n

(2n

ld

)n

which is at most12 because, by hypothesis,

d1 ≤ n

2ms2d

(d2

n1.5m

)n

= 1

2

[n

ms2d

(ld

2n

)n].

PROOF OFCLAIM 6. Suppose to the contrary that|a′′e| > d1/3. Letα be the perpendic-ular projection ofa′ onto the linea′′e. Consider the following two cases (see Figure 9).

Case1: ∠ea′′a′ ≥ π/2. Let bx be the projection ofb onto the linea′b′. Note that|αa′| < |a′a′′| < d2. Define

β ={

b′ if ∠eb′b ≤ π/2,bx if ∠eb′b > π/2.

Let f be the point of intersection of the linea′′e and the line that passes throughβand is perpendicular toa′e. By our choice ofβ, b ∈ 4eβ f (it is abovea′b′ becausea′b′ separatesc from b, and it is belowh(a) by assumption) and thusdist(b, h(a)) ≤dist(β, h(a)) ≤ |β f |. Hence it suffices to show that|β f | < d which contradicts thedefinition ofd.

PAC Learning Intersections of Halfspaces with Membership Queries 73

Fig. 9. The length of the segmenta′′e has to be smaller thand1/3, for otherwise the distance betweenb andthe hyperplaneh(a) is smaller thand, contradiction.

Note that4eαa′ is similar to4eβ f and thus

|αa′||αe| =

|β f ||βe| .

Since|αe| > |a′′e| > d1/3 and|αa′| ≤ |a′a′′| < d2, we have

|β f ||βe| =

|αa′||αe| <

3d2

d1.

Recall that the distance between any pair of points in the unit hypercube is at most√

n.Althoughbx is not necessarily in the unit cube we can bound its distance fromb′ by

√n

because in4b′bxb, |b′bx| ≤ |b′b| ≤√

n.Now,

|βe| ≤ |bxb′| + |b′e| ≤ 2√

n

Therefore,

|β f |2√

n≤ |β f ||βe| <

3d2

d1,

and, sinced2 ≤ (d1/12√

n)d,

|β f | < 6d1d√

n

12d1√

n≤ d

2,

a contradiction, since this implies the distance betweenb andh(a) is smaller thand.

Case2: ∠ea′′a′ > π/2. Since|a′′a′| < d2, we have|a′′α| ≤ |a′′a′| < d2 < d1/6,which implies that|αe| ≥ |a′′e| − |a′′α| > d1/3− d1/6 = d1/6. We also have|a′α| <|a′a′′| < d2. Now apply the same argument as in Case 1, obtaining the contradiction|β f | < d/2× 2< d.

74 S. Kwek and L. Pitt

PROOF OFCLAIM 7. Supposeb′ ∈ H(a). Then, by Claim 6, the pointe= h(a) ∩ a′b′is on a′′p2. Thus, e is in P and hence inH(b). The point a′ has to be inH(b)since otherwise neithera′ nor b′ is in H(b), which implies thate is not in H(b), acontradiction.

References

[1] S. Skiena. Interactive reconstruction via geometric probing.Proceedings of the IEEE, 9:1364–1383,1992.

[2] E. Baum. Neural net algorithms that learn in polynomial time from examples and queries.IEEE Trans-actions on Neural Networks, 2:5–19, 1991.

[3] L. G. Valiant. A theory of the learnable.Communications of the ACM, 27(11):1134–1142, 1984.[4] L. Khachiyan. A polynomial-time algorithm for linear programming (in Russian).Doklady Akademiia

Nauk USSR, 244:1093–1096, 1979. A translation appears inSoviet Mathematics Doklady, 20:191–194,1979.

[5] N. Karmarkar. A new polynomial-time algorithm for linear programming.Combinatorica, 4:373–395,1984.

[6] J. Renegar. A polynomial-time algorithm, based on Newton’s method, for linear programming.Mathe-matical Programming, 40:59–93, 1988.

[7] R. Monteiro and I. Adler. Interior path following primal-dual algorithms: Part I: Linear programming.Mathematical Programming, 44:27–41, 1989.

[8] P. Vaidya. An algorithm for linear programming which requiresO((m+n)n2+(m+n)1.5nL) arithmeticoperations.Mathematical Programming, 47:175–201, 1990. Condensed version inProceedings of the19th Annual ACM Symposium on Theory of Computing, pages 29–38, 1987.

[9] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik–Chervonenkisdimension.Journal of the ACM, 36(4):929–965, 1989.

[10] W. Maass and G. Tur´an. How fast can a threshold gate learn? InComputational Learning Theory andNatural Learning Systems: Constraints and Prospects. MIT Press, Cambridge, MA, 1992. Previousversions appeared inProc. FOCS89 andFOCS90.

[11] N. Littlestone. Learning when irrelevant attributes abound: a new linear-threshold algorithm.MachineLearning, 2:285–318, 1988.

[12] W. J. Bultman and W. Maass. Fast identification of geometric objects with membership queries. InProceedings of the4th Annual Workshop on Computational Learning Theory, pages 337–353. MorganKaufmann, San Mateo, CA, 1991.

[13] V. Shevchenko.On Deciphering a Threshold Function of Many-Valued Logic, pages 155–163. GrokiiState University, 1987.

[14] A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial-time algorithm for learning noisy linearthreshold functions. InProceedings of the37th Annual IEEE Symposium on Foundations of ComputerScience, pages 371–381. IEEE Computer Society Press, Los Alamitos, CA, 1996. Also inAlgorithmica,this issue, pp. xxx–xxx.

[15] A. Blum and R. L. Rivest. Training a 3-node neural net is NP-complete. InAdvances in Neural Infor-mation Processing Systems I, pages 494–501. Morgan Kaufmann, San Mateo, CA, 1989.

[16] E. Baum. A polynomial time algorithm that learns two hidden net units.Neural Computation, 2:510–522,1990.

[17] A. Blum, P. Chalasani, S. Goldman, and D. Slonim. Learning with unreliable boundary queries. InProceedings of the8th Annual Conference on Computational Learning Theory, pages 98–107. ACMPress, New York, 1995. Also inJournal of Computer and System Sciences, to appear.

[18] A. Blum and R. Kannan. Learning an intersection ofk halfspaces over a uniform distribution. InProceedings of the34th Annual Symposium on Foundations of Computer Science, pages 312–320.IEEE Computer Society Press, Los Alamitos, CA, 1993.

[19] H. Aizenstein, T. Heged˝us, L. Hellerstein, and L. Pitt. Complexity theoretic hardness results for querylearning.Computational Complexity, to appear.

PAC Learning Intersections of Halfspaces with Membership Queries 75

[20] K. Pillaipakkamnatt and V. Raghavan. On the limits of proper learnability of subclasses of DNF formulas.In Proceedings of the7th Annual ACM Workshop on Computational Learning Theory, pages 118–129.ACM Press, New York, 1994.

[21] L. Pitt and M. K. Warmuth. Prediction preserving reducibility.Journal of Computer and System Sciences,41(3):430–467, 1990. Special issue for the Third Annual Conference of Structure in Complexity Theory(Washington, DC, June 88).

[22] P. M. Long and M. K. Warmuth. Composite geometric concepts and polynomial predictability.Infor-mation and Computing, 230–252, 1993.

[23] E. B. Baum. On learning a union of half spaces.Journal of Complexity, 6(1):67–101, March 1990.[24] T. Hegedus. Geometrical concept learning and convex polytopes. InProceedings of the7th Annual ACM

Workshop on Computational Learning Theory, pages 228–236. ACM Press, New York, 1994.[25] P. Auer, S. Kwek, W. Maass, and M. Warmuth. Learning of depth two neural nets with constant fan-in

at the hidden nodes. InProceedings of the9th Annual Conference on Computational Learning Theory,pages 333–344. ACM Press, New York, 1996.

[26] N. Bshouty, S. Goldman, D. Mathias, S. Suri, and H. Tamaki. Noise-tolerant distribution-free learningof general geometric concepts. InProceedings of the27th Annual ACM Symposium on Theory ofComputing, pages 151–160. ACM Press, New York, 1996.

[27] N. Bshouty, S. Goldman, and D. Mathias. Noise-tolerant parallel learning of geometric concepts. InProceedings of the8th Annual Conference on Computational Learning Theory, pages 345–352. ACMPress, New York, 1995.

[28] S. Ben-David, N. Bshouty, and E. Kushilevitz. A composition theorem for learning algorithms withapplications to geometric concept classes. InProceedings of the29th ACM Symposium on Theory ofComputing, pages 324–333. ACM Press, New York, 1997.

[29] D. Dobkin and D. Gunopulos. Concept learning with geometric hypotheses. InProceedings of the8thAnnual Conference on Computational Learning Theory, pages 329–336. ACM Press, New York, 1995.

[30] P. Fischer. More or less efficient agnostic learning of convex polygons. InProceedings of the8th AnnualConference on Computational Learning Theory, pages 337–344. ACM Press, New York, 1995.

[31] P. Fischer and S. Kwek. Minimizing disagreements for geometric regions, using dynamic program-ming, with applications in machine learning. Technical Report eC-TR-96-004, Electronic Archive forComputational Learning Theory, 1996.

[32] D. Angluin and P. Laird. Learning from noisy examples.Machine Learning, 2(4):343–370, 1988.[33] M. Kearns, R. Schapire, and L. Sellie. Toward efficient agnostic learning.Machine Learning,

17(2/3):115–142, 1994.[34] P. Auer. On-line learning of rectangles in noisy environments. InProceedings of the Sixth Annual ACM

Conference on Computational Learning Theory, pages 253–261. ACM Press, New York, 1993.[35] Z. Chen and W. Maass. On-line learning of rectangles and unions of rectangles.Machine Learning,

17:201–223, 1994.[36] W. Maass and G. Tur´an. On the complexity of learning from counterexamples. InProceedings of the

30th Annual IEEE Symposium on Foundations of Computer Science, pages 262–267. IEEE ComputerSociety Press, Los Alamitos, CA, 1989.

[37] W. Maass and G. Tur´an. Algorithms and lower bounds for on-line learning of geometrical concepts.Machine Learning, 14(3):251–269, 1994.

[38] D. Dobkin, D. Gunopulos, and W. Maass. Computing the maximum bichromatic discrepancy, with appli-cations to computer graphics and machine learning. Technical Report TR-467-94, Princeton University,1994.

[39] W. Maass and M. Warmuth. Efficient learning with virtual threshold gates. InProceedings of the12thInternational Conference on Machine Learning, pages 378–386. Morgan Kaufmann, San Mateo, CA,1995.

[40] F. Ameur. A space-bounded learning algorithm for axis-parallel rectangles. InProceedings of the2ndEuropean Conference on Computational Learning Theory, EuroColt ’95, pages 311–321. Lecture Notesin Artificial Intelligence 904. Springer-Verlag, New York, 1995.