computing the maximum bichromatic discrepancy, with ... · file: 571j 141701 . by:cv ....

18
journal of computer and system sciences 52, 453470 (1996) Computing the Maximum Bichromatic Discrepancy, with Applications to Computer Graphics and Machine Learning David P. Dobkin* , - and Dimitrios Gunopulos , * Department of Computer Science, Princeton University, 35 Olden Street, Princeton, New Jersey 08540 and Wolfgang Maass 9 Institute for Theoretical Computer Science, Technische Universitaet Graz, Klosterwiesgasse 322, A-8010 Graz, Austria Received November 20, 1995 Computing the maximum bichromatic discrepancy is an interesting theoretical problem with important applications in computational learning theory, computational geometry and computer graphics. In this paper we give algorithms to compute the maximum bichromatic discrepancy for simple geometric ranges, including rectangles and halfspaces. In addition, we give extensions to other discrepancy problems. ] 1996 Academic Press, Inc. 1. INTRODUCTION The main theme of this paper is to present efficient algo- rithms that solve the problem of computing the maximum bichromatic discrepancy for axis oriented rectangles. This problem arises naturally in different areas of computer science, such as computational learning theory, computa- tional geometry and computer graphics ([28], [10]), and has applications in all these areas. In computational learning theory, the problem of agnostic PAC-learning with simple geometric hypotheses can be reduced to the problem of computing the maximum bichromatic discrepancy for simple geometric ranges. In computational geometry, efficient computation of the discrepancy of a two-colored point set is useful for the construction of =-approximations of point sets. Finally in computer graphics, the maximum numerical discrepancy of a point set is a good measure on how well a sampling pattern captures details in a picture. In the next three parts of the introduction we give the background and present three views of the same algorithmic problem from the perspective of learning theory (the mini- mizing disagreement problem), computational geometry (the bichromatic discrepancy) and computer graphics (the numerical discrepancy). The subject of section 2 is an O( n 2 log n) algorithm that computes the maximum bichromatic discrepancy in two dimensions. Here, we first prove that that the minimizing disagreement problem and the computation of the maximum bichromatic discrepancy are equivalent. Then, we develop an algorithm to compute the maximum discrepancy working in one and then in two dimensions. In section 3 we modify this algorithm to com- pute the maximum numerical discrepancy. In Section 4 we give an approximation algorithm, and we extend our results in higher dimensions. Finally, we note here that we are using the RAM model of computation ([20], [37]) throughout this paper. 1.1. Agnostic PAC-Learning and the Minimizing Disagreement Problem One goal of computational learning theory is to provide tools for the design and analysis of learning algorithms that provide satisfactory solutions for real-world learning problems. There are a lot of experimental results regarding the performance of various heuristic learning algorithms on a number of ``benchmark''-datasets for real world classifica- tion problems ([32], [46], [45], [47], [5], [19]), but the set of learning algorithms that are examined in these applications is virtually disjoint from the set of learning algorithms that are traditionally considered in computa- tional learning theory. Haussler in [16] provided an important link between theoretical and applied machine learning, when he intro- duced a variation of Valiant's ([42]) well known model for Probably Approximately Correct learning (``PAC- learning''). The new model, agnostic PAC-learning, provides an adequate format for the generic formulation of article no. 0034 453 0022-000096 18.00 Copyright 1996 by Academic Press, Inc. All rights of reproduction in any form reserved. * The research work of these authors was supported by NSF Grant CCR93-01254 and the Geometry Center. - E-mail address: dpdcs.princeton.edu. E-mail address: dgcs.princeton.edu. 9 E-mail address: maassigi.tu-graz.ac.at.

Upload: others

Post on 28-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01Codes: 6278 Signs: 4165 . Length: 60 pic 11 pts, 257 mm

Journal of Computer and System Sciences � SS1417

journal of computer and system sciences 52, 453�470 (1996)

Computing the Maximum Bichromatic Discrepancy, with Applicationsto Computer Graphics and Machine Learning

David P. Dobkin*, -and Dimitrios Gunopulos

� , *

Department of Computer Science, Princeton University, 35 Olden Street, Princeton, New Jersey 08540

and

Wolfgang Maass9

Institute for Theoretical Computer Science, Technische Universitaet Graz, Klosterwiesgasse 32�2, A-8010 Graz, Austria

Received November 20, 1995

Computing the maximum bichromatic discrepancy is an interestingtheoretical problem with important applications in computationallearning theory, computational geometry and computer graphics. Inthis paper we give algorithms to compute the maximum bichromaticdiscrepancy for simple geometric ranges, including rectangles andhalfspaces. In addition, we give extensions to other discrepancyproblems. ] 1996 Academic Press, Inc.

1. INTRODUCTION

The main theme of this paper is to present efficient algo-rithms that solve the problem of computing the maximumbichromatic discrepancy for axis oriented rectangles. Thisproblem arises naturally in different areas of computerscience, such as computational learning theory, computa-tional geometry and computer graphics ([28], [10]), andhas applications in all these areas.

In computational learning theory, the problem ofagnostic PAC-learning with simple geometric hypothesescan be reduced to the problem of computing the maximumbichromatic discrepancy for simple geometric ranges. Incomputational geometry, efficient computation of thediscrepancy of a two-colored point set is useful for theconstruction of =-approximations of point sets. Finally incomputer graphics, the maximum numerical discrepancy ofa point set is a good measure on how well a samplingpattern captures details in a picture.

In the next three parts of the introduction we give thebackground and present three views of the same algorithmic

problem from the perspective of learning theory (the mini-mizing disagreement problem), computational geometry(the bichromatic discrepancy) and computer graphics(the numerical discrepancy). The subject of section 2 isan O(n2 log n) algorithm that computes the maximumbichromatic discrepancy in two dimensions. Here, we firstprove that that the minimizing disagreement problem andthe computation of the maximum bichromatic discrepancyare equivalent. Then, we develop an algorithm to computethe maximum discrepancy working in one and then in twodimensions. In section 3 we modify this algorithm to com-pute the maximum numerical discrepancy. In Section 4 wegive an approximation algorithm, and we extend our resultsin higher dimensions.

Finally, we note here that we are using the RAM modelof computation ([20], [37]) throughout this paper.

1.1. Agnostic PAC-Learning and the MinimizingDisagreement Problem

One goal of computational learning theory is to providetools for the design and analysis of learning algorithms thatprovide satisfactory solutions for real-world learningproblems. There are a lot of experimental results regardingthe performance of various heuristic learning algorithms ona number of ``benchmark''-datasets for real world classifica-tion problems ([32], [46], [45], [47], [5], [19]), butthe set of learning algorithms that are examined in theseapplications is virtually disjoint from the set of learningalgorithms that are traditionally considered in computa-tional learning theory.

Haussler in [16] provided an important link betweentheoretical and applied machine learning, when he intro-duced a variation of Valiant's ([42]) well known modelfor Probably Approximately Correct learning (``PAC-learning''). The new model, agnostic PAC-learning,provides an adequate format for the generic formulation of

article no. 0034

453 0022-0000�96 �18.00

Copyright � 1996 by Academic Press, Inc.All rights of reproduction in any form reserved.

* The research work of these authors was supported by NSF GrantCCR93-01254 and the Geometry Center.

- E-mail address: dpd�cs.princeton.edu.� E-mail address: dg�cs.princeton.edu.9 E-mail address: maass�igi.tu-graz.ac.at.

Page 2: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141702 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01Codes: 6256 Signs: 5178 . Length: 56 pic 0 pts, 236 mm

real-life classification problems. The shortcoming of theoriginal PAC-learning model of Valiant is that it relies onthe assumption that the labels of the training examples arisefrom a target concept with an apriori known specific simplestructure, an assumption rarely met in practice. Valiantlater extended the basic model to include the possibility thatsome labels come from noise rather than from the targetconcept. However, so far positive learning results in thismodel have been proven only under the assumption thatthe noise is of specific structure ([1], [24]) or that thepercentage of the noisy labels is very small in comparison tothe desired error bound = of the learner.

In the agnostic PAC-learning model of Haussler weassume that the learner gets a sequence of training exam-ples, each consisting of an instance x # X and an outcome(we will also call it a label) y # Y. The sets X and Y arearbitrary sets, called instance space and outcome spacerespectively. However for the most part of this paper we willset Y=[0, 1]. The examples are generated according to anarbitrary distribution D on X_Y unknown to the learner.In a real-world application, D may simply reflect the dis-tribution of data as they occur in nature (possibly includingcontradictions, i.e. for some x # X both (x, 0) and (x, 1)may occur as examples) without assuming that the labelsare generated by any rule. The learner is also given a set Hof hypotheses. Each hypothesis H # H is a function from Xto [0, 1], and we will also call it a decision rule. The trueerror ErrorD(H) of a hypothesis H is:

ErrorD(H)=E(x, b) # D( |H(x)&b| )

or in other words, the probability that H fails to predict thelabel b of an example (z, b) drawn according to D.

The goal of the learner is to compute, for givenparameters =, $>0, a hypothesis H* # H whose true errorErrorD(H*) is with probability at least 1&$ not larger than=+infH # H ErrorD(H).

To achieve this goal, the learner is allowed to specify aminimum size m(=, $) for the training sequence. So the inputfor the computation of H* by the learner is a trainingsequence T of at least m(=, $) examples that are drawn fromX_[0, 1] according to D. This input T may be atypical forthe actual distribution D, so we allow the learner to fail withprobability at most $.

A learner which can carry out this task for any distribu-tion D over X_[0, 1] is called an efficient agnosticPAC-learner for hypothesis class H if its sample boundm(=, $) and its number of computation steps can bebounded by a polynomial in the parameters involved (inparticular in 1�= and 1�$).

For a given training sequence T=[(xi , bi) | 1�i�n]and hypothesis H, we define the empirical error:

ErrorT (H)=|[i |((xi , bi) # T ) 7 (H(xi){bi)] |�n

that is, the number of positive (labeled 1) examples outsideof H plus the number of negative (labeled 0) examples insideH over the size of the set. The empirical error measures howwell a hypothesis predicts D for the given training sequence.The following two uniform convergence results provide aconnection (first given by Haussler, [16]) between therequired size of the training sequence and the difference oftrue and empirical errors. They say that we can bound thedifference between the true error and the empirical error(with high probability) if we pick a large enough trainingsequence at random. Thus, if we have a random trainingsequence large enough, we can compute the empirical errorof any H # H and obtain in this way a good approximationof the true error. The first result is applicable when the setH is finite, and the second when it has a finite VC-dimen-sion.

1. Theorem 1 in [16] states that for any sample T withat least m(=, $)=(ln |H|+ln(2�$))�(2=2) examples drawnwith regard to some arbitrary distribution D overX_[0, 1], the following holds with probability at least1&$:

\H # H( |ErrorT (H)&ErrorD(H)|�=)

2. A recent result by Talagrand ([40], which slightlyimproves [16]) implies that under some rather harmlessmeasurability conditions, the same claim holds for

m(=, $)=(VCdim(H)(ln K+ln ln 2K+ln(1�=)

+ln(1�$))�(2=2)

where VCdim(H) is the VC-dimension of H, and K is someabsolute constant that is conjectured to be not larger than1000.

These results show that in order to prove a positiveresult for efficient agnostic PAC-learning with a specifichypothesis class H�2X of bounded VC-dimension, itsuffices to design an efficient algorithm for a related finiteoptimization problem, for which an efficient solution is alsovery desirable from the point of view of applied machinelearning: the minimizing disagreement problem for H. This isthe problem of computing, for any given finite trainingsequence T of labeled points, some hypothesis H # H whoseempirical error is minimal among all hypotheses in H. Analgorithm that solves the minimizing disagreement problemfor H is, together with the bounds for the minimum numberof training examples given above, an agnostic PAC-learnerfor hypothesis class H. By the same reasoning, an efficient=1-approximation algorithm for the minimizing disagree-ment problem for H can produce a hypothesis H with trueerror up to =1+= from the optimal.

454 DOBKIN, GUNOPULOS, AND MAASS

Page 3: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141703 . By:MC . Date:06:06:96 . Time:08:15 LOP8M. V8.0. Page 01:01Codes: 5550 Signs: 4651 . Length: 56 pic 0 pts, 236 mm

It should be pointed out that Kearns et al. [25] haveshown that for any hypothesis class H, the existence of anefficient algorithm for the minimizing disagreementproblem for H is in fact also a necessary condition forefficient agnostic PAC-learning with hypothesis class H.There are cases where the minimizing disagreementproblem, and therefore efficient agnostic PAC-learning, isvery hard. For example it is NP-hard to solve the problemfor the class of monomials ([25]) and halfspaces inarbitrary dimensions ([18]).

In this paper we look at the minimizing disagreementproblem when the class of hypotheses is the set R of axisaligned, but otherwise arbitrary, rectangles. A rectangularhypothesis R # R defines a natural function from X to[0, 1]. R(x)=1 if x lies in the interior of R, and R(x)=0otherwise. The problem is, given a labeled training sequenceT (in two dimensions), find the rectangular hypothesis Rthat minimizes the empirical error (Fig. 1):

minR # R

ErrorT (R)

In the following sections we give an O(n2 log n) solutionto this problem. It was previously studied by Lubinsky, whogives a cubic algorithm for it ([27]). It is easy to show thatthe VC dimension of R is finite (see 91.3) and therefore suchan algorithm gives an efficient agnostic PAC-learner for rec-tangular hypotheses. We also consider related minimizingdisagreement problems, in particular for the union of twodisjoint rectangles, the complement of rectangles andunions of rectangles, and halfspaces in low dimensions.

Simple hypotheses classes of this type have turned out tobe quite interesting from the point of view of appliedmachine learning. Weiss et al. ([46], [45], [47]) have

FIG. 1. The hypothesis A minimizes the empirical error for the point setshown (circles are labeled 1, and squares are labeled 0, A fails to predict thefilled point).

shown through experiments that for many of the standardbenchmark datasets a short rule that depends on only twoof the attributes, and which is a boolean combination ofexpressions of the form ``aj>c'' or ``aj=c'', provides the bestavailable prediction-rule. For example it is reported byWeiss and Kulikowski ([47]) that for their appendicitisdataset the complement of a rectangle (in 2 of the 8attributes of the particular dataset) is the prediction rulethat performs best. Finally we would like to point out thatoptimal hypotheses of a simple type like the ones consideredin this paper have the additional advantage that theyprovide a human user valuable heuristic insight into thestructure of a real-world learning problem. [9] presentextentions of the algorithms we present here to more com-plicated and expressive hypothesis classes.

Our goal is to contribute tools for the design of algo-rithms that compute optimal prediction rules of this kind.Auer, Holte and Maass ([2]) present a fast agnostic PAClearning algorithm for the important hypothesis class ofdepth 2 decision trees, and evaluate the algorithm's perfor-mance on standard benchmark datasets. This study showsthat one can get competitive experimental results onstandard benchmark datasets with agnostic PAC-learningalgorithms .

1.2. Bichromatic Discrepancy and =-Approximations

Many new results in computational geometry have usedprobabilistic techniques and algorithms. Haussler andWelzl ([17]) gave a very useful abstract framework for theirdevelopment and analysis, the concept of set systems withbounded (finite) VC-dimension.

A set system is a pair (S, R), where S is a set of points,and R is a set of subsets (we will call them ranges) of S. Fora set Y/S, we call the set system (Y, [(R & Y) 7 (R # R)])the subspace induced by Y. We say that Y is shattered by Rif, in the subspace induced by Y, every possible subset of Yis a range (in other words, if |[(R & Y) 7 (R # R)]|=2|Y| ).The Vapnik�Chervonenkis dimension, or VC-dimension ofthe set system (S,R) is the maximum cardinality of allshattered subsets of S ([44]).

A subset A/S is an =-approximation for the set system(S, R) if &A/R|�|A|&|R|�|S&�=, for all R # R. A subsetN/S is an =-net for (S, R) if S & R{6 for any R # R with|R|�|S|>=. Remarkably, if (S, R) has finite VC-dimension,it has =-approximations and =-nets with sizes independentfrom |S|, as the following result shows:

Theorem [17]. Let d be fixed and let (S, R) be a setsystem of VC-dimension d. Then for every =>0, there existsan =-net for (S, R) of size O((1�=) log(1�=)).

Set systems with finite VC-dimension occur naturally ingeometry and in learning theory. Let's consider for examplethe set system we primarily examine in this paper. The set S

455COMPUTING THE MAXIMUM BICHROMATIC DISCREPANCY

Page 4: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141704 . By:MC . Date:06:06:96 . Time:08:15 LOP8M. V8.0. Page 01:01Codes: 5382 Signs: 4315 . Length: 56 pic 0 pts, 236 mm

is a finite set of two dimensional points (S/[0, 1]2), andfor A/S, A is a range (A # R) if and only if there exists anaxis aligned rectangle that contains exactly the points in A.It is easy to see that the VC-dimension of (S, R) is at least4. If we pick a subset Y with 4 points arranged in a diamond,Y is shattered by R. Suppose Y has 5 or more points. Takethe smallest enclosing axis aligned rectangle and, for each ofits edges take exactly one point that intersects it. This subsethas at most 4 points and cannot be in the subspace inducedby Y because any axis aligned rectangle that contains thisset of points contains all points of Y. So the VC-dimensionof our set system cannot be more than 4 (or 2d for d-dimen-sional points).

Because of their properties, =-nets have been used in manygeometric algorithms and applications. Their use is alsoinstrumental in the derandomization of divide and conqueralgorithms ([4]). The derandomization of random algo-rithms is a general problem that allows us a betterunderstanding of the importance of randomness as acomputational resource. It also produces algorithms withguaranteed worst case performance. The only known way todeterministically and efficiently compute =-nets is via=-approximations ([29]). Recently Matousek et al. ([29])gave a strong connection between the discrepancy of a setsystem, and the deterministic construction of =-approxima-tions and =-nets.

Let us define the bichromatic discrepancy first. Let (S, R)be a set system and let / : S � R be a mapping. For a setR # R, let

2(R)= :x # (R & S)

/(x)

be the dichromatic discrepancy of R. We define the maximumbichromatic discrepancy of / on (S, R) by:

Max 2(S, /, R)=maxR # R

|2(R)|

Usually / is a mapping to [&1, +1], and is called acoloring of S (Fig. 2). This is where the name bichromaticcomes from.

The discrepancy of (S,R) is:

disc(S, R)= min/ :X � [&1, +1]

2(S, /, R)

Matousek et al. ([29]) also show that a set system withdiscrepancy $ has a 2$�|S|-approximation A, with |A|=W |S|�2X. Obviously A is also a 2$�|S|-net. Furthermore, thisapproximation A can be constructed from the coloring thathas discrepancy $.

Therefore an algorithm that computes the maximumbichromatic discrepancy of a coloring / also gives an

FIG. 2. The rectangle A maximizes the bichromatic discrepancy for thepoint set shown (circles are mapped to +1, and squares are mapped to &1;the filled points are the ones in the interior).

accurate bound on the quality of the resulting =-approxima-tion. In section 2.1 we also show that such an algorithm canbe used to solve the minimizing disagreement problem.

1.3. Numerical Discrepancy and Sampling Patternsin Graphics

The importance of the sampling technique in computergraphics is clearly demonstrated in the following examples.

The synthesis of realistic images of scenes is one of themost important applications of computer graphics. Oftenthe method of choice to produce the image of a computermodeled scene is ray tracing. In ray tracing's basic form wefind the value of each pixel by sampling the radiance at a setof points in the area of the pixel. For each sample we cast aray in the computer modeling scene and we try to trace it tolight sources. The value of the pixel is then computed froma combination, typically averaging, of the value of the sam-ples. In distributed ray tracing we are sampling in higherdimensions to produce other effects, for example motionblurring or depth of view perception. The idea in ray tracingis to find an approximate solution of the rendering equa-tion, an integral equation derived by Kajiya ([21]). Insteadof solving the continuous problem, we solve a approximatediscrete version by using a set of samples. The error in thisapproximation, and consequently the quality of the picture,depends on how well the sampling point set approximatesthe area function. Therefore the error depends both both onthe size and the quality of the sampling point set.

One of the most general ways of attacking antialiasing issupersampling. In this approach we sample the picture at arate much higher than the pixel rate. So there are manysamples (called supersamples) in each pixel, and to computethe pixel values we resample by averaging the supersampleswithin each pixel area. If we define the points we are

456 DOBKIN, GUNOPULOS, AND MAASS

Page 5: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141705 . By:MC . Date:06:06:96 . Time:08:15 LOP8M. V8.0. Page 01:01Codes: 5402 Signs: 4322 . Length: 56 pic 0 pts, 236 mm

sampling using a uniform pattern we get very visible biasingartifacts like the Moire patterns. Therefore a stochastic or asemi-random point set is preferable because it produces lessprominent random noise.

A number of different approaches have been proposed tofind good quality sampling patterns. Mitchell ([31], seealso [11]) includes an extensive search for good patterns,and shows that their use is important in practice. Forexample, one approach is to use sampling patterns withhigh frequency spectrum. These patterns drive biasing noiseto higher frequencies, where it is less visible, and can bereduced with supersampling. One promising approach is theapplication of the theory of discrepancy or irregularities ofdistribution, introduced by Beck and Chen ([3]), andapplied to computer graphics first by Shirley ([39]) andNiederreiter ([34] ). The discrepancy theory focuses on theproblem of approximating one measure (typically a con-tinuous one) with another (typically a discrete one). It'smain application is in Quasi Monte-Carlo numericalintegration, where we use a point set in order to apply finite-element techniques ([33], see also [48], and [36] for anapplication in finance).

In our graphics applications we want to approximate thearea function (or a weighted area function) and our objec-tive is to find a sampling pattern that provides goodcoverage for all kinds of images. Assume that we have asample set S of points in [0, 1]d. Let F be a family ofregions in [0, 1]d. For any region R in F, let +(R) be theEuclidian measure of R & [0, 1]d (the area of R), and +S(R)be the discrete measure |R & S|�|S| (the fraction of S in R).Then the numerical discrepancy of R with respect to S is:

DS(R)=|+(R)&+S(R)|

and the maximum numerical discrepancy of S with respect tothe family F is:

Max D(S, F)=maxR # F

(DS(R))

Discrepancy is a geometric data structures problem ofbroader interest. The problems that arise typically requirethe introduction of new techniques or extensions of existingones ([7], [11], [8]). Intuitively it is a good qualitymeasure for sampling point sets because it provides a directmeasurement on how well a given pattern estimates certainsimple integral types. Ideally we would like to compute themaximum numerical discrepancy for the most generalmodel, where the set of regions F is the family of convexpolygons. While this problem cannot be done in reasonablecomplexity, other families provide useful approximations.Such families include halfspaces, stripes and axis aligned leftanchored rectangles.

FIG. 3. The rectangle A maximizes the numerical discrepancy for thepoint set shown (the filled points are the ones inside A).

The family of regions that we will consider in thefollowing sections is the set R of d-dimensional axis-alignedboxes (rectangles) in [0, 1]d. We will concentrate on thetwo-dimensional case (Fig. 3). This model, the rectanglediscrepancy, is a good approximation to the real problem ofinterest, because in two dimensions it gives a good measureon how well a sample pattern, when applied to the area ofone or more pixels, captures small details in the picture.Results by Koksma ([26]) and Traub and Wozniakowski([41]) which show that the error in evaluating an integral(under bounded variation conditions) is proportional to thediscrepancy of the sampling point set, further justify its use.

A lot of work has to be done to produce point sets withlow discrepancy in this model (see [3]), and in fact almostoptimal sequences (Hammersley points, [15]) are known.Algorithms that compute the exact or approximate maxi-mum numerical discrepancy of point sets are useful howeverto compare point sets, to find patterns with very low dis-crepancy and to produce point sets when other properties(for example a random distribution) are also important.

2. COMPUTING THE MAXIMUM BICHROMATICDISCREPANCY

2.1. The Maximum Bichromatic Discrepancy and theMinimizing Disagreement Problem

In this section we concentrate on the problem of com-puting the maximum bichromatic discrepancy:

Max 2(S, /, R)=maxR # R

|2(R)|=maxR # R } :

x # (R & S)

/(x)}for a given set S, a given weight function / : S � R and theset R of axis aligned rectangles, and finding the rectangle Rwith 2(R)=Max 2(S, /, R). For a given rectangle R # R

457COMPUTING THE MAXIMUM BICHROMATIC DISCREPANCY

Page 6: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141706 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01Codes: 6161 Signs: 4984 . Length: 56 pic 0 pts, 236 mm

and for a point x # S, x # R if and only if the point x is in theinterior or the boundary of R. To simplify the definition ofMax 2(S, /, R) (and avoid the absolute value), we definethe following function:

Max 2$(S, /, R)=maxR # R

2(R)=maxR # R

:x # (R & S)

/(x)

An algorithm that computes Max 2$(S, /, R) can be used tocompute Max 2(S, /, R) too, since it can find a rectanglewith the maximum positive discrepancy, and, if we reversethe sign of the weights of the points the same algorithmwould compute a rectangle with the maximum negative dis-crepancy. One of the two rectangles must then maximize|2(R)|. The following theorem establishes the connectionbetween Max 2$ and the minimizing disagreement problem.

Theorem 1. Solving the minimizing disagreementproblem for rectangle hypotheses for a training sequence T isequivalent to the problem of finding the rectangle that maxi-mizes 2 for the associated set of points S and the mapping/ : S � [&1, +1] which is defined by the labels in T.

Proof. Suppose we are given a sequence T of labeledexamples. Let S be the set of points we get if we remove thelabels from the examples in T. We use the labels to constructthe coloring /. An example labeled 1, is mapped to +1 (ared point), and an example labeled 0 is mapped to &1(blue point). The rectangle R that maximizes 2(R)=�x # (R & S) /(x) maximizes the red points inside R minus theblue points inside R, so it minimizes the blue points insideR minus the red points inside R. But the number of redpoints inside R is equal to the total number of red points (aconstant) minus the red points outside R, so we have thatthe rectangle that maximizes 2(R)=�x # (R & S) /(x) mini-mizes the number of of blue points inside R plus the numberof red points outside R. But this is equal to ErrorT (R). Theother direction is similar. K

The bichromatic discrepancy provides an approximationfor the numerical discrepancy model with the use of a suf-ficiently fine grid of blue points. It is the discrete equivalentof the numerical discrepancy, since we measure how well apoint set approximates another. In fact, some upper boundresults for the numerical discrepancy given in [3] wereobtained via theorems for the bichromatic discrepancy. So,as we will see in Section 3, an algorithm for the bichromaticdiscrepancy is a good foundation to solve the numericaldiscrepancy case as well.

We begin our investigation with a look at the one-dimen-sional case, to build intuition and set ideas.

2.2. The 1-d Case

In one dimension, axis oriented rectangles become inter-vals that have both endpoints in the unit interval, and the

set X is a set of n distinct numbers between 0 and 1. Thealgorithm we develop computes Max 2$(S, /, R) and findsthe interval that maximizes 2.

We consider the static case first, where both the point setS and the mapping / are fixed and known beforehand. Weassume that the point set is sorted. This assumption isreasonable if the points are the results of experiments per-formed by the learner in monotone order. If this is not thecase, sorting the points is an O(n log n) time preprocessingstep of the algorithm.

First we show that we have only a finite number of inter-vals to consider.

Lemma 1. Max 2$(S, /, R) is maximized by an intervalwhose endpoints are in S.

Proof. If an interval has a free endpoint, we can move ituntil it meets the closest point inside without changing thediscrepancy of the interval. With the exception of the trivialcase of no red points, the interval that maximizes 2 cannotbe empty. K

From this lemma we know that there are only O(n2)different intervals to consider. The most naive algorithmwould compute the discrepancy for all of them to find themaximum bichromatic discrepancy in O(n3) time. Using anincremental algorithm we can compute all the disrepanciesin O(n2) time. Below we show that, with a splitting strategy,further improvement is possible. First we show anothersimple lemma.

Lemma 2. If 0�l�m�r�1 and m � S, then 2([l, r])=2([l, m])+2([m, r]).

Proof. Since m is not a point, each point in [l, r] hasto be in exactly one of [l, m] and [m, r]. So�x # (S & [l, r]) /(x) = �x # (S & [l, m]) /(x) + �x # (S & [m, r]) /(x)and the lemma follows. K

The following lemma shows that we can find theendpoints of the interval that maximizes 2 over all intervalsthat contain m independently from each other.

Lemma 3. Let A=[l, r] be an interval, and take a pointm # A with m � S. Assume that [xi , xj] maximizes 2 over allsubintervals of A that contain the point m. Then [xi , m]maximizes 2 over all subintervals of A that have m as theirright endpoint.

Proof. Suppose that there is an interval [ y, m] (withy # A) such that 2([ y, m])>2([xi , m]). From thehypothesis we know that 2([ y, xj])<2([xi , xj]). FromLemma 2 we have 2([xi , xj])=2([xi , m])+2([m, xj])and 2([ y, xj])=2([ y, m])+2([m, xj]). From theequalities it follows that 2([ y, m])�2([xi , m]) which is acontradiction.

Similarly we can show that [m, xj] must maximize 2 overall subintervals whose left endpoint is m. K

458 DOBKIN, GUNOPULOS, AND MAASS

Page 7: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141707 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01Codes: 6244 Signs: 5099 . Length: 56 pic 0 pts, 236 mm

The next lemma shows that we can find the interval thatmaximizes 2 over all intervals with a given point m as theirright (or left) boundary if we split [l, r] into two parts, andsolve two subproblems.

Lemma 4. Assume that we split into interval A=[l, r]into Al=[l, m] and Ar=[m, r] so that m � S. Also assumethat [xi , r] maximizes 2 over all subintervals of Ar that haver as their right endpoint, and that [xj , m] maximizes 2 overall subintervals of Al that have m as their right endpoint. Theneither [xi , r] or [xj , r] maximizes 2 over all subintervals ofA that have r as their right endpoint.

Proof. Suppose there is a subinterval [ y, r] of A suchthat 2([ y, r])>2([xi , r]) and 2([ y, r])>2([xj , r]).Obviously y cannot be in Ar , since in that case [ y, r] wouldbe contained in Ar and 2([ y, r]) would be at most2([xi , r]). So it has to be in Al . If we use Lemma 3, we getthat 2([ y, m])>2([xj , m]), a contradiction.

This suggests that we can compute the maximum dis-crepancy while we are building a tree of intervals in ahierarchical fashion. We partition [0, 1] into intervals, thatwe call regions. For each region A we compute threemaxima. Amax is the interval that maximizes 2 over allsubintervals of A. By definition, [0, 1]max is the maximumwe are looking for. Aleft is the interval that maximizes 2 overall subintervals of A that share A's left endpoint. And Aright

is the interval that maximizes 2 over all subintervals of Athat share A's right endpoint.

Lemma 5. Assume L and R are adjacent non-overlappingregions that are merged to form LR. Then LRmax , LRleft andLRright can be computed in constant time from Rmax , Rleft ,Rright , Lmax , Lleft and Lright .

Proof. From lemma 3 we know that LRmax is eitherLmax or Rmax or Lright _ Rleft , so we can compute it in con-stant time. From lemma 4 we know that LRleft is either Lleft

or L _ Rleft , and LRright is either Rright or Lright _ R. Itfollows that we can perform a merge operation in constanttime. K

The algorithm to compute [0, 1]max halves the number ofregions in each step.

Algorithm 1. 1. Partition [0,1] into Wn�2X non-over-lapping regions, each properly containing at most 2 pointsof S, so that the boundaries of the regions are not points inS. We find the three maxima for each of these regions.

2. Merge consecutive even and odd regions to producehalf the number of new regions. In each merge operationfind the three new maxima for the new region.

3. If only [0,1] is left, output [0, 1]max , otherwise goback to step 2.

Theorem 2. We can compute the maximum bichromaticdiscrepancy of a sorted one-dimensional point set S for anygiven coloring, and find an interval that maximizes thebichromatic discrepancy, in linear time and space.

Proof. Algorithm 1 computes [0, 1]max in linear timeand space. Its correctness follows from Lemmas 3 and 4.The running time of the first step is 3(n) since there are O(n)intervals, each containing a constant number of points of S.Each merge operation takes O(1) time, and, since we halvethe number of intervals in each iteration, there are a total of3(n) merges. In the same way we can find [0, 1]min (theinterval that minimizes 2), and so compute the maximumbichromatic discrepancy. K

A corollary of Theorem 1 and Theorem 2 is that we cansolve the minimizing disagreement problem for arbitraryinterval hypotheses in linear time.

2.3. The Dynamic 1-d CaseAlgorithm 1 can be easily modified to be dynamic with

respect to insertions and deletions of points.We explicitly construct a binary tree by subdividing the

intervals, with the leaves corresponding to intervals thatcontain a constant number of points of S. Each nodecorresponds to a region, and in each node we keep thelargest bichromatic discrepancy recorded in this region andalso the other two maxima, information which requiresconstant size per node. The total size is thus linear.

To delete a point, we follow the path down to the leaf thatcontains the deleted point, and then retrace the path to theroot, recording the new values in each visited note. Theprevious lemmata show that this can be done in constanttime per node. Insertion is similar.

If we use a balanced binary tree (for example a red-blacktree) each update costs O(log n) time (where n is the currentnumber of points), plus the time required to keep the treebalanced. Each tree rotate is a local operation however andcan be performed in constant time, so the total time of eachupdate is O(log n).

Theorem 3. We can recompute the maximum bichro-matic discrepancy of a colored one-dimensional point set Safter an update, and find an interval that maximizes thebichromatic discrepancy, in O(log n) time and in linear space.

Proof. From the discussion above. K

2.4. The 2-d Case

In two dimensions the input set S is a set of n points in theunit square [0, 1]2 and a mapping / : S � R. To make thepresentation simpler we assume that all x and y coordinatesare distinct. This is not an important restriction however.

The 2-d case builds on the one dimensional algorithm. Tosolve the problem we combine a sweeping technique along

459COMPUTING THE MAXIMUM BICHROMATIC DISCREPANCY

Page 8: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141708 . By:MC . Date:12:06:96 . Time:08:36 LOP8M. V8.0. Page 01:01Codes: 5304 Signs: 3797 . Length: 56 pic 0 pts, 236 mm

the y axis with an application of the dynamic algorithm of92.3 on the x axis. Let's begin with some simple but impor-tant lemmata.

The following lemma (like Lemma 1) shows that we haveto search among a finite number of rectangles.

Lemma 6. There exists a rectangle that maximizes 2such that all its edges pass through a point of S.

Proof. Let Imax=[(xmin , ymin), xmax , ymin), xmax , ymax),(xmin , ymax)], be a rectangle that maximizes 2. Suppose thatit has an edge e that does not pass through a point. We canthen freely move e along the axis perpendicular to it until itmeets the closest point inside, without changing the dis-crepancy of the rectangle. K

It follows that there are at most O(n4) candidates toconsider in the computation of the maximum discrepancy,and, with a smart data structure ([30], [38]), we cansearch through this range in O(n4 log n) time. We use aplane sweep technique to improve upon it.

Let Ycoord be the set of all y coordinates of points in S.There are O( |Ycoord | 2)=O(n2) different pairs of values forthe y coordinates of Imax . For each such pair we are goingto find the rectangle that maximizes 2 among all rectangleswith these y coordinates. The following lemma shows thatfor a given pair of points the problem can be reduced to aone-dimensional problem.

Lemma 7. The problem of finding the rectangle thatmaximizes 2, among all the rectangles with given y coor-dinates, is equivalent to the problem of finding the intervalthat maximizes 2 for a set of one-dimensional points withsimilar cardinality.

Proof. Assume that the fixed y-coordinates are yb , yt ,with yb< yt (Fig. 4).

FIG. 4. The y-coordinates of the rectangle are fixed.

Let Im=[(xi , yb), (xj , yb), (xj , yt), (xi , yt)] be the rec-tangle that maximizes 2 for all xi , xj , xi<xj . By the defini-tion, 2(Imax)=�x # (S & Imax) /(x). But S & Imax=[(xk , yk) |xk # [xi , xj] and yk # [ yb , yt]]. Obviously only the pointswith y coordinates between yb and yt have to be consideredin the computation of Im .

Now consider the set S[ yb, yt]=[xk | (xk , yk) # S andyk # [ yb , yy]] (Fig. 5). Let Amax=[xl , xr] be the intervalthat maximizes 2 for the set S[ yb, yt] and the restriction ofthe original / to the new set. Then take the rectangle I$that is defined by Amax and [ yb , yt] (that is, I$=[(xl , yb), (xr , yb), (xr , yt), (xI , yt)]). Since Im is maxi-mum, we have that 2(Im)�2(I$). But from the construc-tion of S[ yb, yt] we have that 2([xi , xj])=2(Im) and2(I$)=2(Amax). Finally we have that Amax=[xl , xr] is amaximum and so 2([xi , xj])�([xl , xr]). It follows that2([xi , xj])=2([xl , xr]). K

Lemma 7 shows that we can easily compute the maxi-mum rectangle when the y coordinates are fixed by a pair ofpoints. There are a total of O(n2) pairs, and we can searchthrough all of them using the dynamic linear algorithm of92.3.

The following algorithm finds the rectangle that maxi-mizes 2 for an input set S/[0, 1]2 and a coloring /.

Algorithm 2. 1. Compute Ycoord and sort it.

2. For each element yi of Ycoord do:

(a) Find the set Si=[(xj , yj)|(xj , yj) # S and yi�yj]and sort it on the y's.

(b) While Si is not empty, do:

(i) Remove from Si the point (xj , yj) with thesmallest y coordinate, and insert xj into the region.

(ii) use the dynamic linear algorithm to find therectangle that maximizes 2 over all rectangles that have ycoordinates between yi and yj .

(iii) If the new value is larger than the largest seenso far, record the rectangle.

3. Return the best rectangle found.

Theorem 4. We can compute the maximum bichromaticdiscrepancy of a set S( |S|=n) with an arbitrary coloring /,and find an axis aligned rectangle that maximizes thebichromatic discrepancy, in O(n2 log n) time and O(n) space.

Proof. The outlined algorithm finds the maximum of 2.Its correctness follows from lemmata 6 and 7. The algorithm

FIG. 5. The equivalent one-dimensional problem.

460 DOBKIN, GUNOPULOS, AND MAASS

Page 9: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141709 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01Codes: 6246 Signs: 4763 . Length: 56 pic 0 pts, 236 mm

considers all pairs of points. For each one it finds the rec-tangle that maximizes 2 while it contains the lower point. Inthe inner loop of the algorithm we use the dynamic algo-rithm of 92.3 for the left and the right region separately.

For the running time, we analyze each step. Step 1 takesO(n log n) time and then step 2 is executed O(n) times. Step2.a takes O(n log n) time and 2.b takes linear time. Step 2.cis executed O(n) times. The dynamic linear algorithm takesO(log n) time for insertions and O(log n) time for queries. In2.c.ii we do one insertion and two queries, it takes O(log n)time. Finally steps 2.c.i, 2.c.iii and 3 takes constant time.

The total time is O(n2 log n). At each point we have tomaintain the data structure for the dynamic linear algo-rithm. The used space is therefore linear. K

A corollary of Theorem 1 and Theorem 4 is that we cansolve the minimizing disagreement problem for rectangularhypotheses in O(n2 log n) time and linear space.

3. COMPUTING THE MAXIMUM NUMERICALDISCREPANCY

Here we extend the results of section 2 to the numericaldiscrepancy problem. Again we consider the low dimensioncases separately.

3.1. The Static 1-D Case

The input is a set S of n distinct points with coordinates0�x1< } } } <xn�1. The family R is the set of intervals in[0,1]. Following the definition of 91.2, the numerical dis-crepancy of a given interval is the difference between itslength and the ratio of the points it contains, over |S|.

The problem is to find the interval that maximizes thenumerical discrepancy.

We give some convenient notation first.We define the function Count : R � N, where

Count([l, r]) is the number of points inside the interval[l, r], endpoints inclusive. We will also apply Count to openintervals, and then Count((l, r)) gives the number of pointsin the open interval (l, r). For an interval A, if xi is the pointin A that is the closest to its left border, and if xj is the pointin A that is the closest to its right border, then we haveCount(A)=( j&i+1). Given this, the numerical discrepancyD of some interval A=[l, r] can be expressed as

D(A)=|(r&l )&Count(A)�n|

This definition of the discrepancy gives a different expres-sion depending on whether (r&l ) or Count(A)�n is greater,and to simplify matters we define the following two func-tions that operate on intervals. The first function, Ds , com-putes the discrepancy of an interval when the ratio of thepoints in it is smaller that the length of the interval. In this

case, if the endpoints of the interval are also in S, it isobvious that the discrepancy increases if we consider theequivalent open interval, and so we have:

Ds([l, r])=(r&l )&Count((l, r))�n

The second function, Dl , computes the discrepancy of aninterval when the ratio of the points in the interval is largerthan its length. In this case all points, endpoints included,are counted, so:

Dl ([l, r])=Count([l, r])�n&(r&l )

Clearly, the interval that maximizes the discrepancy mustalso maximize one of Ds or Dl .

Again it is easy to show that we only have to consider afinite number of intervals.

Lemma 8. The numerical discrepancy of an one-dimen-sional point set S can be maximized only by a interval withendpoints in S _ [0, 1].

Proof. If a interval has a free endpoint, we can bothincrease and decrease its length without changing its inter-section with S, and one of the two operations has to increasethe discrepancy. K

The static one dimensional case is exactly equivalent tothe bichromatic discrepancy. We can follow the sameapproach and develop a linear time algorithm that com-putes the maximum numerical discrepancy of an one dimen-sional point set.

We only state the main lemmata.

Lemma 9. If 0�l�m�r�1 and m � S, then Dl ([l, r])=Dl ([l, m])+Dl ([m, r]).

Lemma 10. Let m # [l, r], m � S, and assume that[xi , xj] maximizes Dl over all subintervals of [l, r] that con-tain m. Then [xi , m] maximizes Dl over all such subintervalsthat have m as their right endpoint, and [m, xj] maximizes Dl

over all such subintervals that have m as their left endpoint.

Lemma 11. Assume that we split [l, r] into [l, m] and[m, r], (with m � S). Also assume that [xi , r] maximizes Dl

over all intervals in [m, r]that have r as their right endpoint,and that [xj , m] maximizes Dl over all intervals that have mas their right endpoint. Then either [xi , r] or [xj , r] maxi-mizes Dl over all intervals that have r as their right endpoint.

Theorem 5. We can compute the maximum numericaldiscrepancy of a sorted point set S on a line, and find an inter-val that maximizes the numerical discrepancy, in linear timeand space.

Proof. Algorithm 1, suitably modified, finds the intervalthat maximizes Dl . Its correctness follows from Lemma 10and 11. The running time of the first step is 3(n) since there

461COMPUTING THE MAXIMUM BICHROMATIC DISCREPANCY

Page 10: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141710 . By:MC . Date:06:06:96 . Time:08:16 LOP8M. V8.0. Page 01:01Codes: 6033 Signs: 4966 . Length: 56 pic 0 pts, 236 mm

FIG. 6. The insertion of a point can change many maximal points

are O(n) intervals of constant size each. Each merge opera-tion takes O(1) time, and, since we halve the number ofintervals in each iteration, there are a total of 3(n) merges.Similarly we can maximize Ds , and compute the maximumnumerical discrepancy. K

3.2. The Dynamic 1-D Case

The dynamic case is quite more difficult because when weinsert or delete a point the cardinality of S(n) changes. As aconsequence, the function Dl changes and we have to com-pute new maxima for every region. The following exampleshows that the insertion of a new point in a region doeschange the maximal points of other regions.

Let S=[0.1, 0.24, 0.4, 0.6, 0.7, 0.8], and assume we havetwo regions, A=[0, 0.5], and B=[0.5, 1]. We can see thatAmax=[0.1, 0.4], with Dl ([0.1, 0.4])=3�6&0.3=0.2 andDl ([0.1, 0.24])=2�6&0.14=0.193.

But if we insert a point 0.9 in R2 , we have:Dl ([0.1, 0.4])=3�7&0.3=0.129 and Dl ([0.1, 0.24])=2�7&0.14=0.146. Now Amax is [0.1, 0.24] (Fig. 6).

An approach based on separate regions is still valuablehowever. When a point is inserted (deleted) only one regionis directly affected. For the rest of the section we will assumethat n is the maximum cardinality of S, for the entiresequence of updates.

First the interval [0, 1] is divided in r regions, eachcontaining O(n�r) points. We do not allow points of S tobecome endpoints of regions.

Consider such an interval A=[a1 , ar], and assume thatthe points that lie inside A are xmin , ..., xmax . Recall (9 2.2)that Amax is the subinterval that maximizes Dl , Aleft is thesubinterval that maximizes Dl and shares A's left endpoint,and Aright is the subinterval that maximizes Dl and sharesA's right endpoint. In order to compute Amax , Aleft and Aright

dynamically, we use the techniques of the following theoremby Dobkin and Eppstein.

Theorem [8]. We can insert or delete points from a setS/[0, 1], and recompute the halfspace discrepancy aftereach update, in time O(log2 n) per update, and O(n log n)space.

This algorithm allows us to dynamically compute Aleft

and Aright , and we briefly outline the technique here. Tocompute Aleft we have to find the point xi that maximizes

Dl ([xmin , xi])=(i&min+1)�n&(xi&al) over all points inA. This function is a linear function of (xi , i). So we con-struct the convex hull of the O(n�r) points [(xi , i) | x # A],and then we do a binary search on the convex hull to findthe point that maximizes (i&min+1)�n&(xi&al). If weuse a dynamic algorithm to keep the convex hull structure,such as the Overmars and van Leeuwen algorithm ([35])which requires O(log2(n�r)) time per update, we can theninsert or delete points and find the new maximum of thefunction (i&min+1)�n&(xi&al) in O(log2(n�r)) plusO(log(n�r)) time for the update and the binary searchrespectively. The computation of Aright is symmetrical, herewe find the point xj that maximizes (max&j+1)�n&(ar&xj).

The computation of Amax is a little more complicated. Theendpoints of Amax must be points in S, so in fact we want tofind that pair of points that maximizes the functionDl ([xi , xj])=( j&i+1)�n&(xj&xi), over all pairs ofpoints in A.

An alternative way is to view ( j&i+1)�n&(xj&xi)as a linear function of the two-dimensional points(l(i, j), q(i, j)), with l(i, j)=(xj&xi), and q(i, j)=( j&i+1). In other words, the first coordinate is the lengthof the interval [xi , xj], and the second coordinate, the dif-ference of the ranks, is the the number of points of S that liein the same interval. To compute Amax we construct theconvex hull of the O((n�r)2) points [(l(i, j), q(i, j)) |xj # A, xi # A, xi<xj], and perform a binary search to findthe point that maximizes the function ( j&i+1)�n&(xj&xi). The query time is then O(log(n�r)).

So for each region A, we keep two convex hull structuresusing O((n�r)2 log(n�r)) space. From these we extract fourpoints of S that define Amax , Aleft and Aright . Let T be the setthat includes exactly these four points from each region. SoT contains 4n�r points, which change with each update. Thefollowing lemma shows that we can find the new T inO(r log(n�r)+(n�r)2 log2(n�r)) time.

Lemma 12. The set T can be recomputed inO(r log(n�r)+(n�r)2 log2(n�r)) time after each update.

Proof. With an update only one region is directlyaffected. The convex hulls are maintained using a dynamicalgorithm ([35]) that requires O(log2(n�r)) time perupdate. For an insertion of a new point xk of S in the regionA, we have to make a total of O(n�r) insertions to the con-vex hull structures because for each xi in A, a new point( |xi&xk |, |i&k| ) is defined. In addition to that, the secondcoordinate of up to (n�r)2 convex hull points may change,because the insertion of a new point changes the rank of upto (n�r) points. The cost for the structure update is thenO((n�r)2 log2(n�r)).

Now we can find T doing three binary searches foreach region with a total cost of O(r log(n�r)+(n�r)2 log2(n�r)). K

462 DOBKIN, GUNOPULOS, AND MAASS

Page 11: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141711 . By:MC . Date:06:06:96 . Time:08:16 LOP8M. V8.0. Page 01:01Codes: 6279 Signs: 4852 . Length: 56 pic 0 pts, 236 mm

The following lemma shows that we can find the intervalthat maximizes Dl from T.

Lemma 13. An interval in [0,1] that maximizes Dl musthave as endpoints points of T.

Proof. If this maximum interval I lies in one region, thenits endpoints are in T. So suppose its left and right endpointare in regions A and B respectively (Fig. 7).

Let AB=A _ C _ B be the region that starts at a andstops at b (C=AB"A"B). Then I=(I & A) _ C _ (I & B),and Dl (I )=Dl (I & A)+Dl (C)+Dl (I & B). Lemma 11shows that Dl (I & A) is maximized when (I & A)=Aright

and similarly, Dl (I & B) is maximized when (I & B)=Bleft .By the construction of T, in includes the endpoints of Aright

and Bleft and the lemma follows. K

So we can then use the algorithm for the static case andfind the maximum in O( |T | )=O(r) time.

Theorem 6. We can insert or delete points from a pointset S on a line, and recompute the maximum numerical dis-crepancy after each update, and find an interval that maxi-mizes the numerical discrepancy, in time O(n2�3 log2 n) perupdate, and space O(n4�3 log n).

Proof. We can compute T in O(r log(n�r))+O(n2 log2(n�r)�r2) time (from Lemma 12) after each update.Once the new maxima are found, we can use the linear timealgorithm for sorted inputs, and find the updated maximumof Dl for S in an additional O(r) time. The same procedureis applied to DS . We find the maximum numerical dis-crepancy from the maximum of the two maxima. If r=n2�3,the total update cost is O(n2�3 log2 n).

This cost does not include the time required to split aregion if after an insertion it contains more that 2n1�3 points,or to combine two consecutive regions to one if, after aninsertion they both contain less than n1�3 points. Each suchoperation costs O(n2�3 log2 n) because it requires completerebuilt of the convex hull data structures. So the main-tenance of the regions does not raise the asymptoticcomplexity of an update.

The space requirement for each region isO((n�r)2 log(n�r))=O(n2�3 log n), so the space used isO(n4�3 log n). K

Finally we note that the dynamic algorithm also workswhen the weight of each point is not uniformly 1 but is

FIG. 7. The maximum interval I intersects regions A and B, andcontains C.

assigned by a weight function of the form R : S � R+ . Forthis we have to change the index of every point. For the i thpoint, instead of using i as its index, we use the sum of itsweight and the weights of the i&1 points on its left.

3.3. The 2-D Case

In two dimensions the input set S is a set of n points in theunit square [0, 1]2. We assume that the points are distinctand no pair of them has the same y coordinate. Thisassumption makes the derivation easier but is not essentialfor the proof of correctness and the running time analysis ofthe algorithm. As in 93.2, we give some definitions first. Thefunction In : R � N gives the number of points inside a rec-tangle, and the function Area : R � R+ gives the area of arectangle. The numerical discrepancy D of a rectangleR # [0, 1]2 is given by D(R)=|Area(R)&In(R)�n|. Theobjective is to find the rectangle that maximizes D.

Since the definition of D is somewhat awkward, we alsodefine the following two functions Di (I )=Area(R)&In(R)�n and Do(I )=In(R)�n&Area(R).

The maximum of D has to maximize at least one of Do

and Di . In this section we give an algorithm that finds therectangle in R, that maximizes Do , but the same approachcan also find the maximum of Di .

Our approach to the problem is based to the algorithmwe developed for the bichromatic discrepancy. Again it isclear that only O(n4) rectangles have to be considered, asthe following lemma shows.

Lemma 14. Let Imax=[(xmin , ymin), (xmax , ymax), (xmax ,ymax), (xmin , ymax)], be the rectangle that maximizes Do .Then each edge must either pass through a point of S.

Proof. Suppose that that is not the case for an edge e.We can then freely move e along the axis perpendicular toit without changing the number of points that are insideImax . One direction of movement increases the area of therectangle, and the other decreases it, so a movement on thesecond direction results in a rectangle I$ with Do(I$)>Do(Imax). Furthermore, the direction that decreases the areacannot be blocked even if e lies on the boundary of the unitsquare. K

A sweeping technique to search through all O(n2)possible different y coordinates for the best rectangle isagain the technique of choice. The following lemma showsthat the technique we used in 92.4 follows through.

Lemma 15. The problem of finding the rectangle thatmaximizes Do among all the rectangles with given y coor-dinates is equivalent to the problem of finding the interval thatmaximizes Dl for a set of one-dimensional points.

Proof. Assume that the fixed y-coordinates are yb , yt ,with yb<yt . We want to find the rectangle I=[(xi , yb),(xj , yb), (xj , yt), (xi , yt)] that maximizes Do for all

463COMPUTING THE MAXIMUM BICHROMATIC DISCREPANCY

Page 12: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141712 . By:MC . Date:06:06:96 . Time:08:16 LOP8M. V8.0. Page 01:01Codes: 6087 Signs: 4355 . Length: 56 pic 0 pts, 236 mm

i, j, (xi<xj). By the definition Do(I )=In(I )�n&Area(I ),and Area(I )&(xj&xi)( yt&yb). Since ( yt *yb)=2y is apositive constant, we can equivalently maximize the func-tion

D$o(I )=In(I )�(n 2y)&(xj&xi)

We also know that In(I )=[(xk , yk) | xk # [xi , xj] andyk # [ yb , yt]], which shows that to find the maximumrectangle with y coordinates yb and yt we have to consideronly the points with y coordinates between yb and yt .

Again we take the set S[ yb, yt]=[xk | (xk , yk) # S andyk # [ yb , yt]]. From the construction of S[ yb, yt] followsthat Count([xi , xj])=In([(xi , yb), (xj , yb), (xj , yt), (xi , yt)]).Recall that the definition of Dl is Dl ([xi , xj])=Count([xi , xj])�|S[ yb, yt] |&(xj&xi) (93.1). We canhowever modify the algorithm we gave in 93.1 to use theconstant n 2y instead of |S[ yb, yt] |. The modified algo-rithm maximizes the function Dl, [ yb, yt]([xl , xr])=Count([xl , xr])�(n 2y)&(xm&xk). Lets assume that themaximum interval is [xl , xr]. From it we get the rectangleIm=[(xl , yb), (xr , yb), (xl , yt), (xl , yt)], which maximizesD$o and consequently Do . K

Lemma 15 shows that we can easily compute the maxi-mum rectangle when the y coordinates are fixed by a pair ofpoints. However the performance of the dynamic onedimensional algorithm does not allow for a fast two dimen-sional algorithm. The two following lemmata provide anadditional idea that leads to a faster algorithm.

Lemma 16. Assume that for two given points of S,(xb , yb) and (xt , yt) (with 0� yb< yt�1), the rectangle Imaximizes Do over all the rectangles with these y coordinates.Then for I to maximize Do over all rectangles, it has to includethe point (xb , yb).

Proof. From the initial assumption no two points in Shave the same y coordinate. If (xb , yb) � I, then the lowerhorizontal edge of I does not pass through a point, and,from Lemma 14, I can't maximize Do . K

It follows from Lemma 16 that we don't have to find thebest rectangle for every pair of points. Instead, for each pair,we find the best rectangle that contains the lower point ofthe pair. For a given pair of points, the latter maximum canbe smaller than the former, but the maximum rectangleoverall has to include both points of the pair.

The following lemma shows that this problem is equiv-alent to a simpler one dimensional problem.

Lemma 17. Suppose that we are given a pair of pointsthat define the y coordinates, and we want to find the rec-tangle with these y coordinates that maximizes Do and passesfrom the lower point. This problem is equivalent to theproblem of finding the halfspace discrepancy for two sets ofone dimensional points.

FIG. 8. The y-coordinates of the rectangle are fixed.

Proof. Assume that the given points are (xb , yb) and(xt , yt), with yb< yt . We want to find the rectangleIm[(xi , yt), (xj , yt), (xj , yb), (xi , yb)] with xi�xb�xj ,that maximizes Do (Figs. 8 and 9). Following the proof ofLemma 15, we take the set S[ yb , yt]=[xk | (xk , yk) # S andyk # [ yb , yt]]. We also know that we can equivalentlymaximize the function Count([xi , xj])�(n 2y)&(xj&xi),for all xi�xb�xj .

We divide [0, 1] in three regions, [0, r1], [r1 , r2],[r2 , 1] so that r1<xb<r2 , and, for all xi # S[ yb, yt] , eitherxi<r1 or xi>r2 . In other words, only xb is in the region[r1 , r2]. This partitioning divides S[ yb, yt] into three disjointsets, the points on the left of xb (i.e. S[ yb, yt] & [0, r1])),[xb], and the points on the right (i.e. (S[ yb, yt] & [r2 , 1])).We sort the two sets, and rename the points accordingto their rank after the sorting: S[ yb, yt], l=[z1 , ...,z |S[yb , yt] & [0, r1]|] and S[ yb, yt], r=[z1 , ..., z |S[yb , yt] & [r2, 1]|].

Suppose that the maximum interval is [xl , xr]. Since itmust contain xb it can have at most one endpoint in each of[0, r1] and [r2 , 1]. From Lemma 14 it follows that both xl

and xr must be points of S[ yb, yt] . So xl is either in S[ yb, yt], r ,or is equal to xb . Assume that the first is true. If we applyLemma 10, we see that [r2 , xr] has to maximize the func-tion Count([r2 , xj])�(n 2y)&(xj&r2), for all [r2 , xj], xj #(S[ yb, yt] & [r2 , 1]). Equivalently, it must maximize thefunction i�(n 2y)&(xi&r2), for all zi # S[ yb, yt], r . Thisis a linear function of the two dimensional points[(zi , i) | zi # S[ yb, yt], r]. The maximum of this function canbe computed with the technique of [8] that we gave in 93.2.The only modification is that we maximize for a differentlinear function.

FIG. 9. The two equivalent one-dimensional problems.

464 DOBKIN, GUNOPULOS, AND MAASS

Page 13: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141713 . By:MC . Date:07:06:96 . Time:07:28 LOP8M. V8.0. Page 01:01Codes: 3729 Signs: 2762 . Length: 56 pic 0 pts, 236 mm

FIG. 10. The two-dimensional algorithm: (a) The input point set. (b) A pair of points is chosen. (c) The 1D problem is shown. (d) The maxima arecomputed from the convex hulls of the lifted points. (e) The best rectangles (that include the lower point and have these y coordinates) are shown.(f ) The rectangle that maximizes the numerical discrepancy for this point set.

This gives us the only two points in S[ yb, yt] that can formthe right endpoint of the maximum interval. Similarly wecan find the two possible choices for the left endpoint andfrom them we find the maximum interval in constanttime. K

The following algorithm finds the rectangle that maxi-mizes Do for the input set S/[0, 1]2. The algorithm uses amodified version of the dynamic linear half-space dis-crepancy algorithm given in [8] as a subroutine (Fig. 10).

Algorithm 3. 1. Compute Ycoord and sort it.

2. For each element yi of Ycoord do:

(a) Find the set Si=[(xj , yj) | (xj , yj) # S andyi� yj] and sort it on the y's.

(b) Partition [0, 1] in three regions so that the mid-dle one contains only xi among all the points of Si .

(c) While Si is not empty, do:

(i) Remove from Si the point (xj , yj) with thesmallest y coordinate, and insert xj into the appropriate (leftor right) region.

(ii) Use the modified linear algorithm to find therectangle that maximizes Do over all rectangles that have ycoordinates between yi and yj , and include the point(xi , yi).

(iii) If the new value is larger than the largest seenso far, record the rectangle.

3. Return the best rectangle found.Theorem 7. We can compute the maximum numerical

discrepancy of a point set S/[0, 1]2 with n points, and findan axis aligned rectangle that maximizes the numericaldiscrepancy, in O(n2 log2 n) time and O(n log n) space.

Proof. The correctness of the algorithm follows fromLemmas 15, 16 and 17. The algorithm considers all pairs ofpoints. For each one it finds the rectangle that maximizes Do

while it contains the lower point. In the inner loop of thealgorithm we use the modified dynamic algorithm of [8] forthe left and the right region separately. We find the maxi-mum of Di in a similar way, and from there the discrepancymaximum.

For the running time, we analyze each step. Step 1 takesO(n log n) time and then step 2 is executed O(n) times. Step2.a also takes O(n log n) time, and step 2.b takes linear time.The inner loop, step 2.c, is executed O( |Si | )=O(n) times.We are only doing insertions to the convex hull, and we canmaintain the convex hulls at a cost of O(log2 n) time perinsertion ([35]). Consequently, the dynamic linear algo-rithm takes O(log2 n) time for insertions and O(log2 n) timefor queries. In step 2.c.ii we do one insertion and two

465COMPUTING THE MAXIMUM BICHROMATIC DISCREPANCY

Page 14: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141714 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01Codes: 6699 Signs: 5564 . Length: 56 pic 0 pts, 236 mm

queries, so this step takes O(log2 n) time. Steps 2.c.i and2.c.iii are executed in constant time. Finally, step 3 alsotakes constant time.

The total time is O(n2 log2 n). At each point we have tomaintain the data structure for two dynamic convex hullsand two sorted arrays of linear size. The used space is there-fore O(n log n). K

So far we implicitly assumed that each point has weight1. Algorithm 3 can also be used for points with arbitrarynon-negative weights. Since all discrepancy computationsare being done in the subroutine that calls the one-dimen-sional algorithm, only this part has to be modified.

4. EXTENSIONS

In this chapter we give an approximate 2-D algorithm forcomputing the maximum bichromatic discrepancy and themaximum numerical discrepancy for axis oriented rectangles.We also examine different sets of geometric regions, inparticular halfspaces and unions of disjoint rectangles.

4.1. An Approximation Algorithm

It is in many cases desirable (especially when the inputsets are very large) to trade off some of the quality of thesolutions to reduce the computation time. As we havealready seen in the introduction, it is sufficient to findhypotheses with almost optimal empirical error.Approximate algorithms are also useful from the point ofview of applied machine learning.

In this section we develop algorithms that find anapproximate solutions to the two-dimensional problem Thebasic idea is to divide the unit square in a number ofrectangles, replace the points in each rectangle with a singlepoint of the appropriate weight, and use the algorithms ofSections 2 and 3 to compute the maximum discrepancy ofthe new, smaller point set S and the new weight functionw : S � R.

Let's consider first the bichromatic discrepancy. For givenS, / and approximation factor 1�r, the first step is the parti-tioning of the unit square in O(r2) regions. We partition theunit square recursively, by dividing each existing rectanglewith more than 3n�r2 points into two rectangles so that eachof the new rectangles contains at most half of the originalpoints plus n�r2 points. We divide along the x and the y axisalternatively. If we cannot do that for some rectangle, then atleast n�r2 of the points of this rectangle have to have the samex or the same y coordinate. In this case we create a 0-widthrectangle, and we subdivide it along its length so that eachresulting 0-width rectangle has 3(n�r2) rectangles.

Lemma 18. This process is completed in O(n log n), and inthe end we have 3(r2) rectangles, each having 3(n�r2) points.

Proof. The running time comes from the fact that firstwe have to sort the points by x and y coordinates. Since

after each split the number of points in each rectangle isapproximately half, in 2 log r iterations each rectanglehas at most n�22 log r+(n�r2)(1+1�2+ } } } +1�22 log r&1)=O(n�r2) points. Finally, since we stop subdividing a rectanglewhen it has 3(n�r2) points, we end up with 3(r2) rectangles. K

The next step, after the partition of the unit square, is thereduction of the size of the input set from n to O(r2). Foreach rectangle R, we compute the weight of the points thatlie in it w(R)=�x # (S & R) /(x), and replace all the pointswith a point on the lower left corner of the rectangle that hasweight w(R). If a point of S lies on a boundary between tworectangles, we assign it to only one rectangle, ensuring thatthe sum of all the weights is equal to �x # (S & R) /(x). Thisstep also takes O(n log r) time.

Lemma 19. After the partition of the unit square into rec-tangles that contain 3(n�r2) points, a line segment parallel toone of the axes can intersect 3(r) rectangles.

Proof. Assume the line is horizontal, the vertical case issymmetrical. From the construction of the partition, thenumber of non 0-width rectangles that interset a fixedhorizontal line doubles every second iteration, so that lineintersects 3(2log r)=3(r) of the initial rectangles. For eachone, it might intersect at most one vertical 0-widthrectangle. A horizontal line segment can contain manyhorizontal 0-width rectangles, but can intersect at most two,one with each endpoint. K

Theorem 8. Given a point set S, and a coloring /, we cancompute in O(n log n+r4 log r) time an approximation of themaximum bichromatic discrepancy within O(n�r) of theoptimal.

Proof. We apply Algorithm 2, suitably modified towork with the arbitrary weights, to the set Sr . Let's assumethat the rectangle Im maximized 2 for S and /. From theconstruction of Sr , each edge of Im can intersect at most3(r) rectangles. Let Sr & Im be the set of points in Sr that areinside Im , and take I$ to be the minimum rectangle that con-tains all of them. The difference between �x # (s & Im) /(x) and�x # (Sr & I$) w(x) is at most the weight of the rectangles inter-sected by Im . There are 3(r) of them, and each has weight3(n�r2), so 2sr, w(I$) differs at most 3(n�r) from 2S, /(Im).Similarly, if the rectangle I$m maximizes the discrepancy forSr and w, there exists a rectangle I with 2S, /(I ) that differsat most 3(n�r) from 2sr, w(I$m). Therefore, |Max 2$(S, /, R)&Max 2$(S, w, R)|=O(n�r). K

When we consider the numerical discrepancy, we realizethat we have to add another step in the construction of the setSr to account for the area of the rectangles. After we do thesubdivision, we have O(r2) rectangles, and the weight of eachis at most O(n�r2). If the area of any of these rectangles is largerthan 4r&2, we divide it in half along its shortest edge, and con-tinue to do so until the area of all rectangles is at most 4r&2.

466 DOBKIN, GUNOPULOS, AND MAASS

Page 15: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141715 . By:MC . Date:06:06:96 . Time:08:17 LOP8M. V8.0. Page 01:01Codes: 5458 Signs: 4321 . Length: 56 pic 0 pts, 236 mm

The whole operation is performed in O(n log n) time,and, as the following lemma shows, we end up with O(r2)rectangles.

Lemma 20. After this process, the number of the rec-tangles that partition the unit square is 3(r2).

Proof. The number of the initial rectangles is 3(r2).Every rectangle that is produced by the breaking up of alarge initial one has an area of at least r&2, and thereforethere are at most r2 additional rectangles produced.

Lemma 21. The discrepancy of each one of these rec-tangles is O(r&2).

Proof. The discrepancy of any such rectangle R is|Area(R)&In(R)�n|. But we have that 0�Area(R)�2r&2

for every rectangle. We also have that 0�In(R)�n�r2 O 0�In(R)�n�r&2. It follows that |Area(R)&In(R)�n|�2r&2. K

After the subdivision, we can reduce the size of the inputset from n to O(r2). For each rectangle R, we find thenumber of points In(R) that lie inside it, and replace themwith a point on the lower left corner of the rectangle that hasweight In(R) (Fig. 11). Again we have to make sure to counteach point only once, so that the sum of all the weights is n.This step takes O(n log r) time.

The final step is the application of the two-dimensionalalgorithm on the new point set. As we noted in the previoussections, the algorithm can be modified to run for weightedpoints without affecting its asymptotic performance, so thisstep takes O(r4 log2 r) time.

Lemma 22. After the partition of the unit square intorectangles with area O(r&2) and containing 3(n�r2) points, aline parallel to one of the axis can intersect 3(r) rectangles.

FIG. 11. A sample input set (square points), the subdivision of the unitsquare into rectangles with bounded numerical discrepancy (here r=2),and the input to the approximation algorithm (round points, the darkerpoints have higher weight).

Proof. Assume the line is horizontal, the vertical case issymmetrical. From the construction of the partition, thenumber of rectangles that interset a fixed horizontal linedoubles every second iteration, so that line intersects3(2log r)=3(r) of the initial rectangles. Some of these arereplaced by a number of smaller rectangles, which are nowintersected by the horizontal line. There are two kinds ofthese rectangles, depending on when their intersected edgeswere created.

The rectangles that have their two intersected edgesbelong to the same initial rectangle cannot be more than3(r), because each one takes the place of one initialrectangle.

There are also the rectangles that at least one of theirintersected edges was created during the partitioning of thelarge initial rectangles. But such edges divide the largestdimension of the initial rectangles, so each of these rec-tangles is at least - 2�2r wide, and there cannot be morethan - 2r of them that intersect the horizontal line.

The total number of intersected rectangles is thenbetween r and (1+- 2) r. K

Theorem 9. Given a point set S/[0, 1]2, we cancompute in O(n log n+r4 log2 r) time an approximation of themaximum numerical discrepancy within O(1�r) of the optimal.

Proof. From the discussion above and the proof ofTheorem 8. K

We note here that both of these algorithms can be easilymodified to work for point sets with arbitrary weights (forthe bichromatic discrepancy) or arbitrary positive weights(for the numerical discrepancy).

4.2. Higher Dimensions

Both Algorithm 2 and Algorithm 3 can be easily extendedto handle d dimensions. The running times areO(n2d&2 log n) for the bichromatic discrepancy andO(n2d&2 log2 n) for the numerical discrepancy, and thespace requirement is linear.

The main idea of the extension is to project the d dimen-sional points to 2 dimensions. Clearly the box with the max-imum (numerical or dichromatic) discrepancy must have apoint on each hyperplane of its boundary. So the algorithmexamines every possible pair of points as a pair ofboundaries in every one of the first d&2 dimensions. Foreach of the O(n2(d&2)) combinations of boundaries, itprojects all points that are inside the boundaries to theremaining 2 dimensions, and applies the two-dimensionalalgorithm to the set of the projections.

4.3. Union of Two Rectangles

Here we consider a natural extension to the set of axis-aligned rectangles, the set of unions of two disjoint axisaligned-rectangles R2 .

467COMPUTING THE MAXIMUM BICHROMATIC DISCREPANCY

Page 16: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141716 . By:CV . Date:11:07:96 . Time:13:08 LOP8M. V8.0. Page 01:01Codes: 6232 Signs: 4797 . Length: 56 pic 0 pts, 236 mm

Theorem 10. We can compute the range in R2 thatmaximizes the bichromatic discrepancy of a given point set Swith an arbitrary given coloring / in O(n2 log n) time. We canalso compute the range in R2 that maximizes the numericaldiscrepancy of a given point set S in O(n2 log2 n) time.

Proof. Let us consider the case for the bichromatic dis-crepancy first, the other case is similar. The two rectanglesof the optimum range are axis-aligned and disjoint, there-fore they are separated by either a horizontal or a verticalline. Consider the case of the horizontal line. There are npossible cases for such a line, the y coordinates of the inputpoints. For each such line we compute the optimum rec-tangle above it and the optimum rectangle below it. We canfind the best rectangle for all n lines with one run of Algo-rithm 2. Each time we find a new optimum rectangle werecord it in the line that passes right through the top of therectangle. After we have found the optimum rectangle thattouches each line, we can find in linear time the optimumrectangle that touches or lies below the line. Similarly wefind the optimum rectangle that lies above each line. Thenwe form the n pairs and find the optimum one. The case forvertical lines is the same, but we have to modify thealgorithm to perform a horizontal sweep. K

4.4. Union of K Intervals

Let us now consider an extension to the set of intervals in[0, 1], the set of unions of K disjoint intervals in [0, 1]. Letus call this set R2K . K is assumed to be a small constant. Thealgorithms we develop in Section 2 are easily modified tohandle the new set of hypotheses.

Theorem 11. We can compute the union of K disjointintervals that maximizes the bichromatic discrepancy of agiven point set S (S/[0, 1], |S|=n) with an arbitrary givencoloring / in O(K2n+n log n) time. We can also compute thenew union of K disjoint intervals that maximizes thebichromatic discrepancy after inserting or deleting a point inO(K 2 log n) time.

Proof. The proof follows the proofs of Theorems 2 and3. Let us assume that the optimal union of k intervals for agiven set S and weight function / is M=[xi , x2] _ } } } _

[x2K&1 , x2K], with 0�x1<x2< } } } x2K&1<x2K�1.Let us now split the interval [0, 1] into the two subinter-

vals [0, m] and [m, 1].If m is inside one of the intervals of M, let that be

[x2i&1, x2i]. i has to be between 1 and K. We define

[0, m]right, i=[x1 , x2] _ } } } _ [x2i&1 , m]

and

[m, 1]left, K&i+1=[m, x2i] _ } } } _ [x2K&1 , x2K]

[0, m]right, i must maximize the discrepancy over all unionsof i intervals in [0, m], that have their rightmost endpointon m, and similarly [m, 1]left, K&i+1 must maximize thediscrepancy over all unions of K&i intervals in [m, 1], thathave their leftmost endpoint on m.

If m is between two intervals of M, let's assume these twointervals are the j th and the ( j+1)th. Since m can also beto the left of the leftmost interval of M or to the right of therightmost interval, j has to be between 0 and K. We define

[0, m]max, j=[x1 , x2] _ } } } _ [x2 j&1 , x2 j]

and

[m, 1]max, K&j=[x2 j+1, x2 j+2] _ } } } _ [x2K&1 , x2K]

[0, m]max, j must maximize the discrepancy over all unionsof j intervals in [0, m], and similarly [m, 1]max, K&j mustmaximize the discrepancy over all unions of K&j intervalsin [m, 1].

It follows that there are 2K+1 ways that m can split theoptimal union M. So, if we know [0, m]right, i , [0, m] left, i ,[0, m]max, i and [m, 1]right, i , [m, 1]left, i , [m, 1]max, i foreach i between 1 and K, we can compute M=[0, 1]max, K inO(K) time.

We compute M=[0, 1]max, K using a binary tree ofregions. For each region R we have to compute 3K optimalunions of intervals Rright, i , Rleft, i , and Rmax, i (for 1�i�K).Each can be computed in O(K) time from the optimalunions of R's two subregions. The tree can be built inO(n log n+K 2n) time. By keeping the tree balanced, onlyO(log n) regions are affected when we insert or delete apoint. So the running time per update is O(K2 log n). K

4.5. Halfspaces

Dobkin and Eppstein [8] looked at the problem ofcomputing the maximum numerical discrepancy for the setof halfspaces. They gave an algorithm that finds the hyper-plane that maximizes the numerical discrepancy of a pointset S # [0, 1]d ( |S|=n) in O(nd) time. We note here thatthis algorithm can be easily modified to solve the corre-sponding minimizing disagreement problem for hyperplanehypotheses in the same time bound.

4.6. Weighted Area Functions

Algorithm 3 can be used to find sampling patterns thatapproximate different measures rather than the Euclideanarea. Such point sets can be useful in many cases. Forexample, we may know that some parts of an image havemore details and so we can sample these parts at a higherrate. Also, point sets with more weight in the middle can beused for low pass filtering. In a particular case that we

468 DOBKIN, GUNOPULOS, AND MAASS

Page 17: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141717 . By:MC . Date:06:06:96 . Time:08:17 LOP8M. V8.0. Page 01:01Codes: 4369 Signs: 3422 . Length: 56 pic 0 pts, 236 mm

FIG. 12. Sampling sets with small maximum numerical discrepancy. The rectangles shown maximize the numerical discrepancy: a, Euclidian area;b, Weighted area.

examined (Fig. 12b), the area of a rectangle was given bythe function

WArea(A)=|A

sinc(x&0.5) sinc( y&0.5) dx dy

where sinc(x)=sin(x)�x).

CONCLUSIONS AND PROBLEMS

In this paper we examine two closely related algorithmicproblems that arise in different areas of computer science.

The algorithms we present for computing the maximumdiscrepancy have direct applications in machine learningand computer graphics, and their near quadratic runningtime allows them to be useful with the large data sets com-mon in many real-life problems. We also implemented asimple version of the two-dimensional algorithm to com-pute the maximum numerical discrepancy, and used it tofind sampling patterns with very low rectangle numericaldiscrepancy for ray tracing (Fig. 10, Fig. 12a). we use aprobabilistic procedure that starts with a good pattern oflow maximum discrepancy (e.g. Zaremba points) andrandomly picks and replaces points so that the maximumnumerical discrepancy decreases. This implementationprovided the basis for visualization of the way the algorithmworks ([10]). For this purpose we used GASP ([12]), asystem for scientific visualization. Figures 10 and 12 areproduced from our visualization.

Finally we present a number of interesting open questionsrelated to discrepancy.

1. What is the lower bound for an algorithm that com-putes the maximum bichromatic or numerical discrepancy

for axis aligned rectangles in the d-dimensional case (ford>1)? We conjecture that 0(n2) is a lower bound for bothproblems in the 2-D case.

2. A related problem is that of finding a fast dynamicalgorithm for the d-dimensional case (d>1). There is noneknown so far that asymptotically improves on the simpleapproach that just uses the static algorithm after eachupdate.

3. Extend these algorithms to arbitrarily oriented boxesand different families of regions. One problem here is thatthe search space we have to explore increases with the com-plexity of the regions. For example there are O(n5) possiblesolutions when we consider rectangles with arbitrary orien-tation. A fuller treatment on this subject will appear in [14].

4. There is also the problem of finding betterapproximation algorithms for both the static and thedynamic cases.

5. Finally, there are problems in different areas that canbe recast as variants of discrepancy problems. For example,an algorithm by Goldberg ([13]) that orients polygonalparts has to solve the following problem. Given a set oflinear points in the unit interval, and a number k, find thesmallest interval that includes k points. This is a disguiseddiscrepancy problem, because a different way to state it is,find the intervals of minimum discrepancy that contain agiven number of points. Advanced algorithms and relateddata structures developed in discrepancy theory could offersolutions for this and similar problems.

ACKNOWLEDGMENT

The authors would-like to thank Bernard Chazelle and David Hausslerfor helpful communications regarding the research for this papery and twoanonymous referees for their suggestions.

469COMPUTING THE MAXIMUM BICHROMATIC DISCREPANCY

Page 18: Computing the Maximum Bichromatic Discrepancy, with ... · File: 571J 141701 . By:CV . Date:11:07:96 . Time:13:07 LOP8M. V8.0. Page 01:01 Codes: 6278 Signs: 4165 . Length: 60 pic

File: 571J 141718 . By:CV . Date:12:07:96 . Time:09:34 LOP8M. V8.0. Page 01:01Codes: 7841 Signs: 6542 . Length: 56 pic 0 pts, 236 mm

REFERENCES

1. D. Angluin and P. Laird, Learning from noisy examples, MachineLearning 2 (1988), 343�370.

2. P. Auer, R. Holte, and W. Maass, Theory and applications of agnosticPAC-learning with small decision trees, in ``Proc. 12th InternationalMachine Learning Conference, 1995,'' pp. 21�29.

3. J. Beck and W. W. L. Chen, ``Irregularities of Distribution,'' CambridgeUniv. Press, Cambridge 1987.

4. H. Bronnimann, B. Chazelle, and J. Matousek, Product range spaces,sensitive sampling, and derandomization, in ``Proc. 34th Ann. IEEESymposium on Foundations of Computer Science, 1993,'' pp. 143�155.

5. W. Buntine and T. Niblett, A further comparison of splitting rules fordecision-tree induction, Machine Learning 8 (1992), 75�82.

6. B. Chazelle, Geometric discrepancy revisited, in ``34th Symposium onFoundations of Computer Science, 1993.''

7. M. deBerg, ``Computing Half-Plane and Strip Discrepancy of PlanarPoint Sets,'' Manuscript, 1993.

8. D. Dobkin and D. Eppstein, Computing the discrepancy, in``Proceedings of the Ninth Annual Symposium on ComputationalGeometry, 1993,'' pp. 47�52.

9. D. Dobkin and D. Gunopulos, Concept learning with geometrichypotheses, in ``8th ACM Conference on Learning Theory, 1995.''

10. D. Dobkin and D. Gunopulos, Computing the rectangle discrepancy,in ``3rd Annual Video Review of Computational Geometry, 1994,''pp. 385�386.

11. D. Dobkin, and D. Mitchell, Random-edge discrepancy of super-sampling patterns, in ``Graphics Interface '93,'' 1993.

12. D. Dobkin and A. Tal, GASP��A system to facilitate animatinggeometric algorithms, in ``Third Annual Video Review of Computa-tional Geometry, 1994,'' pp. 388�389.

13. K. Y. Goldberg, Orienting polygonal parts without sensors, Algo-rithmica 10 (1993), 201�226.

14. D. Gunopulos, ``Computing the Discrepancy,'' Ph.D. thesis, PrincetonUniversity Department of Computer Science, 1995.

15. J. H. Halton, A retrospective and prospective survey of the MonteCarlo method, SIAM Rev. 12 (1970), 1.

16. D. Haussler, Decision theoretic generations of the PAC-model forneural nets and other applications, Inform. and Comput. 100 (1992),78�150.

17. D. Haussler and E. Welts, =-nets and simplex range queries, Discuss.Comput. Geom. 2 (1987), 127�151.

18. K. U. Hoeffgen, H. U. Simon, and K. S. Van Horn, Robust Trainabilityof single neurons, preprint, 1993.

19. R. C. Holte, Very simple classification rules perform well on mostcommonly used datasets, Machine Learning 11 (1993), 63�91.

20. J. E. Hopcroft and J. D. Ullman, ``Introduction to Automata Theory,Languages and Computation,'' Addison�Wesley, Reading, MA, 1979.

21. J. T. Kajiya, The rendering equation, Comput. Graphics 20 (1986),143�150.

22. M. Kearns, Efficient noise-tolerant learning from statistical queries, in``Proc. of the 25th ACM Symposium on the Theory of Computing,1993,'' pp. 392�401.

23. M. Kearns and M. Li, Learning in the presence of malicious errors,SIAM J. Comput. 22 (1993), 807�1837.

24. M. Kearns and R. E. Schapire, Efficient distribution-free learning ofprobabilistic concepts, in ``Proc. of the 31st Annual IEEE Symp. onFoundations of Computer Science, 1990,'' pp. 382�391.

25. M. Kearns, R. E. Schapire, and L. M. Sellie, Toward efficient agnostic

learning, in ``Proc. of the 5th ACM Workshop on ComputationalLearning Theory, 1992,'' pp. 341�352.

26. J. F. Koksma, Een algemeene stelling uit de theorie der gelijkmatigeverdeeling modulo 1, Mathematica B (Zutphen) 11 (1942�1943),pp. 7�11.

27. D. Lubinsky, ``Bivariate Splits and Consistent Split Criteria inDichotomous Classification trees,'' Ph.D. thesis, Rutgers Univ., 1994.

28. W. Maass, Efficient agnostic PAC-learning with simple hypotheses,in ``Proc. of the 7th Annual ACM Conference on ComputationalLearning Theory, 1994,'' pp. 71�75.

29. J. Matousek, E. Welzl, and L. Wernish, Discrepancy and =-approxima-tions for bounded VC-dimension, in ``Proc. of the 25th ACM Sym-posium on the Theory of Computing, 1993,'' pp. 424�430.

30. K. Mehlhorn, ``Multi-dimensional Searching and ComputationalGeometry,'' Springer-Verlag, New York�Berlin, 1984.

31. D. Mitchell, Ph.D. thesis, Princeton University Department ofComputer Science, to appear.

32. J. Mingers, An empirical comparison of pruning methods for decisiontree induction, Machine Learning 4 (1989), 227�243.

33. H. Niederreiter, Quasi-Monte Carlo methods and pseudo-randomnumbers, Bull. Amer. Math. Soc. 84, No. 6 (1978), 957�1041.

34. H. Niederreiter, Quasirandom sampling computer graphics, in ``Proc.3rd Internal Seminar on Digital Image Processing in Medicine, Riga,Latvia, 1992,'' pp. 29�33.

35. M. Overmars and J. van Leeuwen, Maintenance of configurations inthe plane, J. Comput. System Sci. 23 (1981), 166�204.

36. S. Paskov, ``Computing High Dimensional Integrals With Applica-tions to Finance,'' Tech. Rep. CUCS-023-94, Columbia University.

37. F. Preparata, An optimal real time algorithm for planar convex hulls,Comm. ACM 22 (1979).

38. F. Preparata and M. I. Shamos, ``Computational Geometry,'' Springer-Verlag, New York�Berlin, 1985.

39. P. Shirley, Discrepancy as a quality measure for sample distributions,in ``Proceedings of Eurographics '91,'' pp. 183�193.

40. M. Talagrand, Sharper bounds for empirical processes, Ann. of Probab.Appl., to appear.

41. J. F. Traub and H. Wozniakowski, Recent progress in information-based complexity, Bull. EATCS 51 (1993).

42. L. G. Valiant, A theory of the learnable, Comm. ACM 27 (1984),1134�1142.

43. L. G. Valiant, Learning disjunctions of conjunctions, in ``Proc. of the9th International Joint Conference on Artificial Intelligence, 1985,''pp. 560�566.

44. V. N. Vapnik and A. Ya. Chervonenkis, On the uniform convergenceof relative frequencies of events to their probabilities, Theory Probab.Appl. 16 (1971), 264�280.

45. S. M. Weiss, R. Galen and P. V. Tadepalli, Maximizing the predictivevalue of production rules, Artif. Intell. 45 (1900), 47�71.

46. S. M. Weiss and I. Kapouleas, An empirical comparison of patternrecognition, neural nets, and machine learning classification methods,in ``Proc. of the 11th International Joint Conference on ArtificialIntelligence, 1990,'' pp. 781�787, Morgan Kauffmann.

47. S. M. Weiss and C. A. Kulikowski, ``Computer Systems that Learn,''Morgan Kauffmann, San Mateo, CA, 1991.

48. G. Wasilkowski and H. Wo� zniakowski, Explicit cost bounds of algo-rithms for multivariate tensor product problems, J. Complexity 11(1995), 1�56.

49. H. Wozniakowski, Average case complexity of multivariate integra-tion, Bull. Amer. Math. Soc. 24 (1991).

470 DOBKIN, GUNOPULOS, AND MAASS