artificial intelligence and mathematics - rutgers...

286
Eighth International Symposium on Artificial Intelligence and Mathematics

Upload: trinhque

Post on 10-Apr-2018

234 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Eighth International Symposium on

Artificial Intelligence and Mathematics

January 4-6, 2004

Fort Lauderdale, Florida

Page 2: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Organizing Committee

General Chair Martin Golumbic

University of Haifa, Israel

Program Co-Chairs Fahiem Bacchus

University of Toronto, Canada Peter van Beek

University of Waterloo, Canada

Conference Chair Frederick Hoffman

Florida Atlantic University, USA

Publicity Chair George Katsirelos

University of Toronto, Canada

Program Committee

Franz Baader (TU Dresden, Germany) Peter Bartlett (UC Berkeley, USA) Endre Boros (Rutgers, USA) Adnan Darwiche (UCLA, USA) Rina Dechter (UC Irvine, USA) Boi Faltings (EPFL, Switzerland) Ronen Feldman (Bar-Ilan University, Israel) John Franco (U. of Cincinnati, USA) Hector Geffner (UPF, Spain) Ian Gent (U. of St. Andrews, UK) Enrico Giunchiglia (U. of Genova, Italy) Robert Givan (Purdue, USA) Joerg Hoffmann (Albert-Ludwigs, Germany) Holger Hoos (UBC, Canada) Peter Jonsson (U. of Linkoeping, Sweden)

Henry Kautz (U. of Washington, USA) Sven Koenig (Georgia Tech, USA) Richard Korf (UCLA, USA) Gerhard Lakemeyer (RWTH Aachen, Germany) Omid Madani (U. of Alberta, Canada) Heikki Mannila (U. of Helsinki, Finland) Anil Nerode (Cornell, USA) Ronald Parr (Duke, USA) Pascal Poupart (U. of Toronto, Canada) Jeff Rosenschein (Hebrew University, Israel) Sam Roweis (U. of Toronto, Canada) Dale Schuurmans (U. of Alberta, Canada) Allen Van Gelder (UCSC, USA) Toby Walsh (4C, Ireland)

Sponsors

The Symposium is partially supported by the Annals of Math and AI and Florida Atlantic University. Additional support is provided by the Caesarea Edmond Benjamin de Rothschild Institute at the University of Haifa and by the Florida-Israel Institute.

Page 3: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Contents

Production Inference, Nonmonotonicity and Abduction Alexander Bochman

Spines of Random Constraint Satisfaction Problems: Definition and Impact on Computational Complexity

Stefan Boettcher, Gabriel Istrate, and Allon G. Percus

Interval-Based Multicriteria Decision Making Martine Ceberio and François Modave

Using Logic Programs to Reason about Infinite Sets Douglas Cenzer, V. Wiktor Marek, and Jeffrey B. Remmel

The Expressive Rate of Constraints Hubie Chen

Using the Central Limit Theorem for Belief Network Learning Ian Davidson and Minoo Aminian

Approximate Probabilistic Constraints and Risk-Sensitive Optimization Criteria in Markov Decision Processes

Dmitri A. Dolgov and Edmund H. Durfee

Generalized Opinion Pooling Ashutosh Garg, T.S. Jayram, Shivakumar Vaithyanathan, and Huaiyu Zhu

A Framework for Sequential Planning in Multi-Agent Settings Piotr J. Gmytrasiewicz and Prashant Doshi

Heuristics for a Brokering Set Packing Problem Y. Guo, A. Lim, B. Rodrigues, and Y. Zhu

Combining Symmetry Breaking with Other Constraints: Lexicographic Ordering with Sums

Brahim Hnich, Zeynep Kiziltan, and Toby Walsh

A Simple Yet Effective Heuristic Framework for Optimization Problems Gaofeng Huang and Andrew Lim

Biased Minimax Probability Machine for Medical Diagnosis Kaizhu Huang, Haiqin Yang, Irwin King, Michael R. Lyu, and Laiwan Chan

Combining Cardinal Direction Relations and Relative Orientation Relations in QSR Amar Isli

Unrestricted vs Restricted Cut in a Tableau Method for Boolean Circuits Matti Järvisalo, Tommi A. Junttila, and Ilkka Niemelä

Parameter Reusing in Learning Latent Class Models Gytis Karčiauskas, Finn Jensen, and Tomás Kočka

New Look-Ahead Schemes for Constraint Satisfaction Kalev Kask, Rina Dechter, and Vibhav Gogate

Learning via Finitely Many Queries Andrew C. Lee

Page 4: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Modeling and Reasoning with Star Calculus Debasis Mitra

Analysis of Greedy Robot-Navigation Methods Apurva Mugdal, Craig Tovey, and Sven Koenig

Symmetry Breaking in Constraint Satisfaction with Graph-Isomorphism: Comma-Free Codes

Justin Pearson

Deductive Algorithmic Knowledge Riccardo Pucella

Inferring Implicit Preferences from Negotiation Actions Angelo Restificar and Peter Haddawy

Improving Exact Algorithms for MAX-2-SAT Haiou Shen and Hantao Zhang

Explicit Manifold Representations for Value-Function Approximation in Reinforcement Learning

William D. Smart

Warped Landscapes and Random Acts of SAT Solving Dave A. D. Tompkins and Holger H. Hoos

Using Automatic Case Splits and Efficient CNF Translation to Guide a SAT-Solver When Formally Verifying Out-of-Order Processors

Miroslav N. Velev

Multi-Agent Dialogue Protocols Christopher D. Walton

Bayesian Model Averaging Across Model Spaces via Compact Encoding Ke Yin and Ian Davidson

Crane Scheduling with Spatial Constraints: Mathematical Model and Solving Approaches

Yi Zhu and Andrew Lim

Papers from the Special Session on Intelligent Text Processing Organizer: Shlomo Argamon, Illinois Institute of Technology, USA

Efffective Use of Phrases in Language Modeling to Improve Information Retrieval Maojin Jiang, Eric Jenson, Steve Beitzel, and Shlomo Argamon

Text Categorization for Authorship Verification Moshe Koppel, Jonathan Schler, and Dror Mughaz

Mapping Dependencies Trees: An Application to Question Answering Vasin Punyakanok, Dan Roth, and Wen-tau Yih

A Linear Programming Formulation for Global Inference in Natural Language Tasks Dan Roth and Wen-tau Yih

Page 5: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Production Inference, Nonmonotonicityand Abduction

Alexander BochmanComputer Science Department,

Holon Academic Institute of Technology, Israele-mail: [email protected]

Abstract

We introduce a general formalism of production inference relationsthat posses both a standard monotonic semantics and a natural non-monotonic semantics. The resulting nonmonotonic system is shown toprovide a syntax-independent representation of abductive reasoning.

Abduction is a reasoning from facts to their possible explanations thatis widely used now in many areas of AI, including diagnosis, truth main-tenance, knowledge assimilation, database updates and logic programming.In this study we are going to show that this kind of reasoning can be givena formal, syntax-independent representation in terms of production infer-ence relations that constitute a particular formalization of input-output logics[MdT00]. Among other things, such a representation will clarify the relationbetween abduction and nonmonotonic reasoning, as well as show the expres-sive capabilities of production inference as a general-purpose nonmonotonicformalism.

We will assume that our basic language is a classical propositional lan-guage with the usual connectives and constants ∧,∨,¬,→, t, f. ² willdenote the classical entailment, and Th the associated provability operator.

1 Production Inference Relations

We begin with the following general notion of production inference.

1

Page 6: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Definition 1.1. A production inference relation is a binary relation ⇒ onthe set of classical propositions satisfying the following conditions:

(Strengthening) If A ² B and B⇒C, then A⇒C;

(Weakening) If A⇒B and B ² C, then A⇒C;

(And) If A⇒B and A⇒C, then A⇒B ∧ C;

(Truth) t⇒ t;

(Falsity) f⇒ f .

A distinctive feature of production inference relations is that reflexivityA⇒A does not hold. It is this ‘omission’, however, that determines theirrepresentation capabilities in describing nonmonotonicity and abduction.

In what follows, conditionals A⇒B will be called production rules. Weextend such rules to rules having arbitrary sets of propositions in premises asfollows: for a set u of propositions, we define u⇒A as holding when

∧a⇒A,

for some finite a ⊆ u. C(u) will denote the set of propositions ‘produced’ byu, that is, C(u) = A | u⇒A. The production operator C will play muchthe same role as the derivability operator for consequence relations.

1.1 Kinds of production inference

A classification of the main kinds of production inference relevant for ourstudy is based on the validity of the following additional postulates:

(Cut) If A⇒B and A ∧B⇒C, then A⇒C.

(Or) If A⇒C and B⇒C, then A ∨B⇒C.

(Weak Deduction) If A⇒B, then t⇒(A→B).

A production inference relation will be called regular, if it satisfies Cut;basic, if it satisfies Or1; causal if it is both basic and regular; and quasi-classical, if it is causal and satisfies Weak Deduction.

The rule Cut allows for a reuse of produced propositions as premises infurther productions. Any regular production relation will be transitive.

1Basic production relations correspond to basic input-output logics from [MdT00],while regular productions correspond to input-output logics with reusable output.

2

Page 7: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

The rule Or allows for reasoning by cases, and hence basic productionrelations can already be seen as systems of objective production inference,namely as systems of reasoning about complete worlds (see below).

Causal production relations have been introduced in [Boc03]; they havebeen shown to provide a complete characterization for the reasoning withcausal theories from [MT97].

Finally, quasi-classical production relations will be shown below to char-acterize ‘classical’ abductive reasoning.

1.2 The Monotonic Semantics

A semantic interpretation of production relations is based on pairs of deduc-tively closed theories called bimodels. By the ‘input-output’ understandingof productions, a bimodel represents an initial state (input) and a possiblefinal state (output) of a production process.

Definition 1.2. A pair of classically consistent deductively closed sets ofpropositions will be called a bimodel. A set of bimodels will be called aproduction semantics.

Note that a production semantics can also be viewed as a binary relationon the set of deductive theories.

Definition 1.3. A production rule A⇒B will be said to be valid in a pro-duction semantics B if, for any bimodel (u, v) from B, A ∈ u only if B ∈ v.

Then the following completeness result can be shown:

Theorem 1.1. A relation on the set of propositions is a production inferencerelation if and only if it is determined by a production semantics.

This representation result serves as a basis for semantic characterizationsof different kinds of production relations, described above.

The semantics of regular production relations can be obtained by consid-ering only bimodels (u, v) such that v ⊆ u. We will call such bimodels (andcorresponding semantics) inclusive ones.

Theorem 1.2. ⇒ is a regular production relation iff it is generated by aninclusive production semantics.

3

Page 8: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

The semantics for the three other kinds of production is obtained byrestricting the set of bimodels to bimodels of the form (α, β), where α, β areworlds. The corresponding production semantics can be seen as a relationalpossible worlds model W = (W,B), where W is a set of worlds with anaccessibility relation B. Validity of productions can now be defined as follows:

Definition 1.4. A rule A⇒B is valid in a possible worlds model (W,B) if,for any α, β ∈ W such that αBβ, if A holds in α, then B holds in β.

A possible worlds model (W,B) will be called reflexive, if αBα, for anyworld α, and quasi-reflexive, if αBβ implies αBα, for any α, β ∈ W .

Theorem 1.3. A production inference relation is

• basic if and only if it has a possible worlds model.

• causal iff it has a quasi-reflexive possible worlds model.

• quasi-classical iff it has a reflexive possible worlds model.

1.3 The Nonmonotonic Semantics

The fact that the production operator C is not reflexive creates an importantdistinction between theories of a production inference relation.

Definition 1.5. A set u of propositions is a theory of a production relation,if it is deductively closed, and C(u) ⊆ u. A theory is exact, if u = C(u).

A theory of a production relation is closed with respect to its productionrules, while an exact theory describes an informational state in which everyproposition is also produced, or explained, by other propositions accepted inthis state. Accordingly, restricting our universe of discourse to exact the-ories amounts to imposing a kind of an explanatory closure assumption onadmissible states. This suggests the following notion:

Definition 1.6. A (general) nonmonotonic semantics of a production infer-ence relation is the set of all its exact theories.

The above nonmonotonic semantics is indeed nonmonotonic, since addingnew rules to the production relation may lead to a nonmonotonic changeof the associated semantics, and thereby to a nonmonotonic change in the

4

Page 9: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

derived information. This happens even though production rules themselvesare monotonic, since they satisfy Strengthening the Antecedent.

Exact theories are precisely the fixed points of the production operator C.Since the latter operator is monotonic and continuous, exact theories (andhence the nonmonotonic semantics) always exist.

Since a basic production inference is already world-based, it naturallysanctions the following strengthening of the general nonmonotonic semantics.

Definition 1.7. An objective nonmonotonic semantics of a (basic) produc-tion inference relation is the set of all its exact worlds.

As has been shown already in [MT97], the above semantics is repre-sentable as a set of all worlds (interpretations) that satisfy a certain classicalcompletion of the set of production rules.

2 Abduction versus Production

An abductive framework can be defined as a pair A = (Cn,A), where Cnis a consequence relation that subsumes classical entailment2, while A is adistinguished set of propositions that play the role of abducibles, or explana-tions, for other propositions. A proposition B is explainable in an abductiveframework A if there exists a consistent set of abducibles a ⊆ A such thatB ∈ Cn(a).

It turns out that explanatory relations in an abductive framework can becaptured by considering only theories that are generated by the abducibles.

Definition 2.1. The abductive semantics AS of an abductive framework Ais the set of theories Cn(a) | a ⊆ A.

The information embodied in the abductive semantics can be made ex-plicit by considering the following generated Scott consequence relation:

b °A c ≡ (∀u ∈ AS)(b ⊆ u → c ∩ u 6= ∅)b °A c holds if and only if any set of abducibles that explains b explains

also at least one proposition from c.3 This consequence relation is an exten-sion of Cn that describes not only ‘forward’ explanatory relations, but also

2Such consequence relations are called supraclassical.3A Tarski consequence relation of this kind has been used for the same purposes in

[LU97].

5

Page 10: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

abductive inferences from propositions to their explanations. For example,if C and D are the only abducibles that imply A in an abductive framework,then we will have A °A C, D. Speaking more generally, the above abductiveconsequence relation describes the explanatory closure, or completion, of anabductive framework, and allow thereby to capture the abductive process bydeductive means (see [CDT91, Kon92]).

The following definition arises from viewing explanation as a kind of pro-duction inference.

Definition 2.2. A production inference relation associated with an abductiveframework A is a production relation ⇒A determined by all bimodels of theform (u, Cn(u ∩ A)), where u is a consistent theory of Cn.

Since the above production semantics is inclusive, the associated produc-tion relation will always be regular. Moreover, the following result showshow it is related to the source abductive framework.

Theorem 2.1. The abductive semantics of an abductive framework coincideswith the nonmonotonic semantics of its associated production relation.

We assume below that the set A is closed with respect to conjunctions,that is, if A and B are abducibles, so is A ∧B. To deal with limit cases, weassume also that t and f are abducibles. Then it turns out that the aboveproduction relation admits a very simple syntactic characterization, namely

B⇒AC iff (∃A ∈ A)(A ∈ Cn(B) & C ∈ Cn(A))

Note that A⇒AA holds if and only if A is Cn-equivalent to an abduciblefrom A. Accordingly, we will say that A is an abducible of a productioninference relation ⇒, if A⇒A. The set of such abducibles is closed withrespect to conjunctions. Now, production relations associated with abductiveframeworks satisfy the following characteristic property:

(Abduction) If B⇒C, then B⇒A⇒C, for some abducible A.

Regular production relations satisfying Abduction will by called abduc-tive production relations. For such relations, the production process alwaysgoes through abducibles. The next theorem shows that they are preciselyproduction inference relations that are generated by abductive frameworks.

Theorem 2.2. A production relation is abductive if and only if it is generatedby an abductive framework.

6

Page 11: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Due to the above results, abductive production relations can be seen as afaithful logical representation of abductive reasoning. Notice that abductiveproduction relations provide in this sense a syntax-independent descriptionof abduction: the set of abducibles is determined as a set of propositionshaving a certain logical property, namely reflexivity.

The abductive subrelation. Any regular production relation ⇒ includesan important abductive subrelation defined as follows:

A⇒a B ≡ (∃C)(A⇒C⇒C⇒B)

⇒a is the greatest abductive relation included in ⇒, and in many naturalcases it produces the same nonmonotonic semantics (see [Boc03]).

If A⇒ is the set of abducibles of ⇒, and Cn⇒ the least supraclassicalconsequence relation including ⇒, then the following result can be shown:

Lemma 2.3. If ⇒ is a regular production relation, then its abductive subre-lation ⇒a is generated by the abductive framework (Cn⇒,A⇒).

2.1 Causal and classical abductive inference

Abductive frameworks corresponding to causal production relations are de-scribed in the next definition.

Definition 2.3. An abductive framework A = (Cn,A) will be called A-disjunctive if A is closed with respect to disjunctions, and Cn satisfies thefollowing conditions, for any abducibles A,A1 ∈ A, and arbitrary B,C:

• If A ∈ Cn(B) and A ∈ Cn(C), then A ∈ Cn(B ∨ C);

• If B ∈ Cn(A) and B ∈ Cn(A1), then B ∈ Cn(A ∨ A1).4

Theorem 2.4. An abductive production relation is causal if and only if it isgenerated by an A-disjunctive abductive framework.

As we already mentioned, the objective nonmonotonic semantics of suchproduction relations is obtainable by forming a classical completion of theset of production rules (cf. [CDT91]).

4This rule corresponds to the rule Ab-Or in [LU97].

7

Page 12: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

An abductive framework will be called classical if Cn is a classical conse-quence relation (that is, it is supraclassical and satisfies the deduction theo-rem). Such a framework is reducible to a pair (Σ,A), where Σ is a domaintheory (with the implicit assumption that the background logic is classical).Our last result provides a production counterpart of such frameworks.

Theorem 2.5. An abductive production relation is quasi-classical if and onlyif it is generated by a classical abductive framework.

An interesting negative consequence from the above result is that classi-cal abductive frameworks are already inadequate for reasoning with causaltheories of McCain and Turner; the latter is captured, however, by a broaderclass of A-disjunctive abductive frameworks.

References

[Boc03] A. Bochman. A logic for causal reasoning. In G. Gottlob andT. Walsh, editors, Proceedings Int. Joint Conference on ArtificialIntelligence, IJCAI’03, pages 141–146, Acapulco, 2003. MorganKaufmann.

[CDT91] L. Console, D. Theseider Dupre, and P. Torasso. On the rela-tionship between abduction and deduction. Journal of Logic andComputation, 1:661–690, 1991.

[Kon92] K. Konolige. Abduction versus closure in causal theories. ArtificialIntelligence, 53:255–272, 1992.

[LU97] J. Lobo and C. Uzcategui. Abductive consequence relations. Arti-ficial Intelligence, 89:149–171, 1997.

[MdT00] D. Makinson and L. Van der Torre. Input/Output logics. Journalof Philosophical Logic, 29:383–408, 2000.

[MT97] N. McCain and H. Turner. Causal theories of action and change.In Proceedings AAAI-97, pages 460–465, 1997.

8

Page 13: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Spines of random Constraint Satisfaction Problems:definition and impact on Computational Complexity

Stefan Boettcher∗, Gabriel Istrate and Allon G. Percus†

1 Introduction

The major promise of phase transitions in combinatorial problems was to shed light on the “practical”algorithmic complexity of combinatorial problems. A possible connection has been highlighted by the re-sults (based on experimental evidence and nonrigorous arguments from statistical mechanics) of Monassonet al. [1, 2]. Studying a version of random satisfiability that “interpolates” between 2-SAT and 3-SAT, theyconcluded that the order of the phase transition, combinatorially expressed by continuity of an order pa-rameter called the backbone, might have algorithmic implications for the complexity of the important classof Davis-Putnam-Longman-Loveland (DPLL) algorithms [3]. A discontinuous or first-order transition wassymptomatic of exponential complexity, whereas a continuous or second-order transition correlated withpolynomial complexity.

It is well understood by now that this connection is limited. For instance, k-XOR-SAT is a problembelieved, based on arguments from statistical mechanics [4], to have a first-order phase transition. But itis easily solved by a polynomial algorithm, Gaussian elimination.

One way to clarify the connection between phase transitions and computational complexity is to for-malize the underlying intuition in a purely combinatorial way, devoid of any physics considerations. First-order phase transitions amount to a discontinuity in the (normalized) size of the backbone. For randomk-SAT [5], and more specifically for the optimization problem MAX-k-SAT, the backbone has a combi-natorial interpretation: it is the the set of literals that are “frozen” (assume the same value) in all optimalassignments. Intuitively, a large backbone size has implications for the complexity of finding such as-signments: all literals in the backbone require well-defined values in order to satisfy the formula, but analgorithm assigning variables in an iterative fashion has very few ways to know what the “right” values toassign are. In the case in a first-order phase transition, the backbone of formulas just above the transitioncontains with high probability a fraction of the literals that is bounded away from zero. DPLL algorithmswould then misassign a variable having Ω(n) height in the tree representing the behavior of the algorithm,forcing it to backtrack on the given variable. Assuming the algorithm cannot significantly “reduce” thesize of the explored portion of this tree, a first-order phase transition would then w.h.p imply a 2Ω(n) lowerbound for the running time of DPLL on random instances located slightly above the transition.

There exists, however, a significant flaw in the heuristic argument above: the backbone is defined withrespect to optimal solutions, and would seem to imply that it is difficult to find solutions to, e.g., MAX-K-SAT using algorithms that assign variables iteratively. But why should the continuity/discontinuity ofthe backbone be the relevant predictor for the complexity of the (often easier) decision problem, which iswhat DPLL algorithms try to solve anyway?

[email protected] Emory University, Atlanta, GA 30322†istrate,[email protected] Los Alamos National Laboratory, Los Alamos, NM 87545. Correspondence to: Gabriel Istrate.

Page 14: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Fortunately, it turns out that the intuition of the previous argument also holds for a different orderparameter, a “weaker” version of the backbone called the spine, introduced in [6] in order to prove thatrandom 2-SAT has a second-order phase transition. Unlike the backbone, the spine is defined in termsof the decision problem, hence it could conceivably have a larger impact on the complexity of theseproblems. Of course, the same caveat applies as for the backbone: we are referring to complexity withrespect to classes of algorithms weaker than polynomial time computations, and that in particular are notstrong enough to capture the polynomial time Gaussian elimination algorithm for k-XOR-SAT.

We aim in this paper to provide evidence that the behavior of the spine, rather than the backbone,impacts the complexity of the underlying decision problem. To accomplish this:

1. We discuss the proper definition of the backbone/spine for random CSP.

2. We formally establish a simple connection between a discontinuity in the (relative size of the) spineat the threshold and the resolution complexity of random satisfiability problems. In a nutshell, anecessary and sufficient condition for the existence of a discontinuity in the spine is the existence ofa Ω(n) lower bound (w.h.p.) on the size of minimally unsatisfiable subformulas of a random (un-satisfiable) subformula. But standard methods from proof complexity [7] imply that (in conjunctionwith the expansion of the formula hypergraph, independent of the precise definition of the problemat hand) in all cases where we can prove the existence of a first-order phase transition, such problemshave a 2Ω(n) lower bound on their resolution complexity (and hence the complexity of DPLL algo-rithms as well [3]). In contrast, we show (Theorem 1) that, for any generalized satisfiability problem,a second-order phase transition implies, for every α > 0, a O(2α·n) upper bound on the resolutioncomplexity of their random instances (in the region where most formulas are unsatisfiable)

3. We give a sufficient condition (Theorem 2) for the existence of a discontinuous jump in the size ofthe spine. We then show (Theorem 3) that all problems whose constraints have no implicates of sizeat most two satisfy this condition. Qualitatively, our results suggest that all satisfiability problemswith a second-order phase transition in the spine are “like 2-SAT”.

4. Finally, we present some experimental results that attempt to clarify the issue whether the backboneand the spine can behave differently at the phase transition. The Graph bipartition problem is onecase where this seems to happen. In contrast, the backbone and spine of random 3-coloring seem tohave similar behavior.

A note on the significance of our results: a discontinuity in the spine is weaker than a first-order phasetransition (i.e., a discontinuity in the size of the backbone). Also, unlike the backbone, the spine hasno physical interpretation. But this is not our intention: we have seen that the argument connecting thebackbone size and the complexity of decision problems is problematic. What we rigorously show (withno physics considerations in mind) is that the intuitive argument holds for the spine order parameter.Moreover, the last section of the paper presents experimental work suggesting that the backbone and thespine can behave differently.

2 Preliminaries

Throughout the paper we will assume familiarity with the general concepts of phase transitions in com-binatorial problems (see e.g., [8]), random structures [9], proof complexity [10]. Some papers whoseconcepts and methods we use in detail (and we assume greater familiarity with) include [11], [12], [7].We will use the model of random constraint satisfaction from Molloy [13]:

Page 15: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Definition 1 Let D = 0, 1, . . . , t − 1, t ≥ 2 be a fixed set. Consider all 2tk − 1 possible nonempty setsof constraints (relations) on k variables X1, . . . , Xk with values taken from D. Let C be such a nonemptyset of constraints.

A random formula from CSPn,m(C) is specified by the following procedure:

• n is the number of variables.

• m is the number of clauses, chosen by the following procedure: first select, uniformly at randomand with replacement, m hyperedges of the complete k-uniform hypergraph on n variables.

• for each hyperedge, choose a random ordering of the variables involved. Choose a random con-straint from C and apply it on the list of (ordered) variables.

We use the notation SAT(C) (instead of CSP(C)) when t=2. Also, for Φ an instance of CSP(C) wedenote by optC(Φ) the smallest number of constraints left unsatisfied by some assignment.

Just as in random graphs [9], under fairly liberal conditions one can use the constant probability modelinstead of the counting model from the previous definition. The interesting range of the parameter m iswhen the ratio m/n is a constant, c, the constraint density (details are left for the final version of thepaper). The original investigation of the order of the phase transition in k-SAT used an order parametercalled the backbone,

B(Φ) = x ∈ Lit | ∃ λ ∈ 0, 1 : ∀ Ξ ∈ MAXSAT(Φ), Ξ(x) = λ, (1)

or more precisely the backbone fraction f , the fraction of the n variables that belong to B(Φ).Bollobas et al. [6] have investigated the order of the phase transition in k-SAT (for k = 2) under a

different order parameter, a “monotonic version” of the backbone called the spine.

S(Φ) = x ∈ Lit | ∃ Ξ ⊆ Φ : Ξ ∈ k−SAT, Ξ ∧ x ∈ k−SAT. (2)

They showed that random 2-SAT has a continuous (second-order) phase transition: the size of thespine, normalized by dividing it by the number of variables, approaches zero (as n → ∞) for c < c2−SAT =1, and is continuous at c = c2−SAT. By contrast, nonrigorous arguments from statistical mechanics [5]imply that for 3-SAT the backbone jumps discontinuously from zero to positive values at the transitionpoint c = c3−SAT (a first-order phase transition).

3 How to define the backbone/spine for random CSP (and beyond)

We would like to extend the concepts of backbone and spine to general constraint satisfaction problems.Certain differences between the case of random k-SAT and the general case force us to employ an alterna-tive definition of the backbone/spine. The most obvious is that formula (2) involves negations of variables,unlike Molloy’s model. Also, these definitions are inadequate for problems whose solution space presentsa relabelling symmetry, such as the case of graph coloring where the set of (optimal) colorings is closedunder permutations of the colors. Due to this symmetry, no variable can be frozen in this way.

The new definitions have to retain as many of the properties of the backbone/spine as possible. Inparticular, the new definitions must give rise to order parameters, i.e., quantities that are zero up to thecritical value of the control parameter (in our case constraint density c) and positive above it. The formalstatement that we wish to extend to CSP(C) is presented next for the spine:

Page 16: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Lemma 3.1 Let c be an arbitrary constant value for the constraint density function.

1. If c < limn→∞ck−SAT (n) then limn→∞|S(Φ)|

n= 0.

2. If for some c there exists δ > 0 such that w.h.p. (as n → ∞) |S(Φ)|n

> δ then limn→∞ Prob[Φ ∈SAT ] = 0, that is c > limn→∞ck−SAT (n).

Our solution is to define the backbone/spine of a random instance of CSP(C) slightly differently.

Definition 2

B(Φ) = x ∈ V ar | ∃ C ∈ C : x ∈ C, optC(Φ ∪ C) > optC(Φ),S(Φ) = x ∈ V ar | ∃ C ∈ C and Ξ ⊆ Φ : x ∈ C, Ξ ∈ CSP, Ξ ∧ C ∈ CSP.

For k-CNF formulas whose (original) backbone/spine contains at least three literals, a variable x is inthe (new version of the) backbone/spine if and only if either x or x were present in the old version. Inparticular the new definition does not change the order of the phase transition of random k-SAT.

Alternatively, in studying 3-colorability of random graphs G = (V,E), Culberson and Gent [14] define

S(G) = (x, y) ∈ V 2 | ∃ H ⊆ G : H ∈ 3-COL, H ∪ E(x, y) ∈ 3-COL,so one may define the relative backbone and spine sizes in terms of constraints rather than variables.

Definition 3

B(Φ) = C ∈ C | optC(Φ ∪ C) > optC(Φ),S(Φ) = C ∈ C | ∃ Ξ ⊆ Φ : Ξ ∈ CSP, Ξ ∧ C ∈ CSP.

Since we are looking at a combinatorial definition, with no physics considerations in mind, the onlyprincipled way to choose between the two types of order parameters (one based on variables, the otherbased on constraints) is to look at the class of algorithms we are concerned with. In the case of randomconstraint satisfaction problems (and DPLL algorithms) it is variables that get assigned values, so Def-inition 2 is preferred. On the other hand, we will see an example in a later section (the case of graphbipartitioning) where it makes more sense to use Definition 3.

Finally, note that if the backbone/spine of an instance of CSP(C) (in the sense of definition 2) hassize u, then the backbone/spine (in the sense of definition 3) has size O(uk). It follows readily that thediscontinuity of the second version of backbone/spine implies the discontinuity of the corresponding firstversion. This will notably be the case in our experimental study of 3-COL.

4 Spine discontinuity and resolution complexity of random CSP

Definition 4 Let C be such that SAT(C) has a sharp threshold. Problem SAT(C) has a discontinuous spineif there exists η > 0 such that for every sequence m = m(n) we have

limn→∞

Probm=m(n)

[Φ ∈ SAT ] = 0 ⇒ limn→∞

Probm=m(n)

[|S(Φ)|

n≥ η] = 1. (3)

If, on the other hand, for every ε > 0 there exists mε(n) with

limn→∞

Probm=mε(n)

[Φ ∈ SAT ] = 0 and limn→∞

Probm=mε(n)

[|S(Φ)|

n≥ ε] = 0 (4)

we say that SAT(C) has a continuous spine.

Page 17: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Claim 1 Let Φ be minimally unsatisfiable, and let x be a literal that appears in Φ. Then x ∈ S(Φ).

Corollary 1 k-SAT, k ≥ 3 has a discontinuous spine.

The resolution complexity of an instance Φ of SAT(C) is defined as the resolution complexity of the for-mula obtained by converting each constraint of Φ to CNF-form. A simple observation is that a continuousspine has implications for resolution complexity:

Theorem 1 Let C be a set of constraints such that SAT(C) has a continuous spine. Then for every valueof the constraint density c > limn→∞cSAT (C)(n), and every α > 0, random formulas of constraint densityc have w.h.p. resolution complexity O(2k·α·n).

Definition 5 For a formula F define c∗(F ) = max |Constraints(G)||V ar(G)| : ∅ = G ⊆ F.

The next result gives a sufficient condition for a generalized satisfiability problem to have a discontin-uous spine. Interestingly, it is one condition studied in [13].

Theorem 2 Let C be such that SAT(C) has a sharp threshold. If there exists ε > 0 such that for everyminimally unsatisfiable formulas F it holds that c∗(F ) > 1+ε

k−1, then SAT(C) has a discontinuous spine.

One can give an explicitly defined class of satisfiability problems for which the previous result applies:

Theorem 3 Let C be such that SAT(C) has a sharp threshold. If no clause C ∈ C has an implicate oflength at most 2 then

1. for every minimally unsatisfiable formula F , c∗(F ) ≥ 22k−3

. Therefore SAT(C) satisfies the condi-tions of the previous theorem, i.e., it has a discontinuous spine.

2. Moreover SAT(C) also has 2Ω(n) resolution complexity1.

The condition in the theorem is violated (as expected) by random 2-SAT, as well as by the randomversion of the NP-complete problem 1-in-k SAT. This problem can be represented as CSP(C), for C aset of 2k constraints (corresponding to all ways to negate some of the variables) and has a rigorouslydetermined “2-SAT like” location of the transition point [17]. But the formula C(x1, x2, . . . , xk−1, xk) ∧C(xk, xk+1, . . . , x2k−2, x1) ∧ C(x1, x2k−1, . . . , x3k−3, xk) ∧ C(xk, x3k−2, . . . , x4k−4, x1), where C is theconstraint “1-in-k”, is minimally unsatisfiable but has clause/variable ratio 1/(k−1) and implicates x1∨xk

and x1 ∨ xk.

4.1 Threshold location and discontinuous spines

Molloy [13] has shown (Theorem 3 in his paper) that the condition that turned out to be sufficient forthe existence of a phase transition has implications for the location of the threshold. A natural questiontherefore arises: is it possible to read the order of the phase transition from the location of the threshold?

We cannot answer this question in full. However, we have already seen two problems that do notsatisfy our sufficient condition for a discontinuous spine: random 2-SAT, for which the transition hasbeen proven to be of second order [6], and random 1-in-k SAT, for which we believe a similar resultholds [17]. Molloy’s result does not provide the correct location of the threshold for these two problems.It is, however, striking, that for both problems the actual location of the threshold is twice the value givenby Theorem 3 in [13], at clause/variable ratio 2/k(k − 1). We give, therefore the following observation, avariant of Molloy’s result (proved in the journal version of the paper).

1this result subsumes some of the results in [15]. Related results have been given independently in [16]

Page 18: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Proposition 1 Modify the random model from Definition 1 to:

• allow application of constraints to both variables and negated variables.

• only allow constraints that cannot be made equal via negation of some of their variables.

Denote by CSPneg(C) the new random model.Let C be such that for every minimally unsatisfiable subformula F whose constraints are drawn from

F the ratio of constraints to variables of F is at least 1+εk−1

, for some ε > 0. Then there is a constant δ > 0

such that CSPneg(C) is a.s satisfiable for m ≤ 2k(k−1)

· (1 + δ).

5 Beyond random satisfiability: comparing the behavior of the back-bone and spine

In this section we investigate empirically the continuity of the backbone for two graph problems, randomthree coloring (3-COL) and graph bipartition (GBP).

We consider a large number of instances of random graphs, of sizes up to n = 1024 and over a rangeof mean degree values near the threshold. for each instance we determine the backbone fraction f .

Culberson and Gent [14] have shown experimentally that the 3-COL spine (as defined in Definition 3)exhibits a discontinuous transition. To be consistent with this study (and since it leads to a stronger re-sult anyhow) we use the backbone from the same definition. We employ a rapid heuristic called extremaloptimization [18] that, based on testbed comparisons with an exact algorithm, yields an excellent ap-proximation of f around the critical region. Figure 1 shows f as a function of mean degree. Above thethreshold, for 3-COL (Fig. 1a) f does not appear to vanish, suggesting a discontinuous large n backbone.Culberson and Gent have speculated that at the 3-COL threshold, although their spine is discontinuous,the backbone might be continuous. Our numerical results suggest otherwise.

We next study graph bipartitioning (GBP). This problem cannot, strictly speaking, be cast in the setupof random constraint satisfaction problems from Definition 1, since not every partition of vertices of G isallowed. It can be cast to a variant of this model (with variables associated to nodes, values associatedto each partition and constraint “x = y” associated to the edge between the corresponding vertices) butwe must add the additional requirement that all satisfying assignments contain an equal number of onesand zeros. Thus the complexity-theoretic observations of Section 4 do not automatically apply to it. Wecan, however, give a “DPLL-like” class of algorithms for GBP, that assigns vertices (variables) in pairs,one to each partition. This class of algorithms motivates investigating the backbone/spine under the modelin Definition 3. It is easy to see that the spine of a GBP instance contains all edges belonging to aconnected component of size larger than n/2. Since the phase transition in GBP takes place where thegiant component becomes larger than n/2, GBP has a discontinuous spine. The backbone (Fig. 1b), onthe other hand, appears to remains continuous, vanishing at large n on both sides of the threshold.

One obvious question raised by the previous result is whether the discontinuity of the spine in GBPreally has computational implications for the complexity of deciding whether a perfect bipartition exists.After all, unlike 3-COL, GBP can easily be solved in polynomial time by dynamic programming. Thissituation, however, is similar to that of XOR-SAT, where a polynomial algorithm exists but the complexityof resolution proofs/DPLL algorithms is exponential.

The class of “DPLL-like” algorithms outlined for GBP can no longer be simulated in the straight-forward manner by resolution proofs, however it can be simulated using proof systems Res(k) that areextensions of resolution [19]. Some of the hardness results for resolution extend to these more powerful

Page 19: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

4.0 4.5 5.0 Mean degree

0.00

0.05

0.10

Bac

kbo

ne

Fra

ctio

n

3−COL (a)

n=32n=64n=128n=256n=512

1.0 1.2 1.4 1.6 Mean degree

0.0

0.1

Bac

kbo

ne

Fra

ctio

n

GBP (b)

n=32n=64n=128n=256n=512n=1024

Figure 1: Plot of the estimated backbone fraction (a) for 3-COL and (b) for the GBP, on random graphs,as a function of mean degree α. For 3-COL, the systematic error based on benchmark comparisons withrandom graphs is negligible compared to the statistical error bars; for the GBP, f is found by exact enu-meration. The thresholds α ≈ 4.73 for 3-COL and α = 2 ln 2 for the GBP are shown by dashed lines.

proof systems, and in [20] we investigate the extent to which the results of this paper apply to this classof proof systems. These preliminary results imply that, indeed, the discontinuity of the spine does havecomputational implications for GBP.

6 Discussion

We have shown that the existence of a discontinuous spine in a random satisfiability problem is oftencorrelated with a 2Ω(n) peak in the complexity of resolution/DPLL algorithms at the transition point. Theunderlying reason is that the two phenomena (the jump in the order parameter and the resolution complex-ity lower bound) have common causes.

The example of random k-XOR-SAT shows that a general connection between first-order phase tran-sition and the complexity of the decision problems is hopeless: Ricci-Tersenghi et al. [4] have presenteda non-rigorous argument using the replica method that shows that this problem has a first-order phasetransition, and we can formally show (as a direct consequence of Theorem 3) the following weaker result:

Proposition 2 Random k-XOR-SAT, k ≥ 3, has a discontinuous spine.

However, experimental evidence in previous section suggests that the backbone and the spine do notalways behave in the same way. Therefore our results (and the work in progress mentioned above) suggestthat the continuity/discontinuity of the spine, rather than the backbone, is a predictor for the complexityof the restricted classes of decision algorithms that can be simulated by “resolution-like” proof systems.

References

[1] R. Monasson, R. Zecchina, S. Kirkpatrick, B. Selman, and L. Troyansky. Determining computationalcomplexity from characteristic phase transitions. Nature, 400(8):133–137, 1999.

Page 20: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

[2] R. Monasson, R. Zecchina, S. Kirkpatrick, B. Selman, and L. Troyansky. (2 + p)-SAT: Relation oftypical-case complexity to the nature of the phase transition. Random Structures and Algorithms,15(3–4):414–435, 1999.

[3] P. Beame, R. Karp, T. Pitassi, and M. Saks. The efficiency of resolution and Davis-Putnam proce-dures. SIAM Journal of Computing, 31(4):1048–1075, 2002.

[4] F. Ricci-Tersenghi, M. Weight, and R. Zecchina. Simplest random k-satisfiability problem. PhysicalReviews E, 63:026702, 2001.

[5] R. Monasson and R. Zecchina. Statistical mechanics of the random k-SAT model. Physical ReviewE, 56:1357, 1997.

[6] B. Bollobas, C. Borgs, J.T. Chayes, J. H. Kim, and D. B. Wilson. The scaling window of the 2-SATtransition. Random Structures and Algorithms, 18(3):201–256, 2001.

[7] E. Ben-Sasson and A. Wigderson. Short Proofs are Narrow:Resolution made Simple. Journal of theACM, 48(2), 2001.

[8] O. Martin, R. Monasson, and R. Zecchina. Statistical mechanics methods and phase transitions incombinatorial optimization problems. Theoretical Computer Science, 265(1-2):3–67, 2001.

[9] B. Bollobas. Random Graphs. Academic Press, 1985.

[10] P. Beame and T. Pitassi. Propositional proof complexity: Past present and future. In Current Trendsin Theoretical Computer Science, pages 42–70. 2001.

[11] E. Friedgut. Necessary and sufficient conditions for sharp thresholds of graph properties, and thek-SAT problem. with an appendix by J. Bourgain. Journal of the A.M.S., 12:1017–1054, 1999.

[12] V. Chvatal and E. Szemeredi. Many hard examples for resolution. Journal of the ACM, 35(4):759–768, 1988.

[13] M. Molloy. Models for random constraint satisfaction problems. In Proceedings of the 32nd ACMSymposium on Theory of Computing, 2002.

[14] J. Culberson and I. Gent. Frozen development in graph coloring. Theoretical Computer Science,265(1-2):227–264, 2001.

[15] D. Mitchell. Resolution complexity of Random Constraints. In Eigth International Conference onPrinciples and Practice of Constraint Programming, 2002.

[16] M. Molloy and M. Salavatipour. The resolution complexity of random constraint satisfaction prob-lems. In Proceedings of the 44th Symposium on Foundations in Computer Science, 2003.

[17] D. Achlioptas, A. Chtcherba, G. Istrate, and C. Moore. The phase transition in random 1-in-k SATand NAE 3-SAT. In Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms, 2001.

[18] S. Boettcher and A. Percus. Nature’s way of optimizing. Artificial Intelligence, 119:275–286, 2000.

[19] J. Krajicek. On the weak pigeonhole principle. Fundamenta Matematicae, 170(1–3):123–140, 2001.

[20] G. Istrate. Descriptive complexity and first-order phase transitions. (in progress).

Page 21: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Appendix

6.1 Proof of Claim 1

Let C be a clause that contains x. By the minimal unsatisfiability of Φ, Φ \ C ∈ SAT . On the otherhand Φ \ C ∧ x ∈ SAT , otherwise Φ would also be satisfiable. Thus x ∈ S(Φ \ C).

6.2 Proof of Corollary 1

To show that 3-SAT has a discontinuous spine it is enough to show that a random unsatisfiable formulacontains w.h.p. a minimally unsatisfiable subformula containing a linear number of literals. A way toaccomplish this is by using the two ingredients of the Chvatal-Szemeredi proof [12] that random 3-SAThas exponential resolution size w.h.p.

6.3 Proof of Theorem 1

Proof.By the analog of Claim 1 for the general case, if SAT (C) has a continuous spine) then for every

c > cSAT (C) and for every α > 0, minimally unsatisfiable subformulas of a random formula Φ withconstraint density c have w.h.p. size at most α · n. Consider the backtrack tree of the natural DPLLalgorithm (that tries to satisfies clauses one at a time) on such a minimally unsatisfiable subformula F .By the usual correspondence between DPLL trees and resolution complexity (e.g., [3], pp. 1) it yields aresolution proof of the unsatisfiability of Φ having size at most 2k·α·n+1.

6.4 Proof of Theorem 2

Proof.We first recall the following concept from [12]:

Definition 6 Let x, y > 0. A k-uniform hypergraph with n vertices is (x,y)-sparse if every set of s ≤ xnvertices contains at most ys edges.

We also recall Lemma 1 from the same paper.

Lemma 6.1 Let k, c > 0 and y be such that (k − 1)y > 1. Then w.h.p. the random k-uniform hypergraphwith n vertices and cn edges is (x, y)-sparse, where ε = y − 1/(k − 1), x = ( 1

2e( y

ce)y)1/ε.

The critical observation is that the existence of a minimally unsatisfiable formula of size xn and withc∗(F ) > 1+ε

k−1implies that the k-uniform hypergraph associated to the given formula is not (x, y)-sparse,

for y = εk−1

. But, according to Lemma 6.1, w.h.p. a random k-uniform hypergraph with cn edges is (x0, y)

sparse, for x0 = ( 12e

( yce

)y)1/ε (a direct application of Lemma 1 in their paper). We infer that any formulawith less than x0 · n/K constraints is satisfiable, therefore the same is true for any formula with x0 · n/Kclauses picked up from the clausal representation of constraints in Φ.

The second condition (expansion of the formula hypergraph) can be proved similarly.

Page 22: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

6.5 Proof of Theorem 3

Proof.

1. For any real r ≥ 1, formula F and set of clauses G ⊆ F , define the r-deficiency of G, δr(G) =r|Clauses(G)| − |V ars(G)|. Also define

δ∗r (F ) = maxδr(G) : ∅ = G ⊆ F (5)

We claim that for any minimally unsatisfiable F , δ∗2k−3(F ) ≥ 0. Indeed, assume this was not true.Then there exists such F such that:

δ2k−3(G) ≤ −1 for all ∅ = G ⊆ F. (6)

Proposition 3 Let F be a formula for which condition 6 holds. Then there exists an orderingC1, . . . , C|F | of constraints in F such that each constraint Ci contains at least k − 2 variablesthat appear in no Cj , j < i.

Proof. Denote by vi the number of variables that appear in exactly i constraints of F . We have∑i≥1 i · vi = k · |Constraints(F )|, therefore 2|V ar(F )| − v1 ≤ k · |Constraints(F )|. This can

be rewritten as v1 ≥ 2|V ar(F )| − k|Constraints(F )| > |Constraints(F )| · (2k − 3 − k) =(k − 3) · |Constraints(F )| (we have used the upper bound on c∗(F ). Therefore there exists at leastone constraint in F with at least k − 2 variables that are free in F . We set C|F | = C and apply thisargument recursively to F \ C.

Call the k − 2 new variables of Ci free in Ci. Call the other two variables bound in Ci. Letus show now that F cannot be minimally unsatisfiable. Construct a satisfying assignment for Fincrementally: Consider constraint Cj . At most two of the variables in Cj are bound for Cj . SinceC has no implicates of size at most two, one can set the remaining variables in a way that satisfiesCj . This yields a satisfying assignment for F , a contradiction with our assumption that F wasminimally unsatisfiable.

Therefore δ∗2k−3(F ) ≥ 0, a statement equivalent to our conclusion.

2. To prove the resolution complexity lower bound we use the size-width connection for resolutioncomplexity obtained in [7]: we prove that there exists η > 0 such that w.h.p. random instances ofSAT (C) having constraint density c have resolution width at least η · n. We use the same strategyas in [7]

(a) (prove that) w.h.p. minimally unsatisfiable subformulas are “large”, and

(b) any clause implied by a satisfiable formula of “intermediate” size contains w.h.p. “many” lit-erals.

Indeed, define for a unsatisfiable formula Φ and (possibly empty) clause C

µ(C) = min|Ξ| : Ξ ⊆ Φ, Ξ |= C.

Claim 2 There exists η1 > 0 such that for any c > 0, w.h.p. µ() ≥ η1 · n (where Φ is a randominstance of SAT (C) having constraint density c).

Page 23: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Proof. In the proof of Theorem 2 we have shown that there exists η0 > 0 such that w.h.p. anyunsatisfiable subformula of a given formula has at least η0 · n constraints. Therefore any formulamade of clauses in the CNF-representation of constraints in Φ, and which has less than η0 ·n clausesis satisfiable, and the claim follows, by taking η1 = η0.

The only (slightly) nontrivial step of the proof, which critically uses the fact that constraints in Cdo not have implicates of length at most two, is to prove that clause implicates of subformulas of“medium” size have “many” variables. Formally (and proved in the Appendix)

Claim 3 There exists d > 0 and η2 > 0 such that w.h.p., for every clause C such that d/2 · n <µ(C) <= dn, |C| ≥ η2 · n.

Proof. Take 0 < ε. It is easy to see that if c∗(F ) < 22k−3+ε

then w.h.p. for every subformula G ofF , at least ε

3· |Constraints(G)| have at least k−2 private variables: Indeed, since c∗(G) < 2

2k−3+ε,

by a reasoning similar to the one we made previously v1(G) ≥ (k− 3+ ε)|Constraints(G)|. Sinceconstraints in G have arity k, at least ε/3 · |Constraints(G)| have at least k − 2 “private” variables.

Choose y = 22k−3+ε

in Lemma 6.1 for ε > 0 a small enough constant. Since the problem has a sharpthreshold in the region where the number of clauses is linear, d = infx(y, c) : c >= cSAT (C) > 0.W.h.p. all subformulas of Φ having size less than d/k · n have a formula hypergraph that is (x, y)-sparse, therefore fall under the scope of the previous argument. Let Ξ be a subformula of Φ, havingminimal size, such that Ξ |= C. We claim:

Claim 4 For every clause P of Ξ with k − 2 “private” variables, (i.e., one that does not appear inany other clause), at least one of these “private” variables appears in C.

Indeed, suppose there exists a clause D of Ξ such that none of its private variables appears in C.Because of the minimality of Ξ there exists an assignment F that satisfies Ξ \ D but does notsatisfy D or C. Since D has no implicates of size two, there exists an assignment G, that differsfrom F only on the private variables of D, that satisfies Ξ. But since C does not contain any of theprivate variables of D, F coincides with G on variables in C. The conclusion is that G does notsatisfy C, which contradicts the fact that Ξ |= C.

The proof of Claim 3 (and of item 2. of Theorem 3) follows: since for any clause K of one of theoriginal constraints µ(K) = 1, since µ() > η1 ·n and since w.l.o.g. 0 < d < η1 (otherwise replaced with the smaller value) there exists a clause C such that

µ(C) ∈ [d/2k · n, d/k · n]. (7)

Indeed, let C ′ be a clause in the resolution refutation of Φ minimal with the property that µ(C ′) >dn. Then at least one clause C, of the two involved in deriving C ′ satisfies equation 7.

By the previous claim it C contains at least one “private” variable from each clause of Ξ. Therefore|C| ≥ η2 · n, with η2 = d/2k · ε.

Page 24: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

! " #$ %&!('

)(*,+.-0/21436573!893!+:/2;<*=1?>&@4+*A1BC ;D/2E)F;G>H*JIK3LM1?/NID3!+:E./N-PO;=QSR3T?*=EU*,-MV WYX*=E.;?Z[5U\^]M3_4-J`

aAbDbc `[LM1?/NID3!+:E./N-POdID3D`V WeX#*AE:;4Z4RgfhZ9iAjDjDk=lnm boa4p l

3!q*=/2Wsr tnq C 3!893!+:/2;uZ4Qvqw;x>H*JIK3y=z C E!`|-:3!_Y`~3J>?

AP[#~H|e|YP&P .¢¡N££~¥¤S¦e£M~0¤§~©¨ª

¨s.eª~£U|Uv£7¤¢~£S«v~S§¥¬®­©¨¨¦v#v¨s~¨§o.§~¦¥.7©¦§¥|¨sv¥v~£¨|¥.¯~°±vª§¥²³«v J¨~´¨§~§¥­¬=Gµ¨s.v¦¨s¶|:¤§·¡¸ª£¹4º¦~0°v§n| v£H£v¥¤»¨sH¤¼?Ke¦§¥~£~¥.§!§¥v~¤²

½ ¾o¿À?Á?ÂÄÃÅÆÀ?ÇÈÂ7¿ÉÊ¶Ë ÌÍ|ÎsÏÑÐÈÒÓÏ|ÎsÔPÒÓÏÖÕ×ÔÈÐPÏ»ØÙϻڢÊËgÕ0ÛÏÑÊÜMÝßÞà¼áÞ¶âã9Ô ÕÏÑËäÕ0ÎYÚ¢ÒÙ×ÔÈÒÙÏ»ÊÜ7Ë ÌÍ|ÎsÏÑ×Ï»Ë7ÔÈÊØÙϻڢÊÕÍ=Õ0ÍÑÎsÔPÒÓÊ®Õ:ÎsÏÑåÔÈØPæç ÎÓÒsÕ×JÏÑÎsÏÑڢʮÕÍ9ÕèèJÒÓÚÕÐéhê¸Ú¢Ò ÎÓéϻصèÒÙÚ¢ëÍÑÔÈËìÏÑØ ÎsÚíÌØÙÔ¶ÕØÓÏÑ˱èJÍ»Ôgã9ÔÈÏÑÜ¢éÎsÔÈ×îØÓÌJËìÎsé®Õ:αÔPÊØÓÌJÒÓÔÈØ7Í»ÚãÐÈÚ¢Ë7èÍ»Ôï!ÏÑÎðîÕÊ×<ÔÕØÙÔ¶ÚêSÌJØÓÔÕÊJ×ñãéÔPÒÓÔÔÈÕÐéwãòÔPϻܢéÎgÒÓÔÈèJÒÓÔÈØÙÔÈÊÎsØÎÓéÔݸØÓÌëóÙÔÈÐPÎÓÏÑåÔâ±ÏÑ˱è,Ú¢ÒÙÎsÕÊÐPÔÜ¢ÏÑåÔÈÊëðUÕ«×JÔÈÐÈÏÑØÓÏÑÚ¢ÊM˱ÕÛ0ÔÈÒ¼ÎsÚ7Õ«è®ÕÒvÎsÏ»ÐPÌÍÖÕ0Ò¼ÐÈÒÙÏÑÎsÔPÒÓÏÑÚ¢ÊDæáÔPØÓèÏ|ÎsÔYÏÑÎÓØ[ØÙÏ»Ë7èÍ»ÏÑÐÈÏ|Îð¢ôÎÓéÏ»Ø9ÕèèÒÙÚÕÐé±ØÓÌJõAÔÈÒÙؼÕ#˱Õ.óÙÚ¢ÒH×ÒÓÕãë®ÕÐÛ7ÕØ9ã9ÔYÐÕÊØÙéÚãÎsé®Õ:Î9ÌJØÓÏ»ÊJÜ Õ

ãòÔPϻܢéÎsÔP×UØÓÌËöÎsÚÔPå0ÕÍ»ÌÕ0ÎsÔYèÒÓÔê¸ÔÈÒÙÔÈÊÐPÔÈØÚå¢ÔÈÒ¼ÕØÓÔμÚêoË ÌJÍÑÎsÏÑ×Ï»Ë7ÔÈÊJØÓÏ»ÚÊ®ÕÍÕ0ÍÑÎsÔPÒÓÊ®Õ:ÎsÏÑåÔÈØ9Ï»ØòÔP÷ÌÏÑå0ÕÍÑÔÈÊÎÎsÚ7ÕØÓØÙÌË7Ï»ÊܵÎséJÔÏÑÊ×ÔPèAÔPÊ×ÔÈÊJÐÈÔÚêÐPÒÓÏ|ÎsÔÈÒÙÏÖÕJæøxÚ7èÒÓÔå¢ÔÈÊÎYÎséÏÑØèÒÓÚëÍ»ÔPËMô!ÊÚ¢Ê!ùÕ××Ï|ÎsÏÑåÔÕèèÒÙÚÕÐéÔPØãòÔPÒÓÔØÓÌÜ¢ÜÔÈØÙÎÓÔÈ×DæòÞMÚ¢ÒÙÔSØÓè,ÔÈÐPÏÑú®ÐÈÕÍ»Í|ð¢ôÊÚÊ!ù

Õ××Ï|ÎsÏÑåÔË7ÔÕØÙÌÒÓÔPØÕÊ×UÏ»ÊÎÓÔÈÜ¢ÒÓÕͻؼÐÈÕÊUë,ÔSÌØÓÔP×ÄÎÓÚµÒÓÔPèÒÓÔPØÓÔÈÊÎeèÒÓÔê¸ÔÈÒÙÔÈÊÐPÔÈØÈæ9ûeÊÎÓÏ»Í=ÒÙÔÈÐPÔÈÊÎsÍ|ð¢ôÎséJÏ»ØòãòÕØ×Ú¢ÊÔUÏÑÊÕÒsÕ0ÎÓéÔÈÒUüýþ®ÿãòÕðæYÚãòÔå¢ÔÈÒPô¼ÕÊhÕ.ïJÏÑڢ˱Õ0ÎsÏÕ0ÎÓϻڢÊíÚêYË ÌÍ|ÎsÏÑÐÈÒÓÏ|ÎsÔPÒÓÏÖÕU×ÔPÐÈÏÑØÓÏ»ÚÊËgÕÛÏÑÊÜÌØÓÏÑÊܵÎséÔ#àòéÚ÷ÌÔPÎeÏ»ÊÎsÔPÜ¢ÒsÕ0ÍHÝ2Õµè®ÕÒvÎsÏÑÐÈÌÍ»ÕÒ¼ÐÕ0ØÓÔÚêGÊÚ¢Ê!ùÕ××Ï|ÎsÏÑåÔSÏ»ÊÎsÔPÜ¢ÒsÕ0Í â[ãòÕØèÒÓÚåÏ»×JÔÈ׶ÏÑÊßæ6Ô[ÕÒÓÔ?ÊJÚ.ãÏ»ÊÎsÔPÒÓÔPØÙÎsÔP×ÏÑÊSèÒÓÚåÏ»×ÏÑÊÜYÕèÒsÕÐÎsÏÑÐÕÍ¢ØÙڢͻÌ!ÎsÏ»ÚÊ©ê¸Ú¢ÒKØÓÌJÐéèÒÙÚ¢ëÍ»ÔP˱ØKÌJØÓÏ»ÊJÜÎÓéÔ9àòéÚ÷ÌJÔPÎÏ»ÊÎsÔPÜ¢ÒsÕÍ2æ ç Ê ÏÑÊéÔPÒÓÔÈÊÎèÒÙÚ¢ëÍ»ÔPË Úê[ÊÚÊ!ùÕ×J×ÏÑÎÓÏÑå¢ÔµË7ÔÕ0ØÓÌÒÙÔÈØϻةÎÓéÔÈÏÑÒSÔï!è,Ú¢ÊÔÈÊÎÓÏÖÕÍ4ÐPÚ¢ØÙÎÈæ eÚã9ÔPå¢ÔPÒÈôãòÔãϻͻÍ=ØÓéÚã ÎséÕ0ÎeÎÓéÔÊÚÎsÏÑÚ¢ÊÚê.ùÕ××Ï|ÎsÏÑåÔS˱ÔÈÕØÓÌJÒÓÔÈØ ÝNØÙÔÈÔÖâÕÍ»ÍÑÚ.ãØÌØYÎsÚ7Í»ÏÑ˱Ï|ÎòÎséJÏ»ØYÐÈÚ¢ØvÎeÎÓÚÕ ÝâæòÔÈØÙÏ»×ÔPØÈô®ÕÐÈÚ¢Êå¢ÔPÊÏ»ÔPÊÎÒÓÔÈèJÒÓÔÈØÙÔÈÊÎÕ:ÎsÏ»ÚÊUÚêDÎséJÔSàòéÚ÷ÌÔPÎÏÑÊÎsÔÈÜÒsÕÍ®ãæ ÒÈæ Îæ?ÎsÚ#Ï»Ë7èAÚ¢ÒvÎÕÊJÐÈÔYÕÊ×Ï»ÊÎsÔPÒsÕÐÎsÏ»ÚÊgÏ»ÊJ×Ï»ÐPÔÈØÝ Îsé®Õ0Î9ãÏ»ÍÑÍ,ë,Ô×Ôú®ÊÔP×Äê¸ÌJÒÙÎséJÔÈÒâ?ãÏ»ÍÑÍ=ÕÍÑÍ»ÚãÌØ9ÎsÚ ÔïJèJÒÓÔÈØÙØÎséÔàòéJÚ!÷ÌÔÎÏ»ÊÎsÔPÜ¢ÒsÕ0ÍÏ»ÊÄÎsÔÈÒÙ˱ØÚêÐÈÚ¢Ë7èÍÑÔÈË7ÔÈÊÎÕÒvð¢ônÒÓÔP×ÌÊ×ÕÊÎÕÊJ׶ϻÊJ×ÔÈè,ÔÈÊ×JÔÈÊÎÐPÒÓÏÑÎÓÔÈÒÙÏÖÕ ãéÏ»ÐéMÏ»ØYÕ«Ê®Õ0ÎsÌJÒsÕÍoÔïÎÓÔÈÊØÙϻڢÊÚêGÎÓéÔSãòÔPϻܢéÎÓÔÈ׶ØÙÌËæÉÊuèÒsÕ0ÐPÎsÏÑÐÕÍSèJÒÓÚ¢ëÍÑÔÈË7ØÈôYãòÔ Ú¢ÊÍ|ðÒÓÔP÷ÌJÏ»ÒÓÔíÎséJÔ ×JÔÈÐÈÏÑØÓÏÑÚ¢Êu˱ÕÛ0ÔÈÒUÎsÚîèÒÓÚåÏ»×JÔ6ÏÑ˱è,Ú¢ÒÙÎsÕÊÐPÔÕÊ×

Ï»ÊÎsÔPÒsÕÐÎsÏ»ÚÊ ÏÑÊ×ÏÑÐÈÔÈØSãéÏÑÐéÕ0ÒÓÔ«ØÓÌ7ÐPÏ»ÔÈÊÎÎsÚ×ÔPúÊÔ±èJÒÓÔPê¸ÔPÒÓÔPÊÐÈÔPØÚå¢ÔPÒÎÓéÔgÕ0ÍÑÎsÔPÒÓÊ®Õ:ÎsÏÑåÔÈØÕØÍ»ÚÊÜÄÕØãòÔ#ÕØÓØÙÌË7ÔÎÓéÔ#Ë7ÔÕ0ØÓÌÒÙÔÎÓÚ±ë,Ô.ùÕ×J×ÏÑÎÓÏÑå¢Ô0æYÚãòÔå¢ÔÈÒPô=Ï|ÎÏ»ØÌÊÍÑÏ»ÛÔPÍÑðÎÓé®Õ0ÎÎséJÔ#×ÔPÐÈÏÑØÓÏ»ÚÊM˱ÕÛÔPÒÐÕÊÜ¢ÏÑåÔèJÒÓÔÈÐPÏ»ØÙÔå0ÕÍÑÌÔÈصê¸Ú¢ÒµÎséJÔÈØÓÔÏ»ÊJ×Ï»ÐPÔÈØÈæeÔå¢ÔPÒÙÎséJÔÈÍ»ÔPØÓØPô¼ÎséÏÑØ7ÏÑØ«ÊÚÎ±Õ Ë±Õ.óÙÚ¢ÒµèJÒÓÚ¢ëÍÑÔÈË Õ0Ø«ã9ÔÐÕÊÒÓÔÕ0ØÓڢʮÕ0ëÍÑðUÔï!èAÔPÐPÎÎséÔ×ÔPÐÈÏÑØÓÏ»ÚÊM˱ÕÛÔPÒ¼ÎsÚ«ë,Ô#Õ0ëÍ»ÔÎÓÚ7ÜÏÑå¢ÔÏÑÊÎÓÔÈÒvåÕ0Í»ØÚêxå0ÕÍ»ÌÔPØÈæøéJÔÈÒÓÔê¸Ú¢ÒÓÔ0ôDÎséÔ ÕÏ»Ë Úê?ÎÓéÏ»Øeè®Õè,ÔÈÒÏÑØeÎÓÚèÒÓÔPØÓÔPÊÎÕàòéÚ!÷ÌÔÎÏ»ÊÎsÔPÜ¢ÒsÕ0ÍxëÕØÓÔP× ÚÊÏ»ÊÎsÔPÒÙå0ÕÍÑØeÎséÕ0Î

ãϻͻÍ!ÕÍÑÍ»ÚãÌØ4ÎÓÚÔï!èÒÙÔÈØÓØHÏÑÊÎÓÔÈÒvåÕ0Í»Ø4ÚênèJÒÓÔPê¸ÔPÒÓÔPÊÐÈÔPØHê¸ÚÒ4Ë ÌÍ|ÎsÏ»×JÏ»Ë7ÔÈÊØÙϻڢʮÕ0ÍÕÍÑÎÓÔÈÒÙÊ®Õ0ÎsÏ|å¢ÔPØÝNÌÊJ×ÔÈÒ?ÒÓÕ0ÎséÔPÒãòÔÈÕÛÄÕØÓØÙÌË7èJÎsÏÑÚ¢ÊØâæ?øéÏ»ØãÏÑÍ»ÍDÕ0ͻͻÚã^ÌØÎsÚ7é®Õå¢ÔÕ«ØÓÏÑ˱èÍÑÔôð¢ÔPÎeÕÐPÐÈÌÒÓÕ0ÎsÔË7Ú!×ÔPÍDÚêGèÒÓÔê¸ÔÈÒÙÔÈÊÐPÔÈØÈæ

Page 25: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

ÉÊ ÎséÔ7ú®ÒÓØvÎèÕÒÙÎÚê¼ÎÓéÏ»Øè®Õ0èAÔPÒÈôoãòÔ±ÒÓÔÈÐÈÕÍ»Í?ÎÓéÔ±ÔPØÓØÙÔÈÊÎsÏ»ÕÍ»ØÚêÞ¶à¼áÞ ÕÊ× ÊÚ¢Ê!ùÕ××Ï|ÎsÏ|å¢Ô«Ï»ÊÎsÔùÜ¢ÒsÕ0ÎÓϻڢÊ=æøéÔPÊDô9ã9ÔÄèÒÓÔPØÓÔÈÊαϻÊÎsÔPÒÙå0ÕÍÑØ«ÕÊ×ÎséÔPϻҵڢè,ÔÈÒÓÕ0ÎsÏÑÚ¢ÊØPô9Õ0Ê×hÔPå¢ÔPÊÎsÌ®ÕÍÑÍÑðô[ã9ÔMØÙéÚãäéÚãÎsÚÐÈÚ¢Ë ëJÏ»ÊÔÎÓéÔÈØÙÔÎã9Ú7ÎÓéÔÈÚÒÓÏ»ÔPØÎsÚ«Ú¢ëJÎsÕÏ»ÊUÏÑÊÎÓÔÈÒvåÕ0ÍKÚêèJÒÓÔPê¸ÔPÒÓÔPÊÐÈÔPØ©Ï»ÊÕ7Þà¼áÞ ØÓÔÎÓÎsÏÑÊÜæ! "ìÂ7¿$#&% ÃÃÇÀ?Ç('*) + ),%.-4ÅÁ/)0- Ç¿ 1 24351687:9 ; <0=?>A@CBED&@?>FGD&@:HJI0FGBG@:K@:LNMPOJH/QR@:MTSUoÔÎÌJØ#ÐPÚ¢ÊØÓÏÑ×ÔÈÒ#ÕÄØÓÔPÎV WXVY,Z\[[[Z]V ^Jæ±ÉÊÕË ÌÍÑÎÓÏ»ÐPÒÓÏÑÎÓÔÈÒÙÏÖÕg×ÔPÐÈÏÑØÓÏ»ÚÊ6˱ÕÛÏ»ÊÜÄèÒÙÚ¢ëÍÑÔÈËôoÎÓéÔØÓÔPÎ V ÒÓÔPèÒÓÔPØÓÔÈÊÎÓرÎséJÔØÙÔPαÚê©ÕÍÑÎÓÔÈÒÓÊÕ0ÎsÏ|å¢ÔÈØPôN_a`5b dc [[[ c feMϻصÎÓéÔÄØÓÔPÎ7ÚêeÐÈÒÙÏÑÎÓÔÈÒÓÏ»ÕÚ¢Ò7Õ0ÎÓÎÓÒÓÏÑëÌJÎsÔPØÕÊ×íÎséÔ«ØÓÔÎgV*h[ÒÙÔÈèÒÙÔÈØÓÔPÊÎsØ#ÎÓéÔ«ØÓÔÎ#Ú0ê9å0ÕÍÑÌÔÈØê¸Ú¢ÒÎÓéÔ7Õ0ÎÓÎÓÒÓÏÑëÌJÎsÔjiÙæ«ÉÊ6Ü¢ÔPÊÔÈÒÓÕÍßôoÕU×ÔPÐÈÏÑØÓÏ»ÚÊ6˱ÕÛÔPÒé®ÕØÔPÊÚ¢ÌÜéÏ»ÊJê¸ÚÒÓ˱Õ0ÎsÏÑÚ¢ÊÎsÚMÚ¢ÒÓ×ÔPÒ#å0ÕÍÑÌÔÈØÚêÕ0ÎÙÎsÒÙÏ»ëÌJÎÓÔÈØ#Ï»Ê6Õ¶ØÓÔÎkV h ôxÎséÔPÒÓÔPê¸ÚÒÓÔôGãòÔgãÏÑÍ»ÍHÕØÙØÓÌË7ÔÎsé®Õ0μÔÕÐéØÓÔÎRV hKÏÑØÔÈÊ×JÚ.ã9ÔÈ×ÄãÏÑÎÓéÕãòÔÈÕÛÚ¢ÒÙ×ÔÈÒ0lmhæHûeÊJ×ÔÈÒÕµÒsÕ:ÎséÔPÒòãòÔÈÕÛUÕØÙØÓÌË7èJÎÓϻڢÊ6ÝNÊÕ˱ÔPÍÑðÚ¢ÒÓ×ÔPÒvù ØÓÔÈèÕÒsÕëJϻͻÏ|ÎðJâôê¸Ú¢ÒÕÍÑÍnipoq_nô!ÎÓéÔÈÒÙÔÔï!Ï»ØÙÎÓØÕ è®ÕÒvÎsÏÖÕ0Í=ÌJÎÓϻͻÏ|Îðµê¸ÌÊÐÎsÏ»ÚÊr&htsdV hu É vØÙÌÐéÎÓé®Õ0Îws

xzy h c| hGoV h9ô y htlmh hz~ rhÝ y h¸ârhÝ hÖâ Ý âÉÊhÞà¼áÞ6ôoãòÔÕÏ»Ë Õ0Îú®Ê×JÏ»ÊÜÕÄãòÔÈÕÛ Ú¢ÒÓ×ÔPÒl Úå¢ÔÈÒV ÎséÕ0Î ÏÑØqÙÐPÚ¢ÊØÙÏ»ØÙÎÓÔÈÊÎ|MãÏÑÎsé6ÎséÔ±è®ÕÒÙÎÓÏÖÕÍÚ¢ÒÓ×ÔPÒÓØPôÎsé®Õ0ÎYÏÑØÈôJã9ÔÕÒÓÔÍ»ÚÚ¢ÛÏ»ÊJÜê¸Ú¢ÒÕÊMÕܢܢÒÙÔÈÜÕ:ÎsÏ»ÚÊUÚ¢èAÔPÒsÕ0ÎÓÚ¢ÒsÉ v ^ u É v ØÓÌÐéÎsé®Õ:Îs

xzy c| oqV ô y l ~ 6ÝrEYÝ y YÓâ c [[[ c r^JÝ y ^¢âÓâ6ÝrYPÝ YÙâ c [[[ c r^JÝ ^âÙâ Ý¢âÉʶÎséÔ#ØÓÔP÷ÌJÔÈÍßô,ãòÔãÏ»ÍÑÍK×ÔÈÊÚ0ÎsÔÎÓéÔÜ¢ÍÑÚ¢ë®ÕÍKÌ!ÎsÏ»ÍÑÏÑÎð±ê¸ÌÊÐPÎÓϻڢÊMëðrGÝ y â`6ÝrYPÝ y YÓâ c [[[ c r^JÝ y ^âÓâ¼ê¸Ú¢Òy oVæp[ðgÐPÚ¢ÊØÓÏÑØÙÎÓÔÈÊÎôã9ÔS˱ÔÈÕÊgÎÓé®Õ0μÎÓéÔÐéÚ¢ÏÑÐÈÔÚ0êxÎÓéÔSÕÜÜ¢ÒÓÔPÜÕ0ÎsÏÑÚ¢ÊgÚèAÔPÒsÕ0ÎÓÚ¢ÒòØÙéÚ¢ÌÍÑ×UÒÓÔ(ÔÈÐPÎÎÓéÔèÒÓÔê¸ÔÈÒÓÔPÊÐÈÔPØ©ÚêxÎséJÔ×JÔÈÐÈÏÑØÓÏÑÚ¢ÊÄËgÕÛ0ÔÈÒPônÕÊ×UÎséÔPÒÓÔê¸Ú¢ÒÓÔØÓÚ¢Ë7ÔS×ÔPÜ¢ÒÓÔPÔÚ0ê4ØÓÌJëóÙÔÈÐÎsÏÑåÏ|Îð¢æç å¢ÔPÒÙðÊ®Õ0ÎÓÌÒsÕÍGÕÊ×MØÓÏÑ˱èÍÑÔÕèèÒÙÚÕÐéMê¸ÚÒeØÙÌÐéíÕ±èÒÙÚ¢ëÍÑÔÈË Ï»ØÎsÚ±ÌØÓÔÕ±ØÙÏ»Ë7èÍ»Ôã9ÔÈÏÑÜ¢éÎsÔÈ×ØÙÌËæ

øéÔ©×ÔPÐÈÏÑØÓÏ»ÚÊg˱ÕÛÔPÒòÏ»Ø9ÕØÙÛÔÈ×ÎÓÚèÒÙÚå!ÏÑ×Ôeã9ÔÈÏÑÜ¢éÎsØGhto c nÎÓé®Õ0Î9ÒÓÔ(ÔÈÐPÎÓØòÎséJÔ©Ï»Ë7è,Ú¢ÒÙÎsÕÊÐÈÔÚêDÔÕÐéÐÈÒÓÏ|ÎsÔPÒÓÏ»ÚÊÄÕÊ×ØÓÌÐéÎsé®Õ0Î ^hY Ehn` æ?øéÔÌJÎÓϻͻÏ|Îð«ê¸ÌÊÐPÎÓϻڢÊÄÏÑؼÎséÔPʶ×Ôú®ÊÔÈ×Mëð

xzy oV ôrGÝ y âp` ^ hY EhrhÝ y h¸â ÝâáÔÈØÓèJÏÑÎsÔÕ0ÊÄÕ0ÎÓÎÓÒsÕÐÎsÏÑåÔSØÓÏÑ˱èÍÑÏ»ÐPÏÑÎð7ÕÊ×ÄÍÑÚ.ã^ÐPڢ˱èJÍ»Ôï!Ï|Îð¢ôJÎÓéÏ»ØÕ0èèÒÓÚ¢ÕÐéÄØÓÌJõAÔÈÒÙØ©Õ Ë±Õ.óÙÚ¢Ò¼×JÒsÕãë®ÕÐÛnæÔÐÈÕÊ«ØÓéÚã<ÎÓé®Õ0Î?ÌJØÓÏ»ÊJÜÕÊ7Õ××Ï|ÎsÏÑåÔÕÜ¢ÜÒÓÔÈÜ¢Õ0ÎsÏÑÚ¢Ê Ú¢è,ÔÈÒÓÕ0ÎsÚ¢ÒGØÓÌÐégÕØ?Õã9ÔÈÏ»ÜéÎÓÔÈ׫ØÓÌJË Ï»Ø4ÔP÷ÌÏÑå0ÕÍÑÔÈÊÎÎsÚÕØÙØÓÌË7Ï»ÊÜÕ0Í»Í!ÎÓéÔÕ0ÎÙÎsÒÓÏÑëÌJÎÓÔÈØ4ÏÑÊ×ÔÈè,ÔÈÊJ×ÔÈÊΩÝÖâæÉÊ«èÒÓÕÐPÎÓÏ»ÐPÔô¢ÎséJÏ»Ø4ÏÑØ4ÊÚÎHÒÙÔÕÍÑÏ»ØÙÎÓÏ»Ð9ÕÊ×µÎséJÔÈÒÓÔê¸Ú¢ÒÓÔ0ôãòÔÊÔÈÔP×ÎÓÚµÎsÌÒÙÊÎÓÚ7ÊJÚ¢Ê!ùÕ0××ÏÑÎÓÏÑåÔÕèèÒÙÚÕÐéÔPØÈæ

6876 ¡LNMR¢£H/I0Ig@>@?¤F¥OJFGH/K<TDFtKHpMTI¦@§M>FGSNDHp=CK¨nÚÒ4ÎséÔYØsÕÛ0ÔeÚ0êDÚ¢ÌÒ[ÕèJèÍ»ÏÑÐÕ0ÎÓϻڢÊØPôã9ÔeÒÙÔÈØÙÎÓÒÓÏÑÐPÎ[Ú¢ÌÒÙØÓÔÈÍ|å¢ÔPØ9ÎsÚÎséJÔúÊÏÑÎÓÔYÐÈÕØÓÔ0æpeÚã9ÔPå¢ÔPÒÈôJÎÓéÔÈØÙÔe×JÔPú®ÊÏ~ùÎsÏ»ÚÊØòÐÈÕÊUëAÔ©ÔïÎsÔPÊ×ÔÈ×UÎÓڵϻÊJú®ÊJÏÑÎsÔYØÓÔÎsØݸØÓÔÈÔ*£,ê¸Ú¢Ò¼Õ ×JÔPÎÕ0ϻͻÔP×gèÒÙÔÈØÙÔÈÊÎÕ0ÎÓϻڢÊUÚ0êxê¸ÌwPð±Ï»ÊÎÓÔÈÜ¢ÒÓÕ0ÎsÏÑÚ¢Ênâæç ÊÚÊ!ùÕ×J×ÏÑÎÓÏÑå¢ÔÏÑÊÎsÔÈÜÒsÕÍ®ÏÑØ9ÕØÓÚ¢ÒvÎ[Úê=åÔÈÒÙð7Ü¢ÔPÊÔÈÒÓÕÍ,Õå¢ÔÈÒÓÕÜ¢ÏÑÊÜÚ¢è,ÔÈÒÓÕ0ÎsÚÒ4Îsé®Õ0Î9ÐÕ0ʱÎÕ0ÛÔYÏ»ÊÎsÚ#ÕÐÈÐPÚ¢ÌÊÎÎséÔë,ÔÈé®ÕåϻڢÒÚ0ê4Õ«×ÔÈÐPÏ»ØÙϻڢÊÄ˱ÕÛÔPÒÚê4ÔïJè,ÔÈÒvÎÝ2ÕØã9ÔãÏÑÍ»ÍDØÙÔÈÔÕ«ëÏÑμê¸ÌJÒÙÎséJÔÈÒâæøxÚ7×ÔPú®ÊJÔ#ÊÚÊ!ùÕ×J×ÏÑÎÓÏÑå¢ÔSÏÑÊÎsÔÈÜÒsÕÍÑØÈôJã9Ô#ÊJÔÈÔÈ×MÕ«ØÓÔPÎYÚêGå0ÕÍ»ÌÔPØÚê4ÏÑ˱è,Ú¢ÒvÎÕÊÐPÔôÎséÏÑØYØÙÔPÎYë,ÔÈÏ»ÊJÜ«ÎÓéÔ

åÕ0Í»ÌÔPØ4Úê®ÎséJÔ¼ÊÚ¢Ê!ùÕ××Ï|ÎsÏ|å¢ÔòË7ÔÕ0ØÓÌÒÙÔ¼Ú¢ÊãéÏÑÐé ÎséÔòÊÚ¢Ê!ùÕ××Ï|ÎsÏÑåÔòÏÑÊÎÓÔÈÜ¢ÒÓÕÍÏ»ØÐPÚ¢Ë7èÌJÎsÔP× ê¸ÒÓÚ¢Ëæxøé®Õ0ÎÏ»ØÈô!ãòÔÊÔÈÔP׶յå0ÕÍ»ÌJÔSÚêÏÑ˱è,Ú¢ÒvÎÕÊÐPÔÚêGÔÕÐéMØÓÌëJØÓÔPÎeÚêÕ0ÎÓÎÓÒÓÏÑëÌJÎsÔPØÈæÉÊÄÎséJÔê¸ÚͻͻÚãÏÑÊÜ ×Ôú®ÊÏÑÎÓϻڢÊ=ô©UÝ_!âòÒÓÔÈèJÒÓÔÈØÙÔÈÊÎsØYÎséÔè,ÚãòÔPÒeØÙÔPÎYÚêf_,檬«®­°¯f±³²±´z¯¡µ&¶m·G¸º¹°_»º¸k¹ þ¸g¼¸º¹YÿC½ü¹¹¾À¿»wÁ¹C¸Ã¼ÅÄÙÿ¾#üÆÇ*¼¸º¹¿³Æ ü ȸºÆn¸º¾ÙüÉn¼¸º¹¹¿³ÆÂȺÊË,Ìͼ¸º¹½ÀÁÂÆnº¹¿ßÿÆÎ sz©Ý_!âu c R¿Ï¼ÓüɳÉϸÓýMü$Æ=ÿÆÐvüýý¿³¹¿³Ñ¸kÒ¸Óü¼ºÁ¾øÄÙÿ¾½ÀÁdÓÃÓwǬҸÓü¼ºÁ¾|¸?Ê¿ ½j¿³¹¼Èü¹¿Ï¼Ô¸Ã¼j¹ þ¸ ¹¸þ¾|¸Ã¸½ÿɳɻÿÕ/¿³ÆÂÈgüÖd¿ßÿÒk¼w×

Page 26: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

ÄCØºÊ Î ÝÙâ/`ÚÛ×,¹ þ¸j¸ºÒ,Üz¹Çj¼¸º¹HþJü¼0Æ=ÿ ¿³Ò,Ü®ÿ¾À¹ üÆnøÄÝwÊ Î Ý_!â/` ×,¹¸þ¸ÞÒ7üÖd¿³Ò7üÉ&¼¸º¹HþJü¼ÞÒ7üÖd¿³Ò7üÉ¿³Ò,Ü®ÿ¾À¹ üÆnøÄ?ß(Ê Î Ýà âaá Î Ýâ#âã¿ ½äà c â W _üÆ,ýà W âå×Fü\Æn¸ºÕº¾À¿³¹?¸º¾º¿ßÿÆ üýý¸ÓýæÓüÆÆ=ÿ¹åÒ7ü箸۹¸þ¸¿³Ò,Ü®ÿ¾À¹ üÆnø ÿC½üäÿÈüÉ¿³¹¿ßÿÆèÄü*¼¸º¹¼ÿC½.º¾À¿³¹?¸º¾º¿NüºÊgý¿³Òj¿³Æ¿Ï¼þËøéJÔÈÒÓÔê¸Ú¢ÒÓÔ0ôDÏ»ÊMÚ¢ÌÒeèÒÓÚ¢ëJÍ»ÔÈËäãéJÔÈÒÓÔ.é(ê®ë£ìAÝ_!â`íGôAã9Ô ÊÔPÔÈ×íÕ7å0ÕÍÑÌÔ#ê¸ÚÒeÔå¢ÔÈÒvð¶ÔPÍ»ÔP˱ÔPÊΩÚê©Ý_!â

Îsé®Õ0ÎeÏ»Ø ^ å0ÕÍÑÌÔÈØPæàòÚ¢ÊØÙÏ»×ÔPÒÓÏ»ÊJÜô®ÎséÔå0ÕÍ»ÌÔPØeÚ0ê4ÎséÔÔÈË7èJÎðØÓÔÎÕ0Ê׶ÚêÎséÔËgÕ.ïJÏÑËgÕ0ÍDØÓÔÎÕÒÙÔú!ïJÔP×DôãòÔ«ÕÐÎsÌ®Õ0Í»ÍÑðMÊÔPÔÈ×Dôt ^î `Í ^ïY å0ÕÍÑÌÔÈØ©Ú¢ÒÐÈÚÔ(7ÐPÏ»ÔÈÊÎÓØÎÓÚU×ÔPúÊÔ«ÕUÊÚ¢Ê!ùÕ××Ï|ÎsÏÑåÔË7ÔÕØÙÌÒÓÔ0æjð!ÚJôÎséÔPÒÓÔ7Ï»ØÐÈÍ»ÔÈÕÒÓÍ|ðÕÄÎsÒsÕ0×Ôù ÚõhëAÔÎãòÔPÔÈÊÐPڢ˱èJÍ»Ôï!Ï|ÎðíÕÊ×6ÕÐÈÐPÌÒsÕ0ÐPð¢æ¬YÚãòÔå¢ÔÈÒPô4ãòÔ7ãÏ»ÍÑÍ4ØÓÔPÔ±ÎséÕ0Î#ã9ÔÐÕÊÒÓÔP×ÌÐÈÔÎséJÔ#ÐPڢ˱èJÍ»Ôï!Ï|ÎðØÓÏÑÜ¢ÊÏ|ú®ÐÕÊÎÓÍÑðÏÑÊÄÚ¢ÒÓ×ÔPÒÎsÚ7ܢ̮ÕÒÓÕÊÎsÔÈÔSÎÓé®Õ0ÎYÊÚ¢Ê!ùÕ××Ï|ÎsÏ|å¢ÔSË7ÔÕØÙÌÒÓÔPØ©ÕÒÙÔÌØÓÔP׶ϻÊÄèÒÓÕÐPÎÓÏ»ÐÈÕÍKÕèèJͻϻÐÈÕ0ÎsÏÑÚ¢ÊØPæç ÊÚ¢ÊùÕ××JÏÑÎsÏ|å¢ÔÏ»ÊÎÓÔÈÜ¢ÒÓÕÍKÏ»ØYÕ±ØÙÚ¢ÒÙÎYÚêãòÔPϻܢéÎsÔP׶Ë7ÔÕÊÄÎÕ0ÛÏÑÊÜ7Ï»ÊÎsÚ±ÕÐÈÐPÚ¢ÌÊΩÎÓéÔÏ»Ë7èAÚ¢ÒvÎÕÊJÐÈÔSÚê

ÔPå¢ÔPÒÙðÄÐPÚÕÍ»Ï|ÎsÏÑÚ¢ÊÚêGÐÈÒÙÏÑÎsÔPÒÓÏ»ÕJ檬«®­°¯f±³²±´z¯ñ¶R·G¸º¹ Î »º¸#üÆ=ÿÆÐvüýý¿³¹¿³Ñ¸gÒ¸Óü¼ºÁò¾|¸ ÿÆ Ý_ c ©UÝ_!âÓâüÆ,ýUüÆíüºÜ®ÜzÉ¿óÓü¹¿ßÿÆôÛs®_ju É v,õ/ËöDþ¸*÷Gþ®ÿwøwÁ&¸º¹¿³Æ¹?¸óȾÓüÉxÿC½ôaÕËϾ(ËϹ Î ¿Ï¼#ý¸ÔpÆn¸ÓýÅ»wÇ×

Ýâ#âùúônì Î ` ^ hY ÝóôÝûHÝiâÓâ î ôÝûHÝi î âÓâÙâ Î ÝüÞý hþ âÕþ¸º¾Ã¸kûÿ¿Ï¼üܸº¾ÀÒjÁ¹ ü¹¿ßÿÆÿC½Å¹ þ¸¿³Æ,ý¿óøü¿³Æ^ÿ¾Ùý¸º¾¹ÿ¶þJüѸjô4Ýû?Ý âÙâáP[[[á ô4Ýû?ÝKâÙâ ü ý hþg`bûHÝi â c c ûHÝKâÃeüÆ,ýjô4Ýû?ÝâÙâ` »wÇÿÆѸºÆ¹¿ßÿÆzË éJÔÈÊMÎÓéÔÈÒÙÔÏ»ØÊÚ7ÒÓÏÑØÓÛgÚ0êÐPÚ¢ÊJê¸ÌØÙϻڢÊDôã9ÔSãÏ»ÍÑÍ=ãÒÓÏ|ÎsÔ7Ýi âòê¸Ú¢Ò8ûHÝi âæÉ ÎÏÑØòÔÈÕØÙð±ÎsÚµØÓÔPÔSÎsé®Õ0ÎòÎséÔàòéJÚ!÷ÌÔÎÏ»ÊÎsÔPÜ¢ÒsÕÍAÏ»ØÕjUKÔPëAÔPØÓÜ¢ÌÔSÏÑÊÎsÔÈÜÒsÕÍAÌèÎsÚ«Õ ÒÙÔÈÚ¢ÒÙ×ÔÈÒÙÏ»ÊÜ Ú0êxÎÓéÔ

Ï»Ê×ÏÑÐÈÔPØÈæ ç ÐPÎÓÌ®ÕÍ»Í|ð¢ôÏ|ê,ÎÓéÔÊÚ¢Ê!ùÕ××Ï|ÎsÏ|å¢Ô¼Ë7ÔÕØÙÌÒÓÔ Î Ï»Ø4Õ××Ï|ÎsÏÑåÔô¢ÎÓéÔÈÊ«ÎséÔeàòéÚ÷ÌÔPÎ[ÏÑÊÎÓÔÈÜ¢ÒÓÕÍÒÙÔÈ×ÌÐPÔÈØÎsÚ7Õ UoÔPëAÔPØÓÜ¢ÌJÔ#ÏÑÊÎÓÔÈÜ¢ÒÓÕÍßæ687 FTDFGKFtM>H°>@CLM LTDF ÀFGDFfMTBEFGKÔÕ0ÒÓÔ¼ÊÚãñÕëÍÑÔ9ÎÓÚèÒÓÔPØÓÔPÊÎHéJÚ.ãñÊÚ¢ÊùÕ××JÏÑÎsÏ|å¢Ô[˱ÔÈÕØÓÌJÒÓÔÈØ?ÐÈÕÊ ë,Ô¼ÌØÙÔÈ×«Ï»Ê ÍÑÏ»ÔPÌÚê®ÎÓéÔòã9ÔÈÏ»ÜéÎÓÔÈ׫ØÓÌJËÕÊ×ÚÎséJÔÈÒo˱ÚÒÓÔÎÓÒsÕ×Ï|ÎsÏÑڢʮÕÍ0ÕܢܢÒÙÔÈÜÕ:ÎsÏ»ÚÊÚ¢è,ÔÈÒsÕ:ÎsÚ¢ÒÙØKÏ»ÊÕË ÌÍÑÎÓÏ»ÐPÒÓÏÑÎÓÔÈÒÙÏÖÕH×ÔÈÐPÏ»ØÓÏÑÚ¢ÊËgÕ0ÛÏÑÊÜòê¸ÒÓÕ˱ÔãòÚÒÓÛnæÉ ÎYãòÕØ©ØÙéÚãÊÏ»ÊxÎÓé®Õ0ÎeÌÊ×ÔPÒeÒÓÕ0ÎséJÔÈÒYÜ¢ÔÈÊÔPÒsÕÍxÕØÙØÓÌË7èJÎÓϻڢÊØYÚå¢ÔPÒeÎséJÔ#ØÙÔPÎÚ0ê?ÕÍ|ÎsÔÈÒÙÊ®Õ0ÎÓÏÑå¢ÔPØRVô

ÕÊ×Ú.åÔÈÒÎséJÔã9ÔÕ0ÛUÚ¢ÒÓ×ÔPÒÓØglmhôÎséÔPÒÓÔÔï!ÏÑØÙÎsØYÕ«ÌÊÏÑ÷ÌJÔÊJÚ¢Ê!ùÕ0××ÏÑÎÓÏÑåÔ˱ÔÈÕØÓÌJÒÓÔ Î Ú.åÔÈÒR_gØÙÌÐéMÎÓé®Õ0Îsxzy c oV ô y l ~ rGÝ y ârGÝ â Ýâ

ãéÔÈÒÙÔ rGÝ y â/` ^ hY r ý hþvÝ yý hþßâ î r ý hÏïYþvÝ yý hÏïYþßâC Î Ýü ý hþ2â Ý¢âãéÏ»ÐéÏÑØeØÙÏ»Ë7èÍÑðgÎséÔÕܢܢÒÙÔÈÜÕ0ÎÓϻڢÊÄÚêÎséÔ#˱ڢÊJÚ!×ÏÑ˱ÔPÊØÓÏÑڢʮÕÍ=ÌJÎsÏÑÍ»ÏÑÎð±ê¸ÌÊÐÎsÏ»ÚÊØ©ÌØÙÏ»ÊÜ7ÎséÔ àòéÚ÷ÌJÔPÎÏ»ÊÎsÔPÜ¢ÒsÕÍAãæ ÒPæ ÎÈæ Î æ9ÔÈØÓÏÑ×ÔÈØPôKãòÔ«ÐÕ0Ê ØÙéÚã ÎÓé®Õ0ÎSË7Ú¢ØÙÎÕ0ܢܢÒÓÔPÜÕ0ÎÓϻڢÊÚèAÔPÒsÕ0ÎÓÚ¢ÒÓØÐÈÕÊíëAÔµÒÙÔÈèÒÙÔÈØÙÔÈÊÎsÔP×ëðíÕàòéÚ÷ÌJÔPÎÏ»ÊÎsÔPÜ¢ÒsÕÍ7ݸØÓÔPÔ£ âæäøéÏÑØM˱ÕÛÔPØÎÓéÔàòéÚ!÷ÌÔÎÏÑÊÎÓÔÈÜ¢ÒÓÕÍÕîåÔÈÒvðuÜÔÈÊÔPÒsÕÍÕÊ×uèAÚã9ÔÈÒvê¸ÌÍÎÓÚ!ÚÍÎsÚÒÓÔÈèJÒÓÔÈØÙÔÈÊÎèJÒÓÔPê¸ÔPÒÓÔPÊÐÈÔPØeÏÑÊMÕ0ÊÞà¼áÞ ØÓÔÎÓÎÓÏ»ÊÜ¢ØPæYÚãòÔå¢ÔÈÒPô[ã9ÔUÕÒÙÔØÙÎÓÏ»ÍÑÍHêNÕ0ÐÈÏ»ÊJÜMÎã9ÚÐÈÒÓÌJÐÈÏÖÕ0Í9èÒÙÚ¢ëÍÑÔÈË7ØÈæ¶øéJÔèÒÓÚÚêÚêÎÓéÔÕ0ëAÚå¢ÔUÒÙÔÈØÙÌÍÑÎÏ»ØÊJÚÎÐÈÚ¢ÊØvÎsÒÙÌÐPÎÓÏÑå¢Ô0æ.ðÔÈÐÈÚÊ×DôoÕØ©ã9Ôµé®Õå¢ÔµØÓÕϻ׶ëAÔê¸Ú¢ÒÓÔ0ôDÔPå0ÕÍÑÌ®Õ0ÎsÏÑÊÜÕÊJÚ¢Ê!ùÕ0××ÏÑÎÓÏÑåÔË7ÔÕØÙÌÒÓÔ ÒÙÔÈ÷ÌÏÑÒÓÔÈØÞ ^åÕ0Í»ÌÔPØÈæf6ÔÕÒÙÔòÜ¢ÚÏ»ÊÜYÎsÚØÙÔÈÔ¼ÎséÕ0Îã9ÔÐÕÊÚ.åÔÈÒÙÐÈÚ¢Ë7Ô¼ÎséÔPØÓÔ¼×JÏ7ÐÈÌÍ|ÎsÏÑÔÈØ4Õ0Ê×ÎséÕ0Î4ÌØÙÏ»ÊÜSÊJÚ¢Ê!ùÕ0××ÏÑÎÓÏÑåÔ

Page 27: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

˱ÔÈÕØÓÌÒÙÔÈØÝNÐÈÚÌèÍ»ÔP×gãÏÑÎÓégÏÑÊÎÓÔÈÒvåÕ0Í»Øâ4ÚõAÔÈÒÙؼÕÊÏ»ÐPÔeØÙڢͻÌJÎÓϻڢʫÎsÚ#Ë ÌÍ|ÎsÏ»ÐPÒÓÏ|ÎsÔÈÒÙÏÖÕ×ÔPÐÈÏÑØÓÏ»ÚÊg˱ÕÛÏ»ÊÜèÒÓÚë!ùÍ»ÔÈË7ØÈæUKÔÎMÌJØMØvÎÕÒvÎãÏ|ÎséuÕÐÈÚ¢ÌèJÍ»ÔíÚê ×JÔPú®ÊÏ|ÎsÏÑÚ¢ÊØÎséÕ0ÎÄãϻͻÍÕÍ»ÍÑÚ.ã ÌØÄÎsÚhØÙéÚã éÚã ÎsÚîÍ»ÏÑ˱Ï|αÎÓéÔÐÈÚ¢Ë7èÍ»Ôï!ÏÑÎðgÎÓÚ7Õ ÝâæøéJԱܢÍÑÚ¢ë®ÕÍGÏ»Ë7èAÚÒÙÎÕ0ÊÐÈÔ#Úê¼ÕUÐPÒÓÏÑÎÓÔÈÒÙϻڢÊÏÑØSÜ¢ÏÑåÔÈÊ ëðÔPå0ÕÍ»ÌÕ0ÎsÏÑÊÜgãé®Õ0ÎÎÓéÏ»ØSÐPÒÓÏÑÎÓÔÈÒÙϻڢÊëÒÙÏ»ÊÜØÎsÚ

ÔPå¢ÔPÒÙðMÐÈÚÕ0Í»ÏÑÎÓϻڢÊUÏ|ÎY×Ú!ÔPØ©ÊÚ0Ωë,ÔÈÍÑÚ¢ÊÜ«ÎsÚô,ÕÊ׶Õå¢ÔÈÒÓÕÜ¢ÏÑÊܱÎÓéÏ»ØYÏ»ÊèJÌJÎæøéÏÑØeÏÑØeÜÏÑå¢ÔPʶëðÄÎséÔkð!é®Õ0èÍ»ÔðåÕ0Í»ÌÔSÚ¢ÒÏÑÊ×ÔïÄÚêGÏ»Ë7èAÚ¢ÒvÎÕÊJÐÈԫݸØÓÔÈÔ â檬«®­°¯f±³²±´z¯E¶p·G¸º¹ Î »º¸YüÆ=ÿÆÐvüýý¿³¹¿³Ñ¸Ò¸Óü¼ºÁò¾|¸©ÿѸº¾/_Ë.öDþ¸þJüºÜzÉϸºÇgÑ:üÉÁ¸©ÿC½8¿³Æ,ý¸§Öſϼeý¸ÔpÆn¸Óý»wÇd×

Ýâ/` ú

ú Ýàµâ( Î Ýà"!b#Ae0â î Î ÝàµâCÕ/¿³¹¸þ ú Ýàµâp` ý%$ ú $ ï $ $ ïYþ&(' $ $ &$ ú $ & *) à ) ý¸ºÆ=ÿ¹?¸Ã¼g¹¸þ¸jÓü¾Ùý¿³Æ,üÉGÿC½8àËøéJÔ ð!é®ÕèJÍ»ÔPðÄå0ÕÍÑÌÔ#ÐÈÕʶë,ÔÔïÎsÔÈÊJ×ÔÈ×ÎsÚ±×ÔÈÜÒÓÔÈÔ#ÎãòÚô,Ï»ÊMÚ¢ÒÓ×JÔÈÒeÎÓÚg×Ôú®ÊÔÎÓéÔÏÑÊ×ÏÑÐÈÔÈØYÚê?ÏÑÊÎsÔÈÒù

ÕÐPÎÓϻڢÊØݸëAÔÎãòÔPÔÈÊÕ:ÎÓÎsÒÙÏ»ëÌ!ÎsÔÈØsâ檬«®­°¯f±³²±´z¯,+t¶R·t¸º¹ Î »º¸µüÆ=ÿÆÐvüýý¿³¹¿³Ñ¸ÞÒ¸Óü¼ºÁò¾|¸«ÿѸº¾_Ë$öDþ¸¿³Æ¹?¸º¾Óüº¹¿ßÿÆ㿳Æ,ý¸§Ö仺¸º¹Õ¸Ã¸ºÆiüÆ,ýã¿Ï¼ý¸ÔpÆn¸ÓýÅ»wÇd×

_AÝi c â/` ú ï h%- .

ú Ýàµâ[Ý Î Ýà"!bi c Ae0â î Î Ýà"!Ûbi§e0â î Î Ýà/!ãb#e0â0 Î ÝàµâÓâÕ/¿³¹¸þ . ú Ýàµâp` ý$ ú $ ï $ $ ï þ1&(' $ $ &ý%$ ú $ ïYþ& ËøéJÔ#ÏÑÊÎÓÔÈÒÓÕÐPÎÓϻڢÊUÏÑÊ×Ï»ÐPÔÈØë,ÔÈÍ»ÚÊÜ ÎsÚµÎséJÔÏÑÊÎsÔÈÒvå0ÕÍ î dc 0 KÕÊ×2 _AÝi c !â43«Ï|êoÎséÔÕ0ÎÙÎsÒÓÏÑëÌJÎÓÔÈØmi[ÕÊ×5ÕÒÙÔSÐÈÚ¢Ë7èÍ»ÔP˱ÔPÊÎÕÒvð762 _AÝi c !â48«Ï|êoÎséÔÕ0ÎÙÎsÒÓÏÑëÌJÎÓÔÈØmi[ÕÊ×5ÕÒÙÔSÒÓÔÈ×JÌÊ×®ÕÊÎ62 _AÝi c !â/` «Ï|êoÎséÔÕ0ÎÙÎsÒÓÏÑëÌJÎÓÔÈØmi[ÕÊ×5ÕÒÙÔSÏ»Ê×ÔPèAÔPÊ×ÔPÊÎÈæÉÊÎÓÔÈÒsÕ0ÐPÎsÏÑÚ¢ÊØÚ0êSéÏ»ÜéÔÈÒ7Ú¢ÒÓ×JÔÈÒÓØgÐÕÊÕ0Í»ØÓÚ6ëAÔM×ÔPúÊÔÈ×DôéÚã9ÔPåÔÈÒã9Ô¶ãÏ»ÍÑÍYÒÙÔÈØvÎsÒÓÏÑÐPÎgÚ¢ÌÒÓØÙÔÈÍ|å¢ÔÈرÎsÚ

ØÓÔÈÐPÚ¢Ê×íÚ¢ÒÙ×ÔÈÒeÏ»ÊÎsÔPÒsÕÐÎsÏ»ÚÊØeãéÏÑÐéÚõAÔÈÒÕÜÚ!Ú×MÎÓÒsÕ×ÔùÚõ6ë,ÔPÎã9ÔÈÔÈÊ ÕÐPÐÈÌÒÓÕÐPðÕÊ×ÐPÚ¢Ë7èÍ»ÔïJÏ|Îð¢æøxÚ×Ú«ØÓÚôã9Ô#×JÔPú®ÊÔÎséJÔÊJÚÎsÏÑÚ¢ÊUÚê.ùÕ×J×ÏÑÎÓÏÑå¢ÔSË7ÔÕ0ØÓÌÒÙÔ檬«®­°¯f±³²±´z¯:9E¶RÌ Æ=ÿÆÐvüýý¿³¹¿³Ñ¸jÒ¸Óü¼ºÁò¾|¸ Î ¿Ï¼ÓüɳÉϸÓýäÝ®Ðvüýý¿³¹¿³Ñ¸ ¿ ½±üɳÉ/¿³¹³¼*¿³Æ¹?¸º¾Óüº¹¿ßÿÆ迳Æ,ý¿óøüÿC½ÿ¾Óý¸º¾¸øwÁJüÉòÿ¾ É|ü¾È¸º¾j¹¸þJüÆßü¾|¸*ÆÁÂɳÉHüÆ,ýü¹RÉϸÓü¼º¹SÿÆn¸*¿³Æ¹?¸º¾Ùüº¹¿ßÿÆ¿³Æ,ý¸§Ö ÿC½7ý¸óȾ|¸Ã¸ ¹Õòÿä¿Ï¼*Æ=ÿ¹ÆÁòÉ³É ËÉÊÄÎséJÏ»Øè®ÕÒvÎsÏÑÐÈÌÍ»ÕÒ¼ÐÕ0ØÓÔÚê/ùÕ××JÏÑÎsÏ|å¢ÔË7ÔÕ0ØÓÌÒÙÔÈØÈô®ã9ÔÐÕÊØÓéÚãFÎÓé®Õ0Î#ݧÖâºs

;=< «Â´?>«A@ µ&¶m·G¸º¹ Î »º¸üÝ®Ðvüýý¿³¹¿³Ñ¸gÒ¸Óü¼ºÁ¾|¸wË*öDþ¸ºÆq¹ þ¸å÷Gþ®ÿwøwÁ&¸º¹p¿³Æ¹?¸óȾÙüÉ°ÓüÆ㻺¸jÿÒ,ÜzÁò¹?¸ÓýÅ»wÇ×Ýâ#â ùú ôì Î ` ú

B CDAEÝóô4ÝiâGFqô4ÝHâÓâ:_ºh 0 ú

B CIAEÝóô4ÝiâGJôÝâÓâ ) _(h ) 0 ^ hY ô4Ýi âÈÝ_(h î Kh ) _ºh ) â ÝMLâ

Page 28: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

YÚÎsÔMÎsé®Õ0ÎgÎséÏÑØÔï!èÒÙÔÈØÓØÙϻڢÊóÙÌØvÎsÏ|ú®ÔÈØÎÓéÔÕë,Úå¢ÔÏ»ÊÎÓÔÈÒÓèJÒÓÔPÎsÕ0ÎsÏÑÚ¢ÊwÚêÏÑÊÎÓÔÈÒÓÕÐPÎÓϻڢÊñÏ»Ê×ÏÑÐÈÔPØÈôeÕØUÕèAÚ¢ØÙÏÑÎÓÏÑå¢ÔÏ»ÊÎsÔPÒsÕÐÎsÏ»ÚÊ ÏÑÊ×ÔïÐÈÚÒÓÒÓÔPØÓè,Ú¢Ê×ØSÎÓÚMÕUÐPÚ¢Ê:óÙÌÊÐÎsÏÑÚ¢ÊñÝNÐÈÚ˱èÍÑÔÈË7ÔÈÊÎÕ0ÒÙðJâ©ÕÊ×6ÕÊÔPÜÕ0ÎsÏ|å¢Ô«Ï»ÊùÎsÔÈÒÓÕÐPÎÓϻڢÊÄÏÑÊ×ÔïUÐÈÚ¢ÒÙÒÓÔÈØÙèAÚÊ×ØÎsÚ7Õµ×Ï»Ø óÙÌÊÐPÎÓϻڢÊ6ݸÒÓÔÈ×JÌÊ×®ÕÊÎâæÉÊhÎséJÔãòÔPϻܢéÎsÔP×ñØÙÌË ÐÈÕØÓÔ0ô[ã9ÔÕØÓØÙÌ˱ÔUÎÓé®Õ0Î ÎÓéÔÄ×ÔÈÐPÏ»ØÓÏÑÚ¢Ê˱ÕÛÔPÒµÐÕÊîèÒÓÚåÏ»×JÔUÌصãÏÑÎséÎÓéÔ

ãòÔPϻܢéÎsØØÙéÔON0éJÔèJÌJÎsØÚ¢ÊÄÔÕ0ÐéMÐPÒÓÏÑÎÓÔÈÒÙϻڢÊDæeÚãòÔå¢ÔPÒÈô®ãòÔÛÊÚã&ÎÓé®Õ0ÎÎÓéÏ»ØË7Ú!×JÔÈÍ=ÏÑØϻʮÕ0ÐÈÐÈÌJÒsÕ0ÎÓÔ©ãéÔÈÊÎsÒÙðÏÑÊܱÎÓÚg×ÔÈÕÍxãÏ|ÎséM×ÔÈè,ÔÈÊ×JÔÈÊÐPÏ»ÔÈØPæ6ÔÐPÚ¢ÌÍ»×MÌØÓÔ ÕàòéÚ÷ÌÔPΩϻÊÎsÔPÜ¢ÒsÕÍoÏÑÊØÙÎÓÔÕ×DôAÕØYãòÔ#é®Õå¢Ô ØÙÔÈÔÈÊÎsé®Õ0ÎÎÓéÔPð6ÕÒÙÔ±ÕÐÈÚ¢ÊåÔÈÊÏÑÔÈÊεÕÊJ×èÒÙÔÈÐÈÏÑØÓÔ«ÎsÚÚ¢Í4ÎsÚË7Ú!×ÔPÍ?èÒÓÔê¸ÔÈÒÙÔÈÊÐPÔÈØÈæäYÚ.ã9ÔPåÔÈÒPôÎséÔ7ÐÈÚ˱èÍÑÔï!ÏÑÎðϻرåÔÈÒvðwéÏÑÜ¢éDæ^øéÔÈÒÙÔPê¸Ú¢ÒÙÔôYÏ»Ê<ÚÒÓ×ÔPÒgÎsÚÐPÚ¢Ë ëÏÑÊÔMÎÓéÔ¶ë,ÔÈØvÎUÚêSÎÓéÔ¶Îã9ÚãòÚ¢ÒÙÍ»×ØPô¼ãòÔÐÈÕÊÕØÙÛîÎÓéÔ×ÔÈÐPÏ»ØÓÏÑÚ¢Ê˱ÕÛÔPÒÎsÚÜ¢Ï|å¢ÔgÎÓéÔ¬ð!éÕèÍ»Ôðå0ÕÍÑÌÔÈØPôHÕØã9ÔÈͻͼÕØÎséJÔUÏ»ÊÎsÔPÒsÕÐÎsÏ»ÚÊÏ»Ê×JÏ»ÐÈÔPØÈôHÕÊJ×ÎséÔPÊîÌJØÓÔÎséÔÒÙÔÈÐPÚ¢ÊØÙÎÓÒÓÌÐÎsÏÑÚ¢ÊÎséÔPÚ¢ÒÓÔPË ÎÓÚ Ú¢ëJÎsÕϻʱÎséÔSÕ0ܢܢÒÓÔPÜÕ0ÎÓϻڢÊ7Ú¢è,ÔÈÒsÕ:ÎsÚ¢ÒPô!ãéÏÑÐéUϻؼիàòéJÚ!÷ÌÔÎÏ»ÊÎsÔPÜ¢ÒsÕ0Íãæ ÒPæ ÎÈæ7ÎÓÚMÕ$.ùÕ××Ï|ÎsÏÑåÔ Ë7ÔÕØÙÌÒÓÔ0æQP©ê9ÐÈÚ¢ÌÒÙØÓÔ0ôxã9Ô±éÕåÔ7ÎÓÚ¶Õ0ØÓØÓÌJ˱ԵÎséJÔ7Ë7ÔÕ0ØÓÌÒÙÔ«ÎÓÚMë,Ôå.ùÕ0××ÏÑÎÓÏÑåÔÎsÚµÌØÙÔÎséÔPÚ¢ÒÓÔPË æpYÚãòÔå¢ÔÈÒPô®ÎséÏÑØòÏ»ØòÊÚÎÕØÙÔÈÒÙϻڢÌØòͻϻË7ÏÑÎsÕ0ÎsÏÑÚ¢Ê7ÕØòÎÓéÔÏ»Ë7è,Ú¢ÒÙÎsÕÊÐÈÔ©ÕÊ×ÎÓéÔg.ù Ú¢ÒÓ×ÔPÒÏ»ÊÎsÔPÒsÕÐÎsÏ»ÚÊÄÕÒÓÔÔÈÊJÚ¢ÌÜ¢éÄÎsÚ«Ü¢Ï|å¢ÔÕ ÎséÚ¢ÒÙÚ¢ÌÜ¢éÄØÙÔÈ˱ÕÊÎsÏ»ÐÏÑÊÎÓÔÈÒÙèÒÓÔÎÕ0ÎÓϻڢÊUÚêxÎséJÔÒÙÔÈØÙÌÍÑÎÓØÈæYÔPåÔÈÒÙÎÓéÔÈÍÑÔÈØÙØÈô=ØÙÌÐéÕʶÕèèÒÙÚÕÐéMÒsÕÏÑØÓÔÈØYÕÊMÚÎséJÔÈÒèÒÓÚëÍ»ÔPËMæNeÚã ÐÈÕÊMã9Ô#Ôï!èAÔPÐPΩÎÓéÔ×ÔÈÐPÏ»ØÓÏÑÚ¢ÊËgÕÛ0ÔÈÒ4ÎÓÚSÜ¢ÏÑåÔÕèÒÓÔPÐÈÏ»ØÙÔ¼å0ÕÍ»ÌÔ¼ê¸ÚÒÎséÔÏÑ˱è,Ú¢ÒvÎÕÊÐPÔ¼ÕÊJ×7ÏÑÊÎsÔÈÒÓÕÐPÎÓϻڢÊÏ»Ê×ÏÑÐÈÔPØSRÉÊ«Ú¢ÒÙ×ÔÈÒGÎsÚÚå¢ÔÈÒÙÐÈÚ¢Ë7ÔÎséÏÑØéÌÒÓ×JÍ»Ôô0ãòÔ¼ÏÑÊÎsÒÓÚ×ÌÐPÔ9ÎÓéÔ¼ÐÈÚÊÐÈÔPèJÎ?ÚênÏÑÊÎÓÔÈÒvåÕ0ÍJÕÊJ×µØÓÔÈÔòéÚãñÏ|ÎÐÈÕʵë,ÔòÌØÙÔÈ×µÔ(7ÐPÏ»ÔÈÊÎÓÍÑðÎsÚ×ÔPÒÓÏ|å¢ÔÙÏ»ÊÎsÔPÒÙå0ÕÍDÚ0êèÒÙÔPê¸ÔPÒÓÔÈÊJÐÈÔÈاæT ¾o¿À)Á/'Þ%=U-m7:9 FGHp=WV M>FGDA¤/Hp=YXæD&@>[Z0O F>@CBÉÊÎÓÔÈÒvåÕ0Í ç ÒÓÏ|ÎséË7ÔPÎÓÏ»ÐÝNÉ ç âÏ»ØÕÊÕ0ÒÓÏÑÎÓéË7ÔPÎsÏÑÐ Úå¢ÔPÒØÓÔÎsØÚê¼ÒÙÔÕÍHÊÌË ë,ÔÈÒÙØ#ÐÈÕÍ»ÍÑÔÈ׿³Æ¹?¸º¾ºÑ:üÉ ¼æ±É ç é®ÕØëAÔPÔÈÊèÒÙÚ¢è,Ú¢ØÓÔP×ëð$vYÕË7Ú¢Ê]\æ=ÞMÚ!ÚÒÓÔ^L£GÏ»ÊÎséÔ#ÍÖÕ0ÎÓÔ#ØÙÏ|ïÎsÏÑÔÈØYÏ»ÊMÚ¢ÒÓ×ÔPÒeÎÓÚgË7Ú!×ÔPÍoÌÊÐÈÔPÒÙÎsÕÏ»ÊÎð¢ô=ÕÊ×ÎsÚgÎsÕÐÛͻԵÒÓÚÌÊ×ÏÑÊÜÔÈÒÙÒÓÚ¢ÒÙØÚêHÊÌË7ÔÈÒÙÏ»ÐÕ0ÍÐÈÚ˱èÌ!ÎÕ0ÎÓϻڢÊØPæ0¨®Ú¢ÒÕÐPÚ¢Ë7èÍ»ÔÎsÔèJÒÓÔÈØÙÔÈÊÎÕ:ÎsÏ»ÚÊÚêHÉ ç ô=ã9ÔÒÓÔPê¸ÔPÒÎséÔÒÓÔÈÕ×ÔPÒÎÓÚ$ 2檬«®­°¯f±³²±´z¯`_badcÅ«fe[g/±¯z²«f>ihje[glk¶pÌ&ÒÓÔÈÕÍ,Ï»ÊÎsÔPÒÙå0ÕͿϼüźɻÿ¼¸Óý7üÆ,ýÅÿÆÆn¸Ãº¹?¸Óý¼¸º¹òÿC½,¾|¸ÓüÉzÆÁòÒ*»º¸º¾|¼wËmѸº¾ÀÇè¾|¸ÓüÉR¿³Æ¹C¸º¾ÀÑ:üÉn ¿Ï¼íý¸ºÆ=ÿ¹?¸Óýa»wÇa y c y Õþ¸º¾|¸q¿³¹³¼»ÿÁÂÆ,ý¼íü¾|¸ý¸ÔpÆn¸Óý»wÇ y ` Ï»ÊJêon üÆ,ýy `uØÙÌèpnËqºÆ¶ÿ¾Ùý¸º¾R¹ÿ¾|¸?Üz¾|¸Ã¼¸ºÆ¹f¹ þ¸¾Ã¸ÓüÉÉ¿³Æn¸Õ/¿³¹ þºÉ»ÿ¼¸ÓýÞ¼¸º¹¼ sr ¿Ï¼ÞÿÒ,ÜJüº¹¿ Ô¸Óý¿³Æ¬¹ þ¸Sÿ»wÑ¿ßÿÁ ¼Õ[üÇåÕ/¿³¹ þ¬¹ þ¸Þ¿³ÆÔpÆ¿³¹¿ó¸Ã¼0b îut c 0 t eòË*öDþ¸T¼¸º¹ÿC½g¾|¸ÓüÉ¿³Æ¹?¸º¾ºÑ:üÉ ¼Þ¿Ï¼ý¸ºÆ=ÿ¹C¸ÓýwvÂËx Ï|å¢ÔÈÊíÕUØÙÌëØÓÔÎzyMÚê r ô?ÎséÔÿÆѸ§ÖþÂÁÂɳÉ9ÚêyϻةÎÓéÔµÒÓÔÈÕÍ4ÏÑÊÎÓÔÈÒvåÕ0Í|~?A®ÝMyâ`4 ÏÑÊJêoy c ØÓÌèpydßæ#øéÔÕ/¿Ný¹¸þÄÚ0êÕµÒÙÔÕÍDÏÑÊÎsÔÈÒvå0ÕÍnîÏ»ØòÎÓéÔSÒÓÔÈÕÍDÊÌË ë,ÔÈÒµÝMnHâ/` y î y x ÏÑåÔÈÊUÎã9Ú±ÒÙÔÕÍDÏÑÊÎsÔÈÒvå0ÕÍ»Ø*nîÕÊ×9ônñÏ»ØØÓÕÏ»×UÎÓÚ±ë,Ôj¹¿ È:þ¹?¸º¾g¹¸þJüÆÏ|êµÝnHâá ÝMâ \?ÍÑÔÈË7ÔÈÊÎsØUÚêzv ^ ×ÔPú®ÊJÔëAÚï!ÔÈØPæ x Ï|å¢ÔPÊ ÝMn4 c c n,â ov ^ ôYÎséÔÐPÚ¢ÒÓÒÙÔÈØÙèAÚ¢ÊJ×Ï»ÊÜÚ»ÿ(Ö^Ï»ØgÎÓéÔà¼ÕÒÙÎÓÔÈØÓÏ»ÕÊUèÒÙÚ!×ÌJÐPÎÚêxÏ»ÊÎÓÔÈÒÙå0ÕÍÑØ4 `n4RZã[[[òZn 9ð±Ë7Ï»ØÓÌJØÓÔSÚêoÊÚ0ÎÕ0ÎÓϻڢÊDôÎséÔ©ØsÕË7ÔSØÙðË ë,Ú¢Í=ÏÑØÌØÓÔP×ê¸Ú¢ÒeåÔÈÐPÎÓÚ¢ÒÓØÕ0Ê×ëAÚï!ÔÈØPæSøéÔµÕë,Úå¢ÔùË7ÔÈÊÎsÏÑÚ¢ÊÔÈ׶ÊÚÎsÏÑÚ¢ÊØ©ÕÒÓÔ#ØÙÎsÒÓÕÏ»ÜéÎÙê¸Ú¢ÒÙãòÕÒÙ×ÍÑðÔïÎÓÔÈÊ×ÔP×ÎsÚëAÚï!ÔÈØPæÉ ç Ú¢è,ÔÈÒÓÕ0ÎsÏÑÚ¢ÊØ7ÕÒÓÔØÓÔÎgÎÓéÔÈÚ¢ÒÙÔPÎÓÏ»ÐÄÔïÎsÔPÊØÓÏÑÚ¢ÊرÚêÎséJÔMÐPÚ¢ÒÓÒÙÔÈØÓè,Ú¢Ê×JÏ»ÊÜ6ÒÓÔÕ0ÍÚ¢èAÔPÒsÕ0ÎÓϻڢÊJØÈæ x ÏÑåÔÈÊ

n c Íov±Õ0Ê×îÕÊhÚ¢èAÔPÒsÕ0ÎÓϻڢÊb oXb0 c î c Z c eô[ã9ÔÄé®Õå¢ÔnX`~?fb y ) Ý y c| âo]n\ZeæáÌÔ¼ÎsÚèÒÓÚèAÔPÒÙÎsÏÑÔÈØ?ÚêAË7Ú¢ÊÚÎÓÚ¢ÊÏ»ÐPÏÑÎðôÎséÔPØÓÔÚ¢è,ÔÈÒÓÕ0ÎsÏÑÚ¢ÊØ?ÐÕ0ʱë,ÔÏ»Ë7èÍ»ÔP˱ÔPÊÎÓÔÈ×µëðµÒÙÔÕÍ®ÐPڢ˱èJÌJÎÕ0ÎÓϻڢÊJØÚ.åÔÈÒÎséÔ7ë,Ú¢ÌÊ×ØÚê9Ï»ÊÎsÔPÒÙå0ÕÍÑØÈæå¨nÚÒSÏ»ÊØvÎÕÊÐPÔôxÜ¢ÏÑåÔÈÊ ÎãòÚÏ»ÊÎsÔPÒÙå0ÕÍÑØwn ` ê cS [ÕÊ×` é c ìd2ôxã9Ôé®Õå¢Ôwn]0b` Aê0aé c 0ì*2æm76 V M>FGDA¤/Hp=YR>FtMTKA@CLMTKÉ ç Ï»ØU×ÔPØÓÏ»ÜÊÔÈ× ÎsÚhÒÓÔPèÒÓÔPØÓÔPÊÎMÚ¢ÌJÎÓÔÈÒÄÕèJèÒÓÚï!ϻ˱Õ0ÎsÏÑÚ¢ÊØÚ0ê#ÒÙÔÕÍS÷Ì®Õ0ÊÎÓÏÑÎÓÏ»ÔÈØPæ øéÔíÒsÕ0ÊÜ¢ÔÚêÕÒÓÔÈÕÍê¸ÌÊÐPÎÓϻڢÊ$ôÚå¢ÔPÒ©Õµ×ڢ˱ÕÏÑʶô×ÔPÊÚÎsÔP׶ëð !ÝUâôÐÕÊë,ÔÐPÚ¢Ë7èÌJÎsÔP×MëðUÏÑÊÎsÔÈÒvå0ÕÍDÔïÎÓÔÈÊØÙϻڢÊØPæ

Page 29: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

ª¬«®­°¯f±³²±´z¯¢¡]a¤£(¯²«f>OhoeogG«~¥²«ò¯¦±´z¯§k¶Ì,ÆÄÏ»ÊÎsÔPÒÙå0ÕÍJÔïÎsÔÈÊJØÓÏ»ÚÊ6ÿC½Yü¾|¸ÓüÉw½ÀÁÂÆnº¹¿ßÿÆjôÛsf©¨ÞW r ^ ur ¿Ï¼#ü½ÀÁÂÆnº¹¿ßÿƪís~v ^ u v¼ºÁþä¹ þJü¹x o]v ^ c ÝMo«©¨­¬ Ý®âp`íb£ô4ÝÖï®â ) ï¬oqe=¯/ª Ý®âÓâ öDþ¿ϼg¿³ÆnºÉÁ ¼º¿ßÿƽÿ¾ºÒjÁÂÉ|üå¿Ï¼jÓüɳÉϸÓý¨®ÌÊJ×®ÕË7ÔÈÊÎÕÍKøéJÔÈÚ¢ÒÙÔÈË ÚêGÉ ç ËøéJϻة×JÔPú®ÊÏ|ÎsÏÑڢʶÏÑ˱èÍÑÏ»ÔPؼÎséÔ#Ôï!Ï»ØvÎsÔÈÊJÐÈÔ ÚêHÏÑÊJú®ÊÏ|ÎsÔPÍÑðU˱ÕÊðÏ»ÊÎsÔPÒÙå0ÕÍoÔïÎsÔPÊØÓÏÑÚ¢ÊØÚ0êHÕgÜ¢Ï|å¢ÔÈÊMÒÓÔÈÕÍ

ê¸ÌÊÐPÎÓϻڢÊ=æ4ÉÊUè®ÕÒvÎsÏ»ÐPÌÍÖÕ0ÒÈôÎséÔeãòÔÈÕÛÔPØÙÎeÕÊ×ÎÓϻܢéÎsÔPØÙÎòÔïÎsÔPÊØÓÏÑÚ¢ÊØÕÒÙÔÒÓÔPØÓè,ÔÈÐPÎÓÏÑåÔÈÍÑð±×Ôú®ÊÔÈ×Äëðzs°²±u îut c 0 t oÕÊ×±u³~?fnÝl ÝMUâÓâæøéJÔgË7Ú¢ØÙÎ#ÐÈÚ¢Ë7Ë7Ú¢Ê6ÔïÎsÔÈÊJØÓÏ»ÚÊÏ»ØÛÊÚãÊÕØ#ÎÓéÔäÆ,ü¹Á¾ÙüÉm¸§Öd¹?¸ºÆ¼º¿ßÿÆ®æ©Õ:ÎsÌÒÓÕÍHÔïÎsÔPÊØÓÏÑÚ¢ÊصÕÒÙÔ

Ú¢ëJÎÕ0Ï»ÊÔP׶ê¸ÒÓÚ¢Ë ÎÓéÔ Ôï!èÒÙÔÈØÙØÓÏ»ÚÊØSÚêHÒÓÔÈÕÍGê¸ÌJÊÐPÎÓϻڢÊØPôDÕÊJ×íÕÒÓÔÅ¿³ÆnºÉÁ®¼º¿ßÿÆãÒgÿÆ=ÿ¹ÿÆ¿ó Y ô=ãéÏ»ÐéË7ÔÕ0ÊØÎsé®Õ0Î9Ü¢Ï|å¢ÔÈÊÕÒÓÔÈÕÍ®ê¸ÌÊÐPÎÓϻڢÊôGô!Ï|ÎsØ[Ê®Õ0ÎÓÌÒsÕÍnÔï!ÎÓÔÈÊØÙϻڢÊ=ôJ×ÔPÊÚÎsÔP×ô!ÕÊ×gÎã9Ú ÏÑÊÎÓÔÈÒvåÕ0Í»ØnÕÊ×Q ØÓÌÐéÎsé®Õ0Îwn¥W´[ôGÎÓéÔÈÊ`YÝnHâ.W²YÝ®4âæ¬ð!ÏÑÊÐÈÔ7Ê®Õ0ÎÓÌÒsÕÍ?ÔïÎsÔÈÊJØÓÏ»ÚÊØ ÕÒÙÔ±×JÔPú®ÊÔP×ëðíÎséÔ7ØÙðÊÎÕ:ï6ÚêòÒÓÔÈÕÍÔï!èÒÓÔPØÓØÙϻڢÊØPôxÎã9ÚÔÈ÷ÌÏÑå0ÕÍÑÔÈÊÎÔï!èÒÙÔÈØÓØÙϻڢÊJØ#ÚêòÕUÜ¢ÏÑåÔÈÊíÒÓÔÈÕÍê¸ÌÊJÐPÎsÏÑÚ¢Ê]ôÜ¢ÔÈÊÔPÒsÕÍÑÍÑð¶Í»ÔÕ0×ÎsÚÄ×ÏÑõAÔÈÒÙÔÈÊÎÊ®Õ0ÎsÌJÒsÕÍHÏ»ÊÎÓÔÈÒÙå0ÕÍHÔï!ÎÓÔÈÊØÙϻڢÊJØÈæÄÉÊè¨ÏÑÜ¢ÌÒÓÔ ôoã9ÔØÓÔPÔ±ÎséÕ0Îë,ÚÎÓéÏ»ÊÎÓÔÈÒÙå0ÕÍ4ê¸ÌJÊÐPÎÓϻڢÊØ#×ÔPúÊÔgÏÑÊÎÓÔÈÒvåÕ0ÍÔïÎsÔPÊØÓÏÑÚ¢ÊØYÚê/ôxæpeÚãòÔå¢ÔPÒÈô,Ú¢ÊÔSê¸ÌÊÐÎsÏ»ÚÊÏÑØÐÈÍÑÔÕÒÙÍÑðë,ÔPÎÙÎsÔPÒÈæ

µ¶(·¸

µ¹»ºO¼½°º%¾¿~º%À·¹Hº¼½¿ÁºÂÄÃ(à º¿ÆÅ%Ç È»É Â¿~Å%Ç Ê»È»Éevaluation of Ëevaluation of Ì

¨ÏÑÜ¢ÌÒÓÔ sp©Õ0ÎÓÌÒsÕÍ=Ï»ÊÎsÔPÒÙå0ÕÍKÔPå0ÕÍÑÌ®Õ0ÎÓϻڢÊØòÚêGÎã9Ú7Ôï!èÒÓÔPØÓØÙϻڢÊØeÚêÕ«ÒÙÔÕÍDê¸ÌJÊÐPÎÓϻڢÊ$ôGæøéJÔÚå¢ÔÈÒÙÔÈØvÎsϻ˱Õ0ÎÓϻڢÊèJÒÓÚ¢ëÍÑÔÈËôÛÊJÚ.ãÊÕØý¸?ܸºÆ,ý¸ºÆnºÇRÜz¾sÿ»wÉϸºÒ ÿC½*q§ÌôÏ»Ø9×ÌÔ©ÎÓÚÎséÔe×ÔÈÐPÚ¢ÒÓÒÙÔÈÍ»Õ:ù

ÎsÏ»ÚÊÚ0êGÎÓéÔÚ!ÐPÐÈÌÒÙÒÓÔÈÊJÐÈÔÈØeÚêÕ«å0ÕÒÙÏÖÕëÍÑÔ×ÌÒÙÏ»ÊÜ«ÏÑÊÎÓÔÈÒvåÕ0ÍDÔPå0ÕÍÑÌ®Õ0ÎsÏÑÚ¢ÊDæ/¨®Ú¢ÒÏÑÊØÙÎsÕÊÐPÔônÜ¢ÏÑåÔÈÊÍn `Xê cS ãÏÑÎséäê]Î` ôã9Ôé®Õå¢Ôwn î n`Xê î cS î êd ÏæJæç ÊMÏÑ˱è,Ú¢ÒvÎÕÊÎÒÙÔÈØÓÌJÍÑÎÏ»ØYÞÚ!Ú¢ÒÙÔÑÐ ØÎséÔPÚ¢ÒÓÔPË ÛÊJÚ.ãʶÕؼÎséJÔ*¹ þ¸ÿ¾|¸ºÒ ÿC½,¼º¿³ÆÂÈÉϸ ÿúÁ¾À¾Ã¸ºÆnøüæ

;=< «Â´?>«A@PñaÁÒ¡´´?>«WÓÔ_oÕk¶f·G¸º¹ôq»º¸9ü,¾|¸ÓüɽÀÁÂÆnº¹¿ßÿÆ ôqsf5¨ÞW r ^ u r* ¼ºÁþÞ¹¸þJü¹=Ý y Y c c y ^⧱uÖ Ý y Y c c y ^¢âgÕþ¸º¾|¸ Ö ¿Ï¼#ü ¼ºÇÒ*»ÿÉ¿ó.¸§Ö(Üz¾|¸Ã¼Ã¼º¿ßÿÆ¿³Æ¹?¸º¾?Üz¾|¸º¹?¸ÓýÅ»wÇôËq³½k¸ÓüÂþ y hYÿúÁò¾|¼ ÿÆÉÇÿÆnøg¿³ÆÖ áæipá ¹ þ¸ºÆ x o]v ^ c ÝM¯×©¨­¬ !ÝMUâ`ØYÝ®âÓâ UKÔÎ4ÌØ4ÒÙÔÈ˱ÕÒÓÛSÎÓé®Õ0ÎGÏ»ÊÎsÔPÒÙå0ÕÍÐÈÚ¢Ë7èÌJÎsÕ0ÎsÏÑÚ¢ÊØGÕÒÓÔ9èAÔPÒÙê¸Ú¢ÒÙ˱ÔP× Ú¢ÊÐPÚ¢Ë7èÌJÎsÔPÒÓØPôãéÔÈÒÙÔÒÓÔÈÕÍ!ÊÌËëAÔPÒÓØÕÒÓÔØÓÏÑË ÌÍÖÕ:ÎsÔÈ×îëð®ÚÕ0ÎÓÏ»ÊÜ0ù èAÚÏ»ÊεÊÌË ë,ÔÈÒÙØÈæ ç ØgÕ6ÒÓÔPØÓÌÍ|Îô¼ÒÙÔÕÍÏÑÊÎÓÔÈÒvåÕ0ͻرÕÒÓÔØÓÏÑË ÌÍÖÕ:ÎsÔÈ×hëðîÒÓÔÈÕÍÏ»ÊÎsÔPÒÙå0ÕÍ»ØãéÚØÓÔëAÚ¢ÌJÊ×ØÕ0ÒÓÔ0®ÚÕ0ÎÓÏ»ÊÜ:ùè,ڢϻÊÎÊÌË ë,ÔÈÒÙØÈô,ÐÕÍÑÍ»ÔPשÙÿÈü¹¿³ÆÂÈÐÏÜ®ÿ¿³Æ¹¿³Æ¹?¸º¾ºÑ:üÉ ¼æ[øéÔ#ØÓÔPÎeÚêØÓÌÐéMÏ»ÊÎsÔPÒÙå0ÕÍ»ØÏÑØ×ÔÈÊJÚÎsÔP×Ív»ÚnæeøéJÔ˱ÕÏÑÊU×ÏÑõAÔÈÒÙÔÈÊÐPÔë,ÔPÎã9ÔÈÔPÊ]vYÕÊ×vHÚϻؼÎÓé®Õ0ÎÐÈÚ˱èÌ!ÎÕ0ÎÓϻڢÊØÚåÔÈÒ®ÚÕ0ÎÓÏ»ÊÜ0ù èAÚÏ»ÊμÊÌË ë,ÔÈÒÙØÊÔÈÔP×MÎsÚ«ë,ÔÒÓÚ¢ÌÊJ×ÔÈ×Dæ¨GÍ»ÚÕ0ÎÓÏ»ÊÜ:ùÛGÚÏ»ÊÎYÉ ç ÐÈÚ¢ÒÙÒÓÔPØÓè,Ú¢Ê×ØÎÓÚvYÔÈÕÍÉ ç ãéÔÈÒÙÔ«ÕÍ»ÍoÏ»ÊÎÓÔÈÒÓË7ÔÈ×JÏÖÕ0ÎÓÔÒÓÔPØÓÌÍ|ÎsØÚ0êHÏ»ÊÎsÔPÒÙå0ÕÍÐPڢ˵ùèÌJÎÕ:ÎsÏ»ÚÊØÕÒÓÔ«Ú¢ÌJÎãòÕÒÙ×ÒÓÚÌÊ×ÔP×ÕØê¸Ú¢Í»ÍÑÚãØwsåê cS mo,v Ü lÝêÆÞ cjß®à Ro,v»Únæ¶ãéÔPÒÓÔbÝêÆÞíÝNÒÙÔÈØÙèDæß®à â¼Ï»ØÎséÔÍ»ÕÒÓÜ¢ÔPØÙÎݸØÓ˱ÕÍ»ÍÑÔÈØvÎâ9ÔPÍ»ÔÈË7ÔÈÊÎYÚê ÚØÙËgÕÍÑÍ»ÔPÒÝNÜ¢ÒÙÔÕ0ÎÓÔÈÒsâ9Îsé®ÕÊÚ¢ÒÔP÷ÌÕÍDÎsÚåêÝ âæáâãiäæåGçièé#çÑêdèë®ìí»é#îÄîÄéSï åí»èéñðòëãiê§ðéñóiéëé#óiäæôäõë®ìéíoäÄóñëêdèöS÷î~é#çÑêdèl÷Sëäõéñóøå

L

Page 30: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

¨®Ú¢ÒeÔï!èÒÙÔÈØÓØÙϻڢÊJØãÏ|ÎséØÙÏ»ÊÜÍ»ÔÚÐÈÐPÌÒÓÒÙÔÈÊÐPÔÈØSÚê?å0ÕÒÙÏÖÕëJÍ»ÔÈØPôAã9Ô ÐÈÚ˱Ë7Ú¢ÊÍ|ðÄØsÕð¶ÎÓé®Õ0ÎÞMÚÚ¢ÒÙÔÑÐ Ø©ÎséÔùÚ¢ÒÓÔPË Ï»ØUå0ÕÍ»ÏÑ×Ôï!ÐÈÔPèJÎÚÊuÒÙÚ¢ÌÊ×ÏÑÊÜæöøéÔ Úå¢ÔÈÒÙÔÈØvÎsϻ˱Õ0ÎÓÏ»Ú¢Ê ×ÌÔíÎsÚhÒÙÚ¢ÌÊ×ÏÑÊÜîÔÈÒÓÒÙÚ¢ÒÓØÄé®Õ0ØÎÓÚîë,Ô×Ï»ØvÎsÏ»ÊJÜ¢ÌÏ»ØÙéÔÈ×UãÏ|ÎséÄÚå¢ÔÈÒÙÔÈØvÎsϻ˱Õ0ÎÓϻڢÊUÎséÕ0ÎYÕÍ»ØÙÚ«é®Õèè,ÔÈÊØÚ¢Êv¢æÉÊhÎséJÔÄê¸ÚͻͻÚãÏÑÊÜô4ÎséJÔØÙÔPÎ5vHÚ<ãÏ»ÍÑÍòë,ÔØÙÏ»Ë7èÍÑð6×ÔPÊÚÎsÔP×<ëðòvæ\?Í»ÔP˱ÔPÊÎsØ«ÚêvHÚwãÏÑÍ»Í9ëAÔÄÐÕ0ͻͻÔP׿³Æ¹C¸º¾ÀÑ:üÉ ¼æ

ù ¾o¿À)Á/'Þ%=U- ©ú/ûÁp)uúd)SÁ/)S¿Æm)0-ç Ø9ã9Ôé®ÕåÔØÓÔPÔÈÊUë,ÔPê¸Ú¢ÒÙÔô!ÎÓÚ×Ôú®ÊÔèJÒÓÔPê¸ÔPÒÓÔPÊÐÈÔPØÚ.åÔÈÒÕ0ÍÑÎsÔPÒÓÊ®Õ:ÎsÏÑåÔÈØPôÎÓéÔÌØÙÔÈÒòÏÑØòÒÙÔÈ÷ÌÏ»ÒÙÔÈ×gÎÓÚèÒÙÚå!ÏÑ×ÔÏ»Ë7èAÚ¢ÒvÎÕÊJÐÈÔUÕÊJ×îÏÑÊÎsÔÈÒÓÕÐPÎÓϻڢÊhÏÑÊ×Ï»ÐPÔÈØPô[ëJÌJÎ7ϻصË7Ú¢ÒÓÔUÍÑÏ»ÛÔPÍÑð ÎsÚíÔÈØvÎÕëJͻϻØÙéhÏ»ÊÎsÔPÒÙå0ÕÍ»ØÚêeå0ÕÍÑÌÔÈصÎÓé®ÕÊèÒÓÔPÐÈÏ»ØÙÔgå0ÕÍ»ÌJÔÈØÈæMÉÊÎséJÏ»ØØÙÔÈÐÎsÏ»ÚÊDôã9ÔÔï!èÍ»ÕÏ»Ê6éÚãØÙÌÐéhÏ»ÊÎsÔPÒÙå0ÕÍHÏ»Ê!ê¸Ú¢ÒÓ˱Õ0ÎsÏÑÚ¢ÊíÐÕÊë,ÔgÏÑÊÎsÔÈÜÒsÕ0ÎÓÔÈ×Ï»ÊUÎséÔØÙÐéÔÈË7ÔÚêGÐÈÚ˱èÌ!ÎÕ0ÎÓϻڢÊUÚêGÎséÔ#àòéÚ÷ÌÔPÎeÏ»ÊÎÓÔÈÜ¢ÒÓÕÍßôëðÄÔï!ÎÓÔÈÊ×JÏ»ÊÜ7ÏÑÎÓØ×ÔPú®ÊJÏÑÎsÏÑÚ¢ÊUÎsÚ7ÉÊÎsÔÈÒvå0ÕÍç ÒÓÏ|ÎséË7ÔPÎÓÏ»Ð0æðÏ»ÊÐPÔeÎséJÔ©ÌØÙÔÈÒ¼ÏÑØ9ÊÚ0ÎòÍÑÚ¢ÊÜ¢ÔPÒòÕØÙÛÔP×ê¸Ú¢Ò9èJÒÓÔÈÐPÏ»ØÙÔ©å0ÕÍ»ÌÔPØ9Ú0êKÏ»Ê×JÏ»ÐÈÔPØ_ºh ÕÊ×_ºhôëÌJÎòê¸ÚÒ9ÏÑÊÎÓÔÈÒvåÕ0Í»ØÈôãòÔµÐPÚ¢ÊØÓÏÑ×ÔÈÒ©Ï»ÊÎsÔPÒÙå0ÕÍ»ØeÚêHå0ÕÍ»ÌÔPØ©ÚêHÎÓéÔÈØÙԵϻÊ×ÏÑÐÈÔPØÈôKÕÊJ×ãòÔ ÒÙÔÈØÙèAÔPÐPÎsÏ|å¢ÔPÍÑð×JÔÈÊÚÎÓÔ ÎséÔPËýüÆþ»ÿ«ÕÊ×üÆþÓôiÓô~o]b dc c teæ ç ØÕµÐÈÚ¢ÊJØÓÔÈ÷ÌÔPÊÐÈÔ0ô®ÎséÔSê¸Ú¢ÒÙË ÌÍÖÕê¸Ú¢ÒòÎÓéÔSÐÈÚ¢Ë7èÌJÎsÕ0ÎsÏÑÚ¢ÊÚêxÎséJÔàòéÚ!÷ÌÔÎÏ»ÊÎsÔPÜ¢ÒsÕ0ÍÏ»ØÊÚãFÜÏÑå¢ÔPÊMëðzsÝâ2â ùú ônì Î `

ü~þÔÿ DAEÝóô4ÝiâGFqô4ÝHâÓâ¤üÆþ»ÿ0

üÆþ»ÿ IAEÝô4Ýi âGJqô4ÝH!âÙâ ) ü~þÔÿ ) 0 ^ hY ô4ÝiâÈÝMüÆþ î Kh ) üÆþ»ÿ ) â Ý¢â

ãéÔÈÒÙÔ«ÎÓéÔ7ÕÊÊÚ0ÎÕ0ÎÓϻڢÊîÝâ2â©Ë±ÔÈÕÊØÎséÕ0ÎSÎséÔ«ÏÑÊÎÓÔÈÒÙèÒÓÔÎÕ0ÎÓϻڢÊÚê[ÎséJÏ»Øê¸Ú¢ÒÙË ÌÍÖձϻØSè,ÔÈÒÙê¸ÚÒÓË7ÔÈ×íÌØÓÏÑÊÜÉ ç æ ç ØÕÐÈÚ¢ÊØÙÔÈ÷ÌÔPÊÐÈÔ0ôGÎséJÔ åÕ0Í»ÌÔ Ú0ê9ÎÓéԵϻÊÎsÔPÜ¢ÒsÕ0ÍÏ»ØÕÊÏ»ÊÎsÔPÒÙå0ÕÍ2ô=ãéÏ»Ðéã9Ô«éÚ¢è,Ô Ï»ØÎÓéÔ ÎsÏÑÜ¢éÎsÔÈØvÎÚ¢ÊÔÒÓÔPÜÕÒÓ×JÏ»ÊÜ ÎÓéÔÏ»ÊÎsÔPÒÙå0ÕÍKÏ»Ê!ê¸Ú¢ÒÓ˱Õ0ÎsÏÑÚ¢ÊgèJÒÓÚåÏ»×ÔP×MëðUÎséJÔÌJØÓÔÈÒPæYÚãòÔå¢ÔÈÒPô!ÌJØÓÏ»ÊJÜSÉ ç ˱ÔÈÕÊØÎÓé®Õ0Î?ÚåÔÈÒÓÔPØÙÎÓϻ˱Õ0ÎsÏÑÚ¢Ê ÚênÎséJÔ¼ÒsÕÊJÜ¢Ô¼Úê,ÒÓÔÈÕÍ!ê¸ÌJÊÐPÎÓϻڢÊØËgÕð#Ú!ÐPÐÈÌÒPô×ÌÔÎsÚgÎÓéÔ Õë,Úå¢ÔùË7ÔÈÊÎsÏÑÚ¢ÊÔÈ׶×ÔÈè,ÔÈÊ×JÔÈÊÐðèÒÙÚ¢ëÍ»ÔPË ÚêHÉ ç æ,ÉÊè®ÕÒvÎsÏÑÐÈÌÍ»ÕÒÈônÏ»ÊÎséÔÐÕØÙÔ Úê°\?÷Ì®Õ0ÎÓϻڢÊôÔPå¢ÔPÒÙð6Ï»ÊÎsÔPÒÙå0ÕÍHå0ÕÒÓÏ»ÕëÍ»Ô5üÆþ»ÿÚÐÈÐPÌÒÓØÎãÏ»ÐÈÔ0ôãÏÑÎÓé×Ï|õ=ÔPÒÓÔÈÊÎ#˱ÚÊÚÎsÚÊÏ»ÐPÏÑÎsÏÑÔÈØ7ݸڢÊÐÈÔgèAÚ¢ØÙÏÑÎÓÏÑå¢ÔPÍÑðôGÚ¢ÊJÐÈÔÊÔÈÜÕ:ÎsÏÑåÔÈÍ|ðâôãéÏ»ÐéÏÑÊÔPåÏ|ÎÕëÍ|ð ÍÑÔÕ×Ø?ÎÓÚÚå¢ÔPÒÓÔÈØvÎsÏÑËgÕ0ÎÓÏ»ÊÜÎséJÔeÔïJè,ÔÈÐÎsÔP×ÒsÕÊJÜ¢ÔeÚ0ê=å0ÕÍ»ÌJÔÈØÈæøéÔÈÒÙÔPê¸Ú¢ÒÙÔôÎséÔÒÓÏÑÜ¢éÎè®ÕÒvÎÚêGÎÓéÔSê¸Ú¢ÒÓËÌÍÖÕ ÏÑØÒÓÔPãÒÙÏÑÎÙÎsÔÈÊØÓÚ7ÕØÎÓÚ7ÚëJÎÕÏÑÊÄØÓÏ»ÊJܢͻÔÚÐÈÐPÌÒÓÒÙÔÈÊÐPÔÈØeÚ¢ÊÍÑðzs

Ýâ2â ú ônì Î ` üÆþ»ÿ DAE ÝóôÝiâGF$ô4ÝâÙâ î Y Ýóô4ÝiâG0\ôÝâÓâ|üÆþ»ÿ0 üÆþ»ÿ IAE ÝóôÝiâGJ$ô4ÝâÙâ î Y Ýóô4ÝiâG0\ôÝâÓâ ) ü~þÔÿ ) 0 ^hY ô4Ýiâ¤üÆþ Ýâ

øéÏ»Øê¸Ú¢ÒÓË ÌJÍÖÕÄÐÈÚ¢ÊÎÕ0Ï»ÊØ#Ú¢ÊÍÑð ØÙÏ»ÊÜ¢ÍÑÔ«Ú!ÐPÐÈÌÒÙÒÓÔÈÊJÐÈÔÈصÚ0êÏ»ÊÎsÔPÒÙå0ÕÍ?å0ÕÒÓÏ»ÕëÍ»ÔPØÈôxãéÏ»ÐéÏÑØ Õܢ̮ÕÒÓÕÊÎsÔÈÔ±ÎsÚÚ¢ëJÎÕ0Ï»ÊÎséJÔÔïJÕÐPÎeÒsÕÊJÜ¢ÔSÚê4è,Ú¢ØÙØÓÏ»ëJͻԩå0ÕÍÑÌÔÈØPô®Ü¢ÏÑåÔÈÊÄÎséÔÏ»ÊÎÓÔÈÒÙå0ÕÍÑؼÚêèÒÓÔê¸ÔÈÒÓÔPÊÐÈÔPØeÚ0êÎséJÔÌJØÓÔÈÒPæø¼ã9ÚÕ0ÍÑÎsÔPÒÓÊ®Õ:ÎsÏÑåÔÈØÕÒÙÔ ÎséÔPÊÐPڢ˱èÕÒÓÔP×íãæ ÒÈæ ÎæµÎÓéÔ«ÐÈÚ¢ÒÙÒÓÔÈØÙèAÚÊ×Ï»ÊJÜÏÑÊÎsÔÈÒvå0ÕÍå0ÕÍ»ÌJÔÈØÚê[ÎséÔPÏ»ÒSÏ»Êù

ÎsÔÈÒvå0ÕÍKÏ»ÊÎsÔPÜ¢ÒsÕ0ÍDÚê4àòéÚ!÷ÌÔÎônÕØê¸Ú¢ÍÑÍ»ÚãØwsÝâ2âùúônì Î l Ýâ2âùú òì Î ¨~ ÝâNâùúnônì Î ÝâNâùú òì Î

øéÏ»ØHÏ»ØHÏÑÊÎÓÔÈÒÙèÒÓÔÎsÔÈ×±ÕØ?ÎséJÔ©ÕÍ|ÎsÔPÒÓÊ®Õ0ÎÓÏÑåÔ¼êDÏ»Ø?èJÒÓÔPê¸ÔPÒÓÒÙÔÈ×±ÎÓÚÎÓéÔYÕÍÑÎÓÔÈÒÙÊ®Õ0ÎsÏ|å¢ÔÜæÉ Î9ÏÑØ4ãòÚÒÙÎsé±ÊÚÎÓÏ»ÊÜSÎséÕ0ÎÏÑêDÎÓéÔS×ÔÈÐPÏ»ØÙϻڢÊ˱ÕÛ0ÔÈÒòÜÏÑå¢ÔPؼèÒÙÔÈÐÈÏÑØÓÔå0ÕÍÑÌÔÈØ[ê¸Ú¢Ò9ÎÓéÔSÏ»Ë7èAÚÒÙÎÕ0ÊÐÈÔeÕÊ×UÏÑÊÎÓÔÈÒÓÕÐPÎÓϻڢʱϻÊ×ÏÑÐÈÔPØÈô!ÎÓéÔÈÊÎÓéÔÏ»ÊÎsÔPÒÙå0ÕÍ|ù ë®ÕØÙÔÈ×àòéJÚ!÷ÌÔÎÏ»ÊÎÓÔÈÜ¢ÒÓÕÍoÒÓÔÈØvÎsÒÙÏ»ÐPÎÓØeÎÓÚÕ±ØvÎÕÊJ×®ÕÒÓ×àòéÚ!÷ÌÔÎÏÑÊÎÓÔÈÜ¢ÒÓÕÍxÕÊ׶ÎÓéÔÏ»ÊÎsÔPÒÙå0ÕÍ»ØYÚêèÒÓÔê¸ÔÈÒÓÔPÊÐÈÔPØÕÒÙÔSÒÓÔÈÕÍDå0ÕÍ»ÌÔP×ÄÊÌJË ëAÔPÒÓØPæ=ê ïGäõîÄî ð*÷ñêpëãiê ÷#ååiðçOëäÄé#ó»óié#ëèêÁå®ëèäæôëäõöñê°ëãø÷ë*ëãiêOêÁôäæåMäÄé#óWð*÷ñêdè*ôd÷#óióiéëñäõöñê÷#óYäÄóëê¤èöS÷îïGãiéåMêäÄóëê¤èäÄé#èGôéñóë÷äÄóøå "!OïGãiäÄôlãïé#iî$&%ê°÷ôéñóëèl÷#Oäæôëé#èì4äÄóOí»éñèð*÷Sëäõéñó('

Page 31: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

YÚãòÔå¢ÔÈÒPôãòÔSØÙéÚ¢ÌÍÑ×UÕÍÑØÓÚÔP˱èJé®ÕØÓÏÈÔYÎséÔ©êNÕ0ÐPÎòÎÓé®Õ0Î9ÎséÔ©ÕëAÚåÔÐÕØÙÔÏ»ØòÕÊÏÑ×ÔÕÍ,ÐÕ0ØÓÔ©ãéÔPÒÓÔ©ÎÓéÔÏ»ÊÎsÔPÒÙå0ÕÍDÚêGèÒÙÔPê¸ÔÈÒÙÔÈÊÐPÔÈØe×Ú±ÊJÚÎÏ»ÊÎsÔPÒÓØÙÔÈÐPΩÕÊ×UÎÓéÔèÒÓÔê¸ÔÈÒÓÔPÊÐÈÔPØÕÒÙÔÐÈÍ»ÔÈÕÒÈæ?É Î˱ÕðUé®ÕèJèAÔPÊÎÓé®Õ0Îs

ÝâNâ ù únônì Î*) Ýâ2â ù ú òì Î Î` ÙÉÊMØÓÌÐéÕ±ÐÕ0ØÓÔônã9ÔÊÔPÔÈ×MÎÓÚg×Ôú®ÊÔÕ7×ÔPÜ¢ÒÓÔPÔÚêèÒÓÔê¸ÔÈÒÙÔÈÊÐPÔ ÐÈÚ¢ÒÙÒÓÔPØÓè,Ú¢Ê×ÏÑÊÜ7ÎÓÚ±ÎÓéÔ#ÏÑÊÎsÔÈÒÙØÓÔPÐPÎsÏÑÚ¢Ê

ÚêDÎséJÔeÏÑÊÎÓÔÈÒvåÕ0Í»ØÈæfÔYÐÈÚ¢ÌJÍ»×gÌJØÓÔÕÎsÒÙÏÑåÏÖÕÍ®ØÙڢͻÌ!ÎsÏ»ÚÊ«ãéJÏ»ÐéÏÑØ[ÎÓÚ#ÍÑÚ!Ú¢Û Õ0Î9ÎÓéÔ©ÌJèèAÔPÒòë,Ú¢ÌÊ×JØòÕÊJ×Ü¢ÏÑåÔèÒÓÔê¸ÔÈÒÓÔPÊÐÈÔ¼ÎÓÚeÎÓéÔòéÏÑÜ¢éÔPØÙÎ4ÌJèèAÔPÒë,Ú¢ÌÊ×=ôãéÏ»ÐéÐÈÚ¢ÒÙÒÓÔÈØÙèAÚÊ×ØÎÓÚSÕÊ Ú¢è!ÎsÏ»Ë7Ï»ØvÎsÏÑÐ?ëAÔPé®Õå!ÏÑÚ¢Ò6Ú¢ÒxÎÓÚÍ»ÚÚ¢ÛÕ0Î9ÎÓéÔYÍ»ÚãòÔPÒòë,Ú¢ÌÊ×JØòÕÊJ×Ü¢ÏÑåÔeèJÒÓÔPê¸ÔPÒÓÔPÊÐÈÔ©ÎÓéÔÎséÔeéϻܢéJÔÈØÙÎ9Í»Úã9ÔÈØvμë,Ú¢ÌÊ×7ãéÏ»ÐégÎÓéÔÈʱÐÈÚ¢ÒÙÒÓÔPØÓè,Ú¢Ê×ØÎsÚÕYèAÔPØÓØÙÏ»Ë7Ï»ØÙÎÓÏ»ÐHë,ÔÈé®ÕåϻڢÒPæYÚ.ã9ÔPåÔÈÒPôËgÕÊðÚÎÓéÔÈÒxØÓÚ¢ÍÑÌJÎsÏÑÚ¢ÊØxëAÔÎãòÔPÔÈʵÎséJÔ[åÔÈÒÙð#Ú¢èJÎÓÏ»Ë7Ï»ØvÎsÏ»Ð?ÐÈÕØÓÔòÕÊ×ÎséÔ©åÔÈÒvðgè,ÔÈØÙØÓÏ»Ë7Ï»ØvÎsÏÑÐeÐÈÕØÓÔSÕ0ÒÓÔ©è,Ú¢ØÙØÓÏ»ëJÍ»ÔæÉ Î¼Ï»Ø9Ú¢ÌÒ[ê¸ÔÈÔÈÍÑÏ»ÊÜÎsé®Õ:Îòã9ÔÊÔÈÔP×ÎsÚ ÍÑÚ!ÚÛµØÓÏ»ËÌÍÑÎsÕÊÔÈÚÌØÓÍ|ð7Õ0ÎÎséÔÌJèèAÔPÒÕÊ×UÍÑÚãòÔPÒ¼ëAÚÌÊ×ØÕØ9ãòÔPÍ»ÍDÕ0ØòÎséÔeãÏ»×JÎÓéUÚêKÎséÔ©Ï»ÊÎsÔPÒÙå0ÕÍ»ØPæ4ÉÊ×ÔPÔÈ×Dô®ÏÑÊ˱ÕÊðgØÙÏÑÎsÌÕ0ÎsÏÑÚ¢ÊØÈôÎséÔ«×ÔPÐÈÏ»ØÙϻڢÊ˱ÕÛ0ÔÈÒãÏ»ÍÑÍÔïJéJÏ»ëÏ|ÎØÓÚ¢Ë7ÔµØÓÚ¢ÒvÎÚêòÕåÔÈÒÙØÓÏ»ÚÊ Ú0êòÒÙÏ»ØÓÛ¶ÕÊ×ãÏ»ÍÑÍGãòÕÊÎÎsÚÄé®Õå¢Ô«Ï»ÊÎÓÔÈÒÙå0ÕÍÑØÕØSÎsÏ»ÜéÎ#ÕØè,Ú¢ØÓØÙÏ»ëÍÑÔôKÎséÕ0ÎÏ»ØÒÓÔPØÙÎsÒÙÏ»ÐÎÎÓéÔ7×ÔÈÜ¢ÒÙÔÈÔ7Úê¼ÌJÊÐÈÔPÒÙÎÕ0Ï»ÊÎð¢æ±øéÔ±×JÔPú®ÊÏ|ÎsÏÑÚ¢ÊíÚêÕÄ×ÔÈÜÒÓÔÈÔ7ÚêèÒÓÔê¸ÔÈÒÓÔPÊÐÈÔ#ÕÊ×ÄÎséJÔØÙÔÈ˱ÕÊÎsÏÑÐÈØYÕ0ÎÓÎsÕÐéÔP×ÎÓÚ7Ï|ÎÏ»Øè®Õ0ÒÙÎÚêGÚ¢ÌÒê¸ÌJÎÓÌÒÓÔÒÓÔPØÓÔÈÕÒÓÐéDæ+ 2 Â7¿ÆUÅÛ-ÇÈÂ7¿ÉÊ6ÎséJÏ»ØSè®Õè,ÔÈÒPôxã9ÔgéÕåÔ±èJÒÓÔÈØÙÔÈÊÎsÔP×hÕØÙÏ»Ë7èͻԵÐPÚ¢Ë7èÌJÎÕ:ÎsÏ»ÚÊØÓÐéÔP˱Ô0ôÐÈÚË ëÏ»ÊJÏ»ÊÜUÎséJÔàòéÚ÷ÌÔPÎ#Ï»ÊùÎsÔÈÜÒsÕͫݸϻÊwÎÓéÔ.ùÕ0××ÏÑÎÓÏÑåÔ¶ÐÕ0ØÓÔâgãÏ|ÎséÏ»ÊÎÓÔÈÒÙå0ÕÍÕÒÓÏ|ÎséË7ÔPÎÓÏ»ÐÄÎsé®Õ0ÎÕ0ͻͻÚãØÌJØÎsÚÜ¢Ï|å¢ÔÏ»ÊÎÓÔÈÒÙå0ÕÍÑØÚêèÒÓÔê¸ÔÈÒÓÔPÊÐÈÔPØSÚå¢ÔÈÒ©Ë ÌÍÑÎÓÏ»×ÏÑ˱ÔPÊØÓÏÑڢʮÕÍ=ÕÍÑÎÓÔÈÒÙÊ®Õ0ÎsÏ|å¢ÔPØÈæøéÔµÕèJèÒÓÚÕ0ÐéÏ»Øå¢ÔPÒÙð¶Õ0ÎÓÎÓÒsÕÐÎsÏÑåÔ ÕØYÏÑÎeÒÓÔº®ÔÈÐÎsØ˱ڢÒÙÔÕÐPÐÈÌÒÓÕ0ÎsÔPÍÑðíãé®Õ0ÎãòÔUÐÈÕÊÒÓÔÈÕÍ»Í|ð Ôï!èAÔPÐPÎ ê¸ÒÙÚ¢Ë Õ¶×ÔÈÐPÏ»ØÙϻڢÊ˱ÕÛ0ÔÈÒÈô4ðÔPεÒÙÔÈ˱ÕÏ»ÊJØØÓÏÑ˱èJÍ»Ô7ÕÊ×ØÙÎsÏÑÍ»ÍGÕÍ»ÍÑÚãØ©ÌØ©ÎÓÚUÒÓÔPèÒÓÔPØÓÔÈÊÎ×ÔPèAÔPÊ×ÔÈÊJÐÈÏ»ÔPØëAÔÎãòÔPÔÈÊ6Õ0ÎÓÎÓÒÓÏ»ëJÌJÎsÔPØ©ãéÏ»ÐéÏÑØÊÚΩèAÚØÓØÓÏÑëÍ»Ô#ãÏÑÎsé¶Ë±Ú¢ÒÙÔÎsÒsÕ0×ÏÑÎÓϻڢʮÕ0Í=ÕèJèÒÓÚÕ0ÐéÔÈØYØÓÌÐé¶ÕØÎséÔSã9ÔÈÏÑÜ¢éÎsÔÈ×ØÓÌJËMæÉÊ ÎÓéÔòÐÈÕØÓÔ9ãéÔÈÒÙÔòÎséÔ9Ï»ÊÎsÔPÒÙå0ÕÍÑØGÚênèJÒÓÔPê¸ÔPÒÓÔPÊÐÈÔPØHÕÒÓÔ9×Ï»Ø óÙڢϻÊÎô.ÎséJÔòÚ¢ÒÙ×ÔÈÒGÚê,ÕÍÑÎÓÔÈÒÙÊ®Õ0ÎsÏ|å¢ÔPØGÏÑØÐÈÍÑÔÕÒÙÍÑð

ÔÈØÙÎsÕëÍÑÏ»ØÓéJÔÈ×DæpYÚ.ã9ÔPåÔÈÒPô,Ï|μÏÑØòÊÚÎÕؼÎsÒÙÏÑåÏÖÕÍnÏÑÊgÎséÔ7ݸ˱ÚÒÓÔ©èJÒÓÚ¢ë®Õ0ëÍ»Ôâ[ÐÕ0ØÓÔãéÔPÒÓÔSÎséÔ©Ï»ÊÎsÔPÒÙå0ÕÍ»Ø9é®Õå¢ÔÕʱÏÑÊÎsÔÈÒÙØÓÔPÐPÎsÏÑÚ¢ÊDæ?ÉÊ«ÎséÏÑØ9ÐÈÕØÓÔ0ô!ØÙÚ¢Ë7ÔYË7Ú¢ÒÙÔYÒÙÔÈØÙÔÕÒÙÐéÏ»Ø[ÊÔÈÔP×ÔÈ×±ÎsÚÜ¢ÏÑåÔeÕÐÈÚ¢ÊJØÓÏ»ØvÎsÔPÊÎ9Ú¢ÒÙ×ÔÈÒÙÏ»ÊÜÚê=ÎÓéÔèÒÓÔê¸ÔÈÒÓÔPÊÐÈÔPØòãéÔPÊgÎséÔYØÓÏÈÔYÚêDÎÓéÔeÏÑÊÎsÔÈÒvå0ÕÍ»Ø?ÎsÔPÊ×Ø[ÎsÚ#ÎÓéÔYͻϻË7ÏÑÎÐÕØÙÔeãéÔPÒÓÔYÎséÔPÒÓÔYÏ»Ø[ÊÚÌÊJÐÈÔÈÒvÎÕÏÑÊÎðÚ¢ÊUÎséÔSå0ÕÍ»ÌJÔÈØÚêGÏ»Ë7èAÚ¢ÒvÎÕÊJÐÈÔSÕÊJ×MÏÑÊÎÓÔÈÒÓÕÐPÎÓϻڢÊDæ, )úd)SÁ/)S¿Æm)0--$./10 ²32ò§¥»§~££54²36[7ªJv°² 8:9<;>=@?ACBED#;>F>?C9G;H?&8:9I;JK=:LNMNOQP ?NRS<BE;HMC;>FT?N9EUÙ²V29¨£~¨XWoZY ¨²¥¬E[[\ v¯!¬^]_32e¬ .`CaCb ²©Av£¦~9c4..edH¯¢S£S§Tf v¥°.~§g2ò§~§¥£ih jkFl9mnBpo"=B39rqs8:9utvFHJv8:9I;JK=:LNMNOwOyx=JDKo"9(B39rqgz ²-|/ ³² 0 vª~v¨v®²G~B"KeRJMUBp=JU5MN9<AGFw9<;JHqC=MNOU^o3J?N=:MC9<AM:S"S<OFTDM;>F>?N9(UÙ¬4¨v[|:v¨s~.£M³7ª~¦Y0¥.e?Ö¦E7K7­US¦e.M´¥¨s¬0¡N££¥~¤S¦vKS7¦v¤­²WG:­¢~¨KÙ§~°¬²

-|b/ ³² 0 vª~v¨v®¬36Y²¢²3[[°.¦­®¬£[²p2e²pħ¥¯²~B39<AMNRJ9<;HMNOU&?m19<DJ=;HMNFw9<;>P MNO$DB3OF Fl;lokSS<OyFwxD:MC;>FT?N9EU&;H?~QB"8:9mJ=J9<DJs²Q§|¦:29¨£~¨cWG¦ª§~~v¬ròv£¨v0¬ .`C`C ²- N/ ²A³7¢£¤£³² 0 vª|¨v®² Wx»:|§K~££¨s£¹4¢º¦~0°v§ ²Y ¡;lo¢8:9I;HP ?C9m?N9£;lo3JZ~?NB39<A"MC;>F>?C9EUMN9<AXkSS<OyF>D:MC;H?N9EU?@mQtJDF¤UF>?C9B39<A"JK=Q¥FwU¦cMN9<A9<D:JK=;HMNFl9I;>§>~v¥V¨È¬.³«.¬© §¥°.|¦µ¬(4.¦§~­ .`C`ª ²-|/ ²n³7£¤£Ä³² 0 vª~¨Ù®²cWovѨs©v0¥.ª­7¹4¢º¦ò|:°v§H[S¨s.S¡v¦vª~§|¥N­U0­0|²eY߬«­=@?#D#¯®;lo8:9I;HP ?N9mK?C98:9mK?N=RM;>F>?N9¬«­=@?DJU:U:Fl9rqMN9<A¢°MN9<MqpJRJ9<;v?@m9<DJ=;HMNFw9<;>Fl95±&9²?N­O³JAqrJKxT´MUJA1µ(U;JR¯U§l8«V°¶<¨P¬EWDv~¬ v¨s¬²4:¦§¥­ .`C`Ca ²

-|·/ dY²(H²!³7²Z8:9I;JK=:LNMNO²9<MCOU:FwUÙ²VWo0~¨ss¡6ò§~§ ¬IG°.§¥0¢£7¹4§~$¸®¬E[e²4²¥¬ .`C·C· ²

Page 32: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Using logic programs to reason about infinite sets

D. Cenzer∗ V.W. Marek† J.B. Remmel‡

Abstract

Using the ideas from current investigations in Knowledge Representation we study the useof a class of logic programs for reasoning about infinite sets. Those programs assert codes forvarious infinite sets. Depending on the form of atoms allowed in the bodies of clauses we obtaina variety of completeness results for various classes of arithmetic sets of integers.

1 Introduction

The motivation for this paper comes out of recent progress in logical foundations of Artificial Intel-ligence and, in particular, the area of Knowledge Representation. In the past few years, there hasbeen significant progress in the theory and practice of Logic Programming. In particular, a wholenew area called Answer Set Programming (ASP) has arisen which can be viewed as a fusion of LogicProgramming with Stable Model Semantics (SLP) and satisfiability (SAT). Answer Set Programminghas emerged as both a theoretical and practical basis for the development of new generation of sys-tems that are solidly grounded in the theory of Computer Science and capable of handling practicalsearch problems arising in applications. The new generation of ASP systems such as smodels, dlv,and ASSAT [NSS99, EL+98, LZ02], which use both the native techniques of Logic Programmingand the technology developed in SAT [MM+01, GN02], carry a lot of promise. Moreover, new typesof constraints are introduced that allow for a more compact representation of problems. In suchsystems, the task of the programmer becomes easier because of the effort spent by the back-endprocessing engines.

The main motivation of this paper is to develop some extensions of the current ASP formalismthat allows one to reason about infinite sets. The key idea is to use recursion theoretic techniquesto reason about various types of indices of finite, recursive and r.e. sets. In particular, we develop anew extension of Logic Programming, called Extended Set Based (ESB) Logic Programming, whichallows constraints expressed in terms of such indices. We shall also briefly analyze the complexity ofthe question of when such an ESB program has a recursive model.

We will introduce these new types of constraints below. However, first it will be good to recallthe basic definitions of answer set programming and some of its recent extensions such as cardinalityconstraint logic programming [NSS99] and set constraint logic programming [MR03].

A logic programming clause is a construct of the form

C = p← q1, . . . , qm, not r1, . . . , not rn∗Department of Mathematics, University of Florida, [email protected]. Corresponding author†Department of Computer Science, University of Kentucky‡Department of Mathematics, University of California, San Diego

Page 33: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

where p, q1, . . . , qm, r1, . . . , rn are atoms. A logic program is a set of logic programming clauses. Theatoms q1, . . . , qm, not r1, . . . , not rn form the body of C and the atom p is its head. A model of a clauseC is a set of atoms M such that whenever M satisfies the body of C, then M also satisfies the headof C. The clauses C where n = 0 are called Horn clauses. A program entirely composed of Hornclauses is called a Horn program and a Horn program always has a least model. It is the intendedsemantics of such program. For programs with bodies containing the negation operator not , we willuse the stable model semantics. Following [GL88], we define a stable model of the program as follows.Assume M is a collection of atoms. The Gelfond-Lifschitz reduct of P by M is a Horn program arisingfrom P by eliminating those clauses in P which contain not r with r ∈M . In the remaining clauses,we drop all negative literals from the body. The resulting program GLM(P ) is a Horn program. Wecall M a stable model of P if M is the least model of GLM(P ). In the case of a Horn program, thereis a unique stable model, namely, the least model of P .

It is the general consensus of the Knowledge Representation community that stable models arethe intended models of logic programs. Once such a consensus emerged, it was natural for boththeoreticians and logicians to study various complexity issues associated with stable models of logicprograms. There has been extensive effort of the community to investigate both the theoretical issuesassociated with stable models and the practical algorithms for processing.

The most general case of stable models of logic programs, those allowing for function symbols inatoms, turned out to be very hard. Recall that Horn programs have a least model which is r.e. inthe (code for) program. Starting with [AB90] and continuing with [BMS95] and [MNR94], a numberof results showed that the stable models of logic programs that allow function symbols could beexceedingly complex, even in the case when there is a unique stable model. In particular Marek,Nerode and Remmel [MNR94] showed that there exist finite predicate logic programs which havestable models but which have no hyperarithmetic stable model. While these results may at firstglance appear negative, they had a positive result in the long run since they forced researchers anddesigners to limit themselves to cases where programs can be actually processed. The effect wasthat processing programs (called solvers) had to focus on finite programs that do not admit functionsymbols.

The designers of the solvers have also focused on the issues of both improving processing ofthe logic programs (i.e. searching for a stable model) and improving the use of logic programsas a programming language. The latter task consists of extending the constructs available to theprogrammer to make programming easier and more readable. Various researchers discovered thatit was possible to introduce meaningful extensions to the logic programming syntax and yet havesuch extensions be processed in a manner which is entirely analogous to the processing currentlyemployed in case of logic programs proper. For example, let ω denote the set of natural numbers.Then a cardinality constraint atom is a constraint of the form kXl where X is a finite set of atomsand k and l are elements of ω ∪ ∞ such that k ≤ |X| ≤ l. The meaning of such an atom is thata putative model M satisfies kXl, written M |= kXl, if and only if k ≤ |M ∩X| ≤ l. These atomsand related weight constraint atoms, where we have some weight function wt on X and a model Msatisfies a constraint l ≤ X ≤ u if and only if l ≤

∑p∈M∩X wt(p) ≤ u, are special cases of more

general set constraint atoms of the form 〈X,F〉 where F ⊆ 2X . Here we say M satisfies 〈X,F〉 ifand only if M ∩X ∈ F . Many types of constraints can be expressed in the form 〈X,F〉. For instancevarious constraints used by SQL query language can be so represented. Set constraints have beenintroduced and investigated in [MR03].

Formally a set-constraint clause (or sc-clause) is an expression of the form

〈X,F〉 ← 〈X1,F1〉, . . . , 〈Xm,Fm〉

Page 34: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

It is easy to see that ordinary logic programming can be reduced to set-constraint programming.That is, the meaning of atom a is the same as that of set constraint 〈a, a〉 and the meaningof not a is the same as 〈a, ∅〉. Our definition of stable model is an extension of the version ofthe Gelfond-Lifschitz transform introduced by Niemela, Simons and Soininen [NSS99] for cardinalityconstraint programs and we call it the NSS transform. Again the process of constructing the model isbased on some form of “Horn” programs, reduction, and least fixed points of the one-step provabilityoperators for Horn programs. First, a family F of subsets of X is upper closed if Y ⊆ Z ⊆ X andY ∈ F implies Z ∈ F . We will call an sc-clause Horn if

1. the head of that clause is a single atom (recall that atoms are represented as set constraints)and

2. whenever 〈Xi,Fi〉 appears in the body, then Fi is an upper closed family of subsets of Xi.

A set-constraint Horn program P is a set-constraint program which consist entirely of Horn clauses.There is a natural one-step provability operator associated to a sc-Horn program P , TP : 2X → 2X

where X is the underlying set of atoms of the program, defined by TP (S) equals the set of all p suchthat there is clause

C = p← 〈X1,F1〉, . . . , 〈Xm,Fm〉 ∈ Psuch that S satisfies the body of C. It is easy to see that our definitions ensure that TP is amonotone operator and hence each sc-Horn program has a least model. It can be computed in amanner analogous to the computation of the least model of a definite Horn program as TP ↑ω (∅).The NSS transform NSSM(P ) of the set-constraint program P for a given set of atoms M is definedas follows. First eliminate all clauses with bodies not satisfied by M . Next, for each remaining clause〈X,F〉 ← 〈X1,F1〉, . . . , 〈Xm,Fm〉 and each p ∈ M ∩X put the clause p ← 〈X1,F1〉, . . . , 〈Xm,Fm〉into NSSM(P ). Here Fi is the least family G containing Fi and closed upwards. Clearly the resultingprogram NSSM(P ) is a sc-Horn program and hence has a least model MNSSM (P ). M is a stable modelof P if M = MNSSM (P ). It can be shown [NSS99] that this construction corresponds to the samenotion of Gelfond-Lifschitz stable models when we restrict ourselves to ordinary logic programs.

In this paper we would like to use the mechanism of set-constraints to reason about infinite sets.But how this could be done? Certainly we can not write an entire infinite set. Nevertheless, thereare means to encode infinite sets by finite means, for instance, various types of indices or codes.

So here is the idea. Assume that we have some particular coding scheme for some family ofsubsets of set X. Let F be a finite family of such codes. We will write Fe for the set with the codee. Then we can write two types of constraints. One constraint 〈X,F〉⊆ has the meaning that theputative set of integers M satisfies 〈X,F〉⊆ if and only if M ∩X ⊇ Fe for some e ∈ F . Similarly, weshall also consider constraints of the form 〈X,F〉= where we say that M satisfies 〈X,F〉= if and onlyif M ∩X = Fe for some e ∈ F . Observe that the constraints of the form 〈X,F〉⊆ behave like atomsp in that they are preserved when the set grows while the constraints of the form 〈X,F〉= behavemore like constraints not p in that they are not always preserved as the set grows. Now, it is clearthat once we introduce these type of constraint schemes, we can consider various coding schemes forthe set of indices. For example, in this paper, we will consider three such schemes: explicit indicesof finite sets, recursive indices of recursive sets and r.e. indices of r.e. sets.

We shall then define an extended set-based clause C to be a clause of the form

〈X,A〉∗ ← 〈Y1,B1〉⊆, . . . , 〈Yk,Bk〉⊆, 〈Z1, C1〉=, . . . , 〈Zl, Cl〉=. (1)

where ∗ is either = or ⊆ and define an extended set based program (ESB) P to be a set of extendedset based clauses.

Page 35: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

2 ESB Constraints, Clauses and Programs

In this section, we shall give the formal definitions of ESB constraints, clauses, programs and definethe analogue of Horn programs and stable models for ESB programs. To describe our constraints,we first need to describe three different types of indices for subsets of the natural numbers:(1) Explicit indices of finite sets. For each finite set F ⊆ ω, we shall define the explicit index ofF as follows. The explicit index of the empty set is 0 and the explicit index of x1 < . . . < xm is2x1 + · · ·+ 2xn . We shall let Fn denote the finite set whose index is n.(2) Recursive indices of recursive sets. Let φ0, φ1, . . . , be an effective list of all partial recursivefunctions. By a recursive index of a recursive set R, we mean an e such that φe is the characteristicfunction of R. If φe is a total 0, 1-valued function, then Re will denote the set x ∈ ω : φe(x) = 1.(3) R.e. indices of r.e. sets. By a r.e. index of a r.e. set W , we mean an e such that W equalsthe domain of φe, that is, We = x ∈ ω : φe(x) converges.No matter what type of indices we use, we shall always consider two types of constraints based onX and a set of indices F , namely, 〈X,F〉= and 〈X,F〉⊆. For any subset M ⊆ ω, we shall say thatM is a model of 〈X,F〉=, written M |= 〈X,F〉=, if there exists an e ∈ F such that M ∩ X equalsthat set with index e. Similarly, we shall say that M is a model of 〈X,F〉⊆, written M |= 〈X,F〉⊆,if there exists an e ∈ F such that M ∩X contains the set with index e.

Fix some recursive pairing function [, ] : ω × ω → ω. For any sequence a1, . . . , an, with n ≥ 2,we define the code c(a1, . . . , an) by the usual inductive procedure of defining c(a1, a2) = [a1, a2] andc(a1, . . . , an) = [a1, c(a2, . . . , an)] if n ≥ 3. The explicit index of the sequence ~s = (a1, . . . , an),ind(a1, . . . , an) is defined by induction. If n = 2, then ind(a1, a2) = [2, [a1, a2]] and if n ≥ 3, thenind(a1, . . . , an) = [n, c(a1, . . . , an)]. In this paper, we shall consider three different types of constraints.(A) finite constraints: Here we assume that we are given an explicit index x of a finite set X and afinite family F of explicit indices of finite subsets of X. Throughout this paper we shall identify thefinite constraints 〈X,F〉= and 〈X,F〉⊆ with their codes, ind(0, 0, x, n) and ind(0, 1, x, n) respectivelywhere F = Fn. Here the first coordinate 0 tells that the constraint is finite, the second coordinate is 0or 1 depending on whether the constraint is 〈X,F〉= or 〈X,F〉⊆, and the third and fourth coordinatescode X and F respectively.(B) recursive constraints: Here we assume that we are given a recursive index x of a recursiveset X and a finite family R of recursive indices of recursive subsets of X. Again we shall identifythe recursive constraints 〈X,R〉= and 〈X,R〉⊆ with their codes, ind(1, 0, x, n) and ind(1, 1, x, n)respectively, where R = Fn. Here the first coordinate 1 tells that the constraint is recursive, thesecond coordinate is 0 or 1 depending on whether the constraint is 〈X,R〉= or 〈X,R〉⊆, and the thirdand fourth coordinates code X and R respectively.(C) r.e. constraints: Here we are given a r.e. index x of a r.e. set X and a finite family W of r.e.indices of r.e. subsets of X. Again we shall identify the finite constraints 〈X,W〉= and 〈X,W〉⊆ withtheir codes, ind(2, 0, x, n) and ind(2, 1, x, n) respectively, where W = Fn. Here the first coordinate 2tells that the constraint is r.e., the second coordinate is 0 or 1 depending on whether the constraintis 〈X,W〉= or 〈X,W〉⊆, and the third and fourth coordinates code X and W respectively.

Next we define an extended set-based clause C to be a clause of the form

〈X,A〉∗ ← 〈Y1,B1〉⊆, . . . , 〈Yk,Bk〉⊆, 〈Z1, C1〉=, . . . , 〈Zl, Cl〉=. (2)

where ∗ is either = or ⊆. We shall refer to 〈X,A〉∗ as the head of C, written head(C), and〈Y1,B1〉⊆, . . . , 〈Yk,Bk〉⊆, 〈Z1, C1〉=, . . . , 〈Zl, Cl〉= as the body of C, written body(C). Here either kor l may be 0. M is said to be a model of C if either M does not model every constraint in body(C)or M |= head(C).

Page 36: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Again we shall talk about three different types of clauses.(a) finite clauses: These are clauses in which all of the constraints are finite constraints.(b) recursive clauses: These are clauses where all the constraints appearing in the clause are finiteor recursive constraints and at least one constraint is a recursive constraint.(c) r.e. clauses: These are clauses where all the constraints appearing in the clause are finite,recursive or r.e. constraints and there is at least one r.e. constraint.

An extended set-based (ESB) program P is a set of clauses of the form of (2). We say that anESB program P is recursive, if the set of codes of the clauses in P form a recursive set where thecode of a clause C of the form of (1) is ind(c, e1, . . . , ek, f1, . . . , fl) where c is the code of 〈X,A〉∗, eiis the code of 〈Yi,Bi〉⊆ for i = 1, . . . , k and fj is the code of 〈Zj, Cj〉= for j = 1, . . . , l.

Given a program P , we let Fin(P ) (Rec(P ), RE(P )) denote the set of all finite (recursive, r.e.)clauses in P . It is easy to see from our coding of clauses that if P is a recursive ESB program, thenFin(P ), Rec(P ) and RE(P ) are also recursive ESB programs.

We will say that a program P is recursive with finite constraints if P is a recursive programsuch that P = Fin(P ). Similarly we say that a program P is recursive with recursive constraints ifP = Fin(P ) ∪ Rec(P ) and Rec(P ) 6= ∅ and a program P is recursive with r.e. constraints if P is arecursive program and RE(P ) 6= ∅. Finally we say that P is weakly finite with recursive constraintsif P is a recursive program with recursive constraints and the set of heads of clauses in Rec(P ) isfinite and P is weakly finite with r.e. constraints if P is a recursive program with r.e. constraintsand the set of heads of clauses in Rec(P ) ∪RE(P ) is finite.

Next we define the analogue of Horn programs for ESB programs. A Horn program P is a set ofclauses of the form

〈X,A〉⊆ ← 〈Y1,B1〉⊆, . . . , 〈Yk,Bk〉⊆. (3)

where A is a singleton. We define the one-step provability operator, TP : 2ω → 2ω so that for anyS ⊆ ω, TP (S) is the union of the set of all De such that there exists a clause C ∈ P such S |= body(C),head(C) = 〈X,A〉⊆ and A = e where De = Fe if C is a finite clause, De = Re is C is a recursiveclause, and De is We if C is an r.e. clause. It is easy to see that TP is a monotone operator and hencethere is a least fixed point which we denote by MP . Moreover it is easy to check that MP is a modelof P .

If P is an ESB Horn program in which the body of every clause consists of finite constraints,then one can easily show we reach the least fixed point of TP in ω-steps, that is, MP = TP ↑ω (∅).However, if we allow clauses whose bodies contain either recursive or r.e. constraints of the form〈X,G〉∗ where X is infinite and ∗ is either = or ⊆, then we can no longer guarantee that we reachthe least fixed point of TP in ω steps. To this end consider the following example.

Example 2.1 Let en be the explicit index of the set n for all n ≥ 0, let w be a recursive index ofω and f be a recursive index of the set of even numbers E. Consider the following program.

〈0, e0〉⊆ ←〈2x+ 2, e2x+2〉⊆ ← 〈2x, e2x〉⊆ (for every even number x)

〈ω, w〉⊆ ← 〈E, f〉⊆

Clearly ω is the least model of P but it takes ω + 1 steps to reach the fixed point. That is, it is easyto check that TP ↑ω= E and that TP ↑ω+1= ω

Theorem 2.1 (a) If P is a recursive ESB Horn Program with finite constraints, then the least fixedpoint of the one step provability TP is r.e..

Page 37: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

(b) If P is a weakly finite ESB Horn program with recursive constraints such that Fin(P ) is recursive,then the least fixed point of the one step provability operator TP is r.e..

(c) If P is a weakly finite program with r.e. constraints such that Fin(P ) is recursive, then the leastfixed point of the one step provability operator TP is r.e..

Proof. Part (a) is essentially the usual proof that the least fixed point of a recursive Horn programis r.e..

For part (b), we note that we can construct the least fixed point as follows.Step 1) First take Fin(P ) and construct the least fixed point which we will call U0. Since Fin(P )is recursive, U0 is r.e.. Next consider the set T1 = U0 ∪ S0 where S0 is the union of the set of all Re

such that there exists a clause C ∈ Rec(P ) such that U0 |= body(C) and head(C) = 〈X,R〉 whereR = e. Even though we cannot find S0 recursively, our hypothesis ensures that S0 is a finite unionof recursive sets and hence is a recursive set. Thus T1 is an r.e. set. Now if S0 = ∅, then stop; T1 isthe least fixed point of P . Otherwise go onto step (2).

Step (n+1) Consider the set Un = TFin(P ) ↑ω (Tn). It is easy to see that that since Tn is r.e., Un isr.e.. Next consider the set Tn+1 = Un ∪ Sn where Sn is the union of the set of all Re such that thereexists a clause C ∈ Rec(P ) such that Un |= body(C) and head(C) = 〈X,R〉 where R = e. Eventhough we cannot find Sn recursively, our hypothesis ensures that Sn is a finite union of recursivesets and hence is a recursive set. Thus Tn+1 is an r.e. set. Now if Sn = ∅, then stop; Tn+1 is the leastfixed point of P . Otherwise go onto step (n+2).

Since the set of all head(C) such C ∈ Rec(P ) is finite, it easily follows that this process must stopafter a finite number of steps and hence the least model of P is r.e..

The proof of part (c) is similar to the proof of part (b). We note that there is an alternative way to obtain the least model of weakly finite ESB program

with recursive or r.e. constraints. Namely, if P is a recursive weakly finite program with eitherrecursive or r.e. constraints, let H(P ) denote the set of all head(C) where C ∈ Rec(P )∪RE(P ). Bydefinition, H(P) is a finite set consisting of constraints of the form CX,e = 〈X,A〉⊆ where A = e.In such a situation, we let SCX,e = Re if e is a recursive index and SCX,e = We if e is an r.e. index. IfS ⊆ H(P ), then let US =

⋃CX,e∈S SCX,e . Then the least model M of P is of the form

M = TFin(P ) ↑ω (US)

for some S ⊆ H(P ).The hypothesis that the P is a weakly finite Horn program with recursive or r.e. constraints is

absolutely necessary for the proof of Theorem 2.1 as our next example will show.

Example 2.2 Suppose that we are given a sequence of pairwise disjoint infinite recursive setsY,X0, A0, X1, A1, . . .. Let Y = y0 < y1 < . . ., Xe = x0,e < x1,e < . . . for each e ∈ ω andAe = a0,e < a1,e < . . . for each e ∈ ω. For all k ≥ 0, we shall let Xe,≥k = xk,e < xk+1,e < · · · .

Given an atom a, the finite set constraint 〈a, n〉⊆ where Fn = a is satisfied by a model Miff a ∈M so that the set constraint 〈a, n〉⊆ acts like a atom in a normal logic program. Thus inthe following program, we shall abbreviate that finite set constraint 〈a, n〉⊆ by the atom a.

Let W se denote the finite set of elements z less than or equal to s such that φe(z) converges in s

or fewer steps. Now consider the following program.

Page 38: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

(1) xn,e ← a[n,s],e for all n such that n ∈ W se −W s−1

e

(2) a[n,s],e ← for all n such that n ∈ W se −W s−1

e

(3) ye ← 〈Xe, nk〉⊆ for all k ≥ 0 where nk is a recursive index of Xe,≥k.

It is now easy to see that P is a recursive ESB Horn program such that the least model of P equalsye : We is cofinite ∪ xn,e : n ∈ We. However it is known [So87] that the set e : We is cofinite isa complete Σ0

3 set so that M is a complete Σ03 and hence is certainly not r.e.

Finally, we can define the analogue of a stable model for ESB programs. Given a model M and anESB program P , we define the analogue of the GL-transform by saying that GLM(C), where C ∈ Pis a clause of the form (1), is nil if M does not satisfy the body of C and is

〈X,F ′〉⊆ ← 〈Y1,B1〉⊆, . . . , 〈Yk,Bk〉⊆, 〈Z1, C1〉⊆, . . . , 〈Zl, Cl〉⊆. (4)

if M does satisfy the body of C where F ′ = e and e is an explicit (recursive, r.e.) index of X ∩Mif 〈X,A〉 > ∗ is a finite (recursive, r.e.) constraint. Then GLM(P ) = GLM(C) : C ∈ P will be aHorn program. We then say that M is a stable model of P if M is a model of P and M equals theleast model of GLM(P ).

3 Complexity of least models of ESB Horn programs

There are many complexity issues that need to be explored with respect of ESB programs. In thissection, we shall content ourselves with stating a few results about the complexity of ESB Hornprograms. The arithmetic hierarchy is defined as usual so that the recursive sets are both Σ0

0 andΠ0

0, a Σ0n+1 set is obtained by existential number quantification over a Π0

n set, and a Π0n+1 set is the

complement of a Σ0n+1 set. In particular, a set of natural numbers is Σ0

1 if and only if it is an r.e. set.

Theorem 3.1 For any arithmetic set A, there is a recursive ESB Horn program PA with least fixedpoint is of degree A.

Theorem 3.2 Let M be a recursive set.

(a) If P is a recursive ESB Horn program with finite constraints, then the predicate ‘M is the leastmodel of P ’ is Π0

2.

(b) If P is a recursive weakly finite ESB Horn program with recursive constraints, then the predicate‘M is the least model of P ’ is Σ0

3.

(c) If P is a recursive weakly finite ESB Horn program with r.e. constraints, then the predicate ‘Mis the least model of P ’ is Σ0

4 and in fact a Boolean combination of Σ03 predicates.

It follows from Theorem 3.2 that the predicate that a recursive weakly finite ESB Horn programP with recursive constraints has recursive least model is Σ0

3. In fact, we can prove the following.

Theorem 3.3 Let P be a recursive weakly finite ESB Horn program with recursive constraints, thenthe predicate ‘P has a recursive stable model’ is Σ0

3-complete.

Page 39: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

It turns our that there is a difference between recursive weakly finite ESB programs with r.e.constraints and recursive weakly finite ESB programs with recursive constraints. To give a morerefined analysis for recursive weakly finite programs with r.e. constraints, let us say that a recursiveHorn program P with r.e. constraints is n-weakly finite if there are n different heads of clausesfrom P − Fin(P ). Next we define the difference of two Σ0

3 sets, i.e. the conjunction of Σ03 set with a

Π03 set, to be 1-Σ0

3 set. Then by induction, we say that a set C is n-Σ03 if C = A− B where A is Σ0

3

and B is a (n− 1)-Σ03 set. We can then prove the following.

Theorem 3.4 Let M be a recursive set and let P be a recursive n-weakly-finite ESB Horn programwith r.e. constraints. Then the predicate ‘M is the least model of P ’ is a (2n+1 − 1)-Σ0

3 predicate.

Finally, we can give a completeness result for the case of n = 1.

Theorem 3.5 If P is a recursive 1-weakly-finite ESB Horn program with r.e. constraints, then thepredicate ‘P has a recursive stable model’ is 3− Σ0

3-complete.

References

[AB90] K. Apt and H.A. Blair. Arithmetical classification of perfect models of stratified programs. Fun-damenta Informaticae, 12:1 – 17, 1990.

[BMS95] H.A. Blair, V.W. Marek, and J. Schlipf. The expressiveness of locally stratified programs, Annalsof Mathematics and Artificial Intelligence 15(2):209–229, 1995.

[EL+98] T. Eiter, N. Leone, C. Mateis, G. Pfeifer, and F. Scarcello. The KR System dlv: Progress Report,Comparisons, and Benchmarks. In Proceedings Sixth International Conference on Principles ofKnowledge Representation and Reasoning (KR-98), pages 406–417, 1998.

[GL88] M. Gelfond and V. Lifschitz. The stable semantics for logic programs. In R. Kowalski and K. Bowen,editors, ICLP88, pages 1070–1080, 1988.

[GN02] E. Goldberg, Y. Novikov. BerkMin: a Fast and Robust SAT-Solver. DATE-2002, pages 142–149,2002.

[LZ02] F. Lin and Y. Zhao, ASSAT: Computing answer sets of a logic program by SAT solvers, AAAI-2002,pages 112–117, 2002.

[MNR94] W. Marek, A. Nerode, and J. B. Remmel. The stable models of predicate logic programs. Journalof Logic Programming, 21(3):129–154, 1994.

[MR03] V.W. Marek and J.B. Remmel, Set Constraints in Logic Programming, LPNMR03, Accepted forpublication.

[MM+01] M.W. Moskewicz, C.F. Magidan, Y. Zhao, L. Zhang, and S. Malik. Chaff: engineering an efficientSAT solver, SAT 2001, 2001.

[NSS99] I. Niemela, P. Simons, and T. Soininen. Stable model semantics of weight constraint rules. LP-NMR99, Springer LN in Computer Science 1730, pages 317–331, 1999.

[So87] R.I. Soare. Recursively Enumerable Sets and Degrees. Springer Verlag, 1987.

Page 40: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

The Expressive Rate of Constraints

Hubie Chen

Abstract

In reasoning tasks involving logical formulas, high expressiveness is desirable, although itoften leads to high computational complexity. We study a simple measure of expressiveness:the number of formulas expressible by a language, up to semantic equivalence. We prove adichotomy theorem on constraint languages regarding this measure.

1 Introduction

In reasoning tasks involving logical formulas, there is a trade-off between computational complex-ity and the expressiveness of the language in which the formulas are specified. Consequently, mucheffort has been directed towards identifying language restrictions which yield tractable reasoning.In many cases, these language restrictions, which are typically defined syntactically, impose se-mantic restrictions on the sets of models that can be expressed. For instance, it is known that aclass of propositional models is expressible as a Horn formula if and only if it is closed under thecoordinate-wise boolean AND operation.

A basic question thus arises: how expressive are various languages for specifying knowledge?That is, how rich are the classes of models that can be expressed by different languages? Amongdifferent languages yielding tractable reasoning, those that are most expressive will be of the great-est utility in many contexts. Consider the following two examples, from the area of knowledgecompilation (see [2] for a survey). If one wants to preprocess a knowledge base specified in anintractable language into one specified in a tractable language, to perform rapid on-line query pro-cessing, the most expressive tractable languages are those that are most likely to be applicable.Or, if one wants to approximate a knowledge base specified in an intractable language by one (ormore) knowledge bases specified in a tractable language, the most expressive tractable languagesare those most likely to yield close approximations.

In this paper, we consider the most simple and fundamental measure of expressiveness that wecan conceive of: we grade languages according to the number of different formulas that they canexpress – up to semantic equivalence. That is, we consider the number of different classes of mod-els expressible by a language. We study expressiveness in the context of constraint languages, setsof relations which can be used to specify constraint networks. The notion of constraint languagehas been used heavily to identify tractable cases of the constraint satisfaction problem (CSP) and

Department of Computer Science, Cornell University, Ithaca, NY 14853, USA. Email:[email protected].

1

Page 41: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

variants thereof [3], and can be used to describe restricted classes of propositional formulas suchas Horn formulas and 2-SAT formulas.

The precise notion of expressibility we consider is asymptotic in nature: for a constraint lan-guage , we consider the number of relations of arity expressible by (denoted by ), as tends to infinity. On a finite domain , the number of possible tuples of an -ary relation is, and so the number of different -ary relations is equal to

, which upper bounds by definition. Our main result is a dichotomy theorem on constraint languages over a two-elementdomain, showing that all such constraint languages are either polynomially expressive – that is,such that is polynomial in – or, exponentially expressive – that is, such that is, for some . It is worth emphasizing that the existence of such a dichotomy cannot betaken for granted in light of the many functions having growth rates that are intermediate betweenthose that we identify – for instance, is neither polynomial nor exponential (according to ourdefinitions).1

In order to study the expressivity of constraint languages, we make use of the algebraic view-point on constraint languages that has been used to study the complexity of the CSP [7, 4]. Wedemonstrate a number of sufficient conditions for showing polynomial or exponential expressivitythat apply to constraint languages over finite domains of all sizes. To establish our dichotomytheorem, we use these sufficient conditions in conjunction with Post’s classification theorem onfunction classes.

We remark that our dichotomy theorem has an alternative, rather mathematical formulation interms of algebras. Namely, the theorem can be stated as a dichotomy theorem characterizing, foreach algebra with two elements, the asymptotics of the number of subalgebras of the-fold direct product of . Precisely, our dichotomy theorem implies that is alwayspolynomial or exponential in (for two-element algebras ).

2 Preliminaries

We use the notation as shorthand for the set . We use to denote the th coordinateof a tuple .

2.1 Constraints

Definition 2.1 A relation of arity over is a subset of . A constraint over is anexpression of the form where is a relation over and is a tuple of variables with thesame arity as .

As usual, we consider an arity constraint to be satisfied by an assignment defined onthe variables in if . Note that in this paper, we will only be concernedwith constraints over finite domains of nontrivial size, that is, of size greater than or equal totwo.

1It has similarly been pointed out that the existence of complexity dichotomy theorems such as Schaefer’s theorem[8] are not completely unsurprising in light of Ladner’s theorem, which implies the existence of sets of intermediatecomplexity between P and NP, under the assumption P does not equal NP.

2

Page 42: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Definition 2.2 A constraint language is a set of relations, all of which have the same domain. Aconstraint language is said to be finite if it contains a finite number of relations.

Definition 2.3 When is a constraint language, a -formula is defined to be an expression of theform

where is a finite conjunction of constraints, each of which has relation from andvariables from the set . The variables are said to be the freevariables of the formula.

By , we denote the equality relation on the set .The CSP over a particular constraint language can be defined in the following way.

Definition 2.4 Let be a finite constraint language over a finite domain . The CSP decisionproblem is to decide, given as input a -formula with no free variables, whether or not is true.

2.2 Expressivity

We formalize the class of relations expressible by a constraint language in the following way.

Definition 2.5 We say that an arity relation is expressible by the constraint language ifthere exists a -formula with free variables such that is equivalentto , by which we mean that an assignment satisfies if andonly if is true under . We let denote the set of all relations expressible by .

Example 2.6 Define to be the finite constraint language over the domain where is the arity three relation , is the arity one relation , and is thearity one relation . Every relation induced by a Horn clause can be expressed by . Forexample, consider the relation

. We have that is equivalent to the -formula . Notice that therelation is equal to the set

. In fact, it can be shown thatfor any relation , the formula is equivalent to a Horn formula without existentiallyquantified variables if and only if is in .

Definition 2.7 When is a constraint language, let denote the number of relations in ofarity .

Observe that when the relations in are over a finite set , is bounded above by

.Our interest is in differentiating between two modes of growth in the functions , formalizedin the following definitions.

Definition 2.8 We say that a function (assumed to have the natural numbers as both domainand range) is polynomial in if is for some ; we say that is exponential in if is for some .

Definition 2.9 A constraint language is polynomially expressive if is polynomial in ,and is exponentially expressive if is exponential in .

3

Page 43: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

2.3 Algebra

In this section, we describe the dual algebraic viewpoint on constraint languages that we will useto study expressivity. This viewpoint has been used to study CSP complexity; we refer the readerto [7, 4] for more information.

Definition 2.10 When is a set and is a natural number, a mapping is calledan operation on of rank .

We let denote the set of all operations on of finite rank, and let denote the set of allrelations on of finite arity.

Definition 2.11 A rank operation preserves an arity relation if for all tuples in , the tuple

is also in .

Definition 2.12 For a set of operations , define

Inv is preserved by every operation in

For a constraint language , define

Pol preserves every relation in

The Inv and Pol operators, in fact, form a Galois connection [6]. These operators areconnected to the class of relations expressible by a constraint language in the following way.

Theorem 2.13 [4] For any set of relations , InvPol.

From this theorem, Jeavons [4] demonstrated that the complexity of CSP is an invariantof Pol – that is, any two constraint languages such that Pol Pol give rise toproblems CSP, CSP having exactly the same complexity. Similarly, we can deduce fromthis theorem that the expressive rate of a constraint language is an invariant of Pol.

Corollary 2.14 For any sets of relations , if Pol Pol, then (for all ).

In other words, the expressivity of depends only on Pol. Indeed, our program of classi-fying the constraint languages according to their expressivity can be rephrased as a program ofclassifying the various sets of operations Pol according to their expressivity, which we definein the following way.

Definition 2.15 When is a set of operations, we say that is polynomially expressive(exponentially expressive) if Inv is polynomially expressive (exponentially expressive).

4

Page 44: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

The key feature of this definition is the following proposition.

Proposition 2.16 A constraint language is polynomially expressive (exponentially ex-pressive) if and only if Pol is polynomially expressive (exponentially expressive).

The approach that we take in this paper is to prove results on the expressiveness of constraintlanguages by studying the sets of operations having the form Pol. Fortunately, such sets ofoperations have a particular structure: they are clones.

Definition 2.17 A set of operations is a clone on if the following two conditions hold.

contains all projections, that is, operations of the form (with ).

is closed under composition, where the composition of a rank operation andrank operations is defined to be the rank operation

Proposition 2.18 For every constraint language , the set of operations Pol is a clone.

We remark that studying the expressiveness of constraint languages is equivalent to studying anasymptotic question concerning algebras. The number of subalgebras of the -fold direct productof an algebra is exactly Inv . Moreover, the expressive rate of a constraintlanguage is the number of subalgebras of the -fold direct product of the algebra Pol .

Proposition 2.19 Let be an algebra, that is, a domain paired with a set of operations . A relation is the universe of a subalgebra of if and only if Inv .

Proposition 2.20 Let be a constraint language. A relation is in if and onlyif is the universe of a subalgebra of Pol.

3 Classification tools

In this section, we establish a number of conditions under which the expressivity of a set of opera-tions can be deduced.

Definition 3.1 An operation is -implicative (for ) if there exists suchthat for all , if , then . A set of operations is -implicative (for ) if every operation in is -implicative.

Theorem 3.2 Suppose that is an -implicative set of operations. Then, is exponentiallyexpressive.

5

Page 45: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Proof. Let . For every , let denote the set of -tuples

for all . For every , define to be the set of-tuples , and define to be the smallest set of -tuples containing . We claim thatfor every , it holds that

. Observe that when is an -implicative operation with respect to coordinate , for all , the coordinates atwhich is equal to is a superset of the coordinates at which is equal to . Thus,in computing from , only tuples with

or more -coordinates are added to , and

. It follows that the map taking to is injective, from which we conclude

that Inv

.

As an example application of this theorem, we can show that any semilattice operation is ex-ponentially expressive. Recall that a semilattice operation is a binary operation that is associative,commutative, and idempotent.

Corollary 3.3 If is a semilattice operation over a finite set , then is exponen-tially expressive.

Proof. It is known that for every finite semilattice , there is an element such that for all . Hence, is -implicative, and exponential expressivity of follows from Theorem 3.2.

Definition 3.4 An operation is essentially unary if there exists and a unaryoperation such that for all , . A set of operations isessentially unary if every operation in is essentially unary.

Theorem 3.5 Suppose that is an essentially unary set of operations. Then, is expo-nentially expressive.

Proof. Let be the set of all unary operations corresponding to the operations in , and enlargen if necessary so that it is closed under composition. It suffices to show that is exponentiallyexpressive. For all , define the orbit of to be the set of -tuples .Select a set so that the set contains all (and only) maximal subsets from theset system , and such that for all distinct tuples , . In otherwords, for every maximal orbit in , contains exactly one tuple suchthat .

Observe that every tuple is contained in some maximal orbit, that is, for every ,there exists such that . Since the size of each maximal orbit is bounded aboveby , it follows that

. Now, we claim that for any subset , the smallestrelation Inv containing has the property that . This is because , so if there was a tuple in , there wouldexist such that , implying that would be either non-maximal or equal to forsome , a contradiction to the definition of .

Therefore, the mapping taking to Inv is injective, and Inv is boundedbelow by , which in turn is bounded below by

.

Definition 3.6 An operation of rank is a near unanimity operation if for all , if there exists and such that for all , then .

6

Page 46: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

In other words, is a near unanimity operation if for all , .

Theorem 3.7 If is a set of operations containing a near unanimity operation, then ispolynomially expressive.

Proof. By the characterization of relations closed under a near-unanimity operation in [5], everyrelation in Inv can be expressed by a formula with no existentially quantified variables andcontaining only relations of arity less than or equal to . On variables, the number of suchformulas can be bounded above by

, which is polynomial in . It follows that

is polynomially expressive, since Inv .

Definition 3.8 An operation is a coset operation if it is of the form , where is a binary operation and is a unary operation such that is agroup.

Theorem 3.9 If is a set of operations containing a coset operation, then is polynomi-ally expressive.

Proof. Suppose that is an arity relation in Inv It is straightforward to verify that is acoset of a subgroup of the group . We show that any subgroup ! of can begenerated by a set of size at most !, from which it follows that any Inv of arity has a polynomial (in ) size description, and hence that is polynomially expressive.

If ! has no nontrivial proper subgroups, then ! is generated by a single element, and the claimis obvious. Otherwise, let " be a maximal proper subgroup of ! . The subgroup ! is generatedby any element of ! " combined with any set of generators for " . By Lagrange’s theorem," !# and so by induction " can be generated by a set of size at most !.

4 Dichotomy Theorem

Theorem 4.1 For every constraint language over a domain of size two, is eitherpolynomially expressive or exponentially expressive.

Before proving this theorem, we fix some notation and conventions. Let be the domain . Let and denote the -ary constant operations equal to and , respectively. Let denote the usual NOT operation, and let and denote the usual binary operations OR, AND,and XOR (over ). Let IMP denote the binary operation defined by IMP . Let$ be the ternary operation defined by $ % %, and let be themajority operation over , that is, the operation defined by % % %.Let & be the operation defined by &

.

When is a set of operations, we use Clo to denote the clone generated by .For an operation , let dual denote the dual of , that is, the operation defined bydual . The dual of a clone is the set of operationsdual , which is easily verified to be a clone.

7

Page 47: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Proof. We use Post’s classification theorem to demonstrate that every clone over the domain is either polynomially expressive or exponentially expressive, from which the theoremfollows by appeal to Proposition 2.16 and Proposition 2.18. We adopt the notation of [1], to whichwe refer the reader for a nice description of Post’s theorem. Assume that is a clone; thefollowing cases are exhaustive, up to duality [1].

Case ' Clo : When '

Clo, Corollary 3.3 implies the

exponential expressivity of ; a similary argument applies to in the case that ' .

Case ( CloNOT : Every operation in ( and hence is essentially unary,

from which the exponential expressivity of follows from Theorem 3.5.

Case CloIMP: Every operation in and hence is -implicative, from

which the exponential expressivity of follows from Theorem 3.2.

Case

Clo % dual&: The operation dual& is near-unanimity,

from which the polynomial expressivity of follows from Theorem 3.7.

Case ) Clo$: The operation $ is a coset operation, from which the polynomial

expressivity of follows from Theorem 3.9.

Case * Clo: The operation is near-unanimity, from which the polynomial

expressivity of follows from Theorem 3.7.

Note that throughout, we have implicitly used the fact that for , implies thatInv Inv.

It can be deduced from the proof of Theorem 4.1 in conjunction with Schaefer’s theorem thatfor every finite constraint language over a two-element domain, if is polynomially expressive,then the problem CSP is tractable, that is, decidable in polynomial time. We conjecture thatthis phenomenon occurs also in larger domains, that is, for all constraint languages over a finitedomain, polynomial expressivity implies tractability.

References

[1] E. Bohler and H. Vollmer. Boolean functions and post’s lattice with applications to complexitytheory. Lecture Notes for Logic and Interaction 2002.

[2] Marco Cadoli and Francesco M. Donini. A survey on knowledge compilation. AI Communi-cations, 10(3-4):137–150, 1997.

[3] Nadia Creignou, Sanjeev Khanna, and Madhu Sudan. Complexity Classification of BooleanConstraint Satisfaction Problems. SIAM Monographs on Discrete Mathematics and Applica-tions. Society for Industrial and Applied Mathematics, 2001.

[4] Peter Jeavons. On the algebraic structure of combinatorial problems. Theoretical ComputerScience, 200:185–204, 1998.

8

Page 48: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

[5] Peter Jeavons, David Cohen, and Martin Cooper. Constraints, consistency, and closure. Arti-cial Intelligence, 101(1-2):251–265, 1998.

[6] Peter Jeavons, David Cohen, and Justin Pearson. Constraints and universal algebra. Annals ofMathematics and Artificial Intelligence, 24(1-4):51–67, 1998.

[7] P.G.Jeavons, D.A.Cohen, and M.Gyssens. Closure properties of constraints. Journal of theACM, 44:527–548, 1997.

[8] Thomas J. Schaefer. The complexity of satisfiability problems. In Proceedings of the ACMSymposium on Theory of Computing (STOC), pages 216–226, 1978.

9

Page 49: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Using The Central Limit Theorem for Belief Network Learning

Ian Davidson1, Minoo Aminian1

1Computer Science Dept, SUNY Albany

Albany, NY, USA, 12222. [email protected]

Abstract. Learning the parameters (conditional and marginal probabilities) from a data set is a common method of building a belief network. Consider the situation where we have known graph structure, many complete (no missing values), same-sized data sets randomly selected from the population. For each data set we learn the network parameters using only that data set. In such a situation how will the parameters learnt differ from data set to data set? In this paper we show the parameter estimates across the data sets converge to a Gaussian distribution with a mean equal to the population (true) parameters. This result is obtained by a straight-forward application of the central limit theorem to belief networks. We empirically verify the central tendency of the learnt parameters and show that the parameters’ variance can be accurately estimated by Efron’s bootstrap sampling approach. Learning multiple networks from bootstrap samples allows the calculation of each parameter’s expected value (as per standard belief networks) and also its second moment, the variance. Having the expectation and variance of each parameter has many potential applications. In this paper we discuss initial attempts to explore their use to obtain confidence intervals over belief network inferences and the use of Chebychev’s bound to determine if it is worth gathering more data for learning.

Introduction

Bayesian belief networks are powerful inference tools that have found many practical applications [1]. A belief network is a model of a real world situation with each node (N1… Nm) in the network/graph (G) representing an event. An edge/connection (e1 … el) between nodes is a directional link showing that an event influences or causes the value of another. As each node value can be modeled as a random variable (X1 … Xm), the graph is effectively a compact representation of the joint probability distribution over all combinations of node values. Figure 1 shows a belief network modeling the situation that relates smoking, cancer and other relevant information. Building belief networks is a challenging task that can be accomplished by knowledge engineering or automatically producing the model from data, a process commonly known as learning [2].

Learning graphical models from a training data set is a popular approach where domain expertise is lacking. The training data contains the occurrence (or otherwise) of the events for many situations with each situation termed a training example, instance or observation. For example, the parameters for the Cancer network shown in Figure 1 could be learnt from many patient records that contained the occurrence or not of the events. The four commonly occurring situations of learning belief networks [3] are different combinations of a) learning with complete (or incomplete) data and b) learning with known (or unknown) graph structure. Learning with complete data indicates that the training data contains no missing values, while incomplete data indicates for some training examples, pieces of information of the world situation were unknown. With known graph structure the learning process will attempt to estimate the parameters of the random variables while with unknown structure the graph structure is also learnt. There exists a variety of learning algorithms for each of these situations [1][3]. In this paper we shall focus on the situation where the data is complete and graph structure is specified. In this simplest situation there is no uncertainty due to differing graph structure, latent variables or missing values. This allows us to focus on the uncertainty due to variations in the data. In future work we hope to generalize our findings for other learning situations. Without loss of generality and for clarity we shall focus on Boolean networks.

Learning the parameters of the graph from the training data D of s records involves counting the occurrences of each node value (for parentless nodes) across all records to calculate the marginal probabilities or combination of values (for nodes with parents) and normalizing to derive the conditional probabilities. Now consider having many randomly drawn data sets (D1 … Dt ) all of size s from which we learn model θ1 … θt. Each model places a probability distribution over the event space, which is all combinations of node values. How will these parameters vary from model to model? We shall answer this question by considering a probability distribution over the parameters for each node. We shall call these random variables Q1… Qm. Note that Qi may be a block of random variables as the number of parameters per node can vary depending on its number of parents. In a Boolean network for a parentless node, Qi is a single random variable whose value

Page 50: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

represents node parameter, P(Xi = TRUE). For nodes that have parents in a Boolean network, Qi consists of 2(#Parents) random variables representing the parameters P(Xi = TRUE | Condition1) … P(Xi = TRUE | Condition2^(#Parents)) with the different conditions being the various combinations of TRUE and FALSE values of the parents.

What can we say of these continuous random variables that represent the parameter estimates learnt from different random samples? In this paper we show that these continuous random variables form a Gaussian distribution that center on the population values (the parameters that would be learnt from the entire population of data) and whose variance is a function of the sample size. We show that Qi ~ Gaussian(pi

*, pi*(1- pi

*) /(ks)) where k is a constant, pi

* is the relevant parameters of the generating mechanism that produced the data and s is the sample size. The standard deviation of this random variable approaches zero and the sample mean converges asymptotically to the population value as the value of s increases. We can now make two estimates for each parameter in the network: its expected value and its variance.

In most learning situations the actual probability distribution over the learnt parameters due to uncertainty/variability in the observed data set is unknown. Sometimes, generalities are known such as that decision trees are unstable learners or that iterative algorithms that minimize the distortion (vector quantization error) are sensitive to initial parameters. In this paper we make use of the probability distribution over the learnt parameters for two applications. These should be treated as initial attempts to make use of the mean and variance of each parameter and future work will refine these applications. Firstly, we use the variance to determine if gathering more data will necessarily change the mean value of the parameters by at least some user specified value. This has applications when collecting data is time consuming or expensive. We then illustrate how to create confidence intervals associated with each inference made from the network.

We begin this paper by describing the learning of belief networks, readers familiar with this topic may skip this section. The applicability of the central limit theorem to belief network learning with complete data is described next. We then empirically verify that the central tendency of the distribution of the learnt parameters by sampling the data to learn without replacement from a large collection of data (generated by a Gibbs sampler). Having such a large pool of data is an unusual luxury and we next illustrate approximating the variance associated with the learnt parameters using Efron’s bootstrap sampling (sampling with replacement) [5]. We next discuss how to use the estimated variance for our two proposed applications. Finally we conclude our work and describe future directions.

Learning Belief Networks

The general situation of learning a belief network can be considered as consisting of estimating all the parameters (marginal and conditional probabilities), θ, for the nodes and discovering the structure of the directed acyclic graph, G. Let θi =qi1 … qim be the parameters of the m node graph learnt from data set i. Note that the corresponding upper-case notation indicates the random variable for these parameters. As the network is Boolean, we only need the probability of the node being one value given its conditions (if any). In later sections we shall refer to graph and network together as the model of the data, H= θ, G.

This paper focuses on learning belief networks in the presence of complete data and a given graph structure. From a data set Di = di1, … , dis, dil = ai11 … ailm of s records, we find either the marginal or conditional probabilities for each node in the network after applying a Laplace correction. The term ailj refers to the j th node’s value of the l th instance within the i th data set. For complete data, known graph structure and Boolean network the calculation for node i is simply:

If Nj has no parents, +=+=

liljij sTaq )2/()1( ( 1 )

If Nj has parents Parents(Nj) = Nk

=+

==+=

=+

==+=

lilk

lilkiljijF

lilk

lilkiljijT

FaFaandTaq

TaTaandTaq

2/1

2/1

( 2 )

Note the subscript T or F refers to the node’s parent’s value, the parameter q always refers to probability of the node taking on the value T. For example qijF refers to the probability learnt from the i th data set that the j th node will be T given its parent is F. If a node has more than one parent then the expression for the parameter calculations involves similar calculations over all combinations of the parent node values.

Page 51: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Age Gender

Exposure Smoking

Cancer

TumorS. Calcium

Figure 1. The Cancer Belief Network.

Visit Smoking

TB Cancer Bronchitis

TB or Cancer

DyspneaXRay Result

Figure 2. The Asia Belief Network

The Asymptotic Distribution of Learnt Parameters for Belief Networks

Consider as before a number of random samples of training data (D1 … Dt) each of size s. We divide our graph of nodes into two subsets, those with no ancestors (Parentless) and those that have ancestors and are hence children (Children). We initially consider binary belief networks, whose nodes can only take on two values T (TRUE) and F (FALSE). For each training data (Dk) set we learn a model θk as described earlier.

Under quite general condition (finite variance) the parameters of the nodes that belong to Parentless will be Gaussian distributed due to the central limit theorem [4]. We can formally show this by making use of a common result from the central limit theorem, the Gaussian approximation to the binomial distribution [4].

A similar line of argument that follows could have been established to show the central tendency around each and every joint probability value (P(N1… Nm)) but since the belief network is a compact representation of the joint distribution, we show the central tendency around the parameters of the network.

Situation #1: Nj∈∈∈∈ Parentless Let pj be the proportion of times in the entire population (all possible data) that the j th node (a parentless

node) is TRUE. If our sampling process (creating the training set) is random then when we add/sample a record, the chance the j th node value is TRUE can be modeled as the result of a Bernouli trial (a Bernouli trial can only have one of two outcomes). As the random sampling process generates independent and identically data (IID) then the probability that of the s records in the sample that n records will have TRUE for the j th node is given by a Binomial distribution:

( ) ( ) nsj

njj pp

nsn

spnsP −−

−= 1

)!(!

!),,(

( 3 )

Applying the Taylor series expansion to equation ( 3 ) with the logarithm function for its dampening (smoothness) properties and an expansion point pj.s yields the famous Gaussian distribution expression:

)1(,2

1),,( 2)2/().( 22

jjpsn

j pspepnsP j −== −− σσπ

σ ( 4 )

If the sample size is s then the number of occurrences of Xj=T in any randomly selected data set will be given by a Gaussian whose mean is s.pj and variance s.pj(1- pj) as shown in equation ( 4 ).

Therefore, Qj ~ Gaussian(pj, pj(1- pj)/s)) in the limit when t → ∞ and pj > ε, where pj is the proportion of times Xj=T in the population.

Situation #2. Ni∈∈∈∈ Children and Parent(Ni) = Nj … Nk A similar argument as above holds except that there are now as many Bernouli experiments as there are

combinations of parent values. In total there will be 2k Bernouli experiments. Therefore, Qi ~ Gaussian(pi,j…k, (pi,j…k )(1-pi,j…k)) / (sck)). Note that sck is just the proportion of times that the condition of interest (the parent value combination) occurred.

A valid question to ask is how big should pj (or pi,j…k) and s (or sck) be to obtain reliable estimates. The general consensus in empirical statistics is that if t > 30, spj > 5 then estimates made from the central limit theorem are reliable [4]. This is equivalent to the situation of learning from thirty samples, where each parental condition occurs at least fifty times and no marginal or conditional probability estimate less than 0.10. While

Page 52: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

4

these conditions may seem prohibitive, we found that in practice for standard belief networks tried (Asia, Cancer and Alarm) that the distribution of parameters was typically Gaussian. Note that one training instance can match multiple conditions. Table 1 shows for the Cancer data set that from 249 training data sets each of size 1000 training instances that the average of the learnt parameters passed a hypothesis test at the 95% confidence level when compared to the population (true) parameters. Similar results were obtained for the Asia and Alarm data sets.

n Mean

Gender 249 0.4874Hypothesised 0.5130

Difference between means -0.025695% CI -0.0313 to -0.0199

n MeanAge 249 0.3022

Hypothesised 0.3030

Difference between means -0.000895% CI -0.0070 to 0.0055

n MeanEx.to.Tox. 249 0.4639

Hypothesised 0.4190

Difference between means 0.044995% CI 0.0342 to 0.0556

n MeanSmoking 249 0.4670

Hypothesised 0.4850

Difference between means -0.018095% CI -0.0322 to -0.0039

n MeanCancer 249 0.2977

Hypothesised 0.3090

Difference between means -0.011395% CI -0.0227 to 0.0000

n MeanS.C. 249 0.3210

Hypothesised 0.3170

Difference between means 0.004095% CI -0.0025 to 0.0106

n MeanLung Tmr 249 0.2454

Hypothesised 0.2390

Difference between means 0.006495% CI -0.0023 to 0.0151

Table 1. For the Cancer data set, hypothesis testing of the mean of the learnt parameter and the population (true value) of the parameter. In all cases the difference between the true value and the mean of the learnt parameters lied in the 95% confidence interval. Only a subset of the hypothesis tests are presented for space reasons.

Using Bootstrapping to Estimate the Variance Associated With Parameters

Having the ability to generate random independent samples from the population is a luxury not always available. What is typically available is a single set of data D. The bootstrap sampling approach [5] can be used to sample from D to approximate the situation of sampling from the population. Bootstrap sampling involves drawing r samples (B1 … Br) of size equal to the original data set by sampling with replacement from D. Efron showed that in the limit the variance amongst the bootstrap sample means will approach the variance amongst the independently drawn sample means. We use a visual representation of the difference between the probability distribution over the learnt parameter values (249 bootstrap samples each of size 1000 instances) and a Gaussian distribution with a mean and standard deviation calculated from the learnt parameter values as shown in Figure 3. As expected the most complicated node (smoking) had the greatest deviation from the normal distribution.

Page 53: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

-3

-2

-1

0

1

2

3

4

0.38

0.4 0.43

0.45

0.48

0.5 0.53

0.55

0.58

0.6 0.63

0.65

0.68

Gender

No

rmal

Qu

anti

le

-4

-3

-2

-1

0

1

2

3

0.2 0.25

0.3 0.35

0.4 0.45

0.5 0.55

0.6 0.65

0.7 0.75

Ex.to.Tox.

No

rmal

Qu

anti

le

-3

-2

-1

0

1

2

3

0.18

0.2 0.23

0.25

0.28

0.3 0.33

0.35

0.38

0.4 0.43

Age

No

rmal

Qu

anti

le

-4

-3

-2

-1

0

1

2

3

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Smoking

No

rmal

Qu

anti

le

-3

-2

-1

0

1

2

3

0.05

0.1 0.15

0.2 0.25

0.3 0.35

0.4 0.45

0.5 0.55

0.6

Cancer

No

rmal

Qu

anti

le

-3

-2

-1

0

1

2

3

4

0.2 0.23

0.25

0.28

0.3 0.33

0.35

0.38

0.4 0.43

0.45

0.48

0.5

S.C.

No

rmal

Qu

anti

le

-3

-2

-1

0

1

2

3

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

Lung Tmr

No

rmal

Qu

anti

le

Figure 3. For the Cancer data set the difference between the distribution over the learnt parameter values and a Gaussian distribution with the mean and the standard deviation of the learnt parameter values. The straight line indicates the Gaussian distribution. The x-axis is the parameter value while the distance between the lines is the Kullback-Leibler distance between the distributions at that point.

Page 54: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Learning and Using the Variance of Learnt Parameters

We now describe an approach to learn the belief network parameters’ mean and variance. Firstly, the parameters of a belief network are learnt from the available data using the traditional learning approach determined earlier. Secondly, bootstrap sampling is performed to estimate the variance associated with each learnt parameter. Estimating the variance involves taking a number of bootstrap samples of the available data, learning the parameters of the belief network from each and measuring the variance across these learnt parameters. Therefore, for each parameter in the belief network, we now have its expected value and variance. Estimates of the variance of the parameters could have been obtained from the expressions reported in situation#1 and situation#2 earlier. However, in practice we found that these were not as accurate as calculating variances from a large number (more than 100) bootstrap samples. This was because from our limited training data the learnt parameter was often different from the true value.

In this paper we propose two uses of these additional parameters which we describe next: determining if more data is needed for learning and coming up with confidence intervals over inferences.

Determining the Need for More Data

A central problem with learning is determining how much data is needed (how big s should be). The computational learning theory literature [6] addresses this problem, unfortunately predominantly for learning situations where no noise exists, that is the Bayes error of the best classifier would be 0. Though there has been much interesting work studying the behavior of iterative algorithms [7][8] this work is typically for a fixed data set. The field of active learning attempts to determine what type of data to add to the initial training data set [9] (the third approach in the paper) [10] and [11]. However, to our knowledge this work does not address the question of when to stop adding data.

We now discuss the use of Chebychev’s bound to determine if more data is needed for learning. Others [12] have used similar bounds (Chernoff’s bound) to produce lower bound results for other pattern recognition problems. Since both the expectation and variance of the parameters of interest are known, we can make use of the Chebychev inequality.

The Chebychev inequality allows the definition of the sample size, s, required to obtain a parameter estimate ( p ) within an error (ε ) of the true value from a distribution with a mean of p and standard deviation of σ as shown below.

[ ]2

2

ˆε

σεs

ppP <>− ( 5 )

This expression can be interpreted as an upper bound on the chance that the error is larger thanε . In-turn we can upper bound the right-hand side of this expression by δ which can be considered the maximum chance/risk we are willing to take that our estimate and true value differ by more than ε . Then rearranging these terms to solve for the sample size yields:

δεσ

2

2

≥s ( 6 )

Typical applications of the Chebychev inequality use σ2=p(1-p) and produce a bound that requires knowing the parameter that is trying to be estimated! [13]. However, we can use our empirically (through bootstrapping) obtained variance and circumvent this problem.

The question of how much data is needed, is now answered with respect to how much error (ε ) in our estimates we are willing to tolerate and the chance (δ) that this error will be exceeded. We can use equation ( 6 ) to determine if more data is needed if we are to tolerate learning parameters that with a chance δ differ from the true parameters by more than ε . The constants δ and ε are set by the user and effectively define the users definition of how “close” is close enough. For each node in the belief network we compute the number of instances of each “type” required. We then compare this against the number of actual instances in the data set of this type. If for any node the required number is less than the actual number then no more data needs to be added. By instance “type”, we mean instances with a specific combination of parent node values. There as many instance “types” as there are needed marginal or conditional probability table entries. For example, for the Cancer network (Figure 1) the Exposure node generates two types, one each for when its parent’s (Age) value is T or F, the value of the remaining nodes are not specified.

Page 55: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

7

Applying this test multiple times for the Cancer network found that after approximately 100 training instances all nodes in the network passed the prescribed (above in italics) test as shown in Figure 4.

Minimum Required Number of Instances

0

50

100

150

200

250

25 50 75 100 500 1000

Actual Training Set Size

Min

imu

m E

xpec

ted

Nu

mb

er o

f In

stan

ces

Figure 4. For the cancer belief network, the actual training set size versus the minimum expected number of

instances. The calculated variance over the parameters provides a lower bound on the expected number of instances calculated from ( 6 ) with ε = 0.05 and δ = 0.05.

To test the appropriateness of this stopping criterion we need to empirically determine when adding new instances to the training set adds no information. This involves measuring the expectation (over the joint distribution) of the code word lengths for each combination of node values. We wish to only add instances to the training set if the instances reduce the mean amount of information needed to specify a combination of node values. As the training data size increases the parameters learnt stabilize and better approximate the chance occurrence in the population. Formally, the expected information is:

==

m

iii EEHbitsAverage

2

1))(log().(*- # θ

( 7 )

where Ei is a particular combination of node values of which there are 2m for a m binary node network. H* (.) is the probability distribution of the generating mechanism and θ (.) is the learnt distribution.

Figure 5 shows the average code word lengths against training set size for the Cancer belief network. We see that in the interval of [97,145] adding new instances does not significantly decrease the mean information. We note and expect that our estimate of 100 instances (calculated from equation ( 6 )) is towards the start of this interval as we tolerate an error of 0.05 and a chance of success of 0.95.

Average Number of Bits To Specify an Event in Joint Distribution

5

6

7

8

9

10

11

12

1 25 49 73 97 121

145

169

193

217

241

265

289

313

337

361

385

409

433

457

481

Training Set Size

Mea

n In

form

atio

n

Figure 5. For the cancer belief network, the training set size versus the average number of bits (code word

length) to specify an event in the joint distribution.

Page 56: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Confidence Intervals Over Inferences

Typically with belief networks an inference is of the form that involves calculating a point estimate of P(Ni=T | E) where E is some set of node values that can be considered as evidence. For small networks exact inference is possible but since exact inference is NP-Hard for larger networks approximate inference techniques such as Gibbs sampling is used [2]. However, if the variance of each parameter is known we can now create a confidence interval associated with P(Ni=T | E). We now describe a simple approach to achieve this.

For a belief network parameter of value p and standard deviation σ we know that with 95% confidence that its true value is in the range p± 1.96σ. We can now create three versions of the belief network, one where all network parameters are set to p (the expected case) another where the network parameters are set to p-1.96σ (lower bound) and another where they are set to p+1.96σ (upper bound). Point estimation is performed in each of the three networks using whatever inference technique is applicable and the three values are then bought together to return an expected, lower and upper bound on the inference to provide a 95% confidence interval.

Though it may be thought that the changes amongst these networks will be trivial and not return greatly varying inferences, recent work [14] has shown that even small changes in network parameters can result in large changes in the inference/query probabilities.

Figure 6 shows the great changes in the upper, expected and lower case situations of the query: P(Cancer=TRUE). It is not until 1000 training instances are available that the upper bound, expected and lower bound are within the interval [0.25,0.35]. However, our prior results for computing how much data is needed are only for the expected case and we find that for 100 training instances and more that the expected case is within tolerated error (ε =0.05) of the true value (0.3). Our experimental reports are for multiple experiments but we note that if they were performed only once our approach has a chance of failure of 0.05 as δ = 0.05.

Cofidence Interval vs. Training Set Size

0

0.1

0.2

0.3

0.4

0.5

0.6

25 75 100 500 1000

Training Set Size

Pro

bab

ility

of C

ance

r B

ein

g T

rue Lower Bound

Expected Case

Upper Bound

Figure 6. For the cancer belief network, the training set size against the upper bound, expected and lower

bound values for the query P(Cancer=TRUE). Note that after 100 instances the expected case is always within the specified error (0.05) of the true value (0.3).

Our work on computing confidence intervals is related to the work of computing error bars for belief

networks [15]. The work of Van Allen et’ al investigates the distribution for the response that a belief network will return to a query, and shows that this distribution is asymptotically normal. They derive expressions for the mean and asymptotic variance of this distribution, and provide error bars for the inference empirically. There work assumes the network parameters are drawn from a Dirichlet distribution and they derive an expression that approximates the standard deviation of a query. The distribution for this query is shown to be approximately normal. An approximate credible interval for the mean of this query distribution is obtained using a first-order Taylor expansion of the query based on mean values of the graph parameters. Our work differs as we do not assume the mathematically convenient prior and propose an empirical way of computing the standard deviation of the parameter.

Page 57: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

9

Future Work and Conclusion

Our work posed the question: Given many independently drawn random samples from the population, if we learn the belief network parameter estimates from each, how will the learnt parameters differ from sample to sample. We have shown that for complete data and known graph structure that the learnt parameters will asymptotically distributed around the generating mechanism (true) parameters. This result follows from the central limit theorem and has many potential applications in belief network learning. This means each node in the belief network is described by the parameter’s mean and variance. The variance parameter measures the uncertainty in the parameter estimates due to variability in the data.

However, having many random samples from which to estimate the variance is a great luxury. By using bootstrap sampling we can create an estimate of the variance by sampling from a single data set with replacement. Having a probability distribution over the learnt parameters due to observed data uncertainty has many potential applications. We describe two. We show how to create upper and lower bounds on the probabilities of inferences by creating three networks that represent the expected, lower bound and upper bound. Using the mean and variance associated with each parameter allows determining if more data is required so as to guarantee with a certain probability that the difference between the expected and actual parameter means are within a predetermine error.

We have limited our results in this paper to the simplest belief network learning situation to ensure the uncertainty in the parameters is only due to data variability. We intend to next extend our results to the situation where EM and SEM is used to learn from incomplete data and unknown graph structure.

References

[1] F. Jensen, An Introduction to Bayesian Networks, Springer Verlag, 1996 [2] J. Pearl, Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann

[3] N. Friedman, (1998), The Bayesian structural EM algorithm, in G. F. Cooper & S. Moral, eds., Proc. Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI '98)',

[4] D. Devore, Statistics: The Exploration and Analysis of Data, 3rd Ed, Wadsworth Publishing. 1997.

[5] B. Efron, 1979. Bootstrap methods: another look at the jackknife. Annals of Statistics 7: 1-26.

[6] T. Mitchell, Machine Learning, McGraw Hill, 1997.

[7] L. Bottou, and Y. Bengio, Convergence properties of the k-means algorithm. In G. Tesauro and D. Touretzky, editors, Adv. in Neural Info. Proc. Systems, volume 7, pages 585--592. MIT Press, Cambridge MA, 1995.

[8] R. Salakhutdinov & S. T. Roweis & Z. Ghahramani, Optimization with EM and Expectation-Conjugate-Gradient, International Conference on Machine Learning 20 (ICML03), pp.672--679

[9] D. MacKay. Information-based objective functions for active data selection. Neural Computation, 4(4):590--604, [10] S. Tong, D. Koller, Active Learning for Parameter Estimation in Bayesian Networks. NIPS 2000:

[11] D. Cohn, L. Atlas, and R. Ladner, 1994, Improving Generalization with Active Learning. Machine Learning 15(2):201-221.

[12] H. Mannila, H. Toivonen, and I. Verkamo. Efficient algorithms for discovering association rules. In AAAI Wkshp. Knowledge Discovery in Databases, July 1994

[13] M. Pradhan and P. Dagum. Optimal monte carlo estimation of belief network inference. In Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence, pages 446-- 453, Portland, Oregon, 1996.

[14] H. Chan and A. Darwiche. When do Numbers Really Matter? In JAIR 17 (2002) 265-287.

[15] T. Van Allen, R. Griener, and P. Hopper, Bayesian Error-Bars for Belief Net Inference, The Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI 2001), Seattle.

Page 58: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Approximate Probabilistic Constraints and Risk-SensitiveOptimization Criteria in Markov Decision Processes

Dmitri A. Dolgov and Edmund H. DurfeeDepartment of Electrical Engineering and Computer Science

University of MichiganAnn Arbor, MI 48109

ddolgov,[email protected]

Abstract

The majority of the work in the area of Markov decision processes has focused on expected values of rewards in theobjective function and expected costs in the constraints. Although several methods have been proposed to model risk-sensitive utility functions and constraints, they are only applicable to certain classes of utility functions and allow limitedexpressiveness in the constraints. We propose a construction that extends the standard linear programming formulation ofMDPs by augmenting it with additional optimization variables, which allows us to compute the higher order moments ofthe total costs (and/or reward). This greatly increases the expressive power of the model, and supports reasoning about theprobability distributions of the total costs (reward). Consequently, this allows us to formulate more interesting constraintsand to model a wide range of utility functions. In particular, in this work we show how to formulate the constraint thatbounds the probability of the total incurred costs falling within a given range. Constraints of that type arise, for example,when one needs to bound the probability of overutilizing a consumable resource. Our construction, which greatly increasesthe expressive power of our model, unfortunately comes at the cost of significantly increasing the size and the complexity ofthe optimization program. On the other hand, it allows one to choose how many higher order moments of the costs (and/orreward) are modeled as a means of balancing accuracy against computational effort.

1 Introduction

Markov processes are widely used to model stochastic environments, due to their expressiveness and analytical tractability.In particular, unconstrained (e.g. [3, 4, 8, 15]) as well as constrained (e.g. [1, 2]) Markov decision processes have gainedsignificant popularity in the AI and OR communities as tools for devising optimal policies under uncertainty. The vastmajority of the work in the area has focused on optimization criteria and constraints that are based on the expected valuesof the rewards and costs. However, such risk-neutral approaches are not always applicable and expressive enough, 1 thusprecipitating the need for extending the MDP framework to model risk-sensitive utility functions and constraints.

Several approaches [9, 12, 13] to modeling risk-sensitive utility functions have been proposed that work by transformingrisk-sensitive problems into equivalent risk-neutral problems, which can then be solved by dynamic programming. However,this transformation only works for a certain class of utility functions. Namely, this has been done for exponential utilityfunctions that are characteristic of agents that have “constant local risk aversion” [14] or obey the “delta property” [9],which says that a decision maker’s risk sensitivity is independent of his current wealth. This approximation has a numberof very nice analytical properties, but is generally considered somewhat unrealistic [9]. Our work attempts to address thisissue via approximate modeling of a more general class of utility functions.

As with utility functions, there has been a significant amount of work on risk-sensitive constraints in MDPs. Thesemethods typically work by constraining or including in the objective function the variance of the costs [6, 7, 10, 18, 19] orreasoning about sample-path costs in the case of per unit-time problem formulations [2, 16, 17]. However, in some cases,reasoning about the variance only is also not expressive enough (see [5] for a more detailed discussion).

We propose a method that allows explicit reasoning about the probability distributions of the total reward in the objectivefunction and the distribution of costs in constraints, thus allowing us to represent a wide class of interesting optimizationcriteria and constraints. In this work, we describe a method for handling probabilistic constraints, but the approach is alsodirectly applicable to risk-sensitive objective functions. We focus on transient (or episodic) Markov processes [11] and

1As pointed out, for example, by Ross and Chen in the telecommunication domain [16].

1

Page 59: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

base our approach on the standard occupancy-measure linear programming formulation of constrained Markov decisionprocesses (CMDPs). We augment the classical program with additional optimization variables, which allows us to computethe higher order moments of the total incurred costs for stationary Markov policies. This enables us to reason about theprobability distributions of the total costs, and consequently, to express more interesting constraints such as bounding theprobability that the total costs fall within a given range (or exceed a threshold).

It is important to note that, in general, arbitrary utility functions and arbitrary constraints do not obey the Markovproperty, which means that stationary Markov (history-independent) policies are not guaranteed to be optimal under suchutility functions and constraints. However, given the practical difficulty of implementing non stationary history-dependentpolicies, in this work we limit the search to the class of stationary Markov policies, i.e. we are interested in finding the bestpolicy that satisfies the given constraints and maximizes the given utility function, among the class of stationary history-independent policies.

2 Preliminaries

We formulate our optimization problem as a stationary, discrete-time, fully-observable constrained Markov decision pro-cess. In this section, we review some well-known facts from the theory of standard [3, 15] and constrained [1] fully-observable Markov decision processes.

A standard constrained Markov decision process (CMDP) can be defined as a tuple , where is a finiteset of states, is a finite set of actions,

defines the transition function ( is the probability

that the agent will go to state if it executes action in state ), is the initial probability distributionover the state space, defines the reward function (agent receives a reward of for executing action in state ), and are the costs.2

A solution to a CMDP is a policy that prescribes a procedure for selecting an action that typically maximizes somemeasure of performance (based on the rewards ), while satisfying constraints (based on the costs ). A stationary Markovpolicy can be described as a mapping of states to probability distributions over actions: . Weaddress the problem of finding optimal stationary Markov policies, under probabilistic constraints (defined in section 3).

For a Markov system, the initial probability distribution , the transition probabilities, and the policy together com-pletely determine the evolution of the system in a stochastic sense:

(1)

where is the probability distribution of the system at time , and is the probabilitytransition matrix induced by the policy (

).

In this work we focus our attention on discrete-time transient (or episodic) problems, where there is no predefinednumber of time steps that the agent spends in the system, but it cannot stay there forever. Given a finite state space, thisassumption implies that there have to exist some state-action pairs for which

.

For most of the analysis in this paper we use the expected total reward as the policy evaluation criterion:

, which, for a transient system with bounded rewards, converges to a finite value.

A standard CMDP where constraints are imposed on the expected total costs, and the expected total reward is beingmaximized can be formulated as the following linear program [1, 15]:

(2)

where is the upper bound on the expected total incurred cost, and the optimization variables are called the occupancymeasure of a policy and be interpreted as the expected total number of times action is executed in state .

3 Problem Description

Given a standard constrained MDP model , we would like to find a policy that maximizes the total expectedreward, while satisfying the constraints on the probability of that the cost exceed a given upper bound, i.e. ,

2The costs are said to be incurred for visiting states rather than executing actions, but all results can be trivially extended to the latter case. Also notethat most interesting problems involve several costs, and while we make the simplification that there is only one cost incurred for visiting a state, this isdone purely for notational brevity. All results presented in this work are directly applicable (and most useful) for problems with several costs.

2

Page 60: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

where is the total cost. In other words, we need to solve the following optimization problem:

(3)

where the optimization variables have the standard interpretation of the expected total number of times action isexecuted in state . If could be expressed as a simple function of , the problem would be solved, asone could simply plug the expression into the above program. Unfortunately, things are not as wonderful, and the abovedependency is significantly more complex.

A simple linear approximation to the above program (when costs are non-negative) can be obtained by using the Markovinequality:

(4)

which allows one to express as a linear function of the occupancy measure . Our investigation [5] ofthis approximation showed, unsurprisingly, that this linear approximation is computationally cheap but usually leads tosuboptimal policies, because the Markov inequality provides a very rough upper bound on the probability that the total costexceed a given limit . The purpose of this work is to improve this approximation.

To this end, we are going to come up with a system of coordinates , such that the constraint can beexpressed as a simple function of , so that the expression can be plugged into the optimization program (eq. 3). However,one has to note that there are only free parameters in the system, so not all optimization variables are going to beindependent and additional constraints might have to be imposed.

As mentioned earlier, the method presented in this paper works for more general problems than (eq. 3), which we useas an interesting example to illustrate the approach. Section 7 comments on other problems that this method applies to.

4 Calculating the Probability of Exceeding Cost Bounds

To find the probability of exceeding the cost bounds, it would be very helpful to know the probability density function (pdf). 3 Then, the probability of exceeding the cost bounds could be expressed simply as

Unfortunately, is not easily available. However, it is a well-known fact that under some conditions the momentsof a random variable completely specify its distribution. 4 The th moment of a random variable is defined as the expectedvalue of :

One way to compute the pdf , given the moments is via an inverse Legendre

transform.5 Indeed, the Legendre polynomials

form a complete orthogonal set on the interval

:

Æ Therefore, a function on that interval can be approximated as a weighted sum of

Legendre polynomials:

where is the th Legendre polynomial, and is a constant coefficient,obtained by multiplying the polynomials by , integrating over , and using the orthogonality condition:

(5)

Realizing that is just a linear combination of several moments

, we get:

(6)

where in the second summation, the index runs over all powers of present in . Therefore, for an , we canexpress the probability that is greater than some as a linear function of the moments:

(7)

3Note that, in general, for an MDP with finite state and action spaces, the total costs have a discrete distribution. However, we make no assumptionsabout the continuity of the pdf , and our analysis carries through for both continuous and discrete density functions; in the latter case, canbe represented as a sum of Dirac delta functions:

Æ .

4This is true when the power series of the moments that specifies the characteristic function converges, which holds in our case due to the transientnature of the Markov process and the fact that costs are finite.

5A more common and natural way involves inverting the characteristic function of via a Fourier transform, but the method does not work for thisproblem.

3

Page 61: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

where

, in which the index runs over all polynomials that include the th power of .

Therefore, if we normalize to be in the interval (section 6 discusses in more detail the necessary transformationfor a transient system), we could use the above method to express as a linear function of the moments

.Now, if we could come up with a system of coordinates , such that the moments

could be expressed via , we mightbe able to formulate a manageable approximation to (eq. 3). However, it is important to note that unless we use an infinitenumber of moments, the resulting program will be an approximation to the original one.

5 Computing the Moments

As mentioned in the previous section, the properties of the pdf of the total cost are not immediately obvious, as the totalcost is a sum of a random number of dependent random variables. We do, however, know how the system evolves withtime, i.e. given the initial probability distribution, a policy, and the corresponding transition probabilities over states, weknow the probability that the system is in state at time – it is simply , where is the probabilitytransition matrix induced by the policy (

). In other words, we know the probability distribution for the

random variables , where if state is visited at time , and 0 otherwise.Let us also define for every state a random variable

that specifies the total number of times state isvisited. Then, the moments

of can be expressed as linear functions of the cross-moments (the expected value of the product) as follows:

(8)

Let us now compute the first moments

. Recalling that , and, therefore,its mean equals the probability that is 1:6

(9)

where is the identity matrix, and

holds, because

for our transient system.Multiplying by , we get:

(10)

Note that the above is exactly the “conservation of probability” constraint in (eq. 3). Indeed, since

and

, the two are identical. Let us now compute the second moments in a similar fashion:

(11)

Once again recalling that are binary variables, and since the system can only be in one state at a particular time, themean of their product is:

if Æ if

(12)

6Hereafter we use the notation for binary variables as a shorthand for , and for

4

Page 62: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

where is the probability that state is visited at time and state is visited at time . Also, since thesystem is Markovian, for , we have:

(13)

Substituting, we obtain:

Æ

Æ

Æ

(14)

Unfortunately, as can be seen from the above expression, the second moments cannot be tied to the first moments via alinear function. Therefore, we cannot use the moments as the optimization variables directly. Instead, we are going to workwith the following asymmetric terms where the order of indexes of corresponds to an temporal ordering of the terms inthe sums:

(15)

We will refer to the above terms as the asymmetric moments (although they do not correspond to moments of any realvariables). Clearly, all of the th order asymmetric moments can be expressed as a linear function of the asymmetricmoments of order by moving the term to the left-hand side. For example, for the second moments this stepcan be easily done by rewriting (eq. 15) in matrix form:

(16)

where is the matrix of second order asymmetric moments, and is a diagonal matrix, with values of thefirst order moment on the diagonal. Multiplying by on the right, we get . Similarly, forthe other moments we get:

Æ

Æ

(17)

Furthermore, it can be seen that the true moments of order can be expressed as linear functions of the asymmetricmoments of orders 1 through :

Æ

Æ Æ Æ Æ

Æ Æ Æ

(18)

Indeed, for the first moments, there is only one state index and therefore the first asymmetric moment equals thetrue moment. For the second moments, both and include the term ( ), and thus we have to subtract

Æ. The expressions for other moments are obtained in a similar manner.

5

Page 63: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

The last step that remains in formulating the optimization program is to substitute the transition probabilities and to define the actual optimization variables. This is where we hit a problem that breaks the linearity of the

program. Recall that for the standard CMDP, the optimization variables are defined as and have theinterpretation of the expected total number of times action is executed in state . As mentioned earlier, this allows one toexpress the first-order constraint from (eq. 17) as a linear function of . Indeed, since

, the first moments are

simply

, and recalling that , we have for the first order constraint:

(19)

Unfortunately, the same trick does not work for the higher order constraints. If we were to define similar variables forthe higher-order moments (ex. ), we could, of course, rewrite (eq. 17) as linear functions of these variables.However, by doing this, we would introduce too many free parameters into the program. To retain the original desiredinterpretation of the variables, we would also have to add constraints to ensure that the policy implied by the higher-ordervariables is the same as the one computed from the first-order . Clearly, these new constraints would be quadratic:

(20)

Hence, it appears that there is no easy way to avoid the quadratic expressions ( ) in the constraints on the moments:

Æ

Æ

Æ

(21)

We are therefore left with an optimization program in and the asymmetric moments ( ) that has: alinear objective function

, a constraint on the probability of the total cost exceeding a given threshold, which

is linear in the moments (eq. 7), which are linear in (eq. 18), and a system of quadratic constraints that synchronizesthe moments (eq. 21).

6 An Example

As an example of the use of the method presented in the previous sections, let us consider a toy problem, for which we cananalytically compute the distribution of the total cost, and formulate a constrained optimization program for it using the firstthree moments of the total cost. In this section we present a more careful derivation of the optimization program, payingmore attention to some steps that were just briefly described in section 4. We also present a preliminary empirical analysisthat shows how closely our model approximates the true cumulative probability distribution function of the total cost. Thepurpose of the latter is to serve as a rough indication of the accuracy of our approach, which we cannot yet directly reporton, as at the time of writing we do not have the optimization implemented yet.

Consider the problem depicted in figure 1(a). In this problem, there are two states, one of which ( ) is a sink state.If the agent starts in state 1 ( ), the total received reward is the same as the total incurred cost, and both equal thetotal number of visits to state 1. The obvious optimal policy for the unconstrained problem is to always execute action 1 instate 1 ( ). If we set the upper bound on the cost as , the unconstrained optimal policy has about a30% chance of exceeding that bound. As we decrease the acceptable probability of exceeding the cost bound, the policiesshould become more conservative, i.e. they should prescribe higher probabilities of executing action 2 in state 1.

In order to apply the Legendre approximation from section 4, one has to ensure that the total cost is in the range . Clearly, this is not generally the case for our problem. Therefore, to satisfy this condition, we apply a transformation

6

Page 64: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

c1=1 c

2=0

a1: p1

12=0.1, r

11=0

a2: p2

12=1.0, r

12=0

a1: p1

11=0.9, r

11=1

0.5 0 0.5 1S

0.2

0.4

0.6

0.8

1FS cdf real & Legendre 3

Legendre 3

real0.5 0 0.5 1

S

0.4

0.3

0.2

0.1

0.1Error Legendre 3

(a) (b) (c)

Figure 1: (a) – simple problem with two states and two actions; (b) – the actual cdf of the total cost and the cdf computedfrom a third-degree Lagrange approximation of the pdf of the cost; (c) – relative error of the approximation in (b).

with (a reasonable approximation, as ).7 Figure 1(b) shows the resultingcumulative distribution function for the unconstrained optimal policy and the cdf computed from a third-degreeapproximation of the pdf . Figure 1(c) shows the relative error of the third-degree approximation and serves to show thatwe can expect to get a reasonably good approximation of the cdf using just the first three moments. Note that the pdf for thisproblem is discrete (thus, harder to approximate with continuous Legendre polynomials), but we can still get a reasonablygood approximation of the cdf, which is what we really care about, as our goal is to estimate .Let us now compute a third-order approximation to for an arbitrary policy by computing the coefficients (eq. 5):

(22)

Notice that here we have to use the moments of . However, the constraints in (eq. 21) operate on momentsof , and since our optimization variables are going to be the moments of , we have to be able to express theformer via the latter. This can be easily done by solving the following linear system of equations for :

(23)

Now, the probability of exceeding the cost bounds is simply

where .

7 Conclusions

In this paper we have introduced a method for approximately reasoning about the probability distributions of rewards andcosts in Markov decision processes. The main three sources of approximation error in our method are: 1) the use ofLegendre polynomials to approximate the true pdf, 2) the use of a finite number of moments (the more moments are used,the better the approximation), and 3) truncation of the costs at some upper bound (the lower the mass of the remainingcost

, the better the approximation).We demonstrated the approach on a specific problem that bounds the probability that the total cost exceeds a given upper

bound. However, it is easy to see that the method allows one to model a wide range of risk-sensitive objective functionsand constraints. Indeed, one can just as easily approximate the distribution of the total (normalized) reward and express theexpected value of the utility function as (similarly to (eq. 7)):

! (24)

where !

, is the utility of getting reward , and is the distribution of the total

reward. Then, the approximation methods of sections 4 and 5 are directly applicable. Of course, the above relies on the fact

7For a transient system with bounded rewards . Thus one can always compute or estimate a reasonable for which is arbitrary small.

7

Page 65: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

that there either exists a natural upper bound on the largest possible utility value of a policy, or that there exists an upperbound such that the weighted tail of the utility distribution

is sufficiently small.Even though our construction yields more complex optimization programs than the standard constrained MDP approach,

it is more expressive than the standard risk-neutral CMDP techniques, because our formulation allows one to reason aboutthe probability distributions instead of the expected values of the total cost and rewards. Our ongoing efforts in extendingthis work include several directions such as looking at ways of efficiently encoding and implementing the optimizationprogram, more careful investigation of the complexity and convergence properties of the model (as more moments areused), exploring heuristics for choosing an appropriate number of moments, and a formal analysis of properties of theproblem and the corresponding solutions.

Acknowledgments This work was supported by DARPA and the Air Force Research Laboratory under contract F30602-00-C-0017 as a subcontractor through Honeywell Laboratories and also by Honeywell Laboratories through the “IndustrialPartners of CSE” program at the University of Michigan. The authors thank the anonymous reviewers for their comments.

References

[1] E. Altman. Constrained Markov Decision Processes. Chapman and HALL/CRC, 1999.

[2] E. Altman and A. Shwartz. Adaptive control of constrained Markov chains: Criteria and policies. Annals of OperationsResearch, special issue on Markov Decision Processes, 28:101–134, 1991.

[3] C. Boutilier, T. Dean, and S. Hanks. Decision-theoretic planning: Structural assumptions and computational leverage.Journal of Artificial Intelligence Research, 11:1–94, 1999.

[4] T. Dean and M. Wellman. Planning and Control. Morgan Kaufmann, 1991.

[5] D. A. Dolgov and E. H. Durfee. Approximating optimal policies for agents with limited execution resources. InProceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pages 1107–1112, 2003.

[6] J. A. Filar and L. C. M. Kallenberg. Variance-penalized markov decision processes. Math. of OR, 14:147–161, 1989.

[7] J. A. Filar and H. M. Lee. Gain/variability tradeoffs in undiscounted markov decision processes. In Proceedings of24th Conference on Decision and Control IEEE, pages 1106–1112, 1985.

[8] R. Howard. Dynamic Programming and Markov Processes. MIT Press, Cambridge, 1960.

[9] R. Howard and J. Matheson. Risk-sensitive markov decision processes. Management Science, 18(7):356–369, 1972.

[10] Y. Huang and L. Kallenberg. On finding optimal policies for Markov decision chains. Math. of OR, 19:434–448, 1994.

[11] L. Kallenberg. Linear Programming and Finite Markovian Control Problems. Math. Centrum, Amsterdam, 1983.

[12] S. Koenig and R. G. Simmons. Risk-sensitive planning with probabilistic decision graphs. In J. Doyle, E. Sandewall,and P. Torasso, editors, KR’94: Principles of Knowledge Representation and Reasoning, pages 363–373. MorganKaufmann, San Francisco, California, 1994.

[13] S. Marcus, E. Fernandez-Gaucherand, D. Hernandez-Hernandez, S. Colaruppi, and P. Fard. Risk-sensitive markovdecision processes, 1997.

[14] J. Pratt. Risk aversion in the small and in the large. Econometrica, 32(1-2):122–136, 1964.

[15] M. L. Puterman. Markov Decision Processes. John Wiley & Sons, New York, 1994.

[16] K. Ross and B. Chen. Optimal scheduling of interactive and non-interactive traffic in telecommunication systems.IEEE Transactions on Auto Control, 33:261–267, 1988.

[17] K. Ross and R. Varadarajan. Multichain Markov decision processes with a sample path constraint: A decompositionapproach. Math. of Operations Research, 16:195–207, 1991.

[18] M. Sobel. Maximal mean/standard deviation ratio in undiscounted mdp. OR Letters, 4:157–188, 1985.

[19] D. J. White. A mathematical programming approach to a problem in variance penalised markov decision processes.OR Spectrum, 15:225–230, 1994.

8

Page 66: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Generalized Opinion Pooling

Ashutosh Garg, T. S. Jayram, Shivakumar Vaithyanathan, Huaiyu ZhuIBM Almaden Research Center, San Jose, CA 95120, USA

Abstract

In this paper we analyze the problem of opinion pooling. We introduce a divergence minimizationframework to solve the problem of standard opinion pooling. Our results show that various existingpooling mechanisms like LinOp and LogOp are an special case of this framework. This framework isthen extended to address the problem of generalized opinion pooling. We show that this frameworkdoes satisfies various desiderata and we give an EM algorithm for solvingthis problem. Finally wepresent some results on synthetic and real world data and the results obtained are encouraging.

1 Introduction

The recent explosion on the web has resulted in the availability of valuable customer feedback. Rangingfrom movies to various products such feedback is often available in the form of explicit user ratings.Alternatively, such ratings can also be extracted from opinions expressed in text. Several recent efforts instatistical NLP on extracting such opinions from text is available [4]. The distributed nature of the internetimplies that information regarding users feedback and opinions is often available from multiple sources.Further, individual experts possessing relevant information may use different models to make predictionsor to generate estimates while expressing opinions. To baseinference on all available information, itis necessary to combine the information from all these different experts. In this paper we consider theproblem of aggregating information from multiple experts.Typically, opinions are represented in termsof probability distribution and the aim is to arrive at a single probability distribution which represents theconsensus behavior. This is accomplished using a pooling orconsensus operator. Studied formally underthe name of opinion pooling this problem has primarily been addressed under an axiomatic framework. Insuch approaches a consensus operator is chosen to satisfy a required set of axioms.

This paper tackles a problem that is more complex than the conventional opinion pooling problem.Each expert opinion is characterized by some dimensions anda consensus opinion might be desired acrossany subset of these dimensions. Further, various simple desiderata are defined to be required of the consen-sus opinions. Moving away from the traditional axiomatic approach, a model-based solution is proposedto tackle the problems of consistency and sparsity introduced by this generalization. A formal analysis ofthe model-based consensus results in a derivation of the conditions for which the desiderata are satisfied.

The remainder of the paper begins by first revisiting the problem of conventional opinion pooling.Section 2, motivates and formally introduces the generalized opinion pooling (GOP) problem followedby a discussion on the various desiderata (Section 2.1). In order to motivate our model-based approach,in Section 3, we first cast conventional opinion pooling as a divergence minimization problem. Here it isshown that current, popular, aggregation operators arise as solutions to special cases of this formulation.In Section 4, we extend this optimization framework to GOP problem where we propose a model-based

1

Page 67: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

solution. Section 5 provides an empirical study of the proposed model using opinions collected from theWeb. Section 6 describes the results of our experiments followed by discussion.

2 Preliminaries and Problem Definition

Opinions about products and services can be expressed in several different ways; as ratings on a scale, oras preferences expressed via a probability distribution - e.g., overHigh andLow1. The premise of thispaper is that a decision maker (DM) would be interested in aggregation of opinions from different sources.Consider, for example, the following query“What is the opinion of Thinkpad T30 as expressed at differentsources ?”. Assuming two sources, the answer to this query is some consensus of Thinkpad T30 reviewsat these sources. Note that scale over which ratings are provided often vary across sources and thereforeneed to be normalized somehow. Also, other opinions may not representable over a rating scale2. Dueto these issues it is preferable to express opinions as a probability distribution over preference values. Adetailed discussion on the conversion of scale ratings to probability distributions is beyond the scope ofthis paper. However, the simple model used in this paper is described in a later section.

The following notations will be used throughout the paper. Capital lettersX,Y will be used to referto random variables and the corresponding small lettersx, y will denote the particular instantiation (valuetaken) by these.PX(X = x) will refer to the probability that a random variableX takes on valuex. Whenthe context is clear, we will denote this quantity simply byP (X = x) or P (x). Given a set of empiricaldistributionsP1, P2, . . . , PN, (we will reserve the “hat” notation exclusively for empirical distributions),we will refer toPi, for eachi, as the distribution of the opinion random variableS given by expertei.

The (conventional) opinion pooling problem can be stated asfollows:

Definition 1 (Opinion Pooling). Expertse1, e2, . . . , eN provide opinionsP1, P2, . . . , PN, respec-tively, about a particular topic. The opinion pooling problem is to provide a consensus opinionP aboutthat topic. In other words, we seek a pooling operatorF such thatP = F (P1, P2, . . . , PN).

We now introduce a more elaborate framework where the relationships between the experts is capturedwhile respecting certain constraints. The experts, expressing opinions, are characterized using variousdimensionsof interest. Assume that there arem dimensionsD1, D2, . . . , Dm of interest. Suppose there areN expertse1, e2, . . . , eN who provide opinionsP1, P2, . . . , PN , respectively. Each expertei is associatedwith some assignment of (legal) values to an arbitrary subset of dimensions;Ci is called thecharacteristicof expertei. Let T denote the topic variable about which opinions can be expressed.

Given such empirical distributions from several experts, the DM may request opinions about topicsacross arbitrary characteristics. Note that the desired characteristic need not agree with the characteristicof any of the experts. In addition to this reporting problem,the DM may wish to analyze the relationshipsacross different characteristics. To address such issues while ensuring consistent answers, we proposea framework in which the consensus opinion for all characteristics is obtained via a single distributionP such that the conditional probability distributionP (S|T = t, C) is well-defined for every topict andevery characteristicC. Furthermore, the DM may also wish to impose additional constraints that need tobe satisfied. These constraints can be incorporated into theframework by placing suitable restrictions onP . In Section 5, such constraints are naturally expressed using a statistical model.

1At Epinions.com reviews of products are expressed as ratings on a scale of1 − 5.2As an example consider a study being conducted on the likelihood of a customer coming back – and the responses are

either Likely, Not Likely and Undecided. Such responses arenot easily expressed on a scale.

2

Page 68: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Definition 2 (Generalized Opinion Pooling). Suppose there areN expertse1, e2, . . . , eN who provideopinionsP1, P2, . . . , PN , respectively. LetCi and ti denote the characteristic and topic of expertei. Thegeneralized opinion pooling problem (GOP) is to find a distributionP that can be conditioned on everytopic and every characteristic, subject to the constraintsimposed by the DM. In other words, we seek apooling operatorF such thatP = F (P1, . . . , Pn), and thatP (S|T = t, C) is well defined for every topict and every characteristicC.

Note that the distributionP can potentially contain a larger set of random variables apart from thedimensions, topics and opinion, e.g. it may contain latent variables3. The solutionP provides opinions forall distinct characteristics thus addressing the apparentproblem of sparsity - i.e., empirical distributionsmay not be available for all characteristics or may need to beestimated with very little data. Moreover,the imposition of a single joint distribution ensures that reporting is consistent across all characteristics.

2.1 Desiderata for Pooling Operators

In the literature on opinion pooling, there has been a considerable study of the many properties satisfied bythe various pooling operators [2]. For the generalized opinion pooling problem, particularly in the contextof business and market intelligence, the opinions are distributions over some set of preference values. Forthis domain, we have identified three simple and natural properties that are desired of any solution—

1. Unanimity: If all the experts agree on the opinion of a topic, then the aggregated opinion agreeswith the experts.

2. Boundedness:The aggregated opinion is bounded by the extremes of the expert opinions.

3. Monotonicity: When a certain expert changes his opinion in a particular direction with all otherexpert opinions remaining unchanged, the aggregated opinion changes in the direction of this expert.

3 Opinion Pooling

In order to motivate our approach to GOP, we first present a simple but powerful framework for theconventional opinion pooling problem. We will show that popular operators LinOp and LogOp arise asspecial cases of this formulation. Later, we extend this in anatural way to the GOP problem.

The basic intuition is that in any solution to the opinion pooling problem, we expect the aggregatedistribution to be as “close” as possible to the individual experts. To formalize this, we will considerdistance measures between distributions and cast conventional opinion pooling as a minimization problem.To the best of our knowledge this formulation and the associated derivations have not appeared in literature.

LetD(P,Q) denote a divergence measure between probability distributionsP andQ, whereD satisfies(1) D(P,Q) ≥ 0 and (2)D(P,Q) = 0 if and only if P = Q. We are givenn expert distributionsPi, andtheir respective non-negative weightswi which sum to one. The goal is to obtain an aggregate distributionP via the following minimization problem:

P = argminQ

i

wiD(Pi, Q) (1)

The choice of weights is governed by various criteria [3]. W.l.o.g., in the absence of any knowledge, allexperts will be assumed equal. Therefore allwi are equal and hence ignored in the remainder of the paper.

3Operators such as LinOp dramatically restrict the constraints that can be imposed on the solution[5].

3

Page 69: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

DivergenceD(P,Q) Consensus opinionP (s)

Dγ(P,Q) =1−

∑x P (s)γQ(s)1−γ

γ(1−γ)1Z

(∑i wi[Pi(s)]

γ) 1

γ

DKL(P,Q) =∑

x P (s) log P (s)Q(s)

∑i wiPi(s)

DKL(Q,P ) =∑

x Q(s) log Q(s)P (s)

1Z

∏i[Pi(s)]

wi

L2(P,Q) =∑

x(P (s) − Q(s))2∑

i wiPi(s)

χ2(P,Q) =∑

x(P (s)−Q(s))2

Q(s)1Z(∑

i wi/Pi(s))−1

χ2(Q,P ) =∑

x(P (s)−Q(s))2

P (s)1Z

√∑i wi[Pi(s)]2

Figure 1: Different divergences and the corresponding consensus pooling operator. The quantityZ denotes thenormalization constant.

Table 1 gives a summary of different divergences and the consensus distributions that arise by solving theassociated minimization problems. Derivations are done using standard analytical methods and omitted inthe interest of space. Two interesting cases are

1. LinOp: F is calledLinOp if P can be expressed as a linear combination of the empirical distribu-tions. Choosing either KL-distance orL2 norm as the divergence measure in Equation 1 leads tothis solution for the consensus distribution.

2. LogOp: F is calledLogOp if P is the weighted geometric mean of the empirical distributionsunder consideration. Choice of reverse KL-distance (see Figure 1) leads to LogOp as the consensusdistribution.

Having a closed form solution allows us to directly evaluatethe different divergence measures via thedesiderata stated earlier. First, we observe that LinOp satisfies unanimity, – for any fixeds, if Pi(s) = c forall i, thenP (S = s) = c and boundedness – for everys, mini Pi(s) ≤ P (s) ≤ maxi Pi(s) which followseasily by its definition. LinOp also satisfies a strong monotonicity property: suppose expertei changes hisopinion Pi to Qi and suppose that all other experts’ opinions are unchanged.Let P andQ be the LinOpsolutions before and afterei’s opinion has changed. Then for everys, Q(s) > P (s) (respectively,<, =) ifand only ifQi(s) > Pi(s) (respectively,<, =).

For the pooling operators arising from other divergences, it is possible to construct easy counterex-amples showing that none of them satisfy unanimity or boundedness. However, they all satisfy a weakform of monotonicity. This is shown below for the case when the divergence measure isDγ. For otherdivergence measures, a similar result can be shown using thesame technique.

Theorem 3. Suppose expertei changes his opinionPi to Qi such thatPi(s) < Qi(s), for somes, whilePi(s

′) ≥ Qi(s′) for everys′ 6= s.4 Suppose that all the other experts’ opinions are unchanged i.e. Qj = Pj

for all j 6= i. If P andQ are the solutions usingDγ as a divergence before and after expertei’s opinionhas changed, thenQ(s) > P (s).

Proof. DefineP (x) = (∑

i[Pi(x)]γ)1/γ, andQ(x) = (∑

i[Qi(x)]γ)1/γ, for all x. SincePi(s) < Qi(s), wehaveQ(s) = P (s) + εs, for someεs > 0. Similarly, for everys′ 6= s we havePi(s

′) ≥ Qi(s′) implying

4A dual situation is when the opinion fors decreases while the opinion for the remainings′ 6= s is non-increasing, whichcan be handled similarly.

4

Page 70: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

B

F G

A

S

T

(Source)(Speed)

Figure 2: Bayesian network for Generalized opinion pooling.

Q(s′) = P (s′)− εs′ for someεs′ ≥ 0. Moreover,εs′ is strictly greater than0 for at least ones′ 6= s becausePi(s) < Qi(s) implies thatPi(s

′) > Qi(s′) for at least ones′ 6= s. Therefore,

∑s′ 6=s εs′ > 0.

From Table 1, we note thatP (x) = P (x)/Z andQ(x) = Q(x)/Z ′, for all x, whereZ =∑

x P (x)

andZ ′ =∑

x Q(x) denote the normalization constants forP andQ, respectively, From the previousparagraph, it follows thatZ ′ = Z + εs −

∑s′ 6=s εs′ . Now, sinceZ =

∑x P (x) ≥ P (s), we have

Q(s) =Q(s)

Z ′=

P (s) + εs

Z + εs −∑

s′ 6=s εs′>

P (s) + εs

Z + εs≥

P (s)

Z= P (s).

4 A Model Based Solution to Generalized Opinion Pooling

The previous section presented a divergence minimization framework for opinion pooling together with ananalysis of the conditions under which the various desiderata are satisfied. In this section, the frameworkis extended to the GOP problem. Recall that the solution to theGOP problem is a single distributionPsuch that the consensus opinion for every characteristic and every topic can be obtained as a conditionalprobability distribution, subject to the constraints imposed by the DM. LetP denote the feasible set ofsolutions such that constraints imposed by the DM are satisfied. For each expertei, let Pi denote thedistributionP conditioned on the topic and the characteristic of that expert. In other words, if the topicand characteristic ofei areti andCi respectively, thenPi(s) = P (s|T = ti, Ci). A natural generalizationof the optimization approach considered for conventional opinion pooling (Equation 1) is that the empiricaldistributionPi of expertei be close to the distributionPi, for eachi:

minimize∑

i

wiD(Pi, Pi) such that P ∈ P (2)

We now address whether this solutions to the GOP problem satisfies the desiderata described in Sec-tion 2.1. We will show that under suitable conditions, indeed the minimization problem of Equation 2satisfies unanimity, boundedness and monotonicity. First,to prove a monotonicity result forDγ, we con-sider the following setup where we have two sets of empiricaldistributions as inputs to the minimizationproblem of Equation 2. We show that the difference between the two empirical distributions is positivelycorrelated with the difference between their corresponding minima.

5

Page 71: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Lemma 4. Let P1, . . . , Pn, andQ1, . . . , Qn be two sets of empirical distributions and supposeP andQare the corresponding distributions, obtained by solving Equation 2 using the divergenceDγ. Then,

i

s

[(Pi(s))γ − (Qi(s))

γ ] · [(Pi(s))1−γ − (Qi(s))

1−γ ] ≥ 0.

Proof. Let D = Dγ. SinceP is a minima forP1, . . . , Pn, we have∑

i D(Pi, Pi) ≤∑

i D(Pi, Qi) and∑i D(Qi, Qi) ≤

∑i D(Qi, Pi). Adding the two equations gives

∑i D(Pi, Pi)+D(Qi, Qi)−D(Pi, Qi)−

D(Qi, Pi) ≤ 0. Substituting the definition ofD proves the theorem.

Theorem 5. Suppose expertei changes his opinionPi to Qi such thatPi 6= Qi and all other experts’opinions are unchanged i.e.Qj = Pj for j 6= i. LetP andQ be the solution obtained viaDγ before andafter expertei’s opinion has changed. LetA = s : Qi(s) > P (s) andB = s : Qi(s) < P (s) Then,either for at least ones ∈ A we haveQi(s) ≥ Pi(s), or for at least ones ∈ B we haveQi(s) ≤ Pi(s).

Proof. Note that the setsA andB are nonempty becausePi 6= Qi. Suppose the theorem does not hold i.e.Qi(s) < Pi(s) for all s ∈ A andQi(s) > Pi(s) for all s ∈ B. It follows that the LHS of the inequality ofLemma 4 is strictly negative—a contradiction.

For unanimity and boundedness we have the following result,for divergenceDKL, which assumes thatclass of distributionsP satisfies certain conditions.

Theorem 6. Let P be the distribution obtained by solving Equation 2 viaDKL. Then it satisfies thefollowing conditions: (1) Unanimity: For any fixeds, if Pi(s) = c for all i, thenPi(S = s) = c.(2) Boundedness: For everys, mini Pi(s) ≤ Pi(s) ≤ maxi Pi(s).

Proof. We provide a constructive proof for the above theorem. It shows that under certain existentialconditions, unanimity condition is satisfied when the KL distance is used as the divergence measure.

DefineP ′ such that∀i, P ′i (s) = c and ∀s′ 6= s; P ′

i (s′) = (1−c)

(1−Pi(s))Pi(s

′). We assumePi(s) 6= 1 asotherwise the original KL divergence would be infinity whenPi(s) = 1 andc 6= 1.

Now if P ′ ∈ P (this is the existential condition) then one can show∑

i DKL(Pi, Pi) >∑

i DKL(Pi, P′i ).

This proves the unanimity condition. A similar argument canbe used to prove the boundedness result.

5 Bayesian Network Aggregation

The details of the statistical model describingPi, in equation 2, was conveniently ignored in the previoussection. Recall from the previous section that the distribution of interest is the joint distribution over therandom variables. Moreover, the constraints of the DM, represented as conditional dependency betweenthe random variables, must also be modeled. A convenient representation is a Bayesian Network (BN)that captures, intuitively, the essential aspects of the problem. By varying the conditional independencerelationships modeled and by the incorporation of hidden variables, the BN allows for a rich class ofconstraints that DM would like to impose. Once the parameters of the BN are learned it can then bequeried by the DM to obtain aggregated opinions of interest.However, the complexity of the problem(learning and inference) will depend on the particular choice of network structure.

6

Page 72: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

5.1 Description Of the Model

We illustrate the Bayesian network approach using a simple example. Assume that the DM is interestedin opinions about laptops expressed at multiple sources. Therefore, the topicT equals laptopst, and thecharacteristic includes the dimension sourceG which takes on valuesg ∈ G. To adequately explain thepower of the BN we will assume another dimension, Speed (processor speed),F which takes on valuesf ∈ F . User ratings can be interpreted as empirical distributions P (s|t, g, f). A BN instantiation of thisexample, given below, sheds more light on the learning problem. The dependency structure of the BNis assumed to be defined by a domain expert and represents the constraints of the underlying problem.Figure 3(a) shows the BN under consideration5. Besides the dimensions, topicT , sourceG and speedF there are latent variablesA andB which capture the behavioral similarities exhibited by populationsacross the different characteristics while tackling sparsity.

Let Θ denote the set of all parameters of the network i.e. the (conditional) probability tables associatedwith all the nodes of the network. It is assumed that the probabilities P (G|Θ) andP (F |Θ) (the priorprobabilities for the individual characteristics) are known e.g., a simple estimate is the percentage of dataavailable for each of these variables. The remaining parameters are to be learned using available empiricaldistributions. These empirical distributions are over opinions for different topics conditioned on differentdimensions. In particular, assume that the following empirical distributions were observed:P (S|t, gi, fi)for i = 1, ..., N whereN being the number of experts (empirical distributions observed) and(gi, fi) betheir corresponding characteristics. The parameter learning problem for the Bayesian network can be castas the following optimization problem.

Θ = argminΘ

i

DKL(P (S|t, gi, fi), P (S|t, gi, fi, Θ))

such that P (S|t, gi, fi, Θ) =∑

a,b

P (S|a, b, t,Θ)P (a|gi, Θ)P (b|fi, Θ) (3)

Simply stated this objective function attempts to minimizethe divergence between the learned conditionalprobability distribution and the observed conditional probability distribution. Since the parameters of theBN are estimated from available empirical distributions theobjective function above is different from theusual maximum likelihood (ML) learning of Bayesian networks. However, an EM algorithm [1] can stillbe derived to obtain the estimates ofΘ.

Expanding equation 3 and ignoring the constant term the optimization problem is given as

Θ = argmaxΘ

S,i

P (S|t, gi, fi) log P (S|t, gi, fi, Θ)

The E-M algorithm for the above objective function can now bewritten. LetΘk denote the estimatesof Θ at thek-th step of the algorithm. We have –

E Step: ComputeQ(a, b|S, t, gi, fi) = P (a, b|S, t, gi, fi, Θk)

M Step: Maximize∑

S,t,i,a,b

Q(a, b|S, t, gi, fi)P (S|t, gi, fi) log P (S|a, b, t,Θ)P (a|gi, Θ)P (b|fi, Θ)

Imposing appropriate constraints leads to the following update equation

P (S|t, a, b,Θk+1) =

∑i P (a, b|S, t, gi, fi, Θk)P (S|t, gi, fi)∑

S

∑i P (a, b|S, t, gi, fi, Θk)P (S|t, gi, fi)

The update equations for other parameters can be obtained insimilar fashion.5The analysis of model-based approaches to opinion pooling presented in Section 4 do not depend on the structure of the

BN.

7

Page 73: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

6 Experiments

In this section we describe some experiments to validate theapproach presented in this paper. Thereare two main tasks that one intends to solve by opinion pooling - ”reporting” and ”analysis” of data.In practice the obtained opinions - in the form of empirical distributions - is often sparse (the space ofpossible set of characteristics is huge and we may not have data for all possible combinations) and thatmakes the use of model based approach both necessary and interesting. The evaluation of the presentedapproach is primarily divided into two categories a) robustness to data sparsity, b) ability of the model tocapture behavioral similarities across dimensions. We give results of experiments on both synthetic data –as it provides a more controlled environment allowing for better study, and on real data – using opinionsgathered from the Web. The model is validated against LinOp.

6.1 Synthetic Data

The BN model from which data was sampled is a joint distribution over 4 random variablesS,A,G, Twith the factorizationP (S,A,G, T ) = P (T )P (G)P (A|G)P (S|A, T ). The synthetic data was generatedfor a single topic (T ) from 10 geographical locations - i.e., (G) can take on one of 10 different values.The opinion of the topic is represented by random variableS which also (confusingly) can take on 10different values (1 − 10). The data was generated to reflect three hidden behaviors optimistic, pessimisticand unbiased. For optimistic behavior there is a greater probability mass over higher values of the opinionwhile the converse is true for pessimistic behavior. A uniformly distributed probability mass reflects anunbiased behavior. Fig. 3(a) shows the distribution of the behaviors from which the synthetic data wassampled. A total of 10 experts were assumed (i.e., a total of 10 opinions) - one from each geographicallocation. Moreover, each geographical location is associated with one of the behaviors. Specifically,three geographical regions were assumed to have an optimistic behavior, 3 regions pessimistic and theremaining 4 unbiased. For each expert, a 1000 data points (i.e., a 1000 opinion values) were sampled fromthe appropriate behavior (based on geography). The empirical distribution of these 1000 points was takento be the expert’s opinion.

Learning was accomplished using the EM algorithm (c.f. Section 5 with all parameters initializedrandomly. To test the sensitivity of the algorithm (againstoverfitting) the latent variableA was run witha cardinality of4 ( recall that the ground truth has cardinality3- the distinct behaviors). Fig. 3(b) showsthe learned mixture coefficientsP (a|g). Note that the algorithm did learn the existence of three mainbehaviors indicated by overlap between the class 3 and 4 on the right side. Upon examinationP (S|a = 3)andP (S|a = 4) (not shown in interest of space) were found to be very similar.

To test robustness to sparsity, opinions (empirical distribution) were generated for 2 distinct topics (1& 2), 10 geographical locations and identical behavior (same as the one in previous setting). The learningalgorithm was allowed only a portion of the opinions for parameter estimation. Specifically, for Topic 1,opinions from all geographic locations were used, while forTopic 2, from only 5 geographical locationswere used. Fig. 3(c) shows the learned distributions of opinion (averaged over all locations) for each of thetwo topics. Note that for Topic 1 the results of LinOp and model-based approach are identical. However,for Topic 2 there is a difference in performance between the two approaches. The model-based approachgeneralizes significantly better than LinOp. This is evident from the resulting distribution for the modelbased approach being closer to the one of Topic 1 (whose behavior is identical to Topic 2).

8

Page 74: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Opinion Values

Optimistic PessimisticUnbiased

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Geographical Locations

Mix

ture

Wei

ghts

1 2 3 4 5 6 7 8 9 100.08

0.09

0.1

0.11

0.12

0.13

0.14

0.15

0.16

0.17

Opinion Values

Pro

babi

litie

s

Topic 1 learned using BN Topic 2 learned using BN Topic 1 using LinOpTopic 2 using LinOp

(a) (b) (c)

Figure 3: (a) Optimistic, pessimistic and unbiased behaviors. (b) Plot of mixture coefficientsP (a|g). (c) Resultsof the sparsity experiment

Table 1: Prediction of queries not present in the training set.Query Source Brand Model Speed P (S|Query) P (S|Query)

Query 1 Epinions HP 266MHz [0.9 0.1] [0.89 0.11]Query 2 ZDnet Sony Vaio 667MHz [0.8 0.2] [0.81 0.19]

6.2 Real Data

The second set of experiments was conducted on the real data consisting of opinions about differentlaptopscollected from several sources on the Web namelyEpinions, Cnet, ZDnet, and Ciao. Each laptop in realityis described by several dimensions (possibly tens). To makethe experiments manageable only companyname, model and processor speed are considered here. A totalof 2180 opinions,P (·), with 108 distinctcharacteristics, were collected from the different sources. The structure of the BN was chosen based onexpert knowledge (details omitted in the interest of space).

Each opinion is expressed as a rating over a scale of either 1-5 or 1-7. These ratings were convertedinto a distribution over the spaceHigh andLow assuming the following simple probability model. Eachrating was converted into a corresponding percentage. Thisis interpreted as the probability that a randomreader will classify the corresponding review asHigh. Note that more complicated probability modelscan be used to convert ratings into more complex probabilitydistributions.

To evaluate model robustness to data sparsity, the dataset was divided into 70-30 training/test split.For each characteristic ground truth was defined by applyingLinOp over all opinions (ignoring the split)sharing this characteristic. The BN was learned using70% of the data. For comparison a LinOp basedconsensus opinion was obtained for each characteristic using the appropriate opinions from the trainingsplit. The average KL distance between the ground truth and model-based approach was0.0302 whereasthe KL distance between ground truth and LinOp was0.0439. The average was taken over all possiblevalues of characteristics. This suggests that indeed thereis information to be learnt from other opinionswhile providing an aggregate opinion.

Sometimes the queries of the DM may involve characteristicsfor which opinions may not be availablein the training set. To test the predictive ability of the model the BN was tested on opinions that do notcontain characteristics in the training set. Note that LinOp cannot provide an answer in such cases. Table 1shows the results of this experiment.

9

Page 75: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Table 2: KL divergence between all pairs ofP (A|Source).Epinions Cnet ZDnet Caio

Epinions - 0.3425 0.4030 0.466Cnet - - 0.111 0.3757

ZDnet - - - 0.0867

Table 2 shows the symmetric version6 of KL-divergences between all pairs ofP (A|Source). Thedivergence between sources Caio and ZDnet is the lowest – and they are both based out of UK while theremaining two operate out of US. This interesting, albeit anecdotal, observation might be interpreted assources exhibiting behavioral similarities.

7 Summary

In this paper we introduced a generalized opinion pooling framework for synthesizing unstructured data,with an application to business intelligence reporting. The opinion pooling problem is cast in the form of aconstrained divergence minimization problem. In contrastto conventional opinion pooling where a singleconsensus opinion is sought from a collection of expert opinions, our framework allows the consensusopinion to take into account varying characteristics of theexperts. The degree to which the differingcharacteristics are taken into account can be controlled bythe constraints. Under reasonable conditionsseveral desiderata are satisfied. The constraint can be implemented by some statistical models, such asBayesian networks. We explain the training of such networks from empirical data. Finally, we presentedexperiments validating our approach using both synthetic data and real data.

References

[1] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm.Journalof the Royal Statistical Society, B, 1977.

[2] C. Genest and J. V. Zidek. Combining probability distributions: A critiqueand an annotated bibliography (avecdiscussion).Statistical Science, 1:114–148, 1986.

[3] P. Maynard-Reid II and U. Chajewska. Aggregating learned probabilistic beliefs. InUAI, pages 354–361, 2001.

[4] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up? sentiment classification using machine learning techniques.In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2002.

[5] D. M. Pennock and M. P. Weliman. Graphical representations of consensus belief. InProc. of the 15th Conf. onUncertainty in Artificial Intelligence (UAI-99),, pages 531–540, 1999.

6The symmetric version of the KL-divergence between two distributionsp andq is given as KL(p,q)+KL(q,p)

10

Page 76: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

A Framework for Sequential Planning in Multi-Agent Settings

Piotr J. Gmytrasiewicz and Prashant Doshi∗

Department of Computer ScienceUniversity of Illinois at Chicago

piotr,[email protected]

Abstract

This paper extends the framework of partially observable Markov decision processes (POMDPs)to multi-agent settings by incorporating the notion of agent models, or types defined in games ofincomplete information, and by using Bayesian update over models during repeated interactions.We allow agents to have beliefs over physical states of the environment and over models of otheragents which could include their belief states. Our approach complements a more traditional ap-proach in stochastic games that uses equilibria as a solution paradigm. Our work seeks to avoidsome of the drawbacks of game-theoretic equilibria which may be non-unique and unable to cap-ture off-equilibrium behaviors. Our approach does so at the cost of having to represent, processand continually revise models of other agents. Agents’ beliefs are, in general, arbitrarily nested,and optimal solutions to decision making problems are only asymptotically computable. However,approximate belief updates and approximately optimal plans are computable.

1 Introduction

We develop a framework for sequential rationality of autonomous agents interacting with other agentswithin a common, and possibly uncertain, environment. We use the normative paradigm of decision-theoretic planning under uncertainty as represented by partially observable Markov decision processes(POMDPs) [6, 16, 27]. We make this framework applicable to agents interacting with other agents byallowing them to have beliefs not only about the physical environment, but also about the other agents;i.e., their abilities, sensing capabilities, beliefs, preferences, and intentions.

The formalism of Markov decision processes has been extended to multiple agents before, givingrise to stochastic games or Markov games [10, 25]. Traditionally, the solution concept used for stochas-tic games is that of Nash equilibria. Some recent work in AI follows that tradition [5, 14, 17, 18]. How-ever, while Nash equilibria are useful for describing a multi-agent system when, and if, it has reached astable state, this solution concept is not sufficient as a general control paradigm. The main reasons arethat there may be multiple equilibria with no clear way to choose among them (non-uniqueness), andthe fact that equilibria do not specify actions in cases in which agents believe that other agents may actnot according to their equilibrium strategies (incompleteness) [4, 15].

Other extensions of POMDPs to multiple agents appeared in [3, 30]. They have been called de-centralized POMDPs (DEC-POMDPs), and are related to decentralized control problems [24]. DEC-POMDP framework assumes that the agents are fully cooperative, i.e., they have common reward func-tion and form a team. Furthermore, it is assumed that the optimal joint policy is computed centrallyand then distributed among the agents, which makes it a variant of multibody planning [27].

∗This research is supported by the National Science Foundation CAREER award IRI-9702132, and NSF award IRI-0119270.

Page 77: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Our formalism, called interactive POMDPs (I-POMDPs) is applicable to autonomous agents withpossibly conflicting objectives, operating in partially observable environments, who locally computewhat actions they should execute to optimize their preferences given what they believe. We are mo-tivated by a subjective approach to probability in games [15], and we combine POMDPs, Bayesiangames, work on interactive belief systems [1, 12, 19], and related work [4, 8]. The unique aspect ofI-POMDPs is that they prescribe action based on an agent’s beliefs about other agents and about theirexpected behaviors. This generalizes and complements the traditional equilibrium approach [15] – ifthe agent believes that others will act according to an equilibrium then it also chooses to act out its ownpart of this equilibrium. However, if the agent believes that others will diverge from an equilibriumbehavior then the agent can still optimize. This approach, also called decision-theoretic approach togame theory [21], is capable of avoiding the difficulties of non-uniqueness and incompleteness of tra-ditional equilibrium approach, but at the cost of processing and maintaining possibly infinitely nestedinteractive belief systems [1, 2] (also called knowledge structures [28].) As a result, only approximatebelief updates and approximately optimal solutions to optimal planning problems are computable ingeneral.

Our approach follows a tradition of knowledge-based paradigm of agent design, according to whichit is useful for agents to represent and reason with the important elements of their environment to allowthem to function efficiently. We extend this paradigm to other agents by relying on models of otheragents to predict their actions, and on optimal choices of agent’s own behaviors given these predictions.The two main elements, first of predicting actions of others, and second of choosing own action, areboth handled from the standpoint of Bayesian decision theory. The descriptive power of decision-theoretic rationality, which describes actions of rational individuals, is used to predict actions of otheragents. The prescriptive aspect of decision-theoretic rationality, which dictates that optimal actionschosen are the ones that maximize expected utility, is used for agent to select its own action.

2 Background: Partially Observable Markov Decision Processes

A partially observable Markov decision processes (POMDP) [6, 13, 16, 20] of an agent i is defined as

POMDPi = 〈S,Ai, Ti,Ωi, Oi, Ri〉 (1)

where: S is a set of possible states of the environment (defined as the reality external to the agent i),Ai is a set of actions agent i can execute, Ti is a transition function – Ti : S × Ai × S → [0, 1] whichdescribes results of agent i’s actions, Ωi is the set of observations that the agent i can make, Oi is theagent’s observation function – Oi : Ωi × S × Ai → [0, 1] which specifies probabilities of observationsif agent executes various actions that result in different states, Ri is the reward function representingthe agent i’s preferences – Ri : S × Ai → R.

The belief update and the optimal solutions (depending on an optimality criterion) to POMDPs aredescribed in the literature [6, 13, 16, 20].

2.1 Optimality Criteria

The agent’s optimality criterion, OCi, is needed to specify how rewards acquired over time are handled.Commonly used criteria include:

• A finite horizon criterion, in which the agent maximizes the expected value of the sum of first Trewards: E(

∑Tt=0 rt), where rt is a reward obtained at time t and T is the length of the horizon.

We will denote this criterion as fhT .

Page 78: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

• An infinite horizon criterion with discounting, according to which the agent maximizes E(∑∞

t=0 γtrt),where 0 < γ < 1 is a discount factor. We will denote this criterion as ihγ .

• An infinite horizon criterion with averaging, according to which the agent maximizes the averagereward per time step. We will denote this as ihAV .

In what follows, we concentrate on infinite horizon criterion with discounting, but our approachcan be easily adapted to other criteria.

2.2 Agent Types and Frames

The POMDP definition above includes parameters that permit us to compute an agent’s optimal behav-ior, conditioned on its beliefs. Let us collect these implementation independent factors into a constructwe call an agent i’s type.

Definition 1. (Type) A type of an agent i is θi = 〈bi, Ai,Ωi, Ti, Oi, Ri, OCi〉, where bi is agent i’sstate of belief (an element of ∆(S)), OCi is its optimality criterion, and the rest of the elements are asdefined before. Let Θi be the set of agent i’s types.

Given type, θi, and under the assumption that the agent is Bayesian-rational, the set of agent’soptimal actions will be denoted as OPT (θi). In the next section, we generalize the notion of type tosituations which include interactions with other agents; it then coincides with the notion of type usedin Bayesian games [12, 10].

It is convenient to define the notion of a frame, θi, of agent i:

Definition 2. (Frame) A frame of an agent i is θi = 〈Ai,Ωi, Ti, Oi, Ri, OCi〉. Let Θi be the set ofagent i’s frames.

For brevity one can write a type as consisting of an agent’s belief together with its frame: θi =〈bi, θi〉.

3 Interactive POMDPs

As we mentioned, our intention is to generalize POMDPs to handle presence of other agents. We dothis by including descriptions of other agents (their types, for example), as well as physical aspects ofagent’s own description (i.e., their own frames) in the state space. For simplicity of presentation weconsider an agent i that is interacting with one other agent j.

Definition 3. (I-POMDP) An interactive POMDP of agent i, I-POMDPi, is:

I-POMDPi = 〈ISi, A, Ti,Ωi, Oi, Ri〉 (2)

where:

• ISi is a set of interactive states defined as ISi = S × Mj where S is the set of states of thephysical environment, and Mj is the set of possible models of agent j. Models of other agents areincluded in the state space to allow an agent to have beliefs over them.

Models of agents include factors relevant to the agents’ behavior. Analogously to states of theworld, models of agents are intended to be rich enough to allow informed prediction about behavior.Agent i maintains its belief about the interactive state as a probability distribution over ISi. Let us

Page 79: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

note that the states are subjective; the reality external to each agent, say i, is different in that it includesagents other than i – in this case agent j.

A general model of an agent, j, is a function mj : Hj → ∆(Aj), i.e., a mapping from j’s observ-able histories to probabilistic predictions of j’s behavior. Let Mj be the set of all models of j that arecomputable. One example of a model is obtained during the fictitious play considered in game theory[9, 22]. In fictitious play model probabilities of agent’s future actions are estimated as frequenciesof actions observed in the past. Another simple version is a no-information model [11], according towhich actions are independent of history and are assumed to occur with uniform probabilities, 1/|Aj |,each.

Perhaps the most interesting model is the intentional model, defined to be the agent j’s type, θj =〈bj , A,Ωj , Tj , Oj , Rj , OCj〉, together with the assumption that agent j is Bayesian rational. Agent j’sbelief is a probability distribution over the states of the world, the models of the agent i, and frames ofitself – bj ∈ ∆(S × Mi). The notion of an intentional model, or type, we use here coincides with thenotion of type in game theory, where it is defined as consisting of all of the agent i’s private informationrelevant to its decision making [12, 10]. In particular, if agents’ beliefs are private information, thentheir types involve possibly infinitely nested beliefs over others’ types and their beliefs about others.They have been called knowledge-belief structures and type hierarchies in game theory [1, 2, 7, 19],and are related to recursive model structures in our prior work [11].1.

• A = Ai × Aj is the set of joint moves of all agents.

• Ti is a transition function Ti : ISi × A × ISi → [0, 1] which describes results of agents’actions. Actions can change the physical state, as well as the models of other agents and the agent’sown frame, for example by changing the observation function of one or both agents. One can modelcommunicative actions as changing the beliefs of the agents directly, but it may be more appropriateto model communication as action that can be observed by others and thus can change their beliefsindirectly.

One can make the following assumption about Ti:

Definition 4. (Belief Non-manipulability (BNM)) Agents’ actions do not change the agents’ beliefsdirectly. Formally, for all s, bj , θj, s

′, b′j, θ′j we have:

Ti((s, 〈bj , θj〉), a, (s′, 〈b′j , θ′j〉)) > 0 only if bj = b′j. (3)

Belief non-manipulability is justified in usual cases of interacting autonomous agents. Autonomyprecludes direct “mind control” and implies that agents’ belief states can be changed only indirectly,typically by changing the environment in a way observable to them.2 As we mentioned the agent’sactions are still allowed to change, say, their observation capabilities and their preferences.3

• Ωi is defined as before in POMDP model.

• Oi is an observation function Oi : Ωi × ISi × A → [0, 1].One can make the following assumption about the observation function.

1Implicit in the definition of nested beliefs is the assumption of coherency [7].2In other words, agents’ beliefs do change, just like in POMDPs, but as a result of belief update after an observation, not

as a direct result of any of the agents’ actions.3One can strengthen the notion of autonomous agents by postulating that their preferences are non-manipulable as well,

but we do not go into this here for simplicity.

Page 80: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Definition 5. (Belief Non-observability (BNO)) Agents cannot observe other’s beliefs directly. For-mally, for all o, s, bj , θj , and b′j , we have:

Oi(o, (s, 〈bj , θj〉), a) = Oi(o, (s, 〈b′j , θj〉), a). (4)

The BNO assumption does not imply that the other elements of the agent’s type are fully observable.For example, it is unlikely that an agent’s reward function would be directly observable by other agents,but we do not get into this issue here for simplicity.

• Ri is defined as Ri : ISi × A → R. We allow the agent to have preferences over physical statesand models of all agents, but usually only the physical state will matter.

3.1 Belief Update in I-POMDPs

We will show that, as in POMDPs, agent’s beliefs over their interactive states are sufficient statistics,i.e., they fully summarize the agent’s observable histories. Further, we need to show how beliefs areupdated after agent’s action and observation, and how solution is defined. There are two differencesthat complicate belief update, when compared to POMDPs. First, since the state of the physical en-vironment depends on the actions performed by both agents the prediction of how the physical statechanges has to be made based on the predicted actions of the other agent. The probabilities of other’sactions are obtained based on their models. Thus, unlike in Bayesian and stochastic games, we donot assume that actions are fully observable by other agents. Rather, agents can attempt to infer whatactions other agents have performed by sensing their results on the environment.

Second, changes in the models of the other agents have to be included in the update. Some ofthese changes may be directly attributed to the agents’ actions,4 but, more importantly, the update ofthe other agent’s beliefs due to its new observation has to be included. In other words, the agent hasto update its beliefs based on what it anticipates that the other agent observes and how it updates. Ascould be expected, the update of the possibly infinitely nested belief over other’s types is, in general,only asymptotically computable.

Proposition 1. (Sufficiency) In an interactive POMDP of agent i, i’s current belief, i.e., the probabilitydistribution over the set S × Mj , is a sufficient statistic for the past history of i’s observations.

The next proposition defines the agent i’s belief update function, bti(ist) = Pr(ist|ot

i, at−1i , bt−1

i ),where ist ∈ ISi is an interactive state. We will use the belief state estimation function, SEθi

, as an ab-breviation for belief updates for individual states so that bti = SEθi

(bt−1i , at−1

i , oti). τθi

(bt−1i , at−1

i , oti, b

ti)

will stand for Pr(bti|bt−1

i , at−1i , ot

i). We will also define the set of type-dependent optimal actions of anagent, OPT (θi).

Proposition 2. (Belief Update) Under the BNM and BNO assumptions, the belief update function forinteractive POMDP 〈ISi, A, Ti,Ωi, Oi, Ri〉 is:

bti(is

t) = β∑

ist−1 bt−1i (ist−1)

∑at−1

jPr(at−1

j |mt−1j )Oi(ist, at−1, ot

i)×∑

otjτθt

j(bt−1

j , at−1j , ot

j , btj)Ti(ist−1, at−1, ist)

×Oj(istj , a

t−1, otj)

(5)

where is = (s,mj), isj = (s,mi), bt−1j and bt

j are the belief elements of mt−1j and mt

j if thesemodels are intentional, respectively, β is a normalizing constant, Oj is the observation function in mt

j

4For example, actions may change the agents’ observation capabilities directly.

Page 81: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

(assuming it is intentional model), and Pr(at−1j |mt−1

j ) is the probability of other agent’s action given

its model, i.e., the probability that at−1j is Bayesian rational for that type of agent. If the model is the

other agent’s type, θj , then this probability is equal to 1|OPT (θj)| if at−1

j ∈ OPT (θj), and it is equal tozero otherwise. We define OPT below. If the model of agent j is not intentional then the probabilityof actions are given directly by the model and the summation over oj , τθj

, and the last line drop out.

Proof of Propositions 1 and 2. We start with Proposition 2, by applying the Bayes Theorem:

bti(is

t) = Pr(ist|oti, a

t−1i , bt−1

i ) = Pr(ist,oti,a

t−1i ,bt−1

i )

Pr(oti,a

t−1i ,bt−1

i )= Pr(ist,ot

i|at−1i ,bt−1

i )Pr(at−1i ,bt−1

i )

Pr(oti|at−1

i ,bt−1i )Pr(at−1

i ,bt−1i )

= Pr(ist,oti|at−1

i ,bt−1i )

Pr(oti|at−1

i ,bt−1i )

= β∑

ist−1 Pr(ist, oti|at−1

i , ist−1)bt−1i (ist−1)

= β∑

ist−1

∑at−1

jPr(ist, ot

i|at−1i , at−1

j , ist−1)Pr(at−1j |at−1

i , ist−1)bt−1i (ist−1)

= β∑

ist−1

∑at−1

jPr(ist, ot

i|at−1, ist−1)Pr(at−1j |ist−1)bt−1

i (ist−1)

= β∑

ist−1 bt−1i (ist−1)

∑at−1

jPr(at−1

j |mt−1j )Pr(oi

t|ist, at−1, ist−1)Pr(ist|at−1, ist−1)

= β∑

ist−1 bt−1i (ist−1)

∑at−1

jPr(at−1

j |mt−1j )Pr(oi

t|ist, at−1)Pr(ist|at−1, ist−1)

= β∑

ist−1 bt−1i (ist−1)

∑at−1

jPr(at−1

j |mt−1j )Oi(ist, at−1, ot

i)Pr(ist|at−1, ist−1)(6)

Pr(ist|at−1, ist−1) =∑

otj

1

Pr(ist|at−1, ist−1, otj)

2

Pr(otj |at−1, ist−1) (7)

In order to simplify the term Pr(ist|at−1, ist−1, otj), let us substitute the interactive state ist with

its components, ist = (st,mtj), and if mt

j is intentional, with (st, btj , θ

tj).

1

Pr(ist|at−1, ist−1, otj) = Pr(st, bt

j , θtj|at−1, ist−1, ot

j)= Pr(bt

j|st, θtj, a

t−1, ist−1, otj)Pr(st, θt

j |at−1, ist−1, otj)

In the above equation, the first term on the right-hand side is 1 if Agent j’s belief update,SEθj

(bt−1j , at−1

j , otj) generates a belief state equal to btj . This terms drops out if mt

j is not an intentionalmodel. The action pair at−1 may change the physical state, Agent j’s , and Agent i’s model. Thesecond term on the right-hand side captures this transition. We utilize the BNM assumption to replacethe second term with the transition function.

1

Pr(ist|at−1, ist−1, otj) = τθt

j(bt−1

j , at−1j , ot

j , btj)Ti(ist−1, at−1, ist)

In order to evaluate term 2 of equation Eq. 7, we introduce an intermediate state ist−1/2. Theintermediate state results after the action pair at−1 but before the observations are perceived.

2

Pr(otj |at−1, ist−1) =

∑ist−1/2 Pr(ot

j|at−1, ist−1, ist−1/2)Pr(ist−1/2|at−1, ist−1)=

∑ist−1/2 Pr(ot

j|at−1, ist−1/2)Pr(ist−1/2|at−1, ist−1)

In the first term of the above equation, the BNO assumption makes it possible to replace ist−1/2

with ist.

Pr(otj |at−1, ist−1) =

∑ist−1/2 Pr(ot

j|at−1, ist)Pr(ist−1/2|at−1, ist−1)= Pr(ot

j |at−1, ist)∑

ist−1/2 Pr(ist−1/2|at−1, ist−1)= Pr(ot

j |at−1, ist)= Oj(ist

j , at−1, ot

j)

Page 82: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

We now replace the summand of Eq. 7 with the evaluated expressions.

Pr(ist|at−1, ist−1) =∑

otjτθt

j(bt−1

j , at−1j , ot

j , btj)Ti(ist−1, at−1, ist)

×Oj(istj , a

t−1, otj)

(8)

The final equation for our belief update (Eq. 5) results by replacing Eq. 8 into Eq. 6.Since Proposition 2 expresses the belief bti(is

t) in terms of parameters in the previous time steponly, the Proposition 1 holds as well.

Intuitively, Proposition 1 holds since, as in POMDPs [29], all of the dynamic elements (state tran-sitions and observations) of the model depend on the previous state, not on prior observations andactions.

Proposition 1 and Eq. 5 have a lot in common with belief update in POMDPs, as should be expected.Both depend on agent i’s observation and transition functions. However, since agent i’s observationsalso depend on agent j’s actions, the probabilities of various actions of j have to be included (in thefirst line of Eq. 5.) Further, since the update of agent j’s model depends on what j observes, theprobabilities of various observations have to be included (in the last line of Eq. 5.) The update of j’sbeliefs is included by adding the τθj

term.If none of the models mj are intentional the belief update in I-POMDPs reduces to form similar

to POMDPs. If some models are intentional the belief could be infinitely nested and the belief updatecan be calculated only asymptotically. The belief update can easily be generalized to the setting wheremore than one other agents co-exist with agent i.

3.2 Value Function and Solutions in I-POMDPs

Analogously to POMDPs, each belief state in I-POMDP has an associated value reflecting the maxi-mum payoff the agent can expect in this belief state:

U(θi) = maxai∈Ai

∑is

bi(is)ERaj

i (is, ai) + γ∑o∈Ωi

Pr(oi|ai, bi)U(SEθi(bi, ai, oi)) (9)

where, ERaj

i (is, ai) =∑

ajRi(is, ai, aj)Pr(aj|mj) (since is = (s,mj)).

Agent i’s optimal action, a∗, for the case of infinite horizon criterion with discounting, is an elementof the set of optimal actions for the belief state, OPT (θi), which is defined as:

OPT (θi) = argmaxai∈Ai

∑is

bi(is)ERaj

i (is, ai) + γ∑

oi∈Ωi

Pr(oi|ai, bi)U(SEθi(bi, ai, oi)) (10)

Since the beliefs could be infinitely nested approximations that involve terminating the nestingof beliefs, for example with a no-information model, involve solving a finite number of traditionalPOMDPs, and their complexity is PSPACE-hard [26]. Including more information residing in deeperlevels of nesting results in better approximations. We investigated error bounds of such approximationsexperimentally in [23] in myopic settings; formal proof of the error bounds for longer time horizonsremain subject of future work.

4 Conclusions

This paper proposes a decision-theoretic approach to game theory as a paradigm for designing agentsthat are able to intelligently interact and coordinate actions with other agents in multi-agent environ-ments. We defined a general multi-agent version of partially observable Markov decision processes,called interactive POMDP’s, and illustrated assumptions, some basic properties and solution method.

Page 83: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

This line of work opens a wide area of fertile future research that integrates frameworks for sequen-tial planning, like POMDPs, with elements of game theory and inductive learning.

References

[1] Robert J. Aumann. Interactive epistemology i: Knowledge. International Journal of Game Theory, 28:263–300, 1999.

[2] Robert J. Aumann and Aviad Heifetz. Handbook of Game Theory with Economic Applications, volume 3.Elsevier Science, 2002.

[3] Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentral-ized control of markov decision processes. Mathematics of Operations Research, 2002.

[4] Ken Binmore. Essays on Foundations of Game Theory. Pittman, 1982.

[5] Craig Boutilier. Sequential optimality and coordination in multiagent systems. In Sixteenth InternationalJoint Conference on Artificial Intelligence, pages 478–485, 1999.

[6] Craig Boutilier, Thomas Dean, and Steve Hanks. Handbook of game theory with economic applications.Journal of Artificial Intelligence Research, 11:1–94, 1999.

[7] Adam Brandenburger and Eddie Dekel. Hierarchies of beliefs and common knowledge. Journal of Eco-nomic Theory, 59:189–198, 1993.

[8] Ronald Fagin, Joseph Halpern, Yoram Moses, and Moshe Vardi. Reasoning about Knowledge. MIT Press,1995.

[9] Drew Fudenberg and David K Levine. The Theory of Learning in Games. MIT Press, 1998.

[10] Drew Fudenberg and Jean Tirole. Game Theory. MIT Press, 1991.

[11] Piotr Gmytrasiewicz and Edmund Durfee. Rational coordination in multi-agent environments. AutonomousAgents and Multiagent Systems Journal, 3(4):319–350, 2000.

[12] John C. Harsanyi. Games with incomplete information played by ’bayesian’ players. Management Science,14(3):159–182, 1967.

[13] Milos Hauskrecht. Value-function approximations for partially observable markov decision process. Jour-nal of Artificial Intelligence, 13:33–94, 2000.

[14] J. Hu and M. P. Wellman. Multiagent reinforcement learning: Theoretical framework and an algorithm. InFifteenth International Conference on Machine Learning, pages 242–250, 1998.

[15] Joseph Kadane and Patrick Larkey. Subjective probability and the theory of games. Management Science,28(2):113–120, 1982.

[16] Leslie Kaelbling, Michael Littman, and Anthony Cassandra. Planning and acting partially observablestochastic domains. Artificial Intelligence, 2, 1998.

[17] Daphne Koller and Brian Milch. Multi-agent influence diagrams for representing and solving games. InSeventeenth International Joint Conference on Artificial Intelligence, pages 1027–1034, August 2001.

[18] Michael Littman. Markov games as a framework for multiagent reinforcement learning. In InternationalConference on Machine Learning, 1994.

[19] J.F. Mertens and S. Zamir. Formulation of bayesian analysis for games with incomplete information. In-ternational Journal of Game Theory, 14:1–29, 1985.

[20] George Monahan. A survey of partially observable markov decision processes. theory, models and algo-rithms. Management Science, pages 1–16, 1982.

[21] Roger B. Myerson. Game Theory: Analysis of Conflict. Harvard University Press, 1991.

Page 84: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

[22] J. Nachbar. Evolutionary selection dynamics in games: Convergence and limit properties. InternationalJournal of Game Theory, 19:59–89, 1990.

[23] Sanguk Noh and Piotr Gmytrasiewicz. Identifying the scope of modeling for time-critical multiagentdecision-making. In !7th International Conference on Artificial Intelligence, pages 1043–1048, August2001.

[24] J. M. Ooi and G.W.Wornell. Decentralized control of a multiple broadcast channel. In 35th Conference onDecision and Control, 1996.

[25] Guillermo Owen. Game Theory: Second Edition. Academic Press, 1982.

[26] C.H. Papadimitriou and J.N. Tsitsiklis. The complexity of markov decision processes. MathematicalJournal of Operations Research, 12(3):441–450, 1987.

[27] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach (Second Edition). PrenticeHall, 2003.

[28] Robert Samuel Simon. Locally finite knowledge structures. Technical Report 275, Center for Rationalityand Interactive Decision Theory, Hebrew University, Jerusalem, Israel, October 2001.

[29] Richard Smallwood and Edward Sondik. The optimal control of partially observable markov decisionprocesses over a finite horizon. Operations Research, 21:1071–1088, 1973.

[30] Milind Tambe, Ranjit Nair, David Pynadath, and Stacy Marsella. Towards computing optimal policies fordecentralized pomdps. In Notes of the 2002 AAAI Workshop on Game Theoretic and Decision TheoreticAgents, August 2002.

Page 85: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Heuristics for a Brokering Set Packing Problem

Y. Guo1, A. Lim2, B. Rodrigues3, Y. Zhu2

1Dept of Computer Science, National University of Singapore

Science Drive 2, Singapore 117543

[email protected] of IEEM, Hong Kong University of Science Technology

Clear Water Bay, Kowloon,Hong Kong

iealim,[email protected] of Business, Singapore Management University

469 Bukit Timah Road, Singapore 259756

[email protected]

Abstract

In this paper we formulate the combinatorial auction brokering problem as a Set Pack-ing Problem (SPP) and apply a simulated annealing heuristic to solve SPP. Experimentalresults are compared with a recent heuristic algorithm and also with the CPLEX IntegerProgramming solver where we found that the simulated annealing approach outperformsthe previous heuristic method and CPLEX obtaining results within smaller timescales.

1 Introduction

In [5], a model was proposed for modeling a brokering problem as a set packing problem (SPP).This problem was motivated by the need to provide an intelligent agent-based framework thatsupports fourth-party logistics operations. Combinatorial optimization problems on biddingauction have been extensively studied in the literature [1][2]. Recently, combinatorial auctiontheory has become a subject of interest. Vries and Vohra provide an excellent survey for suchproblems, including the SPP and two other relevant problems, the set partition and coveringproblems[7]. While there are some previous papers on the approaches to set partition andcovering problems, set packing problem is relatively less analyzed [3][4][6].

In this work, we study the broker model which can be described as follows: Assume thereare n bids and m jobs and each bid can cover a number of jobs resulting in a profit to thesupplier, wj (j ∈ 1, ..., n) is the profit if bid j is selected, and [aij]m×n is a m-row, n-column 0-1matrix, where aij = 1 if job i is included in bid j. Further, the decision variables, xj = 1 if thesupplier selects bid j, and 0 otherwise. An integer programming (IP) model for the brokeringproblem as a SPP problem is then:

Maximize∑

j∈N

wjxj (1)

1

Page 86: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Subject to: ∑

j∈N

aijxj ≤ 1, i ∈ M (2)

xj ∈ 0, 1, j ∈ N (3)

where N = 1, . . . , n, M = 1, . . . , m.The first set of constraints ensure that each row iscovered by at most one column and the second integrality constraints ensure that xj = 1, iffcolumn j of the matrix is in the solution.

In [5], the SPP is directly related to bid allocation since rows represent jobs submittedand columns represent bids for a subset of rows and the problem is to find a set fo bids withthe maximum weight which cover a given set of jobs at most once. The weights were derivedby the three factors of pricing, preference and fairness and is indicative of how good any bidis. Although the SPP is well known, we have to the best of our knowledge not found anyapplication of heuristic to this problem, except in [5] where an Iterative Greedy (IG) approachwas proposed for the problem. The IG method is applied to problems of up to 200 jobs and100 bids. In this paper, we develop a heuristic method using an simulated annealing heuristicembedded with a greedy search which provides a guide to the main algorithm. From extensiveexperiments performed, we found that this approach gave better results with less time spentcompared with the IG approach. And when applied to large test cases with up to 1500 jobsand 1500 bids, the new method gave good results compared with those obtained by CPLEXand in much less time.

The paper is organized as follows: in the next section we propose a new simulated anneal-ing heuristic embedded with greedy local search to solve SPP and also discuss the solutiontechniques in detail. The elaborate experimental results are presented in section 3, where theresults of our method are compared with CPLEX solver and the previous heuristic method.The paper gives the conclusion in section 4.

2 Simulated Annealing with Greedy Search

Simulated Annealing (SA) is a meta-heuristic that differs from the traditional hill-climbingmethod. It accepts local moves which may decrease the current objective value with a certainprobability. It comprises of two major components - a local search and temperature coolingschedule. We developed a hybrid meta-heuristic which basically uses an SA approach and witha greedy search for selecting local moves.

2.1 The SA Framework

Our heuristic algorithm is illustrated in Algorithm 1, where the m× n matrix a denotes a job-bid 0-1 matrix where aij = 1 if bid j includes job i (1 ≤ i ≤ m, 1 ≤ j ≤ n). Before the programembarks on the heuristic search, we preprocess input data to eliminate redundant calculationspossible in the heuristic. The preprocessing would first calculate a 0-1 matrix C = [cij]n×n,where cij = 1 if bid i and bid j both contain job k (1 ≤ i ≤ n, 1 ≤ j ≤ n, 1 ≤ k ≤ m), andbid i and bid j are called “conflict” each other. If there exists any bid i (1 ≤ i ≤ n) such thatcij = 0 for any j (1 ≤ j ≤ n), we include the bid i in the final solution and do not considersuch bids in the heuristic search.

Here, we combine the greedy local search with the SA search as the greedy method by itselfwould easily go into local optima but if it is used to guide the SA search, the results turn outmuch better than the IG method in [5] or if we use SA alone. The extent to which the greedy

2

Page 87: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Algorithm 1 Simulated Annealing Framework

S ← ; best value ← 0; Temperature ← Tmax; Iter ← 0preprocess()while Iter < Max Iter and Temperature > T Terminate do

with probability p1

S ← SA Localsearch(s);Cool(Temperature)

with probability (1 − p1)S ← Greedy Localsearch(s);if value(S) > best value then

best value ← value(S);end ifiter ← iter + 1

end while

Figure 1: SA exchange 2 Example

local search is embedded with the SA search is an important factor, and in our implementation,this is expressed by the probability p1.

2.2 SA Local Search and Cooling Schedule

Unlike hill climbing, SA local search can accept moves that will lower the current objectivevalue in order to escape from possible local optima. The SA local search algorithm is sketchedin Algorithm 2, where C4 is a constant.

We have made use of 2 types of neighborhood moves: SA exchange 1 and SA exchange 2.These would select 1 or 2 bids at one time respectively to add to the current solution S, andthen remove any bids in S that conflict with the newly added bids. For example, the algorithmfor SA exchange 2() is illustrated by Algorithm 3. Also refer to Fig 1. In this example, we have7 bids in all with 4 already selected in Stemp. SA exchage 2 would select 2 non-conflictingbids 5, 7 from outside Stemp to add to it, and remove bid 3 from Stemp because it conflictswith the newly added bids. For a cooling scheme, we set the initial temperature Tmax to have alarge value with a cooling schedule Temperature = C5 × Temperature where C5 is a constanta little smaller than 1.

2.3 Greedy Local Search

The greedy local search is used with the SA local search to improve the performance. In agreedy local search, the greedy value determination is directly related to the performance of

3

Page 88: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Algorithm 2 SA local searchStemp ← Swith probability p2

SA exchange 1(Stemp)with probability 1 − p2

SA exchange 2(Stemp)δ = value(Stemp) − value(S)if δ > 0 then

return Stempelse from S to Stemp is a downhill move

T ← C4 × Temperaturep = e−δ/T

with probability pS ← Stemp

return Send if

Algorithm 3 SA exchange 2(Stemp)

if At least 2 non-conflict bids are not in Stemp thenRandomly select bid i and bid j where i, j /∈ Stemp and cij = 0Remove bid k from Stemp if cik = 1 or cjk = 1Stemp ← Stemp ∪ i, j

end ifreturn Stemp

the search. In SPP, a natural approach is to use the profit of a bid as the greedy value. However,this approach ignores the nature of a bid in the way that more jobs a bid contains, the morelikely the bid would conflict with other bids and constrain bids that the supplier can selectwhich makes solutions inferior. In our greedy local search, we assign a penalty cost to each bid,and take the profit less the penalty cost as the greedy value.

2.3.1 Relative Penalty Cost

In order to derive the relative penalty cost, we first introduce an absolute penalty cost |penalty|given by:

|penalty|i =n∑

j=1

C1 ∗ profitj ∗ cij, 1 i n.

[cij]n×n is a n-row, n-column 0-1 matrix where cij = 1 iff ∃k ∈ M aik = 1 and ajk = 1.When bid i conflicts with a high-profit bid j, if we select bid i and not bid j, the outcome

may be detrimental; thus, we assign a penalty cost to each bid according to the profits accruedfrom all other conflicting bids as above. Here, C1 is a constant.

We can now define a relative penalty:

penaltyi =n∑

j=1

(C2 ∗ profitj − C3 ∗ |penalty|j) ∗ cij, 1 i n.

4

Page 89: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Figure 2: Greedy Search Example

The main advantage that a relative penalty has over an absolute penalty is that, whenbid j has a large absolute penalty and we are unlikely to pick it in the greedy process, weneed not incur the absolute penalties of other bids that conflict with bid j since we subtractC3 ∗ |penalty|j. Here, C2 and C3 are both constants.

2.3.2 Greedy Selection

The greedy local search chooses among the bids that do not conflict with any of the bids alreadyselected in S one with the largest (profit − penalty) value. As in Fig. 2, suppose we have 6bids and we have selected current bids S = 1, 3, 4 where there are conflicts between bid 1,5and bid 2,6. After the greedy local search is applied, the resulting set is S = 1, 3, 4, 6.

3 Experimental results

In this section, we present experimental results and compare these with the IG method in [5]and Ilog CPLEX 7.0.1 using the SPP IP model described in this paper. We generated the testcases following the mechanisms described in [5] with various sizes. These involve restricting thenumber of jobs a bidder can bid and the minimum and maximum profit each job can provide,etc. In [5] the maximum size of test case is m = 200 and n = 100 and in this work, we use valuesup to m = 1500 and n = 1500. For comparison with our SA with greedy search (SAG), wetested 4 variations of the IG method in [5] and from our experimental results all 4 methods donot significantly differ from each other in results. For comparisons, we use the second variationof the IG method, a deterministic divergent search and deterministic partial cover describedin [5]. All instances on SAG and IG are run on a machine with PIII-800 dual CPU and 1GBmemory; all instances on CPLEX are run on a local machine with PIV-1.7G CPU and 384Mmemory.

3.1 Optimality Comparisons of SAG and IG with CPLEX

We randomly generated 10 small test cases following the generation mechanism introduced in[5] for which CPLEX can provide optimal results in reasonable time and were able to comparethe performance of SAG and IG for small instances. The results are given in Table 1. Thetime spent on these 10 instances is less than 10 seconds for both SAG and IG methods; andless than 30 seconds for CPLEX.

5

Page 90: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

M N SAG IG δIG1 CPLEX

100 100 19495.02†2 17400.77 10.74% 19495.02100 100 20543.42† 20435.92 0.52% 20543.42100 100 18556.59† 16993.77 8.42% 18556.59100 100 20690.86† 19590.13 5.32% 20690.86100 100 18485.96† 18485.96† 0 18485.96200 200 26284.21† 23920.52 8.99% 26284.21200 200 28473.75† 26142.23 8.19% 28473.75200 200 27894.38† 27894.38† 0 27894.38200 200 30748.64† 30748.64† 0 30748.64200 200 27558.78† 25636.82 6.97% 27558.78

1 δIG measures the difference between IG and optimal solution

2 † means optimal solution is obtained

Table 1: Optimality Comparison for Small Instances

M N # instance SAG best IG best µSAG. t1 µIG. t2 δ1

1000 500 100 94 1 74750.66 33.88 70295.46 31.73 6.34%500 1000 100 91 8 65874.39 77.06 62732.89 143.62 5.01%1000 1000 100 94 5 85205.58 88.70 80256.87 188.29 6.17%1000 1500 100 100 0 84422.91 151.50 80404.98 286.84 5.00%1500 1500 100 93 1 103426.68 192.93 98255.77 314.04 5.26%

1 δ measures the percentage SAG result outperforms IG result

Table 2: Experimental Results on Large Instances

From Table 1, we see that SAG can find optimal solutions for the 10 small instances, whileIG finds 3 out of 10. We can conclude that SAG works better than IG for small size instances.

3.2 Comparisons between SAG and IG Heuristics on Large Instances

In order to compare the performance of the SAG and IG heuristic in real world situations, wegenerated several sets of large test cases of the SPP and apply SAG and IG to solve. 500 testcases are grouped into 5 sets up to m = 1500 and n = 1500 as in Table 2, where µSAG andµIG are the arithmetic average of the 100 instances in each group for SAG and IG methodrespectively, and t1 and t2 are arithmetic average times in seconds of 100 instances in eachgroup for the SAG and IG methods.

From Table 2, we see that the SAG method always gives a 5 to 6 percent improvement inresults. The number of instances of each 100-instance group for our SAG method wins overIG is more than 90, which means SAG consistently outperforms the previous IG method. Inaddition, the time spent of SAG is always about 1/2 that of IG under same conditions excludingthe first set of test cases which has a relatively small size and both SAG and IG run in about 30seconds where SAG results are more than 6% better. In addition, the 500 instances are availableat http://logistics.ust.hk/˜ zhuyi/instance.zip, to serve as our benchmark for the SPP.

6

Page 91: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

M N SAG t1 CPLEX δCPLEX1 IG t2 δIG

1

1000 1000 87138.76 87.64 77977.06 10.51% 80712.11 143.29 7.37%1000 1000 82914.83 88.89 81398.69 1.83% 77132.22 130.24 6.97%1000 1000 85536.72 86.63 77995.93 8.81% 81958.07 134.54 4.18%1000 1000 83254.88 88.82 74232.00 10.84% 74157.05 141.06 10.93%1000 1000 87177.65 89.92 80210.71 7.99% 81097.49 141.13 6.97%1500 1500 108008.96 193.11 98482.81 8.82% 99975.09 306.34 7.44%1500 1500 100897.01 190.46 99377.82 1.51% 93638.39 322.40 7.19%1500 1500 104749.69 192.50 94153.32 10.12% 100033.68 329.04 4.50%1500 1500 101049.25 193.05 91548.39 9.40% 94955.79 321.58 5.99%1500 1500 100658.78 192.08 97390.90 3.25% 96419.52 318.41 4.21%

1 δIG and δCPLEX measure the result difference of IG and CPLEX from SAG

Table 3: Comparision among SAG, Cplex and IG on Large Instances

3.3 Comparisons between SAG and CPLEX on Large Instances

We randomly selected 10 instance from section 3.2 with size m = 1000 , n = 1000 and m = 1500,n = 1500. We give CPLEX 3600 seconds time limit to run each instance (Hence the CPLEXsolutions may not optimal) and compare the results. The statistics are found in Table 3.

From Table 3 we see that the optimal solution cannot be found by CPLEX when instancesize is large. On the other hand, the SAG heuristic provides results which are reasonably betterthan CPLEX and the IG heuristic with higher time efficiency. Also, we note that IG does notprovide better results than CPLEX always, whereas the SAG method does.

4 Conclusion

In this paper, we modelled the combinatorial auction brokering problem as a NP-complete setpacking problem and discussed a meta-heuristic hybrid method to solve the problem. Althoughalmost all approaches to the SPP have used IP techniques, we implemented a simulated an-nealing heuristic which provides good results with less time spent. When compared with thesingle other paper which uses heuristics for the SPP, our algorithm fairs better. Similarly, whencompared with IP solutions obtained from CPLEX, our heuristics performs better. We alsoestablished our benchmark set for future research on SPP.

References

[1] M. Tenhunen A. Anderson and F. Ygge. Integer programming for combinatorial auctionwinner determination. In Proceedings of the Fourth International Conference on Multi-AgentSystems (ICMAS00), pages 39–46, 2000.

[2] M. Aourid and B. Kaminska. Neural networks for the set covering problem: An applica-tion to the test vector compaction. In IEEE international Conference on Neural NetworksConference Proceedings, volume 7, pages 4645–4649, 1994.

7

Page 92: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

[3] E. Balas and A. Ho. Set covering algorithm using cutting planes, heuristics, and subgradientoptimization: A computational study. In Mathematical Programming, volume 12, pages 37–60, 1980.

[4] E. Balas and M. W. Padberg. Set partitioning: A survey. In SIAM Review, volume 18,pages 710–760, 1976.

[5] Hoong Chuin Lau and Yam Guan Goh. An intelligent brokering system to support multi-agent web-based 4th-party logistics. In Proceedings of the Fourteenth International Confer-ence on Tools with Artificial Intelligence, pages 10–11, 2002.

[6] M. W. Padberg. On the facial structure of set packing polyhedra. In Mathematical Pro-gramming, volume 5, pages 199–215, 1973.

[7] Sven De Vries and Rakesh V. Vohra. Combinatorial auctions: A survey. In INFORMSJournal on Computing, volume 15, pages 284–309, 2003.

8

Page 93: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Combining Symmetry Breaking with Other Constraints:lexicographic ordering with sums

Brahim Hnich1, Zeynep Kiziltan2, and Toby Walsh1

1 Cork Constraint Computation Center, University College Cork, Ireland.brahim, [email protected]

2 Department of Information Science, Uppsala University, [email protected]

Abstract. We introduce a new global constraint which combines to-gether the lexicographic ordering constraint with two sum constraints.Lexicographic ordering constraints are frequently used to break symme-try, whilst sum constraints occur in many problems involving capacityor partitioning. Our results show that this global constraint is usefulwhen there is a very large space to explore, such as when the problemis unsatisfiable, or when the search strategy is poor or conflicts with thesymmetry breaking constraints. By studying in detail when combininglexicographical ordering with other constraints is useful, we propose anew heuristic for deciding when to combine constraints together.

1 Introduction

Global constraints specify patterns that reoccur in many problems. For example,we often have row and column symmetry on a 2-d matrix of decision variablesand can post lexicographic ordering constraints on the rows and columns tobreak much of this symmetry [4]. There are, however, only a limited number ofcommon constraints like the lexicographic ordering constraint which repeatedlyoccur in problems. New global constraints are therefore likely to be increasinglymore specialized. An alternative strategy for developing global constraints thatmight be useful in a wide range of problems is to identify constraints that oftenoccur together, and develop efficient constraint propagation algorithms for theircombination. In this paper, we explore this strategy.

We introduce a new global constraint on 0/1 variables that combines togetherthe lexicographic ordering constraint with two sum constraints. Sum and lexi-cographic ordering constraints frequently occur together in problems involvingcapacity or partitioning that are modelled with symmetric matrices of decisionvariables. Examples are the ternary Steiner problem, the balanced incompleteblock design problem, the rack configuration problem, social golfers, etc. Ourresults show that this new constraint is most useful when there is a very largespace to explore, such as when the problem is unsatisfiable, or when the branch-ing heuristics are poor or conflict with the symmetry breaking constraints. The This research is supported by Science Foundation Ireland. We thank the other mem-

bers of the 4C lab, the APES research group, Alan Frisch and Chris Jefferson.

Page 94: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

combined constraint gives additional pruning and this can, for example, helpcompensate for the branching heuristic trying to push the search in a differ-ent direction to the symmetry breaking constraints. Combining constraints isa step towards tackling one of the most common criticisms of using symmetrybreaking constraints. By increasing the amount of propagation, we can partlytackle conflict between the branching heuristic and the symmetry breaking con-straints. Finally, by studying in detail when combining lexicographical orderingwith other constraints is useful, we propose a new heuristic for deciding when tocombine heuristics together. The heuristic suggests that the combination shouldbe likely to prune a significant number of shared variables.

2 Preliminaries

A constraint satisfaction problem (CSP) is a set of variables, each with a finitedomain of values, and a set of constraints that specify allowed values for subsetsof variables. A solution to a CSP is an assignment of values to the variablessatisfying the constraints. To find such solutions, constraint solvers often explorethe space of partial assignments enforcing a local consistency like generalized arc-consistency (GAC). A constraint is GAC iff, when a variable in the constraintis assigned any of its values, compatible values exist for all the other variablesin the constraint. For totally ordered domains, like integers, another level ofconsistent is bounds-consistency (BC). A constraint is bounds consistent (BC)iff, when a variable in the constraint is assigned its maximum or minimum value,there exist compatible values for all the other variables in the constraint. If aconstraint c is BC or GAC then we write BC(c) or GAC(c) respectively.

In this paper, we are interested in lexicographic ordering of vectors of vari-ables in the presence of sum constraints on the vectors. We denote a vector xof n finite integer variables as X = 〈X0, . . . , Xn−1〉, while we denote a vector xof n ground values as x = 〈x0, . . . , xn−1〉. The sub-vector of x with start indexa and last index b inclusive is denoted by xa→b. The domain of a finite integervariable V is denoted by D(V ), and the minimum and the maximum elementsin this domain by min(D(V )) and max(D(V )).

Given two vectors, X and Y of variables, we write a lexicographical orderingconstraint as X ≤lex Y and a strict lexicographic ordering constraint as X <lex

Y . X ≤lex Y ensures that: X0 ≤ Y0; X1 ≤ Y1 when X0 = Y0; X2 ≤ Y2 whenX0 = Y0 and X1 = Y1; . . . ; Xn−1 ≤ Yn−1 when X0 = Y0, X1 = Y1, . . . , andXn−2 = Yn−2. X <lex Y ensures that: X ≤lex Y ; and Xn−1 < Yn−1 when X0 =Y0, X1 = Y1, . . . , and Xn−2 = Yn−2. We write LexLeqAndSum(X,Y , Sx, Sy)for the constraint which ensure that X ≤lex Y ,

∑i Xi = Sx

∑i Yi = Sy.

Similarly, we write LexLessAndSum(X,Y , Sx, Sy) for X <lex Y ,∑

i Xi = Sx,and

∑i Yi = Sy. We denote the dual cases as LexGeqAndSum(X,Y ,

Sx, Sy) and as LexGreaterAndSum (X,Y , Sx, Sy). We assume that the variablesbeing ordered are disjoint and not repeated. We also assume that Sx and Sy areground and discuss the case when they are bounded variables in Section 5.

Page 95: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

3 A worked example

We consider the special (but nevertheless useful) case of vectors of 0/1 variables.Generalizing the algorithm to non Boolean variables remains a significant chal-lenge as it will involve solving subset sum problems. Fortunately, many of theapplications of our algorithm merely require 0/1 variables. To maintain GAC onLexLeqAndSum(X,Y , Sx, Sy), we minimize X lexicographically with respect tothe sum and identify which positions in Y support 0 or 1. We then maximize Ylexicographically with respect to the sum and identify which positions in X sup-port 0 or 1. Since there are two values and two vectors to consider, the algorithmhas 4 steps.

In each step, we maintain a pair of lexicographically minimal and maximalground vectors sx = 〈sx0, . . . , sxn−1〉 and sy = 〈sy0, . . . , syn−1〉. To serve re-peatedly traversing the vectors we have a flag α where for all i < α we havesxi = syi and sxα = syα. That is, α is the most significant index where sxand sy differ. Additionally, we may need to know whether sxα+1→n−1 andsyα+1→n−1 are lexicographically ordered. Therefore, we introduce a booleanflag γ whose value is true iff sxα+1→n−1 ≤lex syα+1→n−1.

Consider the vectors

X = 〈0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0〉Y = 〈0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1〉

and the constraints X ≤lex Y ,∑

i Xi = 3, and∑

i Yi = 2. Each of these con-straints are GAC and thus no pruning is possible. Our algorithm that maintainsGAC on LexLeqAndSum(X,Y , 3, 2) starts with step 1 in which we have

sx = 〈0, 0, 0, 0, 1, 1, 0, 0〉sy = 〈1, 0, 0, 1, 0, 0, 0, 0〉

↑ α

where sx = minx| ∑i xi = 2 ∧ x ∈ X, and sy = maxY | ∑

i yi = 2 ∧ y ∈Y . We check where we can place one more 1 in sx to make the sum 3 asrequired without disturbing sx ≤lex sy. We have α = 0 and γ = true. We cansafely place 1 to the right of α as this does not affect sx ≤lex sy. Since γ istrue, placing 1 at α also does not affect the order of the vectors. Therefore, allthe 1s in X have support.

In step 2 we havesx = 〈1, 1, 0, 0, 1, 1, 0, 0〉sy = 〈1, 0, 0, 1, 0, 0, 0, 0〉

↑ α

where sx = minx| ∑i xi = 4 ∧ x ∈ X, and sy is as before. We check where

we can place one more 0 in sx to make the sum 3 as required to obtain sx ≤lex

sy. We have α = 1 and γ = true. Placing 0 to the left of α makes sx smallerthan sy. Since γ is true, placing 0 at α also makes sx smaller than sy. However,placing 0 to the right of α orders the vectors lexicographically the wrong wayaround. Hence, we remove 0 from the domains of the variables of X on the righthand side of α. The vector X is now 〈0, 1, 0, 1, 0, 0, 1, 1, 0, 0〉.

Page 96: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

In step 3 we havesx = 〈0, 1, 0, 0, 1, 1, 0, 0〉sy = 〈1, 1, 0, 1, 0, 0, 0, 0〉

↑ α

where sx = minx| ∑i xi = 3 ∧ x ∈ X, and sy = maxY | ∑

i yi = 3 ∧ y ∈Y . We check where we can place one more 0 in sy to make the sum 2 asrequired without disturbing sx ≤lex sy. We have α = 0 and γ = true. We cansafely place 0 to the right of α as this does not affect sx ≤lex sy. Since γ istrue, placing 0 at α also does not affect the order of the vectors. Therefore, allthe 0s in Y have support.

Finally, in step 4 we have

sx = 〈0, 1, 0, 0, 1, 1, 0, 0〉sy = 〈0, 0, 0, 1, 0, 0, 0, 0〉

↑ α

where sx is as before, and sy = maxY | ∑i yi = 1 ∧ y ∈ Y . We check

where we can place one more 1 in sy to make the sum 2 as required to obtainsx ≤lex sy. We have α = 1 and γ = true. Placing 1 to the left of α makessx ≤lex sy and so is safe. Since γ is true, we can also safely place 1 at α.However, placing 1 to the right of α makes sx >lex sy. Hence, we remove 1 fromthe domains of the variables of Y on the right hand side of α. The algorithmnow terminates with domains that are GAC:

X = 〈0, 1, 0, 1, 0, 0, 1, 1, 0, 0〉Y = 〈0, 1, 0, 1, 0, 1, 0, 0, 0, 0〉

4 Algorithm

The algorithm first establishes BC on the sum constraints. Note that for 0/1variables, BC is equivalent to GAC. If no failure is encountered we continuewith 4 pruning steps. In step 1, we are concerned with support for 1s in X asin the worked example. In step 2, we are concerned with support for 0s in X.Step 3 is very similar to step 1, except we identify support for the 0s in Y . Step4 is very similar to step 2, except we identify support for the 1s in Y . None ofthe prunings require any recursive calls back to the algorithm. The algorithmruns in linear time in the length of the vectors and is correct. But, for reasonsof space, the details of the algorithm and the proofs are given in [5].

The algorithm can easily be modified for LexLessAndSum(X,Y , Sx, Sy) (thestrict ordering constraint). To do so, we need to disallow equality between thevectors. This requires just two modifications to the algorithm. First, we changethe definition of γ. The flag γ is true iff sxα+1→n−1 <lex syα+1→n−1. Second, wefail if we have minx| ∑

i xi = Sx ∧ x ∈ X ≥lex maxy| ∑i yi = Sy ∧ y ∈

Y .We can also deal with sums that are not ground but bounded. Assume we

have lx ≤ Sx ≤ ux and ly ≤ Sy ≤ uy. We now need to find support first forthe values in the domains of the vectors and second for the values in the rangeof lx..ux and ly..uy. In the first part, we can run our algorithm LexLeqAndSum

Page 97: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

with∑

i Xi = lx and∑

i Yi = uy. In the second part, we tighten the upperbound of Sx with respect to the upper bound of Sy so that maxx|∑i xi =ux ∧ x ∈ X ≤lex maxy|∑i yi = uy ∧ y ∈ Y . The support for the upperbound of Sx is also the support for all the other values in the domain of Sx.Similarly, we tighten the lower bound of Sy with respect to the lower bound ofSx so that minx|∑i xi = lx ∧ x ∈ X ≤lex miny|∑i yi = ly ∧ y ∈ Y .The support for the lower bound of Sy is also the support for all the othervalues in the domain of Sy. The values in the vectors are supported by lx anduy. The prunings of the second part tighten only ux and ly. Hence the pruningsperformed in the second part do not require any calls back to the first part. Itis easy to show that the modified algorithm is correct and runs in linear time.

Finally, we can extend the algorithm to detect constraint entailment. A con-straint is entailed when any assignment of values to its variables satisfy theconstraint. Detecting entailment does not change the worst-case complexity butis very useful for avoiding unnecessary work. For this purpose, we can maintaina flag entailed, which is set to true whenever the constraint LexLeqAndSum isentailed, and the algorithm directly returns on future calls if entailed is set toTrue. The constraint is entailed when we have maxx| ∑

i xi = Sx ∧ x ∈X ≤lex miny| ∑

i yi = Sy ∧ y ∈ Y .

5 Experimental Results

We performed a wide range of experiments to test when this combination ofconstraints is useful in practice. But, for reasons of space, we only show theresults for BIBDs. In the following table, the results for finding the first solutionor that none exists are shown, where “-” means no result is obtained in 1 hour(3600 secs) using ILOG Solver 5.3 on a 1GHz pentium III processor with 256Mb RAM under Windows XP.

Balanced Incomplete Block Designs (BIBD). BIBD generation is a standardcombinatorial problem from design theory with applications in cryptographyand experimental design. A BIBD is a set V of v ≥ 2 elements and a collectionof b > 0 subsets of V , such that each subset consists of exactly k elements(v > k > 0), each element appears in exactly r subsets (r > 0), and each pair ofelements appear simultaneously in exactly λ subsets (λ > 0).

A BIBD can be specified as a constraint program by a 0/1 matrix of b columnsand v rows, with constraints enforcing exactly r ones per row, k ones per col-umn, and a scalar product of λ between any pair of distinct rows. This matrixmodel has row and column symmetry[3], and both the rows and the columns arenow also constrained by sum constraints. Hence, we can impose our new globalconstraint (both) on the rows and the columns.

Instantiating the matrix along its rows from top to bottom and exploringthe domain of each variable in increasing order works extremely well with thesymmetry breaking constraints. All the instances of [7] are solved within a fewseconds. Bigger instances such as 〈15, 21, 7, 5, 2〉 and 〈22, 22, 7, 7, 2〉 are solved in

Page 98: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Problem No symmetry breaking >lexR ≥lexC LexGreaterAndSum RLexGeqAndSum C

# 〈v, b, r, k, λ〉 Failures Time (sec.) Failures Time (sec.) Failures Time (sec.)1 6,20,10,3,4 8,944 0.7 916 0.2 327 0.12 7,21,9,3,3 7,438 0.7 20,182 5.3 5,289 2.13 6,30,15,3,6 1,893,458 192.3 10,618 3.7 1,493 14 7,28,12,3,4 229,241 26.1 801,290 330.7 52,927 275 9,24,8,3,2 6,841 1.1 2,338,067 1115.9 617,707 524.36 6,40,20,3,8 - >1hr 117,126 67.5 4,734 4.47 7,35,15,3,5 7,814,878 1444.4 - > 1hr 382,173 311.28 7,42,18,3,6 - >1hr - >1hr 2,176,006 2,603.7

Table 1. BIBDs: row-wise labelling.

less than a minute. With this search strategy, we observe no difference betweenthe inference of our algorithm and its decomposition into seperate lexicographicordering and sum constraints.

To see a difference, we need either a poor branching heuristic or a large searchspace (e.g., an unsatisfiable problem). Instead of exploring the rows from top tobottom, if we explore them from bottom to top then the problem becomes verydifficult to solve in the presence of the symmetry breaking constraints, i.e. evensmall instances become hard to solve within an hour. We can make the prob-lem more difficult to solve by choosing one row from the top and then one rowfrom the bottom, and so on. Table 1 shows how the search tree is affected. Wemake a number of observations about these result. First, imposing the symmetrybreaking constraints significantly reduces the size of the search tree and time tosolve the problem compared to no symmetry breaking. Moreover, the additionalinference performed by our algorithm gives much smaller search trees in muchshorter run-times. See entries 1, 3, and 6. Second, lexicographic ordering con-straints and the search strategy clash, resulting in bigger search trees. However,the extra inference of our algorithm is able to compensate for this. This suggeststhat even if the ordering imposed by symmetry breaking constraints conflictswith the search strategy, more inference incorporated into the symmetry break-ing constraints can significantly reduce the size of the search tree. See entries 2,4, and 7. Third, increased inference scales up better, and recovers from mistakesmuch quicker. See entry 5. Finally, the problem can sometimes only be solved,when using a poor search strategy, by imposing our new global constraint. Seeentry 8.

6 Lexicographic Ordering with Other Constraints

We obtained similar results with other problems like ternary Steiner, rack con-figuration, steel mill slab design and the social golfers problem. In each case,the combined constraint was only useful when the symmetry breaking conflictedwith the branching heuristic, the branching heuristic was poor, or there was avery large search space to explore. Why is this so?

Page 99: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Katsirelos and Bacchus have proposed a simple heuristic for combining con-straints together [6]. The heuristic suggests grouping constraints together if theyshare many variables in common. This heuristic would suggest that combininglexicographical ordering and sum constraints would be very useful as they in-tersect on many variables. However, this ignores how the constraints are propa-gated. The lexicographical ordering constraint only prunes at one position, α (αpoints to the most significant index of X and Y where Xi and Yi are not groundand equal). If the vectors are already ordered at this position then any futureassignments are irrelevant. Of course, α can move to the right but on average itmoves only one position for each assignment. Hence, the lexicographic orderingconstraint interacts on average with one variable from each of the sum con-straints. Such interaction is of limited value because the constraints are alreadycommunicating with each other via the domain of that variable. This explainswhy combining lexicographical ordering and sum constraints is only of valueon problems where there is a lot of search and even a small amount of extrainference may save exploring large failed subtrees.

A similar argument will hold for combining lexicographic ordering constraintswith other constraints. For example, Carlsson and Beldiceanu have introduceda new global constraint, called lex chain, which combines together a chain oflexicographic ordering constraints [1]. When we have a matrix say with rowsymmetry, we can now post a single lexicographic ordering constraint on allthe m vectors corresponding to the rows as opposed to posting m − 1 of them.In theory, such a constraint can give more propagation. However, our exper-iments on BIBDs indicate no gain over posting lexicographic ordering con-straints between the adjacent vectors. In Table 2, we report the results of solvingBIBDs using SICStus Prolog 3.10.1. We either post lexicographic ordering oranti-lexicographic ordering constraints on the rows and columns, and instanti-ate the matrix from top to bottom exploring the domains in ascending order.The lexicographic ordering constraints are posted using lex chain of Carlssonand Beldiceanu, which is available in SICStus Prolog 3.10.1. This constraintis either posted once for all the symmetric rows/columns, or between adjacentrows/columns. In all the cases, we observe no benefits in combining a chain oflexicographic ordering constraints.

The interaction between the constraints is again very restricted. Each of themis concerned only with a pair of variables and it interacts with its neighboureither at this position or at a position above its α where the variable is alreadyground. This argument suggests a new heuristic for combining constraints: thecombination should be likely to prune a significant number of shared variables.

7 Conclusion

We have introduced a new global constraint on 0/1 variables which combines alexicographical ordering constraint with sum constraints. Lexicographic orderingconstraints are frequently used to break symmetry, whilst sum constraints occurin many problems involving capacity or partitioning. Our results showed that

Page 100: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

v, b, r, k, λ No symmetry <lex R ≤lex C >lex R ≥lex Cbreaking lex chain lex chain

〈X0, . . . , Xm−1〉 〈Xi, Xi+1〉 〈X0, . . . , Xm−1〉 〈Xi, Xi+1〉Backtracks Backtracks Backtracks Backtracks Bactracks

6,20,10,3,4 5,201 84 84 706 7067,21,9,3,3 1,488 130 130 72 726,30,15,3,6 540,039 217 217 9216 92167,28,12,3,4 23,160 216 216 183 1839,24,8,3,2 - 1,473 1,473 79 796,40,20,3,8 - 449 449 51,576 51,5767,35,15,3,5 9,429,447 326 326 395 3957,42,18,3,6 5,975,823 460 460 756 756

Table 2. BIBD: lex chain(〈X0, . . . , Xm−1〉) vs lex chain(〈Xi, Xi+1〉) for all 0 ≤ i <m − 1 with row-wise labelling.

this global constraint is useful when there is a very large space to explore, suchas when the problem is unsatisfiable, or when the branching heuristic is pooror conflicts with the symmetry breaking constraints. However, our combinedconstraint did not compensate in full for a poor branching heuristic. Overall, itwas better to use a good branching heuristic. Finally, by studying in detail whencombining lexicographical ordering with other constraints is useful, we proposeda new heuristic for deciding when to combine constraints together.

References

1. M. Carlsson and N. Beldiceanu. Arc-consistency for a chain of lexicographic orderingconstraints. Technical Report T2002-18, SICS, 2002.

2. M. Carlsson and N. Beldiceanu. Revisiting the lexicographic ordering constraint.Technical Report T2002-17, SICS, 2002.

3. P. Flener, A. Frisch, B. Hnich, Z. Kiziltan, I. Miguel, J. Pearson, and T. Walsh.Breaking row and column symmetry in matrix models. In P. van Hentenryck, editor,Proceedings of 8th CP (CP-2002), pages 462–476. Springer, 2002.

4. A. Frisch, B. Hnich, Z. Kiziltan, I. Miguel, and T. Walsh. Global constraints forlexicographic orderings. In P. van Hentenryck, editor, Proceedings of 8th CP (CP-2002), pages 93–108. Springer, 2002.

5. Z. Kiziltan. Symmetry Breaking Ordering Constraints. PhD thesis, Uppsala Uni-versity. Due to be submitted late 2003.

6. G. Katsirelos and F. Bacchus. GAC on conjunctions of constraints. In T. Walsh,editor, Proceedings of 7th CP (CP-2001), pages 610–614. Springer, 2001.

7. P. Meseguer and C. Torras. Exploiting symmetries within constraint satisfactionsearch. Artificial Intelligence, 129(1-2):133–163, 2001.

Page 101: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

A Simple Yet Effective Heuristic Framework for Optimization Problems

Gaofeng Huang and Andrew LimDept of Industrial Engineering & Engineering Management

Hong Kong University of Science & TechnologyClear Water Bay, Kowloon, Hong Kong

gfhuang, [email protected]

In this paper, we propose a simple yet effective heuristic framework called Fragmental Op-timization (FO). In FO, there are two tightly coupled elements: Fragment Selection and Opti-mization. We formulate the FO technique and apply it to the 2-machine bicriteria flowshopscheduling problem and the 3-Index Assignment Problem. We conduct extensive experi-ments on standard benchmark instances for these problems. The experimental results show thatour methods are superior to the previous best methods for the two problems. As the two problemsare quite different, it suggests that our method is sufficiently general and can be adapted to solveother optimization problems effectively.

1 Overview

Search is one of the basic techniques in Artificial Intelligence. However, most real world optimizationproblems are still intractably hard because of their large search spaces. Many general heuristic searchmethods have thus been developed to find competitive solutions within a reasonable amount of time.These techniques include Simulated Annealing (SA), Tabu Search (TS), Genetic Algorithm (GA), AntColony Optimization (ACO), “Squeaky Wheal” Optimization (SWO), Greedy Randomized AdaptiveSearch Procedure (GRASP), etc.

The purpose of this paper is to present another effective heuristic framework termed FragmentalOptimization (FO). In its simplest form, FO is an iterative improvement algorithm, utilizing the basicprinciple of “easy things first”. Since it is often computationally infeasible to optimize the wholesolution, FO tries to achieve the relatively easier goal of optimizing a portion or fragment of the entireproblem iteratively.

Figure 1: Solution, Representation and Fragment

SolutionSpace

RepresentationofaSolution

a"Fragment"(partialsolution)

S FS\F

As shown in Figure 1, within the solution space, each solution can have a unique representation S.A fragment, F , is defined as a small portion of S, i.e. a partial solution. As a result, a solution S isdivided into two parts: F and S\F . If we leave S\F unchanged, we may be able to optimize F suchthat the overall objective function value is improved. This idea can be formulated as:

optimize F | (S\F ) is fixed (1)

Algorithm 1 illustrates the general framework of FO. When we apply this framework to a specificproblem, two issues need to be considered: (1) how to select the fragment (Fragment Selection), and(2) how to optimize the fragment effectively (Optimization).

To demonstrate the core concept, consider a simple queuing problem as an example. Suppose npersons are waiting to be serviced, and the service processing time for person i is ti. The objective

1

Page 102: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Algorithm 1 Fragmental Optimization framework1: while not Termination Criteria do2: select a fragment from the whole solution space3: optimize this fragment subject to the other part of the solution being fixed4: end while

is to queue these n persons (i.e. arrange the order of service) such that the average waiting timeis minimized. A problem instance is shown in Figure 2. There are 6 persons with processing timest = (20, 50, 20, 10, 10, 30). The initial solution takes 78 units waiting time. In each iteration, werandomly select 3 consecutive persons as a fragment and try to optimize the fragment. So in iteration1, by optimizing the randomly selected fragment (shaded), the overall cost decreases to 62. Afteriteration 2, the cost is down to 50. After 4 iterations, FO reaches the optimal solution of cost 44.

Figure 2: Simple Example20

0 10 20 70

50

90

20 10 10 30

110 140

average waiting time = (0 + 20 + 70 + 90 + 100 + 110)/5 = 78

iteration 1:20 502010 10 30 average waiting time = (0 + 20 + 30 + 50 + 100 + 110)/5 = 62

iteration 2:20 502010 10 30 average waiting time = (0 + 20 + 30 + 50 + 60 + 90)/5 = 50

iteration 3:20 502010 10 30 average waiting time = (0 + 20 + 30 + 40 + 60 + 90)/5 = 48

iteration 4:20 502010 10 30 average waiting time = (0 + 10 + 20 + 40 + 60 + 90)/5 = 44

Intuitively, this models how humans solve real-world problems: look at small fragments and try tosolve the small fragment well in order to improve the results.

The rest of this paper is organized as follows. In the next 2 sections, we will illustrate the FragmentOptimization technique in greater detail by applying it successfully to 2 problems: the 2-machinebicriteria flowshop scheduling problem and the 3-Index Assignment Problem. Our experi-mental results are very encouraging as we managed to improve the results on benchmarks instances forthese problems. In the last section, we present our conclusions.

2 2-machine bicriteria flowshop scheduling problem

The 2-machine flowshop scheduling problems F2 may be defined as follows. We have n jobs and 2machines M1 and M2. Each job must be processed first on machine M1, then on machine M2. All the njobs are available in the beginning. Once a job is started on a machine, it should not be interrupted(“nopreemption”). The processing time of job i on machine M1 is ai, and on machine M2 is bi. Let Ci,j

denote the completion time of job i on machine Mj , a job’s completion time, Ci, is defined as thecompletion time of job i on machine M2, that is, Ci = Ci,2.

Extensive research was done optimizing one single criterion[9]. But recently, more interest hasbeen given to multiple criteria scheduling. The 2-machine bicriteria flowshop problem we discuss inthis section, F2||(∑ Ci/Cmax), is to find a feasible schedule that first minimizes maximum completiontime(makespan) Cmax = maxCi and then minimizes the total completion time

∑Ci.

Chen and Bulfin(1994) proved that F2||(∑ Ci/Cmax) is NP-hard in the strong sense[8]. Rajendrandeveloped a branch-and-bound algorithm and two heuristics which can solve this problem with no morethan 10 and 25 jobs respectively[7]. Neppalli et al.(1996) applied GA to solve this problem[6]. Gupta etal.(1999) developed a tabu search algorithm for this problem[5], and they also proposed 9 polynomialheuristic algorithms in [2]. Later a comparative study of several well-known local search heuristicswas done by Gupta et al.[1]. Using a Lagrangian relaxation to develop the lower bound, T’Kindt etal.(2001) implemented an efficient branch-and-bound algorithm[4], which can solve this problem with

2

Page 103: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Figure 3: No waiting case: t ≥ ai Figure 4: Waiting case: t < ai

Figure 5: Fragment Selection

ÿ2

t1

ÿ1

t0

F

35 jobs. He also used Ant Colony Optimization(SACO) to solve this problem, which “yields betterresults than existing heuristics.”[3].

In this section, we present our Fragmental Optimization (FO) heuristic for this problem, whichcombines dynamic programming and local search strategy. Experimental results show that our FOalgorithm outperforms existing heuristics and provides solutions very close to the optimal.

2.1 Johnson’s Algorithm

The problem F2||Cmax is solved by Johnson in O(n lgn) time[10].

Algorithm 2 JOHNSON’S ALGORITHM1: X ← i|ai ≤ bi;2: Y ← j|aj > bj;3: sort X as a partial sequence by non-decreasing ai;4: sort Y as a partial sequence by non-increasing bj;5: Johnson’s permutation J ← append sequence Y after X.

The completion time of job J(k) and makespan can be computed as:

CJ(k),1 =k∑

i=1

aJ(i)

CJ(k) =

0, k = 0maxCJ(k−1), CJ(k),1+ bJ(k), k = 1, 2, ..., n

(2)

C∗max = Cmax(J) = CJ(n)

Although the job sequence that comes from Johnson’s algorithm can provide a feasible solution toproblem F2||(∑ Ci/Cmax), for most of the instances the solutions provided are “bad”[1]. However,Johnson’s algorithm is useful in checking whether a partial scheduling sequence is feasible. Supposewe have a partial scheduling sequence π′ = π(1), π(2), ..., π(m)(m < n), let π′ = 1, 2, ..., n − π′

be the set of unallocated jobs, we can apply Johnson’s algorithm on π′ and get Jπ′ . If the makespanCmax(π

′Jπ′) > Cmax(J), there should be no complete feasible schedule with π′ as its partial schedule,which means that the partial scheduling sequence π′ is already not feasible[2].

2.2 Dynamic Programming

For the 2-machine flowshop problems with any regular criterion, the total set of permutation schedulescontains at least one optimal solution[9]. This means that the job sequence on machine M1 and M2

is exactly the same. Now we allocate jobs one by one. Let S be the set of unallocated jobs, while tdenotes current free time span of machine M1. As the two cases in Figure (3) & (4) show, the optimalsolution of subproblem (Cmax,

∑Ci) = G(t, S) can be determined recursively by:

G(t, S) = mini∈S

[(pi, pi × |S|) + G(t′, S − i)] (3)

= mini∈S

(bi, bi × |S|) + G(t− ai + bi, S − i) t ≥ ai

(ai + bi − t, (ai + bi − t)× |S|) + G(bi, S − i) t < ai(4)

while the initial condition is G(t, ∅) = (0, 0). Finally, by tracking G(0, 1, 2, ..., n), we can findthe optimal solution for the whole problem. However, this algorithm takes O(T × 2n × n) time andO(T × 2n) space, which is exponential and not practical.

3

Page 104: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

2.3 Fragmental Optimization

Since the time and space complexity grows exponentially, the basic idea of our FO heuristic is to applyDynamic Programming not in the whole solution(length n) but in a small solution fragment such thatthe time and space needed are acceptable. In this problem, fragment is defined as a length L successivesub-sequence(our experiment sets L = 10), while the optimization is done by DP.

As illustrated in Figure 5, our FO heuristic starts with a feasible solution, which means that themakespan is already minimized. The FO algorithm then maintains a sliding-window (windowL, windowR)of width L and only considers the jobs inside the window. In each iteration, it selects the length Lsuccessive sub-sequence and tries to improve the fragment F by applying DP while the minimizedmakespan is maintained (checking by Johnson’s algorithm). Therefore, in one iteration of FO, L jobswill be re-ordered, which will take O(T × 2L × L) time. After each iteration, FO moves the sliding-window forward. This process repeats until no further improvement can be obtained.

2.4 Experimental Results

We conduct extensive experiments to prove the efficiency of our FO heuristic. For each fixed n, thenumber of jobs involved, 100 instances are generated randomly. The processing time ai and bi are bothintegers and evenly distributed in (0, 100). All the algorithms are implemented in C/C++ and run ona Pentium III 800Mhz PC with 128M memory.

2.4.1 Problems with small size

We experiment with the size n = 10 to 17, for 800 instances in total. For each instance, we first usethe Branch-and-Bound algorithm to find the optimal solution, and then we apply FO heuristic to thatinstance to see whether the solution FO provides is optimal or not.

Table 1 lists the result statistics for all these 800 instances. #optimal shows for how many instancesFO heuristic can provide the optimal solution, while T is the running time for one instance. (Theperformance of SACO heuristic[3] is also shown in the table for comparison). As we can see, Branch-and-Bound is extermely slow, while FO heuristic is rather effective because it can find the optimalsolution for most of the instances, especially when n is small.

Table 1: Statistics for small size problem (800 instances)

Tm in Tavg Tm ax Tm in Tavg Tm ax #optim al Tm in Tavg Tm ax #optim al10 100 0.00 0.04 0.23 0.20 0.24 0.30 100 0.00 0.10 1.00 10011 100 0.00 0.07 0.81 0.22 0.30 0.41 100 0.00 0.20 1.00 10012 100 0.00 0.28 5.20 0.27 0.37 0.44 100 0.00 0.21 1.00 10013 100 0.01 1.83 24.23 0.39 0.47 0.71 100 0.00 0.24 1.00 9814 100 0.01 9.66 184.44 0.45 0.54 0.63 100 0.00 0.30 1.00 9815 100 0.00 66.35 960.46 0.48 0.73 2.72 100 0.00 0.24 1.00 9016 100 0.01 94.05 1286.75 0.61 0.86 5.25 100 0.00 0.45 1.00 8417 100 0.01 258.86 6931.5 0.59 0.97 7.24 98 0.00 0.22 1.00 80

Branch-and-Bound SACOFON #instance

2.4.2 Problems with large size

We experiment the size n = 20, 30, 50, 75, 100, 150, 200, for a total of 700 instances. We compareFO heuristic with other remarkable heuristics: INSERT[2], UB2(EXCHANGE)[4], especially SACO1,which was proved to “yield better results than other existing heuristics” for this problem[3], to seewhich algorithm offers the best solution for each instance.

Statistics are shown in Table 2. #best shows the number of instances when a particular algorithmprovides the best solution. δ denotes the deviation of a particular algorithm’s solution from the bestsolution. Tavg shows the average time for solving one instance. As you can see, except for two instances,our FO heuristic always offers the best solution among the 4 algorithms. Therefore, δavg of the otheralgorithm actually means the average deviation between that algorithm and our FO heuristic. Lookingat δavg of SACO, we can see that δavg tends to increase with the growing of problem size n. Consequently,with the growing of n, our FO tends to provide a better solution than SACO. What is more important,with the growing of problem size n, the increase of the average time of FO is less than the increase of

1The authors would like to show special thanks to Vincent T’Kindt for providing the executable SACO program,which really helps to compare the efficiency of the algorithms.

4

Page 105: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Table 2: Statistics for large size problem(700 instances)

ÿavg ÿmax #best ÿavg ÿmax #best ÿavg ÿmax Tavg #best ÿavg ÿmax Tavg #best20 100 0.92% 5.17% 6 0.58% 3.61% 17 0.05% 0.54% 0.41 64 0.00% 0.07% 1.19 9930 100 0.97% 3.25% 0 0.65% 2.90% 2 0.20% 0.98% 1.04 14 0.00% 0.01% 2.02 9950 100 0.96% 2.43% 0 0.58% 2.43% 0 0.36% 0.99% 4.29 0 0.00% 0.00% 4.63 10075 100 0.85% 1.94% 0 0.54% 1.93% 0 0.44% 0.93% 13.50 0 0.00% 0.00% 9.16 100100 100 0.80% 1.67% 0 0.53% 1.67% 0 0.46% 0.97% 30.66 0 0.00% 0.00% 14.69 100150 100 0.70% 2.11% 0 0.50% 2.11% 0 0.49% 1.21% 89.63 0 0.00% 0.00% 30.01 100200 100 0.59% 1.31% 0 0.43% 1.31% 0 0.50% 1.10% 201.13 0 0.00% 0.00% 53.67 100

FOUB2N #instance

INSERT SACO

the average time of SACO. For the instances n = 200, our FO can provide solutions in approximately1 minute, while SACO is 4 times slower.

Hence, FO heuristic outperforms SACO both in time and in the quality of solution.

2.4.3 Compare with LBv

We also compare FO heuristic with the lower bound LBv(Lower Bound by Lagrangian Relaxation)[4]to see the absolute quality of our solutions. We define the absolute deviation ∆ of a solution π as

∆(π) =∑

Ci(π)−LBv

LBv, which means the deviation between that solution and optimal solution is at most

∆(π). Statistics are shown in Figure 6.

Figure 6: Comparison between FO and LBv

N #instance FO∆avg ∆max

10 100 9.92% 32.57%15 100 7.14% 26.64%20 100 6.44% 23.78%30 100 4.36% 12.63%50 100 3.03% 8.63%75 100 2.13% 5.40%100 100 1.81% 4.82%150 100 1.17% 3.11%200 100 0.89% 2.22%

LBv is a coarse lower bound, especially when n is small. As we can see, when n = 10, 15, thesolutions that FO provides are already optimal, but ∆avg is still about 10%.

However, when n becomes larger our FO algorithm can provide solutions that are very close tothe lower bound, which means that the deviation between FO solution and optimal solution is small.Moreover, ∆avg decreases with the growing of problem size n. This means that the larger n is, thecloser the solution is to the optimal solution that our FO can provide. For the instances n = 200, ourFO can almost offer solutions that deviate with only 1% from the optimal.

3 3-Index Assignment Problem

The Three-Index Assignment Problem (AP3), also known as the 3-Dimensional Assignment Problem,can be formulated as:

Instance : a cost matrix C = ci,j,kn×n×n

Solution : two permutations (p, q), p, q ∈ πN

Objective : to minimize C(p, q) =∑n

i=1 ci,p(i),q(i)

where πN denotes the set of all permutations on the set of integers N = 1, 2, ..., n.AP3 is NP-hard since the 3-D Matching Problem, which is one of the basic NP-hard problems,

is one of the special case of AP3. Both exact and heuristic algorithms have been proposed to solveAP3[11, 12, 13, 14]. Among these, Balas and Saltzman(1991)[14] proposed the MAX-REGRET and

5

Page 106: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

VARIABLE DEPTH INTERCHANGE heuristics. Crama and Spieksma(1992)[13] studied a specialcase of AP3 by restricting the cost of edges in any triangle to obey the rule of triangle inequalities.Burkard et al(1996)[12] focused on AP3 with decomposable cost coefficients. However, even for thesetwo special cases, AP3 is still NP-hard. Recently, Aiex et al.(2003)[11] applied GRASP with PathRelinking for AP3 and obtained better results than all other existing heuristics.

In this section, we first design a Fragmental Optimization algorithm for AP3. And then we hybridizeFO with Genetic Algorithm(GA) to demostrate the bybridization between FO and classical hueristicmethods. Experiments indicate that GA benefits from this hybridization. We test our algorithm onthree classes of standard benchmarks and report the computational results.

3.1 Reduce 3D to 2D

It is quite obvious that AP3 is a straightforward extension of the classical two-dimensional assignmentproblem (AP2) defined below:

Instance : a matrix D = di,jn×n(bipartite graph)Solution : q = (q1, q2, ..., qn), q ∈ πN

Objective : to minimize D(q) =∑n

i=1 di,q(i)

Although AP3 is NP-hard, it’s well-known that AP2 can be solved by an efficient implementationof Hungarian algorithm in O(n3) time[15]. Here we consider how to make use of this results. Given aninitial solution (p, q) for AP3,

let di,j = ci,p(i),j, ∀i, j ∈ 1, 2...n (5)

we get minp,q∈πN

n∑i=1

ci,p(i),q(i) = minq∈πN

n∑i=1

di,q(i)

Consequently, if we fix p, the optimization of q becomes an AP2 problem, and vice versa.Therefore, our idea is to optimize one permutation subject to the other permutation being fixed. To

illustrate this, we use an example (instance bs 4 5.dat from Balas and Saltzman Dataset, see Section3.4.1). As shown in Figure 7, a random initial assignment costs 177. Figure 8 shows the optimizationof q by applying Hungarian Method. The objective function decreases from 177 to 72 (The dotted linesshow the original assignment, while the new assignment is shown by bold lines).

Figure 7: random initial for bs 4 5.dat

1

2

3

4

1

2

3

4

1

2

3

4

Obj= 177

1, 2, 3, 4p=(1, 4, 2, 3)q=(2, 4, 1, 3)

cost=(98 5 62 12)=177

Figure 8: optimize permutation q(“fragment”)

1

2

3

4

1

2

3

4

1

2

3

4

Obj= 72

1, 2, 3, 4p=(1, 4, 2, 3)q=(1, 2, 4, 3)(∗)

cost=(39 18 3 12)=72

Figure 9: optimize permutation p

1

2

3

4

1

3

4

2

1

2

4

3

Obj= 36

1, 2, 3, 4p=(3, 4, 2, 1)(∗)q=(1, 2, 4, 3)

cost=(1 18 3 14)=36

Figure 10: optimize the permutation index

1

2

3

4

3

2

1

4

1

2

4

3

Obj= 17

1, 3, 4, 2(∗)p’=(3, 4, 2, 1)q’=(1, 2, 4, 3)

cost=(1 4 2 10)=17‖

1, 2, 3, 4p=(3, 1, 4, 2)q=(1, 3, 2, 4)

3.2 Fragmental Optimization

As shown in Figure 8, we construct a bipartite graph based on Equation(5). Symmetrically, there aretwo other ways to construct such a bipartite graph:

let di,j = ci,j,q(i), ∀i, j ∈ 1, 2...n (6)

or let di,j = cj,p(i),q(i), ∀i, j ∈ 1, 2...n (7)

6

Page 107: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Figure 9 illustrates Equation(6), while Figure 10 corresponds to Equation(7).Using the above, we define three ways to select a fragment, which actually correspond to the three

parts of one solution to AP3, namely permutation p, permutation q and the permutation index. Our FOheuristic iteratively optimizes the fragment by using Hungrarian Method as the optimization methoduntil no more improvement can be achieved.

3.3 Hybridized with Genetic Algorithm

As we all know, Genetic Algorithm (GA) has shown to be competitive technique for solving gen-eral combinatorial optimization problems. However, it is still possible to incorporate problem-specificknowledge into GA so that the results can be further improved. The hybridization between FO withGenetic Algorithm reflects this idea.

Figure 11: Sturcture of FOGA

GenerationCandidates Pool

Crossover

Operator

Fragm entalOptim ization

New

Generation

Selection

Replacem ent

A fairly standard GA structure is adopted in our implementation. The initial “generation” is ran-domly generated. The “Crossover Operator” randomly chooses two individuals (as parents) accordingto their fitness function values, and generates one new individual. This newly generated individualwill “mutate” with a very small probability. All of these newly generated individuals are put into acandidates pool, whose size is normally twice the parent population. The algorithm then applies the“survival of the fittest” principle to the candidates pool and selects the top half of the individuals withgood fitness value from the candidate pool will be selected to form the next “generation”. After severalgenerations, the GA terminates after satisfying some termination criteria.

In our algorithm, we did not implement mutation operators because it was not very useful. Othershave reported similar experiences with mutation in GA[16, 17]. Instead, we replace Mutation Operatorby our Fragmental Optimization module. When a new individual is generated from Crossover Operator,we apply FO to improve its quality before putting it into the candidates pool. By this method, wehybridize FO with Genetic Algorithm. Figure 11 illustrates the structure of our hybridization(FOGA).

Preliminary experiments were conducted to tune our algorithm because the parameter setting caninfluence the performance of a genetic algorithm substantially. Due to the space limitation, we do notreport the details here. We set the population size to be 100.

Figure 12 shows the comparison between pure GA, a multi-round FO, and FOGA over instancebs 26 1.dat from Balas and Saltzman Dataset(see Section 3.4.1). As you can see,

1. Pure GA is rather bad in performance. Even if more time is given.

2. multiFO is a multi-round FO algorithm. In each round, it starts with a random initial permutationand uses FO to improve the solution. As you can see, multiFO has the ability to find relativelygood solutions in a short time. This reflects the effectiveness of our FO algorithm. Unfortunately,even if much more time is given, this algorithm cannot improve the best solution. For the instancebs 26 1.dat, no better solutions can be found after 1 second.

3. FOGA shows the power of hybridization of Genetic Algorithm and Fragmental Optimization. Itis capable of finding very competitive solutions.

7

Page 108: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Figure 12: Compasion between GA, multiFO and FOGA

Time(sec)

Obj.Value

0

20

40

60

80

100

120

140

160

180

200

0 1 2 3 4

GAmultiFOFOGA

Table 3: Balas and Saltzman Dataset(12*5 instaces)Optimal B-S

Avg.ObjAvg.ObjAvg.Obj Avg.Obj Avg.Obj

Value Value Value Value Value

4 42.2 43.2 - - - 42.2 0.00 s 42.2 0.00 s

6 40.2 45.4 - - - 40.2 0.01 s 40.2 0.01 s

8 23.8 33.6 - - - 33.6 0.01 s 23.8 0.03 s

10 19 40.8 - - - 22.6 0.01 s 19 0.37 s

12 15.6 24 15.6 74.79 s > 18.7 s 26.2 0.02 s 15.6 0.87 s

14 10 22.4 10 106.55 s > 26.64 s 26.4 0.02 s 10 1.73 s

16 10 25 10.2 143.89 s > 35.97 s 26.0 0.03 s 10 1.89 s

18 6.4 17.6 7.4 190.88 s > 47.72 s 24.6 0.03 s 7.2 2.95 s

20 4.8 27.4 6.4 246.70 s > 61.68 s 26.8 0.04 s 5.2 4.01 s

22 4 18.8 7.8 309.64 s > 77.41 s 26.4 0.05 s 5.6 4.54 s

24 1.8 14 7.4 382.45 s > 95.61 s 23.2 0.06 s 3.2 5.66 s

26 1.3 15.7 8.4 465.20 s > 116.3 s 23.2 0.07 s 3.6 10.78 s

n

GRASP withPathRelinking multiFO

Avg.Time

PIII800PIII800

AverageCPU Time

R10000

FOGA

Avg.Time

PIII800

Hence, it’s evident that our hybridization of GA and FO is successful. Guided by the GA, it ispossible for FO to improve the quality of solution consistently; and with the help of FO, GeneticAlgorithm becomes competitive in finding good solutions.

3.4 Computational Results

In this section, we demonstrate the effectiveness of our hybrid genetic algorithm(FOGA) by testing ourheuristic on three benchmark datasets. All the codes are implemented by C/C++ under a Pentium III800MHz PC with 128M memory. For each instance, our FOGA is run only once.

However, the computing machine used in Aiex’s paper[11] is SGI Challenge R10000. Therefore, inorder to compare the CPU time, a scaling scheme is used according to SPEC2.

3.4.1 Balas and Saltzman Dataset

This dataset is generated by Balas and Saltzman[14]. It includes 60 test instances with the problemsize n = 4, 6, 8, ..., 22, 24, 26. For each n, 5 instances are randomly generated with the integer costcoefficients ci,j,k uniformly distributed in the interval [0, 100].

Each row of Table 3 stores the average score of the 5 instances with the same size. The Column“Optimal” shows the optimal solution reported by Balas and Saltzman, while column “B-S” is the resultof their VARIABLE DEPTH INTERCHANGE heuristic. Column “GRASP with Path Relinking” isthe result reported in [11]. Column “multiFO” is the result of our multi-round FO algorithm, whichis terminated after 100 rounds. Finally, Column “FOGA” shows the result of our hybrid geneticalgorithm. The best results among these algorithms are highlighted in the table.

As you can see in Table 3, it is evident that our FOGA can provide much better solutions thanGRASP and B-S. Furthermore FOGA is about 10 times faster than GRASP in these instances.

3.4.2 Crama and Spieksma Dataset

Crama and Spieksma generated this dataset by restricting coefficients ci,j,k = di,j +di,k +dj,k[13]. Thereare 3 types of intances in this dataset. For each type, 6 instances of size n = 33 and 3 instances of sizen = 66 are generated.

Table 4, Table 5 and Table 6 report the experimental results. In these tables, column “C-S” reportsthe result of Crama and Spieksma’s heuristic. As highlighted in these tables, for all these 18 instances,our FOGA can always provide the best solutions over all heuristics.

It is surprising that for these instances, FOGA is about 1000 times faster than GRASP. GRASPneeds several hours to get the solutions while FOGA only takes several seconds.

3.4.3 Brukard, Rudolf & Woeginger Dataset

Brukard et al[12] described this dataset with decomposable cost coefficients, which means that ci,j,k =αi • βj • γk. For each problem size n = 4, 6, 8, ..., 16, 100 test instances are provided. Table 7 is theresult statistics of all these 700 instances, where each row is the average of the 100 instances with samesize n; column “B-R-W” reports the result of Brukard et al’s heuristic.

2SPEC(Standard Performance Evaluation Corporation, http://www.specbench.org/osg/cpu2000/) points out thatPIII 800 is not more 4 times faster than SGI Challenge R10000

8

Page 109: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Table 4: Crama and Spieksma Dataset, Type I (6instances)

3DA99N1

3DA99N2

3DA99N3

3DA198N1

3DA198N2

3DA198N3

CaseIDC-S

Avg.ObjAvg.Obj Avg.Obj Avg.Obj

Value Value Value Value

1618 1608 660.5 s > 165.13 s 1608 0.03 s 1608 0.03 s

1411 1401 680.5 s > 170.13 s 1401 0.02 s 1401 0.11 s

1609 1604 676.1 s > 169.03 s 1604 0.03 s 1604 0.11 s

2668 2664 15470.1 s > 3867.5 s 2662 0.21 s 2662 0.55 s

2469 2449 15010.9 s > 3752.7 s 2449 0.21 s 2449 0.27 s

2775 2759 15084.6 s > 3771.2 s 2758 0.22 s 2758 0.58 s

PIII800

GRASP withPathRelinking multiFO FOGA

AverageCPU Time Avg.Time Avg.Time

R10000 PIII800 PIII800

Table 5: Crama and Spieksma Dataset, Type II (6instances)

3DIJ99N1

3DIJ99N2

3DIJ99N3

3DI198N1

3DI198N2

3DI198N3

CaseIDd C-S

Avg.ObjAvg.Obj Avg.Obj Avg.Obj

Value Value Value Value

4861 4797 766.06 s > 191.52 s 4797 0.06 s 4797 0.11 s

5142 5068 772.84 s > 193.21 s 5068 0.13 s 5067 0.26 s

4352 4287 762.19 s > 190.55 s 4287 0.08 s 4287 0.26 s

9780 9694 14629.1 s > 3657.3 s 9687 0.51 s 9684 4.86 s

9142 8954 14922.9 s > 3730.7 s 8947 0.53 s 8944 3.35 s

9888 9751 14391.7 s > 3597.9 s 9747 0.53 s 9745 3.09 s

R10000 PIII800 PIII800 PIII800

GRASP withPathRelinking multiFO FOGA

AverageCPU Time Avg.Time Avg.Time

Table 6: Crama and Spieksma Dataset, Type III (6instances)

3D1299N1

3D1299N2

3D1299N3

3D1198N1

3D1198N2

3D1198N3

CaseIDd C-S

Avg.ObjAvg.Obj Avg.Obj Avg.Obj

Value Value Value Value

135 133 490.79 s > 122.7 s 133 0.01 s 133 0.01 s

137 131 471.21 s > 117.8 s 132 0.01 s 131 0.03 s

135 131 451.72 s > 112.93 s 131 0.01 s 131 0.02 s

293 286 5322.97 s > 1330.7 s 287 0.05 s 286 0.15 s

294 286 5126.86 s > 1281.7 s 286 0.05 s 286 0.16 s

293 282 5059.06 s > 1264.8 s 283 0.05 s 282 0.23 s

PIII800

FOGA

AverageCPU Time Avg.Time Avg.Time

GRASP withPathRelinking multiFO

R10000 PIII800 PIII800

Table 7: Brukard, Rudolf & Woeginger Dataset (700instances)

B-R-W

Avg.Obj.Avg.Obj. Avg.Obj. Avg.Obj.

Value Value Value Value

4 443.7 - - - 433.6 0.00 s 443.6 0.00 s

6 634.2 - - - 633.72 0.00 s 633.72 0.01 s

8 819.94 - - - 819.16 0.01 s 819.16 0.03 s

10 960.55 - - - 959.42 0.03 s 959.41 0.07 s

12 1188.02 1186.81 68.3 s > 17.1 s 1186.83 0.04 s 1186.81 0.13 s

14 1469.27 1467.74 98.1 s > 24.5 s 1467.76 0.07 s 1467.74 0.23 s

16 1476.99 1475.13 139.3 s > 34.8 s 1475.15 0.10 s 1475.13 0.40 s

n

GRASP withPathRelinking multiFO

R10000 PIII800 PIII600 PIII800

FOGA

AverageCPU Time Avg.Time Avg.Time

As highlighted in Table 7, for these test instances, our FOGA provides the same results with GRASPwith about 100 times faster speed. However, this dataset is considered to be easy since even multiFOcan also offer very competitive solutions, which is even faster.

4 Conclusions

The Fragmental Optimization (FO) technique proposed in the paper is a simple yet effective heuristicframework to solve combinatorial optimization problems. We applied the FO technique on the followingwell-known NP-hard problems :• 2-machine bicriteria flowshop problem F2||(∑ Ci/Cmax): A FO-based heuristic is proposed

for this problem, which selects length L segment as fragment and uses dynamic programming as theoptimization method. Experimental results show that our FO heuristic outperforms other existingheuristics both in time and in the quality of the solution. Moreover, by comparing FO to a coarse lowerbound LBv, we conclude that the solution FO provides is very close to the optimal. For small sizeproblems, most of the time FO heuristic can provide the optimal solution; while for large size problems,the solution that FO provides is only about 1%–2% from the optimal.• 3-index assignment problem AP3: Unlike F2||(∑ Ci/Cmax), here the solution to AP3 is

represented as 2 permutations. In our FO heuristic, by defining each permutation as a fragment, wereduce the dimension from 3 to 2; thus the optimization can be performed by Hungarian Methodin O(n3) time. We further hybridize FO with GA. Experiments indicate that this hybridization issuccessful. Computational results show that our hybrid genetic algorithm(FOGA) outperforms otherexisting heuristics. For those benchmark instances we have tested, FOGA is about 10 ∼ 1000 timesfaster than GRASP and can always offer the best solutions in several seconds.

Based on the above examples, the two elements of a Fragmental Optimization algorithm, FragmentSelection and Optimization, are tightly coupled. Generally speaking, any technique can be used inOptimization: such as Dynamic Programming, Network Flow, or even Branch-and-Bound. However,The key consideration is that this Optimization algorithm must be efficient for the selected Fragmentsince it is involved in every iteration.

In conclusion, we proposed the Fragmental Optimization heuristic in this paper. The success ofapplying FO to problems from different domains suggests that FO is a sufficiently general frameworkfor optimization problems.

References

[1] Jatinder N.D. Gupta, Karsten Hennig, Frank Werner, Local search heuristics for two-stage flow shopproblems with secondary criterion, Computers & Operations Research 29 (2002) 123–149.

9

Page 110: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

[2] Jatinder N.D. Gupta, Venkata R. Neppalli, Frank Werner, Minimizing total flow time in a two-machineflowshop problem with minimum makespan, Intl Journal of Production Economics 69 (2001) 323–338.

[3] Vincent T’Kindt, Nicolas Monmarche, Fabrice Tercinet, Daniel Laugt, An Ant Colony Optimizationalgorithm to solve a 2-machine bicriteria flowshop scheduling problem, EJOR 142 (2002) 250–257.

[4] Vincent T’Kindt, Jatinder N.D. Gupta and Jean-Charles Billaut, A Branch-and-Bound algorithm to Solvea Two-Machine Bicriteria Flowshop Scheduling Problem, ORP3 (2001), Paris, September 26–29.

[5] J.N.D. Gupta, Nagarajan Palanimuthu, C.L. Chen, Designing a tabu search algorithm for the two-stageflowshop problem with secondary criterion, Production Planning and Control, (10), 1999, 251–265.

[6] Venkata R. Neppalli, Chuen-Lung Chen, Jatinder N.D. Gupta, Genetic algorithm for the two-stage bicri-teria flowshop problem, European Journal of Operational Research 95 (1996) 356–373.

[7] Rajendran C. Two-stage flow shop scheduling problem with bicriteria, Journal of the Operational ResearchSociety, Volume 43, Issue 9, 1992, Pages 871–884.

[8] C.L. Chen , R.L. Bulfin, Complexity results for multi-machine multi-criteria scheduling problems. Pro-ceedings of the Third Industrial Engineering Research Conference, 1994, p.662–5.

[9] Conway, R.W., Maxwell, W.L., Miller, L.W., Theory of Scheduling, Addison Wesley, Reading, Mass.,(1967), USA.

[10] S.M. Johnson, Optimal two- and three-stage production schedules with setup times included, Naval Re-search Logistics Quarterly 1 (1954) 61–68.

[11] R.M. Aiex, M.G.C. Resende, P.M. Pardalos, and G. Toraldo: “GRASP with path relinking for thethree-index assignment problem”, Technical report, accepted by INFORMS J. on Computing, 2003.http://www.research.att.com/~mgcr/doc/g3index.pdf

[12] R. E. Burkard, R. Rudolf, and G. J. Woeginger: “Three dimensional axial assignment problems withdecomposable cost coefficients”, Discrete Applied Mathematics 65, 123–169, 1996.

[13] Crama, Y. and F.C.R. Spieksma: “Approximation algorithms for three-dimensional assignment problemswith triangle inequalities”, European Journal of Operational Research 60, 273-279, 1992.

[14] E. Balas and M.J. Saltzman: “An algorithm for the three-index assignment problem”. Oper. Res., 39:150–161, 1991.

[15] H.W. Kuhn: “The hungarian method for the assignment problem. In Naval Research”. Logistics Quarterly,volume 2, pages 83-97, 1955.

[16] D. Goldberg: “Genetic Algorithms in Search, Optimization and Machine Learing”, Addison-Wesley, Read-ing, MA, 1989

[17] G. Syswerda: “Schedule Optimization Using Genetic Algorithms”, in L.Davis (Ed.) Handbook of GeneticALgorithms, pp.332–349, Van Nostrand Reinhold, New York, 1991

10

Page 111: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Biased Minimax Probability Machine for Medical Diagnosis

Kaizhu Huang, Haiqin Yang, Irwin King, Michael R. Lyu and Laiwan ChanDepartment of Computer Science and Engineering

The Chinese University of Hong KongShatin, New Territories, Hong Kong

kzhuang, hqyang, king, lyu, [email protected]

Abstract

The Minimax Probability Machine (MPM) constructs a classifier, which provides a worst-case bound on theprobability of misclassification of future data points based on reliable estimates of means and covariance matrices ofthe classes from the training data points, and achieves the comparative performance with a state-of-the-art classifier,the Support Vector Machine. In this paper, we eliminate the assumption of the unbiased weight for each class inthe MPM and develop a critical extension, named Biased Minimax Probability Machine (BMPM), to deal withbiased classification tasks, especially in the medical diagnostic applications. We outline the theoretical derivativesof the BMPM. Moreover, we demonstrate that this model can be transformed into a concave-convex FractionalProgramming (FP) problem or a pseudoconcave problem. After illustrating our model with a synthetic dataset andapplying it to the real-world medical diagnosis datasets, we obtain encouraging and promising experimental results.

1. Introduction

Biased classifiers have many applications, including the medical diagnostic applications. The goal ofconstructing a two-category biased classifier is to make the accuracy of the important class, instead ofthe overall accuracy, as high as possible, while maintaining the accuracy of the less important class at anacceptable level. For some biased classifiers, e.g., the weighted Support Vector Machine [8], it is oftenhard to quantitatively evaluate how the weight will affect the classification. Recently, a novel classificationmodel, Minimax Probability Machine (MPM) [4], provides a worst-case bound on the probability ofmisclassification of future data points based on reliable estimates of means and covariance matrices ofthe classes from the training data points and achieves the comparative performance with a state-of-the-artclassifier, the Support Vector Machine [12].

In this paper, by eliminating the assumption of the unbiased weight for each class in the MPM,we develop a critical extension, Biased Minimax Probability Machine (BMPM), to deal with biasedclassification tasks. This model is transformed into a concave-convex Fractional Programming (FP) [10]problem or a pseudoconcave problem with every local maximum being global maximum. Moreover, asfar as we know, this model is the first quantitative method to control how the decision hyperplane movesin favor of the classification of the more important class.

The paper is organized as follows. In Section 2, we present the linear Biased Minimax ProbabilityMachine while reviewing the original MPM model. In Section 3, we kernelize the linear Biased MinimaxProbability Machine and propose a feasible solving method to extend its application into the non-linearclassification tasks. In Section 4, we illustrate our model with a synthetic dataset and apply it to real-worldmedical diagnosis datasets. Finally, we conclude the paper in Section 5.

2. The Linear Optimal Biased Probabilistic Decision Hyperplane

In this section, we present the linear biased minimax framework while reviewing the original MPM.

Page 112: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Suppose two random vectorsx andy represent two classes of data with means and covariance matricesas x, Σx, y, Σy respectively in a two-category classification task, wherex, y, x, y ∈ Rn, andΣx,Σy ∈ Rn×n. For convenience, we also usex andy to represent the corresponding class of thex data andthe y data respectively.

With reliable estimations ofx, Σx, y, Σy for two classes of data, Minimax Probability Machineattempts to determine the hyperplaneaTz = b (a 6= 0, z ∈ Rn, b ∈ R, superscriptT denotes the transpose)which can separate two classes data with a maximal probability. The formulation for the original modelis written as follows:

maxα,b,a6=0

α s.t. inf PraTx ≥ b ≥ α ,

inf PraTy ≤ b ≥ α ,

whereα represents the lower bound of the accuracy for the future data, or the worst-case accuracy. Futurepoints, z for which aTz > b are then classified as the classx; otherwise, they are judged as the classy. This derived decision hyperplane is claimed to minimize the worst-case (maximum) probability ofmisclassification, or the error rate, for the classification of future data points. Furthermore, the MPMproblem can be transformed into a convex optimization problem, more specifically, a Second Order ConeProgramming problem [6] [7].

In this model, it assumes an unbiased weight for two classes, i.e., it forces the probabilities for the classx and the classy to be an equal valueα. However, in real-world applications, the importance for twoclasses is not always the same, which implies that the corresponding two probabilities are not necessarilyequal. Motivated by this point, we propose the following Biased Minimax Probability Machine (BMPM)formulation:

maxα,β,b,a 6=0

α s.t. inf PraTx ≥ b ≥ α , (1)

inf PraTy ≤ b ≥ β , (2)

β ≥ γ , (3)

whereγ is a pre-specified positive constant, which represents an acceptable accuracy level for the lessimportant class.

This optimization will maximize the accuracy (the probabilityα) for the biased classx while maintainingthe other classy’s accuracy at an acceptable level by setting a lower bound as (3). The hyperplanea∗Tz = b∗ given by the solution of this optimization will favor the classification of the important classxover the less important classy and will be more suitable in handling biased classification tasks.

In the following, we propose to solve this optimization problem. First, we borrow Lemma 1 from [5].Lemma 1:Given a 6= 0, b such thataTy ≤ b andβ ∈ [0, 1), the condition

inf PraTy ≤ b ≥ β,

holds if and only if b− aT y ≥ κ(β)√

aT Σya with κ(β) =√

β1−β

.This lemma can be proved by using the Lagrange multiplier method and the following theory developedin [9]:

supy∈y,Σy

PraTy ≥ b =1

1 + d2, with d2 = inf

aT y≥b(y − y)T Σ−1

y (y − y).

Details about the proof can be seen in [5].By using Lemma 1, we obtain the following transformed optimization problem:

maxα,β,b,a 6=0

α s.t. −b + aT x ≥ κ(α)√

aT Σxa, (4)

b− aT y ≥ κ(β)√

aT Σya, (5)

β ≥ γ, (6)

Page 113: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

whereκ(α) =√

α1−α

, κ(β) =√

β1−β

. Equation (5) is directly obtained from (2) by using Lemma 1.

Similarly, by changingaTx ≥ b to aT (−x) ≤ −b, (4) is obtained from (1). From (4) and (5), we get:

aT y + κ(β)√

aT Σya ≤ b ≤ aT x− κ(α)√

aT Σxa . (7)

If we eliminateb from this inequality, we obtain

aT (x− y) ≥ κ(α)√

aT Σxa + κ(β)√

aT Σya. (8)

We observe the magnitude ofa will not influence the solution of (8). Without loss of generality, we cansetaT (x − y) = 1. In addition, sinceκ(α) increases monotonously withα, maximizingα is equivalentto maximizingκ(α). Thus the problem can be further modified to

maxα,β,a6=0

κ(α) s.t. 1 ≥ κ(α)√

aT Σxa + κ(β)√

aT Σya, (9)

aT (x− y) = 1, (10)

κ(β) ≥ κ(γ), (11)

where (11) is equivalent to (6) due to the monotone property ofκ function.Lemma 2:The maximum value ofκ(α) under the constraints of (9) (10) (11) is achieved when the

right hand side of (9) is strictly equal to1.Proof: Assume the maximum is achieved when1 > κ(β)

√aT Σya + κ(α)

√aT Σxa. A new solution con-

structed by increasingκ(α) with a small positive amount and maintainingκ(β), a unchanged will satisfythe constraints and will be a better solution. ¥

Moreover,Σx andΣy can be considered as positive definite matrices1. Therefore, we obtainκ(α) =1−κ(β)

√aT Σya√

aT Σxaaccording to Lemma 2. Obviously, this optimization function is a linear function with respect

to κ(β) and√

aT Σya is a positive term; therefore, this optimization function is maximized whenκ(β)is set to its lower boundκ(γ). Thus, the BMPM optimization problem can be changed and written intothe so-called Fractional Programming (FP) problem [10] as:

maxa 6=0

f(a)

g(a), s.t. a ∈ A = a|aT (x− y) = 1 , (12)

wheref(a) = 1 − κ(γ)√

aT Σya, g(a) =√

aT Σxa . In the following, we propose Lemma 3 to showthat this FP problem is solvable.

Lemma 3:The Fractional Programming problem (12) is a strictly quasiconcave problem and is thussolvable.Proof: It is easy to see that the domainA is a convex set onRn, f(a) and g(a) are differentiable onA. Moreover, sinceΣx and Σy can be both considered as positive definite matrices,f(a) is a concavefunction onA andg(a) is a convex function onA. Then f(a)

g(a)is a concave-convex FP or a pseudoconcave

problem. Hence it is strictly quasiconcave onA according to [10]. Therefore, every local maximum is aglobal maximum [10]. In other words, this Fractional Programming problem is solvable. ¥

Many methods can be used to solve this problem. For example, a conjugate gradient method can solvethis problem inn (the dimension of the data points) steps if the initial point is suitably assigned [1].In each step, the computational cost to calculate the conjugate gradient isO(n2). Thus this method willhave a worst-caseO(n3) time complexity. Adding the time cost to estimatex, y,Σx,Σy, the total cost is

1In practice, we can always add a small positive amount to the diagonal elements of these two matrices and make them positive definite.

Page 114: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

O(n3 + Nn2), whereN is the number of the data points. This computational cost is the same order tothe Minimax Probability Machine [4] and the linear support vector machine [11].

In this paper, we use Rosen gradient projection method [1] to find the solution of this concave-convexFP problem, which is proved to converge to a local maximum with a worse-case linear convergence rate.Moreover, the local maximum will be exactly the global maximum in this problem.

From Lemma 2, we can see that the inequalities in (7) will become equalities at the maximum point.The optimalb, denoted byb∗, will thus be obtained by

b∗ = a∗T y + κ(β∗)√

a∗T Σya∗ = a∗T x− κ(α∗)√

a∗T Σxa∗. (13)

3. Kernelization

In this section, we first use the kernel trick to find a linear classifier in the feature space,Rf , viamapping then-dimensional data points into a high-dimensional feature space, where the linear classifierin Rf corresponds to a nonlinear hyperplane in the original space. Next, we propose a feasible algorithmto solve the kernelized optimization problem.

Let xiNxi=1 and yjNy

j=1 represent the training data for the classx and the classy respectively andbe mapped asx → ϕ(x) ∼ (ϕ(x),Σϕ(x)), andy → ϕ(y) ∼ (ϕ(y),Σϕ(y)), whereϕ : Rn → Rf is amapping function2. The corresponding linear classifier inRf is aT ϕ(z) = b, wherea, ϕ(z) ∈ Rf andb ∈ R. Similarly, the transformed FP optimization in BMPM can be written as:

maxa 6=0

1− κ(γ)√

aT Σϕ(y)a√aT Σϕ(x)a

s.t. aT (ϕ(x)− ϕ(y)) = 1 . (14)

To make the kernel work, we need to represent the final decision hyperplane and the optimization into akernel form,K(z1, z2) = ϕ(z1)

T ϕ(z2), namely an inner product form of the mapping data points.We reformulate the optimization and the decision hyperplane as the kernel form in the following.Let a = ap + av, whereap is the projection ofa in the space spanned by all the training data, i.e.,

ϕ(xi)Nxi=1 andϕ(yj)Ny

j=1 andav is the orthogonal component ofa in this span space, the componentav will be observed to vanish in the optimization (14) by usingav

T ϕ(xi) = 0 andavT ϕ(yj) = 0. This

implies that the optimala is in the space spanned by all the training data and thus can be written as alinear combination form of the training data, i.e.,

a =Nx∑i=1

µiϕ(xi) +

Ny∑j=1

υjϕ(yj) , (15)

where the coefficientsµi, υj ∈ R, i = 1, . . . , Nx and j = 1, . . . , Ny.Substituting (15) and the following four plug-in estimated parametersϕ(x) = 1

Nx

∑Nx

i=1 ϕ(xi), ϕ(y) =1

Ny

∑Ny

j=1 ϕ(yj), Σϕ(x) = 1Nx

∑Nx

i=1(ϕ(xi)−ϕ(x))(ϕ(xi)−ϕ(x))T , Σϕ(y) = 1Ny

∑Ny

j=1(ϕ(yj)−ϕ(y))(ϕ(yj)−ϕ(x))T into the optimization problem (14), we can obtain a kernelized version:

maxw 6=0

1− κ(γ)√

1Ny

wT KTyKyw

√1

NxwT KT

xKxws.t. wT (kx − ky) = 1 . (16)

In (16), w = [µ1, . . . , µNx , υ1, . . . , υNy ]T and kx, ky ∈ RNx+Ny with

[kx]i =1

Nx

Nx∑j=1

K(xj, zi), [ky]i =1

Ny

Ny∑j=1

K(yj, zi),

2The notation presented in this section largely follows that of [5].

Page 115: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

wherezi = xi for i = 1, 2, . . . , Nx andzi = yi−Nx for i = Nx + 1, Nx + 2, . . . , Nx + Ny. K is given by

K =

(Kx

Ky

)=

(Kx − 1Nxk

Tx

Ky − 1Ny kTy

),

where1Nx is an Nx-dimension column vector with the values of all elements equal to one and1Ny issimilarly defined.Nx and Ny are the number of the data points for the classx andy respectively.Kx

andKy are the matrices formed by the firstNx rows and the lastNy rows of the Gram matrixK, whichis defined asKij = ϕ(zi)

T ϕ(zj).Similarly, the optimalb in the kernelized version, represented byb∗, can be obtained as

b∗ = w∗T ky + κ(β∗)

√1

Ny

w∗T KTyKyw∗ = w∗T kx − κ(α∗)

√1

Nx

w∗T KTxKxw∗,

wherew∗, α∗, andβ∗ are the optimum values given by the above optimization procedure. The kernelizeddecision hyperplane can be written as

f(z) =Nx∑i=1

w∗i K(z,xi) +

Ny∑i=1

w∗Nx+iK(z,yi)− b∗ .

After kernelization, the dimension of the covariance matrices will be the same as the number of thedata points, the Rosen Gradient method is not suitable to solve this large-scale optimization problem. Weadopt the parametric method [10] to solve the kernelized Fractional Programming problem. Moreover, westill use the unkernelized version to present the algorithm since (16) has a form similar to the unkernelizedversion of (12). According to the parametric method, the fractional function,f(a)/g(a) can be iterativelyoptimized in two steps:

Step1: Find a by maximizingf(a) − λg(a) in the domainA, whereλ ∈ R is the newly introducedparameter.

Step2: Updateλ by f(a)g(a)

.According to [10], the maximum ofλ, namely, the maximum solution of the FP problem, is guaranteedto converge via a series of the above iterations.

In the following, we adopt a method to solve the maximization problem in Step1. Replacingf(a) andg(a), we expand the optimization problem as:

maxa6=0

1− κ(γ)√

aT Σya− λ√

aT Σxa s.t. aT (x− y) = 1 . (17)

Equation (17) is equivalent tomina κ(γ)√

aT Σya + λ√

aT Σxa under the same constraint. By writinga = a0 + Fu, wherea0 = (x− y)/ ‖ x− y ‖2

2 andF ∈ Rn×(n−1) is an orthogonal matrix whose columnsspan the subspace of vectors orthogonal tox− y, an equivalent form (a factor1

2over each term has been

dropped) to remove the constraint can be obtained:

minu,η>0,ξ>0

η +λ2

η‖ Σ1/2

x (a0 + Fu) ‖22 +ξ +

κ(γ)2

ξ‖ Σ1/2

y (a0 + Fu) ‖22 . (18)

This optimization form is very similar to the one in Minimax Probability Machine [4] and can also besolved by using an iterative least-squares approach [1] [4].

4. Experiments

In this section, we first illustrate our model with a synthetic dataset. Then we apply it to two real-worldmedical diagnosis datasets, the breast-cancer dataset and the heart disease dataset.

Page 116: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

4.1. A Synthetic Dataset

A two-variable synthetic dataset is generated by the two-dimensional gamma distribution. Two classesof data are generated under the same gamma distribution with the shape and scale parameterΓ(5, 4) forthe first dimension andΓ(6, 3) for the second dimension. To illustrate the algorithm clearly, we transformthe data by some displacement and rotation to distinguish the two classes as illustrated in Fig. 1. Weassume that the classx (the more important class) is represented by filled squares (training points) ando’s (test points). The other classy (the less important class) is represented by black +’s (training points)and blue×’s (test points). The acceptance level is assumed to set to90%. It is clearly observed thatthe solid line/curve (BMPM for linear/Gaussian kernel) is pushed away from the biased classx whencompared with the corresponding dashed line/curve. This is consistent with the lower bounds in Table I,the corresponding lower bounds for classx in BMPM are higher than those in MPM. In addition, thetest-set accuracies for classx, TSAx, are significantly increased in BMPM than those in MPM for boththe linear and the Gaussian kernel settings. On the other hand, the test-set accuracies for the less importantclassy, TSAy, maintain at an acceptable level, i.e.,91.1% and 93.3%, for linear and Gaussian kernelrespectively by setting the lower bound to90.0%. Moreover, the worst-case accuracies given byαx, αy, orα are all smaller than the real test-set accuracies. This clearly demonstrates how the worst-case probabilitycan serve as the quantitative indicator of the classification accuracy of future data points. From Table I,we also observe that the overall test-set accuracies, i.e.,TSA, of BMPM are not necessarily lower thanthose of MPM. An interesting interpretation can be seen in [3].

TABLE I. LOWER BOUNDα AND TEST-SET ACCURACY WITH BMPM AND MPM ON THE SYNTHETIC DATASET.

BMPM MPMKernel α Accuracy α Accuracy

αx αy TSAx TSAy TSA α TSAx TSAy TSA

Linear(%) 94.9 ↑ 90.0 97.8 ↑ 91.1 94.4 92.7 93.3 95.6 94.4

Gaussian(%) 96.9 ↑ 90.0 97.8 ↑ 93.3 95.6 93.1 93.3 95.6 94.4

4.2. Medical Datasets

The breaset-cancer dataset and the heart disease dataset are obtained from UCI machine learningrepository [2]. The breaset-cancer dataset contains458 instances of the benign class and241 instancesof the malignant class. Each instance is described by9 attributes. The heart disease dataset includes120instances with heart disease,150 instances without heart disease and each instance is described by 13attributes. Since handling the missing attribute values is out of the scope of this paper, we remove theinstances with missing attribute values in both datasets. In this experiment, the biased class should bethe malignant class for the breast-cancer dataset and the heart disease class for the heart disease datasetrespectively since misclassifying a patient with a disease into the opposite one may delay the therapy andlead to the aggravation of the disease. Here, we denotex andy as the biased class and the less importantclass respectively.

We evaluate the BMPM algorithm and the MPM algorithm for both datasts. We perform a5-fold crossvalidation (CV-5) in both linear and Gaussian kernel settings for both datasets . The kernel parameterσfor the Gaussian kernele−‖x−y‖2/σ is obtained via the cross validation method. For the BMPM algorithm,we set the lower bound accuracy of classifying the less important class to the “pass-level”50.0%3 and tryto maximize the accuracy of classifying the biased class. The results are shown in Table II and Table III.

3We consider the pass-line50% as an acceptable level for the common persons. The acceptable level can be controlled by real practitioners

according to the specific requirements. And we note that, the setting of the lower bound needs to be suitable, since if it is set too high, the

maximum value ofα may be smaller than this lower bound or even a zero solution will be obtained.

Page 117: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

−5

0

5

10

−4 −2 0 2 4 6 8 10

z 1

z2

Fig. 1. An example to illustrate Biased Minimax Probability Machine. The solid red line is the decision hyperplane for the linear Biased

Minimax Probability Machine while the dashed green line is the decision hyperplane for the linear Minimax Probability Machine. The solid

black curve is the decision hyperplane for the Gaussian kernel Biased Minimax Probability Machine, while the dashed blue curve is the

decision hyperplane for the Gaussian kernel Minimax Probability Machine. Training points are indicated with magenta filled squares for the

classx and black +’s for the classy. Test points are indicated with green o’s for the classx and blue×’s for the classy. The parameterσ

for the Gaussian kernel is searched by the cross validation method. The solid red line and the solid black curve are pushed away from the

biased classx with a qualitative accuracy indicatorαx = 94.9% andαx = 96.9% in Biased Minimax Probability Machine.

TABLE II. LOWER BOUNDα AND TEST-SET ACCURACY WITH BMPM AND MPM ON THE BREAST-CANCER DATASET.

BMPM MPMKernel α Accuracy α Accuracy

αx αy TSAx TSAy TSA α TSAx TSAy TSA

Linear(%) 90.0± 0.3↑ 50.0± 0.0 99.9± 0.1↑ 92.0± 0.2 94.9± 0.2 84.2± 0.3 96.9± 0.4 97.1± 0.5 96.9± 0.3

Gaussian(%) 97.6± 0.3↑ 50.0± 0.0 100.0± 0.0↑ 88.9± 0.2 92.8± 0.2 90.1± 0.3 96.6± 0.2 97.1± 0.3 96.8± 0.2

TABLE III. LOWER BOUNDα AND TEST-SET ACCURACY WITH BMPM AND MPM ON THE HEART DISEASE DATASET.

BMPM MPMKernel α Accuracy α Accuracy

αx αy TSAx TSAy TSA α TSAx TSAy TSA

Linear(%) 58.6± 0.2↑ 50.0± 0.0 82.4± 0.3↑ 82.8± 0.2 82.2± 0.1 56.1± 0.3 81.8± 0.3 83.7± 0.4 82.5± 0.3

Gaussian(%) 61.1± 0.2↑ 50.0± 0.0 83.3± 0.5↑ 85.7± 0.4 84.8± 0.3 58.4± 0.4 81.1± 0.4 86.6± 0.3 85.2± 0.4

From Table II and Table III, we can see that, the accuracies of BMPM for the biased class are increasedsignificantly when compared with those of MPM in both linear and Gaussian kernel settings, which indicatethat the corresponding decision boundaries are biased towards the biased class. Meanwhile, we observethat the accuracies of BMPM for the less important class still maintain at an acceptable level by settingthe lower bound. We also note that the worst-case bounds are all smaller than the real test-set accuracies.This shows again that the worst-case probability can serve as the quantitative indicator of the medicaldiagnosis for the future cases. Comparing the results of linear kernel with the results of Gaussian kernel,we also find that both the worst-case bound and test accuracy for the biased class in the Gaussian kernelare greater than those of the linear kernel. This also demonstrates the advantage of Gaussian kernel setting.

5. Conclusion

The Minimax Probability Machine, a recently-proposed novel classifier, provides a worst-case bound onthe probability of misclassification of future data points and achieves the comparative performance with astate-of-the-art classifier, the Support Vector Machine. In this paper, by eliminating the assumption of the

Page 118: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

unbiased weight for each class in the Minimax Probability Machine, we develop a critical tool, BiasedMinimax Probability Machine, which is the first quantitative method to control how the decision hyperplanemoves in favor of the classification of the more important class, to deal with biased classification tasks,especially in the medical diagnostic applications. This model is transformed into a concave-convexFractional Programming problem or a pseudoconcave problem. After illustrating our model with a syntheticdataset and applying it to the real-world medical diagnosis dataset, we obtain encouraging and promisingexperimental results.

Some important issues need to be checked as our future work. Firstly, our model relies a lot on the goodestimates of the means and covariance matrices, how can we estimate them accurately and robustly is oneimportant issue. Secondly, are there other more efficient methods to solve the Fractional Programmingoptimization problem? Can some decomposition techniques be applied in the Gram matrix and thus speedup the least-squares training? Finally, what we mentioned is in the scope of two-category classificationtasks. The scheme to extend to the multi-category tasks is also one of our research topics in the nearfuture.

Acknowledgements

We thank Gert R. G. Lanckriet at U. C. Berkeley for providing the Matlab source code of MinimaxProbability Machine on the web. We also want to thank Michael I. Jordan at U. C. Berkeley for hisencouragement and comments.

References

[1] D. P. Bertsekas.Nonlinear Programming. Athena Scientific, Belmont, Massashusetts, 2nd edition, 1999.

[2] C. L. Blake and C. J. Merz. Repository of machine learning databases, University of California, Irvine,

http://www.ics.uci.edu/∼mlearn/mlrepository.html, 1998.

[3] K. Huang, H. Yang, I. King, M. R. Lyu, and L. W. Chan. Minimum error minimax probability machine.Submitted to Journal of

Machine Learning Research, 2003.

[4] G. R. G. Lanckriet, L. E. Ghaoui, C. Bhattacharyya, and M. I. Jordan. Minimax probability machine. InAdvances in Neural Information

Processing Systems (NIPS), 2001.

[5] G. R. G. Lanckriet, L. E. Ghaoui, C. Bhattacharyya, and M. I. Jordan. A robust minimax approach to classification.Journal of Machine

Learning Research, 3:555–582, 2002.

[6] M. Lobo, L. Vandenberghe, , S. Boyd, and H. Lebret. An analysis of bayesian classifiers.Applications of second order cone

programming. Linear Algebra and its Applications, 284:193–228, 1998.

[7] Y. Nesterov and A. Nemirovsky. Interior point polynomial methods in convex programming: Theory and applications.SIAM,

Philadelphia, PA., 1994.

[8] E. Osuna, R. Freund, and F. Girosi. Support Vector Machines: Training and Applications. Technical Report AIM-1602, MIT, 1997.

[9] I. Popescu and D. Bertsimas. Optimal inequalities in probability theory: A convex optimization approach. Technical Report TM62,

INSEAD, 2001.

[10] S. Schaible. Fractional programming. In R. Horst and P.M. Pardalos, editors,Handbook of Global Optimization, Nonconvex Optimization

and Its Applications, pages 495–608. Kluwer Academic Publishers, Dordrecht-Boston-London, 1995.

[11] B. Scholkopf and A. Smola.Learning with Kernels. MIT Press, Cambridge, MA, 2002.

[12] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 2nd edition, 1999.

Page 119: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Combining cardinal direction relations and other

orientation relations in QSR∗†

Amar IsliFachbereich Informatik, Universitat Hamburg,

Vogt-Kolln-Strasse 30, D-22527 Hamburg, [email protected]

Abstract

We propose a calculus, cCOA, combining, thus more expressive than eachof, two orientation calculi well-known in QSR: Frank’s projection-based cardi-nal direction calculus, CDA, and a coarser version, ROA, of Freksa’s relativeorientation calculus. An original constraint propagation procedure, PcS4c+(),for cCOA-CSPs is presented, which aims at (1) achieving path consistency(Pc) for the CDA projection; (2) achieving strong 4-consistency (S4c) for theROA projection; and (3) more (+) —the “+” consists of the implementa-tion of the interaction between the two combined calculi. Dealing with thefirst two points is not new, and involves mainly the CDA composition tableand the ROA composition table, which can be found in, or derived from,the literature. The originality of the propagation algorithm comes from thelast point. Two tables, one for each of the two directions CDA-to-ROA andROA-to-CDA, capturing the interaction between the two kinds of knowledge,are defined, and used by the algorithm. The importance of taking into ac-count the interaction is shown with a real example providing an inconsistentknowledge base, whose inconsistency (a) cannot be detected by reasoningseparately about each of the two components of the knowledge, just because,taken separately, each is consistent, but (b) is detected by the proposed algo-rithm, thanks to the interaction knowledge propagated from each of the twocompnents to the other.Key words: Qualitative spatial reasoning, Cardinal directions, Relative ori-entation, Constraint satisfaction, Path consistency, Strong 4-consistency.

∗Qualitative Spatial Reasoning.†This work was supported partly by the EU project “Cognitive Vision systems” (CogVis), under

grant CogVis IST 2000-29375.

1

Page 120: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Berlin

Hamburg

London

Paris

NorthHamburg

BerlinParis

London

Figure 1: A model for the ROA component (left), and a model for the CDA com-ponent (right), of the knowledge in Example 1.

1 Introduction

Two important, and widely known calculi for the representation and processing ofqualitative orientation are the calculus of cardinal directions, CDA, developed byFrank [5], and the relative orientation calculus developed by Freksa [6]. The formeruses a global, south-north/west-east reference frame, and represents knowledge asbinary relations on (pairs of) 2D points. The latter allows for the representationof relative knowledge as ternary relations on (triples of) 2D points. Both kinds ofknowledge are of particular importance, especially in GIS (Geographic InformationSystems) and in robot navigation.

The aim of this work is to look at the importance of combining the two orienta-tion calculi mentioned above. Considered separately, Frank’s calculus [5] representsknowledge such as “Hamburg is north-west of Berlin”, whereas Freksa’s relative ori-entation calculus [6] represents knowledge such as “You see the main train stationon your left when you walk down to the cinema from the university”. We pro-pose a calculus, cCOA, combining CDA and a coarser version, ROA, of Freksa’scalculus. cCOA allows for more expressiveness than each of the combined calculi,and represents, within the same base, knowledge such as the one in the followingexample.

Example 1 Consider the following knowledge on four cities, Berlin, Hamburg, Lon-don and Paris: (1) viewed from Hamburg, Berlin is to the left of Paris, Paris is tothe left of London, and Berlin is to the left of London; (2) viewed from London,Berlin is to the left of Paris; (3) Hamburg is to the north of Paris, and north-westof Berlin; and (4) Paris is to the south of London. The first two sentences expressthe ROA component of the knowledge (relative orientation relations on triples ofthe four cities), whereas the other two express the CDA component of the knowledge(cardinal direction relations on pairs of the four cities).1 Considered separately, eachof the two components is consistent, in the sense that one can find an assignment

1Two cardinal direction calculi, to be explained later, are known from Frank’s work [5]; weassume in this example the one in Figure 2(right).

2

Page 121: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

of physical locations to the cities that satisfies all the constraints of the component—see the illustration in Figure 1. However, considered globally, the knowledge isclearly inconsistent (the physical locations assigned to Hamburg, London and Parisform a triangle in any model of the ROA component, whereas they are collinear inany model of the CDA component).

Example 1 clearly shows that reasoning about combined knowledge consisting ofan ROA component and a CDA component, e.g., checking its consistency, doesnot reduce to a matter of reasoning about each component separately —reasoningseparately about each component in the case of Example 1 shows two componentsthat are both consistent, whereas the conjunction of the knowledge in the two com-ponents is inconsistent. As a consequence, the interaction between the two kindsof knowledge has to be handled. With this in mind, a constraint propagation pro-cedure, PcS4c+(), for cCOA-CSPs is proposed, which aims at: (1) achieving pathconsistency (Pc) for the CDA projection; (2) achieving strong 4-consistency (S4c)for the ROA projection; and (3) more (+). The procedure does more than justachieving path consistency for the CDA projection, and strong 4-consistency for theROA projection. It implements as well the interaction between the two combinedcalculi. The procedure is, to the best of our knowledge, original.

In the remainder of the paper, we first give a brief description of the propagationalgorithm we propose, including its part dealing with the interaction knowledge,consisting of ROA knowledge inferred from CDA knowledge, and, conversely, ofCDA knowledge inferred from ROA knowledge. We then reconsider our illustratingexample to show that, thanks to the interaction knowledge, more inconsistenciesare detected than one would get from just applying path consistency to the CDAcomponent and four-consistency to the ROA component. We finish with a discus-sion relating the work to current research on spatio-temporalising the well-knownALC(D) family of description logics (DLs) with a concrete domain [2]: the discussionshows that if two (spatial) ontologies operate on the same universe of objects (in thiswork, the universe of 2D points), while using different languages for their knowledgerepresentation, then integrating the two ontologies needs an inference mechanismfor the interaction of the two languages, so that, given knowledge expressed in theintegrating ontology, consisting of two components (one for each of the integratedontologies), each of the two components can infer knowledge from the other.

2 Frank’s and Freksa’s orientation calculi and their

integration

2.1 Frank’s calculus.

Frank’s models of cardinal directions in 2D [5] are illustrated in Figure 2. Theyuse a partition of the plane into regions determined by lines passing through areference object, say S. Depending on the region a point P belongs to, we have

3

Page 122: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

North

South-EastSouth-West

West East

North-West North-East

South

S S EastWest

North

South

North-West

South-West

North-East

South-East

Figure 2: Frank’s cone-shaped (left) and projection-based (right) models of cardinaldirections.

frontneutralback

(a) (b)

backneutral

(c)

1

12

155 10

94

3 138

6

(d)

2

11

7A

B

A

B B

AA

Bfront14

rightleft

straight

Figure 3: The partition of the universe of 2D positions on which is based Freksa’srelative orientation calculus.

No(P, S), NE(P, S), Ea(P, S), SE(P, S), So(P, S), SW(P, S), We(P, S), NW(P, S),or Eq(P, S), corresponding, respectively, to the position of P relative to S beingnorth, north-east, east, south-east, south, south-west, west, north-west, or equal. Eachof the two models can thus be seen as a binary Relation Algebra (RA), with nineatoms. Both use a global, west-east/south-north, reference frame. We focus ourattention on the projection-based model (Figure 2(right)), which has been assessedas being cognitively more adequate [5] (cognitive adequacy of spatial orientationmodels is discussed in [6]).

2.2 Freksa’s calculus.

A well-known model of relative orientation of 2D points is the calculus defined byFreksa [6]. The calculus corresponds to a specific partition, into 15 regions, of theplane, determined by a parent object, say A, and a reference object, say B (Figure3(d)). The partition is based on the following: (1) the left/straight/right partitionof the plane determined by an observer placed at the parent object and lookingin the direction of the reference object (Figure 3(a)); (2) the front/neutral/backpartition of the plane determined by the same observer (Figure 3(b)); and (3) thesimilar front/neutral/back partition of the plane obtained when we swap the rolesof the parent object and the reference object (Figure 3(c)). Combining the threepartitions (a), (b) and (c) of Figure 3 leads to the partition of the universe of 2Dpositions on which is based the calculus in [6] (Figure 3(d)).

4

Page 123: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

7

6

2 8

(c)

4

5

3

same

left right

opposite

(b)

A B0

1

(a)

A

B

A

B

Figure 4: The partition of the universe of 2D positions on which is based the ROAcalculus.

2.3 A new relative orientation calculus.

It is known that, computationally, Freksa’s relative orientation calculus, even whenrestricted to its 15 atoms, behaves badly [15]. We therefore consider a coarser versionof it, obtained from the original one by ignoring, in the construction of the partitionof the plane determined by a parent object and a reference object (Figure 3(d)),the two front/neutral/back partitions (Figure 3(b-c)). In other words, we consideronly the left/straight/right partition (Figure 3(a)) —we also keep the 5-elementpartitioning of the line joining the parent object to the reference object. The finalsituation is depicted in Figure 4, where A and B are the parent object and thereference object, respectively. Figure 4(b-c) depicts the general case, correspondingto A and B being distinct from each other: this general-case partition leads to 7regions (Figure 4(c)), numbered from 2 to 8, corresponding to 7 of the nine atoms ofthe calculus, which we refer to as lr (to the left of the reference object), bp (behindthe parent object), cp (coincides with the parent object), bw (between A and B),cr (coincides with the reference object), br (behind the reference object), and rr(to the right of the reference object). Figure 4(a) illustrates the degenerate case,corresponding to equality of A and B. The two regions, corresponding, respectively,to the primary object coinciding with A and B, and to the primary object distinctfrom A and B, are numbered 0 and 1. The corresponding atoms of the calculus willbe referred to as de (degenerate equal) and dd (degenerate distinct).

From now on, we refer to the cardinal directions calculus as CDA (Cardinal Di-rections Algebra), and to the coarser version of Freksa’s relative orientation calculusas ROA (Relative Orientation Algebra). A CDA (resp. ROA) relation is any sub-set of the set of all CDA (resp. ROA) atoms. A CDA (resp. ROA) relation issaid to be atomic if it contains one single atom (a singleton set); it is said to bethe CDA (resp. ROA) universal relation if it contains all the CDA (resp. ROA)atoms. When no confusion raises, we may omit the brackets in the representationof an atomic relation.

5

Page 124: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

3 CSPs of cardinal direction relations and relative

orientation relations on 2D points

We define a cCOA-CSP as a CSP of which the constraints consist of a conjunctionof CDA relations on pairs of the variables, and ROA relations on triples of thevariables. The universe of a cCOA-CSP, i.e., the domain of instantiation of itsvariables, is the continuous set IR2 of 2D points.

3.1 Matrix representation of a cCOA-CSP.

A cCOA-CSP P can, in an obvious way, be represented as two constraint matrices: abinary constraint matrix, BP , representing the CDA part of P , i.e., the subconjunc-tion consisting of CDA relations on pairs of the variables; and a ternary constraintmatrix, T P , representing the ROA part of P , i.e., the rest of the conjunction, con-sisting of ROA relations on triples of the variables. We refer to the representationas 〈BP , T P 〉. The BP entry (BP )ijconsists of the CDA relation on the pair (Xi, Xj)of variables. Similarly, the T P entry (T P )ijk consists of the ROA relation on thetriple (Xi, Xj, Xk) of variables.

3.2 Reasoning within CDA and the CDA-to-ROA interac-

tion: the tables.

We present the CDA-to-ROA interaction in a knowledge base consisting of a cCOA-CSP. The other direction, i.e., the ROA-to-CDA interaction, can be found in thefull paper [9]. The table in Figure 5 presents the augmented CDA composition table;for each pair (r1, r2) of CDA atoms, the table provides: the standard composition,r1r2, of r1 and r2 [5, 12]; and the most specificROA relation r1⊗r2 such that, for all2D points x, y, z, the conjunction r1(x, y)∧r2(y, z) logically implies (r1⊗r2)(x, y, z).

The operation is just the normal composition: it is internal to CDA, in thesense that it takes as input two CDA atoms, and outputs a CDA relation. Theoperation ⊗, however, is not internal to CDA, in the sense that it takes as inputtwo CDA atoms, but outputs an ROA relation; ⊗ captures the interaction betweenCDA knowledge and ROA knowledge, in the direction CDA-to-ROA, by inferringROA knowledge from given CDA knowledge. As an example for the new operation⊗, from SE(Berlin, London) ∧No(London, Paris), saying that Berlin is south-eastof London, and that London is north of Paris, we infer the ROA relation lr onthe triple (Berlin, London, Paris): lr(Berlin, London, Paris), saying that, viewedfrom Berlin, Paris is to the left of London.

The reader is referred to [5, 12] for the CDA converse table, providing the con-verse r for each CDA atom r.

6

Page 125: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

No So Ea We NE NW SE SW

No No [So, No] NE NW NE NW [SE, NE] [SW, NW]br bp, cp, bw rr lr rr lr rr lr

So [So, No] So SE SW [SE, NE] [SW, NW] SE SW

bp, cp, bw br lr rr lr rr lr rr

Ea NE SE Ea [We, Ea] NE [NW, NE] SE [SW, SE]lr rr br bp, cp, bw lr lr rr rr

We NW SW [We, Ea] We [NW, NE] NW [SW, SE] SW

rr lr bp, cp, bw br rr rr lr lr

NE NE [SE, NE] NE [NW, NE] NE [NW, NE] [SE, NE] ?lr rr rr lr lr, br, rr lr rr lr, bp, cp,

bw, rrNW NW [SW, NW] [NW, NE] NW [NW, NE] NW ? [SW, NW]

rr lr rr lr rr lr, br, rr lr, bp, cp, lr

bw, rrSE [SE, NE] SE SE [SW, NE] [SE, NE] ? SE [SW, NE]

lr rr lr rr lr lr, bp, cp, lr, br, rr rr

bw, rrSW [SW, NW] SW [SW, SE] SW ? [SW, NW] [SW, SE] SW

rr lr lr rr lr, bp, cp, rr lr lr, br, rrbw, rr

Figure 5: The augmented composition table of the cardinal directions calcu-lus: for each pair (r1, r2) of CDA atoms, the table provides the composition,r1 r2, of r1 and n r2, as well as the most specific ROA relation r1 ⊗ r2 suchthat, for all 2D points x, y, z, the conjunction r1(x, y) ∧ r2(y, z) logically implies(r1⊗ r2)(x, y, z); the question mark symbol ? represents the CDA universal relationNo,NW,We, SW, So, SE,Ea,NE,Eq.

3.3 A constraint propagation procedure for cCOA-CSPs.

We propose a constraint propagation procedure, PcS4c+(), for cCOA-CSPs, whichaims at:

1. achieving path consistency (Pc) for the CDA projection, using, for instance,the algorithm in [1];

2. achieving strong 4-consistency (S4c) for the ROA projection, using, for in-stance, the algorithm in [11]; and

3. more (+).

The procedure does more than just achieving path consistency for the CDA projec-tion, and strong 4-consistency for the ROA projection. It implements as well theinteraction between the two combined calculi; namely:

1. The path consistency operation, (BP )ik ← (BP )ik ∩ (BP )ij (BP )jk, which,under normal circumstances, operates internally, within a same CSP, is nowaugmented so that it can send information from the CDA component into theROA component.

2. The strong 4-consistency operation, (T P )ijk ← (T P )ijk ∩ (T P )ijl (T P )ilk,which also operates internally under normal circumstances, is augmented sothat it can send information from the ROA component into the CDA compo-nent.

7

Page 126: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

The reader is referred to the full version of the work for details [9].

Example 2 Consider again the description of Example 1. We can represent thesituation as a cCOA-CSP with variables Xb, Xh, Xl and Xp, standing for the citiesof Berlin, Hamburg, London and Paris, respectively.

1. The knowledge ”viewed from Hamburg, Berlin is to the left of Paris” translatesinto the ROA constraint lr(Xh, Xp, Xb): (T P )hpb = lr.

2. The other ROA knowledge translates as follows: (T P )hlp = lr, (T P )hlb =lr, (T P )lpb = lr.

3. The CDA part of the knowledge translates as follows: (BP )hp = No, (BP )hb =NW, (BP )pl = So.

As discussed in Example 1, reasoning separately about the two components of theknowledge shows two consistent components, whereas the combined knowledge isclearly inconsistent. Using the procedure PcS4c+(), we can detect the inconsis-tency in the following way. From the CDA constraints (BP )hp = No and (BP )pl =So, the algorithm infers, using the augmented CDA composition table of Fig-ure 5 —specificaly, the CDA-to-ROA interaction operation ⊗— the ROA rela-tion bp, cp, bw on the triple (Xh, Xp, Xl). The conjunction of the inferred knowl-edge bp, cp, bw(Xh, Xp, Xl) and the already existing knowledge lr(Xh, Xl, Xp)—equivalent to rr(Xh, Xp, Xl)— gives the empty relation, indicating the inconsis-tency of the knowledge.

4 Discussion

Current research shows clearly the importance of developing spatial RAs: special-ising an ALC(D)-like Description Logic (DL) [2], so that the roles are temporalimmediate-successor (accessibility) relations, and the concrete domain is generatedby a decidable spatial RA in the style of the well-known Region-Connection CalculusRCC-8 [14], leads to a computationally well-behaving family of languages for spatialchange in general, and for motion of spatial scenes in particular:

1. Deciding satisfiability of an ALC(D) concept w.r.t. to a cyclic TBox is, ingeneral, undecidable (see, for instance, [13]).

2. In the case of the spatio-temporalisation, however, if we use what is calledweakly cyclic TBoxes in [10], then satisfiability of a concept w.r.t. such a TBoxis decidable. The axioms of a weakly cyclic TBox capture the properties ofmodal temporal operators. The reader is referred to [10] for details.

Spatio-temporal theories such as the ones defined in [10] can be seen as single-ontology spatio-temporal theories, in the sense that the concrete domain represents

8

Page 127: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

only one type of spatial knowledge (e.g., RCC-8 relations if the concrete domain isgenerated by RCC-8). We could extend such theories to handle more than just oneconcrete domain: for instance, two concrete domains, one generated by CDA, theother by ROA. This would lead to what could be called multi-ontolopgy spatio-temporal theories. The presented work clearly shows that the reasoning issue in suchmulti-ontology theories does not reduce to reasoning about the projections onto thedifferent concrete domains.

5 Summary

We have presented the combination of two calculi of spatial relations well-known inQualitative Spatial Reasoning (QSR): Frank’s projection-based cardinal directioncalculus [4, 5] and Freksa’s relative orientation calculus [6, 7]. With an exampleillustrating the importance of such a combination to Geographical Information Sys-tems (GIS), we have shown that reducing the issue of reasoning about knowledgeexpressed in the combined language to a simple matter of reasoning separately abouteach of the two components was not sufficient. The interaction between the two kindsof knowledge has thus to be handled: we have provided a constraint propagationalgorithm for such a purpose, which:

1. achieves path consistency for the cardinal direction component;

2. achieves strong 4-consistency for the relative orientation component; and

3. implements the interaction between the two kinds of knowledge.

Combining and integrating different kinds of knowledge is an emerging and chal-lenging issue in QSR. Related work has been done by Gerevini and Renz [8], whichdeals with the combination of topological knowledge and relative size knowledge inQSR. Similar work might be carried out for other aspects of knowledge in QSR,such as qualitative distance [3] and relative orientation [6, 7], a combination knownto be highly important for GIS and robot navigation applications, and on which notmuch has been achieved so far.

References

[1] J F Allen. Maintaining knowledge about temporal intervals. Communications of theAssociation for Computing Machinery, 26(11):832–843, 1983.

[2] F Baader and P Hanschke. A scheme for integrating concrete domains into conceptlanguages. In Proceedings of the 12th International Joint Conference on ArtificialIntelligence (IJCAI), pages 452–457, Sydney, 1991. Morgan Kaufmann.

[3] E Clementini, P Di Felice, and D Hernandez. Qualitative representation of positionalinformation. Artificial Intelligence, 95:317–356, 1997.

9

Page 128: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

[4] A U Frank. Qualitative spatial reasoning with cardinal directions. In Proceedings ofthe Austrian Conference on Artificial Intelligence, pages 157–167, Vienna, 1991.

[5] A U Frank. Qualitative spatial reasoning about distances and directions in geographicspace. Journal of Visual Languages and Computing, 3:343–371, 1992.

[6] C Freksa. Using Orientation Information for Qualitative Spatial Reasoning. In A UFrank, I Campari, and U Formentini, editors, Proceedings of GIS —from Space to Ter-ritory: Theories and Methods of Spatio-Temporal Reasoning, Berlin, 1992. Springer.

[7] C Freksa and K Zimmermann. On the utilization of spatial structures for cognitivelyplausible and efficient reasoning. In Proceedings of the 1992 IEEE InternationalConference on Systems, Man and Cybernetics, 1992.

[8] A Gerevini and J Renz. Combining topological and qualitative size constraints forspatial reasoning. In M J Maher and J-F Puget, editors, Proceedings of the Fourth In-ternational Conference on Principles and Practice of Constraint Programming (CP),volume 1520 of Lecture Notes in Computer Science, pages 220–234. Springer, 1998.

[9] A Isli. Combining cardinal direction relations with relative orientation rela-tions in QSR. In preparation (downloadable from http://kogs-www.informatik.uni-hamburg.de/ isli/home-Publications-TR.html).

[10] A Isli. Bridging the gap between modal temporal logic and constraint-based QSR as aspatio-temporalisation of ALC(D) with weakly cyclic TBoxes. Technical Report FBI-HH-M-311/02, Fachbereich Informatik, Universitat Hamburg, 2002. Downloadablefrom http://kogs-www.informatik.uni-hamburg.de/ isli/home-Publications-TR.htmland from http://arXiv.org/abs/cs.AI/0307040.

[11] A Isli and A G Cohn. A new Approach to cyclic Ordering of 2D Orientations usingternary Relation Algebras. Artificial Intelligence, 122(1-2):137–187, 2000.

[12] G Ligozat. Reasoning about cardinal Directions. Journal of Visual Languages andComputing, 9(1):23–44, 1998.

[13] C Lutz. The Complexity of Description Logics with Concrete Domains. PhD thesis,LuFG Theoretical Computer Science, RWTH, Aachen, 2001.

[14] D Randell, Z Cui, and A Cohn. A spatial Logic based on Regions and Connection.In Proceedings KR-92, pages 165–176, San Mateo, 1992. Morgan Kaufmann.

[15] A Scivos and B Nebel. Double-Crossing: Decidability and Computational Complexityof a Qualitative Calculus for Navigation. In D R Montello, editor, Spatial InformationTheory: Foundations of GIS, number 2205 in LNCS, pages 431–446, Morro Bay, CA,2001. Springer.

10

Page 129: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Unrestricted vs Restricted Cut in a Tableau Method forBoolean Circuits

Matti Jarvisalo, Tommi A. Junttila, and Ilkka NiemelaLaboratory for Theoretical Computer Science

Helsinki University of Technology

Abstract. This paper studies the relative proof complexity of variations of a tableau method for Booleancircuit satisfiability checking obtained by restricting the use of the cut rule in several natural ways. Theresults show that the unrestricted cut rule can be exponentially more effective than any of the consideredrestrictions. Moreover, there are exponential differences between the restricted versions, too. Theresults also apply to the Davis-Putnam procedure for conjunctive normal form formulae obtained fromBoolean circuits with a standard linear size translation.

1 Introduction

Propositional satisfiability checkers have been applied successfully to many interesting domainssuch as planning [10] and model checking of finite state systems [2, 1]. The success builds onrecent significant advances in the performance of SAT checkers based both on stochastic localsearch algorithms and on complete systematic search, see e.g. [14, 11].

Most successful satisfiability checkers assume that the input formulae are in conjunctive nor-mal form (CNF). The reason for this is that it is simpler to develop efficient data structures andalgorithms for CNF than for arbitrary formulae. However, using CNF makes efficient modeling ofan application cumbersome. Therefore one usually employs a more general formula representationand then transforms the formula into CNF by using a standard translation. This translation intro-duces a new variable for each Boolean connective in the formula, resulting in a linear size CNFtranslation.

In this paper we study satisfiability checking methods for Boolean circuits. Boolean circuitsare interesting because they allow for a compact and natural representation in many domains asthe representation can be simplified by sharing common subexpressions and by preserving naturalstructures and concepts of the domain. A tableau method that works directly with Boolean circuitshas been developed [9]. It can be seen as a lifting of the Davis-Putnam procedure for CNF toBoolean circuits. Instead of standard (cut free) tableau techniques it employs a direct cut rule com-bined with deterministic (non-branching) deduction rules. The aim is to achieve high performanceand to avoid some computational problems in cut free tableaux [5].

The efficiency of a typical Davis-Putnam method based SAT checking system depends on (i)the applied search space pruning techniques (e.g. non-branching deduction rules, non-chronologicalbacktracking and conflict-driven learning) and (ii) the splitting rule (i.e., on which gates the explicitcut is applied). In this paper, we focus on the splitting/cut rule. The research problem is: How dorestrictions on the use of the cut rule effect the proof complexity in Boolean circuit satisfiability

The financial support from Academy of Finland (project #53695) is gratefully acknowledged.

Page 130: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

checking based on tableaux? For instance, one may think that it is a good idea to restrict the cutsto the input gates only as they determine the values of all other gates. Therefore, the search spacefor a circuit with gates and input gates, , would be instead of . This approach isproposed, for example, in [15, 6]. However, our results show that doing so can, in the worst case,result in exponentially larger proofs compared to the unrestricted cut rule. In addition to the inputgate restricted cuts, we study several other natural locality based restrictions of the cut rule, e.g.,“top-down” cuts that are made only on the children of the already determined gates and “bottom-up” cuts that can be applied on input gates and on the parents of the already determined gates. Ourresults show that restricting the cut in any of the considered ways can result in exponentially largerproofs than by applying the unrestricted cut. In addition, we show that there are also exponentialdifferences in the proof complexity between the restricted versions of the cut rule. Note that al-though this paper considers Boolean circuits and a tableau method for them in order to preserveand exploit the structure of the problem, the results directly apply to SAT checkers for CNF formu-lae in the case the formulae are obtained from circuits by using the standard linear size translation.This is because the Boolean propagation induced by the applied tableau rules corresponds to thepropagation induced by the unit-literal rule of the Davis-Putnam method for the clauses generatedby the translation.

2 Boolean Circuits

A Boolean circuit (see e.g. [12]) is an acyclic directed graph in which the nodes are called gates.The gates can be divided into three categories1: (i) output gates with incoming edges but no out-going edges; (ii) intermediate gates with both incoming and outgoing edges; and (iii) input gateswith outgoing edges but no incoming edges.

A Boolean function is associated with each output and interme-diate gate. An example of a Boolean circuit is shown in Figure 1.In this circuit, and are input gates, and intermedi-ate gates, and an output gate.Formally, we present a Boolean circuit with the set of gates as a set of equations of the form , where and is a Boolean function. It is required thatin the set of equations, each has at most one equation and

or or or

and

or

not not

bac d

fe g h

v

Figure 1: A Boolean circuit.

that the equations are non-recursive. Graphically, the Boolean circuit is shown in Figure 1. Fora Boolean circuit with the set of gates , we denote the set of gates appearing in by . Theedge relation for a Boolean circuit is defined as .

A truth assignment for a Boolean circuit is a function . Assign-ment is consistent if holds for each equation in. A constrained Boolean circuit is a Boolean circuit with the restrictions that thegates in are and those in are . Here, we are interested in thesatisfaction problem for constrained Boolean circuits: given a constrained Boolean circuit, is therea consistent truth assignment that respects the constraints? If such a consistent assignment exists,it is called a satisfying truth assignment and the circuit is satisfiable. Otherwise the circuit is un-satisfiable. The satisfaction problem for constrained Boolean circuits is obviously-complete.

1We do not consider circuits in which there are trivial gates with no edges (neither incoming nor outgoing).

Page 131: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

For simplicity, we consider the class of Boolean circuits in which the following three types ofBoolean functions are allowed: (1) iff is ; (2) iff atleast one , , , is ; and (3) iff all , , , are . It is straightforward to extend this class with additional Boolean functions suchas and equivalence.

3 A Tableau Method for Boolean Circuits

We concentrate on a tableau method for Boolean circuit satisfiability checking we call . The method consists of the rules shown in Figure 2. It is a simplified version of the methodintroduced in [9], where rules e.g. for and equivalence are provided.

(a) The explicit cut rule (b) Constant rules (c) “Up” rules for (d) “Down” rules for

“(e) Up” rules for (g) “Up” rules for

(f) “Down” rule for (h) “Down” rule for (i) “Last undetermined child” rules for and

Figure 2: Tableau method for Boolean circuits.

Given a constrained Boolean circuit , a -tableau for it is a tree such that the rootof the tree consists of the equations in and the constraints; for each gate , a entry isadded, while for each , a entry is added. The other nodes in the tree are entries of theform or , where . The entries are generated by applying the rules in Figure 2 as inthe standard tableau method [4].A branch in the tableau is contradictory if it contains bothand entries for a gate . Otherwise, the branch isopen. A branch is complete if it is contradictory, or if there isa or a entry for each in the branch and thebranch is closed under the rules (b)–(i). A tableau is finishedif all the branches of the tableau are complete. A finishedtableau is closed if all of its branches are contradictory.For each , we say that the entry () can bededuced in the branch if the entry () can be generatedby applying rules (b)–(i) only. A closed -tableau for aconstrained circuit is called a -refutation for the circuit.As an example, for the circuit shown in Figure 1 with theconstraint that the output gate is , a -refutation isshown in Figure 3.

Cut

Cut

Figure 3: A-refutation.

Page 132: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

We study variations of in which we restrict the use of the explicit cut rule to certain types ofgates. Let be a constrained Boolean circuit. The considered variations of are thefollowing.

: Use of explicit cut is restricted to input gates (we call such cuts input cuts). : Use of explicit cut is restricted to output gates and those gates for which there exists

a or a entry in the branch and (top-down cuts). : Use of explicit cut is restricted to input cuts and gates for which there exists a

or a entry in the branch and (bottom-up cuts). : Use of explicit cut is restricted to input and top-down cuts. : Use of explicit cut is restricted to bottom-up and top-down cuts.

The method and its variations are complete and sound proof systems for constrainedBoolean circuits in the sense that a complete open branch in a tableau for a circuit gives a satisfyingtruth assignment for the circuit, and a closed tableau indicates that the circuit is unsatisfiable.

4 Proof Complexity and the Pigeon-Hole Principle

We use the notion of -simulation [3] to study the relative efficiency of proof systems. Let be aproof system. The proof complexity (or complexity in short) of a proposition in is the minimumof , where is a -proof of and the size2 of . For any two proof systems and ,we say that -simulates , denoted by , if there exists a polynomial such that, forany , if there exists -proof for of size , then there exists a -proof for of size . Therelation is transitive. If holds but does not, we write . If neither

nor holds, we write .An example of a propositional formula with high proof complexity in many proof systems is

the pigeon-hole principle [8]. It can be formalized as a set of clauses as follows:

where the clauses and are defined as

and , and each

is a Boolean variable with the interpretation “ if and only if the pigeon sitsin the hole”. When ,

is obviously unsatisfiable.For resolution [13], it was first proven by Haken [7] that theproof complexity of

is exponential w.r.t. .We define the size of a refutation in and its variations asthe number of nodes in the closed tableau. As an example, thesize of the -refutation shown in Figure 3 is 14. The obvi-ous ordering of and its variations based on the -simulationrelation is shown in Figure 4.

Figure 4: Obvious ordering of varia-tions of.

We denote the canonical Boolean circuit representation3 of a set of clauses by . As an

example, for , the circuit representation

2Defining the size of a proof depends highly on the type of a proof system considered.3The canonical representation is obvious; each variable induces an input gate, each negated variable and each

clause an intermediate / gate, and the set of clauses itself an output gate.

Page 133: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

is shown in Figure 1. By we denote with the constraint that the output gate of is . It is not hard to show that, given any set of clauses that is unsatisfiable and a -refutation for of size , we can construct a tree-like resolution refutation for of size. Thus we have the following theorem, which obviously applies to the restricted variations of as well.

Theorem 4.1 The complexity of for is exponential w.r.t. .

5 Relative Efficiency of Restricted Cuts

In this section we prove negative -simulation results between and its variations. First wedefine some circuit constructs we call gadgets that are used in the proofs. We call shown in Figure 1 the UNSAT gadget. Similarly, we call

the gadget. Ob-

viously, and are unsatisfiable. The structure of the TD gadget is

shown in Figure 5(a), and the structure of the XOR gadget in Figure 5(b). In the XOR gadgetwe denote by the construct shown in Figure 5(c), i.e., the logical function.

or

and and

orxn

x1z 1 y1

xn

vn wn

vn+1

and and

or

1

yn

n

v

v1 w1

z n

wn+1

T

T

(a)

+

a21

a11X1

X2

+

+a12

a22a23 ++

+

an,n+1 an,3 an,2 an,1

v

Xn

(b)

or

andand

not not

z

yx

ba

c

(c)

Figure 5: (a) The TD gadget, (b) the XOR gadget, and (c) the logical function as a Booleancircuit.

We now show that each pair of the -simulation orderings of and its restricted variations,shown in Figure 4, is actually proper. The following theorems yield a complete classification ofthese systems based on relative proof complexity.

Theorem 5.1 .

Proof sketch. Consider the family of circuits shown in Figure 6 with theoutput gate constrained to . Any such circuit is obviously unsatis-fiable. For we can construct a constant size refutation w.r.t. byapplying the cut rule first on both of the input gates representing variablesin some clause of type

in the leftmost

gadget and then on gate

, after which every branch can be closed as seen in Figure 3.Notice that to generate a refutation we need to reach the UNSAT gadget,

n+1PHPnn+1PHPn

v

and and

UNSATa,bba

and

Figure 6: The circuitfor Theorems 5.1 and5.2.

i.e., it is impossible to generate a contradiction in all the branches of a refutation without havingan entry for some of the gates in the UNSAT gadget in the tableau. From we can deduce, , , and , but nothing else. In , when applying the cut rule contradiction can be

Page 134: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

achieved only in the PHP part or by deducing , , , or by starting from the input gatesand propagating the deduction through the PHP part. As

is unsatisfiable, it is impossibleto deduce or . Now assume that there exists a-refutation of polynomial size w.r.t. forthis circuit. Given such a refutation, it would be easy to construct a -refutation of polynomialsize w.r.t. for

. This contradicts Theorem 4.1.

Theorem 5.2 .

Proof sketch. Consider the family of circuits shown in Figure 6 with the constraint that the outputgate is , making any such circuit unsatisfiable. For we can construct a constant sizerefutation by applying the cut rule on gate and then on gate . For , all-refutations willbe of exponential size w.r.t. as shown in the proof of Theorem 5.1.

Theorem 5.3 .

Proof sketch. Consider the family of circuits shown in Figure 7 with theconstraint that the output gate is . Any circuit in the family is obvi-ously unsatisfiable. For we can construct a refutation of size by first applying the cut rule on gates and . Then we can de-duce and in every branch, after which we can deduce in astraightforward manner.

a ,b2 2a ,b

v

and and w

and

UNSAT1 1

UNSAT

TDn

n+1vn+1

Figure 7: The circuitfor Theorem 5.3.

Notice that to generate a refutation we need to reach the UNSAT gadgets, as in the proof of The-orem 5.1. implies or , but we cannot deduce one or the other. Thus we must applythe cut rule on either or in . Assume that we cut on . Consider the branch in whichwe have (notice the symmetry). Due to we must have . Then from we deduce and in the branch. In the branch where we have usingthe “down” rule for we deduce and . Nothing else can be deduced. By induction on ,we must use the cut rule on either or . By increasing the number of TD gadgets by one, thenumber of times the cut rule is applied also increases by one, doubling the number of branches inthe tableau. After reaching , from we will have or in every branch. This leadsto contradiction in one of the UNSAT gadgets (this is shown in Figure 3). Every -refutationwill thus be of exponential size w.r.t. .

Theorem 5.4 .

Proof sketch. Consider the family of circuits shown inFigure 8 with the constraint that the output gate is ,making any such circuit unsatisfiable. For wecan construct a refutation of polynomial size w.r.t. byapplying the cut rule first on both of the input gates rep-resenting variables in some clause of type

in each of

the gadgets, and then on either one of the gates

, , where , in each of the UNSAT gadgets.Again, to generate a refutation we need to reach the UN-SAT gadgets. With input cuts as the only cuts, this resultsin a refutation of exponential size w.r.t. , as in the proofof Theorem 5.1. With top-down cuts as the only cuts, weare forced to apply the cut rule on at least one of the gateson every level of the XOR gadgets and inductively dou-

an,n+1an,1

+XOR

n

and

PHP

and

PHP

and

PHP

and

PHP

and

PHP

and

PHP

or or or or

or or andor

or andor

and

and

and

and

and and and

v

UNSAT UNSAT UNSAT

Figure 8: The circuit for Theorem 5.4.

Page 135: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

ble the number of open branches on each level. This also results in a refutation of exponential sizew.r.t. .

Theorem 5.5 .

Proof sketch. Consider the family of circuits shown in Figure 9 with theoutput gate constrained to , making any such circuit unsatisfiable.For we can construct a constant size refutation by applying the cutrule top-down as described in the proof of Theorem 5.2.It is impossible to generate a refutation without reaching the UNSAT gad-gets, as in the previous proofs. With bottom-up cut as the only cuts, after

andv

UNSATa,b

+ + ba

n n1 2

XOR XOR

Figure 9: The circuitfor Theorem 5.5.

applying the cut rule on a single we cannot deduce anything (by symmetry we only discussone of the two XOR gadgets). Applying the cut rule on two gates on level we can deduce anentry for at most one gate on level . Inductively, applying the cut rule on gates on level we can deduce an entry for at most gates on level . This approach results in a refutationof exponential size w.r.t. .After applying the cut rule on a single on level , we could also apply the cut rule on any ofthe gates on level that is the parent of . Then we can deduce an entry for the other childof , but nothing else. Now we could again apply the cut on any of the gates on level that is the parent of . Again, we can deduce an entry for the other child of ,but nothing else. Inductively, this approach also results in a refutation of exponential size w.r.t. .All other approaches to climbing up the levels of the XOR gadget are combinations of these twoapproaches, and will also results in refutations of exponential size w.r.t. .

Theorem 5.6 .

Proof sketch. Consider the family of circuits shown in Figure10 with the output gate constrained to . Any such circuitis unsatisfiable. For we can construct a refutation of poly-nomial size w.r.t. by applying the cut rule on , , and , , inductively on .Again, to generate a refutation we need to reach the UNSATgadgets. With bottom-up cuts as the only cuts this results in arefutation of exponential size w.r.t. , as in the proof of Theorem

and

+

+n+1a bn+1 a1 b1++XOR

2nXORn2n+2XOR nXOR n2n+3

and

XORn1

v

an,1an,n+1

UNSAT UNSAT

3+

Figure 10: The circuit for Theorem5.6.

5.5. With top-down cuts, this yields an exponential size refutation w.r.t. , similarly to the proof ofTheorem 5.4.

Finally, we have the following theorem for which the proofs are omitted. In the theorem, (a) isestablished using the ideas in the proofs of Theorems 5.2 and 5.3, (b) results from the approach inthe proofs of Theorems 5.3 and 5.5, and (c) from the approach in the proofs of Theorem 5.5 andfrom a circuit which consists of the circuit shown in Figure 7 with PHP gadgets hanging from allof the input gates of the UNSAT gadgets.

Theorem 5.7 . . .

By transitivity of , the resulting ordering of and its restricted variations based on the-simulation relation is shown in Figure 11.

Page 136: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

6 ConclusionsIn this paper we study the relative proof complexity of

several restricted versions of a tableau method for Booleancircuits. Especially, we consider the natural cases (andtheir combinations) in which (i) the cut rule can only beapplied on the input gates or (ii) the cut rule is applied in alocality based manner either “top-down” starting from theconstraints or “bottom-up” starting from the input gates.We show that the unrestricted use of the cut rule may yieldexponentially smaller proofs than any of the restrictionsconsidered. Moreover, there are exponential differences

Figure 11: Summary of -simulation rela-tions between and its restricted variations.The relation is omitted fromthe picture for clarity.

in the efficiency between the restricted versions, too. The results indicate that it is preferable touse less restricted cut rules in order to have as small proofs as possible. Obviously, the resultshold for extended classes of Boolean circuits if the set of rules involving , , and gatesremains unchanged in the tableau method. The results also apply to the Davis-Putnam methodfor conjunctive normal form formulae that are obtained from Boolean circuits by the standardlinear size translation. For instance, our results are in contradiction with a common intuition thatit is, in general, beneficial to restrict the splittings in a Davis-Putnam procedure to the variablescorresponding to the input gates only.

References[1] P. Bjesse, T. Leonard, and A. Mokkedem. Finding bugs in an Alpha microprocessor using satisfiability solvers.

In Proceedings of the 13th International Conference of Computer-Aided Verification, volume 2102 of LNCS,pages 454–464. Springer, 2001.

[2] E. Clarke, A. Biere, R. Raimi, and Y. Zhu. Bounded model checking using satisfiability solving. Formal Methodsin System Design, 19(1):7–34, July 2001.

[3] S. A. Cook and R. A. Reckhow. The relative efficiency of propositional proof systems. Journal of SymbolicLogic, 44(1):36–50, 1979.

[4] M. D’Agostino, D. M. Gabbay, R. Hahnle, and J. Posegga, editors. Handbook of Tableau Methods. KluwerAcademic Publishers, Dordrecht, The Netherlands, 1999.

[5] M. D’Agostino and M. Mondadori. The taming of the cut: Classical refutations with analytic cut. Journal ofLogic and Computation, 4(3):285–319, 1994.

[6] E. Giunchiglia, A. Massarotto, and R. Sebastiani. Act, and the rest will follow: Exploiting determinism inplanning as satisfiability. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), pages 948–953. AAAI Press, 1998.

[7] A. Haken. The intractability of resolution. Theoretical Computer Science, 39(2–3):297–308, 1985.[8] S. Jukna. Extremal Combinatorics: with Applications in Computer Science. Springer, 2001.[9] T. A. Junttila and I. Niemela. Towards an efficient tableau method for Boolean circuit satisfiability checking. In

Computational Logic – CL 2000, volume 1861 of LNAI, pages 553–567. Springer, 2000.[10] H. Kautz and B. Selman. Pushing the envelope: Planning, propositional logic, and stochastic search. In Pro-

ceedings of the 13th National Conference on Artificial Intelligence, Portland, Oregon, July 1996.[11] M. W. Moskewicz, C. F. Madigan, Y. Zhao, L. Zhang, and S. Malik. Chaff: Engineering an efficient SAT solver.

In Proceedings of the 38th Design Automation Conference, pages 530–535. ACM, 2001.[12] C. H. Papadimitriou. Computational Complexity. Addison-Wesley, Reading, Massachusetts, USA, 1994.[13] J. A. Robinson. A machine-oriented logic based on the resolution principle. Journal of the ACM, 12(1):23–41,

1965.[14] B. Selman, H. Kautz, and B. Cohen. Local search strategies for satisfiability testing. DIMACS Series in Discrete

Mathematics and Theoretical Computer Science, 26:521–532, 1996.[15] O. Shtrichman. Tuning SAT checkers for bounded model checking. In Computer Aided Verification, CAV 2000,

volume 1855 of LNCS, pages 480–494. Springer, 2000.

Page 137: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

! " # $ # " # % & % # ' % # $ % % % # % ' (# $ )"*) % ' " + ,# % ) ) % ' & % -# .) % ! + * % +%#

/ + % ) )% # $ % %% ! + 0 ( 1 1 2# $ %% %% % + '3 , * ' 2 3#

4 %% ) %% "% % % # 5* # 0 ) % % ) % # 4 ) % % # 6 % 7 ! %%#4 +% ") % #

Page 138: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

%%" +% )" +% # %%" +% # $" 8 ) " 8 8 # ) % # $ %" 8 8

8

) %! % # $ "* 8

# % 7 %

% # $ % 9

8 8 8

) # : % ; #

6 "% ) % % #' %

%" + 9

8 8 8 8

8 8

$ + 0 8 #

Page 139: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

: +; + "%

8

$ ) % + 8

8

9 8 ) "* % * % * < #$ +

8 8 8 8 =

8 8 8 8 8

8 8 8 8 =

8 8 8 8

% % % %" # : % < % % % +% 9

8 8

-

8 8 8 = (

8 8 8 2

8 8 8 3

8 8 8 = ,

) # 8 9 # 4

8 # 4 8 =

) 8

8

) #

$ 8 8 0

0 8

8 8

8 8

Page 140: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

> 4<

0 8 0# 4

0 0 ) * * # / )

0 0 ) *

< (", 8 * #

: + ) % % % % #

$ * % # $ % % #

/ ? % " % % % # ) % % % " : # % Æ % # 4 % % % * 8 8 # @ ) % % # 4 % %; * #

!"

# $ %&

? 9

$ ! % '# 4 % # 4 % % % # $ % % # 4 % % ' ) % #

!" # $ #$% $ #$ $ &' ! ()*"+ '+ ,+ *$- ./0 1

-

Page 141: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

$ !" % ' % ) % !+ #4 % # 4 % ) %! % # 4 % % % % # 4 8 #

$ % +% )# ) ": ; # 6 + % ' ! % ) % - % 12# $ % % * .* - # 6 % : # 0## 6 ' & ) "* 0#0 00 # 6 +% A % #, B.; 4 C %#

? ) % C "(3, D " #0 + ; -# # E+ ) !" 4 ' C% # . 0 #

? ) ( # 6 % % ) ! % # $ ; $ # $ ! 1 + 4 % # ? ) )9 ) 1(F ! % ! ) !# > + ! & #

? ) & ! ) !"

2 $ +

+

!" $ # % $

+

$ # # 3 $ # 4 # !# + 4 # "

(

Page 142: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

) ' 5 % %$ $ 6 $ 6

787 / 9/ 7 8 99 9 8 :7 787 : /8 7 /9 7 ; 787 : 987 7 987 :9 787 9:7/ / 9:7 7 787 : 979 97/ : 7; 787 7 /78/ 9: : /77 :8 7 99 787 9 7 99 787 9 /8 8/9 9 8/9/ 9:;9 787 9 8 / 8 / 98 89/9 : 89/9 : : 9

89 9: : / 7 ( 8/ 8 : 9

$ 9 C ! #

) % # $ ; $ #

) %$ 6 $

9 :/;9 8 / 9: : 7/ 9:8 :( /

$ 9 C !" #

? ) ) ) #$ % * #

4 +% % ) %% # ! ) +% 0 ) ! ) ! 0% # ) %% ) *) % !" ) "% # $ * % ) % ' ) % # +

2

Page 143: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

-10380

-10370

-10360

-10350

-10340

0 0.5 1 1.5 2 2.5 3 3.5 4

Sco

re

Time (minutes)

StandardParamReusing

-100000

-95000

-90000

-85000

-80000

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Time (hours)

StandardParamReusingStandardFixed

? 9 : 0 ' #

, % + % # > ) & % % ) % '#

% % %"% ) ! +" % %" # $ ) % % +% % # )* ) + % %% ' 0#

.0 * & < (, =' # $ + 8>$$(*

.0 ? &+ = ( 2+ $ ;( !-$-," $ %$$ ( % 89:+ ($# $+ 8

.90 ? < , & $4$ !"> % ' + :9@8 ' ?+:

./0 ) ( $ ) $ ;Æ$ - # # & 1 + !9">8@+7

3

Page 144: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

.:0 )+ A *+ ) (- # $ 1 ;( + 9!&">@98+ 77

.0 ; A 2 * # 1 '

! " #+ //@:+

.70 ;+ ( A+ A 2+ ) $ ) # $ $ - ' !$%+ 9@9+

.80 = ( 2+ + ? & ', # 1 4$ ' & ! #

+ /@8+ 8

.0 ( % 2 B < =1 # 4 - !''' ( ! + /!9">98@9+

.0 A 2 % & $ ;( '

! " #+ @98+ 8

.0 C ( < & 5 #$ # & - # #$ ' ) ! *%+ //@/::+

.0 * ;- $ 4 4 + ,+ >:@9+ 7/

.90 ? 2 *,# A D &> (E+ 8

./0 )1 ) * :78 - $, $$+ 7>1$$$$:78

.:0 ( + & %+ ) $ - ' ! " ' #+ 99:@9/9+

.0 ( 5 # ;( - $ ' ) ! **+ :/9@:/+

.70 & %+ ( + ) $ $$ ;( # %$$ ( % 9+ ($# $+

.80 < =- # # $ - $ + >$1$F-$

.0 A =+ A + C + ; (;( # - ) #- + !">@8+

.0 < < 5 + A 5+ & B ;Æ$ # - ) #- + :!">/@/8:+ 9

.0 A 5 * ;( # - ) + :!">77@87+

.0 A1 * C $$ $ # $ ' !$%+ 9@97+

,

Page 145: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

New Look-Ahead Schemes for Constraint Satisfaction

Kalev Kask, Rina Dechter and Vibhav GogateDepartment of Information and Computer ScienceUniversity of California, Irvine, CA 92697-3425

Abstract

This paper presents new look-ahead schemes for back-tracking search when solving constraint satisfactionproblems. The look-ahead schemes compute a heuristicfor value ordering and domain pruning, which influencesvariable orderings at each node in the search space.As a basis for a heuristic, we investigate two tasks, bothharder than the CSP task. The first is finding the so-lution with min-number of conflicts. The second iscounting solutions. Clearly each of these tasks alsofinds a solution to the CSP problem, if one exists, ordecides that the problem is inconsistent. Our plan isto use approximations of these more complex tasks asheuristics for guiding search for a solution of a CSPtask. In particular, we investigate two recent partition-based strategies that approximate variable eliminationalgorithms, Mini-Bucket-Tree Elimination and IterativeJoin-Graph Propagation (ijgp). The latter belong to theclass of belief propagation algorithm that attracted sub-stantial interest due to their surprising success for proba-bilistic inference. Our preliminary empirical evaluationis very encouraging, demonstrating that the counting-based heuristic approximated by by IJGP yields a veryfocused search even for hard problems.

IntroductionWe investigate two cost functions for modelling con-straint satisfaction problems. One as an optimization,min-conflict (MC) and the other as solution-counting(SC). When approximating variable elimination algo-rithms on each of these formulations, we obtain heuris-tic functions that can be used to guide backtrackingsearch algorithms. For the min-conflict formulation,each constraint is modelled by a cost function that as-signs 0 to each allowed tuple and 1 to unallowed tuplesand the overall cost is the sum of all cost functions. Anassignment is a solution when its cost is 0.

Within the solution-count formulation, each con-straint is modelled by a function that assigns 1 to al-lowed tuples and 0 otherwise. The overall cost is theproduct of all individual functions. An assignment isa solution when its cost is 1. The target is to find the

Copyright c© 2003, American Association for Artificial Intel-ligence (www.aaai.org). All rights reserved.

number of solutions, namely the number of assignmentshaving cost 1.

Both of these tasks can be solved exactly by vari-able elimination algorithms. However the complexityof these algorithm are two high to be practical and ap-proximations are necessary. In order to approximatevariable elimination, we apply two basic approxima-tion schemes, Mini-Bucket-Tree Elimination (MBTE)(Dechter, Kask, & Larrosa 2001) and Iterative Join-Graph Propagation (IJGP) (Dechter, Kask, & Mateescu2002) to each of the two formalisms, yielding heuris-tic functions. The heuristics derived using MBTE canprovide a lower-bound on the min-conflict function andan upper-bound in the case of solution counting1. Theheuristics generated using IJGP provide no guarantee,but were shown to work very well sometimes for prob-abilistic inference tasks.

We incorporate these these heuristics, computedby MBTE/IJGP on MC/SC tasks within backtrackingsearch and compare the resulting backtracking algo-rithms against MAC (Sabin & Freuder 1997), one ofthe most powerful lookahead methods, and against SLS(Stochastic Local Search).

Our results are very promising. We show that, onhard random problem instances on the phase transi-tion, the SC model yields overall stronger heuristicsthan the MC model. In particular, IJGP computing SCyields a very focused search with relatively few dead-ends. We show that this new algorithm, backtrackingwith solution count heuristic computed by IJGP (IJGP-SC), is more scalable than MAC/SLS - it is inferior toMAC/SLS on small problems, but as the problem sizegrows, the performance of IJGP-SC improves relative toMAC/SLS, and on the largest problems we tried IJGP-SC outperforms both MAC and SLS. These results aresignificant because our base algorithm is naive back-tracking that is not enhanced by either backjumping orlearning, but equipped with a single look-ahead heuris-tics. Finally, we believe that our implementation can beoptimized to yield at least an order of magnitude speed-

1If we normalize messages computed by MBTE-SC, weget approximations of fractions of solutions, instead of upperbounds on solution counts.

Page 146: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

up.

PreliminariesDEFINITION 0.1 (constraint satisfaction problem)A Constraint Network (CN) is defined by atriplet (X,D, C) where X is a set of variablesX = X1, ..., Xn, associated with a set of discrete-valued domains,D = D1, ..., Dn, and a set ofconstraintsC = C1, ..., Cm. Each constraintCi

is a pair (Si, Ri), whereRi is a relationRi ⊆ DSi

defined on a subset of variablesSi ⊆ X called thescope ofCi. The relation denotes all compatible tuplesof DSi

allowed by the constraint. The primal graph ofa constraint network, called aconstraint graph, has anode for each variable, and an arc between two nodesiff the corresponding variables participate in the sameconstraint. Asolution is an assignment of values tovariablesx = (x1, ..., xn), xi ∈ Di, such that eachconstraint is satisfied, namely∀ Ci ∈ C xSi

∈ Ri. TheConstraint Satisfaction Problem (CSP) is to determineif a constraint network has a solution, and if so, to finda solution. A binary CSP is one where each constraintinvolves at most two variables.

As noted a constraint satisfaction problem can besolved through a minimization problem, minimizing thenumber of conflict cost function, through the task offinding all solutions. Both of these tasks can be solvedexactly by inference algorithms defined over a tree-decomposition of the problem specification, developedfor constraint network (Dechter & Pearl 1989) and forbelief networks (Lauritzen & Spiegelhalter 1988). Intu-itively, a tree-decomposition takes a collection of func-tions and partition them into a tree of clusters. The clus-ter tree is often called ajoin-treeor a junction tree. Weoverview formally the notion of a tree-decompositionand a message-passing algorithm over the tree. Thecombined description is similar to the Shafer-Shenoyvariant of junction-tree algorithm (Shafer & Shenoy1990).

We use a recent formalization of tree-decompositiongiven by (Gottlob, Leone, & Scarello 1999).

DEFINITION 0.2 (cluster-tree decompositions)LetCSP =< X, D, C > be a constraint satisfactionproblem. Acluster-tree decompositionfor CSP is atriple D =< T, χ, ψ >, whereT = (V,E) is a tree,and χ and ψ are labelling functions which associatewith each vertexv ∈ V two sets, variable labelχ(v) ⊆ X and function labelψ(v) ⊆ C.

1. For each functionCi ∈ C, there isexactlyone vertexv ∈ V such thatCi ∈ ψ(v), andSi ⊆ χ(v).

2. For each variableXi ∈ X, the setv ∈ V |Xi ∈χ(v) induces a connected subtree ofT . The con-nectedness requirement is also called the running in-tersection property.

Let (u, v) be an edge of a tree-decomposition, thesepa-ratorof u andv is defined assep(u, v) = χ(u) ∩ χ(v);

the eliminator of u and v is defined aselim(u, v) =χ(u)− sep(u, v).

DEFINITION 0.3 (tree-width, induced-width)The tree-width of a tree-decomposition istw = maxv∈V |χ(v)| − 1, and its maximum sep-arator size is s = max(u,v)∈E |sep(u, v)|. Thetree-width of a graph is the minimum tree-width overall its tree-decompositions and it can be shown to beidentical to the graph’s induced-width (Dechter &Pearl 1989).

Cluster-tree elimination(CTE) (Dechter, Kask, &Larrosa 2001) is a message-passing algorithm on a tree-decomposition, the nodes of which are called clusters,each associated with variable and function subsets (theirlabels). CTE computes two messages for each edge(one in each direction), from each node to its neigh-bors, in two passes, from the leaves to the root and fromthe root to the leaves. The message that clusteru sendsto clusterv, for the min-conflict task is as follows: Thecluster sums all its own functions (in its label), with allthe messages received from its neighbors excludingvand then minimize the resulting function relative to theeliminators betweenu andv. This yields a message de-fined on the separator betweenu andv. For solutioncounts the messages are the same except summation isreplaced with a product and minimization with summa-tion.

The complexity of CTE is time exponential in themaximum size of variable subsets and space exponen-tial in the maximum size of the separators. Join-treeand junction-tree algorithms for constraint and beliefnetworks are instances of CTE. CTE is an exact al-gorithm for computing either the min conflict solutionor the solution counts for the whole problem. More-over the algorithm can output for each value and eachvariable the min number of conflicts, or the number ofsolutions extending this variable-value pair, for the re-spective tasks. For more details see (Dechter, Kask,& Larrosa 2001; Dechter, Kask, & Mateescu 2002).Bucket-tree-elimination algorithm,BTE, is the spe-cial case when the underlying tree-decomposition is abucket-tree. A bucket-tree is the structure associatedwith bucket-elimination algorithms.

Mini-cluster-tree elimination(MCTE(i)) (Dechter,Kask, & Larrosa 2001) approximates CTE by usingthe partitioning idea - when computing a message fromclusteru to clusterv, clusteru is partitioned into mini-clusters whose function scopes has at mosti variables,each of which is processed separately by the same clus-ter computation described above, resulting in a set ofmessages which are sent to clusterv. MCTE(i) com-putes a bound on the exact value (a lower bound in caseof a minimization problem, an upper bound in case of amaximization problem) and allows a trade-off betweenaccuracy and complexity controlled byi. Its space andtime complexity is exponential in the input parameterithat also bounds the scope of the messages.

Page 147: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Iterative Join-Graph Propagation(IJGP) (Dechter,Kask, & Mateescu 2002) can be perceived as an iter-ative version of the cluster-tree elimination algorithmthat applies the message-passing tojoin-graphsratherthan join-trees. A join-graph is a decomposition offunctions into clusters that interact in a graph manner,rather than a tree-manner. namely conditions 1 and 2should be satisfied by the clusters in the graph. There-fore there are two major differences between join-treesand join-graphs:

1. Unlike a join-tree decomposition, which is definedon a treeT , a join-graph is defined on a graphG =(V, E).

2. Join-graph satisfies the same two conditions 0.2 asjoin-tree. In addition, each edgee ∈ E of the join-graph has a labellingφ(e) ⊆ sep(e), such that, forany variableXi, the set of edgese|Xi ∈ φ(e) is atree.

Join-graph propagation algorithmsends the sameCTE messages to their neighbors, except that their mes-sages are computed on the edge labellingφ, instead ofseparators. This class of algorithms generalizes loopybelief propagation that demonstrated a useful approxi-mation method for various belief networks, especiallyfor probabilistic decoding (Dechter, Kask, & Mateescu2002).

An important difference between MCTE and IJGP isthat IJGP is iterative and can improve its performancewith additional iterations. Similarly to MCTE(i), IJGPcan be parameterized byi which controls the clustersize in the join-graph, yielding a class of algorithms(IJGP(i)) that allow a trade-off between accuracy andcomplexity. Asi increase accuracy generally increases.Wheni is big enough to allow a tree-structure IJGP(i)coincide with CTE and become exact.

Min-Cost vs. Solution-CountAs noted before, we can express the relationRi as acost functionCi(Xi1 = xi1, ..., Xik = xik) = 0 if(xi1, ..., xik) ∈ Ri, and1 otherwise. We call this aMin-Conflict(MC) model of the CSP. The objective functionis the sum of all cost functions. The CSP problem isto find an assignment for which the cost function is 0.We will focus on a related task - given a partial assign-mentE, compute, for each valuea of each uninstanti-ated variableXi, the number of constraints violated inthe extension of the assignmentE ∪ Xi = a that hasa least number of conflicts. As an optimization (mini-mization) task, it is NP-hard. However, an approxima-tion of this can serve as a heuristic function guiding theBranch-and-Bound search for finding a solution.

Alternatively, we may also express the relationRi asa cost functionCi(Xi1 = xi1, ..., Xik = xik) = 1 if(xi1, ..., xik) ∈ Ri, and0 otherwise. We call this aSo-lution Counting(SC) model of the CSP. The objectivefunction is the product of all cost functions. The CSP

problem is to find an assignment for which the objec-tive function is 1. However, we will focus on a hardertask within this representation - given a partial assign-mentE, compute, for each valuea of each uninstanti-ated variableXi, the number of solutions that agreeswith E ∪ Xi = a. As a counting problem, it is#P-complete. However, an approximation of this taskcan serve as a heuristic function guiding a backtrackingsearch algorithms for finding a solution.

Approximating by MBTE the Min-Conflict andSolution-CountMini-Bucket-Tree Elimination (Dechter, Kask, & Lar-rosa 2001), or more generally, the Mini-Cluster-TreeElimination, can be applied to approximate both theMin-Cost and Solution-Count heuristics in a straight-forward manner. MBTE applied to the Min-Conflictmodel (MBTE(i)-MC) computes, given a partial assign-mentE, for each valuea of each uninstantiated vari-ableX, a lower boundon the cost of the best exten-sion of the given partial assignmentE ∪ X = a.When MBTE is applied to the Solution-Count model(MBTE(i)-SC) it computes, given a partial assignmentE, for each valuea of each uninstantiated variableXi,an approximation on the number of solutions that agreewith E ∪ Xi = a. Note that thei-bound can be usedto control the accuracy and complexity of this approxi-mation scheme.

The respective approximated values computed byMBTE(i)-MC and MBTE(i)-SC can also be used fordomain pruning. It can be shown that the SC heuristicpruning power, computed by MBTE(i), is equivalent tothe MC heuristic pruning power. More precisely,Proposition.[Equivalence of SC and MC for pruning]The pruning power of MBTE(i)-SC is equal to thatof MBTE(i)-MC. Namely, if the partitioning structureused by MBTE-MC and MBTE-SC is identical, thenMBTE(i)-MC(E) > 0 exactly when MBTE(i)-SC(E) =0, where E is a set of instantiated variables. That is,MBTE(i)-MC allows pruning (> 0) iff MBTE( i)-SC al-lows pruning (= 0).

Overall it seems that SC is superior to MC becauseit allows not only pruning of domains but also value or-dering. Consequently, unless MBTE-MC can be imple-mented more efficiently MBTE-SC is superior overall,as will be validated by our experiments.

Approximating Solution-Count by IJGPWe will also use IJGP for approximating solutioncounts for each singleton assignmentXi = a. IJGPfor solution counting is technically very similar to IJGPfor the computation of belief in Bayesian networks(Dechter, Kask, & Mateescu 2002). For completeness,a formal description of IJGP(i)-SC is given in Figure 1.

IJGP(i)-SC takes, as input, a join-graph and an acti-vation schedule which specifies the order in which mes-sages are computed. It executes a number of IJGP itera-tions. The algorithm sends the same messages between

Page 148: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

neighboring clusters as CTE using the SC model. At theend of iterationj it computes the distance∆(j) betweenthe messages computed during iterationj and the pre-vious iterationj − 1. The algorithm uses this distanceto decide whether IJGP(i)-SC is converging. The al-gorithm stops when either a predefined maximum num-ber of iterations is exceeded (indicating that IJGP(i)-SCis not converging), the distance∆(j) is not decreasing(IJGP(i)-SC is diverging), or∆(j) is less than somepredefined value (0.1) indicating that IJGP(i)-SC hasreached a fixed-point. Some of the more significanttechnical points are:

• As input, constraints are modelled by cost functionswhich assign 1 to combinations of values that are al-lowed, and 0 to nogoods.

• IJGP-SC diverges (solution count values computedby IJGP-SC may get arbitrarily large) and thus the so-lution count values computed by IJGP can be shownto be trivial upper bounds on the exact values. Also,in practice IJGP may suffer from double-point preci-sion overflow. To avoid that, we will normalize allmessages as they are computed. As a result, IJGP(i)-SC will compute, for each variableXi, not solu-tion counts but their ratios. For example, IJGP(i)-SC(X = a)=0.4 means that in approximately 40% ofthe solutions,X = a. Therefore the approximatedsolution counts are no longer upper bounds. Still,when the solution count (ratio) computed by IJGP(i)-SC is 0, the true value is 0 as well, and therefore thecorresponding valuea of X can be pruned.

• Note, however, that since we use the solution countsonly to create a variable and value ordering, we don’tneed to know the counts precisely. All we want is thatthe approximated solution counts be accurate enoughto yield a value ordering as close as possible to thatinduced by the exact solution counts.

As we commented earlier, it is easy to see that IJGP-SC zero values are sound.

THEOREM 0.1 (Correctness of IJGP-SC for 0’s)(Dechter & Mateescu 2003) Whenever IJGP(i)-SC(Xi = a)=0, the exact solution count SC(Xi = a)=0as well.

Approximating Min-Cost by IJGP

We do not use IJGP for approximating the Min-Conflictheuristic for the following reason. IJGP is an iter-ative algorithm that runs on a join-graph, not join-tree. Therefore, as with IJGP-SC, IJGP-MC diverges(messages computed by IJGP-MC would get arbitrar-ily large). However, unlike the case with IJGP-SC, thismeans that the values computed by IJGP(i)-MC are notbounds (in case of the MC task we need lower bounds),and therefore their usefulness as a heuristic is ques-tionable. We could normalize the messages, like wedid with IJGP(i)-SC, but unlike in case of IJGP(i)-SC,

heuristic value 0 cannot be used for pruning, and heuris-tic values greater than 0 are not guaranteed to be lowerbounds, so they also cannot be used for pruning as well.

Backtracking algorithm with guidingheuristics

Backtracking with MBTE-MCBB-MBTE(i)-MC is a simple Branch-and-Bound algo-rithm for constraint satisfaction that uses approximatedmin-conflict computed by MBTE(i)-MC as a heuris-tic function. At each point in the search space it ap-plies MBTE(i)-MC and prunes domains of variablesthat have a min-cost greater than 0. When choosing thenext variable to instantiate, it chooses a variable withthe smallest domain. Unfortunately, MBTE(i)-MC doesnot allow dynamic value ordering because all values areeither pruned or have a heuristic value 0.

Backtracking with IJGP-SC/MBTE-SCBB-IJGP(i)-SC uses approximated solution countingcomputed by IJGP(i)-SC as a heuristic function forguiding backtracking search. At each node in thesearch space it computes IJGP(i)-SC and prunes do-mains of variables that have a solution count 0 (theo-rem 0.1). When choosing the next variable to instanti-ate, it chooses a variable with the smallest domain com-puted at this point using by IJGP(i)-SC, breaking tiesby choosing a variable with the largest single solutioncount. The strength of IJGP(i)-SC is in value ordering.It chooses a value with the largest approximated solu-tion count (fraction). BB-MBTE(i)-SC works like BB-IJGP(i)-SC except that the heuristic function is com-puted by MBTE(i)-SC.

Competing algorithmsStochastic Local SearchFor comparison, we also compare against one of themost successful greedy local search schemes for CSP.The stochastic local search algorithm we use is a ba-sic greedy search algorithm that uses three heuristics toimprove its performance:

1. Constraint weighting ((Morris 1993)). When the al-gorithm gets stuck is a local minima, it re-weightsconstraints, which has the effect of changing thesearch space, eliminating the local minima.

2. Dynamic restarts ((Kask & Dechter August 1995).When one try is executed, the program automaticallydetermines when to quit the try and restart.

3. Tie-braking according to historic information ((Gent& Walsh 1993). When more than one flip yield thesame change in the objective function, choose the onethat was used the longest ago.

SLS algorithms, while incomplete, have been suc-cessfully applied to wide range of automated reasoning

Page 149: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Algorithm IJGP( i)-SCInput: A graph decomposition< JG, χ, ψ >, JG = (V, E) for CSP =< X, D, C >. Each constraintC(Sk) isrepresented by a cost functionf(Sk) = 1 iff Sk ∈ Rk and 0 otherwise. Evidence variablesI. Activation scheduled = (u1, v1), . . . , (u2∗|E|, v2∗|E|).Output: A solution count approximation for each singleton assignmentX = a.

Denote by h(u,v) the message from vertexu to v in JG. cluster(u) = ψ(u) ∪ h(v,u)|(v, u) ∈ E,clusterv(u) = cluster(u) excluding message fromv to u.Let h(u,v)(j) be h(u,v) computed during the j-th iteration of IJGP.δh(u,v)(j) =

∑sep(u,v)

(h(u,v)(j) − h(u,v)(j −1))/|h(u,v)(j)|, ∆(j) =

∑dl∈d

(δhdl(j))/2 ∗ |E|.

1. Process observed variables:Assign relevant evidence to allRk ∈ ψ(u), χ(u) := χ(u)− I, ∀u ∈ V .

2. Repeat iterations of IJGP:

• Along d, for each edge(ui, vi) in the ordering,• computeh(ui,vi) = α

∑elim(ui,vi)

∏f∈clustervi

(ui)f

3. until :

• Max number of iterations is exceeded, or• Distance∆(j) is less than 0.1,• ∆(j) > ∆(j − 1).

4. Compute solution counts:

For every Xi ∈ X let u be a vertex in JG such that Xi ∈ χ(u). Compute SC(Xi) =α

∑χ(u)−Xi(

∏f∈cluster(u)

f).

Figure 1: Algorithm IJGP(i)-SC

problems. They have been more scalable than system-atic complete methods, especially on random problems,and thus are the main competing algorithm.

The MAC algorithmMaintaining arc consistency or the MAC algo-rithm (Sabin & Freuder 1997) is one of the best per-forming algorithm for random binary CSPs that usesarc-consistency look-ahead. It differs from chronologi-cal backtracking in the following three aspects:

1. The constraint network is initially made arc-consistent.

2. Every time a variableX is instantiated to a valuev,the effects are propagated in the constraint networkby treating the domain ofX asv.

3. Every time an instantiation of a variableX to a valuev is refuted, the network is made arc-consistent byremovingv from the domain ofX.

The performance of the basic MAC algorithm can beimproved by using variable and value ordering heuris-tics during search. In our implementation2, we haveused thedom/deg heuristic for variable ordering while

2The implementation is based on Tudor’s Hulubei’s imple-mentation available athttp://www.hulubei.net/tudor/csp. Wehave augmented this implementation to includedom/deg andMC heuristics

the min-conflicts or theMC heuristic for value order-ing. This combination was shown to perform the beston random binary CSPs (Bessiere & Regin 1996). Thedom/deg heuristic selects the next variable to be instan-tiated as the variable that has the smallest ratio betweenthe size of the remaining domain and the degree of thevariable. The MC heuristic chooses the value that re-moves the smallest number of values from the domainsof the future variables. Our implementation uses theAC-7 algorithm (Christian Bessiere 1999) for maintain-ing arc-consistency.

Experimental ResultsIn this section, we report on experiments that exam-ined the effect of adding our new lookahead schemeson the performance of the chronological backtrackingalgorithm. Because all the schemes are solution driven,we focused more on the soluble CSPs while our resultson the insoluble instances are incomplete. We comparethe performance of our algorithms by using parameterslike the the cpu time and the number of backtracks. Wealso compare our algorithms with our implementationsof the MAC algorithm and the stochastic local searchalgorithm described in previous sections.

Problem SetsSo far we have experimented with randomly gener-ated binary CSPs using ModelB (MacIntyreet al.

Page 150: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

1998). ModelB can be defined by a standard 4-tuple< N, K, C, T > whereN is the number of variables,K is the domain size of each variable,C is the num-ber of pairs of variables that are involved in a constraintandT is the number of pairs of values that are incon-sistent for each constraint. When constructing a con-straint graph using this model, we selectC constraintsuniformly at random from the possibleN(N − 1)/2constraints. Then, for each constraint, we select uni-formly at randomT pairs of values as nogoods from thepossibleK2 pairs.We generated four sets of random problem instancesin the phase transition region having100, 200, 500and 1000 variables respectively using ModelB. Notethat researchers have indicated (Cheeseman, Kanef-sky, & Taylor 1991; Smith 1994) that the hardest CSPinstances appear in the phase transition region andalso that ModelB generates harder problem instancesthan other models at the phase transition for binaryCSPs (Smith 1994). The domain size for all instanceswas 4 and the constraint tightness was4. These in-stances are typical of graph coloring problems althoughthey are not as random. For each set, we systematicallylocated the phase transition region by varying the num-ber of constraints in increments of5. We then selectedfour points in this range that correspond to a particu-lar constraint density value and generated a number ofrandom problem instances for each selected point.

Results and Discussion

We will refer to the chronological backtracking algo-rithm that uses the look-ahead schemes IJGP(i)-SC,MBTE(i)-SC and MBTE(i)-MC as IJGP(i)-SC,MBTE(i)-SC and MBTE(i)-MC respectively whereiis the i-bound used. Note that processing each searchnode is typically exponential in thei-bound used andso we experimented only withi-bounds of2, 3 and4.Also, note that we have fixed the maximum number ofiterations to10 (see step3 Figure 1). This choice wasrather arbitrary. All the experiments reported in thispaper use a cpu time bound i.e. if a solution is not foundwithin this time bound, we record a time-out. Notethat only those instances that were solved by at leastone algorithm within the time bound are consideredas soluble instances while those instances that wereproved to be insoluble by at least one algorithm withinthe time bound are considered as insoluble instances.

Experiments on the 100-variable-set All experi-ments on the 100-variable-set were run on a Pentium-2400 MHz machine with a1000 MB RAM running ver-sion 9.0 of the red-hat Linux operating system. Thetime bound was500 seconds. We generated200 in-stances each with420, 430, 440 and 450 constraintsand ran all the algorithms on these instances. We ob-served that the instances with420 and430 constraintslie in the under-constrained region of the phase transi-

tion while the instances with440 and450 constraints liein the over-constrained region of the phase transition.We observed that the large variation in cpu time and thenumber of backtracks was very dependent on whetherthe instance was soluble or not. So we have decom-posed our results into two subsets, the first consistingof only the soluble instances while the other consist-ing of only insoluble instances. Tables 1 shows the re-sults for the soluble100 variable instances. It is evidentthat algorithms withi-bound2 dominate their counter-parts with higheri-bounds in terms of cpu time. In gen-eral, for a giveni-bound, IJGP(i)-SC was better thanMBTE(i)-SC which was in turn better than MBTE(i)-MC in terms of cpu time and the number of backtracks.As expected the number of backtracks decreases asi-bound increases. While it is not reflected in the timemeasure, the number of backtracks required by our al-gorithms is significantly lower than MAC(table 1).Table 1 shows the results for insoluble instances. Fromthis table, we can see that for insoluble instances inthe 100-variable-set, IJGP(2)-SC performs better thanMBTE(2)-SC which in turn is better than MBTE(2)-MC both in terms of cpu time and the number of back-tracks. MAC is the best performing algorithm on in-soluble CSPs both in terms of cpu time and the num-ber of backtracks. Dechter and Mateescu (Dechter &Mateescu 2003) proved that IJGP(2) is identical to arc-consistency when run until convergence. However, werun IJGP(2)-SC for only10 iterations and thus IJGP(2)-SC prunes less values as compared to the MAC algo-rithm. Moreover, MAC maintains arc-consistency whena variable is instantiated and also when a value is re-futed while we run IJGP(2) only when a variable is in-stantiated. Less pruning means more nodes exploredand more backtracks which is why the number of back-tracks by IJGP(2)-SC is more than MAC on insolubleCSPs. We believe that the poor performance of algo-rithms with higheri-bounds is also due to the fact thatwe do not run them until convergence at each iteration.

Experiments on the 200-variable-set All ex-periments on the 200-variable-set were run on aPentium-1700 MHz machine with a256 MB RAMrunning the version9.0 of the red-hat Linux operatingsystem. The time-out used was1800 seconds. We ranall the algorithms withi-bound2 on100 instances eachwhen the number of constraints was840, 850, 860 and870 respectively. Algorithms with higheri-bound werefound to be infeasible because of the higher cost bothin terms of time and space that is required to processeach node. We observed that instances with840 and850 constraints lie in the under-constrained region ofthe phase transition while the instances with860 and870 constraints lie in the over-constrained region of thephase transition. We analyze the results on only solubleCSPs for the 200-variable-set because as mentionedearlier, our look-ahead schemes are inherently designedfor soluble CSPs. The results obtained here are similarto the 100-variable-set and are summarized in Table 2.

Page 151: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

IJGP(i)-SC MBTE(i)-SC MBTE(i)-MCC Quartiles i=2 i=3 i=4 i=2 i=3 i=4 i=2 i=3 i=4 SLS MAC

Time for soluble instances1st quartile 1.2 1.8 3.5 3.7 4.6 8.7 2.3 3.0 5.2 0.1 0.1

420.0(163) median 1.2 1.9 3.9 4.1 5.6 9.9 4.7 4.5 5.9 0.2 0.23rd quartile 1.7 3.2 6.0 53.4 45.8 65.1 14.9 20.3 22.8 0.3 0.31st quartile 1.2 1.9 3.6 3.9 5.0 9.0 5.3 3.2 5.0 0.1 0.1

430.0(109) median 1.3 2.1 4.2 14.4 15.8 16.1 13.0 7.1 15.8 0.2 0.23rd quartile 5.1 4.8 7.0 87.7 113.7 80.2 41.4 29.6 48.7 0.5 0.41st quartile 1.3 2.0 3.6 4.5 10.4 8.7 4.7 4.1 6.9 0.1 0.2

440.0(85) median 1.4 2.1 7.0 29.0 30.2 20.9 9.9 20.4 20.7 0.3 0.43rd quartile 2.4 20.9 23.1 88.7 68.9 144.6 150.2 69.0 57.1 1.1 0.61st quartile 1.3 2.0 4.6 4.2 6.5 10.0 53.4 53.4 84.1 0.0 0.3

450.0(43) median 1.4 2.2 11.2 9.3 38.9 287.2 134.9 57.8 89.5 0.7 0.63rd quartile 9.5 8.8 32.7 356.4 209.7 436.3 290.7 217.3 218.1 1.8 0.7

Backtracks for soluble instances1st quartile 0.0 0.0 0.0 100.0 100.0 100.8 1.8 0.0 0.0 38.8

420.0(163) median 0.0 0.0 0.0 111.0 103.0 115.0 83.0 29.0 10.0 44.03rd quartile 21.0 51.0 42.0 871.0 481.0 455.0 338.0 287.0 126.0 54.01st quartile 0.0 0.0 0.0 100.8 100.0 100.0 55.3 3.3 2.3 32.3

430.0(109) median 2.0 3.0 2.0 261.0 267.0 155.0 243.0 62.0 92.0 40.03rd quartile 82.0 68.5 60.0 1200.5 1258.0 485.5 838.5 334.5 415.0 58.51st quartile 0.0 0.0 0.8 117.5 171.0 102.3 58.8 15.3 31.5 42.0

440.0(85) median 2.0 39.0 132.0 485.5 331.0 334.5 415.0 48.5 140.0 73.03rd quartile 69.0 75.8 228.0 1334.0 849.0 565.0 1085.0 386.0 362.3 109.01st quartile 0.0 0.0 17.0 102.0 109.0 116.0 687.0 484.0 503.0 39.0

450.0(43) median 0.0 0.0 69.0 178.0 400.0 1550.0 1673.0 565.0 531.0 58.03rd quartile 162.8 89.0 291.5 4213.8 2033.8 2278.5 5337.8 1823.3 1379.5 75.8

Time for insoluble instances1st quartile 46.0 70.6 107.7 237.5 183.3 290.3 274.1 192.9 202.1 0.3

420.0(37) median 76.6 139.9 168.7 392.6 327.9 398.2 398.4 223.2 276.4 0.43rd quartile 99.1 143.5 261.2 475.2 486.2 500.0 500.0 331.0 363.2 0.51st quartile 24.0 32.8 86.9 158.3 167.1 242.5 104.9 88.7 116.3 0.6

430.0(91) median 44.0 62.7 107.9 276.6 231.7 306.7 148.0 134.1 171.1 0.73rd quartile 59.6 71.5 154.2 415.5 430.2 500.0 246.8 187.6 298.6 0.81st quartile 19.4 30.2 56.2 106.4 108.1 154.0 97.8 92.7 105.9 0.1

440.0(115) median 26.2 43.4 80.1 211.5 232.6 311.8 164.1 141.4 161.3 0.43rd quartile 42.0 62.0 143.6 371.8 382.0 500.0 223.5 174.7 263.0 0.71st quartile 15.9 23.2 46.9 144.4 155.4 216.2 80.5 79.4 102.3 0.2

450.0(157) median 30.2 38.2 90.4 230.6 231.6 327.3 132.0 124.4 156.5 0.43rd quartile 43.5 68.2 113.0 438.0 428.4 495.5 299.5 256.6 292.4 0.6

Backtracks for insoluble instances1st quartile 886.0 1000.5 895.0 1814.0 1402.5 1194.3 3647.3 2266.0 1418.3 57.8

420.0(37) median 1501.0 1718.5 1415.0 4870.0 2870.0 2211.0 4463.5 2577.0 1764.0 92.53rd quartile 2152.0 2067.0 2064.0 7253.0 5312.0 3054.0 34533.0 3044.0 2598.0 106.01st quartile 606.5 422.0 630.8 1565.8 1358.5 1242.5 1537.8 1001.8 856.8 53.3

430.0(91) median 944.0 762.5 792.5 2818.5 1829.5 1415.0 2038.0 1599.0 1350.0 61.03rd quartile 1394.0 1083.0 1398.5 4664.5 3434.5 2512.5 3782.0 2116.5 2215.5 68.51st quartile 395.0 354.0 358.0 991.0 732.0 652.0 1149.0 920.0 704.0 34.0

440.0(115) median 565.0 549.0 617.0 2332.0 1861.0 1549.0 2068.0 1460.0 1048.0 67.03rd quartile 910.8 793.3 881.0 3795.8 2894.5 2181.3 2790.3 1828.3 1695.5 89.51st quartile 293.0 245.5 257.0 1210.5 953.5 760.5 920.0 711.0 550.0 27.5

450.0(157) median 573.0 420.0 481.0 2090.0 1622.0 1429.0 1756.0 1131.0 918.0 52.03rd quartile 1003.3 845.5 706.8 3698.5 2863.8 1868.5 3483.8 2578.3 1791.5 79.5

Table 1: Table showing time in seconds and number of backtracks taken by various algorithms for100 variable problems withK=4, T=4. C:number of constraints andi is thei-bound used. The quantity in the bracket alongside each constraint indicates thenumber of instances on which the results are based on.

IJGP(2)-SC MBTE(2)-SC MBTE(2)-MC MAC SLSC Quartiles T B T B T B T B T

1st quartile 17.0 0.0 58.6 6.5 73.7 201.5 0.5 102.5 2.7840.0(72) median 18.2 3.0 214.6 643.0 272.6 489.0 1.3 303.0 6.9

3rd quartile 159.0 898.0 1800.0 3334.0 1800.0 2811.0 2.9 729.0 13.31st quartile 18.1 1.5 356.9 266.5 531.9 1473.0 0.3 135.0 2.1

850.0(59) median 24.9 48.0 1800.0 3097.0 1800.0 2301.0 0.8 265.0 12.53rd quartile 182.2 1589.0 1800.0 4542.0 1800.0 2814.0 3.3 1044.8 53.51st quartile 27.6 32.5 1024.9 1416.8 689.4 1495.5 0.6 226.5 0.7

860.0(37) median 135.7 656.5 1800.0 2844.5 1800.0 2068.5 0.9 312.0 7.33rd quartile 973.4 4373.0 1800.0 3604.0 1800.0 2543.0 1.9 508.8 22.11st quartile 35.7 3.75 1800.0 2466 643.6 1249.25 1.2 443 34.8

870.0(22) median 83.8 358 1800.0 3451 1800.0 1657.5 2.7 887.5 50.03rd quartile 549.2 2341 1800.0 3730 1800.0 2077 3.7 1924 114.9

Table 2: Table showing the time in seconds and the number of backtracks made by various algorithms for200 variable solubleproblems with K=4, T=4. C:number of constraints, B: the number of backtracks, T: time in seconds andi is thei-bound used. Thequantity in the bracket alongside each constraint indicates the number of instances on which the results are based on.

Page 152: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

IJGP(2)-SC MAC SLSC Quartiles T B T B T

1st quartile 126.2 0.0 23.1 7844.0 31.72040.0(41) median 276.7 404.0 45.1 14987.0 48.4

3rd quartile 638.3 1280.5 187.4 65446.5 94.21st quartile 128.8 2.0 72.9 22708.3 44.8

2060.0(30) median 180.9 109.0 124.9 38507.5 184.53rd quartile 1800.0 11870.0 480.5 129258.0 1195.91st quartile 128.3 2.8 171.2 56792.3 82.7

2080.0(24) median 326.5 479.0 497.9 131492.0 261.53rd quartile 1800.0 5570.5 498.2 135240.3 376.91st quartile 1665.4 3735.0 13.3 3939.0 558.6

2100.0(18) median 1800.0 12791.0 113.2 31759.0 988.53rd quartile 1800.0 17237.0 270.7 74156.3 1503.8

Table 3:Table showing the time in seconds and the numberof backtracks made by various algorithms for500 variablesoluble problems with K=4, T=4. C:number of constraints,B: the number of backtracks, T: time in seconds andi is thei-bound used. The quantity in the bracket alongside each con-straint indicates the number of instances on which the resultsare based on.

Experiments on the 500-variable-set All experi-ments on the500 variable problems were run on aPentium-2400 MHz machine with a1000 MB RAMrunning the red-hat Linux operating system. The time-out used was 7200s. On the 500-variable-set, we re-port results on IJGP(2)-SC, SLS and MAC. MBTE(2)-SC and MBTE(2)-MC were able to solve only3 and6problems respectively out of the400 problems consid-ered in the stipulated time-bound. So we do not reportresults on these algorithms. We observed that the phasetransition for the 500-variable-set occurs around whenthe number of constraints is in the range2040-2100.

0.1

1

10

100

1000

0.1 1 10 100 1000

Tim

e(s)

take

n by

sls

Time(s) taken by ijgp2-sc

N=500, T=4, k=4

Instance

Figure 2:Results comparing IJGP(2)-SC and SLS for solu-ble 500 variable problems with K=4, T=4.

Figures 2 and 3 show a scatter plot of the time takenby IJGP(2)-SC vs SLS and IJGP(2)-SC vs MAC re-spectively while Table 3 gives a summary of results forSLS, MAC and IJGP(2)-SC. From Table 3 and Fig-ures 3 and 2, we observe that SLS and MAC are onlyslightly better in terms of time as compared to IJGP(2)-SC. Once again note that the number of backtracks per-formed by IJGP(2)-SC is significantly less than MAC.

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000 1e+06

#Bac

ktra

cks

by m

ac

#Backtracks by ijgp2-sc

N=500, T=4, k=4

instance

0.1

1

10

100

1000

0.1 1 10 100 1000T

ime(

s) ta

ken

by m

acTime(s) taken by ijgp2-sc

N=500, T=4, k=4

Instance

Figure 3:Results comparing IJGP(2)-SC and MAC forsolu-ble 500 variable problems with K=4, T=4.

Experiments on the 1000-variable-set On the 1000-variable-set, we report results on IJGP(2)-SC andMAC. Our SLS implementation timed-out on all theproblems in the 1000-variable-set. We must acknowl-edge that we are using a sub-optimal implementationof SLS and better results could be obtained by a betterimplementation of SLS. All experiments on the 1000-variable set were run on a Pentium-2400 MHz machinewith a 1000 MB RAM running the red-hat Linux oper-ating system. The time-out used was 7200s. The num-ber of constraints was varied between 4000 and 4100which corresponds to the phase-transition region.Figure 4 shows a scatter plot of the time and the numberof backtracks taken by IJGP(2)-SC and MAC. We cansee that IJGP(2)-SC is better than MAC both in termsof cpu time and the number of backtracks. This re-sult clearly indicates that IJGP(2)-SC scales better thanMAC.To summarize, we found that in general IJGP(i)-SCshows a consistently better performance than MBTE(i)-SC while MBTE(i)-SC shows a consistently better per-formance than MBTE(i)-MC both in terms of the cputime and the number of backtracks. On the other hand,we found that algorithms that use a loweri-bound out-perform those that use a higheri-bound in terms ofcpu time but not in terms of the number of backtracks.IJGP(2)-SC was the best performing algorithm among

Page 153: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

0

100000

200000

300000

400000

500000

600000

0 100000 200000 300000 400000 500000 600000

#Bac

ktra

cks

perf

orm

ed b

y M

AC

#Backtracks performed by IJGP(2)-SC

1000 variable random csps with K=4, T=4

Instance

0

500

1000

1500

2000

2500

3000

3500

4000

0 500 1000 1500 2000 2500 3000 3500 4000

Tim

e in

sec

onds

by

MA

C

Time in seconds by IJGP(2)-SC

1000 variable random csps with K=4, T=4

Instance

Figure 4:Results comparing IJGP(2)-SC and MAC forsolu-ble 1000 variable problems with K=4, T=4.

our look-ahead schemes in terms of cpu time. Moreimportantly, we found that IJGP(2)-SC was better thanMAC in terms of the cpu time on larger problems (1000variables) while it was inferior to SLS and MAC onsmaller problems (100, 200 and 500 variables). In otherwords, our results show that IJGP(2)-SC is more scal-able than MAC. Also since the MAC implementation ishighly optimized and the implementations of our look-ahead schemes are sub-optimal, we find our results veryencouraging.

Summary and ConclusionsAs we claim, the solution count measure is strongerthan the min-conflicts heuristic when everything else re-mains fixed since, using the same boundi, they haveidentical pruning power but SC also provides value or-dering. This heuristic is designed to be effective whenthe problem has a solution. We demonstrated empiri-cally that indeed for MBTE, MBTE-SC is stronger thanMBTE-MC. Still, it is possible to implement MBTE-MC as a strictly propagation algorithm with relation de-scription. This is likely to be much more efficient andtherefore may present a worthwhile approach and time-accuracy trade-off that is not dominated by MBTE-SC.We plan to investigate this in the future.

The next question is whether IJGP is more cost-effective than MBTE for SC approximation. Our em-

pirical evaluation strongly suggests that IJGP-SC ismuch stronger and dominates MBTE-SC, both in prun-ing power, as shown over inconsistent instances (see Ta-ble 1, as well as in its informativeness and guidance tothe solution, as shown over consistent instances (see Ta-ble 1). However, there is no complete dominance.

We see that the ”focus” power of IJGP-SC for valueordering is very strong even for i=2 (the median is loweven in the phase transition). It is much stronger thanMBTE-SC. See in particular the results for 500 and1000 variables.

Finally, our experiments show that IJGP-SC has bet-ter scalability than MAC and SLS (at least relativeto our implementation). Comparing IJGP-SC withMAC/SLS, we observed that MAC/SLS are superior onsmall problems, but as the problem size grows the rel-ative performance of IJGP-SC improves and on 1000variable problems, IJGP-SC outperforms MAC. This iseven more significant since our implementation is farfrom optimal and we are using the heuristic on top ofchronological backtracking without any backjumpingor constraint recording.

ReferencesBessiere, C., and Regin, J.-C. 1996. MAC andcombined heuristics: Two reasons to forsake FC (andCBJ?) on hard problems. InPrinciples and Practiceof Constraint Programming, 61–75.

Cheeseman, P.; Kanefsky, B.; and Taylor, W. M. 1991.Where the Really Hard Problems Are. InProceedingsof the Twelfth International Joint Conference on Arti-ficial Intelligence, IJCAI-91, Sidney, Australia, 331–337.

Christian Bessiere, Eugene C. Freuder, J.-C. R. 1999.Using constraint metaknowledge to reduce arc consis-tency computation.Artificial Intelligence125–148.

Dechter, R., and Mateescu, R. 2003. A simple insightinto iterative belief propagation’s success.UAI-2003.

Dechter, R., and Pearl, J. 1989. Tree clustering forconstraint networks.Artificial Intelligence353–366.

Dechter, R.; Kask, K.; and Larrosa, J. 2001. A generalscheme for multiple lower-bound computation in con-straint optimization.Principles and Practice of Con-straint Programming (CP-2001).

Dechter, R.; Kask, K.; and Mateescu, R. 2002. It-erative join graph propagation. InUAI ’02, 128–136.Morgan Kaufmann.

Gent, I. P., and Walsh, T. 1993. Towards an under-standing of hill-climbing procedures for sat. InPro-ceedings of the Eleventh National Conference on Arti-ficial Intelligence (AAAI-93), 28–33.

Gottlob, G.; Leone, N.; and Scarello, F. 1999. Acomparison of structural csp decomposition methods.Ijcai-99.

Kask, K., and Dechter, R. August 1995. Gsat and

Page 154: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

local consistency. InInternational Joint Conferenceon Artificial Intelligence (IJCAI-95), 616–622.Lauritzen, S., and Spiegelhalter, D. 1988. Local com-putation with probabilities on graphical structures andtheir application to expert systems.Journal of theRoyal Statistical Society, Series B50(2):157–224.MacIntyre, E.; Prosser, P.; Smith, B.; and Walsh,T. 1998. Random constraint satisfaction: Theorymeets practice.Lecture Notes in Computer Science1520:325+.Morris, P. 1993. The breakout method for escapingfrom local minima. InProceedings of the EleventhNational Conference on Artificial Intelligence (AAAI-93), 40–45.Sabin, D., and Freuder, E. C. 1997. Understandingand improving the MAC algorithm. InPrinciples andPractice of Constraint Programming, 167–181.Shafer, G., and Shenoy, P. 1990. Probability propaga-tion. Annals of Math and Artificial Intelligence2:327–352.Smith, B. 1994. The phase transition in constraintsatisfaction problems: A CLoser look at the mushy re-gion. InProceedings ECAI’94.

Page 155: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Learning via Finitely Many Queries

Andrew C. LeeUniversity of Louisiana at Lafayette1

Abstract

This work introduces a new query inference model that can access data and communicate witha teacher by asking finitely many boolean queries in a language L. In this model the parameters ofinterest are the number of queries used and the expressive power of L. We study how the learningpower varies with these parameters. Preliminary results suggest that this model can help studyingquery inference in an resource bounded environment.

1 Introduction

In computability-theoretic learning theory, concepts are modelled as collections of computable func-tions from N to N, where N is the set of all natural numbers. An inference machine M is sim-ply a total Turing machine. Given a concept S and f ∈ S, M takes finite initial segments(0, f(0)), . . . , (n, f(n)) of f (abbrev. as σn(f)) as input and outputs (the Godel number of) aprogram that tries to compute f . f is said to be EX-learnable by M [Gol67] if

limn→∞

M(σn(f)) = e (1)

and that the program e computes f . S is said to be EX-learnable by M when every f ∈ S isEX-learnable by M . The object

EX = S : (∃M)[S is EX-learnable by M ] (2)

is the inference type associated to the learning criteria EX. Denote M(σn(f)) by en. Suppose thatthe following condition (i.e. (3)) which is less restrictive than condition (1), is satisfied:

(∃n0)(∀n)[(n ≥ n0) → program en computes f ] (3)

Then M does learn the function f semantically although the limit limn→∞ en may not exists. Thelearning model that uses criteria (3) is referred as behavioral correct learning. Its inference type isdenoted by BC. Note that both EX and BC (and many other basic inference criteria [CS83] aswell) learners learn passively by observing data.

By contrast, an active way of learning is to ask questions. Gasarch and Smith [GS92] studiedlearning via queries in the computability-theoretic setting. Roughly speaking, a query inferencemachine (abbrev. as QIM) is an inference machine that learn concepts by actively asking questionsto a teacher, who will provide correct answers to these questions. Questions are boolean queriesformulated in a query language L about the function f currently under investigation. These questionsrepresent the extra knowledge we can have while learning functions from the given concept. Forany passive inference type (i.e. inference without using queries) I (e.g. I = EX, BC etc.), thecorreponding query inference type is QI (L):

QI (L) = C : (∃M)[C is I-learnable by an QIM M via the query language L] (4)

The query language L is a parameter that may affect the learning power. In this work we refine thequery inference model. We introduce the bounded query inference types QI (L)〈k〉. The quantity

1Computer Science Department, University of Louisiana at Lafayette, Box 41771, Lafayette, Louisiana 70504-1771Email: [email protected]

1

Page 156: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

k (k ∈ N) denotes the number of queries allowed, is another parameter of interest. The additionallearning power gained by an query inference machine can be measured either quantitatively by thenumber of queries made or qualitatively by the expressive power of the language L.

The main point of this paper is to introduce this framework and demonstrate its potential formeasuring the two types of resources mentioned above in query inference. Our preliminary resultstry to answer the following basic questions:

• Does more queries always help?: We show that (Corollary 3.6 and 3.7) for many passiveinference types I, more queries always help and this fact is independent of the query languagesused.

• Can one trade ‘quantity’ for ‘quality’?: We show that in our query inference model (Corol-lary 3.8) enriching a query language cannot reduce the number of queries used in the learningprocess.

Asking k boolean questions has at most 2k possible sequence of answers. Hence any inferenceprocess that uses k boolean queries can be simulated via a team of 2k inference machines (SeeSection 3 for definitions related to team inference). It is natural to ask the following question:

• Can team inference be simulated by bounded query inference ?:In particular, we ask if a team inference process by 2k learners (without using queries) can besimulated by a single machine using at most k boolean queries in a language L. We show thatwhen using the language [+,×] one can simulate the team learning process. However, it askesquestions that are unrelated to the concepts it wish to learn. For query languages that arereducible to [Succ,<]2 ([GH95, GL97]), our results indicate that a bounded query inferencemachine cannot even simulate a team of size two, regardless of the number of questions asked.

There has been much research on computability-theoretic learning theory. We refer the interestedreaders to [AS83, JORS99, Smi94, GS95, GS97] and the references cited therein for surveys of manymajor developments in this research area.

2 Notation and Basic Definitions

We assume familiarity with the basic notions of logic ([End72]) and computability theory ([Rog67,Soa87]). ϕee∈N denotes an arbitrary acceptable numbering of all but only partial computablefunctions. Throughout this paper COMP denotes the collection of all computable functions fromN to N. COMP0,1 denotes the set when we restrict to 0, 1 valued computable functions. Weidentify it with the collection of all computable subsets of N.

2.1 Query Languages and Inference Models

A query language L consists of the usual logical symbols with equality, quantifiers, symbols for firstorder variables, symbols for every element of N and a special symbol F denoting the function wewish to learn. Extra symbols for some functions and relations on N may be included, and the querylanguage will be denoted by these symbols. For example, [<] denotes the query language with therelation < and [+, <] denotes the query language with the relation < and the function +. We use[∅] to denote the query language with no extra symbols. Most of the query languages we considerare first order and we will use a superscript 2 when we allow set variables and their quantifications.For example, [Succ,<]2 is the query language with the successor function (Succ(x) = x+1) and therelation <. In addition, it consists of set variables and allows quantifications over these set variables.We use small (resp. large) letters are used for number (resp. set) variables, which range over N(resp. subsets of N). We will consider the language [Succ,<]2 in Section 4.

2

Page 157: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Throughout this paper, L denotes a reasonable query language. That is, all the symbols in Lrepresent computable operations. A few examples of queries in various languages are given below:

L query interpretation[∅] (∀y)(∃x)[f(x) = y] ‘Is f surjective?’[<] (∃x)(∀y)[x < y → f(x) = 0] ‘Does f have finite support?’

[+, <] (∃x)(∃p)(∀y)[(x < y) ∧ (p > 0) → f(y) = f(y + p)] ‘Is f eventually periodic?’[+,×] ‘Is x ∈ K?’ ‘Does ϕx converges on input x?’

Note that K is the halting set 2. Queries about sets can be similarly defined. Given a querylanguage L, one may enrich the language to L′ by introducing more extra symbols to L. We willcall L′ an extension of L. For any two query languages L1, L2, when all the extra symbols inL1 can be interpreted in the language L2, then L2 is at least as expressive as L1. For any querylanguages L1 and L2, denote L1 L2 if L2 is at least as expressive as L1. It is known [End72] that[∅] [<] [+, <] [+,×]. Also, [Succ,<]2 is at least as expressive as [Succ,<] because we allowset variables and their quantifications.

In the original model [GS92], a query inference machine obtains all of its information from makingqueries to a teacher and cannot access the data directly. It can however request whatever data itswants via queries (e.g., a QIM can find out what f(13) is by asking the questions ‘f(13) = 0?’,‘f(13) = 1?’, etc. until a YES answer is obtained). In this work we consider an QIM which canaccess data and post a finite number of queries.

Definition 2.1 Let k ∈ N ∪ ∗. Let Mk be the collection of QIMs which can access the datavalues but is allowed to post k logical queries in L. The bounded query inference type QI (L)〈k〉 isdefined as the set

QI (L)〈k〉 = C : (∃M ∈Mk)[C is I-learnable by M ]

Note that QC (L)〈0〉 = C and QI (L)〈∗〉 denote the inference type when the machine can ask finitelymany questions. Also, QC (L)〈∗〉 ⊆ QC (L). It follows that we have the following inference typehierarchy:

QI (L)〈0〉 ⊆ QI (L)〈1〉 ⊆ . . .QI (L)〈k〉 ⊆ . . . ⊆ QI (L)〈∗〉 (5)

3 Main Results

Let I be any inference criteria and let 1 ≤ k ≤ n. The team inference type [k, n]I ([Smi82]) isdefined as the set

C : (∃M1) . . . (∃Mn)(∀f ∈ C)[ at least k out of n Mi’s learns f via the criteria I] (6)

Note that for any f1, f2 ∈ C (f1 6= f2), the sub-collection of machines that learn each of them maybe different.

Observe that asking k questions has at most 2k possible sequence of answers. Hence any inferenceprocess that uses k boolean queries can be simulated via a team of 2k inference machines. That is,

2The query language [+,×] can express the question ‘x ∈ A?’ for any computably enumerated set A [Mat93]. Notethe use of this result in query inference [GS92].

3

Page 158: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Theorem 3.1 For any passive inference type I and for any k ≥ 1, QI (L)〈k〉 ⊆ [1, 2k]I .

Our main result compares the bounded query inference types with team inference types that allowerrors. Let a ∈ N and f, g : N → N. We write f =a g when the set x: f(x) 6= g(x) has at most aelements. An inference machine M is said to BCa-learns f ([CS83, Smi82]) if M fed f outputs overtime an infinite sequence of programs p0, p1 , . . ., such that for all but finitely many n, ϕpn

=a f .The inference type BCa is the set

BCa = S : (∃M)[S is BCa-learnable by M ] (7)

By combining (6) and (7) one can consider the team inference types [k, n]BCa (1 ≤ k ≤ n). Wealso need the following technical definitions in our proof.

Definition 3.2 Let f : N → N and τ be a finite initial segment of f . Let dom(τ) (resp. rng(τ))denote the domain (resp. range) of τ . φk(f) (k ≥ 1) denotes the query

φk(f) = (∀y)(∃x1)(∃x2) . . . (∃xk)[∧i 6=j

(xi 6= xj)∧ k∧

i≥1

f(xi) = y]. (8)

We say that f is k-fold surjective if φk(f) is true. In addition, f is exactly k-fold surjective if φk(f)is true but φk+1(f) is false. We also use the notation φk(τ). We say that φk(τ) is true if

(∀y ∈ rng(τ))(∃x1 ∈ dom(τ)) . . . (∃xk ∈ dom(τ))[∧i 6=j

(xi 6= xj)∧ k∧

i≥1

f(xi) = y]. (9)

Fact 3.3 The query φk(f) (resp. φk(σ)) is true implies φj(f) (resp. φj(σ) are true for all j ≤ k.φk(f) (∀k ∈ N) does not use any extra symbols and hence it can be expressed in any query languages.

The following are generalizations of notions introduced in [Smi82].

Definition 3.4 Let f : N → N, x ∈ N and let Y be an infinite computable subset of N. We usenY (x) to denote the smallest element in Y that is greater than x.

a) The dips of f in Y is the set DY (f) = nY (x) : x ∈ Y ∧ f(nY (x)) < f(x).b) If σ and τ are initial segments of f where τ extends σ, then τ monotonic extends σ along Y if

f is monotonic nondecreasing when restricted to the domain [dom(τ) − dom(σ)] ∩Y .

Theorem 3.5 For any query language L and for any k ≥ 1,

QEX (L)〈k〉 −∞⋃

a=0

[1, 2k − 1]BCa 6= ∅.

Proof: It suffices to show that for any k ≥ 1 and a ≥ 0, there is a concept S(k, a) such thatS(k, a) ∈ QEX ([∅])〈k〉 but S(k, a) 6∈ [1, 2k − 1]BCa.Constuction of S(k, a): Partition N into 2n infinite computable subsets Y1, Y2, . . . , Y2n of N. Byan implicit use of the recursion theorem (See [CR94]), we may let Ci = f ∈ COMP: f is exactlyi-fold surjective and ϕf(max(DYi

(f)) = f (6= ∅) and we set S(k, a) = C1∪. . .∪C2n . By Definition 3.2,Ci and Cj (i 6= j) are pairwise disjoint. Moreover, for any f ∈ S(k, a) and i (1 ≤ i ≤ 2k), we havethe property that

φi(f) is true iff f ∈ Ci ∪ Ci+1 ∪ . . . C2k (10)

4

Page 159: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

S(k, a) ∈ QEX ([∅])〈k〉: By property (10) a bounded query inference machine M can first applybinary search via the queries φ1, . . . , φ2k to determine the unique i such that f ∈ Ci. This step takesat most k queries selected from φ1, . . . , φ2k , and all of them can be formulated in the query language[∅]. Machine M will then proceed by reading initial segments of f . It will first output f(y0) wherey0 is the first element of Yi (Note: Yi computable) and M will update it output to output f(nYi

(x))when M reads an x ∈ Yi such that f(nYi

(x)) < f(x). It follows that M ’s outputs will converge tothe value f(max(DYi

(f))), which is (by definition of Ci) an index of a function that computes f .S(k, a) 6∈ [1, 2k−1]BCa: We present the proof for the case when k = 2, the general case followssimilar ideas. Let M1, M2 and M3 be a team of inference machines. We will construct a programϕe4 by diagonalization against M1, M2, M3 such that

1. If ϕe4 is total, then the function (say g) computed by ϕe4 is an element in C4 but g cannot beBCa-learned by M1, M2 nor M3.

2. If ϕe4 is not total, then it follows from the construction that one machine, say M1, cannotlearn any function that extends an initial segment (say σ) of ϕe4 . By using σ and a similarconstruction which diagonalizes against the two remaining machines (say M2 and M3), onecan create another program ϕe3 . Note that the number of machines involved is reduced by 1.If ϕe3 is total, then the function (say h) computed by ϕe3 is an element in C3 but h cannot beBCa-learned by M2 nor M3 (and definitely not by M1). If ϕe3 is not total, one can reiteratethe steps and construct another program ϕe2 .

3. When the number of machines involved drops to zero, we can simply take ϕe1 to be a surjectiveextension that monotonic extends the current segment in all Yi’s.

To simplify our presentation, we will present the construction of program ϕe4 , ϕe3 and thetransition steps only.

Program ϕe4 :

Initialization Let y be the smallest element in Y4. Set ϕ0e4

= σ0 where (y, e4) ∈ σ0 and φ4(σ0) istrue. Put the machines M1,M2,M3 in a priority queue. and let M1 be the front of the queue.We will use ϕs

e4to denote the initial segment constructed at the beginning of stage s.

Stage s :

1. Let M be the machine that is at the front of the queue. Search for extensions ϕse4⊆ σs ⊂ τs

and x1, . . . , xa+1 ∈ dom(τs − σs) that satisfy the following collection of requirements:a) (Have at least a + 1 errors) ∀i ≤ a + 1, M(σ)(xi) 6= τs(xi).b) (Preserves f(max(DY4(f))) = e4) τs monotonic extends ϕs

e4along Y4.

c) (Avoids making f 5-fold surjective) φ5(τs) is false.2. We reach this step if the previous search terminates. To preserve 4-fold surjectivity, we

extend τs ⊆ τ such that φ4(τ) is true and φ5(τ) is false.3. (Reset parameters and queues) Set ϕs+1

e4= τ . Put M to the end of the priority queue

and go to stage s + 1.

End of Program ϕe4

If ϕe4 is total and let f = ϕe4 . By construction f ∈ C4 and f 6∈ BCa(M1,M2,M3). If ϕe4 is nottotal, then we perform the following transition steps. Let s be the least stage that the search doesnot terminate. Let σ = ϕs

e4and M1 be the machine at the front of the queue at stage s. Note

that for any extension τ of σ that satisfy conditions 1b). and 1c). in the code of ϕe4 , M1(τ) canconverge on at most a points not in dom(σ). hence M1 cannot BCa-identify any function in C thatmonotonic extends f along Y4. By another implicit use of the recursion theorem, we obtain thefollowing program e3.

5

Page 160: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Program ϕe3 :

Initialization Let z be the smallest element that are in the set Y3 − dom(ϕse4

). Set ϕ0e3

= σ30 ,

where ϕse4⊂ σ3

0 , φ3(σ30) is true and (z, e′) ∈ σ3

0 . Here e′ is a padded version of e3 (i.e.: ϕe3

= ϕe′ where e′ is large enough to satisfy the conditions stated above). Remove M1 from thequeue. Put the remaining machines in a priority queue. and let M2 be the front of the queue.(i.e. Now the queue is M2,M3).

Stage s :

1. Let M be the machine that is at the front of the queue. Search for extensions ϕse3⊆ σs ⊂ τs

and x1, . . . , xa+1 ∈ dom(τs − σs) that satisfy the following collection of requirements:a) (Have at least a + 1 errors) ∀i ≤ a + 1, M(σ)(xi) 6= τs(xi).b) (Preserves f(max(DY3(f))) = e3) τs monotonic extends ϕs

e3along Y3 and Y4.

c) (Avoids the function constructed to become 4-fold surjective) φ4(τs) is false.2. Note that We reach this step if the previous search terminates. (Preserve 3-fold surjectiv-

ity) We extend τs ⊆ τ such that φ3(τ) is true and φ4(τ) is false.3. (Reset parameters and queues) Set ϕs+1

e3= τ . Put M to the end of the priority queue and

go to stage s + 1.

End of Program ϕe3

Corollary 3.6 let I = EX or BC. Then for any query languages L and for any k ∈ N,a). QI (L)〈k〉 ⊂ QI (L)〈k + 1〉 and b). QI (L)〈k〉 ⊂ QI (L)〈∗〉.

Proof: Part b). follows from part a). For part a), For any k, a ≥ 0,

QI (L)〈k〉 ⊆ [1, 2k]I (Thm 3.1) & [1, 2k]I ⊂ [1, 2k+1 − 1]BCa ([Smi82]) (11)

By Theorem 3.5 and the fact that EX ⊆ BC, there is a concept C ∈ QI (L)〈k + 1〉 − [1, 2k+1 −1]BCa and by (11) C cannot be in QI (L)〈k〉.

Corollary 3.7 Let L be a query language and k ∈ N. Let r ≥ 1 and I = EXr or I = BCr. Wehave a). QI (L)〈k〉 ⊂ QI (L)〈k + 1〉 and b). QI (L)〈k〉 ⊂ QI (L)〈∗〉.

Proof: It suffices to note that when a ≥ r, equation (11) also holds for I = EXr or I = BCr

and the previous arguments also follow in these cases.

We also shows that improving the quality of boolean queries cannot reduce the number of queriesmade. An intuitive explanation is that any k boolean queries can describe no more than 2k bits ofinformation.

Corollary 3.8 Let L be any query language and let I = EX or BC. For any extension L′ of Land any k ∈ N, QI (L)〈k + 1〉 6⊆ QI (L′)〈k〉.

Proof: We prove the case when I = EX. By Theorem 3.5, (∃C ∈ COMP)[(C ∈ QEX (L)〈k +1〉 − [1, 2k+1 − 1]BCa]. For any extension L′ of L, If C ∈ QEX (L′)〈k〉 ⇒ C ∈ [1, 2k]EX ⊂[1, 2k+1 − 1]EX. A contradiction.

In the original query inference model, the inference process may require potentially infinitelymany queries. We will show that when the language L can express the ‘<’ relation, asking onlyfinitely many questions restrict the learning power of the query inference machine. We generalizethe techniques in Theorem 3.5 to separate the query inference types QEX ([<])〈∗〉 and QEX ([<]).

6

Page 161: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Theorem 3.9 QEX ([<])−QEX ([<])〈∗〉 6= ∅

Proof: We sketch the main ideas only. First by Theorem 3.1 QEX ([<])〈∗〉 ⊆⋃∞

n=1[1, n]EX .We partition N into Yi’s (i ≥ 1), where all of them are infinite and computable. Let C =

⋃∞n=1 Cn,

where Cn is the class of computable functions f such that ϕf(maxDYn (f)) = f and f is eventuallyn-fold surjective, that is, if the following query

φk(f) = (∃z)(∀y)(∃x1)(∃x2) . . . (∃xk)[z < y →∧i 6=j

(xi 6= xj)∧ k∧

i≥1

f(xi) = y].

is true. One can show that C ∈ QEX ([<])−⋃∞

n=0[1, n]EX by a minor modification of the techniqueas in the proof of Theorem 3.5.

4 Relationships with Team Inference

Theorem 3.1 states that for any passive inference type I and for any k ≥ 1, QI (L)〈k〉 ⊆ [1, 2k]I .If the equality QI (L)〈k〉 = [1, 2k]I holds for some L, the language L is expressive enough so thatthe team learning process can be simulated by an active learning strategy via k queries in L.

We first show that when L = [+,×], the above equality holds.

Theorem 4.1 (∀n ∈ N)[QEX ([+,×])〈n〉 = [1, 2n]EX ].

Proof: We only sketch the proof here. We claim that (∀n ∈ N)[QEX ([+,×])〈n〉 ⊇ [1, 2n]EX ].Since all computably enumerable set are diophantine [Mat93, GS92] , In [+,×] one can ask thequeries ‘Does there exists a machine i < j so that Mi learns the function f?’. By combining thisobservation with binary search, one can first determine which machine can actually learns f andthen simulate that machine.

However, for many other languages used in query inference, the answer is negative. We will treat Las a parameter and apply the technique of reduction to deal with this parameter (See [GH95, GL97]).We only state which languages are reducible to [Succ,<]2 here. We referred the interested readersto [GH95] for a detailed discussions on various types of reductions of query languages used in ourinference model.

Theorem 4.2 [GH95] Let b ≥ 2. Let

1. POWb be the unary predicate that determines if a number in the set bn|n ∈ N2. POLYb be the unary predicate that determines if a number in the set nb|n ∈ N3. FAC be the unary predicate that determines if a number in the set n!|n ∈ N

The Languages [+, <], [+, <, POWb], [Succ,<, POWb]2, [Succ,<, POLYb]2 and [Succ,<, FAC]2

are reducible to [Succ,<]2.

Theorem 4.3 (∀k ≥ 1)[[1, 2]EX −QEX ([Succ,<]2)〈k〉 6= ∅.

Proof: (sketch) Let C ∈ [1, 2]EX - EX. By the lifting theorem [GL97], there exists an operator Ωsuch that Ω(C) ∈ [1, 2]EX and suppose Ω(C) ∈ QEX ([Succ,<]2)〈k〉, then Ω(C) ∈ QEX ([Succ,<]2) which in turn implies [GL97] Ω(C) ∈ EX , a contradiction.

7

Page 162: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Corollary 4.4 For any L that is reducible to [Succ,<]2 and for any k ≥ 1,

[1, 2]EX −QEX (L)〈k〉 6= ∅.

Proof: By applying reductions(See [GH95]) as all the languages stated above were shown to bereducible to [Succ,<]2.

5 Open problems

A new query inference model that can access data and communicate with a teacher by asking abounded number of logical queries are studied. The following questions are of interest:

• Simulation of a team learning process via active learning stretegies:Can one develop a languages L so that QI (L)〈k〉 = [1, 2k]I?′ We showed that when L = [+,×],equality do holds. However, in our proof the learner asks questions that are unrelated to thefunctions that we wish to learn. It will be of interest to find a L such that the equality holdsand queries in L truly asks about properties of the functions the machine are observing.

• Using non-boolean queries:No matter how expressive a boolean query can be, it only allow the communication of a singlebit of information. Developing new bounded query inference models that use non-booleanqueries are also of interest.

Acknowledgement: The author wish to thank William Gasarch, Steele Russell for comments onearlier drafts of this work.

References

[AS83] D. Angluin and C.H. Smith. Inductive inference: Theory and methods. ComputingSurveys, 15:237–269, 1983.

[CR94] John Case and James Royer. Subrecursive Programming Systems: Complexity and Suc-cinctness. Springer Verlag, 1994.

[CS83] J. Case and C.H. Smith. Comparison of identification criteria for machine inductiveinference. Theoretical Computer Science, 25:193–220, 1983.

[End72] H. B. Enderton. A mathematical introduction to logic. Academic Press, 1972.

[GH95] W.I. Gasarch and G.H. Hird. Reduction for learning via queries. In Proceedings of 8th

Annual ACM Conference on Computational Learning Theory, pages 152–161, 1995.

[GL97] W.I. Gasarch and Andrew C.Y. Lee. Inferring answers to queries. In Proc. 10th AnnualACM Conference on Computational learning theory, pages 275–284. COLT, 1997.

[Gol67] E.M. Gold. Language identification in the limit. Information and Control, 10:447–474,1967.

[GS92] W.I. Gasarch and C.H. Smith. Learning via queries. Journal of ACM, 39:649–676, 1992.

[GS95] W. I. Gasarch and C. H. Smith. Recursion theoretic models of learning: some results andintuitions. Annals of Mathematics and Artificial Intelligence, 15(2):155–166, 1995.

[GS97] W.I. Gasarch and C. H. Smith. A survey of inductive inference with an emphasis onqueries. In A. Sorbi, editor, Complexity, Logic, and Recursion Theory, number 187 inLecture notes in Pure and Applied Mathematics Series. M. Dekker., 1997.

8

Page 163: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

[JORS99] Sanjay Jain, Daniel Osherson, James Royer, and Arun Sharma. Systems that Learn: AnIntroduction to Learning Theory. MIT Press, Cambridge, MA., second edition edition,1999.

[Mat93] Yuri V. Matiyasevich. Hilbert’s tenth problem. Foundations of Computing Series. MITPress, Cambridge, MA, 1993. Translated from the 1993 Russian original by the author,With a foreword by Martin Davis.

[Rog67] H Rogers, Jr. Theory of recursive functions and effective computability. McGraw-Hill,1967.

[Smi82] C.H. Smith. The power of pluralism for automatic program synthesis. Journal of ACM,29(4):1144–1165, 1982.

[Smi94] C. H. Smith. Three decades of team learning. In Proc. 5th Int. Workshop on AlgorithmicLearning Theory, pages 211–228. Springer-Verlag, 1994.

[Soa87] R.I. Soare. Recursively Enumerable Sets and Degrees. Omega Series. Springer-Verlag,1987.

9

Page 164: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Modeling and Reasoning with Star Calculus: An Extended Abstract

Debasis Mitra Department of Computer Sciences

Florida Institute of Technology Melbourne, FL 32901, USA E-mail: [email protected]

1. Introduction Spatial knowledge representation (KR) schemes are important in many areas of computation, e.g., geographical information systems, natural language technology, image processing, battlefield management, etc. In this paper we have discussed a spatial KR technique for the purpose of expressing angular relations between points in Euclidean space. The scheme is a hybrid of what are known as quantitative and qualitative schemes. In this section we will briefly introduce a related area for the purpose of motivation.

Digital image processing sometimes deploys a method called “chain code” in order to represent a polygon as a sequence of angular relations between the respective pairs of points of the polygon [Gonzalez and Wintz, 1997]. The angles are measured with reference to an absolute direction in the space (e.g., North), in a clockwise or typically anti-clockwise sense. The measurement is discrete rather than continuous, i.e., they come from a finite discrete set of values between 0 and 360 degrees, e.g., x=0°, x=30°, x=60°, . . ., x=330°, for an angular direction x. Actually the ordering index is utilized rather than the angle’s value, e.g., 1 for x=0°, 2 for x=30°, . . ., 12 for x=330° for the above set. Thus, a polygon will be represented as a sequence of these numbers for adjacent pairs of points on its perimeter (Figure 1). Obviously the discretization looses some information and thus, may distort the polygon when it is recovered. However, chain code-schemes may use different level of granularity, e.g., representation with 15° instead of 30°, depending on the requirement in an application.

20

0

16

Figure1: A chain-code representation with 30° angular zoning. Numbers on some of the arrowheads indicate the zones of the arrows. The Star-calculus representation scheme for reasoning with similar type of angular information, which we have invented independently [Mitra, 2002, 2002-2],

Page 165: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

subsumes the “chain code” technique. If the chain code is termed as a quasi-quantitative knowledge representation technique, then Star-calculus is a hybrid of qualitative and quantitative representation technique. As we will see in a subsequent section, the Star-calculus generalizes over the chain code by allowing both the 1-dimensional lines and the 2-dimensional conical regions in between the semi-infinite lines. Thus, the Star-calculus is a more complete representation than the chain-code as a knowledge representation technique. A noteworthy point about our contribution is that we have extended Helly’s theorem (in linear algebra, see Chvátal, 1983). This extension may have a broader impact beyond our work on Star-calculi. 2. Spatio-temporal Qualitative Representation Revisited Starting from the early studies of simple point-based calculus in linear time, spatio-temporal constraint-based reasoning has matured into a discipline with its own agenda and methodology [Chittaro and Montanari, 2000]. The study of such a calculus starts with an underlying “space” and develops a set of jointly exhaustive and pair-wise disjoint (JEPD) “basic relations” with respect to a reference object located in that space. Basic relations correspond to some equivalent regions in the space, for the purpose of placing a second object there with respect to the first one. An equivalent region is such that the second object could be placed anywhere in this region without affecting the relation with the reference object. The underlying space and such a relative zoning scheme of the space with respect to a reference object - forms a calculus in the context of spatio-temporal knowledge representation.

Qualitative reasoning with such a spatio-temporal calculus involves a given set of objects (e.g., points or time-intervals) located in the corresponding “space” and binary disjunctive relations between some of those objects. Each disjunctive binary relation is a subset of the set B of JEPD basic relations. The satisfiability question in the reasoning problem is - whether those relational constraints are consistent with respect to each other or not. The power set of B is closed with respect to the primary reasoning-operators like composition, inversion, set union and set intersection, thus, forming an algebra. Typically these algebras are Relational Algebras in the Tarskian sense [Jonsson and Tarski, 1952]. 3. The Star-calculus Star-calculus(α), where α stands for any even integer >0, involves a generalized angular zoning scheme with respect to a point in the 2D space, with (360/α)-degree angle between any consecutive pair of lines. The set B of (2*α +1) number of basic relations is Eq, 0, 1, 2, 3, …, 2*α-1, where ‘Eq’ is the identity relation with respect to the reference point, every even-numbered relation corresponds to a semi-infinite line fanning away from the origin, and the odd-numbered relations indicate a pie-slice or conic-sectional region between two such consecutive semi-infinite lines. Figure 2 corresponds to the Star-calculus(α=12) with 30° angle between the lines. So, there are two types of basic relations in B depending on the dimensionality, other than the relation ‘Eq’ that has zero dimensionality: re of even type corresponding to a 1D-region (semi-infinite line) and ro of odd type corresponding to a 2D-region (conic-section).

Page 166: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

0

4

6

12

18

20

222

8

10 14

16

13

5

7

911 13

2321

19

17

15

eq

Figure 2: Representation of Star-calculus(12)

Figure 3 shows a canonical representation G(α), indicating the topological relationship between the basic relations in the Star-calculus(α). Each circle represents an equivalent region/basic relation (synonymously used in this paper), the dark ones are the 2D-regions, the white ones are the 1D-regions, the central dot represents the Eq region, and the arcs represent the adjacency between the regions.

α0

2 +11

2

Figure 3: Canonical representation G(α) of the Star-ontology(α) 4. Geometrical Modeling: Representing Polygons Qualitatively As in chain code representation, a polygon may be represented with a Star-calculus (for some integer value of α). Exhaustively, we could express all possible binary relations between every pair of points, n(n-1)/2 of them, for n points. However, we really need only n number of such binary relations between adjacent pairs of points. Example 1: A quadrilateral may be described Star-calculus(12) (i.e., θ =30°) as (p1 2 p2), (p2 22 p3), (p3 14 p4), (p4 8 p1).

An obvious point to note about the proposed representation scheme is that it is inaccurate from a strict quantitative point of view. This means that a polygon recovered from such a given description (e.g., example 1) may be distorted. It can be easily checked [Mitra, 2002-3] that the error of such a representation with a θ-degree angular zoning scheme is:

Page 167: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

(1) any internal angle may be deviated by maximum 2θ in the process of recovery, and (2) the relative orientation of the whole polygon may be deviated by maximum θ. 5. Computational Complexity Issues [Definition 1] The set of all disjunctive relations, the power set P(B) of the set of the basic relations B=Eq, 0, 1, 2, 3, …, 2*α-1 is closed under disjunctive composition, inverse, set union and intersection operations, forming the Star-algebra(α). A reasoning problem instance in any subset Θ of P(B) is expressed as (V, E), where V is a set of points situated in the 2D-space, and E is a set of binary constraints Rij between some of the pairs of points i, j in V such that any Rij ∈ Θ. The satisfiability question in the reasoning problem is to check if it is feasible to assign the points in the space following all the constraints in E. Theorem 1 [Mitra, 2002-2]: Reasoning with full Star-algebra(α) is NP-complete. Proof sketch: Construction of a problem instance in the Star-algebra from an arbitrary 3-SAT problem instance would be as follows. (1) For every literal lij (in the source 3-SAT problem), create two points Pij and Rij such that Pij [1 -> (α+1)] Rij, and (2) for every clause Ci we have Pi1 [(α+1) -> (2*α-1)] Ri2 and Pi2 [(α+1) -> (2*α-1)] Ri3 and Pi3 [(α+1) -> (2*α-1)] Ri1. Here, a binary relation x [r1->r2] y indicates a disjunctive set of basic relations from point x to point y, within the range from r1 through r2 over G(α). Also, (3) for every literal lij that has a complementary literal lgh we have two relations between their corresponding points: Pij [(α-1) -> (2*α-1)] Rgh and Pgh [(α-1) -> (2*α-1)] Rij. The source 3-SAT problem instance is satisfiable iff the corresponding Spatial reasoning problem instance is so, and the construction takes polynomial time with respect to the number of clauses and variables. Hence, the reasoning problem in Star-algebra(α) is NP-hard.

Given some binary constraints between a set of points in any Star-calculus(α), and a set of assignment for those points in the 2D-space (e.g., by their Cartesian coordinates), it could be easily verified if the assignment follows the constraints, in O(|E|) time, for |E| number of binary constraints. Hence, the problem is NP-complete. QED.

[Definition 2] A convex relation is defined as a disjunctive set of basic relations, which can be expressed as the shortest range [r1 – r2, [eq]] over the canonical representation G(α), such that the range does not cross the half circle on G(α). When r2 is not inverse of r1, then the relation Eq is optionally included (including Eq and without - both are convex relations), but when r2 is inverse of r1, then the relation Eq must be present within a convex relation. For all 1D-basic relations r, r, Eq and r, Eq, r∪ are also convex relations. [Definition 3] A preconvex relation is either a convex relation or a convex relation c excluding any number of lower dimensional regions of c from c. [Proposition 1] The set C of all convex relations is closed under the disjunctive composition. Trivially true, from the definition of convex relations. Tautology/universal relation is produced when a half circle on G(α) is being crossed in the result.

Page 168: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

[Proposition 2] The set of all convex relations are closed under the inverse, and the set intersection operations: similarly trivial to show. [Proposition 3] The set of all preconvex relations P are closed under disjunctive composition operation. Note that a preconvex relation also spans a range (shortest) on G(α) like a convex relation, except that some lower dimensional regions (reven, or Eq) may be absent from the range. The fact that disjunctive-composition of relations spanning over two ranges will compose to another range - remains true here as well as in the case of convex relations. However, the absence of a reven from one of the operands could have made the resulting range discontinuous. This situation will not arise: for any absent internal reven from an operand, the two adjacent rodd relations would be present in the same operand and they will compensate for the absent reven in the result For example, compose two preconvex relations 1, 3, 4 and 2, 3, 5 in the Star-calculus(6). Although, 1D-regions (reven) 2 and 4 are absent in the two operands respectively, their adjacent 2D-regions (rodd) are present. The result of the composition operation is 1.2 ∪ 1.3 ∪ 1.5 ∪ 3.2 ∪ 3.3 ∪ 3.5 ∪ 4.2 ∪ 4.3 ∪ 4.5. The adjacent 2D-regions compensate for the corresponding absent 1D-region in any operand and the result of the composition remains unaffected for that absence. However, if two absent reven relations in the two operands are on the same line (e.g., 1, 3, 4 and 0, 1, 3, absent are 2 and 2 respectively), then their composition results 2.2 = 2) may not be reproduced by the adjacent 2D-regions. Since such a resulting absent region could be only of 1D-type, the result remains a preconvex relation. Hence, the set of all preconvex relations remains closed under disjunctive composition. [Proposition 4] Preconvex set P is closed under the set intersection and the inverse operations. Since the possible absence of 1D-regions from the two operands (ranges otherwise) of the set intersection operation could not cause any 2D-region to be absent from the result of the operation - the preconvex property is preserved under the set intersection. Inverse operation will trivially preserve the preconvex property of its operand. [Proposition 5] Thus, the convex set C and the pre-convex set P (C ⊂ P) form two sub-classes of the Star-algebra(α). Theorem 2: 4-satisfiability is sufficient to imply global consistency for the preconvex sub-class P.

The Theorem-2 can be proved by using an extension of the Helly’s theorem for convex sets as stated in Chvátal [1983, Theorem 17.2]: “Let F be a finite family of at least n+1 convex sets in Rn such that every n+1 sets in F have a point in common. Then all the sets in F have a point in common.” One could define a corresponding notion of a pre-convex set, where from a convex set c, some strictly-lower dimensional-convex subsets of c may be absent. A circle is a convex region in the 2D-space. However, exclude a straight line (a convex region in a lower dimension) over the circle from that circle, then the circle minus the line becomes a pre-convex region, and does not remain a convex region. We define stricter sets/regions below.

Page 169: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

[Definition 4] A strongly convex region in an n-dimensional space is a convex region and is also of n-dimension. [Definition 5] A strongly pre-convex region in an n-d space is a strongly convex region with possibly some lower dimensional convex regions within it being excluded. (Note that an excluded lower dimensional region is never from the boundary of the original region, and a convex region minus any part of its boundary is still a convex region.) [Definition 6] We will call a region as a convex closure cl(p) of a pre-convex region p where the missing lower dimensional regions of p are added back to the original pre-convex region p. Thus, the convex closure of a pre-convex region is a convex region.

Note that the pre-convex or strongly pre-convex regions are very much “approximate” convex regions, and should not be confused with arbitrary non-convex regions. Helly’s theorem could be extended toward the strongly pre-convex sets. Using such an extended Helly’s theorem one can prove the Theorem 2. Extended Helly’s Theorem: In an n-D space if every (n+1) subsets of a finite set P of strongly pre-convex regions has a non-null intersection, then all elements of P have a non-null intersection between them, and vice versa (“only if”). [We presume that some element(s) of P are not convex, otherwise Helly’s theorem follows and we do not need a separate proof.] Proof sketch of the extended Helly’s theorem: The only if part is trivially true.

Suppose S is the set of convex closures of all the elements of the set P. Note that each element of P is a sub-region of some element of S.

Suppose every (n+1) subsets of P have a non-null intersection. Then, so does S, because each element of P is a sub-region of some element in S. Say, all elements of S intersect to a non-null region q (could be a collection of disconnected sub-regions) [true by Helly’s theorem]. Either (1) the highest dimension of q is n, or (2) q is union of only some lower dimensional regions. Suppose all elements of P intersect to u, which is to be proved as non-null. Case 1: Suppose L is the union of all lower dimensional regions that were absent in P but present in S, or in other words L= (union over S – union over P). L is non-null (because all elements in P are not convex). Since, L is of lower dimension, while q is of n-dimension, u=(q-L) has to be non-null. All elements of P intersect to u, and it is non-null. This fact satisfies extended-Helly’s theorem’s “if” part, for case 1. Case 2: q is a part of each element of S, but must be on the boundary of the elements of S, because q is of dimension lesser than n and each element of S is of n-dimension (this is a simple lemma for the lower dimensional q). Note above that L cannot contain any region from the boundary of any element in P, as the convex closure of a pre-convex region g is never formed by adding a lower dimensional region from the boundary of g. Hence, (q intersect L) is null, and so, u=(q-L)=q is non-null. u is where all elements of P intersect and it is non-null. This completes the “if” part for case 2, and hence for both the cases. QED. Proof sketch of Theorem 2: Induction base case for four points is trivially true by the definition of 4-satisfiability. Induction hypothesis is that the assertion is true for (m-1) points, and hence all the (m-1) points have satisfiable placements in the space. Consider a

Page 170: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

new m-th point with respect to which we have (m-1) preconvex relations with respect to the other (m-1) older points. By 4-satisfiability assumption we know that the three regions with respect to every three old points have a non-null intersection. By extended Helly’s theorem, that would imply the existence of a non-null region for the new m-th point. Hence, there exist a non-null region for the placement of the new m-th point satisfying the strongly pre-convex constraints, or the global consistency is implied. QED.

The 4-satisiability can be easily checked by a polynomial algorithm. Hence, [Proposition 6] The convex and strongly pre-convex subsets of a Star-calculus are tractable sub-classes. Theorem 3: The strongly pre-convex subclass P is maximally tractable. Proof sketch: Define maximal-convex relations being the ones corresponding to a half space region on one side of an infinite ‘line’ in a Star-calculus(α). For example, in Star-calculus(6) (use a similar Figure 2) regions 0, eq, 6, 2, eq, 8 and 4, eq, 10 are three such lines, and a disjunctive relation eq, 0, 1, 2, 3, 4, 5, 6 is such a maximal-convex relation. Now, add one of the two corresponding adjacent two-dimensional region to each such maximal-convex relation (e.g., eq, [0 – 7], which will obviously be a non-convex relation after that addition, and call any such relation as m+. Next, loosen the definition of m+ by allowing some lower dimensional regions to be absent in it (e.g., eq, [0 – 3], [5 - 7]), and call any such relation as p+. Consider the set P+ of all such p+ relations. Our proof of NP-hardness (Theorem 1) of the Star-algebra(α) uses such p+ relations and thus, shows that the reasoning problem in P+ is NP-hard. This fact, along with the Proposition 6, clearly shows that the subclass P is maximally tractable. QED. 6. Some Special Cases of the Star-calculus We have always maintained that the Star-calculus(α) is for even integer α. This is because odd α corresponds to the basic relations that have unintuitive inverse relations, and pairs of such basic relations may compose to non-unique results [Mitra 2002-2]. Actually a Star-calculus can be “generated” only by a set of infinite straight lines passing through a point (the Eq region) in 2D Euclidean space, which is not the case for odd α values. Another interesting observation is that the symmetrical zoning is not really required for generating a Star-calculus (observed first by Hyoung-Rae Kim). For example, Star-calculus(6) can be generated by any three concurrent lines on a plane, not necessarily with 60-degree angles between them (with some minor distinguishing consequence).

Star-calculus(2) is the simplest special case with five basic relations that could be semantically described as Equality, Front(0), Above/Left(1), Back(2), Below/Right(3). This calculus has some interesting applications in qualitative spatial reasoning. Note that the Theorem 1 on NP-hardness of Star-algebra(α) does not automatically apply to this case as it is a special case with α=2. However, the fact that the 3-consistency does not imply global satisfiability could be easily verified with the following problem instance involving four points p, q, r, and s. The relations are (p Eq q Eq r), and (p 1|2|3 s), (q 2|3|0 s), and (r 0|Eq|2 s). Constraints between each three points are satisfiable but the satisfiability could not be extended toward the fourth point. In general, the Star-algebra(2) is probably NP-complete.

Page 171: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Star-calculus(4) and the corresponding algebra has been extensively studied by Ligozat (1998) as Cardinal-directions algebra. We have also worked before on the Star-calculus(6) as another special concrete case. It has thirteen basic relations [Mitra 2002]. 8. Conclusion In this paper we have extended our works on the previously introduced quasi-qualitative spatial representation scheme for angular directions between points. We have discussed some of its significance in geometrical modeling. Also, its computational complexity issues are discussed with reference to an enhanced Helly’s theorem. Helly’s theorem’s importance in qualitative spatio-temporal reasoning could not be overstated. We believe its enhancement toward the notion of strongly pre-convex regions has a broader impact in the field. Our work extends some of the proposed directional calculi for spatial KR in the past.

Freksa (1992) proposed a rather unintuitive calculus with two reference points instead of one, where the reference direction is automatically determined by the line joining the two points. Frank (1991) proposed a cone-based calculus that appears similar to the Star-calculi on the surface, but actually is Star-calculus(4) with 1D lines being replaced with 2D cones. Implication of our results toward these exotic calculi is yet to be investigated. For example Haroud and Faltings results on Continuous domain CSP’s with x-convexity (existence of convex projection of constraints on a variable) could be extended toward x-preconvexity using the extended Helly’s theorem presented here. Acknowledgement: This work is partly supported by National Science Foundation. Valuable discussions with Hyoung-Rae Kim and Jochen Renz are acknowledged. References Chvátal, V., (1983). “Linear programming,” pp. 266, W. H. Freeman and Company. Chittaro, L., and Montanari, A. (2000). “Temporal representation and reasoning in artificial intelligence: Issues and approaches,” Annals of Mathematics and Artificial Intelligence, Baltzar Science Publishers. Djamila, H., and Faltings, B. (1994) “Global consistency for continuous constraints,” Proceedings of the 11th European Conference on Artificial Intelligence (ECAI-94). Frank, A. U. (1991). “Qualitative spatial reasoning about cardinal directions,” in Proceedings of the 7th Austrian Conference on Artificial Intelligence, p.157-167. Freksa, C. (1992). “Using orientation representation for qualitative spatial reasoning,” in Frank, A.U., Campari, I., Formentini, U., editors, Theories and Methods of Spatio-temporal Reasoning in Geographic Space: Proceedings of the International Conference GIS – From Space to Territory, pp. 162-178, Springer Verlag, Pisa, Italy. Gonzalez and Wintz (1987). “Digital Image Processing,” Academic Press.

Page 172: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Jonsson, B., and Tarski, A., (1952). “Boolean algebras with operators II,” American Journal of Mathematics, Vol. 74, pp 127-162. Ligozat, G., (1996). “A new proof of tractability for ORD-Horn relations,” Proceedings of the AAAI-96, pp. 395-401, Portland, Oregon. Ligozat, G., (1998). “Reasoning about Cardinal directions,” Journal of Visual Languages and Computing, Vol. 9, pp. 23-44, Academic Press. Mitra, D., (2002). "A class of star-algebras for point-based qualitative reasoning in two-dimensional space," Debasis Mitra, Proceedings of the FLAIRS-2002, Pensacola Beach, Florida. Mitra, D., (2002-2). “Qualitative Reasoning with arbitrary angular directions,” Spatial and Temporal Reasoning Workshop note, AAAI, Edmonton, Canada. Mitra, D., (2002-3). “Representing geometrical objects by relational spatial constraints,” Proceedings of Knowledge-based Computer Systems (KBCS) conference, Mumbai, India. Vilain, M.B., and Kautz, H. (1984). “Constraint propagation algorithms for temporal reasoning,” Proceedings of 5th National Conference of the AAAI.

Page 173: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Analysis of Greedy Robot-Navigation Methods

Apurva Mugdal Craig ToveyGeorgia Institute of Technology

College of Computing801 Atlantic Drive

Atlanta, GA 30332-0280, USAapurva, [email protected]

Sven KoenigUniversity of Southern CaliforniaComputer Science Department

941 W 37th StreetLos Angeles, CA 90089-0781, USA

[email protected]

Abstract

Robots often have to navigate robustly despite incomplete information about theterrain or their location in the terrain. In this case, they often use greedy methods tomake planning tractable. In this paper, we analyze two such robot-navigation methods.The first method is Greedy Localization, which determines the location of a robot inknown terrain by always moving it to the closest location from which it will make anobservation that reduces the number of its possible locations, until it has reduced thatnumber as much as possible. We reduce the upper bound on the number of movementsof Greedy Localization from O(n

32 ) to O(n log n) on grid graphs and thus close to the

known lower bound of Ω(n log n/ log log n), where n is the number of (unblocked) verticesof the graph that discretizes the terrain. The second method is Dynamic A* (D*),which is used on several prototypes of both urban reconnaissance and planetary robots.It moves a robot in initially unknown terrain from given start coordinates to givengoal coordinates by always moving the robot on a shortest presumed unblocked pathfrom its current coordinates to the goal coordinates, pretending that unknown terrainis unblocked, until it has reached the goal coordinates or there are no more presumedunblocked paths. We reduce the upper bound on the number of movements of D* fromO(n

32 ) to O(n log2 n) on arbitrary graphs and O(n log n) on planar graphs (including

grid graphs) and thus close to the known lower bound of Ω(n log n/ log log n), wheren is the number of (blocked and unblocked) vertices of the graph that discretizes theterrain.

1 Introduction

Robot-navigation problems with incomplete information are challenging because nondeter-minism results in a large number of contingencies. These problems include localization (wherethe map is known but the location of the robot relative to the map is unknown and the ob-jective is to move the robot until it has discovered its current location) and goal-directed

1

Page 174: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

navigation in unknown terrain (where the location of the robot relative to the map is knownbut the map is unknown and the objective is to move the robot to a given goal location). Thesensors on-board a robot can typically sense the terrain only near its current location, andthe robot thus has to interleave planning with movement to sense new parts of the terrain,either to discover more about its current location or to discover more about the map.

In this paper, we analyze two robot-navigation methods that both interleave planning withmovement and use greedy (= myopic) planning approaches to make planning tractable. Thefirst method is Greedy Localization, which moves a robot in known terrain so that it deter-mines its initially unknown location. The second method is Dynamic A* (D*), which movesa robot in initially unknown terrain from known start to known goal coordinates. Bothrobot-navigation methods are simple to implement, easy to integrate into complete robotarchitectures, and seem to result in small travel distances in practice. We analyze their traveldistances to understand whether the travel distances are indeed small in any kind of terrainor whether they were small only because of properties of the terrain used to test them exper-imentally. We model the terrain as graph (in practice, grid graphs are commonly used) andthen analyze the worst-case number of edge traversals as a function of the size of the graph,measured by the number of its vertices n. Robots move so slowly that their task-completiontimes are completely dominated by their travel times. Hence this criterion is at least as im-portant as the competitive ratio [7] because it is a guarantee on the task-completion time. Wereduce the upper bound on the worst-case number of edge traversals of Greedy Localizationfrom O(n

32 ) [12] to O(n log n) on grid graphs and thus close to the best known lower bound of

Ω(n log n/ log log n) [12]. With a completely different analysis, we reduce the upper bound on

the worst-case number of edge traversals of D* from O(n32 ) [5, 11] to O(n log2 n) on arbitrary

graphs and O(n log n) on planar graphs (including grid graphs) and thus close to the bestknown lower bound of Ω(n log n/ log log n) [8], which holds even on grid graphs [11].

2 Greedy Localization

The localization problem on grids is defined as follows. The robot moves with no actuatoruncertainty on the grid (with the usual north-east-south-west orientation). Each cell of thegrid is either blocked (= untraversable) or unblocked (= traversable). The perimeter of thegrid consists of blocked cells. The start cell of the robot is unblocked. The robot has an on-board compass and a map (including orientation and traversability information). However,it does not know in which cell it is located. The robot can always move from its currentcell north, east, south or west to any neighboring cell. The movement succeeds if the cell isunblocked. The robot has localized when it has made a series of movements such that theobservations made during those movements are sufficient to determine the location of therobot, or to determine that the location cannot be determined (if, for example, the robot islocated within one of a pair of isomorphic connected components of the grid). After eachmovement, the robot is permitted to use all the observations made so far to decide in whichdirection to move next. The localization problem is to localize in a minimum number ofmovements. We assume in the following that the sensors of the robot always detect withno sensor uncertainty which of its four neighboring cells are blocked (and perhaps also the

2

Page 175: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

blockage status of additional cells).

Greedy Localization maintains the set of cells that the robot can be in given all observationsthat it has made so far. It always makes the robot execute a shortest sequence of movementsso that all observations that it has made after the sequence of movements are guaranteed toreduce the number of cells that the robot can be in. It terminates once the robot cannotreduce the number any more. If the number is one after termination then the robot islocalized, otherwise it cannot localize. Greedy Localization was pioneered by Nourbakhshin robot programming classes where Nomad 150 mobile robots had to navigate mazes thatwere built with three-foot high and forty inch long cardboard walls [6]. Similar strategies areused in more complex environments where robots use probability distributions over locationsrather than sets of locations to be able to deal with sensor noise, and the greedy localizationmethods then move the robots to decrease the entropy of the probability distribution ratherthan the cardinality of the set [3].

2.1 Worst-Case Travel Bound

Researchers often analyze localization methods using online criteria, by comparing the traveldistance of a robot to the distance that would be traveled by an omniscient robot that knowsits location at the outset and seeks only to verify that location. Dudek, Romanik, andWhitesides [2], for example, find a best possible online ratio of |H| − 1, where H is computedfrom the start location of the robot with long range sensors in a polygonal model. We,on the other hand, analyze the worst-case travel distance of Greedy Localization because asmall worst-case travel distance guarantees that it cannot perform very badly, which is moreimportant than the “regret” measured by the competitive ratio. Our analysis makes heavyuse of a lemma from [13].

We model the grid as a grid graph G = (V, E) whose vertices correspond to the unblockedcells. The start vertex of the robot is x0. The observation that the robot makes in a vertexis the same as the observation that it makes in the corresponding cell. Two vertices areconnected via an edge iff they correspond to neighboring cells. The number of edge traversalsof the robot on the grid graph then is the same as its travel distance on the grid.

Theorem 1 Greedy Localization traverses at most |V | + 2|V | ln |V | edges on grid graphsG = (V, E).

Proof: The robot is always in a particular vertex in the graph, even though the robot doesnot know which one it is. For the purpose of the analysis, we follow the movements of therobot as they actually occur in the graph. During iteration i, the robot follows a shortestpath from vertex xi−1 to vertex xi so that the observation in xi is guaranteed to reduce thenumber of vertices that the robot can be in. This implies that xi is closest to xi−1 amongall informative vertices, where a vertex is informative if the observation from the vertex isguaranteed to reduce the number of vertices that the robot can be in. Note that uninformativevertices remain uninformative. Let d(x, x′)G denote the distance from vertex x to vertex x′ in

3

Page 176: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

graph G and define li = d(xi−1, xi)G. The main intuition behind the proof of the theorem isthe fact that no vertex within a large distance of xi−1 can be informative if li is large. Thus,the number of large li and thus also the number of edge traversals of Greedy Localizationmust be small. We use marking sequences to formalize this intuition. A marking sequence ongraph G = (V, E) is a sequence of triples vi, ri, M i for i = 1, 2, . . ., whose integers ri ≥ 0,vertices vi ∈ V , and sets M i ⊆ V satisfy the following properties:

1. vi ∈ M i,

2. M1 = ∅ and M i ⊂ M i+1, and

3. d(v, vi)G ≤ ri implies v ∈ M i+1.

The cost of the marking sequence is∑

i(1+ri). Vertices v are considered to be marked at stepi iff v ∈ M i. Our key construct then is as follows: Greedy Localization forms an associatedmarking sequence where ri = li − 1, vi = xi−1, and M i+1 is the set of uninformative verticesafter the robot has reached and made an observation from vi. The number of edge traversalsof Greedy Localization equals the cost of the associated marking sequence since 1 + ri = li.Note that the marking sequence is less restrictive than Greedy Localization because vi neednot be at distance 1 + ri from vi+1. Instead, the marking sequence consists of a sequence ofchoices of an unmarked vertex vi and a radius ri. All vertices within distance ri of vi (andpossibly additional vertices) are marked, and the marking sequence continues.

Lemma 1 The cost of any marking sequence is no larger than |V |+ 2|V | ln |V | on connectedgraphs G = (V, E).

Proof sketch: It follows from the triangle inequality that there exists a maximum costmarking sequence that only marks one vertex, namely vi, per step. For if another vertex vwere also marked, one could replace the step with two more expensive steps that mark only vi

and v, respectively. By viewing the marking sequence as a sequence of disjoint balls of radiusri in the metric space of graph distances, the connectivity of the graph limits the number ofradii that are at least t to 2|V |/t. The lemma follows. Full details are given in [13].

Greedy Localization constrains the movements of the robot to be in a connected componentof the graph. Hence, the lemma applies and the theorem is proved.

Note that our proof is not specific to robots with short-range sensors that operate on two-dimensional grid graphs in which the potential neighbors of a cell are located only to itsnorth, east, south, and west. Our proof does require that uninformative vertices remainuninformative. This is the case, for example, if the set of vertices that the robot believes itcan be in when it is in some vertex x is always included in the set of vertices that the robotbelieved it could have been in when it was in the same vertex x earlier. Thus our theoremholds also, for example, for higher-dimensional or differently connected grid graphs and iscompletely independent of the kind of sensors used by the robot.

4

Page 177: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

3 D*

The goal-acquisition problem on grid graphs is defined as follows. The robot moves again withno actuator uncertainty on a grid with blocked and unblocked cells. Its perimeter consistsof blocked cells. The start cell of the robot is unblocked. The robot knows its start cell andorientation and has to move to a given goal cell. It does not know initially which cells areblocked. The robot can always move from its current cell north, east, south or west to anyneighboring cell. The movement succeeds if the cell is unblocked. After each movement, therobot is permitted to use all the observations made so far (corresponding to the partial mapthat it has learned so far) to decide in which direction to move next. The goal-acquisitionproblem is move to the goal cell in a minimum number of movements. We assume in thefollowing that the sensors of the robot always detect with no sensor uncertainty whether thecell that it attempts to move to is blocked.

D* maintains the partial map that the robot has learned so far. It always makes the robotexecute a shortest sequence of movements from its current cell to the goal cell under theoptimistic assumption that cells are unblocked that have not been observed to be blocked. Itterminates once the robot has reached the goal cell or no such sequences of movements existany longer (in which case the robot cannot reach the goal cell). Whenever the robot observesa blocked cell on its current path, D* needs to replan, which can be implemented efficiently[9] and easily [4]. D* has been used outdoors on an autonomous high-mobility multi-wheeledvehicle that navigated 1,410 meters to the goal location in an unknown area of flat terrainwith sparse mounds of slag as well as trees, bushes, rocks, and debris [10]. As a result ofthis demonstration, D* is now widely used in the DARPA Unmanned Ground Vehicle (UGV)program, for example, on the UGV Demo II vehicles. D* is also being integrated into MarsRover prototypes, tactical mobile robot prototypes and other military robot prototypes forurban reconnaissance.

3.1 Analysis

We analyze the worst-case travel distance of D* in the following. We model the grid as a gridgraph whose vertices correspond to the cells. Thus, vertices can be blocked or unblocked.Two vertices are connected via an edge iff they correspond to neighboring cells. However,our analysis holds for graphs in general, not just grid graphs. We therefore generalize theproblem as follows:

Consider a graph G = (V, E) with blocked and unblocked vertices. We assume withoutloss of generality that the graph is connected (otherwise we can consider only the connectedcomponent of the graph that contains the start vertex). The start vertex v0 of the robotis unblocked. The robot always knows its current vertex (for example, because the robotknows its start vertex) and has to move to a given goal vertex t. It does not know initiallywhich vertices are blocked. The robot can attempt to move from its current vertex to anyneighboring vertex. If the neighboring vertex is unblocked, then the robot moves to it. Ifthe neighboring vertex is blocked then the robot remains in its current vertex. In both

5

Page 178: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

cases, the robot observes whether the neighboring vertex is blocked. D* always moves therobot along a shortest presumed unblocked path from its current vertex to the goal vertex.A path is presumed unblocked if it does not contain vertices that the robot knows to beblocked. Whenever the robot observes that a vertex is blocked, it recalculates a shortestpresumed unblocked path from its current vertex to the goal vertex and repeats the process.D* terminates once the robot has reached the goal vertex or can no longer find a presumedunblocked path from its current vertex to the goal vertex.

We use the following notation: At the beginning of iteration i, the robot is in vertex vi−1 andEi is the set of edges that are not incident on a vertex known to be blocked at that time.Note that E1 = E. The robot then plans a shortest path Pi in graph H i = (V, Ei) from vertexvi−1 to t, starts to follow it, and then stops in vertex vi either because vi = t or because thevertex bi following vi on Pi is blocked. In the latter case, let v′

i denote the vertex following bi

on Pi. The robot eventually stops in vertex vk either because vk = t or because there are nolonger any presumed unblocked paths from vk to t. Thus, k ≤ |V |. Let d(x, x′)G denote thedistance from vertex x to vertex x′ in graph G. If x and x′ are not in the same connectedcomponent of G then d(x, x′)G = ∞. The travel distance of the robot is

C =k∑

i=1

d(vi−1, vi)Hi

.

3.2 Telescoping

Lemma 2 D* traverses at most |V | +∑k−1

i=1 d(vi, v′i)

Hi+1edges on connected graphs G =

(V, E).

Proof: Since vi lies on the shortest path Pi from vi−1 to t in H i, by the principle of optimality,

C =k∑

i=1

d(vi−1, vi)Hi

=k∑

i=1

(d(vi−1, t)Hi − d(vi, t)

Hi

)

= d(v0, t)H1 − d(vk, t)

Hk

+k−1∑

i=1

(d(vi, t)Hi+1 − d(vi, t)

Hi

)

≤ |V | +k−1∑

i=1

(d(vi, t)Hi+1 − d(vi, t)

Hi

).

The last inequality uses d(v0, t)H1

< |V | which holds since G = (V, E) = (V, E1) = H1 isconnected.

Now consider an arbitrary i with 1 ≤ i < k. The robot planned a shortest path Pi in H i fromvi−1 via vi, bi, and v′

i to t. Thus, the subpath of Pi from v′i to t is a shortest path in H i from

6

Page 179: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

v′i to t. Since it does not contain any edges incident on bi, it is also a shortest path in H i+1

from v′i to t and thus d(v′

i, t)Hi+1

= d(v′i, t)

Hi. Thus, by the triangle inequality,

d(vi, t)Hi+1 ≤ d(vi, v

′i)

Hi+1

+ d(v′i, t)

Hi+1

= d(vi, v′i)

Hi+1

+ d(v′i, t)

Hi

.

By the definition of v′i, we have d(vi, t)

Hi= 2+d(v′

i, t)Hi

. Thus, C ≤ |V |+∑k−1i=1 (d(vi, t)

Hi+1 −d(vi, t)

Hi) ≤ |V |+∑k−1

i=1 ((d(vi, v′i)

Hi+1+d(v′

i, t)Hi

)− (d(v′i, t)

Hi+2)) ≤ |V |+∑k−1

i=1 d(vi, v′i)

Hi+1.

3.3 Time Reversal and Weighted Edges

Consider the following function:

CYCLE-WEIGHT(T,S). Input: a tree T = (V, E) and an ordered list S = ej : 1 ≤j ≤ k of distinct edges from the complete graph on V such that S ∩ E = φ. Define theweight wi of edge ei ∈ S to be the length of a shortest cycle that contains ei in the graphTi = (V, E ∪ ej : i ≤ j ≤ k). Output:

∑ki=1 wi.

We now show that∑k−1

i=1 d(vi, v′i)

Hi+1 ≤ CYCLE-WEIGHT(T, S) for a suitably constructedtree T and S = ej = (vj, bj) : 1 ≤ j < k.The basic idea relating the edge weights in CYCLE-WEIGHT to the d(vi, v

′i)

Hi+1values can

be understood by considering a special case. Assume that Hk is connected except for theisolated vertices b1, b2, . . . , bk−1. Reverse the time perspective so that the robot movementsadds edges, first the edges incident on bk−1, then the edges incident on bk−2, and so on.Pick T to be a spanning tree of the graph (V, Ek ∪ (bj, v

′j) : 1 ≤ j < k) and S to be

ej = (vj, bj) : 1 ≤ j < k. Then, wi ≥ 2 + d(vi, v′i)

Hi+1for 1 ≤ i < k since every cycle

that contains ei = (vi, bi) in Ti must also contain (bi, v′i). Consequently,

∑k−1i=1 d(vi, v

′i)

Hi+1 ≤∑k−1i=1 wi = CYCLE-WEIGHT(T, S).

Unfortunately this simple construction does not work in the general case since multiple con-nected components may be formed when the edges incident on a blocked vertex are removed.To get around this problem, we define the sequence of graphs Fk, Fk−1, . . . , F1 as follows:

• Let Fk be a spanning forest of Hk.

• For 1 ≤ i < k, let Ci1, C

i2, . . . , C

iki

be the connected components of H i+1 which getmerged with bi in H i, with the restriction that v′

i ∈ Ci1. Select a wi

j ∈ Cij for 1 ≤ j ≤ ki

such that wij is a neighbor of bi in G, with the restriction that wi

1 = v′i. Let Fi result

from Fi+1 by adding the edges (bi, wij) : 1 ≤ j ≤ ki.

The following lemma is immediate:

7

Page 180: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Lemma 3 For 1 ≤ i ≤ k and all vertices u and v, Fi is acyclic; d(u, v)Fi < ∞ iff d(u, v)Hi <∞; and d(u, v)Fi ≥ d(u, v)Hi

.

Proof: By induction.

We are now ready to prove the bound.

Lemma 4 Let H1, H2, . . . , Hk be a sequence of graphs as defined above. Let T = F1 andS = ei = (vi, bi) : 1 ≤ i < k. Then

∑k−1i=1 d(vi, v

′i)

Hi+1 ≤ CYCLE-WEIGHT(T, S).

Proof: By lemma 3, Fi+1 and H i+1 have the same connected components. The subgraph ofF1 induced by Ci

1 is connected since Ci1 is a component of H i+1. The edges ej for i < j < k

are contained in Ci1 since vj, v

′j, bj ∈ Ci

1 for all i < j < k. Thus, the graph obtained bycontracting all vertices of Ci

1 in Ti+1 is acyclic. Since Ti is obtained from Ti+1 by adding ei,every cycle that contains ei = (vi, bi) in Ti must also contain (bi, v

′i). Thus, wi is equal to

2 plus the distance between vi and v′i in the subgraph G′ of Ti induced by Ci

1. But G′ isalso a subgraph of H i+1 and hence wi ≥ 2 + d(vi, v

′i)

Hi+1. Consequently,

∑k−1i=1 d(vi, v

′i)

Hi+1 ≤∑k−1i=1 wi = CYCLE-WEIGHT(T, S).

3.4 An Extremal Problem on Graphs

We now bound CYCLE-WEIGHT((V, E), S) in terms of |V | and |S|. Let Ew = ei : wi ≥w ⊆ S be the set of edges with weight at least w. Recall that the girth of a graph is thelength of its shortest cycle. Define Γ(n, w) [and ΓP (n, w)] to denote the maximum number ofedges in graphs [and, respectively, planar graphs] with n vertices and a girth of at least w.The following lemma relates Ew and Γ(n, w).

Lemma 5 |Ew| ≤ Γ(|V |, w) − |V | + 1 for all w and CYCLE-WEIGHT((V, E), S).

Proof: Consider the graph Tw = (V, E ∪ Ew). We claim that Tw has a girth of at least w.To see this, assume that it does not and thus has a cycle C of length w′ < w. Since (V, E)is a tree, at least one edge of C must belong to Ew. Consider the edge ej ∈ Ew ∩ C withthe smallest j. Then Tj contains C and thus wj ≤ w′ < w. On the other hand, wj ≥ wsince ej ∈ Ew, which is a contradiction. Thus, Tw has a girth of at least w. This implies thatΓ(|V |, w) ≥ |E ∪ Ew| = |E| + |Ew| = |V | − 1 + |Ew| and the lemma follows.

Corollary 1 |Ew| ≤ ΓP (|V |, w) − |V | + 1 for all w and CYCLE-WEIGHT((V, E), S) suchthat (V, E ∪ S) is planar.

8

Page 181: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Proof: In the proof of lemma 5, Tw is planar because it is a subgraph of the planar graph(V, E ∪ S). Hence Γ(|V |, w) may be replaced by ΓP (|V |, w).

We now bound CYCLE-WEIGHT((V, E), S) by making use of bounds on Γ(n, w) andΓP (n, w), two well studied problems in extremal combinatorics. We first consider the casewhere the graph (V, E ∪ S) is planar.

Lemma 6 ΓP (n, w) ≤ wnw−2

for all w and n.

Proof: Since the sum of the lengths of all faces of any planar graph G = (V, E) is at most2|E| and every face has a length of at least w, the number of its faces can be at most 2|E|/w.The bound of the lemma follows from substituting this relationship in Euler’s formula.

Note that the weight of any edge in S is at most |V |. Define Ew,2w = ei ∈ S : w ≤ wi < 2w.Then, by corollary 1 and lemma 6,

CYCLE-WEIGHT((V, E), S) ≤log |V |∑

i=1

2i+1|E2i,2i+1 |

≤ O(|S|) +log |V |∑

i=3

2i+1|E2i|

≤ O(|S|) +log |V |∑

i=3

2i+1(ΓP (|V |, 2i) − |V | + 1)

≤ O(|S|) +log |V |∑

i=3

2i+1(2i|V |2i − 2

− |V | + 1)

≤ O(|S|) +log |V |∑

i=3

2i+1 4|V |/2i

= O(|S|) +log |V |∑

i=3

8|V |= O(|S| + |V | log |V |).

We now repeat the analysis for general graphs. In this case, we use a recent result by Alon,Hoory and Linial that states that any graph G = (V, E) with average degree d > 2 has agirth of at most logd−1 |V | [1], resulting in the following lemma.

Lemma 7 Γ(n, w) ≤ n(n1w + 1)/2 for all w and n.

Proof: Consider any graph G = (V, E) with |V | = n, |E| ≥ |V | + 1 and a girth of at leastw. Then, its average degree is d = 2|E|/n > 2 and thus, by the result by Alon, Hoory andLinial, w ≤ log2|E|/n−1 n. Solving this inequality for |E| yields the lemma.

9

Page 182: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

This lemma allows us to bound CYCLE-WEIGHT((V, E), S) for general graphs. Using cal-

culus, we can show that w(|V |(|V | 1w − 1)) = O(|V | log |V |) for |V | ≥ w > log2 |V |. Using this

fact with lemmata 5 and 7, we have

CYCLE-WEIGHT((V, E), S) =∑

i:wi≤log2 |V |wi +

i:wi>log2 |V |wi

≤ |S| log2 |V | +log |V |∑

i=2 log log |V |2i+1|E2i,2i+1 |

≤ |S| log2 |V | +log |V |∑

i=2 log log |V |2i+1|E2i|

≤ |S| log2 |V | +log |V |∑

i=2 log log |V |2i+1(Γ(|V |, 2i) − |V | + 1)

= |S| log2 |V | +log |V |∑

i=2 log log |V |2i+1(|V |(|V | 1

2i − 1)/2 + 1)

= |S| log2 |V | +log |V |∑

i=2 log log |V |O(|V | log |V |)

= O((|V | + |S|) log2 |V |).

We now state these results as lemma.

Lemma 8 CYCLE-WEIGHT((V, E), S) = O((|V | + |S|) log2 |V |) for allCYCLE-WEIGHT((V, E), S). CYCLE-WEIGHT((V, E), S) = O(|S| + |V | log |V |) forall CYCLE-WEIGHT((V, E), S) such that (V, E ∪ S) is planar.

3.5 Worst-Case Travel Bound

We are now ready to prove an upper bound on the worst-case travel distance of D*.

Theorem 2 D* traverses O(|V | log |V |) edges on connected graphs G = (V, E). It traversesO(|V | log2 |V |) edges on connected planar graphs G = (V, E).

Proof: By lemmata 2 and 4, D* traverses at most |V | +∑k−1

i=1 d(vi, v′i)

Hi+1 ≤ |V | +CYCLE-WEIGHT((V, E ′), S) edges, where |S| < |V | and (V, E ′ ∪ S) is a subgraph of G.By lemma 8, CYCLE-WEIGHT((V, E ′), S) = O((|V |+ |S|) log2 |V |) = O(|V | log2 |V |) and, ifG and thus (V, E ′ ∪ S) are planar, CYCLE-WEIGHT((V, E ′), S) = O(|S|+ |V | log |V |). Thetheorem follows.

10

Page 183: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

4 Conclusions

The robot-navigation methods that we have analyzed in this paper, Greedy Localizationand D*, are appealingly simple and easy to implement from a robotics point of view andappealingly complicated to analyze from a mathematical point of view. Our results, likewise,are satisfying in two ways. First, our tighter upper bounds on their worst-case travel distancesguarantee that they cannot perform badly at all. Second, the gaps between the best knownlower and upper bounds are now quite small, namely O(log log n) for Greedy Localization ongrid graphs and D* on planar graphs (including grid graphs), and O(log n log log n) for D*on arbitrary graphs.

References

[1] N. Alon, S. Hoory, and N. Linial. The Moore bound for irregular graphs. Graph and Combina-torics, 18(1):53–57, 2002.

[2] G. Dudek, K. Romanik, and S. Whitesides. Localizing a robot with minimum travel. InProceedings of the Annual ACM-SIAM Sympsosium on Discrete Algorithms, pages 437–446,1995.

[3] D. Fox, W. Burgard, and S. Thrun. Active Markov localization for mobile robots. Robotics andAutonomous Systems, 25:195–207, 1998.

[4] S. Koenig and M. Likhachev. Improved fast replanning for robot navigation in unknown terrain.In Proceedings of the International Conference on Robotics and Automation, pages 968–975,2002.

[5] S. Koenig, C. Tovey, and Y. Smirnov. Performance bounds for planning in unknown terrain.Artificial Intelligence, 2002.

[6] I. Nourbakhsh. Robot Information Packet. Distributed at the AAAI-96 Spring Symposium onPlanning with Infomplete Information for Robot Problems, 1996.

[7] D. Sleator and R. Tarjan. Amortized efficiency of list update and paging rules. Communicationsof the ACM, 28(2):202–208, 1985.

[8] Y. Smirnov. Hybrid Algorithms for On-Line Search and Combinatorial Optimization Problems.PhD thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh (Pennsylva-nia), 1997. Available as Technical Report CMU-CS-97-171.

[9] A. Stentz. The focussed D* algorithm for real-time replanning. In Proceedings of the Interna-tional Joint Conference on Artificial Intelligence, pages 1652–1659, 1995.

[10] A. Stentz and M. Hebert. A complete navigation system for goal acquisition in unknownenvironments. Autonomous Robots, 2(2):127–145, 1995.

[11] C. Tovey, S. Greenberg, and S. Koenig. Improved analysis of D*. In Proceedings of the Inter-national Conference on Robotics and Automation, 2003.

11

Page 184: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

[12] C. Tovey and S. Koenig. Gridworlds as testbeds for planning with incomplete information. InProceedings of the National Conference on Artificial Intelligence, pages 819–824, 2000.

[13] C. Tovey and S. Koenig. Improved analysis of greedy mapping. In Proceedings of the Interna-tional Conference on Intelligent Robots and Systems, 2003.

12

Page 185: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

! "#

$ %!!

!

" # $

%

&

Æ '(

Page 186: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

! " "# ! " $ %

& " ! '" ( ! ) ! "! ) !

*) + , - . ! ! "! " ) ! " ! /) )) 0"! ! ! " ( ! ! ) ! " + '" )) 0" ! ! ! ! " " ,

1! " !! ! )) 0" ") " ! " ' 1! ! ! !" ! 1! 23 " ! )) 4 55 455 " ! 0" ! ! " 6) ' "

7 - ! "! ! ) ) " ! ! 8 ) ! ! ! ! ! 6" '

7777 777 777 77 77 7

3" 2 3 , ! ! ! 7

3 ) ) ) 9 ! 5% !! " ! ! : ! " !! ! ! -7 ; ( 0" ! ' 2 8

-

Page 187: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

! ! -7 ; 5%< ( !! -7 " " ! = -7 ! . " ! 4" ) " 2 ! ! ! 3" - ! ! = ! ! .

- . -- -. - --. -- .- ... .-. .- .. - . -. - .

3" -2 3 ! ! . ! =

3 ! ( ) ! !) 2 ! = ! ! ! ' " <

/) 455 , )) 0" ' $ " " ! ! " ! ' ! ! " !"" ! )) 0" ' $ 455 ' " ! ! " !!! ! " ! )

( ) 455 ) ") ! ! " ! " 0") " ))

:" 6 455 " ! " ! $ " " !6 ! ! !

$ ! " ! ! $ 455 ! ! " !

.

Page 188: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

! " ! ! ! ! ! ") " ! "! ! " 1! " % $ ! 455 " ! ! ! ) " ! 455 1! ! ! " ! ) ! !! ! 1! ') ! ! " )

1 6 " ! ! 1! ") ! ! 3 ! ! 2 ! ) " ! ) ) ! ) # 2 ! ! ) ! # ) ! ! ! > ! > 6) ) ! !! ! $ ! ) ! 6 )

( ) ! 6 )) ! '!) " ! 1! )) ! " )) ! Æ) ! ! ! 6 )) ( "! ! " " ! ' " '! ! ? " ! " ! )) !

1! ! " ! ! ! 1 ! ! " )) 0" ") " ! ! " ! ! ! "!! ! ! ! ) ! )) 0" $! ! ! ! " ! .) !! " ! "! ! !

Page 189: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

! ) !) ! ! ! 8 ! ! ! ! !

1! ! " ! " 9 ! 2

>

>

! ! ! ! ! " ! )

! ! " ! !! !) " ! " ! ! @ " " 2

1! ! ! ! # " ! ! ! ! ! !

A

1! ! " ! ! # >

! ! " A

3 ) ! ! !

A

3 > ! !

$ 6" . ' ! ! 77 7 $ ! ! ! ! )) 0" ) ! ! " ! ! "! ! # ! 1! !" # ! 1!

) ! "! !

>

3) ! ! !

! ! !

Page 190: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

7

3" .2 " ! ! 77 7

! "#

455 ! ! ! '!) ! !"! '! ! )) ! ! ! " )) ! !) ! ! ) $ " ) Æ" ! 455 ! ! ! ! '! $ 6" "

B! 12 455 4 12 ' 4. ? . 7C C +, - +C -+7+ . +-+ +C, .-C.? . -7C, +C? C+-? .,7C .?-++, ?,C7C --?-?, .,7C +C+ ?,C7+ -? .,7C ?,C?C ?,C7-7 .-.. .,7C ?-?- ?,C7

3" 2 *" = ! .

! . = " " " -.9!= ! ! -& ) " B"'A ! ! -7 ! ' ! 1! ! ) ! 1! " ! " " ! ! !

,

Page 191: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

! ! " !" " 1 6 ! ! ! "D " " ! , 455 ! !! 6" " ! !. ! =

B! 12 455 4 12 ' 4-- .,7- ?7 ..C 7+--. 7, ?7 7. 7+-- ? ?7 ?? 7+-- ?+. ?7 7? 7+--, CC- -+-, C?C-. 7C-?

3" 2 *"2 5 = ! .

$ #

!"! ! ! ! ! 455 !6 ! 0" ) ! )" ! ! ! ! ) 0") ! )) " " "! ! 0" ! ! ) " ! 45 C

1! !0" " ! ! " 4 $ 4 5 ! 9 ! 1! ) " ) 1$%1 " $9-77,? 1$%1 ! --++.,+ ! ! *! " ! )D7. D7.

"%

E* F) -77.

?

Page 192: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

- * 4 !0" ! "" $ & &! G3 " " -7 ?-HC, : ++C

. * 4 I /'" ) ! $ G G; " ?. ?.HC? : +++

& % 4" ) ! '! 1! * 1-77-C !$" " -77-

* 5! & E" -77.

, 1 3! ! & )) $1 I! " --+. +.H7? : -77

? 3 & 3! 4 8! J E= $ &" G 1 I! 4 " ) ' $ : 8) " -?7 ,-H?, : -77-

C $ 9 I 8) 1 E) 9" 2)) " ! $ : 8) " -7 H.7 : -77-

+ $ 9 4& ! )) " ! $ ++H,7. -777

7 I 9 B* I! ! 72-7-H-7+ +C

48 G * " ! 2?CHC? +,.

- %") 8" B "! .72.++H -77.

. 4 &E) "D " --

C

Page 193: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

: 8) (B #$% 2.H.?- -77-

+

Page 194: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Deductive Algorithmic Knowledge∗

Riccardo PucellaDepartment of Computer Science

Cornell UniversityIthaca, NY 14853

[email protected]

Abstract

The framework of algorithmic knowledge assumes that agents use algorithms to computethe facts they explicitly know. In many cases of interest, a logical theory, rather than a par-ticular algorithm, can be used to capture the formal reasoning used by the agents to computewhat they explicitly know. We introduce a logic for reasoning about both implicit and explicitknowledge, where the latter is given with respect to a deductive system formalizing a logicaltheory for agents. The highly structured nature of such logical theories leads to a very naturalaxiomatization of the resulting logic over a deductive system. In the case when the deductivesystem is in fact tractable, we show that the decision problem for the logic is NP-complete, noharder than propositional logic.

1 Introduction

It is well known that the standard model of knowledge based on possible worlds is useful whenreasoning about knowledge ascribed to agents by an external observer, but is not appropriate whenagents need to explicitly compute what they know in order to make decisions. Part of the problemis that such a form of knowledge is subject to the problem of logical omniscience, that is, the agentsknow all the logical consequences of their knowledge [Fagin, Halpern, Moses, and Vardi 1995,Chapter 9].

Levesque [1984], among others, distinguishes implicit knowledge as described above fromex-plicit knowledge, which represents the knowledge that is available to the agent in order to makedecisions. A general approach to model explicit knowledge is that ofalgorithmic knowledge[Fa-gin, Halpern, Moses, and Vardi 1995, Chapter 10]. In the algorithmic knowledge framework, theexplicit knowledge of the agents is given by an algorithm that the agents use to establish whetherthey know a particular fact. The algorithmic knowledge framework has the advantage of being gen-eral; this same generality also means that there are no specific properties of algorithmic knowledgeunless we focus on specific algorithms.

In this paper, we study a form of algorithmic knowledge,deductive algorithmic knowledge,where the explicit knowledge of agents comes from a logical theory in which the agents perform∗Work supported in part by NSF under grant CTC-0208535, by ONR under grant N00014-02-1-0455, by the DoD

Multidisciplinary University Research Initiative (MURI) program administered by the ONR under grant N00014-01-1-0795, and by AFOSR under grant F49620-02-1-0101.

Page 195: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

their reasoning about the facts they know. Many useful forms of explicit knowledge can be formal-ized in such a logical theory for agents. This can be viewed as a form of algorithmic knowledge,where the algorithm used by an agent is an algorithm that attempts to infer whether a fact is derivablefrom the deduction rules provided by the agent’s logical theory. The highly structured presentationof an agent’s logical theory lets us readily derive the properties of explicit knowledge in this con-text. Intuitively, the deduction rules of the logical theory directly translate into properties of explicitknowledge.

To motivate the use of logical theories to capture explicit knowledge, consider the followingexample. In previous work [Halpern and Pucella 2002], we showed how the framework of algorith-mic knowledge could be used to reason about agents communicating via cryptographic protocols.Algorithmic knowledge is useful to model an adversary that has a certain number of capabilities todecode the messages is intercepts. There are of course restrictions on the capabilities of a reason-able adversary. For instance, the adversary may not explicitly know that it has a given message ifthat message is encrypted using a key that the adversary does not know. To capture these restric-tions, Dolev and Yao [1983] gave a now-standard description of capabilities of adversaries. Roughlyspeaking, a Dolev-Yao adversary can decompose messages, or decipher them if it knows the rightkeys, but cannot otherwise “crack” encrypted messages. The adversary can also construct new mes-sages by concatenating known messages, or encrypting them with a known encryption key. It isnatural to formalize a Dolev-Yao adversary using a deductive system that describes what messagesthe adversary “has” based on the messages it has intercepted, and what messages the adversary canconstruct.

To reason about such examples, we introduce a modal logic that lets us reason about both theimplicit knowledge of agents, which is useful for specifications, and the explicit knowledge ofagents, formalized as a logical theory. In contrast with existing work on developing models foragents reasoning via logical theories (for instance, Konolige [1986] and Giunchigliaet al [1993]),which reason completely within the logical theories, assuming perhaps a global logical theory forthe world, we base our logic on a standard possible-worlds semantics. This shows that it is possibleto combine a standard possible-worlds account of implicit knowledge with a logical theory repre-senting the explicit knowledge of agents, and to reason about both simultaneously. Our approachmoreover easily extends to the probabilistic setting, such as when there is a probability measure overthe possible worlds [Fagin and Halpern 1994]. We focus in this paper on the technical properties ofthe resulting logic.

The rest of this paper is structured as follows. In the next section, we give the backgroundon deductive systems, that we use to define the logical theories of the agents. In Section 3, weintroduce the formal logic to reason about knowledge given by deductive systems. In Section 4, weshow how we derive sound and complete axiomatizations for the logic for reasoning about particulardeductive systems. In Section 5, we address the complexity of the decision procedure for the logic.We conclude with a discussion of natural extensions of our framework. For reasons of space, weleave these extensions as well as the proof of our technical results for the full paper.

2 Deductive Systems

We start with defining the framework in which we capture the logical theories of the agents, theirdeductive or inferential powers. Reasoning occurs over the terms of some term algebra. More

Page 196: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

precisely, assume a fixed finite signatureΣ = (f1, . . . , fn), where eachfi is an operation symbol,with arity ri. (Operation symbols of arity0 are called constants.) Assume a setVars of variables.Define theterm algebraTΣ as the least set such thatVars ⊆ TΣ, and for allf ∈ Σ of arity n, andfor all t1, . . . , tn ∈ TΣ, thenf(t1, . . . , tn) ∈ TΣ. Intuitively, TΣ contains all the terms that can bebuilt from the variables, constants, and operations inΣ. We say a term is aground termif it containsno variables. LetT gΣ be the set of ground terms inTΣ. A substitutionρ is a mapping from variablesin Vars to ground terms. The application of a substitutionρ to a termt, written ρ(t), essentiallyconsists of replacing every variable int with the ground term corresponding tot in ρ. Clearly, theapplication of a substitution to a term yields a ground term.

A deductive systemD is a subset of℘(TΣ) × TΣ. A deduction rule(t1, . . . , tn, t) of D istypically written t1, . . . , tn . t, and means thatt can be immediately deduced fromt1, . . . , tn. Adeduction oft from a setΓ of terms is a sequence of ground termst1, . . . , tn, such thattn = t,and everyti is either (1) of the formρ(t′), for some substitutionρ and some termt′ ∈ Γ, or (2)of the formρ(t′), for some substitutionρ and some termt′ for which there is a deduction rulet′i1 , . . . , t

′ik. t′ in D, whereρ(t′ij ) = tij for all j, andi1, . . . , ij < i. We writeΓ `D t if there is a

deduction fromΓ to t via deduction rules inD. Observe that by definition we havet `D T for alltermst.

Example 2.1:The following deductive system DY over the signatureΣ = (recv, has, encr, conc, inv)captures the Dolev-Yao adversary described in the introduction. Here,recv(m) represents the factthat the adversary has received the termm, has(m) represents the fact that the adversary understandsthe termm, encr(m, k) represents the encryption of termm with key k, conc(m1,m2) representsthe concatenation of termsm1 andm2, andinv(k) represents the inverse of the keyk.

recv(m) . has(m)has(inv(k)), has(encr(m, k)) . has(m)

has(conc(m1,m2)) . has(m1)has(conc(m1,m2)) . has(m2)

Assume further thatΣ contains constants such asm, k1, k2. We can therefore derive, for instance,that recv(encr(m, k1)), recv(encr(inv(k1), k2)), recv(inv(k2)) `DY has(m). In other words, it ispossible for a Dolev-Yao adversary to derive the messagem if it has receivedm encrypted under akey k1, which inverse it has received encrypted under a keyk2, which inverse it has received.

To account for constructing new messages, consider the signatureΣ′ which extendsΣ with aunary constructorconstr, whereconstr(m) represents the fact that the adversary can construct thetermm. We can account for this new constructor by adding the following deduction rules to DY.

has(m) . constr(m)constr(k), constr(m) . constr(encr(m, k))

constr(m1), constr(m2) . constr(conc(m1,m2))

For instance, we haverecv(encr(m, k1)), recv(inv(k1)), recv(k2) `DY constr(encr(m, k2)).

It should be clear from the definitions that the deductive systems we consider in this paperare monotonic. In other words, ifΓ `D t, thenΓ′ `D t whenΓ ⊆ Γ′. While the frameworkwe introduce in Section 3 would remain unchanged were we to consider nonmonotonic deductivesystems, the axiomatizations of Section 4 would not be possible.

Page 197: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

3 Deductive Algorithmic Knowledge

We now introduce a simple modal logic for reasoning about the implicit and explicit knowledge ofan agent, where the explicit knowledge is formalized as a logical theory. For simplicity, we focuson the single agent case, and moreover on the purely propositional case. We start by introducing thesimple setting, where the deductive system is used to reason about primitive propositions. We thenextend it to full propositional formulas.

Primitive Deductive Systems. We first define the logicLprim(Σ), over a signatureΣ. We take theprimitive propositions to beT gΣ, the ground terms overΣ. The language of the logic is obtained bystarting with the primitive propositions inT gΣ, and closing off under negation, conjunction, theKoperator, and theX operator applied to primitive propositions only. Intuitively,Kϕ is read as “theagent implicitly knowsϕ”, while Xp is read as “the agent knowsp according to his logical theory”.(We extend the framework so thatX applies to more general formulas later in this section.) Wedefine the usual abbreviations,ϕ ∨ ψ for ¬(¬ϕ ∧ ¬ψ), andϕ⇒ ψ for ¬ϕ ∨ ψ. We definetrue asan abbreviation for an arbitrary but fixed propositional tautology, andfalse as an abbreviation for¬true.

We use a special form of deductive system, that essentially takes into account observationsthat the agent can make. Aprimitive deductive systemD over Σ is a deductive system with adistinguished classObs ⊆ T gΣ such that, intuitively, no term inObs arises as the conclusion of adeductive rule inD. Formally, for allob ∈ Obs and for all rulest1, . . . , tn . t of D, there does notexist a substitutionρ such thatρ(t) = ob.

The semantics of the logic follows the standard possible-worlds presentation for modal logicsof knowledge [Hintikka 1962]. Adeductive knowledge structureis a tupleM = (S, π,D), whereSis a set of states,π is an interpretation of the primitive propositions, andD is a primitive deductivesystem overΣ, with observation setObs. Every states in S is of the form(e, obs), wheree is astate of the environment (taken from a setE), that captures the general state of the system, andobsis a finite set of observations, taken fromObs, representing the observations that the agent has madeat that state. Hence,S ⊆ E × ℘(Obs).1 The interpretationπ associates with every state the set ofprimitive propositions that are true at that state, so that for every primitive propositionp ∈ T gΣ, wehaveπ(s)(p) ∈ true, false. The only assumption we make is that the interpretation respects theobservations made at a state, that is,π(e, obs)(ob) = true if and only if ob ∈ obs.

LetM(Σ) be the set of all deductive knowledge structures with signatureΣ, while for a fixeddeductive systemD overΣ, letMD(Σ) be the set of all deductive knowledge structures over thedeductive systemD.

We define a relation between states that captures the states that the agent cannot distinguishbased on the observations. Defines ∼ s′ if s = (e, obs) ands′ = (e′, obs) for somee, e′, and set ofobservationsobs. Clearly,∼ is an equivalence relation over the states.

We define what it means for a formulaϕ to be true at a states of M , written (M, s) |= ϕ,inductively as follows.

(M, s) |= p if π(s)(p) = true1For simplicity, we assume that the observations form a set. This implies that repetition of observations and their

order is unimportant. We can easily model the case where the observations form a sequence, at the cost of complicatingthe presentation.

Page 198: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

(M, s) |= ¬ϕ if (M, s) 6|= ϕ

(M, s) |= ϕ ∧ ψ if (M, s) |= ϕ and(M, s) |= ψ

(M, s) |= Kϕ if (M, s′) |= ϕ for all s′ ∼ s

(M, s) |= Xp if s = (e, ob1, . . . , obn) andob1, . . . , obn `D p.

Example 3.1:Consider the deductive system DY from Example 2.1. This deductive system can beviewed as a primitive deductive system by taking

Obs = recv(t) : t ∈ T gDY andhas does not appear int.

Intuitively, an observation represents a message intercepted by the adversary. Define the subtermrelationv onT gDY as the smallest relation subject to:

t v t,if t v t1 thent v conc(t1, t2),if t v t2 thent v conc(t1, t2),if t v t1 thent v encr(t1, t2).

Consider a structureM = (S, π,DY), where we record at every state all messages intercepted bythe adversary at that state. Letπ be an interpretation that respects the observations made at a state,and such thatπ(e, obs)(has(t)) = true if and only if there existst′ ∈ T gDY such thatrecv(t′) ∈ obsand t v t′. In other words,has(t) holds at a state ift is a subterm of a message interceptedby the adversary. Lets1 be a state with observationsrecv(encr(m, k1)), recv(encr(inv(k1), k2)),and s2 a state with observationsrecv(encr(m, k1)), recv(encr(inv(k1), k2)), recv(inv(k2)). Bydefinition ofπ, we have(M, s1) |= K(has(m)) and(M, s2) |= K(has(m)), so that at both states,the adversary implicitly has the messagem. However, from the results of Example 2.1, we see that(M, s2) |= X(has(m)), while (M, s1) |= ¬X(has(m)). In other words, the adversary explicitlyknows it hasm at states2 (where it has intercepted the appropriate terms), but not at states1.

The monotonicity of the deductive systems means that for a structureM with statess =(e, obs), s′ = (e′, obs ′), andobs ⊆ obs ′, we have(M, s) |= Xp implies (M, s′) |= Xp. Thus,explicit knowledge of facts is never lost when new observations are made.

Propositional Deductive Systems. The logic we introduced above is restricted to takingX overprimitive propositions only. This is sufficient for many situations of interest, but it is often toolimiting. Among other things, it cannot capture the fact that the agent may not have the deductivecapabilities to do full propositional reasoning. For instance, an agent could explicitly knowp andexplicitly knowq, but perhaps not thatp∧q. This is what we address in this section. ApropositionalsignatureΣ is a signature with propositional constructors, that is,true, false, not, and ⊆ Σ, wheretrue, false have arity0, not has arity 1, andand has arity2. We define a logicLprop(Σ) over apropositional signatureΣ where we can reason about explicit knowledge of propositional formulas.The primitive propositions are again taken to beT gΣ. The syntax is that ofLprim(Σ), except that weallowX to apply to propositional formulas. We also taketrue andfalse as primitives, rather thandefining them as abbreviations.

Page 199: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

As before, the semantics of this logic is given in terms of deductive algorithmic structuresM =(S, π,D), but here we takeD to be apropositional deductive system, that is, a primitive deductivesystem defined over a propositional signatureΣ. The satisfaction relation(M, s) |= ϕ is definedas before (along with obvious rules fortrue andfalse), except that we modify the semantics of theX operator to take into account the fact that we can query for explicit knowledge of propositionalformulas.

To do this, we first define the translation of a formulaϕ in our language into a propositionaltermϕT in the term algebra, in the obvious way:pT is p, trueT is true, falseT is false, (¬ϕ)T isnot(ϕT ), and(ϕ ∧ ψ)T is and(ϕT , ψT ).

There is a subtlety about this translation, and about the logic in general. We have defined∨ and⇒ by abbreviation, which means that any formula containing∨ or⇒ is really a formulacontaining∧ and¬. Thus, in some sense, the agent cannot explicitly distinguish betweenϕ ∨ ψand¬(¬ϕ ∧ ¬ψ); they are the same formula for him. This makes strong assumptions, again,on the reasoning capabilities of the agent. One way around this problem is to use a syntax thatdirectly uses∨,⇒, and perhaps other connectives, rather than introducing them by abbreviations.We would then add similar constructors to the propositional signature, and extend the translationabove accordingly. For simplicity, we will not do so in this paper, but it should be clear that it is astraightforward extension.

We modify the semantics ofX to deal with all propositional formulas:

(M, s) |= Xϕ if s = (e, ob1, . . . , obn) andob1, . . . , obn `D ϕT .

Clearly, sincepT is justp, this semantics extends that of the previous section.

Example 3.2:The following deduction rules can be added to any primitive deductive system to ob-tain a propositional deductive system that captures a subset of the inferences that can be performedin propositional logic.

t . not(not(t)) not(and(t, not(t′))), t . t′ and(t, t′) . t′

not(not(t)) . t not(and(t, not(t′))), t′ . t t, not(t) . falset . not(and(not(t), not(t′))) t, t′ . and(t, t′) false . tt′ . not(and(not(t), not(t′))) and(t, t′) . t

One advantage of these rules, despite the fact that they are incomplete, is that they can be used toperform very efficient (linear-time, in fact) propositional inference [McAllester 1993].

By consideringfull deductive systems, that is, propositional deductive systems with unary con-structorsknow andaknow, it is clear how to give semantics toXϕ for an arbitrary formulaϕ. Westudy such deductive systems in the full paper.

4 Axiomatizations

In this section, we present a sound and complete axiomatization for reasoning about explicit knowl-edge given by a deductive system. Clearly, for a particular deductive system, the properties ofXdepend on that deductive systems. Intuitively, we should be able to read off the properties ofXfrom the deductive rules themselves. The remainder of this section aims at making this statementprecise.

Page 200: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

First, let us introduce an axiomatization for reasoning about deductive systems in general, inde-pendently of the actual deduction rules of the system. First, we need axioms capturing propositionalreasoning in the logic.

Taut. All instances of propositional tautologies

MP. Fromϕ andϕ⇒ ψ inferψ

Axiom Taut can be replaced by an axiomatization of propositional tautologies [Enderton 1972]. Thefollowing well-known axioms capture the properties of the knowledge operator [Fagin, Halpern,Moses, and Vardi 1995].

K1. (Kϕ ∧K(ϕ⇒ ψ))⇒ Kψ

K2. Kϕ⇒ ϕ

K3. Kϕ⇒ KKϕ

K4. ¬Kϕ⇒ K¬KϕGen. Fromϕ inferKϕ

Since algorithmic knowledge is interpreted with respect to the observations at the current state, andthat two states are indistinguishable to an agent if the same observations are made at both states,agents know whether or not they explicitly know a fact. This is captured by the following twoaxioms.

X1. Xϕ⇒ KXϕ

X2. ¬Xϕ⇒ K¬Xϕ

Finally, observations have a special status in the logic. The interpretation of observation primitivesis fixed to depend on the observations made at the state. The following two axioms express theproperties of observations.

X3. ob ⇒ Kob

X4. ob ⇔ Xob

Axiom X3 captures the fact that indistinguishable states have the same observations. AxiomX4follows from the definition of deduction in Section 2: recall that for all termst of a deductivesystemD, we havet `D t.

Let AX consists of the axiomsTaut, MP, K1–K4, Gen, andX1–X4. The following result isalmost immediate.

Theorem 4.1: The axiomatizationAX is sound and complete forLprim(Σ) andLprop(Σ) with re-spect toM(Σ).

If we want to reason about deductive algorithmic structures equipped with a particular deductivesystem, we can do better. We can capture the reasoning with respect to the deductive system withinour logic. The basic idea is to translate deductive rules of the deductive system into formulas of ourlogic. First, consider a primitive deductive systemD. A deductive rule inD of the formt1, . . . tn .tis translated into an axiom of the form(Xt1∧ . . .∧Xtn)⇒ Xt. In fact, such a translation yields anaxiom schema, where we view the variables int1, . . . , tn, t as schema metavariables, to be replaced

Page 201: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

by appropriate elements of the term algebra.2. Let Ax prim

D be the set of axioms derived in this wayfor the primitive deductive systemD.

Theorem 4.2: The axiomatizationAX ∪ Ax prim

D is sound and complete forLprim(Σ) with respecttoMD(Σ).

For propositional deductive systems, we need a slightly more involved translation, since wewant to translate the propositional constructors that appear in the terms into logical connectives inthe translated formulas. We can do this as follows. A deductive rule of the formt1, . . . , tn . t inD is translated to a formula(XtT1 ∧ . . . ∧XtTn ) ⇒ XtT . We define the formulatT correspondingto the termt by induction on the structure oft: trueT is true, falseT is false, (not(t))T is ¬(tT ),(and(t1, t2))T is tT1 ∧ tT2 , andtT is t for all other termst. This translation again yields an axiomschema of the kind we defined above. Note that we do not translate propositional constructorsthat appear under non-propositional constructors within a term. (Intuitively, these constructors willnever arise out of the translation of formulas given in Section 3.) LetAx prop

D be the set of axiomsderived in this way for the propositional deductive systemD.

Theorem 4.3: The axiomatizationAX ∪ Ax prop

D is sound and complete forLprop(Σ) with respecttoMD(Σ).

5 Complexity

In this section, we study the decision problem for our logics, that is, the problem of determining,for a given formula, whether it is satisfiable. Since our logics extends the logic of knowledge wherethe knowledge operator is interpreted over an equivalent relation (in our case,∼), the difficulty ofthe decision problem is at least as hard. It is known that the decision problem for knowledge overan equivalence relation is NP-complete [Halpern and Moses 1992]. Adding deductive algorithmicknowledge is unproblematic if we do not require a fixed deductive system. Intuitively, for satis-fiability, we can simply take as a deductive system one with specific deduction rules sufficient tosatisfy the subformulasXϕ appearing in the formula.

Theorem 5.1: The decision problem forLprim(Σ) andLprop(Σ) over structures inM(Σ) is NP-complete.

What happens if we fix a particular deductive system, and want to establish whether a formulaϕ is satisfiable in a structure over that particular deductive system? The difficulty of this problemdepends intrinsically on the difficulty of deciding whether a deductionΓ `D t exists inD. Sincethis problem may be undecidable for certain deductive systemsD, reasoning in our logics can beundecidable over those deductive systems. On the other hand, if the deductive system is decidablein polynomial time (i.e., if the problem of deciding whether a deductionΓ `D t exists inD can besolved in time polynomial in the size ofΓ andt), then the decision problem for our logics remainsrelatively easy.

2One needs to be careful when defining this kind of axiom schema formally. Intuitively, an axiom schema of theabove form, with metavariables appearing in terms, correspond to the set of axioms where each primitive proposition inthe axiom is a substitution instance of the appropriate term in the axiom schema.

Page 202: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Theorem 5.2: For any given primitive (resp., propositional) deductive systemD that is decidablein polynomial time, the decision problem forLprim(Σ) (resp.,Lprop(Σ)) over structures inMD(Σ)is NP-complete.

In fact, there is a class of deductive systems that can be efficiently decided, and thus by The-orem 5.2 lead to a reasonable complexity for our logics. McAllester [1993] defines a deductivesystemD to belocal if wheneverΓ `D t there exists a local deduction oft from Γ. A deduction islocal if every proper subterm of a term in the deduction is either a proper subterm oft, a proper sub-term of a member ofΓ, or appears as a subterm of a deductive rule inD. One can show that, for anydeductive systemD, whether a local deduction oft from Γ exists in polynomial time in the size ofΓ andt. If D is local, so that the existence of a deduction ensures the existence of a local deduction,then the deduction relationD is polynomial-time decidable. The deductive system in Example 2.1is local, while adding the deduction rules in Example 3.2 to any local primitive deductive systemyields a local deductive system.

Corollary 5.3: For any given local primitive (resp., propositional) deductive systemD, the decisionproblem forLprim(Σ) (resp.,Lprop(Σ)) over structures inMD(Σ) is NP-complete.

6 Conclusion

We have described in this paper an approach to combining implicit knowledge interpreted overpossible worlds with a notion of explicit knowledge based on a logical theory that allows agents toderive what they explicitly know. This additional structure, the agent’s logical theory, can be usedto completely characterize the properties of the explicit knowledge operator. More specifically, itlets us derive sound and complete axiomatizations for the logic in a uniform way.

For simplicity, we have focused on the single agent case, but it should be clear that the frame-work generalizes to multiple agents in a straightforward way. It suffices to provide each agent witha deductive system. This lets one reason about what agents know about what other agents explicitlyknow. In combination with the full deductive systems mentioned at the end of Section 3, such aframework can capture a form ofsimulative inference[Kaplan and Schubert 2000], where an agentcan reconstruct the reasoning of another agent in an attempt to explicitly determine what the latterexplicitly knows. We study this extension in the full paper, along with an extension to dynamicalsystems, where observations are taken over time.

Our framework is also suitable for studying logical theories that are probabilistic. For instance, itis possible to assign a weight to every deduction rule, and to model deduction by randomly choosingwhich deduction rule to apply based on a distribution proportional to the weight of every applicablerule. This yields a probability distribution on the deductions, about which we can reason followingideas developed in [Halpern and Pucella 2003]. We leave this for future work.

Acknowledgments. Thanks to Hubie Chen and Joe Halpern for comments on previous drafts ofthis paper.

Page 203: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

References

Dolev, D. and A. C. Yao (1983). On the security of public key protocols.IEEE Transactions onInformation Theory 29(2), 198–208.

Enderton, H. B. (1972).A Mathematical Introduction to Logic. Academic Press.

Fagin, R. and J. Y. Halpern (1994). Reasoning about knowledge and probability.Journal of theACM 41(2), 340–367.

Fagin, R., J. Y. Halpern, Y. Moses, and M. Y. Vardi (1995).Reasoning about Knowledge. TheMIT Press.

Giunchiglia, F., L. Serafini, E. Giunchiglia, and M. Frixione (1993). Non-omniscient belief ascontext-based resoning. InProceedings of IJCAI-93, pp. 548–554.

Halpern, J. Y. and Y. Moses (1992). A guide to completeness and complexity for modal logics ofknowledge and belief.Artificial Intelligence 54, 319–379.

Halpern, J. Y. and R. Pucella (2002). Modeling adversaries in a logic for reasoning about se-curity protocols. InProceedings of FASec’02. Available as Royal Holloway Department ofComputer Science Technical Report CSD-TR-02-13. To appear in LNCS.

Halpern, J. Y. and R. Pucella (2003). Probabilistic algorithmic knowledge. InTheoretical Aspectsof Rationality and Knowledge: Proc. Ninth Conference (TARK 2003), pp. 118–130.

Hintikka, J. (1962).Knowledge and Belief. Ithaca, N.Y.: Cornell University Press.

Kaplan, A. N. and L. K. Schubert (2000). A computational model of belief.Artificial Intelli-gence 120, 119–160.

Konolige, K. (1986).A Deduction Model of Belief. Los Altos, CA: Morgan Kaufmann.

Levesque, H. J. (1984). A logic of implicit and explicit belief. InProceedings, AAAI-84, pp.198–202.

McAllester, D. (1993). Automatic recognition of tractability in inference relations.Journal of theACM 40(2), 284–303.

Page 204: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Inferring implicit preferences from negotiation actions

Angelo RestificarUniversity of Wisconsin-Milwaukee

Peter HaddawyAsian Institute of Technology

Abstract

In this paper we propose to model a negotiator’s decision-making behavior, expressed as preferencesbetween an offer/counter-offer gamble and a certain offer,by learning from implicit choices that can be in-ferred from observed negotiation actions. The agent’s actions in a negotiation sequence provide informationabout his preferences and risk-taking behavior. We show howoffers and counter-offers in negotiation can betransformed into gamble questions providing a basis for inferring implicit preferences. Finally, we presentthe results of experiments and evaluation we have undertaken.

1 Introduction

As agent technology is applied to problems in e-commerce, weare beginning to see software agents that areable to represent and make decisions on behalf of their users. For example, in Kasbah [6] agents buy and sellon behalf of their human users. In delegating negotiation toan agent, we would like that agent to be as savvya negotiator as possible. In negotiation, a more effective formulation of proposals can be achieved if an agenthas some knowledge about the other negotiating party’s behavior, in terms of preferences and risk attitude,since solutions to bargaining problems are affected by both[8, 9, 11]. But such information is private to eachparty and we cannot expect that the other negotiating party will provide it freely. In addition, the elicitationof preferences and attitudes towards risk, or utility, is ingeneral difficult and continues to be a challengeto researchers. Although various elicitation techniques for decision makers have been widely used,e.g.,seetechniques in [5], they are not readily applicable in the negotiation scenario. An agent, for instance, cannotsimply ask gamble questions to assess its opponent’s utility function. On the other hand, the use of learningmechanisms in negotiation has been investigated in severalrecent studies, see for example [1, 17], and has beenshown to be an effective tool in handling uncertainty and incompleteness.

In this paper, we propose a novel method for constructing a model of a negotiator’s decision-making be-havior by learning from implicit preferences that can be inferred from only observed negotiation actions. Weshow, in particular, how actions in a negotiation transaction can be transformed into offer/counter-offer gamblequestions which can subsequently be used to construct a model of the agent’s decision-making behavior viamachine learning techniques. The model contains information about the decision maker’s preferences betweena certain offer and an offer/counter-offer gamble which canbe expressed in the form of comparative preferencestatements. We present theoretical results that allow us toexploit data points implicit in observed actions. Weuse multiple layer feed-forward artificial neural networksto implement and evaluate our approach. Finally, wepresent the results of our experiments.

2 Preliminaries

Decision-making based on the principle of maximum expectedutility requires the negotiator to choose from aset of possible actions that action which maximizes his expected utility. An action with uncertain outcomes canMany thanks to Matt McGinty, Department of Economics, University of Wisconsin-Milwaukee, whose valuable comments anddiscussions with the first author have benefited this work.

1

Page 205: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

be viewed as a gamble whose prizes are the outcomes themselves. In negotiations where agents exchange offersand counter-offers according to an alternating offers protocol [12], a counter-offer to a given offer can be viewedas a gamble with two possible outcomes: either no agreement is reached, or a counter-offer is accepted in whichcase an agreement is reached. The probability in this particular gamble is the decision maker’s own subjectiveprobability of reaching an agreement with the other negotiating party. A utility maximizing negotiator candecide between a certain offer and the corresponding gambleby comparing the utility of the certain offer andthe expected utility of the particular gamble. Thus, the decision-making behavior of a negotiating agent can bemodeled by comparative statements that express his preferences between certain offers and particular gambles.

In this paper, we will assume that each party’s utility is completelyunknownto the other and no trusted thirdparty exists. Each agent has a limited reasoning capacity sothat endless outguessing regress does not occur.We consider a2-level agent whose nested modeling capability consists only of a model of its own utilities anda model of the other agent’s decision-making behavior basedon observations of past actions [16]. As a specificexample that we will use to illustrate our approach, we consider a scenario of two agents negotiating over theprice of an indivisible item: one agent observing the behavior of a second agent. The first agent constructs amodel of the second agent’s decision-making behavior by learning from preferences inferred from the latter’sobserved actions. This scenario is similar to an online seller (first agent) observing a potential buyer’s (secondagent) actions for the purpose of making effective proposals.

We also assume that the item can be delivered as promised to the buyer and that the seller is paid for itat the agreed price. In each negotiation there is negligiblebargaining cost and discount rate. Both bargainingcost and discount rate are considered private information.We assume that the two agents’ bargaining behaviordoes not deviate from their true utilities and that the utility functions of the agents are stationary over the periodin which the negotiation takes place. Due to space limitation, we will focus only on constructing a model ofan agent’s decision-making behavior from observed negotiation actions. We will tackle the relaxation of someof the assumptions above in subsequent work. The claim we make in this paper is that we have a method forconstructing a model of an agent’s decision-making behavior expressed as comparative preference statementsbetween certain offers and corresponding gambles. The gamble is associated with the agent’s counter-offergiven his opponent’s offer. We will call this theoffer/counter-offer gamble, or o/c gamble.

Note that the set of comparative preference statements thatcan be obtained from our model is, in general,a proper subset of the set of comparative preference statements that can be obtained from a utility function.Thus, we can use some of the tools for eliciting utilities to construct a model of the agent’s decision-makingbehavior. An important tool in eliciting utilities is the use of lottery or gamble questions and the concept ofcertainty equivalence.

Definition 1 LetD be a domain,U be a utility function overD, and letoi and oj be outcomes in a gambleG whereoi occurs with a probabilityp, oj occurs with a probability(1 p), andoi; oj 2 D. Let us denotethis gambleG by the shorthand(oi; p; oj). If p = 0:5, thenG is called the standard gamble. A certaintyequivalent is an amounto such that the decision maker (DM) is indifferent betweenG and o. Thus,U(o) =pU(oi) + (1 p)U(oj) or o = U1[pU(oi) + (1 p)U(oj).

Using the standard gamble, a decision maker can be asked whatamount he would assign too such that hewould be indifferent between the gamble ando. The answer to this standard gamble question is the certaintyequivalento. Alternatively, given the outcomesoi andoj and an amounto, the decision maker can be askedwhat probabilityp would he assign such that he would be indifferent betweeno and the gamble(oi; p; oj).Utility functions can be constructed by interviewing the decision maker and asking him the answers to gamblequestions [5]. A gamble question is nondegenerate ifp 6= 0 andp 6= 1. A decision maker with an increasingutility function is risk-averse (resp. risk-prone, risk-neutral) iff his certainty equivalent for any nondegenerategamble is less than (resp. is more than, equal to) the expected value of the gamble. For a decreasing utilityfunction, a decision maker is risk-averse (resp. risk-prone, risk-neutral) iff his certainty equivalent for anynondegenerate gamble is more than (resp. is less than, equalto) the expected value of the gamble.

We will model the agents’ exchanges as alternating offers [12] but will use Zeuthen’s concept of probabilityto risk a conflict [18] as a basis for transforming negotiation offers and counter-offers into gamble questions. In

2

Page 206: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

the alternating offers game we refer to the history of the agents’ actions as a negotiation sequence. To simplifythe type of actions, we will simply call the first action the initial offer and denote all actions after that as eitheranacceptaction, arejectaction, or acounter-offeraction. To avoid confusion, we will use subscripts to denotewhich agent the action belongs to. The negotiation terminates in one of the following ways: either an agent hasaccepted an offer, an agent has rejected the offer, or the negotiation has reached an impasse,i.e.,a deadlock. Animpasse happens when no agent makes a concession, giving thesame counter-offers they have given previouslyfor a consecutive finite number of times. We formally define a negotiation sequence below.

Definition 2 Let o;a; r, and denote offer, accept, reject, and counter-offer, respectively. An action is anegotiation action iff 2 fo;a; r; g. LetD = [Pmin; Pmax be the domain of an attribute over which acommodity is to be negotiated. Anyx 2 D is called a position. LetA = fS;Bg whereS andB denote theseller and buyer agents, respectively. A negotiation sequenceN is the sequenceN = (j ; x1); (k; x2); : : :wherej; k 2 A; j 6= k; xi 2 D, and the last action is infa; r; g. A negotiation transaction is any pair ofconsecutive actions.

Consider the following example where buyerB negotiates the price of a certain commodity with sellerS.

Negotiation Fragment 1B1. I would like to offer$45.00 for it.S2. This is a fine item. I am selling it for$66.00.B3. I think $45.00 is more reasonable.S4. Alright, $60.00 for you.B5. Let’s see. I’d be willing to pay$51.00 for it.S6. $60.00 is already a good price.B7. $51.00 is fair, I think.S8. I’ll reduce it further to$57.00, just for you.B9. I really like the item but I can only afford$54.00.S10. $57.00 is a bargain.B11. $54.00 is my last offer.S12. Alright. Sold at$54.00.

In Negotiation Fragment 1,D = [45; 66. The negotiation sequenceN = (oB ; 45); ( S ; 66); ( B ; 45);( S ; 60); ( B ; 51); ( S ; 60); ( B ; 51); ( S ; 57); ( B ; 54); ( S ; 57); ( B ; 54); (aS ; 54). The buyerB makesan offer inB1 by offering $45.00. InS2, the seller counter-offers by quoting a price of $66.00 for the item.This continues untilS12 where the seller accepts the buyer’s counter-offer. The terminating move is(aS ; 54).We will now show how counter-offers can be viewed as gamble questions that provide implicit data points forlearning an agent’s decision-making behavior.

3 Negotiation actions as gamble questions

The theoretical results in this section point out implicit information that may be exploited from observedcounter-offers. First, we define an agent’s probability to risk a conflict following Zeuthen.

Definition 3 LetUB andUS beB andS’s utility function, respectively. LetxB beB’s position andxS beS’sposition. The probability thatB will risk a conflict,B, and the probability thatS will risk a conflict,S, aredefined as follows: B = ( UB(xB)UB(xS)UB(xB) if UB(xB)>UB(xS)0 otherwise

(1)S = ( US(xS)US(xB)US(xS) if US(xS)>US(xB)0 otherwise(2)

3

Page 207: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

The probability that an agent would risk a conflict is proportional to the difference between the agent’s positionand what he is offered. The closer the other agent’s offer is to his position, the smaller this probability should be.The farther away the other agent’s offer is from his position, the larger the readiness to risk a fight or conflict. Ifan agent is offered something equal to his position or is offered something better, then his probability of riskinga conflict is zero. We assume the following about the agents’ utilities:

A. 1 B’s utility function,UB, is monotonically decreasing.

A. 2 S’s utility function,US , is monotonically increasing.

By computing the expected utility, an agent can decide whether to accept the other agent’s offer, reject it,or insist its own position.

Definition 4 LetUB andUS beB andS’s utility function, respectively, andB andS beB andS’s action,respectively. Let the gamblesGB andGS refer to(conflictB; pS ;xB) and(conflictS ; pB ;xS), respectively.B’s(resp. S’s) subjective probability thatS (resp. B) will risk a conflict ispS (resp. pB). Also, letEU [GB =(1 pS)UB(xB)+ pSUB(conflictB); andEU [GS = (1 pB)US(xS)+ pBUS(conflictS).B = 8>>>>><>>>>>: accept iff UB(xS)EU [GB ^EU [GB > UB(conflictB)

counter-offer iffUB(xS)<EU [GB ^UB(xS) > UB(conflictB)reject iff UB(conflictB)UB(xS) (3)

S = 8>>>>><>>>>>: accept iff US(xB)EU [GS ^EU [GS > US(conflictS)counter-offer iffUS(xB)<EU [GS ^US(xB) > US(conflictS)reject iff US(conflictS)US(xB) (4)

Note that(1 pS) isB’s subjective probability that he himself will succeed,i.e., the probability thatS willnot risk a conflict. The knowledge of eitherp or (1 p) as a subjective estimate guides an agent’s decision.In the case where an agent’s probability to risk a conflict canbe estimated via empirical frequency, the methodwe propose here still applies. The latter, however, usuallyrequires a large sample size which may not be easilyavailable in practice.

Definition 4 specifies what relation is true when a specific action is taken. In particular, counter-offers whichcomprise most actions in a negotiation sequence imply preference between an o/c gamble and an offer. We donot know what specific valuespB andpS have, but we can learn the relation that triggered the negotiationaction. B’s estimate, for example, ofS’s probability to risk a conflictpS is incorporated in the relation onwhichB’s actions are based. According to Savage [14], the use of a gamble question as a probability elicitationfunction whenB is maximizing the effect of an outcome would induceB to reveal his opinion as expressedby his subjective probability. This clearly means that for any counter-offer byB some specific value ofpS wasused by the decision makerB although this value can not be directly observed. Thus, thepS in the gambleGBof the preference relationGB xS has a specific value. The same can be said aboutB’s subjective probabilityof success,(1 pS), when insistingxB . Although the exact value ofpS or (1 pS) is not known, by learningthe relation that triggeredB’s action along with the parametersxS , xB , andconflictB , instances are learnedwhere the values ofpS and(1 pS) are taken into account.

Learning relations also offers additional flexibility. If the value of conflict is known then we can use it inthe training instances. Note thatconflictB andconflictS is the outcome for the buyer and the seller, respectively,whenever no agreement is reached. In a buyer-seller scenario the value of conflict for the buyer might be theprevailing market price.conflictB could be the highest possible value a buyer could end up paying should anon-agreement occur or for the seller,conflictS could be a low constant if the seller ends up having to sellthe item resulting in a deep discount or a zero profit. In the case where the price of conflict is not known,

4

Page 208: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

relations can still be learned using some default constant B 2 D providedUB( B) < UB(xS) in B’s case andUS( S) < US(xB) in S’s case. As long as the same constant B (resp., S) used for training is also the oneused in the query, the learned model is expected to output thecorresponding preference information based onthe relations it has learned. In essence, a model of the agent’s decision-making behavior can be constructedwithout necessarily knowing the exact value of conflict.

The next three theorems allow us to generate training sets implicit from observed actions. Due to spacelimitation, we will only present theoretical results for the buyer. Analogous results also hold for the seller.

Theorem 1 Let xB beB’s counter-offer,xS beS’s offer, and letxB beB’s certainty equivalent for gambleGB = (conflictB ; pS ;xB). If (GB xS) then for any nondegenerateGB, xB 2 (xB ; xS).Proof. GB xS means(1 pS)UB(xB) + pSUB(conflictB) > UB(xS). SincepS 6= 0, it follows thatUB(xB)> (1 pS)UB(xB)+pSUB(conflictB) and so by transitivity, we haveUB(xS) < (1 pS)UB(xB)+pSUB(conflictB) < UB(xB). Note thatU1B [(1 pS)UB(xB)+pSUB(conflictB) = xB . By monotonicityassumption A.1,xB > xB andxB < xS . Hence,xB 2 (xB ; xS).Theorem 2 (Inferior Offers) LetxB beB’s counter-offer,xS beS’s offer, andGB = (conflictB ; pS ;xB). LetpS beS’s readiness to risk a conflict atx, xS < x. If (GB xS) then((1pS)UB(xB)+pSUB(conflictB)) >UB(x);8 x such thatxS < x.

Proof. SinceUB(xS) > UB(x) for anyx > xS, then((1pS)UB(xB)+pSUB(conflictB)) > UB(x). S’s offeris atxS , so anyx > xS must be better forS thanxS by A.2. Sincex is better thanS’s position, by definition,S’sreadiness to risk a conflict,pS, can not increase. By transitivity,(1pS)UB(xB)+pSUB(conflictB) > UB(x).Theorem 3 (Irresistible Offers) LetxB beB’s counter-offer,xS beS’s offer, andGB = (conflictB ; pS ;xB).LetpS beS’s readiness to risk a conflict atx, xB > x. If (GB xS) then((1pS)UB(xB)+pSUB(conflictB)) <UB(x); 8 x such thatxB > x.

Proof. By A.1, UB(x) > UB(xB). pS 6= 0 becauseB’s position, xB, is not the same asS’s offer, xS.Since0 < pS 1, UB(xB) > (1 pS)UB(xB). By transitivity, UB(x) > (1 pS)UB(xB). However,UB(xB) > UB(conflictB). Thus,UB(x) > (1 pS)UB(xB) + pSUB(conflictB). Sincex is less thanS’s position, xS , S’s readiness to risk a conflict increases, by definition. SopS > pS. SinceUB(x) >(1 pS)UB(xB) + pSUB(conflictB), by transitivityUB(x) > (1 pS)UB(xB) + pSUB(conflictB).

Theorem 1 states that a counter-offer reveals the interval in which the certainty equivalent lies. Accordingto Theorem 1, ifB decides to make a counter-offerxB to an offerxS by S, we can infer thatB’s certaintyequivalent for gambleGB lies in the interval(xB ; xS). This is useful because, intuitively, forB any offer bySabovexS can be considered worse thanGB and any offer byS belowxB is preferred toGB . Theorem 2 statesthat ifB prefers the gamble to an offer ofxS by S thenB would also prefer the gamble to any offer byS thatis greater thanxS. Theorem 3 says that an offer byS that is less than whatB would insist is acceptable toB.

Example 1 Consider the transaction (S10; B11) in Negotiation Fragment 1. According to Theorem 1, thebehavior observed in (S10; B11) implies thatxB lies in the interval($54:00; $57:00). According to Theorem 2,for all offers x > $57:00, B would prefer the gamble insisting$54:00 to x. By Theorem 3, for all offersx < $54:00, B would prefer the offer to a gamble insisting$54:00.

Each distinct pair of exchanges in a negotiation sequence gives different information with respect to therisk attitude that each agent shows throughout the negotiation sequence. In particular, eachpB andpS maybe different between transactions, hence each pair gives additional information. If a transaction or a pair ofcounter-offers are observed repeatedly then the training instances based on this information are also increased,in effect, adding more evidence to the decision maker’s specific risk behavior. In summary, Theorems 1-3 stateinformation that is implicit in each counter-offer. Using these, we can generate explicit data points to be used

5

Page 209: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

as training sets for artificial neural networks. In previouswork (see [10, 4]), artificial neural networks are usedto construct a decision maker’s utility model. The network can be trained to learn from the instances generatedfrom observed data. In this work, the weights in the network encode the function that describes the decision-making behavior of the agent being modeled. The network takes as input a certain offer, counter-offer, and theprice of conflict, and outputs a prediction of whether or not that o/c gamble is preferred by the decision maker.

We envisage the models discussed here to be used as tools to formulate effective offers which includetradeoff exploration, and as heuristic to help guide the search for efficient solutions in negotiation problems.Often, models predict actions of negotiating agents using expectations of what a decision maker might dogiven relevant information. In game theory literature, these specific sets of relevant information that affect thedecision maker’s decision processes are also known astypes[3]. They may include, for example, reservationprice, cost and profit structure, negotiation style, amountof resources, etc. [17]. Types are often modeledusing a probability distribution. As the negotiation progresses, updates using Bayes’ rule on the probabilitydistribution over these specific sets of relevant information are made. A difficulty in the use of types is howprior probability distribution might be updated over a verylarge set of information, possibly one that containsinfinite number of types. As in most cases, the modeler is forced to assume a finite number of types say, strongand weak types [13] or is forced to choose a smaller set of ’relevant’ information say, reservation price asopposed to everything that affects a decision maker’s action, as exemplified in Zeng and Sycara’s work [17].

A disadvantage in the use of probability distribution is notnecessarily in determining which informationis relevant but in computing posteriors given the limited information that can be inferred from observed ne-gotiation actions. For example, if the only observations acquired during negotiations are the responses of thenegotiating agents expressed as counter-offers then clearly, one could not use the observations to directly infercertain specific sets of relevant information such as the cost and profit structure. As we have shown above,we can infer the agent’s decision-making behavior in terms of preferences between a certain offer and an o/cgamble from observed actions. All that is needed is the assumption that a rational player evaluates his decisionby taking into considerationall the relevant information he has which affect the issue(s) under negotiation. Thismight include indirectly observable information like his cost and profit structure, deadline, etc. If, given thisrelevant information, an offer is more favorable to him thansay a gamble then he takes it because he prefers itsoutcome over that gamble. Hence, whatever observations aremade can be taken as a result of such evaluationthat is reflected through the decision maker’s behavior. Forexample, one need not explicitly model the costand profit structure of the opponent. If the opponent acceptsan offer then this can be viewed as something thathe prefers, after considering his cost and profit structure,over what other consequences might result. In short,our approach allows the construction of oneparticular model that represents the decision maker’s behavior, asopposed to choosing (via a probability distribution) the opponent’s type from a very large, possibly infinite, setof information.

4 Model construction and evaluation

In the previous section, we have laid out the framework for constructing a model of a negotiating agent’sdecision-making behavior using implicit data from observed negotiation transactions. We will describe in thissection the construction and evaluation of a model ofB’s decision-making behavior. The results from ourexperiments using synthetic data sets suggest significant predictive accuracy using only an average of 5 pairsof exchanges per negotiation. For convenience, we have chosen to implement a model ofB’s decision-makingbehavior using a multilayer feed-forward artificial neuralnetwork, a statistical learning method capable ofapproximating linear and nonlinear functions. Note that the concept we discussed above on how implicit datapoints maybe generated from observed negotiation actions is independent of the machine learning techniquethat maybe used. We believe, however, that the choice of a particular technique must be made so that theintended preference relations are learned given the available additional training data. Since artificial neuralnetworks have been widely studied and used as universal function approximators in various problems acrossmany disciplines, we believe that a demonstration of our model via artificial neural networks would make ourtechnique more accessible to a wider audience. In addition,extensions and variants of artificial neural networks

6

Page 210: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

like the knowledge-based artificial neural network [15] canpotentially be used to improve the model’s predictiveperformance as well as reduce its training time [10, 4].

We have assumed thatB’s decision-making behavior is guided by his utility function. We used a controlutility function for B, (x), to generate negotiation actions where each counter-offerx made byB is suchthat 1((1 pS) (x) + pS ( )) < y, for any offery made byS and for a given conflict . This meansthat a counter-offer is made becauseB’s expected utility of the gamble is higher than his utility for y. BacceptsS0s offer when the utility ofy is equal to or exceeds that of the gamble. Moreover, we have arbitrarilychosen two separate functions for (x) representing risk-averse and risk-prone behavior and haverun severalexperiments varying the following parameters to constructa synthetic data set: probability to risk a conflict(pS), negotiation domain, and negotiation strategy. Note thatwe usepS together with other parameters onlyto simulate behavior from which actions can be observed. Theconstruction of the model does not assumeknowledge ofpS. The control utility function, (x), allows us to test the predictive performance of our model.Its purpose is to representB’s unknown utility function against which our constructed model will be tested.Our data set contains observable actions based on (x). However, (x) itself is unknown to the system thatconstructs the model.

For any nondegenerate gamble the certainty equivalent partitions the offers into two regions: the regionbelow the certainty equivalent contains offers thatB prefers to a gamble and the region above it containsthose where a gamble is preferred. The (x) curve contains the utility of the certainty equivalentx for anynondegenerate gamble. According to the previous section, the counter-offers in a negotiation sequence canbe used as interval constraints to approximate the certainty equivalents. We evaluate the effectiveness of ourapproach by training the artificial neural networks using data points implied by these intervals. We then test thenetwork’s predictive performance by comparing the learnedmodel against the control utility function whichguides the decision maker’s behavior. An effective model should demonstrate a significant improvement overa random guess,i.e., givenS’s offer and a conflict value held constant, the model should be able to correctlyclassify more than50% of the time whether a given offer is preferred byB over a gamble.

We generated negotiation sequences and ran experiments using two control utility functions: 1(x) =1 e0:05x149 (risk-averse, decreasing function) and 2(x) = e0:0025x (risk-prone, decreasing function). In eachnegotiation sequence,pS is either generated randomly or is chosen fromf0:50; 0:25; 0:60g. The negotiationusing 1(x) is over the domainD1 = [50; 100 and that of 2(x) is overD2 = [200; 700. The value of conflict is set at the maximum value of the domain. The buyer-seller negotiation strategy,, vary among Boulware-Conceder, Conceder-Boulware, and Conceder-Conceder pairs. We define our Boulware strategy as one wherethe agent concedes only10% of the time and a Conceder strategy as one where concession isfrequent at90%of the time. Whenever an agent concedes, concession is randomly chosen between050% of the differencebetween both agents’ most recent counter-offers.

The artificial neural network used in our experiments has onehidden layer with four nodes. The input layercontains three input nodes and the output layer contains twonodes. The output nodes representGB xS andGB xS whereGB = ( B ; pS ;xB). The input to the network are:B’s price of conflict B , B’s counter-offer xB , andS’s offer or counter-offerxS . Data fed into the input layer are scaled so that values only rangebetween0 and1. We point out that the input layer does not include a node for the specificpS of the gambleGB. AlthoughpS is not directly observable, the observed gambleGB which is used to generate the traininginstances together withxS has a specific value ofpS associated with it. Whenever we want to ask the questionwhetherGB xS or GB xS given certain offerxS we are referring to a specificGB with an associatedspecific value ofpS . Note that the input query in this case would only contain B , xB , andxS .

Negotiation sequences used for training, tuning, and testing are randomly generated using a chosen strategypair, a control utility function (x), a negotiation domain, and a constant conflict value. We usedak1 crossvalidation method to train and tune the network, wherek is the number of negotiation transactions in eachnegotiation sequence. Network training is stopped when either no improvement in performance is detectedfor a successive2; 000 epochs or the number of epochs reaches20; 000. Among the data generated usingthe intervals,90% is used for training and10% is used for tuning. For the examples below the interval, thenetwork is trained to output(o1 = 0:1; o2 = 0:9) and for the examples above the interval it is trained to

7

Page 211: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

0.40

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.24 0.26 0.28 0.30 0.32 0.34 0.36

average interval width

net

wo

rk a

ccu

racy

mean

below ceq

Figure 1: Overall Performance Results

output (o1 = 0:9; o2 = 0:1). Data points used to test the network are separately generated using the truexB of the gamble in each negotiation transaction. Our data set contains a total of 97 negotiation sequences.The total number of negotiation transactions is 477 which gives an average of 5 transactions per negotiationsequence. The training instances are obtained by generating a total of 200 random data points for each observednegotiation transaction; 100 random data points for each ofthe region below and above the interval. These aresimply data points that are directly inferrable from the observed transaction and help to fill out the data presentedto the neural network. The certainty equivalent, which is obtained from the control utility function, lies insideeach interval. For each of the regions below and above the certainty equivalent 100 test points are generated.We then evaluate the approach by comparing how well the modelperforms when trained using the intervalsagainst the test points from the control utility function. We consider data points to be correctly classified wheno1 2 [0; 0:2; o2 2 [0:8; 1 for test points below thexB and wheno1 2 [0:8; 1; o2 2 [0; 0:2 for test points abovethe xB. All (pseudo)random numbers in the experiment are generated using the Mersenne Twister algorithm[7].

Intuitively, not all negotiation transactions may be useful. For example, an offer that is near the maximumdomain value and a counter-offer that is near the minimum domain value has an interval width that is close tothe width of the domain. Since we are using the interval to estimate the certainty equivalent such a negotiationtransaction would be less useful than one in which both the offer and counter-offer are closer to the certaintyequivalent. Moreover, regression analysis indicates thatthe distance of the lower interval limit to the truexBhas significant influence on our model’s predictive performance. Predictive accuracy is50% or better when theaverage distance of the lower interval limit is within0:50 (normalized) of the truexB. In real scenario, however,we have no idea how far the lower interval limit is to the truexB since this is considered private information.We would, therefore, like to have a practical basis for choosing which negotiation transactions are useful astraining examples. We used the interval width of each negotiation transaction,i.e., the distance between thelower and upper limit of the interval, as a basis to eliminatedata points that may not be useful in constructingthe model.

To test the overall performance, negotiation sequences were grouped into subsets where normalized intervalwidths are no greater than0:50, 0:45, 0:40, 0:35, and0:30. A normalized interval width is computed as theratio of the interval width to the domain width. The average interval width for each of the subset above is0:35,0:34, 0:31, 0:29, and0:24, respectively. The average network performance of each of these respective subsetsis shown in Figure 1. The overall network performance increases as the average interval width correspondingto the negotiation transactions decreases. The solid curveshows the performance of the model in predictingwhether a certain offer is preferred byB to an o/c gamble using only implicit data points below the interval.This is important becauseB’s counter-offers only correspond to the lower limit of the interval. The dottedcurve shows the accuracy of the model in predicting whether acertain offer is preferred byB to an o/c gamble

8

Page 212: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

and whetherB prefers an o/c gamble to a certain offer. The mean accuracy isobtained by averaging the resultsusing implicit data points belowand above the interval. The results suggest that for intervals with averagewidth of 0:24 the network can predict about72% of the time whether a certain offer is preferred byB to ano/c gamble . For intervals with average width of less than or equal to0:31, we are able to predict with morethan60% accuracy whetherB prefers a certain offer to an o/c gamble. In addition, the predictive accuracyof the model when implicit data points above and below the interval are used is better than a random guess.We ran four right-tailedz-tests and one right-tailedt-test using the following hypotheses:H0 : = 0:50 andHa : > 0:50. For thet-test the null hypothesis is rejected at = 0:005. In each of thez-tests, the nullhypothesis is rejected at = 0:001.

5 Related Work and Summary

Zeng and Sycara [17] propose to model sequential decision making by using a subjective probability distri-bution over an agent’s beliefs about his environment and about his opponent, including the opponent’s payofffunction. A negotiating agent chooses the action that maximizes his expected payoff given the available infor-mation. Zeng and Sycara use observed actions to obtain a posterior distribution over an agent’s belief set usingBayesian rules. In their framework, a decision maker’s action is predicted using the posterior distribution. Inour framework, a decision maker’s action is predicted usinga learned model that contains information aboutthe decision maker’s preferences between a certain offer and the o/c gamble associated with a counter-offer.As suggested above, our approach aims to avoid the difficultyassociated with obtaining posterior distributionsfrom the limited information that can be inferred from observed actions.

Bui et al. [1] propose to augment a negotiation architecturewith an incremental learning module, imple-mented as a Bayesian classifier. Using information about past negotiations as sample data, agents learn andpredict other agents’ preferences. These predictions reduce the need for communication between agents, andthus coordination can be achieved even without complete information. Here, preferences are modeled usinga probability distribution over the set of possible agreements. An agent’s preference is estimated using theexpected value of the distribution. Our work differs from [1] on several points: first, our model includes bothpreferences and risk-taking behavior; second, we use a model of an agent’s decision-making behavior to arriveat a possible offer instead of using a Bayes classifier; and finally, the negotiation context in which we consideris largely adversarial rather than cooperative in that the agents can insist on their respective positions even ifthis leads to a breakdown in negotiation, agents can not be asked about their preferences directly, and the onlytype of information exchanged between agents are offers andcounter-offers.

Chajewska et al. [2] propose to elicit utility functions from observed negotiation behavior. An elicitedutility function can be used by the decision maker to determine which action gives the maximum utility. Themodel we have discussed here does not elicit a utility function but rather represents and learns the negotiatingagent’s decision-making behavior in terms of preferences between a certain offer and an o/c gamble. Someother notable differences between our work and that of Chajewska et al. are as follows: (1) The approachproposed by Chajewska et al. starts off from a database of (partially) decomposable utility functions elicitedvia standard techniques. Standard elicitation techniques, like interviews using gamble questions, is in generalinapplicable in most negotiation scenarios. We have demonstrated how counter-offers can be viewed as gamblequestions. (2) Observed behavior is used to eliminate inconsistent utilities and from the consistent ones theexpectation of the distribution of utilities is chosen. Ourapproach uses observed behavior to infer and learnimplicit preferences. (3) In our framework, we exploit the subjective probabilities which is an inherent hiddencomponent in an agent’s move. However, probabilities from empirical frequency, whenever available, can alsobe used. The only disadvantage in using objective probabilities as is suggested in [2] is that sufficiently largesample size is needed which may not be available. We, however, agree with Chajewska et al. about the useof existing knowledge in the elicitation process. In previous work we have shown that such knowledge can beused to guide an elicitation process [4] and improve the predictive performance of a constructed model [10].

In summary, we have presented a novel method for constructing a model of an agent’s decision-makingbehavior by learning from implicit preferences inferred from observed negotiation actions. In particular, we

9

Page 213: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

have presented theoretical results which allow counter-offers to be transformed into gamble questions, providingthe basis for inferring implicit preferences. Training instances can be generated from intervals implicit incounter-offers which can then be used to train artificial neural networks. Our model, which can be augmentedby existing knowledge, determines whether a certain offer is preferred to an o/c gamble or vice-versa. In ourexperiments, we have obtained statistically significant results which indicate over70% accuracy for intervalswhose widths are roughly25% of the domain width. Moreover, the accuracy of the model whenimplicit datapoints above and below the interval are used is better than a random guess.

References[1] H. H. Bui, S. Venkatesh, and D. Kieronska. Learning otheragents’ preferences in multi-agent negotiation using the bayesian

classifier.International Journal of Cooperative Information Systems, 8:275–294, 1999.

[2] Urszula Chajewska, Daphne Koller, and Dirk Ormoneit. Learning an agent’s utility function by observing behavior. In Proceed-ings of the International Conference on Machine Learning, 2001.

[3] Drew Fudenberg and Jean Tirole.Game Theory. The MIT Press, 1991.

[4] Peter Haddawy, Vu Ha, Angelo Restificar, Ben Geisler, andJohn Miyamoto. Preference elicitation via theory refinement. Journalof Machine Learning Research, 4:317–337, Jul 2003.

[5] Ralph Keeney and Howard Raiffa.Decisions with Multiple Objectives: Preferences and ValueTradeoffs. Cambridge UniversityPress, 1993.

[6] Patti Maes, Robert H. Guttman, and Alexandros G. Moukas.Agents that buy and sell. InCommunications of the ACM, volume 42.ACM, March 1999.

[7] M. Matsumoto and T. Nishimura. Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number gen-erator.ACM Trans. on Modeling and Computer Simulation, 8(1):3–30, 1998.

[8] John Nash. The bargaining problem.Econometrica, 18(2):115–162, 1950.

[9] Martin J. Osborne. The role of risk aversion in a simple bargaining model. In Alvin E. Roth, editor,Game-theoretic models ofbargaining. Cambridge University Press, 1985.

[10] Angelo Restificar, Peter Haddawy, Vu Ha, and John Miyamoto. Eliciting utilities by refining theories of monotonicity and risk. InWorking Notes of the AAAI-02 Workshop on Preferences in AI and CP: Symbolic Approaches. American Association for ArtificialIntelligence, July 2002.

[11] A. Roth and U. Rothblum. Risk Aversion and Nash’s Solution for Bargaining Games with Risky Outcomes.Econometrica,50:639–647, 1982.

[12] Ariel Rubinstein. Perfect equilibrium in a bargainingmodel.Econometrica, 50:97–109, 1982.

[13] Ariel Rubinstein. A bargaining model with incomplete information about time preferences.Econometrica, 53(5):1151–1172,1985.

[14] Leonard J. Savage. Elicitation of Personal Probabilities and Expectations.Journal of the American Statistical Association,66(336):783–801, 1971.

[15] Geoffrey G. Towell and Jude W. Shavlik. Knowledge-Based Artificial Neural Networks.Artificial Intelligence, 70, 1995.

[16] Jose M. Vidal and Edmund H. Durfee. The impact of nested agent models in an information economy. InProceedings of theSecond International Conference on Multiagent Systems, 1996.

[17] Dajun Zeng and Katia Sycara. Bayesian learning in negotiation. Int. Journal Human-Computer Studies, 48:125–141, 1998.

[18] Frederick Zeuthen.Problems of Monopoly and Economic Warfare. Routledge and Kegan Paul, Ltd., 1930.

10

Page 214: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Improving Exact Algorithms for MAX-2-SAT∗

Haiou Shen, Hantao ZhangComputer Science Department

The University of IowaIowa City, IA 52242

hshen, [email protected]

November 17, 2003

Abstract

We study three new techniques which will speed up the branch-and-bound algorithmfor the MAX-2-SAT problem: The first technique is a new lower bound function for thealgorithm and we show that the new lower bound function is consistently better than otherlower bound functions. The other two techniques are based on the strongly connectedcomponents of the implication graph of a 2CNF formula: One uses the graph to simplifythe formula and the other uses the graph to design a new variable ordering. The exper-iments show that the simplification can reduce the size of the input substantially whenused in preprocessing and that the new variable ordering performs much better when theclause-to-variable ratio is less than 2. The result of this research is a high-performance im-plementation of an exact algorithm for MAX-2-SAT which outperforms any implementationwe know about in the same category. It also shows that our MAX-2-SAT implementationis a feasible and effective tool to solve large instances of the Max-Cut problem in graphtheory.

1 Introduction

In recent years, there has been considerable interest in the maximum satisfiability problem(MAX-SAT) of propositional logic, which, given a set of propositional clauses, asks to finda truth assignment that satisfies the maximum number of clauses. The decision version ofMAX-SAT is NP-complete, even if the clauses have at most two literals (so called the MAX-2-SAT problem). Because the MAX-SAT problem is fundamental to many practical problems incomputer science [12] and computer engineering [19], efficient methods that can solve a largeset of instances of MAX-SAT are eagerly sought. One important application of MAX-2-SATis that NP-complete graph problems such as maximum cut, independent set, can be reducedto special instances of MAX-2-SAT [7, 15]. Many of the proposed methods for MAX-SAT arebased on approximation algorithms [8]; some of them are based on branch-and-bound methods[12, 6, 4, 14, 13, 11, 16]; and some of them are based on transforming MAX-SAT into SAT[19].

To the best of our knowledge, there are only four implementations of exact algorithms forMAX-SAT that are variants of the well-known Davis-Putnam-Logemann-Loveland (DPLL)procedure [9]. One is due to Wallace and Freuder (implemented in Lisp) [18]; one is due toBorchers and Furman [6] (implemented in C and publicly available); the last two are made

∗Partially supported by the National Science Foundation under Grant CCR-0098093.

1

Page 215: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

available in 2003 by Alsinet, Manya and Planes [1] (a substantial improvement over Borchersand Furman’s implementation) and Zhang, Shen, and Manya [20], respectively.

In this paper we will discuss three novel techniques intended to improve the performance ofthe branch-and-bound algorithm proposed in [20]. It is well-known that the tighter the boundthe smaller the search tree in a typical branch-and-bound algorithm. We introduce a new lowerbound function which can reduce the search tree substantially. The other techniques are basedon the use of the strongly connected components (SCC) in the implication graph of a 2CNFinstance. It is well-known that the satisfiability of 2CNF formula can be decided in linear timeby computing SCC [3]. For MAX-2-SAT, we found that computing SCC can help us to (a)simplify the input greatly, and (b) design a new variable ordering for the branch-and-boundalgorithm in [20].

In order to evaluate the new techniques, we present experimental results on thousands ofMAX-2-SAT instances. We showed that the improved algorithm is consistently and substan-tially better than all the known algorithms [6, 1, 20]. We also show the performance of thealgorithm on a large set of random MAX-CUT problems.

2 Preliminary

Let F be a formula in 2CNF with n variables V = x1, ..., xn and m clauses. An assignmentis a mapping from V to 0, 1 and may be represented by a vector X ∈ 0, 1n, where 0 meansfalse and 1 means true. Let S( X,F ) be the number of clauses satisfied by X and K(F ) be thenumber of minimal false clauses under any assignment.

For every literal x, we use variable(x) to denote the variable appearing in x. That is,variable(x) = x for positive literal x and variable(x) = x for negative literal x. If y is a literal,we use y to denote x if y = x.

A partial (complete) assignment can be represented by a set of literals (or unit clauses) inwhich each variable appears at most (exactly) once and each literal is meant to be true in theassignment. If a variable x does not appear in a partial assignment A, then we say literals xand x are unassigned in A. Let u(x) record the number of unit clauses x generated during thesearch. If there are no unit clauses in the input, u(x) is initialized to zero.

The algorithm we use is presented in Figure 1. In the algorithm, B(x) = y | (x ∨ y) ∈F, variable(x) < variable(y) for each literal x.

The following result is provided in [20].

Theorem 2.1 Suppose F is a set of binary clauses on n variables. Then max 2 sat2(F, n, g0)returns true if and only if there exists an assignment under which at most g0 clauses in Fare false. The time complexity of dec max sat(F, n, g0) is O(n2n) and the space complexity isL/2 + O(n), where L is the size of the input.

3 Lower Bounds

In line 2 of dec max 2 sat in Figure 1, popular lower bound functions can be used to improveits performance. The following two lower bound functions are used in [1, 2, 20]:

• LB1 = the number of conflicting (i.e., empty) clauses by the current partial assignment.

• LB2 = LB1 +∑n

j=i min(u(j), u(j)).

2

Page 216: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Figure 1: A decision algorithm for MAX-2-SAT.

function max 2 sat2 ( F : clause set, n: variable, g0: int) return boolean// initiationfor i := 1 to n do

compute B(i) and B(i) from F ;u(i) := u(i) := 0; // assuming no unit clauses in F

end forreturn dec max 2 sat(1, g0);

end function

function dec max 2 sat( i: variable, g: integer ) return boolean1 if (i > n) return true; // end of the search tree2 if (lower bound(i) > g) return false3 // decide if we want to set variable i to true4 if (u(i) ≤ g) ∧ (u(i) < u(i) + |B(i)|) then5 record unit clauses(i);6 if (dec max 2 sat(i + 1, g − u(i))) return true;7 undo record unit clauses(i);8 end if9 // decide if we want to set variable i to false10 if (u(i) ≤ g) ∧ (u(i) ≤ u(i) + |B(i)|) then11 record unit clauses(i);12 if (dec max 2 sat(i + 1, g − u(i))) return true;13 undo record unit clauses(i);14 end if15 return false;end function

procedure record unit clauses ( x: literal )for y ∈ B(x) do u(y) := u(y) + 1 end for;

end procedure

procedure undo record unit clauses ( x: literal )for y ∈ B(x) do u(y) := u(y) − 1 end for;

end procedure

3

Page 217: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

where u(x) is the number of unit clauses x under the current partial assignment. Using LB2instead LB1 contributes greatly to the improved performance of Alsinet, Manya and Planes’implementation over Borchers and Furman’s. It is easy to see that LB1 ≤ LB2 ≤ K(F ) wheng0 ≤ K(F ).

Lemma 3.1 If there is a clause x ∨ y in F such that u(x) < u(x) and u(y) < u(y), then LB2+ 1 ≤ K(F ).

The above lemma allows us to design an enhanced lower bound function as follows. Forany literal x, let c(x) = u(x) − u(x) and S be the set of clauses of which both literals areunassigned under the current assignment. Then the new lower bound can be computed by thefollowing procedure:

LB3 := LB2;for every clause (x ∨ y) ∈ S do

if (c(x) > 0) ∧ (c(y) > 0) thenLB3 := LB3 + 1; c(x) := c(x) − 1; c(y) := c(y) − 1;

end ifend for

Because c(x) > 0 and c(y) > 0 imply u(x) > u(x) and u(y) > u(y), for a clause x∨ y, if weassign 0 to x, u(y) will be increased by 1. So min(u(y), u(y)) will be increased by 1. The samereason applies to y. We can see that it is safe to increase the lower bound by 1 in this case.

Theorem 3.2 If g0 ≤ K(F ), then LB2 ≤ LB3 ≤ K(F ).

Note that the function lower bound(i) in dec max 2 sat (line 2) of Figure 1 returns LB2− LB1 for our old implementation and returns LB3 − LB1 for NB (New Bound) in ourexperiments.

4 Using SCC

Given a 2CNF formula F , the implication graph, GF , of F is a directed graph where thevertices are the set of the literals whose variables appear in F and there is an edge from x to yiff x ∨ y. It is proved in [3] that F is unsatisfiable iff GF has a strongly connected component(SCC) which contains both x and x for some literal x. For MAX-2-SAT, we may use SCC tosimplify the original problem:

• If a SCC does not contain any conflicting literals, delete the literals in this SCC from theoriginal 2CNF formula.

• If there are more than one SCC, divide the original 2CNF formula according the SCCsand run MAX-2-SAT program against each component separately.

The idea of “divide-and-conquer” is not new in SAT. For instance, Bayardo and Pehoushek[5] have used this idea for counting models of SAT. However, it is a new approach to use thisidea based on SCC for SAT. While this idea is appealing, in the study of random MAX-2-SAT, we tested thousands of instances but found that very few of them contain more thanone SCC with conflict (i.e., with conflicting literals). This should not be a surprise because

4

Page 218: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

in the study of random graphs, Erdos and Rnyi [10] showed that for a simply graph with nnodes, when the number m of edges grows from 0 to n(n − 1)/2, the first cycle appears whenεn < m < (1/2 − ε)n and then the graph consists of trees and a few unicyclic componentsuntil m ≈ n/2. When m = (1 + ε)n/2 there is a unique giant component and all othercomponents are trees or unicyclic; after that, all other components will merge into the giantcomponent. The application of this result to MAX-2-SAT implication graphs is that only thegiant component can contain conflicting literals. That is, it is not likely there is more thanone SCC with conflict.

Despite of the fact that we have only one SCC with conflict in most cases, we can combine, the above idea with some known pre-processing methods [12, 6, 4, 14, 13, 11, 16] to simplifya 2CNF formula before running the decision algorithm. We found that the following sequenceof operations is very effective for most 2CNF formulas:

• Complementary Unit Rule: If F = x ∨ y ∧ x ∨ y ∧ x ∨ z ∧ x ∨ z ∧ F ′, thenK(F ) = K(F ′) + 1.

• Convert F to the implication graph GF and locate SCCs in GF . If a SCC does notcontains conflicting literals, we delete this component.

• Resolution Rule: If F = x ∨ y ∧ x ∨ z ∧ F ′ and F ′ does not contain x and x,K(F ) = K(F ′ ∧ (y ∨ z)).

• Export each SCC as a new 2CNF formula: For each edge (x → y) in the component,add x ∨ y in the formula.

• Solve each new 2CNF formula by the MAX-2-SAT algorithm.

As shown later in the paper, this preprocessing procedure reduces the size of randominstance greatly and improves considerably the performance of our algorithm for random MAX-2-SAT instances.

Another new usage of SCCs is to design a variable ordering for the algorithm in Figure 1.This algorithm uses a fixed ordering, i.e, from 1 to n, to assign truth value to variables. Itis found useful in [20] to sort the variables according their occurrences in the input in non-increasing order and then assign variables in that order. Using SCC, we design a new weightfunction for variables and we then sort the variables by this weight in non-increasing order.

The weight function is computed using the following procedure: At first each variable hasa weight equal to 0. Then we update the weight by finding the shortest path between everypair of (x, x) in the SCC. For any node y in the path, we increase the weight of y by 1.

The intuition behind this ordering is that we want to derive conflicting clauses as early aspossible so that the lower bound checking at line 2 of the algorithm in Figure 1 can be moreeffective. If a clause appears in the shortest path connecting two conflicting literals in a SCC,we give the literals in this clause higher weights so that the truth value of this clause can bedecided early. The experimental results in the next section show that the ordering by the newweight function performs better than the occurrence ordering when either c = m/n or K(F )is small.

5 Experimental Results

We have implemented all the three techniques presented in this paper to support the algorithmpresented in Figure 1. The implementation is written in C++ and preliminary experiment re-sults seem promising. We ran our program with four different configurations: sorting variables

5

Page 219: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Table 1: Experimental results on Borchers and Furman’s examples. (in seconds)

Problem false clauses BF AMP OLD SONB#vars #clauses50 100 4 0.02 0.03 0.01 0.02

150 8 0.06 0.03 0.01 0.03200 16 4.18 0.38 0.06 0.04250 22 24 0.26 0.03 0.08300 32 350 4.88 0.71 0.4350 41 2556 10 1.26 0.49400 45 2308 4.65 0.23 0.24450 63 – 44 4.42 2.98500 66 – 17 1.04 0.48

100 200 5 0.14 0.16 0.05 0.07300 15 501 29 4.0 0.64400 29 – 1204 124 8.3

150 350 4 0.18 0.22 1.71 0.24450 22 – – – 235

by occurrence (SO), sorting variables by occurrence with new lower bound (SONB), sortingvariables by weight (SW) and sorting variables by weight with new lower bound (SWNB). Alldata are collected on a cluster of Pentium 4 2.4GHz linux machines each with 1GB memory.

Table 1 shows some results of Borchers and Furman’s program (BF) [6], Alsinet et al.’s(AMP, the option LB2-I+JW is used), our old implementation (OLD) and our new implemen-tation (SONB) on the random problems distributed by Borchers and Furman. In the table,“–” indicates an incomplete run after running for two hours. Note that the upbound of K(F )in our algorithm is at first set to the number found by the first phase of Borchers and Furman’slocal search procedure and then decreased by one until the optimal value is decided. It isclear that our algorithm runs consistently faster than both Borchers and Furman’s programand Alsinet et al.’s modification. For simple and small size problems, the old implementationsometimes is faster than the new implementation, because our new program uses a more com-plex preprocessing procedure. For hard problems, our new implementation is definitely fasterthan the old one.

Figure 2 compares Alsinet et al.’s program and our implementation on the random prob-lems of 40 variables and 200 variables. We considered the following cases: n = 40 variableswith m = 60, 80, 100, 120, 160, 200, 300, ..., 3000, 3040, 3080 clauses and n = 200 variables withm = 300, 320, 340, 360, 380, 400 clauses. For each case, we generated 100 random problems(excluding satisfiable ones). It is clear that all four configurations of our algorithm run consis-tently faster than Alsinet et al.’s program. From the figure, we can see that SO is better thanSW on the instances of n = 40 variables, but SW is better than SO in many instances of then = 200 variables. The new lower bound (LB3) is consistently better than LB2. Note that inthe figure of n = 40 variables, the running time of our program is decreasing while c = m/nis increasing when c = m/n is large enough. The reason for this is that the preprocessing canreduce a lot of clauses when c = m/n is very large. For n = 200, Alsinet et al.’s program couldnot finish one job when m = 340, 4 jobs when m = 360, 14 jobs when m = 380 and 35 jobswhen m = 400. When m = 400, both SO and SW have one job unfinished (in two hours).

More detailed results comparing the four configurations, i.e., SO, SONB, SW, and SWNB,can be found in the full version of this paper (see also the appendix). In the following, we turn

6

Page 220: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Figure 2: Running time for SO, SONB, SW, SWNB and Alsinet et al.’s algorithm.

0.01

0.1

1

10

100

500 1000 1500 2000 2500 3000 3500

tim

e

clauses

SO

SONB

SW

SWNB

AMP

0.1

1

10

100

1000

300 320 340 360 380 400

tim

e

clauses

SO

SONB

SW

SWNB

AMP

(a) n=40 (b) n=200

our attention to the Max-Cut problem.Given an undirected simple graph G = (V,E), where V = x1, ..., xn is the set of vertices

and E is the set of edges, let weight wxi,xj be associated with each edge (xi, xj) ∈ E. TheMax-Cut problem is to find a subset S of V such that

W (S, S) =∑

xi∈S,xj∈S

wxi,xj

is maximized, where S = V − S. In this paper, we let weight wxi,xj = 1 for all edges. Thefollowing theorem shows how to reduce the Max-Cut problem to the MAX-2-SAT problem.

Theorem 5.1 [7, 15] Given G = (V,E), where |V | = n, |E| = m, we assume weightw(xi, xj) = 1 for each (xi, xj) ∈ E. We construct an instance of MAX-2-SAT as follows:Let V be the set of propositional variables and for each edge (xi, xj) ∈ E, we create exactly twobinary clauses: (xi ∨xj) and (xi ∨xj). Let F be the collection of such binary clauses, then theMax-Cut problem has a cut of weight k iff the MAX-2-SAT problem has an assignment underwhich m + k clauses are true.

We have run thousands instances of the random Max-Cut problem. The performance ofour MAX-2-SAT solver is much better than a special-purpose program for Max-Cut. Someresults are shown in the full version of the paper (see also the appendix). For the instanceswhere the number of variables n = 30, 50, and 100, SO is better than SW. Those instanceshave a relatively large c = m/n > 2 in most cases. For the instances with n = 1000, SW isoften more than 10 times faster than SO when c = m/n < 2.

6 Conclusion

We have given a new lower bound function for the branch-and-bound algorithm of MAX-2-SATand showed that the new lower bound function is consistently better than other lower boundfunctions. We have also used SCCs of the implication graph of a 2CNF formula to simplifythe formula and to design a new weight for variables. The experiments showed that sortingvariables by weight performs much better when the clause to variable ratio c = m/n is lessthan 2. The proposed preprocessing technique can reduce the size of the input greatly. Theresult of this research is a high-performance implementation of an exact algorithm for MAX-2-SAT, which outperforms any implementation we know about in the same category. We appliedthe MAX-2-SAT algorithm to solve large instances of Max-Cut problem, and the experimentalresults showed this approach is feasible and effective. All the techniques presented in the paper

7

Page 221: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

can be applied to the weighted MAX-2-SAT problem where each clause has a weight. The Max-Cut problem with arbitrary weights can be easily converted into an instance of the weightedMAX-2-SAT. As future work, we will specialize and improve the MAX-2-SAT algorithm forthe Max-Cut problem and will solve some real world Max-Cut problems. We will extend thetechniques presented in the paper to solve general MAX-SAT problems.

References

[1] T. Alsinet, F. Manya, J. Planes, Improved branch and bound algorithms for Max-SAT.Proc. of 6th International Conference on the Theory and Applications of SatisfiabilityTesting, SAT2003, pages 408-415.

[2] T. Alsinet, F. Manya, J. Planes, Improved branch and bound algorithms for Max-2-SATand Weighted Max-2-SAT. Catalonian Conference on Artificial Intelligence, 2003.

[3] B. Aspvall, M. F. Plass and R. E. Tarjan, A linear-time algorithm for testing the truth ofcertain quantified Boolean formulas, Information Processing Letters, 1979, 8(3):121-123

[4] N. Bansal, V. Raman, Upper bounds for MaxSat: Further improved. In Aggarwal andRangan (eds.): Proceedings of 10th Annual conference on Algorithms and Computation,ISSAC’99, volume 1741 of Lecture Notes in Computer Science, pages 247-258, Springer-Verlag, 1999.

[5] R.J. Bayardo and J.D. Pehoushek. Counting models using connected components. In 7thNat’l Conf. on Artificial Intelligence (AAAI), pages 157–162, 2000.

[6] B. Borchers, J. Furman, A two-phase exact algorithm for MAX-SAT and weighted MAX-SAT problems. Journal of Combinatorial Optimization, 2(4):299-306, 1999.

[7] J. Cheriyan, W.H. Cunningnham, L. Tuncel, Y. Wang. A linear programming and round-ing approach to Max 2-Sat. DIMACS Series in Discrete Mathematics and TheoreticalComputer Science, 26:395–414, 1996.

[8] E. Dantsin, A. Goerdt, E.A. Hirsch, R. Kannan, J. Kleinberg, C. Papadimitriou, P. Ragha-van, U. Schoning. A deterministic (2 − 2/(k + 1))n algorithm for k–SAT based on localsearch. Theoretical Computer Science, 2002.

[9] M. Davis, G. Logemann, D. Loveland, A machine program for theorem-proving. Commu-nications of the Association for Computing Machinery, 7 (July 1962), 394–397.

[10] P. Erdos and A. Rnyi, On the evolution of random graphs. Mat. Kutato Int. Kozl 5 (1960),17-61.

[11] J. Gramm, E.A. Hirsch, R. Niedermeier, P. Rossmanith: New worst-case upper boundsfor MAX-2-SAT with application to MAX-CUT. Preprint, submitted to Elsevier, May,2001

[12] P. Hansen, B. Jaumard. Algorithms for the maximum satisfiability problem. Computing,44:279-303, 1990.

[13] E.A. Hirsch. A new algorithm for MAX-2-SAT. In Proceedings of 17th International Sym-posium on Theoretical Aspects of Computer Science, STACS 2000, vol. 1770, LectureNotes in Computer Science, pages 65-73. Springer-Verlag.

8

Page 222: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Figure 3: Performance comparison for SO, SONB, SW and SWNB

100

1000

10000

100000

1e + 06

1e + 07

1e + 08

1e + 09

300 320 340 360 380 400

bra

nches

clauses

SO

SONB

SW

SWNB

100

1000

10000

100000

1e + 06

1e + 07

1e + 08

1e + 09

420 440 460 480 500 520 540 560

bra

nches

clauses

SO

SONB

SW

SWNB

(a) n=200 (b) n=300

100

1000

10000

100000

1e + 06

1e + 07

1e + 08

1e + 09

500 520 540 560 580 600 620 640

bra

nches

clauses

SO

SONB

SW

SWNB

100

1000

10000

100000

1e + 06

1e + 07

1e + 08

1e + 09

660 680 700 720 740 760 780 800 820

bra

nches

clauses

SO

SONB

SW

SWNB

(a) n=400 (b) n=500

[14] E.A. Hirsch. New worst-case upper bounds for SAT. Journal of Automated Reasoning,24(4):397-420, 2000.

[15] M. Mahajan, V. Raman. Parameterizing above guaranteed values: MaxSat and MaxCut.Journal of Algorithms, 31:335–354, 1999.

[16] R. Niedermeier, P. Rossmanith. New upper bounds for maximum satisfiability. Journal ofAlgorithms, 36:63-88, 2000.

[17] H. Shen, H. Zhang, An empirical study of max-2-sat phase transitions, Proc. of LICS’03Workshop on Typical Case Complexity and Phase Transitions, Ottawa, CA, June 2003.

[18] R. Wallace, E. Freuder. Comparative studies of constraint satisfaction and Davis-Putnamalgorithms for maximum satisfiability problems. In D. Johnson and M. Trick (eds.) Cliques,Coloring and Satisfiability, volume 26, pages 587-615, 1996.

[19] H. Xu, R.A. Rutenbar, K. Sakallah, sub-SAT: A formulation for related boolean satisfia-bility with applications in routing. ISPD’02, April, 2002, San Diego, CA.

[20] H. Zhang, H. Shen, F. Manya: Exact algorithms for MAX-SAT. In Proc. of InternationalWorkshop on First-order Theorem Proving (FTP 2003). http://www.elsevier.com/gej-ng/31/29/23/135/23/show/Products/notes/index.htt

Appendix

Figure 3 and Table 2 show the results of four different configurations: SO, SONB, SW andSWNB on random MAX-2-SAT problems. Each case has 100 random instances. We can seethat the preprocessing procedure can greatly reduce the problem size. We can also see thatSW is faster than SO and SWNB is faster than SONB when c = m/n is relatively small. Ournew lower bound function can always prune more branches.

9

Page 223: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Table 2: Experimental results on random MAX-2-SAT instances. (Reduced is the problemafter preprocessing)

Problem Reduced SO SONB SW SWNB#vars #clauses #vars #clauses branches branches branches branches200 300 53.4 80.8 6.88e+03 1.26e+03 1.46e+03 480200 320 66 106 6.97e+04 5.37e+03 1.6e+04 2.54e+03200 340 81 140 1.46e+06 9.2e+04 3.27e+05 3.04e+04200 360 95.5 177 1.57e+07 4.37e+05 4.16e+06 1.67e+05200 380 105 204 1.09e+08 9.38e+05 3.81e+07 1.38e+06200 400 116 237 2.39e+08 2.28e+06 7.81e+07 3.06e+06300 420 56.6 79.6 6.24e+03 1.19e+03 1.3e+03 326300 450 77.4 115 4.16e+05 1.66e+04 3.42e+04 3.32e+03300 480 102 166 1.48e+07 2.01e+05 2.51e+06 1.73e+05300 510 122 210 1.89e+08 5.87e+06 4.71e+07 1.99e+06300 540 139 253 5.93e+08 2.01e+07 2.63e+08 8.66e+06400 520 49.2 64.1 5.28e+03 521 798 176400 560 73.3 103 2.42e+05 1.25e+04 1.15e+04 2.51e+03400 600 107 162 2.67e+07 2.88e+05 3.3e+06 1.6e+05400 640 132 210 1.39e+08 5.97e+06 5.23e+07 4.91e+06500 650 59 76 1.1e+04 1.3e+03 3.5e+03 1e+03500 700 89 120 2.5e+06 1.7e+05 4e+05 1.5e+04500 750 132 197 5e+07 5.9e+06 9.5e+06 1.2e+06500 800 171 277 – 3.1e+07 2.9e+08 1.9e+07

Figure 4: Performance of SONB and SWNB for Max-Cut Problems

100

1000

10000

100000

1e + 06

1e + 07

50 100 150 200 250 300 350 400

bra

nches

edges

SONB

SWNB

1000

10000

100000

1e + 06

1e + 07

1e + 08

1e + 09

100 150 200 250 300 350 400 450 500

bra

nches

edges

SONB

SWNB

(a) n=30 (b) n=50

100

1000

10000

100000

1e + 06

1e + 07

1e + 08

1e + 09

100 150 200 250 300

bra

nches

edges

SONB

SWNB

100

1000

10000

100000

1e + 06

1e + 07

780 800 820 840 860 880 900

bra

nches

edges

SONB

SWNB

(a) n=100 (b) n=1000

10

Page 224: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Explicit Manifold Representations for Value-FunctionApproximation in Reinforcement Learning

William D. SmartDepartment of Computer Science and Engineering

Washington University in St. LouisOne Brookings DriveSt. Louis, MO 63130

United [email protected]

1 Introduction

We are interested in using reinforcement learning for large, real-world control problems. In par-ticular, we are interested in problems with continuous, multi-dimensional state spaces, in whichtraditional reinforcement learning approached perform poorly.

Value-function approximation addresses some of the problems of traditional algorithms (forexample, continuous state spaces), and has been shown to be successful in a number of specificapplications. However, it has been proven not to work in the general case, and has been known tofail even on simple problems.

We propose a novel approach to value-function approximation, based on the theory of mani-folds. We identify the key failing of current techniques, and show how our approach avoids thisproblem by explicitly modeling the topology of the state space.

We begin with a brief description of the problems of current value-function approximationtechniques. We then motivate our proposed approach, outlining the mathematics underpinning it,and provide the results of our initial experiments.

2 The Problem with Value-Function Approximation

Value-function approximation (VFA) is a technique in reinforcement learning (RL) where the tab-ular representation of the value function is replaced with a general-purpose function approximator.It is generally used to deal with continuous state spaces, and to allow generalization between sim-ilar states.

The straightforward application of VFA, where the tabular representation is simply replacedby a general-purpose function approximator, has been shown not to work in the general case [2].However, in certain simple worlds, it seems to work quite reliably, even when Gordon’s conditions[3] are not fully satisfied. For example, consider the world shown in figure 1. The agent can movein the four cardinal directions, and starts in the lower left corner. Rewards are -1 for hitting a wall,+1 for reaching the upper right corner, and zero everywhere else. The agent’s state corresponds toits position in the world, and is represented by a vector inR

2. The policy shown by the blue linewas learned using Q-learning [10] with a two-layer back-propagation network as a value-functionapproximator. The agent reliably learned an optimal policy for this world from a variety of differentrandom starting configurations.

1

Page 225: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

−1−1

−1

−1

start

+1

Figure 1: A simple world where straightforward VFA is effective, with a policy learned usingQ-learning with VFA (blue line).

start

goal

A

C

B

70

60

50valu

e

goal

start

Figure 2: A simple world where VFA fails (left). The value function for the “up” action, calculatedby tabular Q-learning (right).

Although using value-function approximation in the above example works well, we can causeit to fail by simply adding a wall (see figure 2). Typically, the learned policy will seem to ignorethe wall, and attempt to drive through it towards the goal. The reason for this failure is easy tounderstand. VFA works by generalizing values from known states to unknown ones. There is animplicit assumption that states close to each other have similar values. For all commonly usedfunction approximators, this measure of similarity is the (possibly weighted) Euclidean distancebetween the states. For the world shown in figure 2, this assumption is flawed in some parts ofthe state space. In particular, points A and C are close in Euclidean space, but have very differentvalues. This difference represents itself as a discontinuity in the value function. VFA with a simplefunction approximator will tend to “blur” this discontinuity, attempting to incorrectly generalizeacross it.

The main reason for the failure of VFA is that most function approximation algorithms implic-itly make assumptions about the topology of their domain, which are often not met in practice.To separate the points, and cause VFA to succeed, we need to correctly model the topology of thedomain of the VFA algorithm. We propose doing this by explicitly modeling the domain using amanifold. We give a more formal definition of a manifold below. First, however, we give a simpleexample to build the intuition of a domain with a different topology.

Consider the simple world shown on the left of figure 3, an open room with an obstacle in themiddle. As in figure 2, the Euclidean distance (red lines) between points A and B, and between Aand C is small. However, B is intuitively “closer” to A than C is.

2

Page 226: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

B

C

A

BA

C

Figure 3: A simple world containing one obstacle (left). The world is topologically cylindrical(right).

The state space of the world is a closed subset ofR2, corresponding to the position of the

agent. A closed subset ofR2 with a single continuous boundary is topologically a disk. However,the manifold on which the agent can actually move (the free space) is topologically a cylinder, ascan be seen in figure 3. The top edge of the cylinder corresponds to the outer edge of the obstacle,and the bottom edge to the outer wall of the world. We can safely use Euclidean distance on thesurface of the cylinder to get a better measure of “nearness”. The distance on the cylinder from Ato C (blue line) roughly corresponds to the blue line on the left side of the figure.

The Euclidean distance metric on the cylinder corresponds to a step-based distance metric inthe original world. Intuitively, this makes sense for VFA, since the value function is defined interms of the number of steps to reward-giving states,V π (s) =

∑∞t=0 γ

trt+1. The value of a statewill now be similar to those that are close to it in the manifold.

3 Manifolds

Before giving the formal definition of a manifold, we provide an intuition about the structure, andhow it can be used constructively. Consider an atlas of the world. Each page in the atlas has apartial map on it, usually a single country. Each of these single-page maps can have a differentscale, depending on the size of the country. This allows more detail to be shown where it isnecessary, without committing to representing the entire atlas at a particular resolution.

The pages in the atlas also overlap with each other at the edges. For example, the map forFrance has part of northern Spain on it. This provides a well-defined way of moving from one mappage to another. Notice that the maps may be at different scales, and may not line up perfectly.page to another, we can establish a correspondence between points in the overlap regions.

This method of using overlapping partial maps allows us to map a complex surface (the world)with a set of simpler surfaces (the individual maps). Each local map can be appropriate to the localfeatures, such as scale, and be topologically simpler than the global map. In the case of an atlas,we are covering a sphere with a set of disks. We can define global features, such as distance, bycomposing their local versions on each page, and dealing appropriately with the overlap regions.This will prove to be important to us, and we will discuss it in more detail below.

We now make some definitions. Each map page in the above example is achart. The collectionof all charts is theatlas. The area shared by two adjacent pages is theoverlap region. The functionthat maps points in the France map to points in the Spain map is thetransition function. Thefunction on each chart (for example, the elevation, or value function) is theembedding function.

3

Page 227: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

We formally define a manifold,M, embedded in a space,S, as an atlas, a set of overlap regions

Figure 4: One possible chartallocation for a simple worldwith one obstacle.

and a set of transition functions.

Atlas A finite set,A, of homeomorphisms fromS to the disk.Each elementαc ∈ A is called achart. The co-domain ofαc is labeled asc. The termchart is commonly also used torefer toc.

Overlap Regions A set of subsets,Uij = ci ∩ cj, whereαciand

αcjare charts inA and whereUii = ci. Uij will be empty

if ci andcj do not overlap. The atlas,A, definesUij in thefollowing way: A pointp ∈ ci is in Uij if and only if thereexists a pointq ∈ cj such thatα−1

ci(p) = α−1

cj(q).

Transition Functions A set of functions,Ψ. A transition func-tion ψij ∈ Ψ is a mapψ : Uij → Uji whereUij ⊂ ci andUji ⊂ cj. Note thatUij andUji will often be empty. We candefineψij to beαi α−1

j .

Most of the previous work with manifolds involves using manifolds to analyze an existing sur-face. We propose constructing a manifold to model the domain of the value function. Some pa-pers in the computer graphics literature describe the use of manifolds for surface construction andmodeling [4, 5, 6, 8, 9]. However, they all deal with two-dimensional surfaces embedded in three-dimensions.

overlapregion

functions

chart 1chart 2

blendfunctionsembedding

global function

Figure 5: Blending across charts.

We are, in general, interested in manifolds of di-mensionn embedded inm-dimensional spaces,wheren ≤ m. For the remainder of this proposalwe will, without loss of generality, only deal withthe case wheren = m. This makes the mapping,αc : S → c, and the corresponding transition func-tions simple and intuitive.

Now that we have defined what we mean bya manifold, the first task is to use the manifold tomodel the domain of the value function. The ba-sic idea is to model the reachable parts of the statespace as a set of simpler regions (the charts). Con-sider the world shown in figure 4. As we estab-lished previously (on page 3), this world is topo-logically cylindrical. There are an infinite num-ber of possible atlases for this example; any set ofcharts that completely covers the reachable part ofthe state space is acceptable. One possible atlas isshown in figure 4. Notice that charts on oppositesides of the obstacle do not overlap. The reachablespace in each chart is convex; there are no obstacles

4

Page 228: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

in any of the charts.1 Euclidean distance should, therefore, be a reasonable measure of similaritywithin a chart. This means that straightforward VFA techniques should work well with in the chart.

Now we have a model of the domain of the value function. The next step is to embed thevalue function approximation in this domain. To create the value function, we build a local valuefunction on each chart, and combine the results. The values of the embedding functions (the localvalue functions) are combined as shown in figure 5. In this figure, there are two one-dimensionalcharts, each with an embedding function (the red line). Each chart also has ablend functiondefinedon it. This is a smooth, continuous function,bc : c→ R. The value of the blend function, and all ofits derivatives, must be zero at the edge of the chart. The blend functions form a partition of unityover the manifold. Embedding functions are combined linearly in the overlap regions, according tothe blend function. The global function,F , at pointp is defined as a linear blend of the embeddingfunctions,fc, of all the charts in whichp lies, weighted by the blend functions,

F (p) =

∑c∈A bc (p) fc (p)∑

c∈A bc (p).

Recall that, for most values ofp, most values ofbc (p) will be zero, sincep 6∈ c.Following Grimm and Hughes [4], we use a one-dimensional spline as the blend function,

b (x). If we assume a chart covering the unit square,b (0) = 0, b (1) = 0, andb (x) is maximalat x = 0.5. All derivatives are zero at the boundaries,b(n) (0) = 0, b(n) (1) = 0. To calculatethe blend value for a particular point in the chart,p ∈ Rn, the blend function is applied to eachdimension, and the value multiplied together,

bc (p) =

∏ni=1 b (pi) , p ∈ c,

0, otherwise.

Blending the embedding functions in this manner allows us to make some guarantees about theresulting global function. Building a global function from this manifold, under certain technicalconditions, will result in aCk (smooth and continuous up to thekth derivative) manifold surface.Also, the value of the global function will be locally bounded by the values of the embeddingfunctions, since the blend functions are a partition of unity.

3.1 Manifolds for VFA

The basic idea of using manifolds for VFA is to allocate a set of overlapping charts that coverthe world, reflecting the topology of the reachable space. We assign a function approximator toeach of these charts, and construct the global value function by blending the predictions of thesefunction approximators together, as described above. If we are using Q-learning, we constructone manifold for each possible action. Updating the value of a state corresponds to presenting atraining example to the function approximator for each of the charts in which the state lies. In allother ways, we can treat the set of manifolds as a drop-in replacement for the table in standardRL algorithms. However, this somewhat simplistic description hides two hard problems: howto allocate charts, and how to calculate distance in the manifold. We are currently investigatingcomputational approaches to these problems.

1Formally, each chart is topologically a disk.

5

Page 229: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

In this paper, we will consider only approximations of the state-action value function, whichcan be calculated off-policy. The state-value function, on the other hand, is calculated for a partic-ular policy. Thus, a change in the policy will affect which areas in the environment are reachable.This, in turn, will affect the effective topology of the world. This means that the best chart alloca-tion will depend on the (possibly changing) policy. We will also simplify our initial experimentsby assuming that there is a single source of reward at the goal state.

We also make the assumption in this paper that we can arbitrarily sample states and actions,and that we are not contstained to follow trajectories. Once we have a better understanding of chartallocation schemes in this setting, we hope to relax the the assumption and deal with trajectory-based agents. It seems possible, in principle, to construct charts on-line, as the agent followstrajectories through the world.

4 Preliminary Empirical Results

We have performed some preliminary experiments to investigate how well manifold-based repre-sentations of the value function work in practice. We have experimented with two very straightfor-ward chart allocation schemes: fixed-size random placement, and random-walk based.

Fixed-Size Random: Pick a random point not currently covered by the manifold. Place a fixed-size, axis-aligned square chart at that point. Repeat until the domain is completely coveredby charts.

Random Walk: Pick a random point not currently covered by the manifold as the starting point.Performn random walks of lengthm steps from that point by randomly selecting actions.Record all of the points visited. Place the smallest axis-aligned rectangular chart that coversall of these points. Repeat until the domain is completely covered by charts.

The random walk allocation scheme is the simplest “realistic” allocation scheme. The intuition isthat points a small number steps away from each other are likely to be similar, and should be in thesame chart. Ideally, we would like to explore all possiblem-step paths from the starting point, butthis becomes intractable for a moderate number of actions and size ofm. Thus, we sample using arandom walk, and trust this to give us reasonable coverage.

Finding a point not covered by the manifold is a hard problem, and we do not currently havea satisfying solution. It is easy to tell if a point is covered by checking each of the charts in themanifold. We use a generate-and-test approach, randomly picking points and checking if they arecovered. We keep picking points until we come up with one that is not covered. If we cannotgenerate such a point in a fixed number of tries (one million for these experiments), we declarethat the domain is completely covered by the manifold.

Figure 6 shows the various chart allocation strategies. Notice that the fixed-size allocationoverlaps the internal walls in the world, and does not model the topology. The random walkallocations do not straddle the walls, and do model the topology of the domain. Figure 7 shows thecorresponding learned value functions for the “up” action, for each of the allocations. The discountfactor was set to 0.99, the learning rate to 0.8, and one million random training experiences wereused. A simple leaky-integrator (constant) function approximator was used as the embeddingfunction for each chart. Notice that the fixed-sized allocation (figure 7(a)) does not capture the

6

Page 230: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

(a) (b) (c) (d)

Figure 6: Chart allocations for (a) fixed-size charts, and random walks of (b) length 1, (c) length3, and (d) length 5.

50 55 60 65 70 75 80 85 90 95

100

50 55 60 65 70 75 80 85 90 95

100

50 55 60 65 70 75 80 85 90 95

100

50 55 60 65 70 75 80 85 90 95

100

(a) (b) (c) (d)

Figure 7: Learned value functions for (a) fixed-size charts, and charts allocated by random walksof (b) length 1, (c) length 3, and (d) length 5.

discontinuities in the value function well. However, each of the random-walk based allocationscapture the discontinuity well.

With small charts, as in figure 6(b), all points covered by a chart are very similar, and the valuefunction approximation is very smooth. As the charts become larger, the approximation begins toreflect the underlying chart structure, as in figure 6(d). This is especially true if we use a simpleembedding function, as we are here. There is a fundamental tension between the complexity ofthe embedding function approximator and the maximum useful chart size. Expressive functionapproximators permit larger charts because they can model the value of points that are relativelydissimilar. Simple function approximators necessitate small charts, since they can only model thevalue of points that are very similar.

5 Conclusions and Current Work

Manifold representations offer the possibility of uniting a number of other successful value func-tion approximation schemes. For example, a tabular representation is simply a manifold withuniformly-sized charts that tile the state space, and have zero overlap, and a leaky-integrator func-tion approximator as the embedding function. Variable-resolution partitioning schemes [7] can beviewed as cahrt-allocation, again with non-overlapping charts. Interestingly, one of the most suc-cessful function approximators for VFA, the CMAC [1] is exactly a manifold. A CMAC tiles thestate space with overlapping tiles, each of which has a constant value assigned to it. The predictionfor a point,p, is the sum of the values of all of the tiles in whichp falls. This sum is often weighted

7

Page 231: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

by the distance ofp from the center of the tile. Using the language of manifolds, the tiles are charts,the value is a leaky-integrator embedding function, and the weights are the blend functions.

For non-overlapping charts, our use of manifolds bears some similarity to piecewise func-tion approximation schemes. For example, for a leaky-integrator embedding function and non-overlapping charts, a manifold is exactly a piecewise constant function approximator. The addi-tion of overlap regions causes the approximation to become smooth and continuous over the statespace. The main difference between our method and existing similar function approximators isthat we explictly attempt to model the topology of the state space.

Although our experiments to date have yielded promising results, it remains an open questionas to whether constructing an effective manifold is fundamentally easier than solving the problemthrough other methods, or learning models of the transition and reward functions. To address thisquestion empirically, we are currently working on chart-allocation schemes based on statisticaltests. Our hope is that these methods will require many fewer training data samples than currentapproaches.

References

[1] James S. Albus. A new approach to manipulator control: The cerebellar model articulation controller(CMAC). Journal of Dynamic Systems, Measurement and Control, pages 220–227, 1975.

[2] Justin A. Boyan and Andrew W. Moore. Generalization in reinforcement learning: Safely approxi-mating the value function. In G. Tesauro, D. S. Touretzky, and T. Leen, editors,Advances in NeuralInformation Processing Systems, volume 7, pages 369–376. MIT Press, 1995.

[3] Geoffrey J. Gordon.Approximate Solutions to Markov Decision Processes. PhD thesis, School ofComputer Science, Carnegie Mellon University, June 1999. Also available as technical report CMU-CS-99-143.

[4] Cindy M. Grimm and John F. Hughes. Modeling surfaces of arbitrary topology using manifolds.Computer Graphics, 29(2), 1995. Proceedings of SIGGRAPH ’95.

[5] Cindy M. Grimm and John F. Hughes. Smooth iso-surface approximation. InProceedings of ImplicitSurfaces ’95, pages 57–67, 1995.

[6] Paton J. Lewis. Modeling surfaces of arbitrary topology with complex manifolds. Master’s thesis,Brown University, 1996.

[7] Andrew W. Moore. Variable resolution reinforcement learning. Technical report CMU-RI-TR,Robotics Institute, Carnegie Mellon University, 1995.

[8] Josep Cotrina Navau and Nuria Pla Garcia. Modeling surfaces from meshes of arbitrary topology.Computer Aided Geometric Design, 17(2):643–671, 2000.

[9] Josep Cotrina Navau and Nuria Pla Garcia. Modeling surfaces from planar irregular meshes.ComputerAided Geometric Design, 17(1):1–15, 2000.

[10] Christopher J. C. H. Watkins and Peter Dayan. Q-learning.Machine Learning, 8:279–292, 1992.

8

Page 232: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Warped Landscapes and Random Acts of SAT Solving

Dave A. D. Tompkins and Holger H. Hoos

Department of Computer ScienceUniversity of British Columbia

Abstract

Recent dynamic local search (DLS) algorithms such as SAPS are amongst the state-of-the-art methods forsolving the propositional satisfiability problem (SAT). DLS algorithms modify the search landscape duringthe search process by means of dynamically changing clause penalties. In this work, we study whether theresulting, ‘warped’ landscapes are easier to search than the landscapes that correspond to the original prob-lem instances. We present empirical evidence indicating that (somewhat contrary to common belief) this isnot the case, and that the main benefit of the dynamic penalty update mechanism in SAPS is an effectivediversification of the search process. In most other high-performance stochastic local search algorithms, thesame effect is achieved by the strong use of randomised decisions throughout the search. We demonstratethat in SAPS, random decisions are only required in the (standard) search initialisation procedure, and can becompletely eliminated from the remainder of the subsequent search process without any significant change inthe behaviour or performance of the resulting algorithms compared to the original, fully randomised SAPSalgorithm. We conjecture that the reason for this unexpected result lies in the ability of the deterministicvariants of the scaling and smoothing mechanism and the subsidiary iterative best improvement mechanismunderlying SAPS to effectively propagate the effects of initial randomisation throughout a search process thatshows the sensitive dependence on inditial conditions that is characteristic for chaotic processes.

1 Introduction and Background

The Propositional Satisfiability Problem (SAT) is one of the most prominent hard combinatorial decision prob-lems; it plays an important role in many areas of computing and is of substantial theoretical and practical interest.A popular and very successful approach to solving SAT is to apply Stochastic Local Search (SLS) algorithms.A well-known class of SLS algorithms for SAT is based on the idea of Dynamic Local Search (DLS); DLSalgorithms dynamically change the evaluation function of a subsidiary local search algorithm, and hence thesearch landscapes, during the search process. Well-known DLS algorithms for SAT include Morris’ BreakoutMethod [8], GSAT with clause weighting [11], Frank’s GSAT with rapid weight adjustment [1], the Discrete La-grangian Method (DLM) [14], the Smoothed Descent and Flood (SDF) and Exponentiated Sub-Gradient (ESG)algorithms [9, 10], as well as our recent Scaling and Probabilistic Smoothing (SAPS) algorithm, which achievesstate-of-the-art performance for SAT and MAX-SAT [6, 13].

All of these DLS algorithms use dynamically changing clause penalties in order to modify (or warp) theunderlying search landscape during the search. In particular, the SAPS algorithm works as follows: All proposi-tional variables are randomly initialised (one or zero), and each clause penalty is initialised to one. The evaluationfunction for SAPS is the sum of the unsatisfied clause penalties. The subsidiary local search algorithm of SAPSis a simple iterative best improvement method that in each step flips the variable that decreases the evaluationfunction the most, with ties broken randomly. (Note that the same basic search method is underlying the GSATalgorithm [7].) If no variable flip can reduce the evaluation function, then the algorithm is at a local minimum.At a local minimum, either with probability a variable is randomly selected and flipped (a so-called randomwalk step), or the clause penalties are adjusted by a scaling step, and possibly a smoothing step. At a scalingstep, all unsatisfied clause penalties are multiplied by a fixed value , which is a parameter of the algorithm.After a scaling step, a smoothing step occurs with probability , in which case all penalties are adjusted to , where is the penalty of a given clause , is the average of all clause

Corresponding author; mailing address: Holger H. Hoos, Department of Computer Science, University of British Columbia, 2366Main Mall, Vancouver, BC, Canada, V6T 1Z4; e-mail: [email protected]

1

Page 233: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

weights after scaling, and the parameter has a fixed value between zero and one. (A detailed description ofSAPS can be found in [6].)

In this work, we study the factors underlying the effectiveness of DLS algorithms such as SAPS for solvinghard SAT instances. In particular, we investigate the commonly held belief that the warping of the search spaceachieved by these algorithms makes the respective problem instances easier to solve and hence can be seenas a form of learning. We present evidence that this belief is essentially incorrect and provide better insightsinto the role of warping, as realised by the scaling and smoothing mechanism in SAPS, in the context of theimpressive performance of state-of-the-art DLS algorithms for SAT. Furthermore, we investigate the role ofrandom decisions in DLS algorithms and present empirical evidence that, contrary to earlier results for other SLSalgorithms for SAT, for SAPS the only place where randomisation is required is in the determination of the initialtruth assignment. This demonstrates that even a completely deterministic variant of the scaling and smoothingmechanism used in SAPS achieves a sufficient degree of diversification and mobility of the search process,leading to the same excellent performance as the original, heavily randomised SAPS algorithm. Furthermore,the run-time of the resulting algorithm for a given SAT instance solely depends on the initial assignment, whichsuggests possible directions for further, substantial performance improvements.

2 Warped Intelligence

While solving a given problem instance, DLS algorithms are dynamically warping their underlying search spacewhenever they update their clause penalties. It has been suggested that the success of this approach is a resultof the fact that the clause penalties represent accumulated ‘knowledge’ about the search space (see, e.g., [1]). Inother words, DLS algorithms are ‘learning’ about the landscape they search. In particular, the resulting warpedlandscape will be easier to search than the original space. Often, this reflects back to a popular analogy that aDLS algorithm ‘fills in the holes’, i.e., the local minima, of a given search landscape. Although the notion thatthe effectiveness of DLS algorithms is explained by their ability to produce warped landscapes in which solutionscan be found more efficiently appears to be widely accepted, little evidence has been provided to support thishypothesis.

To investigate the validity of this proposed explanation, we examined the SAPS algorithm, and took snapshotsof the clause penalties whenever the algorithm had found a solution. These clause penalties were then used togenerate weighted SAT instances, which we refer to as the SAPS generated weighted instances, where the weightof each clause is simply the corresponding clause penalty used by SAPS at the point when a solution was found.(Here and in the following, we will use the terms ‘weight’ and ‘clause weight’ to refer to weights that are staticallyassociated with the clauses of a given, weighted SAT instance, while the terms ‘penalty’ and ‘clause penalty’ arereserved for dynamically changing penalty values associated with clauses during the run of a DLS algorithm.)Many existing SLS algorithms for SAT can be generalised easily and naturally to weighted SAT instances byreplacing their standard evaluation function, defined as the total number of clauses unsatisfied under a given truthassignment, by a function that maps truth assignments to the corresponding weighted sum of unsatisfied clauses.(Note that standard SAT instances can be seen as a special case of weighted SAT instances, where all clauseweights are equal.) We propose that if (on average) an algorithm can solve the weighted problem instance infewer search steps than the corresponding unweighted problem instance, then the weights are truly making theinstance easier.

If the efficiency of SAPS were based on its ability to learn clause weights that render the given instance easier,this should be reflected in reduced search cost when SAPS is initialised with the clause weights from a previoussuccessful run. Note that this is equivalent to restarting the algorithm by randomly reinitialising all variables,while keeping the clause penalties the same. In other words, if all the clause penalties represent ‘knowledge’ thatthe algorithm has accumulated about the search space, then the modified SAPS algorithm is starting with all ofthat knowledge a priori.

For this experiment, and all remaining experiments reported in this paper, we chose a number of well-knownrandom and structured instances from SATLIB (www.satlib.org) [5]. For all experiments, we did not op-

Page 234: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

SAPS Generated Permuted SAPS Randomly GeneratedWeighted Instances Weighted Instances Weighted Instances

Instance Unweighted Median Median Median

uf100-easy 81 0.98 1.01 1.06 0.98 1.00 1.02 1.31 1.36 1.46uf100-hard 3,763 1.08 1.11 1.14 1.00 1.05 1.09 1.03 1.06 1.10uf250-hard 197,044 0.98 1.06 1.14 0.35 1.01 1.04 0.97 1.03 1.06uf400-hard 2,948,181 0.92 1.04 1.17 1.00 1.04 1.07 0.95 1.10 1.19ais10 20,319 1.06 1.09 1.11 1.03 1.08 1.12 1.04 1.11 1.19bw large.a 2,499 0.90 0.93 0.98 0.99 1.02 1.04 1.01 1.04 1.07bw large.b 34,548 0.97 1.02 1.08 1.00 1.02 1.06 0.99 1.07 1.11flat100-hard 24,248 0.99 1.02 1.04 0.97 1.02 1.05 0.98 1.01 1.04logistics.c 9,446 0.97 1.03 1.06 1.02 1.05 1.07 1.05 1.07 1.14ssa7552-038 3,960 0.86 0.91 0.95 0.94 0.98 1.00 1.02 1.08 1.12

Table 1: Comparison of SAPS performance on SAPS generated weighted instances, permuted SAPS generatedinstances and randomly weighted instances. For each test instance, SAPS was run 1000 times to obtain a run-length distribution. From that distribution, 25 characteristic runs were selected according to the quantiles ,, . . . , . Weighted instances were then generated using the final clause penalties from each of the char-acteristic runs. SAPS was run on each of the 25 weighted instances 250 times, and the median run from eachweighted instance was identified. Those 25 medians formed a distribution, and basic descriptive statistics of thatdistribution (median, , ) are presented above. For each weighted result the value shown is the ratio ofthe respective run-length statistic of the weighted instance to the median run-length of the unweighted instance.The procedure was then repeated on 25 instances where the weights were randomly permuted, and then finallyon 25 instances with randomly generated weights.

timise the SAPS parameters (,,,) and used the default values . Generally, wemeasure the performance of an algorithm as the median or mean number of steps required to reach a solution.

As can be seen from the results in Table 1, the performance of SAPS on the SAPS generated weightedinstances was not significantly better than the performance of SAPS on the original (unweighted) instances, andin some instances it was worse. Furthermore, we found no correlation between the length of the SAPS run fromwhich the penalties were taken (short, average, long) and the hardness of the resulting weighted instance.

Next, we considered variants of the SAPS generated weighted instances in which the weights had beenrandomly permuted by assigning each weight to a randomly chosen clause. The results in Table 1 demonstratethat SAPS behaves similarly on the original SAPS generated instances and the permuted instances. This providesfurther evidence that the individual weights do not reflect specific knowledge about the respective clauses.

Finally, we looked at instances generated with completely random weights that were uniformly sampled atrandom from the interval (0,1], which we refer to as randomly weighted instances. In most of the experiments,we found no significant performance differences when we compared the performance results of SAPS on theSAPS generated weighted instances and the randomly weighted instances. A few minor exceptions, in which therandomly generated weights rendered the instance slightly harder than SAPS generated weights or unit weights,are seen in Table 1 (bw large.a and ssa7552-038).

To further investigate this matter, we considered two larger SATLIB test-sets: flat100, a set of 100 randomlygenerated, hard, SAT-encoded graph colouring instances with 300 variables each, and uf100, a set of 100 in-stances from the solubility phase transition region of Random-3-SAT with 100 variables each. (Note that theflat100 instances contain a certain amount of structure induced by the specific mechanism used for generatingrandom graph colouring instances and, more importantly, by the encoding into SAT.) In both cases, we repeatedthe previously explained procedures on all 100 instances of the respective test-set in order to determine whetherfor any of these instances using SAPS weights rendered them significantly easier than using randomly generated

Page 235: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

1000

10000

100000

1000 10000 100000Med

ian

run-

leng

th o

n ra

ndom

ly g

ener

ated

wei

ghts

[ste

ps]

Median run-length on SAPS generated weights [steps]

100

1000

10000

100 1000 10000Med

ian

run-

leng

th o

n ra

ndom

ly g

ener

ated

wei

ghts

[ste

ps]

Median run-length on SAPS generated weights [steps]

Figure 1: Comparison of SAPS performance on SAPS generated weighted instances and randomly weightedinstances obtained for the SATLIB test-sets flat100 (left) and uf100 (right). For each test instance, SAPS wasrun 1000 times to obtain a run-length distribution. From that distribution, 25 characteristic runs were selectedaccording to the quantiles , , . . . , . Weighted instances were then generated using the final clausepenalties from each of the characteristic runs. SAPS was run on each of the 25 weighted instances 250 times,and the median run from each weighted instance was identified. The median of those medians was selected asthe run-length shown above. The same procedure was carried out using 25 variants of each test instance withrandomly generated weights.

weights. The results presented in Figure 1 show that this is not the case.Although we have established that SAPS does not generally find the SAPS generated weighted instances

easier, it may be the case that other algorithms can use these weights to their advantage. In particular, this couldbe the case for weighted static algorithms such as GWSAT [12]. Unlike DLS algorithms (such as SAPS), whichdynamically change the clause penalties and may hence steer away from the original penalty values unless theinstance can be solved relatively quickly, these static algorithms will continue to use the original weights andmay thus be in a better position to exploit them. However, preliminary experiments with GWSAT (results notpresented here) indicate that similar to SAPS, GWSAT does not appear to perform better when using SAPSgenerated weights as opposed to the unit weights that characterise the original, unweighted problem instances.

We also studied a slight variation of GSAT, RGSAT (Restarting GSAT), which uses the same best improve-ment search method underlying both GSAT as well as SAPS, but restarts the search from another random truthassignment whenever it cannot perform an improving search step, i.e., whenever it encounters a local minimumor plateau of the given search landscape. Note that RGSAT has no plateau search capabilities, but can also neverget stuck in a local minimum. The performance of RGSAT is quite poor, but it provides interesting insights intothe hardness of SAT instances for simple local search methods. We conducted a series of experiments where wetook snapshots of the warped landscape at every local minimum encountered within a SAPS search trajectory onSATLIB instance uf100-easy. On each of the corresponding weighted instances, we then ran RGSAT multipletimes and we found that for RGSAT, these instances are becoming progressively easier to solve. However, wealso observed that RGSAT’s performance generally increases with the ruggedness of a given search landscape,where (following common practice) we consider a landscape more rugged if it contains fewer plateaus and morestrict local minima. It is easy to see that during a SAPS run that starts from unit penalties, the landscape rugged-ness can be expected to progressively increase, and hence it is not surprising that RGSAT finds the resultingwarped landscapes easier to search than the landscape of the original, unweighted instance.

Overall, based on our empirical results, there is no evidence to support the hypothesis that the warped land-scapes generated by SAPS are easier to search for any reasonably powerful SLS algorithm, and hence, that the

Page 236: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

clause penalties determined over successful runs of SAPS reflect any general knowledge on how to solve thegiven problem instance more efficiently. Based on the similarity of the underlying penalty update mechanisms,we predict that the same result will apply to the warped landscapes obtained from other DLS algorithms, buttesting this prediction will require some additional experimentation.

Based on these findings, it is worthwhile to revisit our understanding of the behaviour of DLS algorithms.The penalty update system of SAPS is based on two mechanisms: scaling and smoothing. Scaling clearly affectsthe mobility of the algorithm, i.e., its ability to escape from local minima and other non-solution areas of thegiven search space. This was the original motivation for introducing the scaling mechanisms, and it seems toserve that purpose well. It has been suggested that DLS algorithms use scaling (or similar mechanisms) to ‘fill inthe holes’ (i.e., local minima) of the given landscape [8, 11].

While this analogy is appealing, it also turns out to be very deceiving. Consider a regular 3-SAT clause , and the impact of increasing the penalty of this clause on a given search landscape. Recallthat the search landscape for a CNF formula with variables is an -dimensional hypercube formed by the truth assignments of . Any assignment has neighbours that differ from in the truth value of exactlyone variable. If the search is at a position in the space where clause is unsatisfied ( ), then neighbours will also leave unsatisfied, and increasing the clause penalty of will raise the level (i.e.,increase the evaluation function value) of those neighbours as well. So the level of the current assignment willbe raised relative to only three local neighbours. But recall that of the entire search space also has clause unsatisfied, and so while locally the space has changed relative to three local neighbours, globally pointsin the space have been affected. Hence, by modifying a clause penalty, a DLS algorithm may effect a beneficialchange in a very small local region of the search landscape, but at the global level, it is possible and rather likelythat the side-effects of this local warping can be quite detrimental, e.g., by giving rise to new local minima. (Thepossibility of such detrimental side-effects has also been hinted at in [8].)

The smoothing mechanism is required to compensate for these undesired, global side-effects. Essentially,smoothing helps the SAPS algorithm to ‘forget’ clause penalties that were important in order to escape from alocal minima region, but are no longer helpful after the search has left that region. In some sense, scaling andsmoothing try to balance short-term versus long-term memory effects of the search. Since in DLS algorithmslike SAPS, long-term memory effects seem to be mostly undesired side-effects of crucially important short-termmodifications of the search landscape, the main role of smoothing appears to be to limit the long-term effects ofthese local changes. (See also the discussion of long-term vs short term effects in [1].) Hence, when we re-usethe penalties from a previous run of SAPS, we start in a completely different area of the search space, and themost recent local warping (which dominates the SAPS search behaviour) is of no benefit. Recent local changesfrom the previous search will be gradually ‘forgotten’ through smoothing, leaving the accumulated residual ‘longterm’ memory (clause penalties) of the previous search, which according to our results does not appear to beeffective in reducing search cost.

Overall, the results from our empirical investigation suggest that the penalty-based mechanisms used bySAPS and other DLS algorithms for warping the search landscape mainly serves as a diversification mechanism,which allows the search process to effectively avoid and overcome stagnation due to local minima and plateauregions while maintaining its heuristic guidance.

3 Random Decisions Need Not Apply

By definition and in practice, random decisions are an essential ingredient of stochastic local search and theyare often crucial for achieving high performance in a robust way [3]. But it may be noted that the stronglyrandomised search mechanisms found in SAT algorithms such as GWSAT or WalkSAT [12], serve essentiallythe same purpose as the scaling and smoothing mechanism in SAPS: effective diversification of the search.Together with an earlier observation that high-performance DLS algorithms, after an initial search phase, oftenbecome nearly deterministic [10], this raises the question where and to which extent randomness is needed in thecontext of DLS algorithms.

Page 237: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

We begin our investigation of this issue with the observation that after the search is initialised (at a randomlyselected truth assignment) SAPS uses three types of random decisions:

1. Random tie-breaking: When two or more variables would give the identical best improvement whenflipped, one of them is chosen at random.

2. Random walk steps: When a local minimum is encountered, a random walk is performed with probability.

3. Probabilistic smoothing: Scaling, which also occurs only when a local minimum is encountered, is fol-lowed by smoothing with probability .

When examining the importance of each of these mechanisms in more detail, we found the following:

1. Random tie-breaking: On long search trajectories, after the algorithm has performed scaling and smooth-ing, most clause penalties become unique, and so the possibility of encountering a tie becomes increasinglyunlikely.

2. Random walk steps: We have not encountered any empirical evidence that the random walk mechanism(originally inherited from the ESG algorithm) is important for the performance of SAPS, and in our expe-rience setting does not degrade performance.

3. Probabilistic smoothing: The circumstances under which probabilistic smoothing will produce a substan-tially different search trajectory from periodic smoothing appear to be rare, and we have not seen anyinstances where this difference has significantly altered the algorithm’s performance.

Together, these observations suggest that none of the random decisions discussed so far are crucial to the be-haviour and performance of SAPS. To test this hypothesis, we designed a SAPS variant called SAPS/NR, whichdiffers from SAPS only in the following three aspects:

1. Deterministic tie-breaking: Whenever a tie between variables occurs, the SAPS/NR algorithm determinis-tically chooses the variable with the lowest index value.

2. No random walk steps: The parameter is always set to zero, so that random walk steps are neverperformed.

3. Periodic smoothing: The probabilistic smoothing is replaced with deterministic periodic smoothing, wheresmoothing occurs every local minima.

At first glance, it may seem that SAPS/NR is completely deterministic, but we must emphasise that the initialisa-tion of SAPS/NR is identical to the procedure in SAPS, and consequently the initial starting position for each runof SAPS/NR is completely random. The results of a simple experiment in which we measured the performancedifferences between SAPS and SAPS/NR are reported in Table 2; clearly, there is no significant difference be-tween the behaviour of SAPS and SAPS/NR on our test instances. In particular, it is somewhat surprising thatthere appears to be no significant reduction in the variability of the run-time for the same problem instance.

To see how SAPS/NR behaves when the number of random decisions is reduced even further, we conductedan additional experiment in which we first reduced the total randomness to zero by initialising all variables to afixed initialisation value and confirmed that all runs were identical. We then allowed one variable to be randomlyinitialised, then two variables, and so on. The results of this experiments, reported in Table 3, shows that evenfor as few as 32 random decisions (i.e., 32 random variable initialisations), the behaviour of SAPS/NR is notsubstantially different from that of the fully randomised original SAPS algorithm. Even more surprisingly, evenfor 4 random decisions, as illustrated in Figure 2, the run-time, which can now take only 16 different values,often exhibits the variability that is observed for fully randomised SAPS as well as for all other state-of-the-artSLS algorithms for SAT [4].

Page 238: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

SAPS SAPS/NRInstance Mean c.v. Mean c.v.

uf100-easy 102 0.75 103 0.70uf100-hard 5,572 0.95 5,458 0.97uf250-hard 296,523 0.98 282,668 1.02uf400-hard 4,349,480 0.75 3,662,192 0.83ais10 32,810 1.01 31,527 0.99bw large.a 3,374 0.85 3,245 0.81bw large.b 50,025 0.95 50,266 0.94flat100-hard 35,124 1.02 33,519 0.98logistics.c 12,873 0.76 12,458 0.83ssa7552-038 4,460 0.44 4,399 0.41

Table 2: Performance comparison of SAPS and SAPS/NR. For each instance, SAPS and SAPS/NR were run1000 times, and the mean and coefficient of variation (c.v.) of the run lengths are shown for each instance. Thec.v. is calculated as the standard deviation divided by the mean (). Note that c.v. characterises anexponential run-length distribution, which is typical for high-performance SLS algorithms for SAT.

# Random flat100-hard uf100-hard bw large.aDecisions Mean c.v. Mean c.v. Mean c.v.

0 24,198 0.00 4,764 0.00 2,160 0.001 28,491 0.51 5,316 0.44 3,016 0.402 30,315 0.68 4,834 0.67 2,989 0.594 33,966 0.90 5,408 0.88 3,254 0.758 33,467 0.98 5,413 0.94 3,218 0.81

16 34,074 0.99 5,560 0.97 3,285 0.8432 34,113 0.99 5,476 0.96 3,276 0.8364 34,099 0.99 5,435 0.96 3,265 0.84

# vars 33,769 0.98 5,458 0.97 3,245 0.81SAPS 33,519 0.98 5,572 0.95 3,374 0.85

(# random) (9,780) (2,750) (1,590)

Table 3: The effect of varying the number of random decisions in the initialisation of SAPS/NR. For a SAPS/NRexperiment with variables and random decisions, ( ) variables were randomly selected and assigned atruth value (,) randomly, and then those variables were fixed for the remainder of that experiment. In eachexperiment, 1000 runs were conducted (250 for uf400-hard) and the run-length mean and c.v. were found. Eachexperiment was then repeated 100 times on every instance, where the randomly selected ( ) variables tofix were different for every experiment. The median results of those 100 experiments are presented above foreach instance. For comparison, the corresponding values for SAPS are shown, in addition to the mean number ofrandom decisions made in a run of SAPS.

Page 239: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

1000

10000

100000

1e+06

1000 10000 100000 1e+06

SA

PS

/NR

run

-leng

th [s

teps

]

SAPS run-length [steps]

300 Decisions4 Decisions

10

100

1000

10000

100000

10 100 1000 10000 100000

SA

PS

/NR

run

-leng

th [s

teps

]

SAPS run-length [steps]

100 Decisions4 Decisions

Figure 2: Quantile-quantile plot of the run-length distributions for SAPS/NR vs regular SAPS on SATLIB in-stances flat100-hard (left) and uf100-hard (right). The data is based on 1000 runs on each of SAPS, SAPS/NR(4 random decisions) and SAPS/NR ( random decisions, where is the number of variables). For the 1000runs with only 4 random decisions, ( ) variables were selected at random and assigned a truth value (,)randomly, and then those variables were fixed for all 1000 runs.

This result appears to be in stark contrast with earlier observations by Gent and Walsh [2], who demonstratedthat the initial starting position is typically not important for the performance of variants of GSAT. Indeed, it canbe shown that for the regular SAPS algorithm, deterministic initialisation does not lead to significantly differentbehaviour, since apparently, the other random decisions are sufficient for ensuring sufficient diversification andmobility of the search process. The ESG algorithm also exhibits the same behaviour, where ESG has only twoof the three sources of randomness described for SAPS (no probabilistic smoothing), but for a deterministicinitialisation those two sources provide the same degree of variability as observed for regular ESG or SAPS.However, in the absence of other random decisions, the initial starting position is not only important, but alsosufficient for achieving the same (desirable) behaviour as exhibited by SAPS, ESG and other high-performanceSLS algorithms for SAT.

In this context, it is interesting to note that implementations of all SLS algorithms on traditional computersare in fact deterministic, since they use pseudo-random number generators instead of a true source of randomnessas the basis for any “random” decision. Much in tune with current theoretical thinking, this suggests that truerandomness is probably not needed for achieving good performance in SLS algorithms and other methods forsolving hard combinatorial problems. Instead, the crucial role of pseudo-random decisions is to diversify thesearch process in a way that is independent from features of the given problem instance, in order to compensatefor weaknesses in the heuristics that otherwise guide the search. Our results suggest that for DLS algorithms suchas SAPS/NR, the complex interaction between the subsidiary greedy local search process and the effects of thewarping of the search space accomplished by the scaling and smoothing mechanisms to a large extent fulfill thesame role. This leads to a search process that is chaotic in the sense of a dynamical system that shows extremelysensitive dependence on its initial conditions. (Note that chaotic behaviour is often defined based on sensitivedependence on initial conditions.) We conjecture that this chaotic nature of the search propagates and amplifiesthe effects of the random initialisation through arbitrarily long search trajectories and thus reduces the need forfurther random decisions throughout the search process, while achieving the high mobility and diversificationthat is crucial for achieving the excellent performance of SAPS. Clearly, further investigation is required in orderto validate this interpretation of the search process in SAPS/NR and SAPS.

Page 240: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

4 Conclusions and Further Work

In this paper, we presented an initial investigation into the factors underlying the excellent performance of state-of-the-art dynamic local search algorithms for SAT. Our empirical results show that the warping of search land-scapes effected by the scaling and smoothing mechanism of SAPS, a state-of-the-art dynamic local search al-gorithm for SAT, does not typically correspond to a global simplification of the given problem instance. Thelocal warping achieved by scaling plays a crucial role in allowing the subsidiary greedy local search algorithmto escape from local minima of its evaluation function, but can have large, unwanted effects on the search land-scape on a global scale. Therefore, it is crucial to provide a mechanism that allows for undoing these unwantedchanges. In SAPS, this function is performed by the smoothing stage, which helps the algorithm ‘forget’ aboutrecent scaling effects. Hence the primary effect of scaling and smoothing corresponds to a form of reactiveshort-term memory; however, because of the nature of the smoothing operation, SAPS shows long-term memoryeffects which typically do not appear to be helpful for the search process. Currently, we are extending the workpresented here by trying to isolate the short-term from the long-term memory of the algorithm, as well as by fur-ther studying the nature of the dynamically changing landscapes in dynamic local search using standard searchspace analysis techniques. We believe that it may be possible to devise a modified long-term memory mechanismthat, unlike the current smoothing scheme in SAPS, can be useful for rendering problem instances easier as thesearch progresses.

Furthermore, we have shown that, as a result of the properties of the scaling and smoothing mechanism,the role of randomness in SAPS is somewhat different from other SLS algorithms; in particular, after randomlyinitialising the search process, all further random decisions can be eliminated without any significant effect on thebehaviour and performance of the algorithm. The resulting variant of SAPS, SAPS/NR, essentially shows chaoticbehaviour in that the length of successful runs is extremely sensitively dependent on the initial truth assignment.We believe that this is a non-trivial effect of the complex interaction between the simple subsidiary greedy searchalgorithm and the dynamics of penalty updates. We are currently investigating this chaotic behaviour in moredetail; furthermore, we are trying to find heuristics that allow us to initialise SAPS/NR in a way that biases ittoward substantially shorter run-times. Particularly for structured problem instances, we are hopeful that suchheuristic initialisation methods can be devised, leading to further improvements in the state-of-the-art in SATsolving.

References

[1] Jeremy Frank. Learning short-term clause weights for GSAT. In Proceedings of the Fifteenth International JointConference on Artificial Intelligence (IJCAI’97), pages 384–389, 1997.

[2] Ian P. Gent and Toby Walsh. Towards an understanding of hillclimbing procedures for SAT. In Proceedings of theEleventh National Conference on Artificial Intelligence (AAAI’93), page 283, 1993.

[3] Holger H. Hoos. On the run-time behaviour of stochastic local search algorithms for SAT. In Proceedings of theSixteenth National Conference on Artificial Intelligence (AAAI’99), pages 661–666, 1999.

[4] Holger H. Hoos and Thomas Stutzle. Local search algorithms for SAT: An empirical evaluation. Journal of AutomatedReasoning, 24(4):421–481, 2000.

[5] Holger H. Hoos and Thomas Stutzle. SATLIB: An Online Resource for Research on SAT. In H. van Maaren I. P. Gentand T. Walsh, editors, SAT 2000: Highlights of Satisfiability Research in the year 2000, volume 63 of Frontiers inArtificial Intelligence and Applications, pages 283–292. IOS Press, Amsterdam, The Netherlands, 2000.

[6] Frank Hutter, Dave A.D. Tompkins, and Holger H. Hoos. Scaling and probabilistic smoothing: Efficient dynamiclocal search for SAT. In LNCS 2470: Proceedings of the Eighth International Conference on Principles and Practiceof Constraint Programming, pages 233–248. Springer Verlag, 2002.

[7] David Mitchell, Bart Selman, and Hector Levesque. Problem solving: Hardness and easiness - hard and easy distri-butions of SAT problems. In Proceeding of the Tenth National Conference on Artificial Intelligence (AAAI-92), SanJose, California, pages 459–465. AAAI Press, Menlo Park, California, USA, 1992.

Page 241: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

[8] Paul Morris. The breakout method for escaping from local minima. In Proceedings of the Eleventh National Confer-ence in Artificial Intelligence (AAAI’93), pages 40–45, 1993.

[9] Dale Schuurmans and Finnegan Southey. Local search characteristics of incomplete SAT procedures. In Proceedingsof the Seventeenth National Conference in Artificial Intelligence (AAAI’00), pages 297–302, 2000.

[10] Dale Schuurmans, Finnegan Southey, and Robert C. Holte. The exponentiated subgradient algorithm for heuristicboolean programming. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence(IJCAI’01), pages 334–341, 2001.

[11] Bart Selman and Henry A. Kautz. Domain-independant extensions to GSAT : Solving large structured variables. InProceedings of the Thirteenth International Joint Conference on Artificial Intelligence (IJCAI’93), pages 290–295,1993.

[12] Bart Selman, Henry A. Kautz, and Bram Cohen. Noise strategies for improving local search. In Proceedings of theTwelfth National Conference on Artificial Intelligence (AAAI’94), pages 337–343, Seattle, 1994.

[13] Dave A.D. Tompkins and Holger H. Hoos. Scaling and probabilistic smoothing: Dynamic local search for unweightedMAX-SAT. In LNAI 2671: Proceedings of the Sixteenth Conference of the Canadian Society for ComputationalStudies of Intelligence (AI 2003), pages 145–159. Springer Verlag, 2003.

[14] Zhe Wu and Benjamin W. Wah. An efficient global-search strategy in discrete lagrangian methods for solving hardsatisfiability problems. In Proceedings of the Seventeenth National Conference in Artificial Intelligence (AAAI’00),pages 310–315, 2000.

Page 242: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Using Automatic Case Splits and Efficient CNF Translation to Guide a SAT-Solver When Formally Verifying Out-of-Order Processors

Miroslav N. Velev [email protected]

http://www.ece.cmu.edu/~mvelev

Abstract. The paper integrates automatically generated case-splitting expressions, and an efficient translation to CNF, in order toformally verify an out-of-order superscalar processor having register renaming, as well as Reorder Buffer and Reservation Stationsthat are completely implemented and instantiated. The processor was defined in the high-level Hardware Description LanguageAbsHDL, based on the logic of Equality with Uninterpreted Functions and Memories (EUFM), and was formally verified with anextended version of the decision procedure EVC, combined with a SAT-solver. The manual work was limited to definition of nec-essary invariant constraints. The formal verification was decomposed, based on automatically generated case-splitting expres-sions—matching Recorder Buffer entries with Reservation Stations to compute the results for those Reorder Buffer entries, andallowing for orders of magnitude speedup if many CPUs are available for parallel runs of EVC. Efficient translation from EUFMto CNF—by producing more ITEs, and merging ITE-trees with 2 levels of their leaves—resulted in additional 32× speedup.

1. IntroductionThis paper presents a method for automatic formal verification of out-of-order superscalar processors with registerrenaming [38][81], and studies the potential for additional speedup if extensive computing resources are available.Currently companies use compute-farms with thousands of CPUs to extensively test new microprocessors by binarysimulation [36], without guarantee for complete correctness. In our earlier work on applying the logic of Equalitywith Uninterpreted Functions and Memories (EUFM) [20]—see Sect. 2.1—to formal verification of pipelined andsuperscalar processors with in-order execution, we imposed some simple restrictions [92][93] on the high-leveldescription style, resulting in correctness formulas where most of the word-level values appear only in positive equal-ity comparisons. That allowed us to exploit a property that such word-level values can be treated as distinct constants,thus significantly pruning the solution space, and achieving orders of magnitude speedup, while still performing com-plete formal verification; we call this property Positive Equality. Those restrictions, together with techniques tomodel multicycle functional units, exceptions, and branch prediction [94], allowed our tool flow [96] to be used atMotorola [53] to formally verify a model of the M•CORE processor, and detect 3 bugs, as well as corner cases thatwere not fully implemented. However, the automatic formal verification of out-of-order processors with registerrenaming, as well as Reorder Buffer and Reservation Stations that are completely implemented and instantiated,remains a challenge.

Previous methods for formal verification of out-of-order processors require extensive manual work to abstract allhardware structures; to reduce the complexity of the proof—by setting up an inductive argument over the number ofReorder Buffer entries [5][40][47][48][55], or by applying symmetry reductions to decrease the number of ReorderBuffer entries, Reservation Stations, and registers considered in the proof [44][66][67], or by deriving rewriting rulesthat are based on the structure of the correctness formulas and are used to simplify those formulas [40][97]; to definean abstraction function—mapping an implementation state to an equivalent specification state—by extending theReorder Buffer with auxiliary state used to simplify the expressions produced by the abstraction function[4][40][44][54][97], or by defining an intermediate abstraction to bridge the gap between the implementation and thespecification [11][40][41][47][48][77][78]; and to define and prove many lemmas and theorems [40][77][78].

In this paper, the formal verification is done with an extended version of our tool flow [96] that was applied atMotorola, and consists of: 1) the term-level symbolic simulator TLSim, used to symbolically simulate the high-levelimplementation and specification processors, and produce an EUFM correctness formula; 2) an improved version ofthe decision procedure EVC that exploits Positive Equality and other optimizations to translate the EUFM correctnessformula to a satisfiability-equivalent Boolean formula; and 3) an efficient SAT-solver. The tool flow was also used inan advanced computer architecture course [99], where students designed and formally verified single-issue pipelinedprocessors, as well as extensions with exceptions and branch prediction, and dual-issue superscalar implementations.

The contributions of this paper include: 1) the definition of invariant constraints that are necessary for formalverification of out-of-order superscalar processors with completely implemented Reorder Buffer and Reservation Sta-tions; 2) the use of automatically generated case splits, based on the invariant constraints, to decompose the proof intomany simpler proofs that can be discharged efficiently by exploiting Positive Equality, and result in orders of magni-

1

Page 243: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

tude speedup, compared to monolithic formal verification; and 3) a new translation from propositional logic to Con-junctive Normal Form (CNF) [45], uniquely targeted to the structure of the Boolean correctness formulas from suchprocessors, and resulting in additional 32× speedup. These contributions make possible a formal verification methodthat contrasts previous approaches, where the Reorder Buffer and Reservation Stations are manually abstracted, andthe user has to set up an inductive proof over the number of Reorder Buffer entries, and possibly manually apply sym-metry reductions to decrease the number of Reservation Stations. Furthermore, and also in contrast to previousapproaches, the Reorder Buffer is not extended with auxiliary state in order to simplify the expressions resulting fromthe abstraction function; significant additional speedup can be expected with such approaches, at the cost of extramanual work.

2. Background2.1 Translation from EUFM to Propositional LogicThe syntax of EUFM [20] includes terms and formulas. Terms are used to abstract word-level values of data, registeridentifiers, memory addresses, and the entire states of memory arrays. A term can be an Uninterpreted Function (UF)applied to a list of argument terms, a term variable, or an ITE operator selecting between two argument terms basedon a controlling formula, such that ITE(formula, term1, term2) will evaluate to term1 if formula = true, and to term2if formula = false. The syntax for terms can be extended to model memories by means of the interpreted functionsread and write [20][95] that satisfy the forwarding property of the memory semantics—that a read gets the data valuewritten by the most recent write to the same address, or the value from the initial memory state otherwise. Formulasare used to model the control path of a microprocessor, and to express the correctness condition. A formula can be anUninterpreted Predicate (UP) applied to a list of argument terms, a Boolean variable, an ITE operator selectingbetween two argument formulas based on a controlling formula, or an equation (equality comparison) of two terms.Formulas can be negated and combined by Boolean connectives. We will refer to both terms and formulas as expres-sions. UFs and UPs are used to abstract functional units by replacing them with “black boxes” that satisfy only theproperty of functional consistency—that equal inputs to the UF (UP) produce equal output values.

To check an EUFM formula for validity, we can use a specialized decision procedure [30][84], or can translatethe EUFM formula to an equivalent Boolean formula that has to be a tautology in order for the EUFM formula to bevalid. The second approach allows us to benefit from the recent tremendous advances in SAT-solvers—e.g.,[34][68][74] (see [59][98] for comparative studies, and [12][50][107] for surveys)—and is used in the current paper.Restrictions on the style for describing high-level processors [92][93] reduced the number of terms that appear inboth positive and negated equations (called g-terms for general terms), and increased the number of terms that appearonly in positive equations (called p-terms for positive terms). The property of Positive Equality [92][93] allows us totreat syntactically different p-terms as not equal when evaluating the validity of an EUFM formula, thus achievingdramatic simplifications and orders of magnitude speedup (see [18] for correctness proof). However, equationsbetween g-term variables can be either true or false, and can be encoded with Boolean variables [33][72][100], byaccounting for the transitivity property of equality [19].

When translating an EUFM formula to an equivalent Boolean formula, applications of the same UF or UP can beeliminated with nested ITEs [93]. For example, if f(a1, b1), f(a2, b2), and f(a3, b3) are three applications of UF f,where a1, b1, a2, b2, a3, and b3 are terms, then the first application will be eliminated with a new term variable c1, thesecond with ITE((a2 = a1) ∧ (b2 = b1), c1, c2), where c2 is a new term variable, and the third with ITE((a3 = a1) ∧ (b3= b1), c1, ITE((a3 = a2) ∧ (b3 = b2), c2, c3)), where c3 is a new term variable. That is, the second, third, and any subse-quent applications of the UF are eliminated with ITE-chains that enforce functional consistency. UPs are eliminatedin the same way, but using new Boolean variables instead of new term variables. This method for eliminating UFsand UPs by enforcing functional consistency is used in [54][55][79][96]. Alternatively, functional consistency can beenforced with Ackermann constraints [1]—the three applications of the UF will be replaced with the new term vari-ables c1, c2, and c3; then, the functional consistency of the second application of the UF with respect to the first willbe enforced by extending the resulting formula with the constraint (a2 = a1) ∧ (b2 = b1) ⇒ (c2 = c1), with such con-straints added for each pair of applications of that UF. This method for enforcing functional consistency is used in[8][46][60][72][88][106], but does not result in ITE-chains, and so will not benefit from the CNF translation in Sect.2.3. A general case of ITE-chains are ITE-trees, where the then-expressions can be ITEs as well. We will call leavesof an ITE-tree its inputs that appear as then- or else-inputs of the lowest level of ITEs in the tree. ITE-trees also resultafter eliminating a read from a sequence of writes by accounting for the forwarding property of the memory seman-tics, and from modeling conditional instruction flow when instructions are not stalled by the control logic.

2

Page 244: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

After the UFs are eliminated, the terms consist of only ITE operators and term variables. In earlier EUFM deci-sion procedures that exploit Positive Equality [54][55][79][96], equations between nested-ITE terms are eliminatedby pushing the equations to the ITE leaves, and replacing the original equation with a disjunction of conjunctions offormulas. For example, given terms ITE(c1, a1, a2) and ITE(c2, b1, b2), where c1 and c2 are formulas, and a1, a2, b1,and b2 are term variables, the equation ITE(c1, a1, a2) = ITE(c2, b1, b2) will be replaced with the formula c1 ∧ c2 ∧(a1 = b1) ∨ c1 ∧ ¬ c2 ∧ (a1 = b2) ∨ ¬ c1 ∧ c2 ∧ (a2 = b1) ∨ ¬ c1 ∧ ¬c2 ∧ (a2 = b2). However, as observed in [102],we can preserve the ITE-tree structure of equation arguments, and replace the equation with ITE(c1, ITE(c2, a1 = b1,a1 = b2), ITE(c2, a2 = b1, a2 = b2)). Furthermore, we can translate the resulting ITE-tree to CNF by merging the con-straints for ITEs inside the tree, and representing it with a single set of clauses without intermediate variables for out-puts of ITEs inside the tree (see Sect. 2.3). This resulted in up to 420× speedup for processors with in-order execution[102].

In the extended version of the decision procedure EVC [96] that is used in this paper, the final Boolean formulaconsists of AND, OR, NOT, and ITE gates. Hashing [93] ensures that: there are no duplicate gates; merges an ANDhaving another AND as input into a single AND having as inputs all the inputs of the two gates, except for the outputof the merged AND, and similarly for an OR having another OR as input; eliminates duplicate inputs to AND and ORgates; and replaces an AND/OR with a constant if the gate has complemented inputs.

2.2 Conventional Translation from Propositional Logic to CNFA primary CNF variable is one representing the value of a primary input, i.e., input of the original Boolean circuit.(Boolean formula and Boolean circuit are used interchangeably in this paper.) An auxiliary CNF variable is one rep-resenting the value of a gate output. In general, the translation of Boolean formulas to CNF is exponential. However,by introducing a new CNF variable for the output of every logic gate, and imposing constraints that preserve the func-tion of that gate [87], we get a satisfiability-equivalent [22] CNF representation. Both the size of the resulting CNFand the complexity of the translation procedure are linear in the size of the original Boolean formula. For AND, OR,NOT, and ITE gates, the conventional translation to CNF is as follows (“←“ stands for assignment):

o ← AND(i1, i2, ... , in) : (i1 ∨ ¬ o) ∧ ( i2 ∨ ¬ o) ∧ ... ∧ ( in ∨ ¬ o) ∧ (¬ i1 ∨ ¬ i2 ∨ ... ∨ ¬ in ∨ o)

o ← OR(i1, i2, ... , in) :(¬ i1 ∨ o) ∧ (¬ i2 ∨ o) ∧ ... ∧ (¬ in ∨ o) ∧ (i1 ∨ i2 ∨ ... ∨ in ∨ ¬ o)

o ← NOT(i) :(¬ i ∨ ¬ o) ∧ (i ∨ o)

o ← ITE(i, t, e) :(¬ i ∨ ¬ t ∨ o) ∧ (¬ i ∨ t ∨ ¬ o) ∧ ( i ∨ ¬ e ∨ o) ∧ (i ∨ e ∨ ¬ o)

Instead of explicitly translating the inverters (NOT gates), we can subsume them in their fanout gates [71], byreplacing all instances of the CNF variable for the inverter output with the negated CNF variable for the inverterinput, thus eliminating the output variable and the 2 clauses for each inverter. The controlling value of an AND gateis false (0), since it produces a false (0) on the gate output, regardless of the values on other inputs. Similarly, the con-trolling value of an OR gate is true (1).

2.3 Translation from Propositional Logic to CNF by Merging ITE-TreesPreserving the ITE-tree structure of equation arguments (see Sect 2.1) results in Boolean correctness formulas withITE-trees, where each ITE inside a tree has fanout count of 1, i.e., drives only one gate, and that is another ITE insidethe same tree. An ITE-tree can be translated to CNF with a unified set of clauses [102], without intermediate variablesfor outputs of ITEs inside the tree—see Fig. 1.a. Furthermore, ITE-trees can be merged with 1 level of their AND/ORleaves [102], where each leaf has fanout count of 1—see Fig. 1.b.

We can similarly merge other gate groups when translating them to CNF [101]: AND/OR→ITE groups, wherean ITE has an AND or OR gate as its then-input, or else-input, or a different AND/OR gate at each of these inputs;and OR/ITE→AND (AND/ITE→OR) groups, where an AND (OR) is driven by an OR (AND), or an ITE. Note thata driven AND (OR) gate may have many OR/ITE (AND/ITE) inputs with fanout count of 1. Then, we can choosewhich one to merge by using a variant of the FANIN heuristic [64] for BDD-variable ordering—selecting the inputgate with highest topological level. The motivation is to shorten the longest path for Boolean Constraint Propagation

3

Page 245: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

(BCP) from a primary input to the output of the driven AND (OR) gate. Then, if the heuristic is applied to many gategroups, we could significantly shorten many paths for BCP from primary inputs to the output of the Boolean circuit.

Fig. 1. (a) Example ITE-tree, and its translation to CNF with a unified set of clauses without intermediate variables for outputs ofITEs inside the tree; and (b) Merging an ITE-tree with 1 level of its AND/OR leaves that have fanout count of 1. Each ITE-tree isrepresented with the conjunction of all clauses for paths from the tree leaves to the tree output. ITEs are shown as multiplexors.

3. Implementation Processor to be Formally VerifiedThe out-of-order implementation processor to be formally verified is shown in Fig. 2. It is modeled after the PowerPC750 [42], but can execute only register-register-ALU instructions. A Fetch Engine supplies up to two instructions tothe Register Rename & Dispatch unit on every clock cycle. Each instruction has five fields: a RegWrite bit, indicatingwhether the instruction will update the Register File; an Opcode; a destination register Dest; and two source registers,Src1 and Src2. The Register Rename & Dispatch unit issues in program order up to 2 of the instructions supplied bythe Fetch Engine, as long as there are available Reorder Buffer (ROB) entries and Reservation Stations. The ReorderBuffer is a FIFO structure, implemented as a shift register (like in the PowerPC 750) that keeps the original programorder of instructions currently in execution, and temporarily stores their results until the instructions are completed inprogram order. The Register Rename & Dispatch unit renames each destination register to a destination tag, DestTag,that is different from those of instructions currently in execution, and is used to identify the result of the instruction.The Register Rename & Dispatch unit also provides the data for each operand of an instruction that is being issued, aslong as that operand is already computed and is located either in the Register File or the ROB, or else the destinationtag of the most recently issued instruction that is in the ROB and will compute the operand. A Reservation Station isa buffer that temporarily stores instruction information, including the opcode, the two source tags, and operands, untilboth operands become available and the instruction can be executed by a functional unit. After an instruction is exe-cuted, its result is placed on a common data bus, and is forwarded to instructions in other Reservation Stations, toinstructions that are being issued by the Register Rename & Dispatch unit, and to the ROB entry waiting for theresult. The PowerPC 750 has 6 Reservation Stations, and 6 ROB entries. The Fetch Engine gets instructions from aread-only Instruction Memory.

Fig. 2. Block diagram of the implementation processor.

1 0

1 0

1 0

c1

c2

c3

a1 a2 a3 a4

o

Path from a1 to o:

Path from a2 to o:

Path from a3 to o:

Path from a4 to o:

(¬a1 ∨ ¬c1 ∨ o) ∧ (a1 ∨ ¬ c1 ∨ ¬ o)

(¬a2 ∨ c1 ∨ ¬ c2 ∨ ¬ c3 ∨ o) ∧ (a2 ∨ c1 ∨ ¬ c2 ∨ ¬c3 ∨ ¬ o)

(¬a4 ∨ c1 ∨ c2 ∨ o) ∧ (a4 ∨ c1 ∨ c2 ∨ ¬o)

(¬a3 ∨ c1 ∨ ¬ c2 ∨ c3 ∨ o) ∧ (a3 ∨ c1 ∨ ¬ c2 ∨ c3 ∨ ¬ o)

1 0

1 0

1 0

c1

c2

c3

b1 a2 a3

o

Path from leaf G1 to o:

Path from a2 to o:

Path from a3 to o:

Path from leaf G2 to o:

b2

G1

d1 d2

G2

(¬ d1 ∨ c1 ∨ c2 ∨ o) ∧ (¬ d2 ∨ c1 ∨ c2 ∨ o) ∧ (d1 ∨ d1 ∨ c1 ∨ c2 ∨ ¬ o)

(¬ b1 ∨ ¬ b2 ∨ ¬ c1 ∨ o) ∧ (b1 ∨ ¬c1 ∨ ¬ o) ∧ (b2 ∨ ¬c1 ∨ ¬ o)

(¬ a2 ∨ c1 ∨ ¬ c2 ∨ ¬ c3 ∨ o) ∧ (a2 ∨ c1 ∨ ¬ c2 ∨ ¬ c3 ∨ ¬o)

(¬ a3 ∨ c1 ∨ ¬ c2 ∨ c3 ∨ o) ∧ (a3 ∨ c1 ∨ ¬ c2 ∨ c3 ∨ ¬ o)

(a) (b)

FetchEngine

Reg.File

Reorder Buffer (ROB)

ALU ALU ALU

...

...

Reg. Rename& Dispatch

ReservationStations

CommonData Buses

4

Page 246: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Every entry in the ROB has 6 fields: Valid, indicating whether the entry contains a valid instruction; RegWrite,controlling whether the instruction will write its result to the Register File; ValidResult indicating whether the instruc-tion’s result has been computed, in which case it is stored in field Result of the same ROB entry; Dest and DestTag,the destination register and its renaming tag, respectively. Instructions are executed out of program order, as soon asboth operands of an instruction become available in the Reservation Station allocated for the instruction. Each Reser-vation Station has 8 fields: Valid, indicating whether the Reservation Station contains a valid instruction; ValidData1and ValidData2, indicating whether the first and second data operands, respectively, are available in fields Data1 andData2; the source tags for the data operands, SrcTag1 and SrcTag2, respectively; and the destination tag, DestTag,that will be placed on a common data bus together with the result, and will be used to forward the result to the ROBentry with the same destination tag, and to waiting Reservation Stations.

Up to 2 of the 2 oldest instructions are retired in program order on every clock cycle, i.e., are removed from thebeginning of the ROB and their results are written to the Register File, as long as those ROB entries are valid, theirRegWrite bits are true, and their results have been computed (bits ValidResult are true).

The user-visible state consists of the PC and Register File. The specification processor is non-pipelined and exe-cutes one instruction per clock cycle by fetching the instruction from the read-only Instruction Memory, incrementingthe PC, computing the ALU result, and writing it to the instruction’s destination register in the Register File if theinstruction’s RegWrite bit is true.

The ROB, the Reservation Stations, the Dispatch logic, the logic for renaming source registers of dispatchedinstructions, and the Retirement logic are completely implemented. The block that renames destination registers ofdispatched instructions is abstracted with a generator of arbitrary values [94], producing new term variables. Con-straints are imposed that each such term variable, abstracting a destination tag of a dispatched instruction, be differentfrom term variables abstracting other destination tags in the execution engine, as discussed next.

4. Constraints for the Formal Verification4.1 Necessary Constraints for the Abstracted Block That Renames Destination RegistersLet FE1 and FE2 be the first and second instruction supplied by the Fetch Engine, and let FE1.DestTag and FE2.Dest-Tag be the renaming tags for the destination registers of these instructions. Let ROBi be the ith entry in the ROB, andlet ROBi.Valid and ROBi.DestTag be its Valid bit and its destination tag that renames the instruction’s destination reg-ister. Then, since the block that renames destination registers is abstracted with a generator of arbitrary values, weneed to impose the constraints that the destination tag of a newly dispatched instruction be different from the destina-tion tags of valid instructions in the ROB:

ROBi.Valid ⇒ ¬(ROBi.DestTag = FE1.DestTag), i = 1, ..., N (1)ROBi.Valid ⇒ ¬(ROBi.DestTag = FE2.DestTag), i = 1, ..., N (2)

where N is the number of ROB entries. In the case of processors that can dispatch more instructions per clock cycle,such N constraints have to be imposed for each dispatched instruction.

Similarly, the verification required the constraint that the destination tags of both dispatched instructions be dif-ferent:

¬( FE1.DestTag = FE2.DestTag) (3)

Note that constraints (1), (2), and (3) do not require an invariant check, i.e., are assumed to be properties of theabstracted block for renaming destination registers. An actual implementation of that block has to be formally veri-fied to satisfy these properties.

4.2 Necessary Invariant ConstraintsThe ROB was implemented as a shift register (like in the PowerPC 750 [42]), such that instructions are removed onlyfrom the beginning of the shift register, the remaining valid entries are shifted to fill the emptied slots, and newinstructions are placed from the resulting first free position. Hence, if an entry is free (i.e., not valid), then so will beall subsequent entries, and the required invariant constraints for the ROB were:

¬ ROBi.Valid ⇒ ¬ROBj.Valid, i = 1, ..., N-1 (4)

Note that constraints (4) imply that if an entry in the ROB is valid, then so are all preceding entries:

ROBi.Valid ⇒ ROBj.Valid, i = 2, ..., N (4′)

j = i+1

N

j = 1

i-1

5

Page 247: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

although such invariant constraints were not used.Since each destination register is renamed to a unique destination tag at dispatch, then every pair of valid entries

in the ROB should have different destination tags:

ROBi.Valid ∧ ROBj.Valid ⇒ ¬(ROBi.DestTag = ROBj.DestTag), i < j, i, j = 1, ..., N (5′)

Based on (4′), if ROBj.Valid is true then ROBi.Valid is true, for i < j, and we can rewrite (5′) as:

ROBj.Valid ⇒ ¬(ROBi.DestTag = ROBj.DestTag), i < j, i, j = 1, ..., N (5)

which are the invariant constraints that were used.Similarly, since the destination tags are also kept in the Reservation Stations to identify the result when placed on

a common data bus, then every pair of Reservation Stations RSi and RSj that contain valid information should havedifferent destination tags:

RSi.Valid ∧ RSj.Valid ⇒ ¬(RSi.DestTag = RSj.DestTag), i < j, i, j = 1, ..., K (6)

where K is the number of Reservation Stations.Since instructions that write to the Register File, i.e., whose RegWrite bit is true, are allocated both an ROB entry

and a Reservation Station at dispatch, then every valid instruction in the ROB with RegWrite bit of true andValidResult bit of false (i.e., the result is not yet available in the Result field of that ROB entry) should have a pendingupdate from a Reservation Station. That Reservation Station should contain valid information and destination tagequal to the destination tag of that ROB entry. Additionally, if the first ROB entry (containing the oldest instruction inexecution) does not have its result computed, then the Reservation Station that will produce the result should haveboth of its data operands ready, as there are no older instructions that could produce those operands; this extra con-straint is needed to avoid deadlock. Hence:

ROB1.Valid ∧ ROB1.RegWrite ∧ ¬ ROB1.ValidResult

⇒ [RSj.Valid ∧ (RSj.DestTag = ROB1.DestTag) ∧ RSj.ValidData1 ∧ RSj.ValidData2] (7′)

ROBi.Valid ∧ ROBi.RegWrite ∧ ¬ROBi.ValidResult

⇒ RSj.Valid ∧ (RSj.DestTag = ROBi.DestTag), i = 2, ..., N (7)

The next invariant constraints avoid circular data dependencies between Reservation Stations, e.g., ReservationStation k waiting for an operand to be produced by Reservation Station l, which is waiting for an operand to be pro-duced by Reservation Station k—a scenario leading to deadlock, as also noted by McMillan [66]. Hence, the invariantconstraints that if a Reservation Station is waiting for an operand, it will be produced by a Reservation Station corre-sponding to an earlier ROB entry:

RSi.Valid ∧ ¬ RSi.ValidData1

⇒ [ROBj.Valid ∧ ROBj.RegWrite ∧ ¬ ROBj.ValidResult

∧ (RSi.DestTag = ROBj.DestTag) ∧ RSi.data1_from_before_ROBj], i = 1, ..., K (8)

where RSi.data1_from_before_ROBj is the condition that ROBj is preceded by an ROB entry that has an earlier indexand is waiting for the same result as the first data operand of Reservation Station i, i.e., for result with destination tagequal to RSi.SrcTag1:

RSi.data1_from_before_ROBj ← [ROBk.Valid ∧ ROBk.RegWrite ∧ ¬ ROBk.ValidResult

∧ (RSi.SrcTag1 = ROBk.DestTag)], i = 1, ..., K, j = 2, ..., N (9)

where “←“ stands for assignment. In constraints (8), index j starts from 2 since constraint (7′) guarantees that if aReservation Station will produce the result for ROB1, then that Reservation Station has both of its data operandsready. Constraints like (8) and (9) were also imposed for the second data operand of each Reservation Station.

The next invariant constraints state that if a Reservation Station contains valid information, then there is an ROBentry waiting for the result from that Reservation Station:

RSi.Valid

⇒ [ROBj.Valid ∧ ROBj.RegWrite ∧ ¬ ROBj.ValidResult

∧ (RSi.DestTag = ROBj.DestTag)], i = 1, ..., K (10)

j = 1

K

j = 1

K

j = 2

N

k = 1

j-1

j = 1

N

6

Page 248: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

5. Automatic Case SplitsOne of the main reasons for the complexity of automatic formal verification of out-of-order processors with a Reor-der Buffer, register renaming, and Reservation Stations is the large number of matchings between ROB entries, wait-ing for results, and Reservation Stations that could produce those results. If the number of ROB entries, N, is greaterthan or equal to the number of Reservation Stations, K, then the maximum number of ROB entries that can be waitingfor results is K, and the number of matchings will be K!, where every matching includes the cases that each of those KROB entries is either invalid (bit Valid is false), or will not write to the Register File (bit RegWrite is false), or itsresult has been computed (bit ValidResult is true), or there is a specific Reservation Station that has valid informationand will produce the result for that ROB entry. Hence, a possible way to automatically case-split the formal verifica-tion is to spawn multiple runs of the tool flow, each handling one of the K! cases, assuming enough computingresources are available, as can be expected for an industrial environment.

To explain the above decomposition in terms of the constraints from Sect. 4, we will be using case-splittingexpressions that are variants of constraints (7′) and (7) with just one of the K disjuncts to the right of the implication.That is, a variant of (7′) will be used to restrict ROB1 to be either invalid, or have a RegWrite bit of false, or have itsresult computed, or map to a specific one of the K Reservation Stations. Similarly, a variant of (7) will be used torestrict each of ROB2 through ROBN to be either invalid, or have a RegWrite bit of false, or have its result computed,or map to a specific Reservation Station that is different from those assigned to previous ROB entries. Furthermore,for each matching between ROB entries and Reservation Stations, we can simplify each of the K constraints (8) andeach of the K constraints (10) to a variant with only one of the disjuncts to the right of the implication, i.e., the dis-junct corresponding to the ROB entry whose destination tag equals that of the given Reservation Station. Each of theother disjuncts in an instance of (8) or (10) will simplify to false, since valid ROB entries have different destinationtags, and so there can be only one ROB entry whose destination tag equals that of a Reservation Station. Note that allpossible matchings between ROB entries and Reservation Stations can be enumerated automatically, and if an auto-matic tool flow can prove correctness for each of the matchings, then that would imply correctness over the entiresolution space.

6. Merging ITE-Trees with 2 Levels of LeavesITE-trees can be further merged with 2 levels of their leaves—see Fig. 3—such that the merged leaves have fanoutcount of 1.

Fig. 3. Merging an ITE-tree with 2 levels of its leaves, where each merged leaf has fanout count of 1: (a) ITE→AND andOR→AND groups as the 2 levels of merged leaves; and (b) ITE→OR and AND→OR groups as the 2 levels of merged leaves.

Path from G3→G1 to o:

Path from b2 to o:

Path from b3 to o:

Path from G4→G2 to o:

(¬ a1 ∨ ¬ c4 ∨ ¬ b1 ∨ ¬ c1 ∨ o) ∧ (a1 ∨ ¬ c4 ∨ ¬c1 ∨ ¬ o)

(¬ b2 ∨ c1 ∨ ¬ c2 ∨ ¬c3 ∨ o) ∧ (b2 ∨ c1 ∨ ¬ c2 ∨ ¬c3 ∨ ¬ o)

(d1 ∨ d2 ∨ c1 ∨ c2 ∨ ¬ o) ∧ (¬ d1 ∨ ¬ b4 ∨ c1 ∨ c2 ∨ o) ∧ (¬d2 ∨ ¬b4 ∨ c1 ∨ c2 ∨ o)

(¬ b3 ∨ c1 ∨ ¬ c2 ∨ c3 ∨ o) ∧ (b3 ∨ c1 ∨ ¬c2 ∨ c3 ∨ ¬ o)

(a)

1 0

1 0

1 0

c1

c2

c3

b2 b3

o

b1

G1

b4

G3

d1 d2

1 0 c4

a1 a2

G2

G4

(¬ a2 ∨ c4 ∨ ¬ b1 ∨ ¬ c1 ∨ o) ∧ (a2 ∨ c4 ∨ ¬ c1 ∨ ¬ o)

(b1 ∨ ¬ c1 ∨ ¬ o)

(b4 ∨ c1 ∨ c2 ∨ ¬ o)

Path from G7→G5 to o:

Path from b2 to o:

Path from b3 to o:

Path from G8→G6 to o:

(¬ a1 ∨ ¬ c4 ∨ ¬c1 ∨ o) ∧ (a1 ∨ ¬ c4 ∨ b1 ∨ ¬ c1 ∨ ¬o)

(¬ b2 ∨ c1 ∨ ¬ c2 ∨ ¬ c3 ∨ o) ∧ (b2 ∨ c1 ∨ ¬ c2 ∨ ¬ c3 ∨ ¬o)

(¬ d1 ∨ ¬ d2 ∨ c1 ∨ c2 ∨ o) ∧ (d1 ∨ b4 ∨ c1 ∨ c2 ∨ ¬o) ∧ (d2 ∨ b4 ∨ c1 ∨ c2 ∨ ¬o)

(¬ b3 ∨ c1 ∨ ¬ c2 ∨ c3 ∨ o) ∧ (b3 ∨ c1 ∨ ¬ c2 ∨ c3 ∨ ¬ o)

(b)

1 0

1 0

1 0

c1

c2

c3

b2 b3

o

b1

G5

b4

G7

d1 d2

1 0 c4

a1 a2

G6

G8

(¬ a2 ∨ c4 ∨ ¬ c1 ∨ o) ∧ (a2 ∨ c4 ∨ b1 ∨ ¬ c1 ∨ ¬ o)

(¬ b1 ∨ ¬ c1 ∨ o)

(¬ b4 ∨ c1 ∨ c2 ∨ o)

7

Page 249: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Any merged first-level leaves are either AND or OR gates (if ITEs with fanout count of 1, they would have beenpart of the ITE-tree). Due to the hashing scheme for formulas in the decision procedure EVC (see Sect. 2.1), anymerged second-level leaf is either an ITE, or an OR driving an AND (G4→G2 in Fig. 3.a), or an AND driving an OR(G8→G6 in Fig. 3.b). If a first-level AND/OR is driven by many gates with fanout count of 1, merged was the onewith highest topological level (see Sect. 2.3). Note that if G3 supplies the AND gate G1 with that gate’s controllingvalue of false in Fig. 3.a—i.e., either a1 is false and is selected to appear at the output of G3 by c4 being true, or a2 isfalse and is selected to appear at the output of G3 by c4 being false—then the value of b1 does not affect the value ofG1, and so ¬b1 does not appear in the clauses (a1 ∨ ¬ c4 ∨ ¬ c1 ∨ ¬ o) and (a2 ∨ c4 ∨ ¬ c1 ∨ ¬o). Similarly for G4→G2and b4 in Fig. 3.a; for G7→G5 and b1 in Fig. 3.b; and for G8→G6 and b4 in Fig. 3.b.

7. ResultsThe experiments were conducted on a Dell OptiPlex GX260 having a 3.06-GHz Intel Pentium 4 with a 512-KB on-chip L2-cache, 2 GB of physical memory, and running Red Hat Linux 9.0. The tool flow, consisting of the term-levelsymbolic simulator TLSim [96] and an extended version of the decision procedure EVC [96], was combined with theSAT-solver siege_v4 [74]—an improved version of siege_v0, one of the top performers in the 2003 SAT competition[59]. In EVC, equations between term variables were encoded with new eij Boolean variables [33], while transitivityof equality was enforced as described in [19]. The abstraction function—mapping an implementation state to anequivalent specification state—was computed by using the ideas in controlled flushing [21]: the instructions in theROB were not allowed to advance, thus reducing the ambiguity of the instruction flow; the ALUs were made to deter-ministically compute the results of Reservation Stations with ready operands; and the ith ROB entry was allowed towrite its result to the Register File in the ith cycle of the abstraction function, i.e., in the latest cycle when that resultcould be computed. The processor used in the experiments had 6 ROB entries, 6 Reservation Stations, 2 data oper-ands per instruction, and could issue and retire up to 2 instructions per cycle—the same numbers as in the PowerPC750 [42]. The invariance check of all constraints took 12 seconds, if they were verified in sequence, one constraint ata time, but much longer if verified monolithically. Most of the bugs, made when designing the correct processor, weredetected by just checking invariance of the constraints. The discussion next is about the formal verification of safetyof the correct model—that 1 step of this dual-issue model corresponds to 0, or 1, or 2 steps of its specification. Safetywas checked with the commutative correctness diagram used in [20][93][94][98].

Table 1 shows the results from formal verification with the necessary invariant constraints. The “old” translationto CNF is without preserving the ITE-tree structure of equation arguments when eliminating equations, but using adisjunction of conjunctions. “# Runs” is the number of matchings when some or all of the ROB entries are matchedwith Reservation Stations to produce the results for those ROB entries—e.g., 720 runs (6!) when each of the 6 ROBentries is matched with a different one of the 6 Reservation Stations. With the old translation to CNF, and withoutcase splits to enumerate matchings, siege_v4 did not finish in 168 hours (1 week). Using the old translation, andmatching all 6 ROB entries with reservation stations for a total of 720 runs, resulted in average SAT time of 644 sec-onds per run. Merging only ITE-trees [102] reduced the average SAT time to 181 seconds per run. Merging ITE-treeswith one level of their leaves [102] further reduced the average SAT time to 102 seconds per run. And merging ITE-trees with 2 levels of leaves (see Sect. 6) resulted in average SAT time of 20 seconds per run, i.e., speedup of 5× rela-tive to translation by merging ITE-trees with 1 level of leaves, 9× relative to translation by merging only ITE-trees,and 32× relative to the old translation from EUFM to CNF. Furthermore, if 720 CPUs are available for parallel runsof the tool flow, we can complete all the runs in 55 seconds (the maximum time for a run), resulting in 2 orders ofmagnitude speedup relative to the sequential execution time of 14,400 seconds for all runs on 1 CPU. The CNF for-mulas obtained with the old translation are available as [103]

Additional experiments were conducted to determine the optimum number of ROB entries to match with Reser-vation Stations in order to minimize the SAT time with sequential runs. That number was found to be 4, requiring 360runs, and resulting in average time per run of 39 seconds, total sequential time of 14,198 seconds, and maximum timeper run of 104 seconds. (The table of results is not shown for lack of space.) Further experiments were used to explorethe benefit from additional invariant constraints (also not presented for lack of space) that are not necessary for theformal verification, but reduce the solution space. They did not make a difference, if the formal verification wasdecomposed sufficiently by matching 4, 5, or 6 ROB entries with Reservation Stations, although resulting in an orderof magnitude speedup if fewer ROB entries were matched with Reservation Stations. Without matching any of theROB entries with Reservation Stations, and so executing a single monolithic run, the new translation to CNF—merg-ing ITE-trees with 2 levels of leaves—required more than 24 hours, regardless of the use of extra invariant con-straints.

8

Page 250: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

The speedup from the new translation to CNF is due to the structure of the given class of Boolean formulas.When formally verifying models with in-order execution, the Boolean formulas do not have ITE-trees where manyfirst- and second-level leaves have fanout count of 1, and so the new translation could not be applied many times.

Merging ITE-trees with their first-level leaves that have fanout count greater than 1, slowed the SAT-solving byup to 40× [102]. Merging ITE-trees with 3 and more levels of leaves, each with fanout count of 1, resulted in worseperformance compared to merging ITE-trees with 2 levels of leaves, indicating that the optimal performance for thegiven class of Boolean formulas is achieved by merging ITE-trees with 2 levels of their leaves.

8. DiscussionThe benefits from merging ITE-trees with their leaves include: 1) Reduced variables and clauses—relative to con-ventional CNF translation where each ITE is translated separately—but possibly increased literals. 2) Reduced solu-tion space—fewer CNF variables allow a SAT-solver to make fewer decisions when evaluating a formula; removingthe unimportant variables—representing signals inside ITE-trees—improves the efficiency of the search, allowing aSAT-solver to make decisions only based on important variables that control the branching in ITE-trees, or are leavesof ITE-trees. 3) Reduced BCP—eliminating intermediate variables for outputs of ITEs inside a tree, allows a SAT-solver to quickly propagate the value of an ITE-tree leaf to the tree output; if the literals increase, the BCP will alsoincrease, but the benefits from reduced variables and clauses more than offset this in the experiments; BCP takes upto 90% of the SAT time [68]. 4) Automatic use of signal unobservability—the clauses, introduced for each path inan ITE-tree, become satisfied as soon as an ITE-controlling signal selects another path, allowing a SAT-solver morefreedom in assigning values to unobservable [25][76] signals. 5) Reduced L2-cache misses—the fewer variables andclauses result in smaller CNF file sizes, and more succinctly represent the solution space, allowing the clauses for theactive portion of the search to better fit in the L2 cache; also, the average number of literals per clause increases, andsince the literals for a clause are situated in contiguous memory locations, SAT-solvers can better exploit the spaciallocality of memory accesses [38]; big CNFs result in high L2-cache misses, and thus decrease the performance ofSAT-solvers [108]. 6) Guiding the SAT-solver branching—each path, passing through an ITE-tree and its leaves, isdue to a different symbolic-execution trace, so that by representing each path as clauses without intermediate CNFvariables, we point the SAT-solver toward processing one symbolic-execution trace at a time, and make it easier forthe SAT-solver to prune infeasible paths whose clauses contain CNF variables with complemented values.

Sheeran and Stålmarck [82] found that optimal performance of Stålmarck’s method [83] on many formulas fromformal verification is achieved when the dilemma rule—exploring all possible assignments to a set of Boolean vari-ables, and deducing constraints for the solution space—is applied to either 0 or 1 Boolean variables at a time. Thecorresponding conclusion from Sect. 7 is that best was the strategy to merge ITE-trees with 2 levels of leaves. Thetwo conclusions indicate that different classes of Boolean formulas require different levels of SAT-solver branchingto achieve optimal learning and optimal SAT-solver performance.

Additional speedup can be expected if siege_v4 is extended into an incremental SAT-solver, such as[27][28][105], where constraints (in the current paper case splits matching ROB entries with Reservation Stations)are added to the CNF formula incrementally, one constraint at a time. For each constraint, if the formula is provedunsatisfiable, the constraint is removed from the clause database, and so are any conflict clauses triggered by thatconstraint; however, conflict clauses that apply to the solution space without the constraint are kept, so that the effortfor deriving them is amortized across the processing of many constraints.

Translation to CNF

# ROB Entries

Matched with

Reservation Stations

#Runs

Average Formula StatisticsAverage

SAT Time [sec]

Total SAT Time [sec]

Boolean Variables

CNF Variables

CNF Clauses

CNFLiterals

AverageLiterals

per Clause

With Sequential

Runs

With Parallel

Runs

old 0 1 323 64,627 871,053 2,497,863 2.868 > 168 h > 168 h > 168 h

old 6 720 323 47,136 612,784 1,757,288 2.868 644 463,680 872

merge ITE-trees 6 720 323 15,909 185,572 961,451 5.181 181 130,320 272

merge ITE-trees with 1 level of leaves

6 720 323 12,543 178,840 1,138,421 6.366 102 73,440(20.4 h)

223

merge ITE-trees with 2 levels of leaves

6 720 323 12,410 178,574 1,141,377 6.392 20 14,400(4 h)

55

Table 1. Results from formal verification of safety with the necessary invariant constraints.

9

Page 251: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

9. Related Work on SAT and EUFM Decision ProceduresGupta et al. [35] were first to implement a circuit-based SAT-solver that uses structural information to identify gateswith unobservable outputs and remove the clauses for those gates, as long as the gates remain unobservable. Novikov[69] exploited signal observability when deriving relations between CNF variables, by branching on up to 5 CNFvariables, recording the resulting binary values for other variables, but keeping don’t-cares for variables that do notget a binary value, and then extracting compact relations that hold in all assignments not leading to a conflict. Othercircuit-based SAT-solvers identify signals with equal or complemented values in order to prune the solution space[52][62][70], or use a hybrid representation of Boolean circuits [32][70]—gate-level for the circuit, and CNF for con-straints and learnt clauses. Franco et al. [31] present a circuit-based SAT-solver that uses BDDs [17] in its decisionheuristic, and was faster than Chaff [68] on formulas from Bounded Model Checking, but slower on formulas fromEUFM-based formal verification of processors with in-order execution. Theoretical results about circuit-based SATalgorithms are presented in [2][16]. Hong et al. [39] used don’t-cares to minimize BDDs.

Kautz et al. [10][75] used depth-first traversal of a pebbling graph in order to generate a partial branchingsequence, defining CNF-variable assignments to be made by a SAT-solver when beginning to process the CNF for-mula for that graph, thus guiding the search towards a solution. Reda et al. [73] used BDD-variable ordering heuris-tics to derive a CNF-variable decision order for SAT-solving CNFs of circuits. To reduce the cost of BCP, Binghamand Hu [13] compiled Boolean formulas to programs, and simulated the resulting code with random vectors. Addi-tional code was generated to identify input patterns that will produce the same output value as the current vector, thuspruning the solution space.

Variations of Tseitin’s transformations [87] were used in [9][14][26][29][49][71], and Larrabee [57] was first toapply them to testing of circuits, but none of these authors used transformations for ITE-trees. Kuehlmann et al. [52]represented Boolean circuits in terms of 2-input AND gates and inverters, and transformed groups of 3 connectedAND gates, one driven by the other two, into a canonical form by accounting for inverters at gate inputs.

Algebraic simplifications for CNF [7][15][24][58][61][63][65] require long processing times for big formulas.Three of those methods [15][63][65], as well as techniques for CNF variable ordering by minimizing the cut-width[3][104] (see also [23]), were applied to simpler formulas from formal verification of processors [98], but took longtime, and did not accelerate the SAT-solving. However, deriving the direct and indirect implications, as well as theextended backward implications [109], between pairs of signals, and adding those implications as 2-literal clauses tothe CNFs from conventional translation, resulted in orders of magnitude speedup [6]. Using multiple parallel runs ofa SAT-solver with different decision heuristics [80], or with different translations from EUFM to propositional logic[98], and stopping when a run finds a solution, reduced the SAT time.

Burch [21] extended SVC [46], a decision procedure for EUFM, with simplifications that are based on observ-ability don’t-cares, but his method depended on manually provided case-splitting expressions, and was computation-ally expensive even for simple benchmarks. Jones et al. [46], and Levitt and Olukotun [60] devised heuristics thatsped up SVC, but did not scale for complex formulas and were not flexible, as observed by Barrett et al. [8].

Automatic methods for deriving invariant constraints in high-level microprocessors have been proposed[43][56][85][86], but have not been applied to out-of-order designs with register renaming, as well as Reorder Bufferand Reservation Stations that are completely implemented and instantiated, and are likely to run into scaling prob-lems due to the large state spaces of such designs.

Nested ITEs were first used to eliminate uninterpreted functions and uninterpreted predicates in [91], where bit-level functional units were abstracted with read-only instances of an Efficient Memory Model (EMM) [89][90] forbehavioral abstraction of memories in symbolic simulation. The EMM has been adopted in verification tools by Inno-logic Systems [37], and Synopsys [51].

10. ConclusionsThe paper studied the potential for automatic decomposition and speedup in formal verification of out-of-order super-scalar processors, having register renaming, as well as a Reorder Buffer and Reservation Stations that are completelyimplemented and instantiated. This is in contrast to previous approaches, where these hardware structures are manu-ally abstracted, and the user has to set up an inductive proof over the number of Reorder Buffer entries, and possiblymanually apply symmetry reductions to decrease the number of Reservation Stations. Furthermore, and also in con-trast to previous approaches, the Reorder Buffer was not extended with auxiliary state in order to simplify the expres-sions resulting from the abstraction function; significant additional speedup can be expected with such approaches, atthe cost of extra manual work. The formal verification was possible due to automatically generated case-splitting

10

Page 252: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

expressions—matching Reorder Buffer entries with Reservation Stations that will compute the data operands forthose Reorder Buffer entries, and resulting in orders of magnitude speedup if many CPUs are available for parallelruns of the tool flow. An efficient translation from the logic of EUFM to CNF—by producing more ITEs, and merg-ing ITE-trees with 2 levels of their leaves—resulted in additional 32× speedup. Future work will examine more com-plex out-of-order processors, and will fine-tune the translation to CNF.

References[1] W. Ackermann, Solvable Cases of the Decision Problem, North-Holland, Amsterdam, 1954.[2] M. Alekhnovich, and A.A. Razborov, “Satisfiability, Branch-Width and Tseitin Tautologies,” Symposium on Foundations of Computer

Science (FOCS ’02), November 2002.[3] F.A. Aloul, I.L. Markov, and K.A. Sakallah, “Faster SAT and Smaller BDDs via Common Function Structure,” International Conference

on Computer-Aided Design, 2001.[4] T. Arons, and A. Pnueli, “Verifying Tomasulo’s Algorithm by Refinement,” 12th International Conference on VLSI Design (VLSI ’99),

June 1999.[5] T.Arons, and A.Pnueli, “A Comparison of Two Verification Methods for Speculative Instruction Execution,” Tools and Algorithms for the

Construction and Analysis of Systems (TACAS ’00), S. Graf, and M. Schwartzbach, eds., LNCS 1785, Springer-Verlag, March–April 2000,pp. 487–502.

[6] R. Arora, and M.S. Hsiao, “Enhancing SAT-Based Equivalence Checking with Static Logic Implications,” High Level Design Validationand Test Workshop (HLDVT ’03), November 2003.

[7] F. Bacchus, and J. Winter, “Effective Preprocessing with Hyper-Resolution and Equality Reduction,” Theory and Applications of Satisfi-ability Testing (SAT ’03), 2003.

[8] C. Barrett, D. Dill, and A. Stump, “Checking Satisfiability of First-Order Formulas by Incremental Translation to SAT,” Computer-AidedVerification (CAV ’02), July 2002.

[9] M. Bauer, D. Brand, M. Fischer, A. Meyer, and M. Paterson, “A Note on Disjunctive Form Tautologies,” SIGACT News, Vol. 4 (1973).[10] P. Beame, H. Kautz, and A. Sabharwal, “Understanding the Power of Clause Learning,” International Joint Conference on Artificial Intel-

ligence (IJCAI ’03), August 2003.[11] S. Berezin, E. Clarke, A. Biere, and Y. Zhu, “Verification of Out-Of-Order Processor Designs Using Model Checking and a Light-Weight

Completion Function,” Journal on Formal Methods in System Design (FMSD), special issue on Microprocessor Verifications, Vol. 20, No.2 (March 2002).

[12] A. Biere, and W. Kunz, “SAT and ATPG: Boolean Engines for Formal Hardware Verification,” International Conference on ComputerAided Design (ICCAD ’02), November 2002, pp. 782–785.

[13] J.D. Bingham, and A.J. Hu, “Semi-Formal Bounded Model Checking,” Computer-Aided Verification (CAV ’02), LNCS 2404, Springer-Verlag, July 2002.

[14] T. Boy de la Tour, “An Optimality Result for Clause Form Translation,” Journal of Symbolic Computation, Vol. 14 (1992), pp. 283–301.[15] R.I. Brafman, “A Simplifier for Propositional Formulas with Many Binary Clauses,” Int’l. Joint Conference on Artificial Intelligence

(IJCAI ’01), 2001.[16] E. Broering, and S.V. Lokam, “Width-Based Algorithms for SAT and CIRCUIT-SAT,” Theory and Applications of Satisfiability Testing

(SAT ’03), May 2003.[17] R.E. Bryant, “Graph-Based Algorithms for Boolean Function Manipulation,” IEEE Transactions on Computers, Vol. C-35, No. 8 (August

1986), pp. 677–691.[18] R.E. Bryant, S. German, and M.N. Velev, “Processor Verification Using Efficient Reductions of the Logic of Uninterpreted Functions to

Propositional Logic,” ACM Transactions on Computational Logic (TOCL), Vol. 2, No. 1 (January 2001), pp. 93–134.[19] R.E. Bryant, and M.N. Velev, “Boolean Satisfiability with Transitivity Constraints,” ACM Transactions on Computational Logic (TOCL),

Vol. 3, No. 4 (October 2002).[20] J.R. Burch, and D.L. Dill, “Automated Verification of Pipelined Microprocessor Control,” Computer-Aided Verification (CAV ’94), LNCS

818, Springer-Verlag, June 1994.[21] J.R. Burch, “Techniques for Verifying Superscalar Microprocessors,” 33rd Design Automation Conference (DAC ’96), June 1996.[22] H.K. Büning, and T. Lettmann, Propositional Logic: Deduction and Algorithms, Cambridge Tracts in Theoretical Computer Science 48,

Cambridge University Press, 1999.[23] E.M. Clarke, and O. Strichman, “A Failed Attempt to Optimize Variable Ordering with Tools for Constraints Solving,” Workshop on Con-

straints in Formal Verification,2002.[24] J.M. Crawford, and L.D. Auton, “Experimental Results on the Crossover Point in Satisfiability Problems,” National Conference on Artifi-

cial Intelligence, 1993.[25] M. Damiani, and G. De Micheli, “Observability Don’t Care Sets and Boolean Relations,” International Conference on Computer-Aided

Design (ICCAD ’90), 1990.[26] E. Eder, “An Implementation of a Theorem Prover Based on the Connection Method,” Artificial Intelligence: Methodology, Systems,

Applications (AIMSA ’84), 1985.[27] N. Eén, and N. Sörensson, “An Extensible SAT-solver,” 6th International Conference on Theory and Applications of Satisfiability Testing

(SAT ’03), May 2003.[28] N. Eén, and N. Sörensson, “Temporal Induction by Incremental SAT Solving,” Workshop on Bounded Model Checking (BMC ’03),

ENTCS, Vol. 89, No. 4, 2003.[29] U. Egly, and T. Rath, “On the Practical Value of Different Definitional Translations to Normal Form,” International Conference on Auto-

mated Deduction (CADE ’96), 1996.[30] J.-C. Filliâtre, S. Owre, H. Rueß, and N. Shankar, “ICS: Integrated Canonizer and Solver,” Computer-Aided Verification (CAV ’01),

G. Berry, H. Comon, and A. Finkel, eds., LNCS 2102, Springer-Verlag, July 2001, pp. 246–249.[31] J. Franco, M. Kouril, J. Schlipf, J. Ward, S. Weaver, M. Dransfield, and W.M. Vanfleet, “SBSAT: A State-Based, BDD-Based Satisfiability

Solver,” 6th International Conference on Theory and Applications of Satisfiability Testing (SAT ’03), May 2003.[32] M.K. Ganai, L. Zhang, P. Ashar, A. Gupta, S. Malik, “Combining Strengths of Circuit-Based and CNF-Based Algorithms for a High-Per-

formance SAT Solver,” 39th Design Automation Conference (DAC ’02), June 2002.[33] A. Goel, K. Sajid, H. Zhou, A. Aziz, and V. Singhal, “BDD Based Procedures for a Theory of Equality with Uninterpreted Functions,”

Computer-Aided Verification (CAV ’98), LNCS 1427, Springer-Verlag, June 1998, pp. 244–255.[34] E. Goldberg, and Y. Novikov, “BerkMin: A Fast and Robust Sat-Solver,” Design, Automation, and Test in Europe (DATE ’02), March

2002, pp. 142–149.[35] A. Gupta, A. Gupta, Z. Yang, and P. Ashar, “Dynamic Detection and Removal of Inactive Clauses in SAT with Application in Image Com-

putation,” 38th Design Automation Conference (DAC ’01), June 2001, pp. 536–541.[36] S. Hangal, and M. O’Connor, “Performance Analysis and Validation of the picoJava Processor,” IEEE Micro, May–June 1999, pp. 62–72.[37] G. Hasteer, Personal communication, February 1999.

11

Page 253: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

[38] J.L. Hennessy, and D.A. Patterson, Computer Architecture: A Quantitative Approach, 3rd edition, Morgan Kaufmann, San Francisco,2002.

[39] Y. Hong, P. Beerel, J. Burch, and K. McMillan, “Safe BDD Minimization Using Don't Cares,” Design Automation Conference (DAC ’97),June 1997.

[40] R. Hosabettu, M. Srivas, and G. Gopalakrishnan, “Proof of Correctness of a Processor with Reorder Buffer Using the Completion Func-tions Approach,” Computer-Aided Verification (CAV ’99), N. Halbwachs, and D. Peled, eds., LNCS 1633, Springer-Verlag, July 1999.

[41] R.M. Hosabettu, “Systematic Verification of Pipelined Microprocessors,” Ph.D. Thesis, Department of Computer Science, University ofUtah, August 2000.

[42] IBM Corporation, PowerPC 740™/PowerPC 750™: RISC Microprocessor User’s Manual, 1999.[43] A.J. Isles, R. Hojati, and R.K. Brayton, “Computing Reachable Control States of Systems Modeled with Uninterpreted Functions and Infi-

nite Memory,” Computer-Aided Verification (CAV ’98), A.J. Hu, and M.Y. Vardi, eds., LNCS 1427, Springer-Verlag, June 1998.[44] R. Jhala, and K.L. McMillan, “Microarchitecture Verification by Compositional Model Checking,” Computer-Aided Verification (CAV

’01), LNCS 2102, July 2001.[45] D.S. Johnson, and M.A. Trick, eds., The Second DIMACS Implementation Challenge, DIMACS Series in Discrete Mathematics and The-

oretical Computer Science. http://dimacs.rutgers.edu/challenges[46] R.B. Jones, D.L. Dill, and J.R. Burch, “Efficient Validity Checking for Processor Verification,” International Conference on Computer-

Aided Design (ICCAD ’95), 1995.[47] R.B. Jones, J.U. Skakkebæk, and D.L. Dill. “Formal Verification of Out-of-Order Execution with Incremental Flushing,” Journal on For-

mal Methods in System Design (FMSD), special issue on Microprocessor Verifications, Vol. 20, No. 2 (March 2002), pp. 139–158.[48] R.B. Jones, Symbolic Simulation Methods for Industrial Formal Verification, Kluwer Academic Publishers, Boston/Dordrecht/London,

2002.[49] T.A. Junttila, and I. Niemelä, “Towards and Efficient Tableau Method for Boolean Circuit Satisfiability Checking,” International Confer-

ence on Computational Logic (CL ’00), LNAI 1861, Springer-Verlag, July 2000.[50] H. Kautz, and B. Selman, “Ten Challenges Redux: Recent Progress in Propositional Reasoning and Search,” Principles and Practice of

Constraint Programming (CP ’03), F. Rossi, ed., LNCS 2833, Springer-Verlag, September–October 2003.[51] A. Kölbl, J.H. Kukula, K. Antreich, and R.F. Damiano, “Handling Special Constructs in Symbolic Simulation,” 39th Design Automation

Conference (DAC ’02), June 2002.[52] A. Kuehlmann, M.K. Ganai, and V. Paruthi, “Circuit-Based Boolean Reasoning,” 38th Design Automation Conference (DAC ’01), June

2001.[53] S. Lahiri, C. Pixley, and K. Albin, “Experience with Term Level Modeling and Verification of the M•CORE™ Microprocessor Core,”

International Workshop on High Level Design, Validation and Test (HLDVT ’01), November 2001.[54] S.K. Lahiri, S.A. Seshia, and R.E. Bryant, “Modeling and Verification of Out-of-Order Microprocessors in UCLID,” Formal Methods in

Computer-Aided Design (FMCAD ’02), LNCS 2517, Springer-Verlag, November 2002.[55] S.K. Lahiri, and R.E. Bryant, “Deductive Verification of Advanced Out-of-Order Microprocessors,” Computer-Aided Verification (CAV

’03), LNCS, July 2003.[56] S.K. Lahiri, R.E. Bryant, and B. Cook, “A Symbolic Approach to Predicate Abstraction,” Computer-Aided Verification (CAV ’03), LNCS,

July 2003.[57] T. Larrabee, “Test Pattern Generation Using Boolean Satisfiability,” IEEE Transactions on Computer-Aided Design of Integrated Circuits

and Systems, Vol. 11, No. 1, 1992.[58] D. Le Berre, “Exploiting the Real Power of Unit Propagation Lookahead,” Workshop on Theory and Applications of Satisfiability Testing

(SAT ’01), H. Kautz, and B. Selman, eds., Elsevier Science Publishers, Electronic Notes in Discrete Mathematics, Vol. 9, June 2001.[59] D. Le Berre, and L. Simon, “Results from the SAT’03 Solver Competition,” 6th International Conference on Theory and Applications of

Satisfiability Testing (SAT ’03), 2003.[60] J. Levitt, and K. Olukotun, “Verifying Correct Pipeline Implementation for Microprocessors,” International Conference on Computer-

Aided Design (ICCAD ’97), 1997.[61] C.M. Li, and Anbulagan, “Look-Ahead versus Look-Back for Satisfiability Problems,” Principles and Practice of Constraint Program-

ming (CP ’97), LNCS 1330, 1997.[62] F. Lu, L.-C. Wang, K.-T. Cheng, J. Moondanos, and Z. Hanna, “A Signal Correlation Guided ATPG Solver and Its Applications for Solv-

ing Difficult Industrial Cases,” 40th Design Automation Conference (DAC ’03), June 2003.[63] I. Lynce, and J.P. Marques-Silva, “Probing-Based Preprocessing Techniques for Propositional Satisfiability,” International Conference on

Tools with Artificial Intelligence (ICTAI ’03), November 2003.[64] S. Malik, A.R. Wang, R.K. Brayton, and A. Sangiovani-Vincentelli, “Logic Verification Using Binary Decision Diagrams in a Logic Syn-

thesis Environment,” International Conference on Computer-Aided Design (ICCAD ’88), November 1988.[65] J.P. Marques-Silva, “Algebraic Simplification Techniques for Propositional Satisfiability,” Principles and Practice of Constraint Pro-

gramming (CP ’00), September 2000.[66] K.L. McMillan, “Circular Compositional Reasoning about Liveness,” Technical Report, Cadence Berkeley Labs, Cadence Design Sys-

tems, 1999.[67] K.L. McMillan, “A Methodology for Hardware Verification Using Compositional Model Checking,” Science of Computer Programming,

Vol. 37, No. 1–3 (May 2000).[68] M.W. Moskewicz, C.F. Madigan, Y. Zhao, L. Zhang, and S. Malik, “Chaff: Engineering an Efficient SAT Solver,” 38th Design Automation

Conference (DAC ’01), June 2001.[69] Y. Novikov, “Local Search for Boolean Relations on the Basis of Unit Propagation,” Design, Automation and Test in Europe (DATE ’03),

March 2003.[70] R. Ostrowski, E. Grégoire, B. Mazure, and L. Saïs, “Recovering and Exploiting Structural Knowledge from CNF Formulas,” Principles

and Practice of Constraint Programming (CP ’02), P. Van Hentenryck, ed., LNCS 2470, Springer-Verlag, September 2002, pp. 185–199.[71] D.A. Plaisted, and S. Greenbaum, “A Structure Preserving Clause Form Translation,” Journal of Symbolic Computation (JSC), Vol. 2,

1985, pp. 293–304.[72] A. Pnueli, Y. Rodeh, O. Strichman, and M. Siegel, “The Small Model Property: How Small Can It Be?”, Journal of Information and Com-

putation, Vol. 178, No. 1, 2002.[73] S. Reda, R. Drechsler, and A. Orailoglu, “On the Relation Between SAT and BDDs for Equivalence Checking,” Symposium on Quality of

Electronic Design, 2002.[74] L. Ryan, Siege SAT Solver v.4. http://www.cs.sfu.ca/~loryan/personal/[75] A. Sabharwal, P. Beame, and H. Kautz, “Using Problem Structure for Efficient Clause Learning,” Theory and Applications of Satisfiability

Testing (SAT ’03), 2003.[76] H. Savoj, and R.K. Brayton, “The Use of Abservability and External Don’t Cares for the Simplification of Multi-Level Networks,” Design

Automation Conference, 1990.[77] J. Sawada, “Formal Verification of an Advanced Pipelined Machine,” Ph.D. Thesis, Department of Computer Sciences, University of

Texas at Austin, December 1999.[78] J. Sawada, and W.A. Hunt, Jr., “Verification of FM9801: Out-of-Order Processor with Speculative Execution and Exceptions That May

Execute Self-Modifying Code,” Journal on Formal Methods in System Design (FMSD), special issue on Microprocessor Verifications,Vol. 20, No. 2 (March 2002), pp. 187–222.

12

Page 254: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

[79] S.A. Seshia, S.K. Lahiri, and R.E. Bryant, “A Hybrid SAT-Based Decision Procedure for Separation Logic with Uninterpreted Functions,”Design Automation Conference (DAC ’03), June 2003.

[80] O. Shacham, and E. Zarpas, “Tuning the VSIDS Decision Heuristic for Bounded Model Checking,” Microprocessor Test and Verification(MTV ’03), May 2003.

[81] J.P. Shen, and M. Lipasti, Modern Processor Design: Fundamentals of Superscalar Processors, beta edition, McGraw-Hill, July 2002.[82] M. Sheeran, and G. Stålmarck, “A Tutorial on Stålmarck's Proof Procedure for Propositional Logic,” Formal Methods in System Design

(FMSD), Vol. 16, No. 1, 2000.[83] G. Stålmarck, “A System for Determining Propositional Logic Theorems by Applying Values and Rules to Triplets that are Generated

from a Formula,” 1989. Swedish Patent No. 467 076 (approved 1992), U.S. Patent No. 5 276 897 (1994), European Patent No. 0403 454(1995).

[84] Stanford Validity Checker (SVC), http://sprout.Stanford.EDU/SVC[85] J.X. Su, D.L. Dill, and C. Barrett, “Automatic Generation of Invariants in Processor Verification,” Formal Methods in Computer-Aided

Design (FMCAD ’96), 1996.[86] A. Tiwari, H. Rueß, H. Saïdi, and N. Shankar, “A Technique for Invariant Generation,” Tools and Algorithms for the Construction and

Analysis of Systems (TACAS ’01), T. Margaria, and W. Yi, eds., LNCS 2031, Springer-Verlag, April 2001, pp. 113–127.[87] G.S. Tseitin, “On the Complexity of Derivation in Propositional Calculus,” in Studies in Constructive Mathematics and Mathematical

Logic, Part 2, 1968, pp. 115–125. Reprinted in J. Siekmann, and G. Wrightson, eds., Automation of Reasoning, Vol. 2, Springer-Verlag,1983.

[88] O. Tveretina, and H. Zantema, “A Proof System and a Decision Procedure for Equality Logic,” Tech. Report, Computer Science, Techni-cal University of Eindhoven, 2003.

[89] M.N. Velev, and R.E. Bryant, “Efficient Modeling of Memory Arrays in Symbolic Ternary Simulation,” Tools and Algorithms for the Con-struction and Analysis of Systems (TACAS ’98), B. Steffen, ed., LNCS 1384, Springer-Verlag, March–April 1998.

[90] M.N. Velev, and R.E. Bryant, “Incorporating Timing Constraints in the Efficient Memory Model for Symbolic Ternary Simulation,” Inter-national Conference on Computer Design (ICCD ’98), October 1998, pp. 400–406.

[91] M.N. Velev, and R.E. Bryant, “Bit-Level Abstraction in the Verification of Pipelined Microprocessors by Correspondence Checking,”Formal Methods in Computer-Aided Design (FMCAD ’98), LNCS 1522, Springer-Verlag, November 1998, pp. 18–35.

[92] M.N. Velev, and R.E. Bryant, “Exploiting Positive Equality and Partial Non-Consistency in the Formal Verification of Pipelined Micro-processors,” 36th Design Automation Conference (DAC ’99), June 1999, pp. 397–401.

[93] M.N. Velev, and R.E. Bryant, “Superscalar Processor Verification Using Efficient Reductions of the Logic of Equality with UninterpretedFunctions to Propositional Logic,” Correct Hardware Design and Verification Methods (CHARME ’99), LNCS 1703, Springer-Verlag,September 1999, pp. 37–53.

[94] M.N. Velev, and R.E. Bryant, “Formal Verification of Superscalar Microprocessors with Multicycle Functional Units, Exceptions, andBranch Prediction,” 37th Design Automation Conference (DAC ’00), June 2000, pp. 112–117.

[95] M.N. Velev, “Automatic Abstraction of Memories in the Formal Verification of Superscalar Microprocessors,” Tools and Algorithms forthe Construction and Analysis of Systems (TACAS ’01), T. Margaria, and W. Yi, eds., LNCS 2031, Springer-Verlag, April 2001.

[96] M.N. Velev, and R.E. Bryant, “EVC: A Validity Checker for the Logic of Equality with Uninterpreted Functions and Memories, Exploit-ing Positive Equality and Conservative Transformations,” Computer-Aided Verification (CAV ’01), LNCS 2102, Springer-Verlag, July2001.

[97] M.N. Velev, “Using Rewriting Rules and Positive Equality to Formally Verify Wide-Issue Out-Of-Order Microprocessors with a ReorderBuffer,” Design, Automation and Test in Europe (DATE ’02), March 2002, pp. 28–35.

[98] M.N. Velev, and R.E. Bryant, “Effective Use of Boolean Satisfiability Procedures in the Formal Verification of Superscalar and VLIWMicroprocessors,” Journal of Symbolic Computation (JSC), Vol. 35, No. 2 (February 2003), pp. 73–106.

[99] M.N. Velev, “Integrating Formal Verification into an Advanced Computer Architecture Course,” ASEE Annual Conference & Exposition,June 2003.

[100] M.N. Velev, “Automatic Abstraction of Equations in a Logic of Equality,” Automated Reasoning with Analytic Tableaux and RelatedMethods (TABLEAUX ’03), 2003.

[101] M.N. Velev, “Efficient Translation of Boolean Formulas to CNF in Formal Verification of Microprocessors,” Asia and South PacificDesign Automation Conference, January 2004.

[102] M.N. Velev, “Exploiting Signal Unobservability for Efficient Translation to CNF in Formal Verification of Microprocessors,” Design,Automation and Test in Europe (DATE ’04), February 2004.

[103] M.N. Velev, CNF Benchmark Suite ENGINE-UNSAT-1.0, August 2003. http://www.ece.cmu.edu/~mvelev[104] D. Wang, E. Clarke, Y. Zhu, and J. Kukula, “Using Cutwidth to Improve Symbolic Simulation and Boolean Satisfiability,” IEEE Interna-

tional High Level Design Validation and Test Workshop (HLDVT ’01), 2001.[105] J. Whittemore, J. Kim, and K. Sakallah, “SATIRE: A New Incremental Satisfiability Engine,” 38th Design Automation Conference (DAC

’01), June 2001.[106] H. Zantema, and J.F. Groote, “Transforming Equality Logic to Propositional Logic,” Workshop on First Order Theorem Proving (FTP

’03), June 2003.[107] L. Zhang, and S. Malik, “The Quest for Efficient Boolean Satisfiability Solvers,” Computer-Aided Verification (CAV ’02), E. Brinksma,

and K.G. Larsen, eds., LNCS 2404, Springer-Verlag, July 2002, pp. 17–36.[108] L. Zhang, and S. Malik, “Cache Performance of SAT Solvers: A Case Study for Efficient Implementation of Algorithms,” Theory and

Applications of Satisfiability Testing (SAT ’03), May 2003.[109] J.-K. Zhao, E.M. Rudnick, and J.H. Patel, “Static Logic Implication with Application to Redundancy Identification,” IEEE VLSI Test Sym-

posium (VTS ’97), April–May 1997.

13

Page 255: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Multi-Agent Dialogue Protocols

Christopher D. Walton ([email protected]) ∗

Centre for Intelligent Systems and their Applications (CISA), Edinburgh, UK

November 28, 2003

Abstract

In this paper we propose a new agent communication language which separatesagent dialogue from any specific agent reasoning technology. This language is intendedto address a number of perceived shortcomings with the mentalistic model of agentcommunication on which the FIPA-ACL standard is founded. Our language expressesinter-agent dialogue through the use of agent protocols, and is intended to be indepen-dent of the technology used for message delivery. In this paper we specify the syntaxof our communication language, together with an operation semantics which definesan implementation of the language. Our language specification is derived from processcalculus and thus forms a sound basis for the verification of our agent protocols.

1 Introduction

AMulti-Agent-System (MAS) may be defined as a collection of agents, which are autonomousand rational components, that interact within an environment [Jen00]. An individual agentof a MAS exhibits intelligent behaviour based on interactions with other agents, the envi-ronment, and internal reasoning processes. It is this intelligent behaviour that distinguishesa MAS from a conventional distributed or parallel software system. From this definition itis clear that an essential pragmatic consideration in the construction of a MAS must be aspecification for the interactions between individual agents, as agents must interact in orderto exhibit intelligent behaviour. A popular basis for this interaction is the theory of theoryof rational action by Cohen and Levesque [CL90]. The FIPA-ACL specification [FIP99]recognises this theory by providing a formal semantics for the performatives expressed inBDI logic [RG98]. However, there is a growing dissatisfaction with the mentalistic model ofagency as a basis for defining inter-operable agents between different agent platforms [Sin98].

Inter-operability requires that agents built by different organisations, and using differentsoftware systems, are able to safely communicate with one another in a common languagewith an agreed semantics. The problem with the BDI model as a basis for inter-operableagents is that although agents can be defined according to a commonly agreed semantics,it is not generally possible to verify that an agent is acting according to these semantics.

∗This work is sponsored by the UK Engineering and Physical Sciences Research Council (GrantGR/N15764/01) Advanced Knowledge Technologies (AKT) Interdisciplinary Research Collaboration.

1

Page 256: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

This stems from the fact that it is not known how to assign mental states systematicallyto arbitrary programs. For example, we have no way of knowing whether an agent actuallybelieves a particular fact. For the semantics to be verifiable it would be necessary to haveaccess to an agents’ internal mental states which is not typically possible. This problemis known as the semantic verification problem and is detailed in [Woo00]. In order toavoid the semantic verification problem a number of alternative semantics for expressingrational agency have been proposed. Two of these approaches are a semantics based on socialcommitments, and a semantics based on dialogue games. A summary of these approaches,and other semantic models is presented in [MC02].

In this paper we do not adopt a specific semantics of rational agency, or define a fixedmodel of interaction between agents. Our belief is that in a truly heterogeneous agent sys-tem we cannot constrain the agents to any particular model. Instead, we define a modelof dialogue which separates the rational process and interactions from the actual dialogueitself. This is accomplished through the adoption of a dialogue protocol which exists at alayer between these processes. This approach has been adopted in the Conversation Pol-icy [GHB99] and Electronic Institutions [ERS+01] formalisms. The definition presented inthis paper differs in that dialogue protocol specifications can be directly executed.

A dialogue protocol allows the semantics of a dialogue to be independently expressed.This approach has some compelling advantages, for example, we can succinctly express therules of an auction as a dialogue protocol, while the agents participating in this dialogueare free to select their own auction strategies, i.e. dialogue protocols do not compromise theself-interest of the individual agents. Agents can be specified in different languages, usingdifferent rational processes, and still participate in the dialogue expressed in the protocol.The only restriction on the autonomy of the agents is that they follow the dialogue protocolwhich encodes all the necessary information to participate in the dialogue. It should benoted that dialogue protocols also greatly assist in the design of large MAS as they imposestructure on the agents, and co-ordinate tasks between agents. They also simplify the designof individual agents as they separate the task of defining the co-ordination of the agentsfrom the definition of agent behaviours. This separation also permits the refinement andverification of the agent protocol independently from the design of the individual agents.

In this paper we present our approach to defining dialogue protocols suitable for con-structing large MAS of inter-operable agents. Our method draws from our earlier experi-ences with defining Electronic Institutions and borrows its theory from the process calculusdomain. In section 2 we present a small language which we use to express dialogue protocols.In section 3 we present a relational operational-semantics for evaluating our language whichcan be implemented in an agent system. Lastly, in section 4 we describe our implementationand discuss future work concerning the verification of our dialogue protocols.

2 The MAP Language

The MAP language is a lightweight dialogue protocol language which provides a replacementfor the state-chart representation of protocol found in Electronic Institutions [ERS+01]. Ourformalism allows the definition of infinite-state dialogues and the mechanical processing ofthe resulting dialogue protocols. The semantics of our language are derived from the field of

2

Page 257: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

process calculus, and in particular the Calculus of Communicating Systems (CCS) [Mil89].We have redefined the core of the Electronic Institutions framework to provide an executablespecification, while retaining the concepts of institutions, scenes, and roles.

The division of agent dialogues into scenes is a key concept in our protocol language. Ascene can be thought of as a bounded space in which a group agents interact on a single task.The use of scenes divides a large protocol into manageable chunks. For example, a negotiationscene may be part of a larger marketplace institution. Scenes also add a measure of securityto a protocol, in that agents which are not relevant to the task are excluded from the scene.This can prevent interference with the protocol and limits the number of exceptions andspecial cases that must be considered in the design of the protocol. Additional securitymeasures can also be introduced into a scene, such as placing entry and exit conditions onthe agents, though we do not deal with these here. However, we assume that a scene placesbarrier conditions on the agents, such that a scene cannot begin until all the agents arepresent, and the agents cannot leave the scene until the dialogue is complete.

The concept of an agent role is also central to our definition of a dialogue protocol.Agents entering a scene assume a fixed role which persists until the end of the scene. Forexample, a negotiation scene may involve agents with the roles of buyer and seller. Theprotocol which the agent follows in a dialogue will typically depend on the role of the agent.For example, an agent acting as a seller will typically attempt to maximise profit and willact accordingly in the negotiation. A role also identifies capabilities which the agent mustprovide. For example, the buyer must have the capability to make buying decisions andto purchase items. Capabilities are related to the rational processes of the agent and areencapsulated by decision procedures in our definition.

S ∈ Scene ::= n[R, A, P (k)] (Scene Definition)

P ∈ Protocol ::= agent(a, r, φ(k)) = op (Agent Protocol)

op ∈ Operation ::= α (Action)| op1 then op2 (Sequence)| op1 or op2 (Choice)| op1 par op2 (Parallel Composition)| waitfor op1 timeout op2 (Iteration)| agent(φ(k)) (Recursion)

α ∈ Action ::= ε (No Action)| v = p(φ(k)) (Decision Procedure)| M => agent(φ(2)) (Send)| M <= agent(φ(2)) (Receive)

M ∈ Message ::= ρ(φ(k)) (Performative)

φ ∈ Term ::= v | a | r | c | _

Figure 1: MAP Abstract Syntax.

The abstract syntax of MAP is presented in Figure 1. Agents are uniquely identified bya name a, and have a fixed role r for the duration of the scene. A scene comprises a fixed set

3

Page 258: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

of roles R, a set of participating agents A, and a sequence of protocols P (k). A protocol Pcan be considered a procedure where a, r, and φ(k) are the arguments. The initial protocolfor an agent is specified by setting φ(k) to be empty (i.e. k = 0). Protocols are constructedfrom operations op which control the flow of the protocol, and actions α which have side-effects and can fail. The interface between the protocol and the rational process of the agentis achieved through the invocation of decision procedures p. Interaction between agents isperformed by the exchange of messages M which contain performatives ρ. Procedures andperformatives are parameterised by terms φ, which are either variables v, agents a, roles r,constants c, or wild-cards _. Variables are bound to terms by unification which occurs inthe invocation of procedures, the receipt of messages, or through recursive calls.

1 GeneralPractitionerScene[%patient, %doctor, !Patient1, !Doctor1,

2

3 agent(!Patient1, %patient) =

4 request(appointment) => agent(_, %doctor) then

5 waitfor

6 (accept(appointment, $appointment) <= agent($doctor, %doctor) then

7 ($symptoms = getSymptoms() then

8 inform(symptoms, $symptoms) => agent($doctor, %doctor) then

9 waitfor

10 (inform(refer) <= agent($doctor, %doctor) or

11 inform(norefer) <= agent($doctor, %doctor))

12 timeout (e)) or

13 reject(appointment) <= agent($doctor, %doctor))

14 timeout (e)

15

16 agent(!Doctor1, %doctor) =

17 waitfor (request(appointment) <= agent($patient, %patient)) timeout (e) then

18 ($appointment = makeAppointment($patient) then

19 accept(appointment, $appointment) => agent($patient, %patient) then

20 waitfor

21 (inform(symptoms, $symptoms) <= agent($patient, %patient) then

22 ($ref = doReferral($patient, $symptoms) then

23 inform(refer) => agent($patient, %patient)) or

24 inform(norefer) => agent($patient, %patient))

25 timeout (e)) or

26 reject(appointment) => agent($patient, %patient)]

Figure 2: General Practitioner Protocol.

It is helpful to consider an example of a MAP protocol in order to illustrate these concepts.In Figure 2 we present an example scene which would form part of a larger institution.This scene defines an interaction protocol between doctor and patient agents. This scene isintended to represent a patient visiting a General Practitioner (GP) to obtain a diagnosis ofsome symptoms. We distinguish between the different types of terms by prefixing variablesnames with $, role names with %, and agent names with !. We define two agents !Patient1and !Doctor1 which have roles %patient and %doctor respectively. The protocol for the

4

Page 259: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

patient is specified separately from the doctor, though the two will interact closely. Theagents are synchronised purely through the exchange of messages.

When exchanging messages, through send and receive actions, a unification of terms inthe definition agent(φ1, φ2) is performed, where φ1 is matched against the agent name, andφ2 is matched against the agent role. For example, the request for an appointment in line 4of the protocol will match any agent whose role is a %doctor. Similarly, the receipt of therequest in line 17 of the protocol will match any agent whose role is %patient, and the nameof this agent will be bound to the variable $patient. We can therefore define broadcastand multi-cast communications. Furthermore, our example will scale when more than twoagents are present in the scene.

The semantics of message passing in our language corresponds to reliable, buffered, non-blocking communication. Sending a message will succeed immediately if an agent matchesthe definition, and the message M will be stored in a buffer on the recipient. Receiving amessage involves an additional unification step. The message M supplied in the definitionis treated as a template to be matched against any message in the buffer. For example, inline 6 of the protocol, a message must match accept(appointment, $appointment), andthe variable $appointment will be bound to the second term in the message if the matchis successful. Sending a message will fail if no agent matches the terms, and receiving amessage will fail if no message matches the message template.

Communication is non-blocking in that the send and receive actions do not delay theagent. For this reason, all of the receive actions are wrapped by waitfor loops to avoid raceconditions. For example, in line 17 the agent will loop until a message is received. If thisloop were not present the agent may fail to find an appointment request and the protocolwould terminate prematurely. The advantage of non-blocking communication is that wecan check for a number of different messages. For example, in lines 9 through 12 of theprotocol the agent waits for either a refer or norefer decision. The waitfor loop includes atimeout condition which is triggered after a certain interval has elapsed. This can be usefulin handling certain kinds of failures, though we do not make use of timeouts in our example.

At various points in the protocol, an agent is required to perform tasks, such as makinga decision, or retrieving some information. This is done through the use of decision pro-cedures. As stated earlier, decision procedures provide an interface between the dialogueprotocol and the rational processes of the agent. In our language, a decision procedure ptakes a number of terms as arguments and returns a single result variable v. The actualimplementation of the decision procedure is external to the dialogue protocol. In effect, thedecision procedure acts as a hook between the dialogue and the rational processes. For exam-ple, the makeAppointment decision procedure in line 18 of the dialogue refers to an externalappointment procedure, which can be arbitrarily complex (e.g. a timetabling application).

The operations in the protocol are sequenced by the then operator which evaluatesop1 followed by op2, unless op1 involved an action which failed. The failure of actions isgenerally handled by the or operator. This operator is defined such that if op1 fails, thenop2 is evaluated, otherwise op2 is ignored. For example, if the doReferral procedure in line22 fails, then a norefer message will be sent in line 24. Our language also includes a par

operator which evaluates op1 and op2 in parallel. This is useful when an agent is involved inmore than one action simultaneously, though we do not use this in our example.

5

Page 260: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

External data is represented by constants c in our language. We do not attempt to assigntypes to this data, rather we leave the interpretation of this data to the decision procedures.For example, in line 7 the symptoms are returned by the getSymptoms procedure, andinterpreted by the doReferral procedure in line 22. Constants can therefore refer to complexdata-types, e.g. flat-file data, XML documents, images.

It should be clear that MAP is a powerful language for expressing multi-agent dialogue.We have used this language to specify a wide range of protocols, including a range of popularnegotiation and auction protocols. It is important to note that MAP is not intended to be ageneral-purpose language, and therefore the relative paucity of features (e.g. no user-defineddata-types) is entirely appropriate. Nonetheless, dialogue protocols are executable, given asuitable communication platform and an appropriate definition of the decision procedures.In the following section we define the semantics of the language formally as the basis for animplementation.

3 Semantics of MAP

The provision of a clean and unambiguous semantics for our MAP language was a primaryconsideration in the design of the language. The purpose of the semantics is to formallydescribe the meaning of the different language constructs, such that dialogue protocols ex-pressed in the language can be interpreted in a consistent manner. We consider this to be afailing of the formal semantics of FIPA [FIP99], which is expressed in BDI logic. The FIPAsemantics is an abstract description, which neglects practical aspects such as a definitionof the communication primitives. Furthermore, the BDI modalities can be interpreted in anumber of different ways, e.g. [RG95, Jen93], meaning that implementations of BDI agentshave typically been ad-hoc in nature.

∆ ∈ Agent Environment ::= amap7→ (r, AE , VE , PE , ME

(k))

AE ∈ Agent Protocols ::= φ(k) map7→ op

VE ∈ Variables ::= vmap7→ φ

PE ∈ Decision Procedures ::= pmap7→ φ(k)

ME ∈ Messages ::= (a, r, M)

Figure 3: MAP Evaluation Environment.

We have chosen to present the MAP semantics in a relational operational semantics for-malism called natural semantics [Kah87], so called because the evaluation of the relationsis reminiscent of natural deduction. The natural semantics style is convenient because theentire evaluation of an agent dialogue can be captured within a (semi-)compositional deriva-tion that can be reasoned about inductively. The rules of the semantics can be implementeddirectly (e.g. as Prolog Horn clauses) and a derivation can be performed incrementally (depth-first) from the root to the leaves. In natural semantics, we define relations between the initialand final states of program fragments. A program fragment in MAP is either an operationop, or an action α. The state is captured by an agent environment ∆ which is defined in

6

Page 261: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

∆, a ` op ⇒ ∆′

∆, a ` α ⇒ ∆′

∆, a ` α ⇒ ∆′

(1)

∆, a ` op1 ⇒ ∆′

∆′, a ` op2 ⇒ ∆′′

∆, a ` op1 then op2 ⇒ ∆′′

(2)

∆, a ` op1 ⇒ ∆′

∆, a ` op1 or op2 ⇒ ∆′

(3)

∆, a ` op2 ⇒ ∆′

∆, a ` op1 or op2 ⇒ ∆′

(4)

∆, a ` op1 ⇒ ∆′

∆, a ` op2 ⇒ ∆′′

∆, a ` op1 par op2 ⇒ ∆′ ∪∆′′

(5)

∆(a) = (r, AE , VE , , )

VE ` subst(φ1(k)) ⇒ φ2

(k)

∃φ3(k) ∈ AE |

∅ ` unify(φ3(k), φ2

(k)) ⇒ VE′

∆ ∪ VE′, a ` op ⇒ ∆′

∆, a ` agent(φ1(k)) ⇒ ∆′

(6)

∆, a ` α ⇒ ∆′

∆, a ` ε ⇒ ∆

(7) ∆(a) = (r, , VE , PE , )

VE ` subst(φ1(k)) ⇒ φ2

(k)

VE ` unify(PE (p), φ2(k)) ⇒ VE

VE′ ` eval(p, v) ⇒ VE

′′

∆, a ` v = p(φ1(k)) ⇒ ∆ ∪ VE

′′

(8)

∆(a) = (r, , VE , , )

VE ` subst(φ1(k)) ⇒ φ3

(k)

VE ` subst(φ(2)2 ) ⇒ φ

(2)4

∀a′ ∈ ∆ |

∆(a′) = (r′, , VE′, , ME

′(k))

∅ ` unify(φ(2)4 , (a′, r′)) ⇒ ∅

∆, a ` ρ1(φ1(k)) => agent(φ

(2)2 ) ⇒ ∆(a′) ∪ (a, r, ρ1(φ3

(k)))

(9)

∃ME ∈ ∆(a) |

ME = (a′, r′, ρ2(φ3(k)))

∅ ` unify((a′, r′), φ(2)2 ) ⇒ VE

VE ` unify(φ1(k), φ3

(k)) ⇒ VE′

∆, a ` ρ1(φ1(k)) <= agent(φ

(2)2 ) ⇒ (∆− ME ) ∪ VE

(10)

VE ` subst(v) ⇒ VE (v)

VE ` subst(φ) ⇒ φ

VE ` subst(φ(k)) ⇒

(subst(φ1), . . . , subst(φk))

VE ` unify( , φ) ⇒ VE

VE ` unify(φ, φ) ⇒ VE

VE ` unify(v, φ) ⇒ VE [v 7→ φ]

VE ` unify(φ1(k), φ2

(k)) ⇒

unify(φ11, φ

12) ∪ · · · ∪ unify(φk

1 , φk2)

Figure 4: MAP Operational Semantics.

7

Page 262: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Figure 3. The environment contains an n-tuple for each agent comprising the agent role r,the agent protocols AE , the bound variables VE , the decision procedures PE , and a messagequeue ME . The agent protocols AE map from arguments φ(k) to operations op, where anempty sequence of arguments is the initial agent protocol. The decision procedures PE arerepresented as a map from the procedure name p to the argument terms φ(k). The messagequeue ME (k) is a sequence of n-tuples (a, r, M), where a and r are the name and role ofthe sender, and M is the actual message. For brevity we omit the rules for constructing theinitial environment, and for checking well-formedness of the environment from our definition.

We define the evaluation rules for the program fragments of MAP in Figure 4. To capturethe exchange of messages between agents we assume that the environment ∆ is sharedbetween agents. Thus, sending a message to an agent is captured by placing the messageinto the message queue ME of the recipient. Rules 1 through 6 define the evaluation of thedifferent types of operations op. The form of these rules is ∆, a ` op ⇒ ∆′, where ∆ is thestate at the start of evaluation, a is the name of the agent performing the evaluation, op isthe operation, and ∆′ is the state on completion. Similarly, rules 7 through 10 capture theevaluation of the actions α. The form of these rules is ∆, a ` α ⇒ ∆′, which is as beforewhere α is the action. We also define a substitution function, VE ` subst(φ) ⇒ φ′ whichsubstitutes variables for their values, and a unification function VE ` unify(φ1, φ2) ⇒ VE ′

which matches terms and binds variables to values. The VE ` eval(p, v) ⇒ VE ′ functionevaluates the external decision procedure p, binding the result to v in VE ′.

4 Conclusions and Further Work

In this paper we have defined a novel language for representing dialogue protocols in Multi-Agent Systems. Our language of multi-agent dialogue protocols (MAP) fills an essentialgap between the low-level communication and high-level reasoning processes found in suchsystems. The language is founded on process calculus and is expressive enough to describea large range of agent protocols.

Dialogue protocols specified in the MAP language are designed to be directly executableby the agents participating in the dialogue. To this end we have presented an operationalsemantics for the language, which precisely defines the evaluation behaviour of the language.Our presentation in the natural semantics style enables a direct implementation of the evalu-ation rules of the language. We have implemented these rules directly as Prolog Horn clausesusing LINDA for inter-agent communication. We have also implemented the MAP languagein Java using concurrent threads for the individual agents, and an interpreter which providesthe necessary back-tracking and unification behaviour.

Dialogue protocols specify complex asynchronous and concurrent interactions, and there-fore it is difficult to design correct protocols. Our experience with defining protocols in MAPhas shown that predicting undesirable behaviour is a non-trivial task. To address this is-sue we are currently investigating the use of model-checking techniques [CGP99] to performautomated verification. The appeal of this approach over simulation is that an exhaustiveexploration of the dialogue space is performed. Our early experiments with this techniquehave shown a high success rate in the detection of failures (e.g. non-termination) in ourdialogues.

8

Page 263: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

References

[CGP99] E. M. Clarke, O. Grumberg, and D. A. Peled. Model Checking. MIT Press, 1999.

[CL90] P. R. Cohen and H. J. Levesque. Rational interaction as the basis for communi-cation. Intentions in Communication, pages 221–256, 1990.

[ERS+01] M. Esteva, J. A. Rodrıguez, C. Sierra, P. Garcia, and J. L. Arcos. On the FormalSpecification of Electronic Institutions. In Agent-mediated Electronic Commerce(The European AgentLink Perspective), number 1991 in Lecture Notes in ArtificialIntelligence, pages 126–147, 2001.

[FIP99] FIPA Foundation for Intelligent Physical Agents. FIPA Specification Part 2 -Agent Communication Language, April 1999. Available at: www.fipa.org.

[GHB99] M. Greaves, H. Holmback, and J. Bradshaw. What is a Conversation Policy?In Proceedings of the Workshop on Specifying and Implementing ConversationPolicies, Autonomous Agents ’99, Seattle, Washington, May 1999.

[Jen93] N. R. Jennings. Specification and Implementation of a Belief-Desire-Joint-Intention Architecture for Collaborative Problem Solving. Journal of Intelligentand Cooperative Information Systems, 2(3):289–318, 1993.

[Jen00] N. R. Jennings. On Agent-Based Software Engineering. Artificial Intelligence,117(2):277–296, 2000.

[Kah87] G. Kahn. Natural Semantics. In Proceedings of the Fourth Annual Symposiumon Theoretical Aspectsof Computer Science (STACS’87), number 247 in Lec-ture Notes in Computer Science, pages 22–39, Passau, Germany, February 1987.Springer-Verlag.

[MC02] N. Maudet and B. Chaib-draa. Commitment-based and Dialogue-game basedProtocols – New Trends in Agent Communication Language. The KnowledgeEngineering Review, 17(2):157–179, 2002.

[Mil89] R. Milner. Communication and Concurrency. Prentice-Hall International, 1989.

[RG95] A. S. Rao and M. P. Georgeff. BDI-agents: from theory to practice. In Proceedingsof the First International Conference on Multiagent Systems (ICMAS-95), pages312–319, San Francisco, USA, June 1995. AAAI Press.

[RG98] A. S. Rao and M. Georgeff. Decision procedures for BDI logics. Journal of Logicand Computation, 8(3):293–344, 1998.

[Sin98] M. P. Singh. Agent Communication Languages: Rethinking the Principles. IEEEComputer, pages 40–47, December 1998.

[Woo00] M. Wooldridge. Semantic issues in the verification of agent communication lan-guages. Autonomous Agents and Multi-Agent Systems, 3(1):9–31, 2000.

9

Page 264: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Bayesian Model Averaging Across Model Spaces via Compact Encoding Ke Yin Ian Davidson* Department of Computer Science Department of Computer Science SUNY – Albany SUNY – Albany Albany, NY 12222 Albany, NY 12222 [email protected] [email protected] Abstract Bayesian Model Averaging (BMA) is well known for improving predictive accuracy by averaging inferences over all models in the model space. However, Markov chain Monte Carlo (MCMC) sampling, as the standard implementation for BMA, encounters difficulties in even relatively simple model spaces. We introduce a minimum message length (MML) coupled MCMC methodology, which not only addresses these difficulties but has additional benefits. We illustrate the methodology with a mixture component model example (clustering) and show that our approach produces more interpretable results when compared to Green’s popular reverse jump sampling across model sub-spaces technique. The MML principle mathematically embodies Occam’s razor by assigning penalized prior probabilities to complicated models. We find that BMA prediction based on sampling across multiple sub-spaces of different complexity makes much improved predictions compared to the single best (shortest) model. 1. Introduction Bayesian model averaging (BMA) removes the model uncertainty by making inferences from all possible models in the considered model spaces weighted by their posterior probabilities. The removal of uncertainty decreases the associated risk of making predictions from only a single model hence improves the prediction accuracy [9]. The standard BMA implementation involves running a Markov chain that has the posterior distribution of the models as its stationary distribution typically using either the Gibbs’ or Metropolis-Hasting’s algorithm. However, the method encounters difficulties [6] in even simple model spaces, such as mixture models. Not only are the full conditional distributions difficult to derive and sample from, but mixing can also become slow, which can be exacerbated if the data has relatively high dimensions. Thus, powerful as BMA is, these difficulties prevented its wide application in the machine learning and statistical inference problems. The Minimum Message Length (MML) principle [1] evaluates the quality of a model by the total message length to encode both the model and the data D. The message length of a model has a simple relationship with the posterior probability of the model that ( ))|()(exp)|( θθθ DMsgLenMsgLenDp −−∝ where the information length is measured in nits. This property allows sampling models according to their posterior probabilities calculated from the message lengths of the models. In this paper we show that the MML-MCMC sampler has significant additional benefits over MCMC with traditional estimators, in particular:

• The full conditional distribution defined by the message length distribution is easier to derive and sample from.

• MML mathematically embodies Occam’s razor by associating the model’s prior probability with its

complexity. • MML uses message length as the universal metric to evaluate the quality of models, allowing easily

sampling across different model subspaces or even fundamentally different model spaces. • The MML discretization of the model space into regions with each region having a representative model

greatly reduces the size of the model space, resulting in more efficient sampling. In this paper, we illustrate the MML coupled MCMC framework with mixture model as an example although the approach is applicable to other model families. Bayesian inference from Gaussian mixtures with unknown

Page 265: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

number of components was first addressed by Richardson and Green [5] using their reversible jump method. Though their approach is readily applicable to univariate problems, to our knowledge it has not been successfully demonstrated on high dimensional data. There exist work that attempts to bypass the problem of estimating the number of components by assuming infinite models using the Dirichlet process methods [10][13]. The method relies on the mathematical convenience of a conjugate prior and we intend to compare this approach to ours in future work. We start by introducing the MML principle and its formalization for mixture models. We then formally propose the five kernel moves in the MML-MCMC sampler and prove its convergence to the MML defined posterior. The convergence is then empirically verified via chain diagnosis on an artificially generated Gaussian dataset. Next, we compare MML-MCMC sampler with the reversible jump sampler [5] and illustrate that MML-MCMC sampler finds more interpretable posterior distribution of k by adopting the automatic prior in the MML principle. Finally, we empirically explore the predictive capability of BMA with a number of standard machine learning and statistical datasets. This work is an extension of our earlier work [12] to include jumping between model spaces of varying complexity. To our knowledge, it is the first work to address model averaging across model spaces of varying dimension for predictive purposes. 2. Minimum Message Length Principle The minimum message length principle was first proposed by Wallace and Boulton [2] using Shannon’s information encoding scheme. The principle measures both the model complexity and the model’s fit to the data in nits of information. The total message length equals to the sum of two parts: the message length to encode the model and the message length to encode the data in the presence of the model.

)|()(),( θθθ DMsgLenMsgLenDMsgLen += (1)

One can only encode the model to some finite precision otherwise the message length will be infinite. For this reason, the model space must be discretized into a number of cells. All the models contained in each cell are considered to be uniformly distributed throughout it and the cell is represented by the model in the center. The larger the cell volume is, the less number of cells will there be in the model space, and the less information required specifying a particular model. However, larger cell volumes also lead to inaccurate model specifications resulting in longer expected message length in the second part of the message. For a model with np parameters, the optimal volume of a cell V that minimizes the total expected message length is determined [4] by

( )( )θκ FV

p

p

n

n det)(

1= (2)

Here, d is the lattice constant [7], and F() is the expected Fisher information. Using the optimal cell volume produces a message length of:

( ) ( )( )2

)()(detln2

1)(ln)ln(

2p

n

p nLFp

nMsgLen

p+++−= θθθκ (3)

In the above equation is no longer a continuous variable. It may only change with the unit of the cell volume. Rewriting equation (3) yields

( ) ( )( ) ( ) ( )2

)|(ln)(ln2

)(det)(

1ln)(ln pp

n

n

nDpVp

nL

FpMsgLen

p

p

+−⋅−=++

−−= θθθ

θκθ (4)

Page 266: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

This expression is very close to the logarithm of the famous Bayesian theorem shown below, less the probability of the data which is a constant.

( ) ( ) ( ))(ln)|(ln)(ln)|(ln DpDppDp +−−=− θθθ (5) We see that the first part of the message length serves as the prior probability for the model. This suggests more complicated models that take more information to encode should have less prior probability compared to simpler models. This is consistent with and quantifies the Occam’s razor philosophy to choose simpler models when the fitness of the data is equivalent. Later we will see this automatic prior yields more interpretable results in section 7 compared to uniform prior over models with different complexity. The discretization in MML has introduced imprecision on . However, it can be shown that this imprecision is no greater than the difference between the estimator and the true value of [1]. In other words, the MML defined optimal cell volume equals to the inherent imprecision due to the distribution of the data. Thus, we have

( ) ( ))(ln2

)|(ln)(ln Dpn

DppMsgLen p −+−−= θθ (6)

Since 2

pn is constant, we obtain the transformation between the message length (measured in nits) and the

posterior distribution.

MsgLeneDp −∝)|(θ (7)

The discretization of the parameter space yields much less models to consider, making the sampling more efficient. Moreover, the message length distribution can be truncated by a constant amount c while still transforms into the same posterior distribution. Let ,...,, 21 nθθθ=Θ be a set of models and let

,...,, 21 nmmmM = be the corresponding two part message lengths of the models. Then both message length distributions ,...,, 21 nmmmM = and ,...,, 21 cmcmcmM n −−−=′ will transform into the equivalent posterior distributions ofΘ . The proof is straightforward since

=

−−

−−

=

==n

i

cm

cm

n

i

m

m

ii

i

i

i

e

e

e

eDp

1

)(

)(

1

)|(θ (8)

This result makes the message length calculation significantly less cumbersome when trying to sample from the transformed distribution. Instead of calculating the absolute message length for each model, one only needs to calculate their relative differences. 3. Minimum Message Length in Mixture Component Model A mixture component model (M) with d dimensions and n instances can be specified as ,,, θkpwM = , where

,...,, 21 nwwww = The assignments of the instances, each nikwi ,...,3,2,1,...,2,1 ==

,...,, 21 kpppp = The relative weight of each component 0≥ip =

=k

iip

1

1

Kk ,...,2,1= Number of components, K is the maximum possible value for k. , σµθ = The dk × parameter matrices of the mixture component model. , ijij σµ stands for the

mean and standard deviation for the jth dimension in the ith component.

Page 267: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

The full encoding of the mixture model consists of two parts. The first part of the message encodes the independent parameters θ,, pk according to their prior distributions. The second part of the message encodes the assignments w and the data. Given the knowledge of p, the optimal coding dictionary encodes the assignment to component j with jpln− nits. Then the instances are encoded with knowledge of w and . An

instance x with d dimensions will take ),|(ln1

),(),(=

−d

mmxwmxwmxf σµ nits to encode. In summary,

( ) ( )2

)ln(2

),,(detln2

1),|(lnln),,(ln

1 1 1),(),(,

p

n

pk

j

n

i

d

mmiwmiwmijj

nnpkFxfppnpkhMsgLen

p+++−−−=

= = =

κθσµθ (9)

An expansion similar to that of Baxter and Oliver [3] gives the full message length expansion.

( ) ( ) ( )( ) ( )1)ln(

2

12!ln

2ln)2ln(

2

lnln2

132ln!1ln2ln

1 12

),(

2

),(,

1 1),(

11 12

,

2

,1

+−++−−

++−

+−++−−=

= == =

== ==

pn

n

i

d

m jiw

jiwmin

i

d

mmiw

k

jjj

k

j

d

m mj

j

mpop

d

m

kdkk

xnd

ppnnkp

kkMsgLen

κσ

µσπ

σσ

(10)

Each different parameters tuple wpk ,,, θ corresponds to a model in the model space and its message length is calculated by equation (10) from which the joint distribution ),,,,( Dwpkp θ can be derived. Also the full conditional distribution where some of the parameters are known, such as ),,,|( Dwpkp θ , can be conveniently transformed from the same message length distribution only by fixing the given parameters as constants. 4. MML Coupled Markov Chain Monte Carlo The standard approach to sampling from the posterior distribution is by running a Markov chain that has the posterior distribution as its stationary distribution. In the chain, the state at time t, ,,, ttttt kpwM θ= , will only depend on the state at time t-1, ,,, 11111 −−−−− = ttttt kpwM θ . The iteration from time t-1 to t, usually called a sweep, consists of sampling each parameter conditionally on all other remaining parameters. In the MML-MCMC sampler for mixture component model, there are five moves in each sweep. • Sampling w from ),,,|( Dpkwp θ • Sampling p from ),,,|( Dwkpp θ • Death and Re-Birth of a component • Split and Combine of a component • Sampling from ),,,|( Dwpkp θ The first, second and fifth steps sample from the conditional distribution using Gibbs algorithm while the third and fourth steps uses a Metropolis-Hasting algorithm to handle empty components and jump across subspaces. We will formalize each move in detail in the rest part of this section. 4.1 Sampling w The assignment w(x) of an instance x will affect the message length in the following way. Firstly the instance assignment w(x) to a component must be encoded with ( ))(ln xwp nits. Then the data point itself must be

encoded using )( xwθ which will take ( ))(|ln xwx θ− nits. We only consider the message length parts that are

functions of w(x) in equation (10) since all other parts of message length are constants and can be ignored in calculating the full conditional distribution. The truncated message length distribution that an instance x is assigned to component j is:

Page 268: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

( )

==

−+++−=−−=

d

m mj

mjmd

mmjjjj

xdpxfpjxMsgLen

12

,

2

,

1, 2

ln)2ln(2

)ln()|(ln)ln(),(σ

µσπθ (11)

The distribution consists of k message lengths each corresponds to the k possible assignments for the instance. This message length distribution can be transformed into the full conditional distribution.

=

==k

j

jxMsgLen

jxMsgLen

e

eDkpjxwp

1

),(

),(

),,,|)(( θ (12)

The instance is stochastically assigned to one of the components according this distribution. This stochastic assignment process repeats for each instance in the dataset. 4.2 Sampling p We are to sample p from the full conditional distribution ),,,|( Dwkpp θ . Here, p can be viewed as a multinomial distribution with uniform priors whose p.d.f is given by

( ),1,

!

!),...,,(

1

11

1

121 −=∏∏

=−

==

=

−k

jjk

k

j

njk

jj

k ppwherepn

npppf j −=

=

1

1

k

jjk nnn (13)

We sample p using a message length defined distribution derived from the equation below.

( ) ( ) ( ) ( ) ( )∏

=

===

+−+−−−=+−−=k

jj

kk

jjj

k

jj

k

jjj

p

npnnnkpFppnphMsgLen

1

1

111

ln2

1)ln(!!ln!1ln)(detln

2

1ln)(ln (14)

The optimal cell volume for the k-1 parameters that minimize the expected message length is given by

( ) ( ) 1

1

1

1

1

ˆ

))ˆ(det(

1−

=−

∏==

k

k

k

jj

k

k n

p

pFV

κκ (15)

We approximate the volume of the cell by a hypercube. The width of the jth (j=1 to k-1) dimension of the cube sj equals

( ) )1(2

1

1

ˆˆ

= kk

k

j

j pn

ps

κ (16)

We sample each ( )11 −≤≤ kjp j around the maximal likelihood estimatorn

np j

j =ˆ with the cell width

of ( ) )1(2

1

1

ˆˆ

kk

k

j pn

p

κ. We truncate the message length distribution by the message length at jp . The truncated

message length distribution is specified by

( ) ( )

k

kk

j

jj

k

kk

j

jj

kj

kj

kkjjkkjjj

p

pn

p

pn

p

pn

p

pn

pp

pp

pFpnpnpFpnpnpMsgLen

ˆln

2

1ˆln

2

1ˆln

ˆln

ˆˆln

2

1

)ˆ(det2

1ˆlnˆln)(detln2

1lnln)(

++

+=++=

+−−−+−−= (17)

We sample the value of jp using the fully conditional distribution transformed from this message length

distribution. And the sampling repeats for all j values from 1 to k-1.

Page 269: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

4.3 Death and Rebirth Completely empty components and components that contain only single instance require special attention, as they usually indicate a redundant part of the model that increases the message length to specify the model yet will not help compress the data. To increase the mixing rate and shorten the total message length, better parameters can be proposed to help encode the data. The death and rebirth moves involve destroying an empty component or a nearly empty component and reinitialize it with new parameters, which will hopefully encode the data more efficiently. As we adopt uniform priors for and p, their message length is always the same no matter what values we encode. Reinitializing the parameter values for the empty component will not change the total message length since no instance is encoded with the empty component. Also, we maintain a symmetrical parameter proposal function by reinitialize the parameters randomly so that the detail balance will be satisfied [10]. If a component contains one instance, we cannot always reinitialize the component since changes of the parameters of the component will affect the message length to encode the single instance in that component. The increase of the message length is )|(ln newxf θ− since it takes 0 nits to encode the instance previously (the parameters of the component contains all information about the instance). Thus the probability of accepting the move is )|( newxf θ . If the dimensionality of the data is high, then the acceptance rate becomes very low. One way of overcoming this difficulty is by only reinitializing the parameter only in one random dimension. 4.4 Split and Combine This is the move that enables sampling across subspaces with different k values. At each split and combine move, we will stochastically determine whether to attempt to split or to combine. When k=1, we always attempt the split move and when k equals the maximum number of components K, we always attempt the combination move. At all other k values, both the probability to split and the probability to combine will equal to 0.5. 4.4.1 Split First a random component is chosen as the split candidate. Then we choose the dimension with the largest standard deviation to generate the split pivot. The split pivot sp is randomly generated from [ ]σµσµ +− , in the chosen dimension and all instances inside the component are divided into two groups: those larger than the pivot and those smaller in the chosen dimension. All the parameters of the two groups are estimated using the maximum likelihood estimator. Let s be the candidate component proposed to split into s1 and s2, the change of the message length after the proposed split can be calculated from equation (10):

( ) ( ) ( ) ( ) ( )( ) ( )

( ) ( ) ( ) ( )1ln222

lnlnlnlnlnlnln23

2ln

2ln

2ln1)ln(

2

12ln2ln

)( 12

),(

2

),(

)1( 12

),(

2

),(

)1( 12

),(

2

),(

1,,22,112211

12

,

2

,2

2

2

,1

1

.1

+−

−−

−+

−−

−+−−+++

−++++−−=∆

∈ =∈ =∈ =

=

==

kxxx

pppnppppppnn

pppdkMsgLen

sCx

d

m jiw

jiwm

sCx

d

m jiw

jiwm

sCx

d

m jiw

jiwm

d

mmssmssmssssssssss

d

m ms

s

ms

s

ms

s

nmpop

d

mp

σµ

σµ

σµ

σσσ

σσσκσ

(18)

The move is subject to a Metropolis test and will pass the test with a probability of ,1min MsgLene ∆− . It is also important that the split proposal function and the combination proposal function be symmetrical. That is, if s splits into s1 and s2, then the probability of choosing s for the split attempt should equals the probability of choosing s1 and s2 for the combination attempt. To achieve this, we enforce that after the split, at least one of the two newly produced components must be the most adjacent component of the other. Here, when we say component j1 is the most adjacent component to j2, we mean j1 has the shortest Euclidean distance to j2. Note that this property is not mutual, that is, j1 is most adjacent to j2 does not follow that j2 is

Page 270: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

most adjacent to j1. If none of the two components is most adjacent to the other, the split attempt is unconditionally rejected. How this proposal function enforces symmetry we discuss in greater detail in the section on proving convergence. 4.4.2 Combine We pick the two candidate components to combine by choosing the first candidate component randomly and enforce its most adjacent component as the second candidate component. The message length change for a combination step is very similar to the reverse as that of the split step.

( ) ( ) ( ) ( ) ( )( ) ( )

( ) ( ) ( )k

xxx

pppnppppppnn

pppdkMsgLen

sCx

d

m jiw

jiwm

sCx

d

m jiw

jiwm

sCx

d

m jiw

jiwm

d

mmssmssmssssssssss

d

m ms

s

ms

s

ms

s

nj

d

mp

ln222

lnlnlnlnlnlnln2

3

2ln

2ln

2ln1)ln(

2

12)1ln(2ln

)( 12

),(

2

),(

)1( 12

),(

2

),(

)1( 12

),(

2

),(

1,,22,112211

12

,

2

,2

2

2

,1

1

.1

−−

−+

−+

−++−+−−

−+−+++−+−=∆

∈ =∈ =∈ =

=

==

σµ

σµ

σµ

σσσ

σσσκσ

(19)

The combination move is accepted with the Metropolis probability ,1min MsgLene ∆− . In the next section, we will see this is the symmetrical move for the split move. 4.5 Sampling The truncated message length in sampling from the conditional distribution ),,,|( Dwpkp θ , by fixing all given parameters as constants, is

( )( )

= == == =

−++=

n

i

d

m jiw

jiwmin

i

d

mmiw

k

j

d

m mj

xMsgLen

1 12

),(

2

),(,

1 1),(

1 12

,2

ln1

lnσ

µσ

σ (20)

Because there are k components and in each component there are d independently Gaussian distributions, the total number of Gaussian distribution is dk × . The message length for the jth component and mth variable is:

( )( ) ( ) ( )

==

−+−=

−++=

jj n

i jiw

jiwmi

miwj

n

i jiw

jiwmi

miwj

mj

jiwmj

xn

xnnMsgLen

12

),(

2

),(,

),(1

2

),(

2

),(,

),(2

,

),(, 2ln2

2ln

1),(

σµ

σσ

µσ

σµσ (21)

We sample jiwmj ),(, ,µσ from the full conditional distribution transformed from the above message length

calculations. This step is completed for each component and each attribute. 5. Proof of Convergence If the stationary distribution of a Markov chain is to converge towards the posterior distribution, the chain must satisfy the following three conditions [6]. • The chain must be aperiodic, which means there should not be any positive integer, T, that satisfies

TtTtTtTtttttTtt kpwkpwMM +++++ == θθ ,,,,,,, for all 0tt ≥ . If there is such a T then the chain is periodic and the period of the chain would be T.

• The chain must be irreducible, that is, the transition time takes from any two states ),,,( θkpwM and

),,,( θ ′′′′ kpwM must not be infinity. • The chain must satisfy the detail balance condition for every move, which is given by

),,,,,,(),,,(),,,,,,(),,,( θθθθθθ kpwkpwpkpwpkpwkpwpkpwp →′′′′′′′′=′′′′→ (22)

Page 271: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Given any state at time t, the probability the next state is the same state is greater than zero. Thus it is possible that the chain will stay at one state for arbitrary number of iterations and move to other states. The chain can only repeat the history with a probability less than 1. Therefore, the chain can never get into the deterministic cycle that satisfies the periodicity condition. The chain is also irreducible as the transition probability between any two states is greater than zero thus it will not take infinite time for the transition to take place. We now prove that detail balance holds for the five moves. In the five moves the parameters are sampled from the full conditional distributions transformed from the message length distribution. In move 1, 2 and 5 where Gibbs sampling is used, it is easy to see the detail balance holds since

)()()()( 122211

21

θθθθθθ →=⋅=→−−

ppZ

e

Z

epp

mm

(23) Here stands for whichever parameter being sampled in the move and Z is the normalization factor −

i

mie .

In move 3 and 4 where Metropolis algorithm is used,

( ) ( ) )()(1,min1,min)()( 12221121

2

12

1

θθθθθθ →===→ −−−

−−−

ppeZ

ee

Z

epp mm

mmm

m

(24) But it remains to prove that in the Metropolis test, the proposal function is symmetrical. In the death and rebirth move, since these parameters are randomly reinitialized, the probability of a proposal is independent of the current state and is obviously symmetrical. In the split/combine move, the probability of proposing a component as the split candidate equals 1/k. The proposed component j is split into j1 and j2. We also enforce that at least one component resulted from the split must the most adjacent of the other. Without losing generality, let j2 be the most adjacent component of j1. In the combination step, the probability of choosing j1 and j2 equals to the sum probability of choosing j1(which will automatically choose j2 subsequently) and the probability of choosing j2 and then j1 as its most adjacent component. Let )2,1( jjp be the probability of proposing components j1 and j2 as the combination candidates and )( jp be the probability of proposing component j as the split candidate, then

)(11

1

1

1

1)2,1( jp

kkkkjjp ==

++

+= (25)

The proposing function is symmetrical and the detail balance holds. In summary, the proposed MML coupled MCMC sampler consists of five moves. The Markov chain grown by the sampler is aperiodic, irreducible and detail balanced therefore its stationary distribution converges to the MML defined posterior distribution of the models. 6. MCMC Diagnosis In this section, we empirically diagnose the MML coupled MCMC sampler to verify the convergence and analyze the mixing rate. A four component two dimension Gaussian mixture (Figure 1) dataset is used for the purpose. We run the chain for 200,000 iterations with the first 50,000 as the burn-in period. We diagnose the convergence by comparing the posterior probabilities of k among different segments of the chain, each with a size of 50,000 iterations. If the chain has converged then the posterior probabilities should be constant across the segments. The probability of a particular value of k is the number of iterations the sampler stays in the given model sub-space. The comparison results are presented in Figure 2. Trace plots for k (Figure 3) indicates efficient mixing across the subspaces of different k, with split move and the combine move have an acceptance rate of 11.2% and 11.3% respectively.

Page 272: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

7. Comparisons with Green’s Reversible Jump Sampling The sampler is capable of jumping across subspaces with different k values and the posterior distribution of k is the relative frequency the sampler stays in each subspace. This distribution of k reflects the belief in how many components should there be in the data. In this section we apply the MML coupled MCMC sampler on three univariate datasets used by Richardson and Green (Figure 4) in their paper on reversible jump sampling [5]. We compare the posterior distribution of k found by MML-MCMC sampler with that found by the reversible jump sampler. The comparison results are summarized in table 1.

Table 1 comparing the distribution of k on enzyme, acidity and galaxy datasets

k 1 2 3 4 5 6 7 8 Split Combine

MML 0.000 0.989 0.010 0.001 0.000 0.000 0.000 0.000 0.46% 0.46% Enzyme

RG 0.000 0.024 0.290 0.317 0.206 0.095 0.041 0.017 8% 4%

k 1 2 3 4 5 6 7 8 Split Combine

MML 0.003 0.989 0.007 0.001 0.000 0.000 0.000 0.000 1.1% 1.1% Acidity

RG 0.000 0.082 0.244 0.236 0.172 0.118 0.069 0.037 14% 7%

k 1 2 3 4 5 6 7 8 Split Combine

MML 0.060 0.746 0.174 0.018 0.002 0.000 0.000 0.000 9.50% 10.60% Galaxy

RG 0.000 0.000 0.061 0.128 0.182 0.199 0.16 0.109 11% 18%

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

1 2 3 4 5 6 7 8 9 10 11 12 k

p(k)

50001-100000

100001-150000

150001-200000

Figure 2. Posterior distribution of k among different time segments

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3

x1

x2

Figure 1. Four components two-dimension Gaussian Mixture

0

1

2

3

4

5

6

7

8

9

10

180000 185000 190000 195000 200000 iteration

k

Figure 3. Typical trace plot of k (iteration 180001-200000)

Page 273: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

MML-MCMC sampler finds quite different k distributions from the reversible jump sampler. This great difference, however, does not invalidate either approach. In Richardson’s work, they adopt a uniform prior distribution on k, that is, the complexity of the model is not considered as part of the problem. In minimum message length principle, larger k values, which imply more complicated model, are penalized as they require more information to encode. However, upon examining the histogram of the datasets visually, the posterior distribution of k found by MML-MCMC is more consistent with human cognitions. For example, the reversible jump sampler suggests the best k values for the three problems are 4, 4 and 6 respectively while MML-MCMC suggests 2, 2 and 2. The smaller acceptance rate on split and combine for MML-MCMC in the enzyme and galaxy data does not suggest better mixing rate for reversible jump sampler since it finds much more diffused posterior distribution of k than MML-MCMC sampler.

Enzyme Data

0

0.05

0.1

0.15

0.2

0 0.5 1 1.5 2 2.5

x

freq

uen

cy

Acidity Data

0

0.02

0.04

0.06

0.08

0.1

0.12

2.5 3.5 4.5 5.5 6.5

x

freq

uen

cy

Galaxy Data

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

9 14 19 24 29 34

x

freq

uen

cy

Figure 4. The instance density distribution for Enzyme, Acidity and Galaxy datasets

Page 274: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

8. Bayesian Model Averaging and Classification As previously stated, MML-MCMC sampler automatically adopts a prior distribution of the models that penalizes overly complicated models. In this section we empirically show that the MML-MCMC sampler when making predictions via Bayesian model averaging performs significantly better than predictions made from a single model. We show our results on a number benchmark machine learning and statistics dataset. We compare the prediction capability between the single best model found by EM and the Bayesian averaged model calculated from the MML-MCMC sampler. It should be noted that some of these datasets have high dimensions and to our best knowledge, no previous sampler may sample across subspaces for data with high dimension efficiently. The first dataset we use is the handwritten digit dataset. This data set consists of sixteen attributes and a dependent attribute representing the digit (0 through 9). The sixteen attributes represent 8 x-y coordinates the pen took while writing the digit. To reduce the computational complexity, we reduce the digit recognition problem into five smaller problems. Instead of trying to identify a digit from all 10 possible classes, we paired digits similar in shape into fives groups. The accuracies are given in Table 2. The predictions accuracies are based on 100 trials with 70% training and 30% testing set.

Table 2 Performance of EM and Bayesian model averaging on digit recognition dataset

DIGITS Number

of Attributes

EM Error (%)

EM Standard Deviation

BMA Accuracy

BMA Standard Deviation

Error Reduction

Standard Deviation Reduction

0 & 2 16 0.3 0.7 0.1 0.5 67% 29% 1 & 7 16 20.1 4.4 7.3 3.2 64% 27% 3 & 8 16 6.3 3.0 3.5 2.7 44% 10% 4 & 9 16 12.3 13.6 5.3 2.5 57% 82% 5 & 6 16 18.4 4.6 0.3 0.9 98% 80%

We further test the prediction performance via another six benchmark datasets in UCI machine learning repository and the results are show in Table 3.

Table 3 Performance of EM and Bayesian model averaging on UCI datasets

Datasets Number

of Attributes

EM Error (%)

EM Standard Deviation

BMA Error (%)

BMA Standard Deviation

Error Reduction

Standard Deviation Reduction

Wine 13 5.4 3.0 4.0 2.6 25.9% 13.3% Iris 4 8.7 5.2 4.8 3.2 44.8% 38.5%

Pima 8 35 2.6 29.3 2.6 16.3% 0.0% Glass 9 51.6 6.3 45.7 6.5 11.4% -3.2%

Diabete 3 15.7 6.1 12.4 4.5 21.0% 26.2% Shuttle 9 20.2 5.3 10.3 3.3 49.0% 37.7%

We see that the Bayesian averaged model from the MML-MCMC sampler makes much better predictions, both in terms of accuracy and variance. The removal of the model uncertainty occurs at two levels. Not only is the uncertainty of k value removed, but also the model uncertainty for a given k. These accuracies might not

Page 275: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

be the highest among all possible classifiers available, such as C4.5 and neural network. However, we expect the Bayesian model averaged C4.5 performs better than plain C4.5 and the Bayesian model averaged neural network performs better than a plain neural network. 9. Conclusion Minimum message length principle evaluates the quality of models by the information length to encode both the model and data. The different message lengths associated with different models constitutes a message length distribution which can be conveniently transformed into the posterior distribution or full conditional distributions. This allows building MML-MCMC samplers that are easier to mathematically and pragmatically work with. Furthermore, it inherits the benefit from the MML principle that enables computable connection between model complexity and the prior probability of the model. With all these conveniences and a newly proposed sampling algorithm with five moves, we introduced an MML-MCMC sampler for the mixture component models. The sampler can sample across subspaces with different components with good mixing rate even for high dimensional data. Also, it finds posterior distribution of k that is more consistent with human cognition. Thus the sampler can reliably estimate k on high dimensional data where estimation by visualization becomes less obvious to human. Moreover, such a sampler allows making Bayesian model averaging predictions which are empirically verified to be more accurate than the single best models. Pragmatically, the work motivates the implementation of MML-MCMC sampler for other model families such as Hidden Markov Model (HMM) or a C4.5 decision tree, making inference by averaging the inference from all models in each model space, or further, in both model spaces together. MML-MCMC makes sampling across fundamentally different model spaces possible since the posterior is calculated from a universally computable metric, the message length. Such a sampler is a long-term goal of this work. The BMA predictor in this situation should also achieve improved accuracy and reduced variance. Philosophically, the work justifies the consistency between Occam’s razor and human cognition in a statistical light. By assigning smaller prior probability to more complicated models, we have obtained estimation on the number of components that is more interpretable to a human. This automatic prior obtained from the model complexity, now becomes the prior in the literal sense. That is, it is not acquired via repeated exposure to experience, but rather preexists in the learning faculty itself [11].

References [1] C.S. Wallace, P.R. Freeman.: Estimation and Inference by Compact Encoding, Journal of the Royal Statistical

Society. Series B (Methodological) 49 (1987) 240-265 [2] C.S. Wallace and D.M. Boulton, An Information Measure for Classification, Computer Journal, 11:185-195, 1968 [3] R.A. Baxter and J.J. Oliver. Finding Overlapping Components with MML, Statistics and Computing, 2000, 10, pp5-

16. [4] J.J. Oliver and R.A. Baxter, MML and Bayesianism: similarities and differences, 1994, Dept. of Computer Science,

Monash University, Clayton, Victoria 3168, Australia, Technical Report TR 206 [5] S. Richardson and P. Green. On Bayesian analysis of mixtures with an unknown number of components, with

discussion, Journal of the Royal Statistical Society, B, 59, 731-792 (1997). [6] W. Gilks, S. Richardson, and D. Spiegelhalter, Markov Chain Monte Carlo in Practice. Interdisciplinary Statistics.

Chapman and Hall, 1996. [7] J.H. Conway and N.J.A Sloane, Sphere Packings, Lattices and Groups. Springer-Verlag, London, 1988. [8] A. R. Barron and T. M. Cover, Minimum complexity density estimation, IEEE Trans. Inform. Theory, vol. 37, pp.

1034-1054, July 1991. [9] T. Mitchell, Machine Learning, McGraw-Hill, 1997.

Page 276: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

[10] R. Neal, Probabilistic Inference Using Markov Chain Monte Carlo Method, Technical Report CRG-TR-93-1, Department of Computer Science University of Toronto, 1993.

[11] I. Kant, Critique of Pure Reason, Cambridge University Press, 1998. [12] I. Davidson, K. Yin, Message Length Estimators, Probabilistic Sampling and Optimal Prediction, DIMACS

Workshop on Complexity and Inference, Li, Vitanyi and Hansen, 2003. [13] C.E. Rasmussen, The Infinite Gaussian Mixture Model, in Advances in Neural Information Processing Systems, 14,

2000.

Page 277: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Crane Scheduling with Spatial Constraints: MathematicalModel and Solving Approaches

Yi Zhu and Andrew LimDept of IEEM, Hong Kong University of Science and Technology

Clear Water Bay, Kowloon, Hong Kong

zhuyi,[email protected]

Abstract

In this paper, we examine crane scheduling for ports. This important component ofport operations management is studied when the non-crossing spatial constraint, whichis common to crane operations, are considered. Our objective is to minimize the latestcompletion time for all jobs, which is a widely used criteria in practice. We provide theproof that this problem is NP-complete and design a branch-and-bound algorithm to ob-tain optimal solutions. A simulated annealing meta-heuristic with effective neighborhoodsearch is designed to find good solutions in larger size instances. The elaborate experi-mental results show that the branch-and-bound algorithm runs much faster than CPLEXand the simulated annealing approach can obtain near optimal solutions for instances ofvarious sizes.

1 Introduction

The Port of Singapore Authority (PSA), a large technology corporation located in Singapore, isone of the busiest ports in the world. PSA handles 17.04 million TEU’s annually or nine percentof global container throughput in Singapore, the world’s largest transshipment hub. PSA isconcerned with maximizing throughput at its ports in view of pressures derived from limitedport size, high cargo transshipment volumes and limited physical facilities and equipment[1][5].

Crane scheduling and work schedules play an important part in port management. Cranesare in the interface between the land and water sides. It is in this multi-channel interfacethat we find bottlenecks and where cranes and other cargo-handling equipment (fork lifts,conveyors etc.) come into operation. We find that much of the operational decision-makingdone at PSA is based on practical experience and simulation (see, for example, [5]). Whilethe latter is invaluable in general, analytic models can provide an enhanced role without thelimitations of experience-generated rules-of-thumbs or simulation (see [3]). Port systems areusually peculiar, making existing models invalidate. Sabria and Aganzo [4] give a bibliographyon port operations with the focus on berthing and cargo-handling systems. This bibliographycontains some analytic models that have been developed for other ports. Berthing is a widely-analyzed port activity, where queuing theory finds application (see for example, Sabrian andDaganzo [4]).

Recently, a set of spatial constraints are studied in crane scheduling problem ([6][7]). Themost interesting one is the non-crossing constraint, i.e. crane arms cannot be crossed over

1

Page 278: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

each other simultaneously. It is a structural constraint on cranes and crane tracks. In [6][7],the problems are modelled as bipartite graph matching. Two simpler problems are solvedby dynamic programming algorithms and some heuristic methods are designed to tackle thehardest version.

In [6][7], however the job-to-crane assignment is only performed once to maximize the totalthroughput without taking time into consideration. In practice, the scheduling target is oftento complete all the jobs with respect to certain criteria. Thus in this paper, we augment thework by solving the problem that minimizes the latest completion time of all the jobs, whichcome in different sizes. We assume that once a crane starts to do the job, it could not stop thework until it completes the whole job,i.e., the jobs are non-preemptive.

2 Mathematical Formulation

Throughout this paper we study the crane scheduling problem that concerns the job-to-craneassignment in order to minimize the latest completion time. The following assumptions andconstraints are considered:

• All jobs have different processing times but the crane rate is constant;

• All jobs are non-preemptive, i.e., once a crane starts to do a job, it must complete itwithout any pause or shift;

• Any two cranes cannot do two jobs simultaneously if their arms are crossed. Cranes linesup in tracks and jobs are in ships which berth along the wharf. We label the cranesand jobs according to their relative spatial positions. This means cranes, 1, 2, 3, . . ., arearranged on a line from left to right (or east to west), jobs, 1, 2, 3, . . ., are in the similarmanner. For details, please see Figure 1. The non-crossing constraint can be simplyexpressed as:If job i is assigned to crane k and job j is assigned to crane l simultaneously, then i < jif and only if k < l.

Figure 1: An Illustration of Crane and Job Line

We present the formulation of the problem below:

Parameters:• m: the number of cranes;• n: the number of jobs;• pi: the processing time of job i (1 ≤ i ≤ n);

2

Page 279: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

The decision variables:• ci: an integer variable which represents the completion time of job i (1 ≤ i ≤ n);• Cmax: an integer variable which represents the latest completion time among all jobs;• xik: a binary variable equal to 1 if and only if job i is assigned to crane k (1 ≤ i ≤ n,1 ≤

k ≤ m);• yij: a binary variable equal to 1 if and only if job i completes no later than job j starts,

i.e., ci <= cj − pj (1 ≤ i, j ≤ n);• zijkl: a binary variable equal to 1 if and only if job i is assigned to crane k and job j is

assigned to crane l.

The mathematical formulation (M is a sufficiently large constant number) :Minimize: Cmax

Subject to:Cmax ≥ ci,∀1 ≤ i ≤ n (1)

ci − pi ≥ 0, ∀1 ≤ i ≤ n (2)m∑

k=1

xik = 1,∀1 ≤ i ≤ n (3)

zijkl ≤ xik,∀1 ≤ i, j ≤ n, ∀1 ≤ k, l ≤ m (4)

zijkl ≤ xjl,∀1 ≤ i, j ≤ n, ∀1 ≤ k, l ≤ m (5)

xik + xjl − 1 ≤ zijkl,∀1 ≤ i, j ≤ n, ∀1 ≤ k, l ≤ m (6)

ci − (cj − pj) + yijM > 0,∀1 ≤ i, j ≤ n (7)

ci − (cj − pj)− (1− yij)M ≤ 0, ∀1 ≤ i, j ≤ n (8)

yij + yji ≥ zijkk,∀1 ≤ i, j ≤ n, i 6= j, ∀1 ≤ k ≤ m (9)

yij + yji ≥ zijkl,∀1 ≤ i < j ≤ n, ∀1 ≤ l < k ≤ m (10)

In the above formulation, constraints (1) and (2) define the properties of decision variablesCmax and ci. Constraints (3) ensures that one job must be assigned to exactly one crane.Constraints (4)–(6) define the variable z.(similar definition can be found in [9]) Constraints (7)–(8) define the properties of the variable y: constraint (7) indicates that yij = 1 if ci ≤ (cj − pj),which means yij = 1 when job i finishes no later than job j starts; constraint (8) indicates thatyij = 0 if ci > (cj − pj), which means yij = 0 when job i finishes after job j starts. Constraint(9) specifies that the jobs assigned to one crane cannot overlap. And finally, constraint (10)specifies the non-crossing constraint, i.e., if job i is assigned to crane k and job j is assigned tocrane l simultaneously (i.e. their processing durations overlap each other) , then i < j if andonly if k < l.

3 NP-completeness Proof

We prove the decision version of the crane scheduling problem is NP-complete:

• Input: the number of cranes m, the number of jobs n, and the jobs processing time pi

(1 ≤ i ≤ n), a positive integer C;• Question: Is there a job-to-crane assignment, so that if job i is assigned to crane k and

job j is assigned to crane l simultaneously, then i < j if and only if k < l (the non-crossing

3

Page 280: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

constraint is satisfied), and Maxci = C?(1 ≤ i ≤ n, ci denotes the completion time of job i)

Proof: To prove a problem is NP-complete, we have to show it is in NP and also NP-hard[2]. It is trivial to show that crane scheduling problem is in NP, as checking whether a solutionhas latest completion time k can be done in O(n) time and checking the non-crossing constraintcan be accomplished in O(n2) time.

To prove crane scheduling problem is NP-hard, we show that the set partition problem,which is known as NP-complete, is reducible to crane scheduling problem. The set partitionproblem is described as follows:

• Input: An integer set S = s1, s2, . . . , sn•Question: Can S be partitioned into two disjoint subsets S1 and S2, such that

∑si∈S1

si =∑si∈S2

si = hs, where hs = 1/2∑

si∈S si?

The reduction algorithm takes an instance S of the set partition problem. The constructedcrane scheduling problem instance has 2 cranes and n + 2 jobs; the processing time of both job1 and job (n + 2) is hs; and the processing time of job (i + 1) has the processing time si, where1 ≤ i ≤ n. For example, if the set S is 1, 3, 4, 6, then the crane scheduling problem has 6jobs, with the processing time 7, 1, 3, 4, 6 and 7 respectively. Obviously, this transformationalgorithm runs in polynomial time.

To complete the proof, we show that this transformation is indeed a reduction: the set Scan be partitioned to two equal subsets if and only if all the n + 2 jobs could be completed in2hs time.

Suppose set S can be partitioned to two equal sets S1 and S2, then perform the crane-to-jobassignments in two steps:

• Step 1: we assign job 1 to crane 1. Since crane 1 needs time hs to complete job 1, wecan assign jobs j1, j2, . . . , jk to crane 2, where ji − 1 ∈ S1, 1 ≤ i ≤ k. Since the sum ofprocessing time of all the jobs j1, j2, . . . , jk is hs, and the non-crossing constraint is notconsidered (as crane 1 is doing the first job, and the labels of jobs done by crane 2 are allgreater than 1), crane 2 can complete the jobs j1, j2, . . . , jk in hs time.

• Step 2: we assign job n + 2 to crane 2. Similarly, we assign jobs j′1, j′2, . . . , j

′k′ to crane

1, where j′i − 1 ∈ S2, 1 ≤ i ≤ k′. Also, the non-crossing constraint is not considered, ascrane 2 is doing the last job, and the labels of jobs done by crane 1 are all less than n+2,crane 1 can complete the jobs j′1, j

′2, . . . , j

′k′ in hs time.

Hence all the n+2 jobs can be completed in 2hs time if the set S can be partitioned to twoequal subsets.

Conversely, suppose all the n + 2 jobs can be done in 2hs time, then both cranes are fullyutilized as the sum of the processing time of all the jobs is 4hs. Then we can conclude that job1 and job n + 2 must be done by crane 1 and crane 2 respectively (if job 1 is done by crane 2or job n + 2 is done by crane 1, the non-crossing constraint takes effect and the cranes couldnot be fully utilized). Assume jobs j1, j2, . . . , jk are assigned to crane 1 in additional to job 1,jobs j′1, j

′2, . . . , j

′k′ are assigned to crane 2 in addition to job n + 2, then

∑ki=1 pji

=∑k′

i=1 pj′i ,which means, the set S can be partitioned to two equal subsets S1 = sj1−1, sj2−1, . . . , sjk−1and S2 = sj′1−1, sj′2−1, . . . , sj′

k′−1.

4

Page 281: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Hence the set S can be partitioned to two equal subsets if all the n+2 jobs can be completedin 2hs time. Thus the crane scheduling problem is NP-hard and it is NP-complete as it is inNP which is proven in the beginning of this section.

4 The Branch-and-Bound Algorithm

In this section, we discuss our branch-and-bound algorithm to solve the crane scheduling prob-lem optimally. In this section and next section, ai denotes the assigned crane of job i and ci

denotes the completion time of job i.In the branch-and-bound search procedure, we make a decision of one job-to-crane assign-

ment in each step (thus n steps are needed). So in every step, the n jobs can be divided intotwo disjoint sets: one set Sa that contains the jobs that have been assigned to cranes, and theother set Su that contains the jobs that have not been assigned to cranes.

There are some important aspects need to be considered when implementing the branch-and-bound algorithm. Due to the space limitation, we only discuss the most important componenthere, which is the bounding functions. The two lower bounds we use are:

• Lower Bound 1: Let CLk denote the latest job completion time on crane k in thecurrent step, i.e., CLk = Maxci, i ∈ Sa, ai = k, and C ′ = Maxci, i ∈ Sa denotes thecurrent latest completion time, then the following inequality holds:

C ′ + Max0, d( ∑

i∈Su

pi −m∑

k=1

(C ′ − CLk))/|Su|e ≤ OPT

Intuitively, the lower bound makes the assumption that all the remaining unassigned jobsare preemptive and the non-crossing constraint is ignored too. Then we just sum up theprocessing time of all the jobs and distribute it to all the cranes uniformly.

• Lower Bound 2: In each step, for each unassigned job i with respect to every crane k,we define the Earliest Assignment Time, EAT (i, k), which indicates the earliest possibletime if job i is assigned to crane k in this step. The second lower bound considers the jobsare non-preemptive, and each job is assigned to the crane that can complete it earliest,i.e., with the least EAT value. It can be expressed as follows:

Maxni=1Minm

k=1EAT (i, k)+ pi ≤ OPT

Both lower bounds can be computed efficiently (O(n) for the first bound and O(mn) for thesecond bound). We calculate the two lower bounds in each step of search to cut the branches.

This branch-and-bound algorithm is observed to be efficient as compared with the CPLEXsolver in our experiments. The details are presented in the experimental results section.

5 A Simulated Annealing Heuristic Algorithm

5.1 Neighborhood Moves

We develop a neighborhood move which is based on the graph search. As we know, the objectiveof the neighborhood search is to find better solutions by reassigning or shifting a subset jobs

5

Page 282: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

to cranes. As the optimality criterion is to minimize the latest completion time Cmax, the jobsthat complete right at the time Cmax would be the “trouble makers” (there may be one jobor several jobs), i.e., they have to be shifted forward otherwise the objective value will not beimproved. However, only shifting or reassigning those jobs is not possible. Using Figure 2 asan example, Job 7 cannot be shifted forward because of the non-crossing constraint with Job8; Job 6 cannot be shifted forward because of the non crossing constraint with Job 5; Job 9cannot be shifted forward because it cannot overlap with Job 10, etc. Hence, shifting a job aforward may require cascading changes: we have to shift job b beforehand if a is constrainedby b; then job c and job d must be shifted before b if b is constrained by c and d, ..., untilreaching the jobs that are not constrained by others. Thus, an effective neighborhood movemust reassigned jobs that constrain the movement of the jobs with completion time Cmax.

To represent the relationship between jobs clearer, and to ease our search, we define theJob Constraining Graph (JCG), which is a directed graph depicting constraining relationshipsamong jobs. Each node in JCG corresponds to one job; the directed edge (i, j) exists if andonly if job i is constrained by job j, i.e., job i could not start before job j completes when thenon-crossing constraint (or overlapping in one crane) is taken into account. Figure 3 shows theJCG for the example in Figure 2.

Crane 1

Crane 2

Crane 3

Job 2

Job 4

Job 5

Job 6

Job 8 Job 1

Job 3 Job 9

Job 7

Job 10

Figure 2: A Simple Problem Instance

Job 1

Job 2

Job 3

Job 4

Job 5

Job 6

Job 10 Job 9

Job 8

Job 7

Figure 3: A Job Constraining Graph Example

With the aid of JCG, it is easy to detect the subset of jobs that constrain the “troublemakers”. We can perform breath first search from the “trouble maker” nodes along the directededges all the way to the nodes with no outgoing edges. As shown in Figure 3, the gray nodesare the nodes we identify by using the breath first search from the “trouble maker” node job 7.

After obtaining the subset of nodes after the breath first search on JCG, we remove themfrom the original solution and insert them back after reordering. The insertion follows thegreedy fashion: for each job, we insert it to the earliest possible place among all the cranes.During the insertion, all the jobs that are not removed stay fixed. Of course, different solutionscould be obtained if the insertion order varies. In implementation, we try to perform theinsertion several times and the one with best cost is selected.

Our neighborhood moves guarantee the feasibility of the obtained neighborhood solutions,which is not easy to achieve by using commonly used neighborhood moves. Also, the solutionsare diverse as we perform stochastic reordering.

5.2 Simulated Annealing Algorithm Framework

We can now describe our SA approach. In each iteration of SA, the above mentioned neighbor-hood search is performed and one move is generated. Each neighborhood move is accepted ifit improves the objective function value. Other possible solutions are also accepted accordingto a probabilistic criterion. Such probabilities are based on the annealing process and they areobtained as a function of system temperature.

The SA heuristic framework that we employ is as follows:

6

Page 283: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

1. Get an initial solution xnow. Set xbest = xnow;2. Set the annealing temperature T ;3. Perform the neighborhood search and generate a move, calculate the delta cost ∆;4. Decide whether to perform the neighborhood move generated, with the probability,

po = a ∗ exp(−∆/(k ∗ T )), where constants a and k determine the accept rate. If the move isperformed and the cost is smaller than xbest, update xbest;

5. Decrease T by a cooling rate factor d; T = T ∗ d. If T does not meet the terminationcriteria, go to Step 3;

6 Experimental Results

We conduct a series of experiments to test the correctness and efficiency of our Branch-and-Bound (B&B) and SA algorithms. Both algorithms are coded in Java and run in PentiumIV 1.6GHz machine. As comparison, we use CPLEX 7.1.0, which is installed in Pentium IV2.5GHz machine, to solve the formulation presented in Section 2. The initial temperature T ofSA is 100, and the cooling rate d is 0.999. The SA process terminates if T drops to 10−5 or themaximum number of iterations exceeds 104.

We design several categories of instances to test various aspects of our algorithms.

1. Random Instances with Small Sizes

We create 10 instances with small sizes. The processing time of jobs are randomly gen-erated in the interval [10, 40]. We run the CPLEX solver, branch-and-bound and SAalgorithms using the 10 instances and the results are shown in Table 1. We observe thatour B&B algorithm runs much faster than CPLEX solver, though both are exact method;and SA can obtain optimal solutions in short time. We believe that the performance ofCPLEX is not good because it employs the general branch-and-bound method while ourB&B is specific to this problem; also, there are too many constraints in the formulationthat slow down the computation (instances 5 and 10, indicated by *, make the programran out of memory, hence, only the feasible solutions, which may not be optimal, areindicated).

2. Special Designed Instances with Small Sizes

We designed 10 instances with small sizes. In each of the instances, the processing timeof one job is very large (500–1000) and others are all small (10–40). Hence the optimalvalue is just that large processing time (if there are more than two cranes), as we cancomplete all the other jobs while the job with large processing time is being handled.Table 2 shows the test results. Our B&B obtains all the solutions instantly taking theadvantage of our bounding functions, while CPLEX still consumes much time. SA takeslonger time than B&B as it does the iterative improvement, but the solutions it obtainsare all optimal.

3. Random Instances with Medium Sizes

10 medium sizes instances are generated to compare the B&B algorithm and SA, asCPLEX cannot handle such test data. The processing time is generated in the interval[10, 50]. Table 3 reports the solution. We can see that SA is still able to obtain optimalsolutions, and its running time is superior to the B&B method.

7

Page 284: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

4. Random Instances with Large Sizes

To test the performance our SA algorithm thoroughly, we conduct experiments usinggroups of large instances. We select three m values (the number of cranes): 5, 10 and15; for each m value, we choose the n values (the number of jobs) as 5–10 multiple of mrespectively; and for each m and n values combination, 10 instances are generated. Intotal, the 180 instances which are divided into 18 groups. The processing time for eachjob generated randomly from the interval [50, 150].

We compare our SA results with the lower bounds which are calculated using the twobounding functions mentioned in section 4. Table 4 shows the summarized results. Notethat all the values in the table are the average of results among 10 instances in thosegroups. The excessive percentage (exp) is calculated as follows: exp = (VSA−VLB)/VLB ∗100%.

All the results are very close to their lower bounds, especially for the first few groups. Webelieve the solutions are very near to optimal. The other interesting observation is thatfor all three groups, given the same number of cranes, the exp percentage becomes lesserwhen the number of jobs increases.

Instance Size CPLEX B&B SANo. (n×m) value CPU(sec) value CPU(sec) value CPU(sec)1 8× 3 71 105.45 71 0.063 71 8.6872 9× 3 86 84.36 86 0.250 86 10.6883 10× 3 85 783.70 85 1.250 85 13.2034 11× 3 95 2316.38 95 5.047 95 13.7655 12× 3 94* 19869.20 94 41.828 94 16.6566 8× 4 62 17.55 62 0.047 62 8.2037 9× 4 63 59.22 63 0.094 63 10.4848 10× 4 66 3305.31 66 0.500 66 11.6419 11× 4 60 7955.19 60 3.078 60 14.71910 12× 4 86* 5330.89 81 12.875 81 16.219

Table 1: Results on Random Instances with Small Sizes

7 Conclusion

The non-crossing constraint, which is an important and practical spatial constraint, is consid-ered in the crane scheduling problem and studied in this paper. We showed that the problem

Instance Size CPLEX B&B SANo. (n×m) value CPU(sec) value CPU(sec) value CPU(sec)1 16× 3 500 33.23 500 0.031 500 23.5632 17× 3 400 165.63 400 0.015 400 34.0623 18× 3 675 119.77 675 0.015 675 30.1094 19× 3 1375 263.69 1375 0.031 1375 32.2665 20× 3 1643 127.03 1643 0.015 1643 54.2666 16× 4 1700 68.11 1700 0.016 1700 28.5317 17× 4 1200 543.23 1200 0.031 1200 70.9698 18× 4 1500 2977.53 1500 0.016 1500 53.6879 19× 4 1976 102.52 1976 0.047 1976 79.95310 20× 4 2700 165.36 2700 0.015 2700 72.719

Table 2: Results on Special Designed Instances with Small Sizes

8

Page 285: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

Instance Size B&B SA Instance Size B&B SANo. (n×m) value CPU(sec) value CPU(sec) No. (n×m) value CPU(sec) value CPU(sec)1 13× 3 122 271.313 122 17.218 6 14× 3 130 2132.062 130 19.7192 13× 3 113 250.547 113 17.500 7 14× 4 112 939.265 112 20.6413 13× 4 98 109.078 98 21.485 8 14× 4 106 1010.578 106 23.1724 13× 4 94 174.172 94 20.235 9 15× 3 141 14134.297 141 24.4225 14× 3 136 1728.578 136 19.453 10 15× 4 102 3390.250 102 25.531

Table 3: Results on Random Instances with Medium Sizes

Group Size value exp CPU Group Size value exp CPU Group Size value exp CPUNo. (n×m) (%) (sec) No. (n×m) (%) (sec) No. (n×m) (%) (sec)1 25× 5 494 1.63 708 7 50× 10 524 3.03 3185 13 75× 15 537 6.27 82842 30× 5 594 0.92 1015 8 60× 10 613 2.10 4995 14 90× 15 637 6.88 121863 35× 5 698 0.72 1470 9 70× 10 720 2.10 7009 15 105× 15 724 4.05 176204 40× 5 814 0.44 2007 10 80× 10 814 1.70 9292 16 120× 15 832 3.71 257415 45× 5 926 0.54 2653 11 90× 10 922 1.75 12912 17 135× 15 928 2.81 342396 50× 5 980 0.52 3404 12 100× 10 1022 1.33 16616 18 150× 15 1016 1.74 43080

Table 4: Results on Random Instances with Large Sizes

is NP-complete and provided an integer programming model for the problem. We devised abranch-and-bound algorithm to solve the instances of moderate sizes and showed the our ap-proach outperforms the CPLEX solver for this problem using the mathematical model. We alsodevised a new graph-search based neighborhood search and used it in the simulated annealingframework to tackle large instances of the problem. The results obtained by our simulatedannealing algorithm are optimal or near optimal.

References

[1] Ebru K. Bish, Thin-Yin Leong, Chung-Lun Li, Jonanthan W.C.Ng, and David Simchi-Levi. Analysis of a new vehicle scheduling and location problem. Naval Research Logistics,48(5):1002–1024, 2001.

[2] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms.MIT Press, 1990.

[3] Carlos F. Daganzo. The crane scheduling problem. Transportation Research, 23B(3):159–175, 1989.

[4] Sabria F. and Carlos F. Daganzo. Queuing systems with scheduled arrivals and establishedservice order. Transportation Research, 23B(3):159–175, 1989.

[5] Peng-Hong Koh, Jimmy L.K. Goh, Hak-Soon Ng, and Hwei-Chiat Ng. Using simulation topreview plans of a container port operations. In Proceedings of the 1994 Winter SimulationConference, pages 1109–1115, 1994.

[6] Andrew Lim, Brian Rodrigues, Fei Xiao, and Yi Zhu. Crane scheduling using tabu search.In Proceedings of International Conference on Tools with Artificial Intelligence, (ICTAI2002, Washington, USA), pages 146–153, 2002.

9

Page 286: Artificial Intelligence and Mathematics - Rutgers Universityrutcor.rutgers.edu/~amai/aimath04/AcceptedPapers/aimath-I.pdf · Artificial Intelligence and Mathematics January 4-6, 2004

[7] Andrew Lim, Brian Rodrigues, and Yi Zhu. Crane scheduling using swo with local search.In Proceedings of 4th Asia-Pacific Conference on Simulated Evolution and Learning (SEAL2002, Singapore), 2002.

[8] Roy I. Peterkofsky and Carlos F. Daganzo. A branch and bound solution method for thecrane scheduling problem. Transportation Research, 24B(3):159–172, 1990.

[9] Jiefeng Xu and Glenn Bailey. The airport gate assignment problem: Mathematical modeland a tabu search algorithm. In Proceedings of 34th Hawaii International Conference onSystem Sciences, 2001.

10