integrating logic and probability: algorithmic improvements...

Integrating Logic and Probability:Algorithmic Improvements in

Markov Logic Networks

Marenglen Biba

Department of Computer Science

University of Bari, Italy

DISSERTATIONsubmitted in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHYin Computer Science

2009

mailto:[email protected]

Reading Committee

1. Advisor: Professor Floriana Esposito

2. Reviewer:

3. Reviewer:

4. Reviewer:

Signature from head of PhD committee:

ii

Abstract

This dissertation proposes novel algorithms for learning and inference in Markov

Logic Networks.

Statistical Relational Learning challenges one of the most important problems of

Machine Learning since its birth: integrating logic and probability in learning.

Markov Logic is a powerful representation formalism that combines full first-order

logic with probabilistic graphical models by attaching weights to first-order for-

mulas and viewing these as templates for features of Markov Networks (MNs).

Markov Logic Networks (MLNs) together with a set of constants define ground

MNs. MLNs preserve the expressivity of first-order logic and take advantage of

probabilistic graphical models algorithms being therefore a powerful model for

dealing with structured, noisy and uncertain data.

The rich expressivity of MLNs comes at the cost of learning and inference. Struc-

ture learning is the task of learning the logical clauses together with their weights

and is a very computationally hard task involving a search in a huge space of hy-

pothesis with many local optima for the evaluation function. Therefore robust al-

gorithms for structure learning of MLNs are needed. This dissertation proposes a

novel generative structure learning algorithm based on the iterated local search

metaheuristic. Extensive empirical study using real-world benchmark datasets

show that the algorithm improves predictive accuracy and learning time compared

to the state-of-the-art algorithms.

Generative structure learning algorithms optimize the joint distribution of all the

variables. This can lead to suboptimal results for predictive tasks because of the

mismatch between the objective function used (likelihood or a function thereof)

and the goal of classification (maximizing accuracy or conditional likelihood). In

contrast discriminative approaches maximize the conditional likelihood of a set

of outputs given a set of inputs and this often produces better results for predic-

tion problems. Unfortunately, the computational cost of optimizing structure and

parameters for conditional likelihood is prohibitive. This disseratation proposes

novel discriminative structure learning algorithms based on the simple approxi-

mation of choosing structures by maximizing conditional likelihood while setting

parameters by maximum likelihood. Extensive experiments in real-world domains

show that the proposed discriminative algorithms improve over state-of-the-art

generative structure learning and discriminative weight learning algorithms.

Inference in graphical models is NP-hard. For MLNs, MAP inference can be per-

formed through SAT solvers. This dissertation proposes the IRoTS algorithm for

MAP inference in MLNs and shows through experiments that it is a high per-

forming algorithm by improving over the state-of-the-art existing algorithm in

terms of solutions quality and inference running times. Moreover in statistical

relational learning, probabilistic and deterministic dependencies must be handled.

This dissertation extends IRoTS by proposing MC-IRoTS, an algorithm that com-

bines MCMC methods and SAT solvers for the problem of conditional inference

in MLNs. Empirical evaluation on real-world data shows good improvements to-

wards the state-of-the-art algorithm for conditional inference in MLNs.

For my parents

Acknowledgements

There are a lot of people that I would like to acknowledge for having been of

support during this long period of hard work. I will start with my colleagues who,

everyday have shared with me my work on machine learning research. I would

like to thank Floriana Esposito for having given me both freedom and good advice

for research; she taught me the importance of high quality research and inspired

me to investigate pattern recognition. Many thanks to Stefano Ferilli for having

shared with me all the ideas of my research and for having been of great support

during all these years; he rounded out my background on logic programming and

relational learning, taught me the importance of empirical evaluation in machine

learning and encouraged me to publish. Thanks to Nicola Di Mauro from whom I

received precious advice and ideas on metaheuristics. Thanks also to Teresa Basile

for her careful suggestions on our joint works.

I would like to thank Nicola Fanizzi for many helpful discussions on machine

learning topics and for having shared with me my curiosity on Linux, Latex and

kernels. Thanks also to Claudia d’Amato for having been a great colleague in the

LACAM laboratory. Many thanks also to all the colleagues at Dipartimento di

Informatica, Bari.

Part of this research was carried out at the Department of Computer Science, Uni-

veristy of Washington, Seattle. Special acknoledgement goes to Pedro Domingos

of University of Washington, for having given me the possibility to deepen my

knowledge on statistical relational learning by visiting his machine learning group.

He made my period in Seattle very productive and I learned from him the impor-

tance of practical machine learning. I would like to thank all the other members of

the machine learning group at UW-CSE: Stanley Kok for having shared with me

the results on structure learning, Marc Sumner for his help on Alchemy, Hoifung

Poon for useful discussions on MC-SAT, Parag Singla for helpful discussions on

discriminative learning, Jesse Davis for helpful talks on machine learning topics

and Daniel Lowd for his help with PSCG and for having shared with me his re-

sults. I would like to thank also Liliana Mihalkova of University of Texas for her

help with BUSL.

Of great help for me was Mary Giordano and her family in Seattle, who made my

stay there a very exciting experience. Thank you all for your support and for the

splendid time we had together in Seattle.

I would like to thank my girlfriend Eni for having been of great support during

these years of hard work and for having shared with me all my difficult moments.

She made me understand the great value of taking care of people and giving love

to them. Thank you Eni!

Finally, I would like to thank my parents who helped me understand the sense of

life. They made my work easier by giving me good advice and by encouraging

me to continue my way. Thanks also to my brother and his family for their help

during these years.

Contents

List of Figures ix

List of Tables xi

List of Algorithms xiv

1 Introduction 1

1.1 Statistical Models of Relational Data . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Overview of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Statistical Relational Learning 9

2.1 Probabilistic Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.2 Markov Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.3 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.4 Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 First-Order Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Inductive Logic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.1 Learning from entailment . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.2 Learning from interpretations . . . . . . . . . . . . . . . . . . . . . . 19

2.3.3 Learning from proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Probabilistic Inductive Logic Programming . . . . . . . . . . . . . . . . . . . 20

2.4.1 Learning from Probabilistic Entailment . . . . . . . . . . . . . . . . . 20

2.4.2 Learning from Probabilistic Interpretations . . . . . . . . . . . . . . . 21

2.4.3 Learning from Probabilistic Proofs . . . . . . . . . . . . . . . . . . . . 21

v

CONTENTS

2.5 SRL and PILP models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.1 Knowledge-based Model Construction . . . . . . . . . . . . . . . . . . 22

2.5.2 Probabilistic Relational Models . . . . . . . . . . . . . . . . . . . . . 23

2.5.3 Bayesian Logic Programs . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5.4 Stochastic Logic Programs . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5.5 PRISM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5.6 Relational Dependency Networks . . . . . . . . . . . . . . . . . . . . 26

2.5.7 nFOIL, TFOIL and kFOIL . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5.8 Other models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Markov Logic Networks 29

3.1 Markov Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Structure Learning of MLNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.1 Pseudo-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.2 Two-step Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.3 Single-step Learning by Optimizing Weighted Pseudo-likelihood . . . . 38

3.2.4 Bottom-up Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3 Parameter Learning of MLNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.1 Generative approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.2 Discriminative Approaches . . . . . . . . . . . . . . . . . . . . . . . . 41

3.4 Inference in MLNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.1 MAP Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.2 Conditional Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 The GSL algorithm 47

4.1 The Iterated Local Search metaheuristic . . . . . . . . . . . . . . . . . . . . . 47

4.2 Generative Structure Learning using ILS . . . . . . . . . . . . . . . . . . . . . 50

4.2.1 The Perturbation Component . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.2 The Local Search Component . . . . . . . . . . . . . . . . . . . . . . 53

4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.1 Link Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.2 Entity Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3.3 Systems and Methodology . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

vi

CONTENTS

4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5 The ILS-DSL algorithm 65

5.1 Setting Parameters through Likelihood . . . . . . . . . . . . . . . . . . . . . . 66

5.2 Scoring Structures through Conditional Likelihood . . . . . . . . . . . . . . . 67

5.3 Discriminative Structure Learning using ILS . . . . . . . . . . . . . . . . . . . 68

5.3.1 The ILS-DSLCLL version . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.3.2 The ILS-DSLAUC version . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.4.1 Link Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.4.2 Entity Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75


5.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6 The RBS-DSL algorithm 87

6.1 The GRASP metaheuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.2 Randomized Beam Discriminative Structure Learning . . . . . . . . . . . . . . 88

6.2.1 The RBS-DSLCLL version . . . . . . . . . . . . . . . . . . . . . . . . 93

6.2.2 The RBS-DSLAUC version . . . . . . . . . . . . . . . . . . . . . . . . 93

6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93


6.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7 The IRoTS and MC-IRoTS algorithms 103

7.1 MAP/MPE inference using IRoTS . . . . . . . . . . . . . . . . . . . . . . . . 104

7.1.1 The SAT and MAX-SAT problems . . . . . . . . . . . . . . . . . . . . 104

7.1.2 Iterated Robust Tabu Search . . . . . . . . . . . . . . . . . . . . . . . 106

7.1.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.2 Conditional Inference for MLNs using MC-IRoTS . . . . . . . . . . . . . . . 117

vii

CONTENTS

7.2.1 The SampleIRoTS algorithm: Combining MCMC and IRoTS . . . . . 119

7.2.2 The MC-IRoTS algorithm . . . . . . . . . . . . . . . . . . . . . . . . 121

7.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7.3 Discriminative Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . . 125

7.3.1 Optimizing Conditional Likelihood for Weight Learning . . . . . . . . 125

7.3.2 Learning MLNs Weights by Sampling with MC-IRoTS . . . . . . . . . 128

7.3.3 Experiments on Web Page Classification . . . . . . . . . . . . . . . . . 131

7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

8 Conclusion 135

8.1 Contributions of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 135

8.2 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . 137

8.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Appendix A The MLN++ Package 141

References 143

viii

List of Figures

2.1 Example of the graph structure of a Bayesian Network . . . . . . . . . . . . . 10

2.2 Example of the graph structure of a Markov Network . . . . . . . . . . . . . . 13

3.1 Example of a knowledge base in first-order logic . . . . . . . . . . . . . . . . 33

3.2 Example of a knowledge base in Markov Logic . . . . . . . . . . . . . . . . . 33

3.3 Partial construction of the nodes of the ground Markov Network . . . . . . . . 34

3.4 Complete construction of the nodes of the ground Markov Network . . . . . . 34

3.5 Connecting nodes whose predicates appear in some ground formula . . . . . . 35

3.6 Complete construction of the structure of the graph for the Markov Network . . 35

4.1 The Iterated Local Search schema . . . . . . . . . . . . . . . . . . . . . . . . 50

ix

LIST OF FIGURES

x

List of Tables

2.1 Conditional Probability Tables (CPTs) for all variables . . . . . . . . . . . . . 11

4.1 All predicates in the UW-CSE domain . . . . . . . . . . . . . . . . . . . . . . 57

4.2 All predicates in the CORA domain . . . . . . . . . . . . . . . . . . . . . . . 57

4.3 Accuracy results on UW-CSE for ten parallel independent walks of GSL . . . . 58

4.4 Accuracy comparison of GSL, BUSL and BS on the UW-CSE dataset . . . . . 59

4.5 Learning times (in minutes) on UW-CSE for ten parallel independent walks of

GSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.6 Comparison of learning times (in minutes) on UW-CSE for GSL, BUSL and BS 60

4.7 Accuracy results on CORA for ten parallel independent walks of GSL . . . . . 61

4.8 Accuracy comparison of GSL with BUSL on the CORA dataset . . . . . . . . 61

4.9 Learning times (in minutes) on CORA for ten parallel independent walks of GSL 62

4.10 Comparison of learning times (in minutes) on CORA for GSL and BUSL . . . 62

5.1 CLL results for the query predicate advisedBy in the UW-CSE domain . . . . . 80

5.2 AUC results for the query predicate advisedBy in the UW-CSE domain . . . . 80

5.3 CLL results for all query predicates in the CORA domain . . . . . . . . . . . . 81

5.4 AUC results for all query predicates in the CORA domain . . . . . . . . . . . . 81

6.1 CLL results for the query predicate advisedBy in the UW-CSE domain . . . . . 99

6.2 AUC results for the query predicate advisedBy in the UW-CSE domain . . . . 99

6.3 CLL results for all query predicates in the CORA domain . . . . . . . . . . . . 100

6.4 AUC results for all query predicates in the CORA domain . . . . . . . . . . . . 100

xi

LIST OF TABLES

7.1 Inference results in terms of cost of false clauses for query predicate advisedBy

for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using

MLNs learned by running PSCG for 500 iterations . . . . . . . . . . . . . . . 111

7.2 Running times (in minutes) for the same number of search steps for query

predicate advisedBy for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu

with restarts, using MLNs learned by running PSCG for 500 iterations . . . . . 112



MLNs learned by running PSCG for 10 hours . . . . . . . . . . . . . . . . . . 113



with restarts, using MLNs learned by running PSCG for 10 hours . . . . . . . . 113



MLNs learned by running PSCG for 50 iterations with both advisedBy and

tempAdvisedBy as non-evidence predicates . . . . . . . . . . . . . . . . . . . 114



with restarts, using MLNs learned by running PSCG for 50 iterations with both

advisedBy and tempAdvisedBy as non-evidence predicates . . . . . . . . . . . 114

7.7 Inference results in terms of cost of false clauses for query predicate tempAd-

visedBy for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts,

using MLNs learned by running PSCG for 50 iterations with both advisedBy

and tempAdvisedBy as non-evidence predicates . . . . . . . . . . . . . . . . . 115


predicate tempAdvisedBy for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-

Tabu with restarts, using MLNs learned by running PSCG for 50 iterations with

both advisedBy and tempAdvisedBy as non-evidence predicates . . . . . . . . 115

7.9 Inference results in terms of cost of false clauses for query predicates ad-

visedBy and tempAdvisedBy in a single inference task, for IRoTS, MaxWalkSAT-

Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learned by running

PSCG for 50 iterations with both advisedBy and tempAdvisedBy as non-evidence

predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

xii

LIST OF TABLES

7.10 Running times (in minutes) for the same number of search steps for query pred-

icates advisedBy and tempAdvisedBy in a single inference task, for IRoTS,

MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learned

by running PSCG for 50 iterations with both advisedBy and tempAdvisedBy

as non-evidence predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.11 Inference results in terms of cost of false clauses for query predicates taugh-

tBy, advisedBy and tempAdvisedBy in a single inference task, for IRoTS,

MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learned

by running PSCG for 50 iterations with the three predicates as non-evidence

predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.12 Running times (in minutes) for the same number of search steps for query pred-

icates taughtBy, advisedBy and tempAdvisedBy in a single inference task, for

IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs

learned by running PSCG for 50 iterations with the three predicates as non-

evidence predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.13 Inference running times for 1000 samples in the CORA domain . . . . . . . . 122

7.14 Accuracy results of inference for 1000 samples in the CORA domain . . . . . . 124

7.15 Accuracy results of inference for 1000 samples for the advisedBy predicate

based on the MLNs generated with 500 iterations of PSCG . . . . . . . . . . . 125

7.16 Inference running times (in seconds) for 1000 samples for the predicate ad-

visedBy based on the MLNs generated with 500 iterations of PSCG . . . . . . 125


based on the MLNs generated by running PSCG for 10 hours on the training

data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126


visedBy based on the MLNs generated by running PSCG for 10 hours on the

training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126


based on the MLNs generated by running PSCG with both advisedBy and tem-

pAdvisedBy as non-evidence predicates . . . . . . . . . . . . . . . . . . . . . 127


visedBy based on the MLNs generated by running PSCG with both advisedBy


xiii

LIST OF TABLES

7.21 Accuracy results of inference for 1000 samples for the predicate tempAd-

visedBy based on the MLNs generated by running PSCG with both advisedBy


7.22 Inference running times (in seconds) for 1000 samples for the predicate tem-

pAdvisedBy based on the MLNs generated by running PSCG with both ad-

visedBy and tempAdvisedBy as non-evidence predicates . . . . . . . . . . . . 128

7.23 Accuracy results of inference for 1000 samples with both query predicates ad-

visedBy and tempAdvisedBy in a single inference task . . . . . . . . . . . . . 129

7.24 Inference running times (in seconds) for 1000 samples with both query predi-

cates advisedBy and tempAdvisedBy in a single inference task . . . . . . . . . 129

7.25 Accuracy results of inference for 1000 samples with query predicates taughtBy,

advisedBy and tempAdvisedBy in a single inference task . . . . . . . . . . . . 130

7.26 Inference running times (in seconds) for 1000 samples with the query predi-

cates taughtBy, advisedBy and tempAdvisedBy in a single inference task . . . 130

7.27 Accuracy results for classifying webpages of students . . . . . . . . . . . . . . 131

7.28 Accuracy results for classifying webpages of faculty members . . . . . . . . . 132

7.29 Accuracy results for classifying webpages of research projects . . . . . . . . . 132

7.30 Accuracy results for classifying webpages of courses . . . . . . . . . . . . . . 133

7.31 Overall accuracy results for web page classification in the WebKB domain . . . 133

xiv

List of Algorithms

4.1 The Iterated Local Search algorithm . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 The GSL algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 The SearchBestClause component of GSL . . . . . . . . . . . . . . . . . . . . 52

4.4 The local search component of GSL . . . . . . . . . . . . . . . . . . . . . . . 54

5.1 The ILS-DSL algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2 The SearchBestClause component of ILS-DSL . . . . . . . . . . . . . . . . . 70

5.3 The subsidiary procedure LocalSearch and the Step function of ILS-DSL . . . 71

6.1 The GRASP metaheuristic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.2 The RBS-DSL algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.3 The SearchBestClause procedure of the RBS-DSL algorithm . . . . . . . . . . 90

6.4 Randomized Construction of the best WPLL candidate list . . . . . . . . . . . 91

6.5 Randomized choice of the best CLL (or AUC) candidate list to form the new

beam. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7.1 The WalkSAT algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.2 The Robust Tabu Search algorithm . . . . . . . . . . . . . . . . . . . . . . . . 109

7.3 The Iterated Robust Tabu Search algorithm . . . . . . . . . . . . . . . . . . . 110

7.4 The MC-IRoTS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

xv

LIST OF ALGORITHMS

xvi

Chapter 1

Introduction

1.1 Statistical Models of Relational Data

Traditionally, Artificial Intelligence research has fallen into two separate subfields: one that

has focused on logical representations, and one on statistical ones. Logical AI approaches like

logic programming, description logics, classical planning, symbolic parsing, rule induction,

etc, tend to emphasize handling complexity. Statistical AI approaches like Bayesian networks,

hidden Markov models, Markov decision processes, statistical parsing, neural networks, etc,

tend to emphasize handling uncertainty. However, intelligent agents must be able to handle

both for real-world applications. The first attempts to integrate logic and probability in AI

date back to the works in (Bacchus 1990; Halpern 1990; Nilsson 1986). Later, several authors

began using logic programs to compactly specify Bayesian networks, an approach known as

knowledge-based model construction (Wellman and Goldman 1992).

In Machine Learning, a central problem has always been learning in rich representations

that enable to deal with structure and relations. Much progress has been achieved in the rela-

tional learning field or differently known as Inductive Logic Programming (Lavrac and Dze-

roski 1994). On the other hand, successful statistical machine learning models with their roots

in statistics and pattern recognition, have made possible to deal with noisy and uncertain do-

mains in a robust manner. Powerful models such as Probabilistic Graphical Models (Pearl

1988) and related algorithms have the power to handle uncertainty but lack the capability of

dealing with structured domains.

Statistical Relational Learning (Getoor and Taskar 2007) or Probabilistic Inductive Logic

Programming (De Raedt et al. 2008) has undertaken the hard task not only of Machine Learning

1

1. INTRODUCTION

but of entire Artificial Intelligence to build hybrid models that integrate logical formalisms and

statistical ones. A growing amount of work has been dedicated to integrating subsets of first-

order logic with probabilistic graphical models, to extend logic programs with a probabilistic

semantics or integrate other formalisms with probability. Some of the logic-based approaches

are: Knowledge-based Model Contruction (Wellman and Goldman 1992), Bayesian Logic Pro-

grams (Kersting and De Raedt 2001a), Stochastic Logic Programs (Cussens 2001; Muggleton

1996), Probabilistic Horn Abduction (Poole 1993), Queries for Probabilistic Knowledge Bases

(Ngo and Haddawy 1997), PRISM (Sato and Kameya 1997a), CLP(BN) (Santos Costa et al.

2003). Other approaches include frame-based systems such as Probabilistic Relational Models

(Friedman et al. 1999) or PRMs extensions defined in (Pasula and Russell 2001), description

logics based approaches such as those in (Cumby and Roth 2003) and P-CLASSIC of (Koller

et al. 1997), database query langauges (Taskar et al. 2002), (Popescul and Ungar 2003), etc.

All these SRL approaches are based on subsets of first-order logic. Markov Logic (Domin-

gos and Richardson 2007; Domingos et al. 2008) is a further step in generalizing these ap-

proaches. It is a simple language that provides the full expressiveness of graphical models

and first-order logic in finite domains, and remains well-defined in many infinite domains as

the results in (Richardson and Domingos 2006; Singla and Domingos 2007) show. Markov

Logic extends first-order logic by attaching weights to first-order formulas. which are viewed

as templates for constructing Markov networks. In the infinite-weight limit, Markov Logic

reduces to standard first-order logic. In Markov Logic it is avoided the assumption of i.i.d.

(independent and identically distributed) data made by most statistical learners by using the

power of first-order logic to compactly represent dependencies among objects and relations. A

Markov Logic Network (MLN) can be seen as a knowledge base capable of soundly handling

uncertainty, tolerating imperfect and contradictory knowledge, and reducing brittleness.

The expressiveness of Markov Logic poses computationally challenging problems. As

with other graphical models, there are three important tasks with MLNs: structure learning,

parameter learning and inference. This dissertation aims at giving a contribution for each of

these tasks.

Structure learning for MLNs is the task of learning the logical clauses. This can be per-

formed from scratch or starting from an already learned structure and try to revise it. In

(Richardson and Domingos 2006) structure learning was performed through ILP methods

(Lavrac and Dzeroski 1994) followed by a weight learning phase during which maximum-

pseudolikelihood (Besag 1975) weights were learned for each previously learned clause. State-

2


of-the-art algorithms for structure learning are those in (Kok and Domingos 2005; Mihalkova

and Mooney 2007) where learning of MLNs is performed in a single step using weighted

pseudo-likelihood as the evaluation measure during structure search. However, these algo-

rithms follow systematic search strategies that can lead to local optima and prohibitive learning

times. The algorithm in (Kok and Domingos 2005) performs a beam search in a greedy fash-

ion which makes it very susceptible to local optima, while the algorithm in (Mihalkova and

Mooney 2007) works in a bottom-up fashion trying to consider fewer candidates for evalu-

ation. Even though it considers fewer candidates, after initially scoring all candidates, this

algorithm attempts to add them one by one to the MLN, thus changing the MLN at almost each

step, which greatly slows down the computation of the optimal weights. Moreover, both these

algorithms cannot benefit from parallel architectures. This dissertation proposes an approach

based on the Iterated Local Search (ILS) metaheuristics (Loureno et al. 2002) that samples the

set of local optima and performs a search in the sampled space. We show that, through a simple

parallelism model such as independent multiple walk, ILS achieves important improvements

towards the state-of-the-art algorithms of (Kok and Domingos 2005; Mihalkova and Mooney

2007).

Generative approaches optimize the joint distribution of all the variables. This can lead to

suboptimal results for predictive tasks because of the mismatch between the objective function

used (likelihood or a function thereof) and the goal of classification (maximizing accuracy or

conditional likelihood). In contrast discriminative approaches maximize the conditional like-

lihood of a set of outputs given a set of inputs (Lafferty et al. 2001) and this often produces

better results for prediction problems. In (Singla and Domingos 2005) the voted perceptron

based algorithm for discriminative weight learning of MLNs was shown to greatly outperform

maximum-likelihood and pseudo-likelihood approaches for two real-world prediction prob-

lems. Recently, the algorithm in (Lowd and Domingos 2007), outperforming the voted percep-

tron became the state-of-the-art method for discriminative weight learning of MLNs. However,

both discriminative approaches to MLNs learn weights for a fixed structure, given by a domain

expert or learned through another structure learning method (usually generative). Better results

could be achieved if the structure could be learned in a discriminative fashion. Unfortunately,

the computational cost of optimizing structure and parameters for conditional likelihood is

prohibitive. This dissertation proposes discriminative structure learning algorithms based on

the simple approximation of choosing structures by maximizing conditional likelihood while

3

1. INTRODUCTION

setting parameters by maximum likelihood. Structures are scored through a very fast infer-

ence algorithm MC-SAT (Poon and Domingos 2006) whose lazy version Lazy-MC-SAT (Poon

et al. 2008) greatly reduces memory requirements, while parameters are learned through a

quasi-Newton optimization method like L-BFGS that has been found to be much faster (Sha

and Pereira 2003) than iterative scaling initially used for Markov Networks’ weights learning

(Della Pietra et al. 1997). We show through experiments in two real-world domains that the

proposed algorithm improves over the state-of-the-art algorithm of (Lowd and Domingos 2007)

in terms of conditional likelihood of the query predicates.

Discriminative approaches may not always provide the highest classification accuracy. An

empirical and theoretical comparison of discriminative and generative classifiers (logistic re-

gression and Naïve Bayes (NB)) was given in (Ng and Jordan 2002). It was shown that for small

sample sizes the generative NB classifier can outperform a discriminatively trained model. This

is consistent with the fact that, for the same representation, discriminative training has lower

bias and higher variance than generative training, and the variance term dominates at small sam-

ple sizes (Domingos and Pazzani 1997; Friedman 1997a). For the dataset sizes typically found

in practice, however, the results in (Greiner et al. 2005; Grossman and Domingos 2004; Ng and

Jordan 2002) all support the choice of discriminative training. An experimental comparison

of discriminative and generative parameter training on both discriminatively and generatively

structured Bayesian Network classifiers has been performed in (Pernkopf and Bilmes 2005).

This dissertation presents an experimental comparison between the proposed generative and

discriminative structure learning algorithms for MLNs and confirms the results in (Ng and Jor-

dan 2002) in the case of MLNs by showing that on a small dataset the generative algorithm is

competitive, while on a larger dataset the discriminative algorithm outperforms the generative

one in terms of conditional likelihood (Ng and Jordan 2002).

Maximum a posteriori (MAP) inference in MNs means finding the most likely state of a set

of output variables given the state of the input variables. This problem is NP-hard. For discrim-

inative training, the voted perceptron is a special case in which tractable inference is possible

using the Viterbi algorithm (Collins 2002). In (Singla and Domingos 2005) the voted percep-

tron was generalized to MLNs by replacing the Viterbi algorithm with a weighted SAT solver.

This algorithm is gradient descent and computing the gradient of the conditional log-likelihood

(CLL) requires the computation of the number of true groundings for each clause. This can be

performed by finding the MAP state which can be computed by dynamic programming meth-

ods. Since for MLNs, the MAP state is the state that maximizes the sum of the weights of the

4


satisfied ground clauses, this state can be efficiently found using a weighted MAX-SAT solver.

The authors in (Singla and Domingos 2005) use the MaxWalkSAT solver (Selman et al. 1996).

This dissertation proposes to use IRoTS as a MAX-SAT solver for performing MAP inference

in MLNs. Extensive experiments in real-world domains show that IRoTS performs better than

the state-of-the-art algorithm for MAP inference in MLNs, in terms of solutions quality and

inference running times.

Conditional inference in graphical models involves computing the distribution of the query

variables given the evidence and it has been shown to be #P-complete. The most widely used

approach to approximate inference is by using Markov Chain Monte Carlo (MCMC) methods

and in particular Gibbs sampling. One of the problems that arises in real-world applications, is

that an inference method must be able to handle probabilistic and deterministic dependencies

that might hold in the domain. MCMC methods are suitale for handling probabilistic dependen-

cies but give poor results when deterministic or near deterministic dependencies characterize a

certain domain. On the other side logical ones, like satisfiability testing cannot be applied to

probabilistic dependencies. One approach to deal with both kinds of dependencies is that of

(Poon and Domingos 2006) where the authors use SampleSAT (Wei et al. 2004) in a MCMC

algorithm to uniformly sample from the set of satisfying solutions. As pointed out in (Wei et al.

2004), SAT solvers find solutions very fast but they may sample highly non-uniformly. On the

other side, MCMC methods may take exponential time, in terms of problem size, to reach

the stationary distribution. For this reason, the authors in (Wei et al. 2004) proposed to use

a hybrid strategy by combining random walk steps with MCMC steps, and in particular with

Metropolis transitions. MC-SAT (Poon and Domingos 2006) is an inference algorithm that

combines ideas from satisfiability (SAT) and MCMC methods. It uses the SampleSAT (Wei

et al. 2004) algorithm as a subroutine to efficiently jump between isolated or near-isolated re-

gions of non-zero probability, while preserving detailed balance. SampleSAT is an extension

to WalkSAT to sample satisfying solutions near-uniformly by combining it with simulated an-

nealing. This dissertation proposes the novel algorithm SampleIRoTS based on the iterated

local search (Loureno et al. 2002) and robust tabu search (RoTS) (Taillard 1991) metaheuris-

tics, that interleaves RoTS steps with simulated annealing ones in an iterated local search. The

dissertation then proposes SampleIRoTS plugged in the novel proposed algorithm MC-IRoTS.

Experimental evaluation shows that on a large number of inference tasks, MC-IRoTS performs

inference faster than the state-of-the-art algorithm for MLNs while maintaining the same qual-

ity of predicted probabilities.

5

1. INTRODUCTION

Often the structure of the model is already given by a domain expert and the task is to

learn the parameters of the model. Weight learning of MLNs in a discriminative fashion has

produced for predictive tasks much better results than generative approaches as the results in

(Singla and Domingos 2005) show. In this work the voted-perceptron algorithm was general-

ized to arbitrary MLNs by replacing the Viterbi algorithm with a weighted satisfiability solver.

The new algorithm is essentially gradient descent with an MPE approximation to the expected

sufficient statistics (true clause counts) and these can vary widely between clauses, causing

the learning problem to be highly ill-conditioned, and making gradient descent very slow. In

(Lowd and Domingos 2007) a preconditioned scaled conjugate gradient (PSCG) approach was

shown to outperform the algorithm in (Singla and Domingos 2005) in terms of learning time

and prediction accuracy. This algorithm is based on the scaled conjugate gradient method and

very good results are obtained with a simple approach: per-weight learning weights, with the

weight’s learning rate being the global one divided by the corresponding clause’s empirical

number of true groundings. This approach was originally proposed in (Moller 1993) for train-

ing neural networks. PSCG, in each iteration, takes a step in the diagonalized Newton direction

and uses samples from the MC-SAT algorithm (Poon and Domingos 2006) to approximate the

Hessian for MLNs instead of the line search to choose a step size. In this dissertation, we plug

in the PSCG algorithm, the MC-IRoTS in order to sample for approximating the Hessian. We

show through experiments in the web page classification domain that parameter learning with

PSCG by sampling with MC-IRoTS produces a model whose probabilities produced from the

model through inference show high accuracy in classification. This shows that MC-IRoTS, is

not only a fast inference algorithm, but it can be used also during learning since it samples

uniformly or near-uniformly.

1.2 Overview of this Dissertation

The dissertation is structured as follows:

• Chapter 2 introduces the basic notions and terminology on Probabilistic Graphical Mod-

els and Inductive Logic Programming. It describes learning and inference algorithms for

Bayesian Networks and Markov Networks and gives a detailed description of the differ-

ent ILP learning settings. Finally it describes related work on other Statistical Relational

Learning models.

6

1.2 Overview of this Dissertation

• Chapter 3 presents in detail Markov Logic and the three tasks: structure learning, param-

eter learning and inference with existing related algorithms.

• Chapter 4 presents the Generative Structure Learning (GSL) algorithm. It describes the

Iterated Local Search metaheuristic and the choice of its components for the task of

structure learning of MLNs. Finally it presents experimental evaluation of GSL.

• Chapter 5 presents the discriminative structure learning algorithm ILS-DSL based on

the Iterated Local Search metaheuristic. It describes how parameters are set and how

structures are scored. Two versions of the algorithm are presented, each optimizing

respectively conditional-likelihood and area under curve of precision-recall. Finally ex-

perimental evaluation of DSL is presented.

• Chapter 6 presents the discriminative structure learning algorithm RBS-DSL inspired

from the GRASP metaheuristic. It describes how parameters are set and how structures

are scored. Two versions of the algorithm are presented, each optimizing respectively

conditional-likelihood and area under curve of precision-recall. Finally experimental

evaluation of RBS-DSL is presented.

• Chapter 7 introduces the basic notions of the satisfiability problem and how MAP infer-

ence for Markov Logic Networks (MLNs) can be performed using MAX-SAT solvers.

It then presents the Iterated Robust Tabu Search algorithm with the experimental evalu-

ation for the task of MAP inference in MLNs. Then it introduces Markov Chain Monte

Carlo methods and how these can be combined with SAT solvers. It presents the Sam-

pleIRoTS algorithm for uniformly sampling from the set of satisfying assignments of a

clause. Then it presents Markov Chain IRoTS (MC-IRoTS), an algorithm that combines

MCMC with SAT. Finally it presents experiments with MC-IRoTS in two tasks: proba-

bilistic inference on a large variety of MLNs inference problems and parameter learning

with the PSCG algorithm using MC-IRoTS as sampler.

• Chapter 8 reviews the main contributions of this dissertation and outlines directions for

future research.

7

1. INTRODUCTION

8

Chapter 2

Statistical Relational Learning

This chapter introduces basic notions of Probabilistic Graphical Models (PGMs) and Relational

Learning approaches. It describes two graphical models such as Bayesian Networks (BNs)

and Markov Networks (MNs) and their related algorithms. Then notions of Inductive Logic

Programming (ILP) are presented describing the three ILP learning settings. Finally, the last

part of the chapter presents different SRL models combining PGMs with ILP. Some other

models which do not built upon PGMs but integrate statistical learning in the ILP setting such

as nFOIL, TFOIL and kFOIL are presented at the end of the chapter to give a complete view

of SRL.

2.1 Probabilistic Graphical Models

This section introduces basic notions of Probabilistic Graphical Models (PGMs). As pointed

out in (Jordan 1998), PGMs are a marriage between probability theory and graph theory and

provide a natural tool for dealing with two problems that occur throughout applied mathematics

and engineering – uncertainty and complexity. PGMs are graphs in which nodes represent

random variables, and the arcs represent probabilistic relationships between these variables

(Cowell et al. 1999; Jordan 1998; Pearl 1988). PGMs have several useful properties: they

are a simple way to visualize the structure of a probabilistic model and can be used to design

new models; insights into the properties of the model, including conditional independence, can

be achieved by inspecting the graph; complex computations regarding inference and learning

can be expressed in terms of graphical manipulations and the mathematical expressions can be

performed implicitly. The graph captures the way in which the joint distribution over all of

9

2. STATISTICAL RELATIONAL LEARNING

the random variables can be decomposed into a product of factors each depending only on a

subset of the variables. BNs are directed graphical models, in which the links of the graphs

have a particular directionality indicated by arrows. The other major class of graphical models

are MNs (also known as Markov random fields) which are undirected graphical models, in

which the links do not carry arrows and have no directional significance. Directed graphs are

useful for expressing causal relationships between random variables, while undirected graphs

are better suited to expressing soft constraints between random variables. In the following there

will be discussed key aspects of these two graphical models as needed for understanding the

statistical relational learning setting. Detailed treatment of these two models can be found in

(Bishop 2006; Cowell et al. 1999; Edwards. 2000; Jordan 1998).

2.1.1 Bayesian Networks

In order to better present the use of directed graphs to describe probability distributions, let’s

consider first an arbitrary joint distribution p(S, A, H, F) over four variables S, A, H, F. Let’s

suppose S stands for Sunny, A stands for Arsonists, H stands for Hot and F stands for Forests

in Fire (Figure 2.1).

Figure 2.1: Example of the graph structure of a Bayesian Network -

In directed models there is a more complicated notion of independence than in undirected

models. This implies several advantages. The most important is that one can regard an arc from

A to B as indicating that A “causes” B. For example in Figure 2.1, Hot causes Fire. This guides

the construction of the graph structure. In addition, directed models can encode deterministic

relationships. In addition to the graph structure, we have to specify the parameters of the model.

10


A H P(F=T) P(F=F)F F 0.0 1.0T T 0.99 0.01T F 0.9 0.1F T 0.9 0.1

P(S=T) P(S=F)0.5 0.5

Table 2.1: Conditional Probability Tables (CPTs) for all variables

For a directed model, we must specify the Conditional Probability Distribution (CPD) at each

node. If the variables are discrete, this can be represented as a table (CPT), which lists the

probability that the child node takes on each of its different values for each combination of

values of its parents. For example, in Figure 2.1, all nodes are binary, i.e., have two possible

values, which we will denote by T (true) and F (false). An important concept for probability

distributions over multiple variables is that of conditional independence (Dawid 1980).

We can see from Figure 2.1 that the event "Forest on Fire" (F = true) has two possible

causes: either there is hot weather (H = True) or an arsonist is causing fire (A = true). The

strengths of these relationships are given in the table. For example, we see that P(F = true|A =

true,S = f alse) = 0.9, and since each row must sum to one, P(F = f alse|S = true,A =

f alse) = 1− 0.9 = 0.1. Since the S node has no parents, its CPT specifies the prior proba-

bility that it is sunny (in this case, 0.5). (We are thinking of S as representing the season: if it

is a sunny, it is more likely to have fires).

The simplest conditional independence relationship encoded in a BN is the following: a

node is independent of its ancestors given its parents, where the ancestor/parent relationship is

with respect to some fixed topological ordering of the nodes.

Based on the chain rule of probability, the joint probability of all the nodes in the graph

above is:

P(S,A,H,F) = P(S)∗P(A|S)∗P(H|S,A)∗P(F |S,A,H)

By using conditional independence relationships, we can rewrite this equation as:

P(S,A,H,F) = P(S)∗P(A|S)∗P(H|S)∗P(F |A,H)

simplifying the third term because H is independent of A (hot weather is independent of the

fact that an arsonist is in action) and the last term because F is independent of S.

The conditional independence relationships allow to represent the joint distribution more

compactly. If we had n binary nodes, the full joint would require O(2n) space to represent, but

11


the factored form would require O(2max(|Par(Xi)|)) space to represent, where Par(Xi) is the set of

parents of the variable Xi. Fewer parameters make learning easier too.

We can now give in general terms the relationship between a given directed graph and the

corresponding distribution over the variables. The joint distribution defined by a graph is given

by the product, over all of the nodes of the graph, of a conditional distribution for each node

conditioned on the variables corresponding to the parents of that node in the graph. Thus, for a

graph with N nodes, the joint distribution is given by:

p(X) =N

∏n=1

p(x{n}|Pa(x{n})) (2.1)

where Pa(x{n}) denotes the set of parents of x{n} , and X = x1, ...,xN . This key equation

expresses the factorization properties of the joint distribution for a directed graphical model.

Directed graphs that we are considering are subject to an important restriction, that there must

be no cycles, in other words there must be no paths from node to node along links following

the direction of the arrows and end up back at the starting node. Such graphs are also called

directed acyclic graphs, or DAGs. This is equivalent to the statement that there exists an

ordering of the nodes such that there are no links that go from any node to any lower numbered

node.

2.1.2 Markov Networks

A Markov Network (also known as Markov random field) is a model for the joint distribution

of a set of variables X = (X1,X2,. . . ,Xn) ∈ χ (Della Pietra et al. 1997; Kindermann and Snell

1980). It is composed of an undirected graph G and a set of potential functions. The graph has

a node for each variable, and the model has a potential function φk for each clique in the graph.

A potential function is a non-negative real-valued function of the state of the corresponding

clique. The joint distribution represented by a MN is given by:

P(X = x) =1Z ∏

kφk(x{k}) (2.2)

where x{k} is the state of the kth clique (i.e., the state of the variables that appear in that

clique). Z, known as the partition function, is given by:

Z = ∑x∈χ

∏k

φk(x{k}) (2.3)

12


A clique is defined as a subset of the nodes in a graph such that there exists a link between

all pairs of nodes in the subset. In other words, the set of nodes in a clique is fully connected.

Furthermore, a maximal clique is a clique where it is not possible to include any other nodes

from the graph in the set without it ceasing to be a clique. The graphical structure in Figure 2.2

contains two maximal cliques {S, A, F} and {S, H, F}, all the other cliques are not maximal {S,

A}, {S, H}, {S, F}, {H, F}, {A, F}. We can consider only functions of the maximal cliques,

without loss of generality, because other cliques must be subsets of maximal cliques. For

example, if {S, A , F } is a maximal clique and we define an arbitrary function over this clique,

then including another factor defined over a subset of these variables would be redundant.

Figure 2.2: Example of the graph structure of a Markov Network -

MNs are often conveniently represented as log-linear models, with each clique potential

replaced by an exponentiated weighted sum of features of the state, leading to:

P(X = x) =1Z

exp(∑j

w jf j(x)) (2.4)

A feature may be any real-valued function of the state. We will focus on binary features,

f j ∈ {0,1}. In the most direct translation from the potential-function form, there is one feature

corresponding to each possible state xk of each clique, with its weight being log(φ(x{k}). This

representation is exponential in the size of the cliques.

2.1.3 Structure Learning

Structure Learning of Bayesian Networks

Structure learning for graphical models is learning the graph structure from data. For BNs the

13


goal is to learn a DAG that best explains the data. This problem is NP-hard since the number

of DAG’s on N variables is super-exponential in N. (There is no closed form formula for this,

but there are 543 dags on 4 nodes, and O(1018) dags on 10 nodes). The maximum likelihood

model is a complete graph, since this has the largest number of parameters, and hence fits the

data the best. A well-principled way to avoid this kind of over-fitting is putting a prior on

models, specifying preference for sparse models. Based on Bayes’ rule, the MAP (maximum

a posteriori) model is the one that maximizes:

P(G|D) =P(D|G)P(G)

P(D)(2.5)

Taking logs of each component of the equation:

logP(G|D) = logP(D|G)+ logP(G)+ e (2.6)

where e =−logP(D) is a constant independent of G. The scoring function for each model

is given by P(D|G).

Structure learning can be performed under full observability or partial observability. In

the first case local search algorithms can be used efficiently (possibly with multiple restarts).

Since the scoring function is a product of local terms, local search is more efficient, because to

compute the relative score of two models that differ by only a few arcs (i.e., neighbors in the

space), it is only necessary to compute the terms which they do not have in common; the other

terms cancel when taking the ratio. One of the most common methods for learning BNs is that

of (Geiger and Chickering 1995) which performs a search over the space of network structures,

starting from an initial network which may be random, empty, or derived from prior knowl-

edge. At each step, the algorithm generates all variations of the current network that can be

obtained by adding, deleting or reversing a single arc, without creating cycles, and selects the

best one using the Bayesian Dirichlet (BD) score. Bayesian methods can also be used since in

a Bayesian approach, the goal is to compute the posterior P(G|D). This is super-exponentially

large and a well-principled way is to sample a set of graphs from this distribution. The standard

approach is using an MCMC search procedure. This approach is quite popular and different

variants exist (for a review see (Murphy. 2001)). In case of partial observability an approach

is that of doing a local search inside the M-step of the Expectation-Maximization (EM) algo-

rithm. This is called Structural EM and it is proven to converge to a local maximum of the

Bayesian Information Criterion (BIC) (Friedman 1997b). More on learning BNs can be found

14


in (Buntine 1994; Heckerman. 1998).

Structure Learning of Markov Networks

The problem of learning the structure of MNs is a very hard one. Most algorithms for this

task are based on greedy heuristic search which incrementally modifies the model by adding

and possibly deleting features. For example, the approaches in (Della Pietra et al. 1997; Mc-

Callum 2003) add features in order to greedily improve the model likelihood; once a feature is

added, it is never removed. Since the feature addition step is heuristic and greedy, it can cause

the inclusion of unnecessary features, leading therefore to overly complex structures and over-

fitting. Another approach is that of (A. Deshpande and Jordan 2001; Bach and Jordan 2002)

that searches over the space of low-treewidth models. However, the advantage of such models

in practice is unclear. Another method was proposed in (Lee et al. 2006) that is based on the

use of L1-regularization on the weights of the log-linear model. This has the effect of biasing

the model towards solutions where many of the parameters are zero. This formulation converts

the MNs learning problem into a convex optimization problem in a continuous space, which

can be solved using efficient gradient methods.

2.1.4 Parameter Learning

Parameter Learning of Bayesian Networks

When learning only the parameters, the structure is known and fixed and the goal is to

learn, for each node in the network, a probabilistic model of that variable given the values of

its parents. The goal is typically to maximize the likelihood of the training data. If the training

data is complete, this is accomplished simply by counting the co-occurrences of the values of

the node with the various values of its parents. When some of the data values are missing, the

well-known EM algorithm (Dempster et al. 1977) can be employed. EM estimates the CPTs

from the known data, then uses those to estimate the missing values, then uses those to re-

estimate the CPTs, and repeats until convergence.

Parameter Learning of Markov Networks

For undirected graphical models, the parameters are the clique potentials. Maximum like-

lihood estimates of these can be computed using iterative proportional fitting (Jirousek and

Preucil 1995). In (Della Pietra et al. 1997), MNs weights were learned using iterative scaling.

However, maximizing the likelihood (or posterior) using a quasi-Newton optimization method

15


like L-BFGS was found to be much faster (Sha and Pereira 2003). Second-order, quadratic-

convergence methods like L-BFGS are known to be very fast if started near the optimum. In

general the most commonly objectives used are maximum likelihood and maximum condi-

tional likelihood, often with additional parameter priors. There is no closed form for these

parameters, but they are convex, and so the global optimum can be found using iterative meth-

ods, such as simple gradient descent or more sophisticated optimization algorithms (P. 2001;

Vishwanathan et al. 2006). Unfortunately, each step of these optimization algorithms requires

the computation of the log partition function and the gradient which in turn requires performing

inference on the model with the current parameters. As MRF inference is computationally ex-

pensive or even intractable, the learning task that executes inference repeatedly is often viewed

as intractable.

One commonly-used approach (Shental et al. 2003; Sutton and McCallum 2005a; Taskar

et al. 2002) is the approximation of the gradient of the maximum likelihood objective through

an approximate inference technique, most often the loopy belief propagation (LBP) (Pearl

1988; Yedidia et al. 2005) algorithm. LBP uses message passing to find fixed points of the

non-convex Bethe approximation to the energy functional (Yedidia et al. 2005). Unfortunately,

for some choices of models, LBP can be highly non-robust, providing wrong answers or not

converging at all. Recently, in (Ganapathi et al. 2008), an approach for combining MRF learn-

ing and Bethe approximation was proposed. They consider the dual of maximum likelihood

MN learning – maximizing entropy with moment matching constraints – and then approximate

both the objective and the constraints in the resulting optimization problem.

2.1.5 Inference

Graphical models specify a complete joint probability distribution (JPD) over all the variables.

Given the JPD, we can answer all possible inference queries by marginalization (summing out

over irrelevant variables). However, the JPD has size O(2n), where n is the number of nodes.

Hence summing over the JPD takes exponential time. One of the algorithms is called Variable

Elimination that uses the factored representation of the JPD to do marginalisation efficiently.

The key idea is to "push sums in" as far as possible when summing (marginalizing) out irrel-

evant terms. The principle of distributing sums over products can be generalized greatly to

apply to any commutative semiring. This forms the basis of many common algorithms, such

as Viterbi decoding and the Fast Fourier Transform. If we want to compute several marginals

16

2.2 First-Order Logic

at the same time, we can use Dynamic Programming (DP) to avoid the redundant computa-

tion that would be involved if we used variable elimination repeatedly (Pearl 1988). However,

for real-world problems and models, such as those with repetitive structure, as in multivariate

time-series or image analysis, large induced width makes exact inference very slow. Therefore

approximation techniques must be used. Unfortunately, approximate inference is #P-hard, but,

nonetheless, approximation method often work well in practice. Some of these methods are:

Variational methods, Monte Carlo methods, Loopy belief propagation, Bounded cutset condi-

tioning, Parametric approximation methods. More on exact inference can be found in (Kschis-

chang et al. 2001; McEliece and Aji 2000), whereas for approximate inference in graphical

models more can be found in (Jordan et al. 1999; Murphy et al. 1999).

2.2 First-Order Logic

A first-order knowledge base (KB) is a set of sentences or formulas in first-order logic (FOL)

(Genesereth and Nilsson 1987). Formulas in FOL are constructed using four types of symbols:

constants, variables, functions, and predicates. Constant symbols represent objects in the do-

main of interest. Variable symbols range over the objects in the domain. Function symbols

represent mappings from tuples of objects to objects. Predicate symbols represent relations

among objects in the domain or attributes of objects. A term is any expression representing an

object in the domain. It can be a constant, a variable, or a function applied to a tuple of terms.

An atomic formula or atom is a predicate symbol applied to a tuple of terms. A ground term

is a term containing no variables. A ground atom or ground predicate is an atomic formula

all of whose arguments are ground terms. Formulas are recursively constructed from atomic

formulas using logical connectives and quantifiers. A positive literal is an atomic formula; a

negative literal is a negated atomic formula. A KB in clausal form is a conjunction of clauses,

a clause being a disjunction of literals. A definite clause is a clause with exactly one positive

literal (the head, with the negative literals constituting the body). A possible world or Herbrand

interpretation assigns a truth value to each possible ground predicate.

It is often convenient to convert formulas to a more regular form, typically clausal form

(also known as conjunctive normal form (CNF)). A KB in clausal form is a conjunction of

clauses, a clause being a disjunction of literals (predicates or their negations). Every KB in FOL

can be converted to clausal form through a sequence of steps. Formulas in clausal form contain

no quantifiers; all variables are implicitly universally quantified. (They are also standardized

17


apart, i.e., no variable appears in more than one clause.) Existentially quantified variables are

replaced by Skolem functions. A Skolem function is a function of all the universally quantified

variables in whose scope the corresponding existential quantifier appears. To perform inference

in FOL using clausal form, resolution and local satisfiability search can be used. The latter is

applied after propositionalizing the KB (i.e., forming all ground instances of CNF clauses),

and proceeds by repeatedly flipping the truth values of propositions to increase the number of

satisfied clauses. These are known under the name of SAT solvers.

Because of the computational complexity, KBs are generally constructed using a restricted

subset of FOL where inference and learning is more tractable. The most widely-used restriction

is to Horn clauses, which are clauses containing at most one positive literal. In other words,

a Horn clause is an implication with all positive antecedents, and only one (positive) literal in

the consequent. A program in the Prolog language is a set of Horn clauses. Prolog programs

can be learned from examples (often databases) by searching for Horn clauses that hold in the

data. The field of inductive logic programming (ILP) (Muggleton and De Raedt 1994) deals

exactly with this problem.

2.3 Inductive Logic Programming

Inductive Logic Programming (ILP) and multi-relational data mining are concerned with learn-

ing and mining within first order logical or relational representations. The main task in ILP is

finding an hypothesis H (a logic program, i.e. a definite clause program) from a set of posi-

tive and negative examples P and N. In particular, it is required that the hypothesis H covers

all positive examples in P and none of the negative examples in N. The representation lan-

guage for representing the examples together with the covers relation determines the ILP setting

(De Raedt 1997). Overviews of inductive logic learning and multi-relational data mining can

be found in (Dzeroski and Lavrac 2001; Lavrac and Dzeroski 1994; Muggleton and De Raedt

1994). In the following will be discussed the three main settings for learning in ILP. A recent

and more detailed review of these three settings can be found in (De Raedt and Kersting 2003,

2004).

2.3.1 Learning from entailment

Learning from entailment is probably the most popular ILP setting and many well-known ILP

systems such as FOIL (Quinlan 1990), PROGOL (Muggleton 1995) or ALEPH (Srinivasan)

18

2.3 Inductive Logic Programming

follow this setting. In this setting examples are definite clauses and an example e is covered

by an hypothesis H, w.r.t the background theory B if and only if B∪H |= e. Most ILP systems

in this setting require ground facts as examples. They typically proceed following a separate-

and-conquer rule-learning approach (Furnkranz 1999). This means that in the outer loop they

repeatedly search for a rule covering many positive examples and none of the negatives (set-

covering approach (Mitchell 1997). In the inner loop ILP systems generally perform a general-

to-specific heuristic search using refinement operators (Nienhuys-Cheng and de Wolf 1997;

Shapiro. 1983) based on θ -subsumption (Plotkin. 1970). These operators perform the steps in

the search-space, by making small modifications to a hypothesis. From a logical perspective,

these refinement operators typically realize elementary generalization and specialization steps

(usually under θ -subsumption). More sophisticated systems like PROGOL or ALEPH employ

a search bias to reduce the search space of hypothesis.

2.3.2 Learning from interpretations

In the ILP setting of learning from interpretations, examples are Herbrand interpretations and

an example e is covered by an hypothesis H, w.r.t the background theory B, if and only if e is a

model of B∪H. A possible world is described through sets of true ground facts which are the

Herbrand interpretations. Learning from interpretations is generally easier and computationally

more tractable than learning from entailment (De Raedt 1997). This is due to the fact that

interpretations carry much more information than the examples in learning from entailment. In

learning from entailment, examples consist of a single fact, while in interpretations all the facts

that hold in the example are known. The approach followed by ILP systems learning from

interpretations is similar to those that learn from entailment. The most important difference

stands in the generality relationship. In learning from entailment an hypothesis H1 is more

general than H2 if and only if H1 |= H2, while in learning from interpretations when H2 |= H1.

A hypothesis H1 is more general than a hypothesis H2 if all examples covered by H2 are also

covered by H1. ILP systems that learn from interpretations are also well suited for learning

from positive examples only (De Raedt and Dehaspe 1997).

2.3.3 Learning from proofs

The first learning system that used to perform a kind of learning from proofs was the Model

Inference System (MIS) (Shapiro. 1983). This system normally learned from entailment, but

19


when information was missing, it queried the user for missing information by asking the truth

value of facts. The answers to these queries would then allow MIS to reconstruct the trace

or the proof for the positive example. Inspired by the work of Shapiro on MIS, the authors

in (De Raedt and Kersting 2004) defined the learning from proofs setting of ILP. In learning

from proofs, the examples are ground proof-trees and an example e is covered by a hypothesis

H w.r.t. the background theory B if and only if e is a proof-tree for H ∪ B. There can be

different possible forms of proof trees. For example, the authors in (De Raedt and Kersting

2004), assume that the proof tree is in the form of an and-tree where the nodes contain ground

atoms. They define a proof tree in this way: t is a proof-tree for T if and only if t is a rooted

tree where for every node n ∈ t with children child(n) satisfies the property that there exists a

substitution θ and a clause c such that n = head(c)θ and child(n) = body(c)θ .

2.4 Probabilistic Inductive Logic Programming

Probabilistic Inductive Logic Programming (PILP) can be seen as a field that aims at combining

ILP principles such as refinement operators with statistical learning. The most natural way to

do this is be giving a probabilistic semantics to the three ILP settings. In the following will

be sketched how all ILP settings can be extended with probabilistic semantics. More on this

extension can be found in (De Raedt and Kersting 2003, 2004).

The first change from ILP is that the cover relation becomes a probabilistic one. Then

clauses become annotated with probability values. A probabilistic covers relation for an ex-

ample e, an hypothesis H and a background theory B, returns a probability P. We can write

cover(e,H ∪B) = P(e|H,B). The latter is the likelihood of the example e. With this cover

relation, the goal of PILP is to find an hypothesis H that maximizes the likelihood of the data

P(E|H,B) where E is the set of examples.

2.4.1 Learning from Probabilistic Entailment

In a probabilistic setting, a logic program C becomes a set of clauses of the form h← bi where h

is an atom and bi are different bodies of clauses. For each clause in C, the probability P(bi|h) is

the conditional probability distribution that for a random substitution θ for which hθ is ground

and true (resp. false), the query biθ succeeds (resp. fails). It is assumed the prior probability of

h is given as P(h), the probability that for a random substitution θ , h is true (resp. false). The

covers relation P(hθ |C) (B is fixed) can thus be defined as:

20

2.4 Probabilistic Inductive Logic Programming

P(hθ |C) = P(hθ |b1θ , ...,bkθ) =P(b1θ , ...,bkθ |hθ)×P(hθ)

P(b1θ , ...,bkθ)

If we apply the naïve Bayes assumption to this equation we have:

P(hθ |C) = ∏P(biθ |hθ)×P(hθ)P(b1θ , ...,bkθ)

2.4.2 Learning from Probabilistic Interpretations

In order to give a probabilistic semantics to this ILP setting, probabilities must be assigned to

interpretations covered by a logic program. One way of doing this is to consider ground atoms

as random variables that are defined by the underlying definite clause programs (Kersting and

De Raedt 2001a). The authors distinguish between two types of predicates: deterministic and

probabilistic ones. The former are called logical, the latter Bayesian. A Bayesian logic program

is a set of of Bayesian (definite) clauses of the form A|A1, ...,An where A is a Bayesian atom,

A1, ...,An,n ≥ 0, are Bayesian and logical atoms and all variables are (implicitly) universally

quantified. To quantify probabilistic dependencies, each Bayesian clause c is annotated with

its conditional probability distribution cpd(c) = P(A|A1, ...,An), which quantifies the proba-

bilistic dependency among ground instances of the clause. A set of Bayesian logic program

together with the background theory induces a Bayesian network. The random variables A

of the Bayesian network are the Bayesian ground atoms in the least Herbrand model I of the

annotated logic program (For details see (Kersting and De Raedt 2001a)).

2.4.3 Learning from Probabilistic Proofs

Learning from probabilistic proofs is similar to Stochastic Logic Programs (SLPs) (Cussens

2001; Muggleton 1996) (this model will be discussed in the following paragraph). In SLPs,

similar to stochastic context-free grammars, the clauses are annotated with probability labels

in such a way that the sum of the probabilities associated to each clause defining any predicate

is 1.0 (less restricted versions have been considered in (Cussens. 1999)). SLPs are an example

of learning from entailment because the examples are ground facts entailed by the target SLP,

while in (De Raedt and Kersting 2004) the idea is to learn from proofs which carry a lot more

information about the structure of the underlying logic program. The basic element is that

proofs are probabilistic and a probability of a proof given a query predicate q, is the product

of the probabilities of clauses that have been used in the proof of q assuming proofs are finite

21


(see (Cussens. 1999) for the general case). The probability of a ground atom is then defined

as the sum of all the probabilities of all the proofs for that ground atom. An approach for

learning SLPs from proofs is that of (De Raedt et al. 2005) that combines ideas from the early

ILP system Golem [33] that employs Plotkin’s (Plotkin. 1970) least general generalization

(LGG) with bottom-up generalization of grammars and hidden Markov models (Stolcke and

Omohundro. 1993). The resulting algorithm employs the likelihood of the proofs as scoring

function.

2.5 SRL and PILP models

Most approaches in PILP start from ILP and extend it with probabilistic semantics. On the

other side SRL approaches start from Probabilistic Graphical Models (PGMs) and extend them

with relational representations. In the following will be presented the most well known SRL

and PILP approaches.

2.5.1 Knowledge-based Model Construction

One of the simplest way of combining probability and first-order logic is augmenting an ex-

isting first-order (Horn clause) knowledge base with probabilistic information. This is the ap-

proach taken by knowledge-based model construction (KBMC) methods, which derives from

work by Ngo and Haddaway (Ngo and Haddawy 1997) and earlier work surveyed in (Wellman

and Goldman 1992). Probabilistic logic programs proposed in (Ng and Subrahmanian 1992)

are also similar to this approach. The basic idea of all KBMC approaches is that with each

clause in a knowledge base is associated a set of parameters that specify how the consequent

probabilistically depends on its antecedents. In the simplest case, this is a single parameter that

specifies the probability that the consequent holds given that the antecedents hold. To answer

queries, KBMC constructs from the knowledge base a BN containing the relevant knowledge.

Each grounded predicate that is relevant to the query appears in the BN as a node. Relevant

predicates are found using Prolog backward chaining, except that rather than stopping when

finding a proof tree, KBMC finds every possible proof tree. Further, in order to find all relevant

predicates, backward chaining is performed not only from the query predicate to the evidence

predicates, but also from each evidence predicate to the other evidence predicates and the query

predicate.

22


2.5.2 Probabilistic Relational Models

Probabilistic relational models (PRMs) (Friedman et al. 1999) are a combination of frame-

based systems and Bayesian networks. Differently from PILP approaches, the authors start

from BNs learning approaches and extend these to a rich representation language in order to

deal with both relations and uncertainty. The early idea od PRMs was to allow the properties of

an object to depend probabilistically both on other properties of that object and on properties

of related objects. The authors in (Friedman et al. 1999) generalize the ideas of (Koller and

Pfeffer. 1998) on constructing rich probabilistic representations and of a related work based on

description logics (Koller et al. 1997). The major limitation of a BN is its propositional nature,

thus the entire domain must be known and probabilistic parameters that could be shared, end up

across many CPTs. What BNs lack, is the concept of variable instantiation which is common

in logic. PRMs achieve exactly this.

A PRM is of a set of classes Cl1,Cl2, ...,Cn. Each class Cl has a set of attributes A; each

attribute A is denoted Cl.A (For example, Car.color). Each class also has a set of reference

slots R, where each reference slot is denoted Cl., and points to an instance of the same or

another class (for example, Car.engine). Reference slots can be composed to form a slot chain

(for example, Car.engine.power refers to a car’s engine power). A also defines a probabilistic

relationship between attributes of classes. An attribute may depend on any attribute of the same

class, or of a class that is reachable through some slot-chain. Every PRM can be compiled into

a BN and essentially a PRM can be thought of as a template which, when given a specific

domain of objects, is “compiled” into a BN. Given a PRM and a set of objects, inference is

performed by constructing the corresponding BN and applying standard inference techniques

to it.

As the authors in (Taskar et al. 2002) point out, the need to avoid cycles in PRMs causes

significant representational and computational difficulties. Inference in PRMs is done by cre-

ating the complete ground network, which limits their scalability. PRMs require specifying a

complete conditional model for each attribute of each class, which in large complex domains

can be quite burdensome.

2.5.3 Bayesian Logic Programs

Bayesian Logic Programs (BLPs) fall in the ILP setting of learning from probabilistic entail-

ment introduced in the previous sections. A BLP together with the background theory induces

23


a BN. The random variables A of the Bayesian network are the Bayesian ground atoms in the

least Herbrand model I of the annotated logic program. This is similar in spirit with the ap-

proach KBMC described above. BLPs are represented by regular clauses, on which the typical

refinement operators from ILP can be applied. However, in BLP, it is required that the exam-

ples are models of the BLP, i.e. cover(H,e) = true if and only if e is model of H. This is

needed since the set of random variables defined by a BLP corresponds to a Herbrand model.

The requirement is enforced when learning the structure of BLP by starting from an initial set

of hypothesis that satisfies this requirement and from then on only considering refinements that

do not result in a violation. In addition, acyclicity is enforced by checking for each refine-

ment and each example that the induced Bayesian network is acyclic. Scooby (Kersting and

De Raedt 2001a,b) is a greedy hill-climbing approach for learning Bayesian logic programs.

Scooby takes the initial BLP as starting point and computes the parameters maximizing likeli-

hood. Then, refinement operators generalizing respectively specializing H are used to compute

all legal neighbours of H in the hypothesis space.

2.5.4 Stochastic Logic Programs

Stochastic Logic Programs were first defined by Muggleton in (Muggleton 1996) as general-

izations of Hidden Markov Models (HMMs) and Stochastic Context-Free Grammars (SCFGs).

An SLP is a definite logic program where some of the clauses are labelled with non-negative

numbers. A pure SLP is an SLP where all clauses have labels, while in an impure SLP some of

the clauses do not have labels. A normalised SLP is the one where labels for clauses that share

the same predicate symbol sum to one. If this is not true, the SLP is said to be unnormalised

(Cussens 2001; Cussens. 1999). In normalised SLPs, labels can be considered as probabilities.

In these SLPs, since each clause has a probability label associated, through the SLD-resolution

strategy that employs a stochastic selection rule, it is induced a probability distribution over

atoms of each predicate in the Herbrand base. An SLP defines a probability distribution over

derivations where the probability of a derivation is given by the product of the labels of the

clauses used in the SLD derivation.

Parameters of SLPs can be learned through the Failure Adjusted Maximisation (FAM)

algorithm (Cussens 2001), while structure learning of SLPs can be performed through an ILP

system and then learn the parameters with FAM. Another approach to learn the structure of

SLPs is that in (Muggleton 2002). This approach incrementally learns an additional clause

for a single predicate in a SLP. From an ILP perspective, this corresponds to a typical single

24


predicate learning setting under entailment. A related approach is that of learning SLPs from

proofs (De Raedt et al. 2005) which corresponds to the third ILP setting of learning from

proofs. In (De Raedt et al. 2005) the authors employ Plotkin’s (Plotkin. 1970) least general

generalization (LGG) in an ILP system to learn SLPs from proof banks.

2.5.5 PRISM

Programming in Statistical Modelling (PRISM) (Sato and Kameya 1997b, 2001) combines

logic programming and statistical modelling based on the distributional semantics introduced

by Sato (Sato 1995). PRISM programs are not only just a probabilistic extension of logic pro-

grams but are also able to learn from examples through the EM (Expectation- Maximization)

algorithm which is built-in in the language. PRISM represents a formal knowledge represen-

tation language for modeling scientific hypotheses about phenomena which are governed by

rules and probabilities. The parameter learning algorithm (Sato and Kameya 2001), provided

by the language, is a new EM algorithm called graphical EM algorithm that when combined

with the tabulated search has the same time complexity as existing EM algorithms, i.e., the

Baum-Welch algorithm for HMMs, the Inside-Outside algorithm for PCFGs, and the one for

singly connected BNs that have been developed independently in each research field. Since

PRISM programs can be arbitrarily complex (no restriction on the form or size), the most pop-

ular probabilistic modeling formalisms such as HMMs, PCFGs and BNs can be described by

these programs.

PRISM programs are defined as logic programs with a probability distribution given to

facts that is called basic distribution. Formally a PRISM program is P = F ∪R where R is a

set of logical rules working behind the observations and F is a set of facts that model observa-

tions’ uncertainty with a probability distribution. Through the built-in graphical EM algorithm

the parameters (probabilities) of F are learned and through the rules this learned probabil-

ity distribution over the facts induces a probability distribution over the observations. The

most appealing feature of PRISM is that it allows the users to use random switches to make

probabilistic choices in the logic program. A random switch has a name, a space of possible

outcomes, and a probability distribution. Recent advances on PRISM can be found in (Sato

and Kameya 2008).

25


2.5.6 Relational Dependency Networks

Relational Dependency Networks (RDNs) are dependency networks in which each node’s

probability, conditioned on its Markov blanket, is given by a decision tree over relational at-

tributes (Neville and Jensen 2007). RDNs are an extension of dependency networks (DNs)

(Heckerman et al. 2000) for relational data. DNs are an approximate representation. They ap-

proximate the joint distribution with a set of conditional probability distributions (CPDs) that

are learned independently. This approach to learning, results in significant efficiency gains over

exact models. However, because the CPDs are learned independently, DNs are not guaranteed

to specify a consistent joint distribution, where each CPD can be derived from the joint distri-

bution using the rules of probability. This limits the applicability of exact inference techniques.

RDNs can represent and reason with the cyclic dependencies required to express and ex-

ploit autocorrelation during collective inference. RDNs share certain advantages of relational

undirected models. In (Neville and Jensen 2007) the authors describe a relatively simple

method for structure learning and parameter estimation, which results in models that are easier

to understand and interpret. The primary distinction between RDNs and other existing SRL

models is that RDNs are an approximate model. RDNs approximate the full joint distribution

and thus are not guaranteed to specify a consistent probability distribution. The quality of the

approximation will be determined by the data available for learning: if the models are learned

from large data sets, and combined with Monte Carlo inference techniques, the approximation

should be sufficiently accurate.

2.5.7 nFOIL, TFOIL and kFOIL

These three PILP systems are probabilistic extensions to the ILP system FOIL (Quinlan 1990).

They all fall in the PILP setting of learning from probabilistic entailment. nFOIL (Landwehr

et al. 2005) was the first system in literature to tightly integrate feature construction and Naïve

Bayes. Such a dynamic propositionalization was shown to be superior compared to static

propositionalization approaches that use Naïve Bayes only to post-process the rule set. nFOIL

adapts FOIL by using conditional likelihood as the scoring function. A significant difference

with FOIL is, however, that the covered positive examples are not removed. TFOIL (Landwehr

et al. 2007) is similar in spirit and Tree Augmented Naïve Bayes, a generalization of Naïve

Bayes is integrated with FOIL. The authors show in (Landwehr et al. 2007) that TFOIL outper-

forms nFOIL. In a recent approach (Landwehr et al. 2006), the kFOIL system integrates ILP

26


and support vector learning. kFOIL constructs the feature space by leveraging FOIL search for

a set of relevant clauses. The search is driven by the performance obtained by a support vector

machine based on the resulting kernel. The authors showed that kFOIL improves over nFOIL.

2.5.8 Other models

This paragraph gives an overview of other SRL and PILP models.

Relational Markov models. In this model, the states of the Markov model are labeled

with parameterized predicates (Anderson et al. 2002). Based on a first-order representation for

the state, RMMs are more able to use smoothing to combat data scarcity.

Structural Logistic Regression. In structural logistic regression (SLR) (Popescul and

Ungar 2003), the predictors are the output of SQL queries over the input data. This model

integrates tightly the ILP step with the statistical learning step in a dynamic propositionalization

approach similar to nFOIL, but the difference is that SLR employs a more advanced (and hence

computationally more expensive) statistical model such as logistic regression.

Maximum Entropy Modelling with Clausal Constraints. MACCENT (Dehaspe 1997)

falls in the same category as SLR described above with the difference that it uses as a statistical

model, the maximum entropy modeling.

Relational Markov networks. RMNs (Taskar et al. 2002) combine MNs with database

queries. The relational structure of the RMN is defined by relational clique templates, which are

essentially SQL queries, and their associated potential functions. Each clique template, when

applied to a database, generates a set of tuples. Each tuple defines a clique in the “unrolled”

ground MN. Since they use one parameter per state of each clique, RMNs are limited to small

cliques.

1BC2. 1BC2 (Flach and Lachiche 2004) is a naïve Bayes classifier for structured data. The

logical component (and hence the features) of such a model are fixed, and only the parameters

are learned using statistical learning techniques.

SAYU. SAYU (Davis et al. 2005) uses a “wrapper” approach where (partial) clauses gener-

ated by the refinement search of an ILP system are proposed as features to a (tree augmented)

naïve Bayes, and incorporated if they improve performance. This means that feature learning

and naïve Bayes are tightly coupled similar to nFOIL. However, SAYU scores features based

on a separate tuning set. The probabilistic model is trained to maximize the likelihood on the

training data, while clause selection is based on the area under the precision-recall curve of the

model on a separate tuning set.

27


CLP(BN). CLP(BN) (Costa et al. 2008) aims at integrating BNs with constraint logic pro-

gramming. The authors in (Costa et al. 2008) propose the language CLP(BN) and show that

this model subsumes PRMs and show also that algorithms from ILP can be used with minor

modifications to learn CLP(BN) from data.

LPADs. Logic Programs with Annotated Disjunctions (Riguzzi 2004; Vennekens et al.

2004) combine logic and probability in an elegant way. Each ground annotated disjunctive

clause represents a probabilistic choice between a number of ground non-disjunctive clauses.

By choosing a head atom for each ground clause of an LPAD, it is obtained a normal logic

program called an instance of the LPAD. A probability distribution is defined over the space of

instances by assuming independence between the choices made for each clause.

28

Chapter 3

Markov Logic Networks

This chapter presents Markov Logic and how Markov Logic Networks (MLNs) serve as tem-

plates for constructing Markov Networks (MNs). It describes existing algorithms for learning

and inference in MLNs.

3.1 Markov Logic

Markov Logic is a combination of MNs and FOL. A FOL KB is a set of hard constraints on

the set of possible worlds: worlds that violate even one formula, have zero probability. Markov

Logic is based on the idea that these constraints must be soften: when a world violates one

formula in the KB it is less probable, but not impossible. A world is more probable, if it vi-

olates fewer formulas. Each formula in Markov Logic has an associated weight that reflects

how strong a constraint is: the higher the weight, the greater the difference in log probability

between a world that satisfies the formula and one that does not, other things being equal. A

set of formulas in Markov Logic is a Markov Logic Network. In this chapter, we make the

assumption that we are in a finite domain. Extending MLNs to infinite domains is a topic of

current work (Singla and Domingos 2007). MLNs allow to define probability distributions over

possible worlds (Halpern 1990). MLNs are defined as follows:

Definition 3.1.1 A Markov Logic Network (Richardson and Domingos 2006) N is a set of

pairs (Fi;wi), where Fi is a formula in first-order logic and wi is a real number. Together with

a finite set of constants C = {c1,c2, . . . ,cp} it defines a Markov Network MN;C as follows:

1. MN;C contains one binary node for each possible grounding of each predicate appearing in

29

3. MARKOV LOGIC NETWORKS

N. The value of the node is 1 if the ground predicate is true, and 0 otherwise.

2. MN;C contains one feature for each possible grounding of each formula Fi in N. The value

of this feature is 1 if the ground formula is true, and 0 otherwise. The weight of the feature

is the wi associated with Fi in N. Thus there is an edge between two nodes of MN;C iff the

corresponding ground predicates appear together in at least one grounding of one formula in N.

An MLN can be viewed as a template for constructing MNs. The probability distribution over

possible worlds x specified by the ground MN MN;C is given by:

P(X = x) =1Z

exp(F

∑i=1

wini(x)) =1Z ∏

iφi(xi)ni(x) (3.1)

where F is the number of formulas in the MLN, ni(x) is the number of true groundings of Fi

in x and xi is the state of the atoms appearing in Fi. As formula weights increase, an MLN

increasingly resembles a purely logical KB, becoming equivalent to one in the limit of all

infinite weights.

The syntax of the formulas in an MLN is the standard syntax of FOL (Genesereth and Nils-

son 1987). Free (unquantified) variables are treated as universally quantified at the outermost

level of the formula. In this dissertation the focus is on MLNs whose formulas are function-

free clauses and it is also assumed domain closure (it has been proven that no expressiveness is

lost), ensuring that the MNs generated are finite. In this case, the groundings of a formula are

formed simply by replacing its variables with constants in all possible ways.

Considerations:

1. The predicates in every ground formula form a clique in MN;C which not necessarily is

a maximal one. The structure of the ground MN;C is constructed as follows: there is an edge

between two nodes of MN;C iff the corresponding ground predicates appear together in at least

one grounding of one formula in N.

2. An MLN without variables is an ordinary MN. Any log-linear model over Boolean

variables can be represented as an MLN, since each state of a Boolean clique is defined by a

conjunction of literals.

3. An MLN is different from an ordinary first-order KB in that it can produce useful results

even if it contains contradictions. An MLN can also be obtained by merging several KBs, even

if they are partly incompatible.

4. If a knowledge base KB is satisfiable, the satisfying assignments are the modes of the

distribution represented by an MLN consisting of KB with all positive weights.

30

3.1 Markov Logic

5. Each state of MN;C represents a possible world. A possible world is a set of objects,

functions (mappings from tuples of objects to objects), and relations that hold between the

objects; together with an interpretation, they determine the truth value of each ground atom.

The following assumptions ensure that the set of possible worlds for (N,C) is finite, and that

MN;C represents a unique, well-defined probability distribution over those worlds, irrespective

of the interpretation and domain (Richardson and Domingos 2006): unique names (different

constants refer to different objects) (Genesereth and Nilsson 1987); domain closure (the only

objects in the domain are those representable using the constant and the function symbols in

(N,C)) (Genesereth and Nilsson 1987); known functions (for each function appearing in N, the

value of that function applied to every possible tuple of arguments is known, and is an element

of C).

As pointed out in (Richardson and Domingos 2006), the third assumption allows to re-

place functions by their values when grounding formulas. Therefore the only ground atoms

to be considered are those having constants as arguments. The infinite number of terms con-

structible from all functions and constants in (N, C) (the “Herbrand universe” of (N, C)) can

be ignored, because each of those terms corresponds to a known constant in C, and atoms

involving them are already represented as the atoms involving the corresponding constants.

The possible groundings of a predicate are obtained simply by replacing each variable in the

predicate with each constant in C, and replacing each function term in the predicate by the cor-

responding constant. If a formula contains more than one clause, its weight is divided equally

among the clauses, and a clause’s weight is assigned to each of its groundings.

The unique names assumption can be removed by introducing an equality predicate of the

form equals(x,y) and adding the necessary axioms to the MLN. For the second assumption, as

shown in (Richardson and Domingos 2006), there can be different scenarios. When the number

n of unknown objects is known, the domain closure can be removed simply by introducing

n arbitrary new constants. If n is unknown but finite, domain closure can be removed by

introducing a distribution over n, grounding the MLN with each number of unknown objects

and computing the probability P(F) = ∑nmaxn=0 P(n)P(F |Mn

N;C) where MnN;C is the ground MLN

with n unknown objects. If n is infinite, it would be necessary to extend MLNs to the infinite

case. If HN,C is the set of all ground terms constructible from the function symbols in N and the

constants in N and C (the “Herbrand universe” of (N, C)), the assumption 3 (known functions)

can be removed by treating each element of HN,C as an additional constant and applying the

31


same procedure used to remove the unique names assumption. As a summary, all the three

assumptions can be removed if the domain is finite.

6. An MLN can be viewed as a template for constructing MNs. In different worlds (dif-

ferent sets of constants) it will produce different networks of widely varying size, but all with

certain regularities in structure and parameters, given by the MLN (e.g., all groundings of the

same formula will have the same weight).

7. When weights increase, an MLN increasingly resembles a purely logical KB. In the limit

of all infinite weights, the MLN represents a uniform distribution over the worlds that satisfy

the KB. (A non-uniform distribution could easily be represented using additional formulas with

non-zero weights.)

What is important for the purpose of having the same power of PGMs, is that MLNs should

subsume other propositional graphical models.

Proposition 3.1.1 Every probability distribution over discrete or finite-precision numeric vari-

ables can be represented as a Markov Logic Network. (the proof can be found in (Richardson

and Domingos 2006).

On the other side it is important to preserve the power of FOL. The following proposition

states that FOL is a special case of Markov Logic.

Proposition 3.1.2 If KB is a first-order knowledge base, let N be the MLN obtained by as-

signing a weight w to every formula in KB, C be the set of constants appearing in KB, Pw(x)

be the probability assigned to a (set of) possible world(s) x by MN;C , χKB be the set of worlds

that satisfy KB, and F be an arbitrary formula in FOL. Then:

1.∀x ∈ χKB limw→∞ Pw(x) = |χKB|−1

2.For all F, KB |= F ⇐⇒ limw→∞ Pw(F) = 1

This states that, in the limit of all equal infinite weights, the MLN represents a uniform

distribution over the worlds that satisfy the KB, and all entailment queries can be answered by

computing the probability of the query formula and checking whether it is 1.

A simple example of a first-order KB is given in Figure 3.1. FOL statements are always

true, and the FOL formulas state that if someone drinks heavily, he will have an accident and

32

3.1 Markov Logic

if friends drinks, they’re going to have accidents.

Figure 3.1: Example of a knowledge base in first-order logic -

Since FOL statements, in practice are not always true, it is necessary to soften these hard

constraints. For example, in practice it is not always true that if someone drinks heavily, he

will have a car accident. In Figure 3.2, it is presented a KB in Markov Logic. As it can be seen,

formulas have weights attached and statements are not always true any more. Their degree

of truth depends on the weight attached. For instance, the first formula expresses a stronger

constraint than the second.

Figure 3.2: Example of a knowledge base in Markov Logic -

The simple KB in Figure 3.2 together with a set of constants, defines a MN. For example,

suppose we have two constants in the domain that represent two persons, Paolo and Cesare.

Then, the first step in the construction on the MN is given by the grounding of each predicate

in the domain according to the constants of the domain. Partial grounding is shown in Figure

3.3 where only groundings of HDrinks and CarAcc are considered. The complete nodes are

shown in Figure 3.4 where all the groundings of the predicates represent nodes in the graph.

In the next step, any two nodes whose corresponding predicates appear together is some

ground formula are connected. For example, in Figure 3.5, the nodes HDrinks(P) and CarAcc(P)

are connected through an arc, because the two predicates appear together in the grounding of

the second formula. The complete graph is presented in Figure 3.6.

33


Figure 3.3: Partial construction of the nodes of the ground Markov Network -

Figure 3.4: Complete construction of the nodes of the ground Markov Network -

34

3.1 Markov Logic

Figure 3.5: Connecting nodes whose predicates appear in some ground formula -

Figure 3.6: Complete construction of the structure of the graph for the Markov Network -

35


3.2 Structure Learning of MLNs

This section presents the main structure learning algorithms describing in detail the evaluation

function and search strategies.

3.2.1 Pseudo-likelihood

MLN weights can be learned by maximizing the likelihood of a relational database. Like in ILP,

a closed-world assumption (Genesereth and Nilsson 1987) is made, thus all ground atoms not

in the database are assumed false. If there are n possible ground atoms, then we can represent

a database as a vector x = (x1, ...,xi...,xn) and xi is the truth value of the ith ground atom, xi

= 1 if the atom appears in the database, otherwise xi = 0. Standard methods can be used to

learn MLN weights following Equation 3.1. If the jth formula has n j(x) true groundings, by

Equation 3.1 we get the derivative of the log-likelihood with respect to its weights by:

∂

∂w jlogPw(X = x) = n j(x)−∑

x′Pw(X = x′)n j(x′) (3.2)

where x′ are databases and Pw(X = x′) is P(X = x′) computed using the current weight

vector w = (w1, ...,w j). Thus, the jth component of the gradient is the difference between the

number of true groundings of the jth formula in the data and its expectation according to the

model. Counting the number of true groundings of a first-order formula, unfortunately, is a

#P-complete problem.

The problem with Equation 3.2 is that not only the first component is intractable, but also

computing the expected number of true groundings is also intractable, requiring inference over

the model. Further, efficient optimization methods also require computing the log-likelihood

itself (Equation 3.1), and thus the partition function Z. This can be done approximately using a

Monte Carlo maximum likelihood estimator (MC-MLE) (Geyer and Thompson 1992). How-

ever, the authors in (Richardson and Domingos 2006) found in their experiments that the Gibbs

sampling used to compute the MC-MLEs and gradients did not converge in reasonable time,

and using the samples from the unconverged chains yielded poor results.

In many other fields such as spatial statistics, social network modeling and language pro-

cessing, a more efficient alternative has been followed. This is optimizing pseudo-likelihood

(Besag 1975) instead of likelihood. If x is a possible world (a database or truth assignment)

36


and xl is the lth ground atom’s truth value, the pseudo-likelihood of x is given by the following

equation (we follow the same notation as the authors in (Richardson and Domingos 2006)):

P∗w(X = x) =n

∏l=1

Pw(Xl = xl|MBx(Xl)) (3.3)

where MBx(Xl) is the state of the Markov blanket of Xl in the data. (i.e., the truth values of

the ground atoms it appears in some ground formula with). From Equation 3.1 we have:

P(Xl = xl|MBx(Xl)) =exp(∑F

i=1 wini(x))exp(∑F

i=1 wini(x[Xl=0]))+ exp(∑Fi=1 wini(x[Xl=1]))

(3.4)

Or we can take the gradient of pseudo-log-likelihood:

∂

∂wilogP∗w(X = x) =

n

∑l=1

[ni(x)−Pw(Xl = 0|MBx(Xl))ni(x[Xl=0])−

Pw(Xl = 1|MBx(Xl))ni(x[Xl=1])](3.5)

where ni(x[Xl=1]) is the number of true groundings of the ith formula when Xl = 1 and the

remaining data do not change and similarly for ni(x[Xl=0]). To compute the expressions 3.4 or

3.5, we do not need to perform inference over the model. The optimal weights for pseudo-log-

likelihood can be found using the limited-memory BFGS algorithm (Liu and Nocedal 1989).

3.2.2 Two-step Learning

The first attempt to learn MLNs structure was that in (Richardson and Domingos 2006), where

the authors used CLAUDIEN (De Raedt and Dehaspe 1997) in a first step to learn the clauses

of MLNs and then learned the weights in a second step with a fixed structure. CLAUDIEN,

unlike most other ILP systems, which learn only Horn clauses, is able to learn arbitrary first-

order clauses being a good candidate system for learning the structure of MLNs. The authors

in (Richardson and Domingos 2006), sped up the computation of the optimal weights through

several techniques:

• In Equation 3.5, the sum can be greatly sped up ignoring predicates that do not appear in

the ith formula.

• The counts ni(x), ni(x[Xl=1]),ni(x[Xl=0]) do not change with the weights and need only be

computed once.

37


• Ground formulas whose truth value is not affected by changing the truth value of any

single literal may be ignored, since then ni(x) = ni(x[Xl=1]) = ni(x[Xl=0]). In particular,

this holds for all clauses with at least two true literals. This can often be the great majority

of ground clauses.

To avoid overfitting, the authors penalized the pseudo-likelihood with a Gaussian prior on

each weight. The results obtained by learning the structure (from scratch or by refining an

existing KB) and then the weights, were not better than learning the weights for a hand-coded

KB. This is due to the fact that CLAUDIEN does not maximize the likelihood of the data, but

uses typical ILP coverage evaluation measures.

3.2.3 Single-step Learning by Optimizing Weighted Pseudo-likelihood

Since CLAUDIEN (like other ILP systems) is designed to simply learn first-order theories that

hold with some accuracy and frequency in the data, and not to maximize the data’s likelihood

(and hence the quality of the MLN’s probabilistic predictions), the authors in (Kok and Domin-

gos 2005) proposed an algorithm for learning the structure of MLNs by directly optimizing a

likelihood-type measure in a single-step. They showed experimentally that it outperforms the

approach of (Richardson and Domingos 2006).

The authors in (Kok and Domingos 2005) found that the measure used in (Richardson and

Domingos 2006) gives undue weight to the largest-arity predicates, resulting in poor modeling

of the rest. For this reason they defined the weighted pseudo-log-likelihood (WPLL):

logP+w (X = x) = ∑

r∈Rcr

gr

∑k=1

logPw(Xr,k = xr,k|MBx(Xr,k)) (3.6)

where R is the set of first-order predicates, gr is the number of groundings of first-order

predicate r, and xr,k is the truth value (0 or 1) of the kth grounding of r. The choice of predicate

weights cr depends on the user’s goals. In (Kok and Domingos 2005) cr was set to 1gr

, which

has the effect of weighting all first-order predicates equally. If modeling a predicate is not

important (e.g., because it will always be part of the evidence), its weight can be set to zero.

To combat overfitting, WPLL was penalized with a structure prior of e−α ∑Fi=1 di , where di is the

number of predicates between the current version of the clause and the original one. (If the

clause is new, this is simply its length). Like in (Richardson and Domingos 2006), the authors

in (Kok and Domingos 2005) penalized each weight with a Gaussian prior.

38


Regarding search strategies the authors used beam search to find the best clause to add. The

algorithm starts with the unit clauses and the expert-supplied ones, applies each legal literal

addition and deletion to each clause, keeps the b best ones, applies the operators to those, and

repeats until no new clause improves the WPLL. The chosen clause is the one with highest

WPLL found in any iteration of the search. If the new clause is a refinement of a hand-coded

one, it replaces it. Since each change must improve WPLL to be accepted, (even though literals

are added and deleted), no loops can occur.

As pointed out in (Kok and Domingos 2005) a potentially serious problem that arises when

evaluating candidate clauses using WPLL is that the optimal (maximum WPLL) weights need

to be computed for each candidate. Since this involves numerical optimization, and needs to be

done millions of times, it could easily make the algorithm too slow. In (Della Pietra et al. 1997;

McCallum 2003) the problem is addressed by assuming that the weights of previous features

do not change when testing a new one. Surprisingly, the authors in (Kok and Domingos 2005)

found this to be unnecessary if the very simple approach of initializing L-BFGS with the current

weights (and zero weight for a new clause) is used. Although in principle all weights could

change as the result of introducing or modifying a clause, in practice this is very rare. Second-

order, quadratic-convergence methods like L-BFGS are known to be very fast if started near

the optimum (Sha and Pereira 2003). This is what happened in (Kok and Domingos 2005):

L-BFGS typically converges in just a few iterations, sometimes one.

Experimental evaluation showed that learning the structure in a single step greatly improved

over other methods such as purely ILP-based methods, purely probabilistic methods or the two

step structure learning approach of (Richardson and Domingos 2006).

3.2.4 Bottom-up Learning

The algorithm in (Kok and Domingos 2005) follows a top-down approach based on a generate-

and-test strategy which blindly generates many potential candidates independent of the train-

ing data and then tests these for fitness on the data. For MLNs the space of potential model

revisions is combinatorially explosive and such a search can become intractable following a

top-down strategy. In ILP many attempts have been made to use the data to guide the search

for good candidates. These methods follow a bottom-up approach by using the training data to

construct hypotheses (Muggleton and Feng 1992). Inspired by these approaches, the authors in

(Mihalkova and Mooney 2007) propose Bottom-Up Structure Learning (BUSL), a bottom-up

39


approach for learning the structure of MLNs. The algorithm uses a “propositional” MN struc-

ture learner to construct “template” networks that guide the construction of candidate clauses.

The basic idea of BUSL is to first automatically create a MN template from the provided data

and then use the nodes in this template as components for clauses that can contain one or more

literals that are connected by a shared variable. Since in MNs, a node is independent of all

other nodes given its immediate neighbors (i.e. its Markov blanket) and since every probability

distribution respecting the independencies captured by the graph of a MN can be represented

as the product of functions defined only over the cliques of the graph, then to specify the prob-

ability distribution over a MN template, the algorithm needs to consider only clauses defined

over the cliques of the template. BUSL does exactly this. It uses MN templates to restrict the

search space for clauses only to those candidates whose literals correspond to nodes that form

a clique in the template. In this way, it generates fewer candidates for evaluation. Even though

BUSL evaluates fewer candidates, after initially scoring all candidates, the algorithm attempts

to add them one by one to the MLN, thus changing the MLN at almost each step, which greatly

slows down the computation of the WPLL. This is the main drawback of the algorithm regard-

ing speed. Regarding accuracy, the results in (Mihalkova and Mooney 2007), clearly show that

BUSL outperforms the top-down approach in terms of conditional likelihood (CLL) and area

under curve (AUC) for the precision recall curve.

3.3 Parameter Learning of MLNs

Parameter learning for MNs and MLNs can be distinguished in generative and discriminative.

Generative approaches optimize the joint probability distribution of all the variables. In contrast

discriminative approaches maximize the conditional likelihood of a set of outputs given a set

of inputs (Lafferty et al. 2001) and this often produces better results for prediction problems.

In this Section will be presented the main approaches for learning the MLNs weights.

3.3.1 Generative approaches

Generative approaches for MLNs optimize likelihood or pseudo-likelihood. Both these ap-

proaches were proposed in (Richardson and Domingos 2006). As introduced in the previous

Section, the difficulty with Equation 3.2 is that not only the first component is intractable, but

also computing the expected number of true groundings is also intractable, requiring inference

40

3.3 Parameter Learning of MLNs

over the model. Moreover, efficient optimization methods also require computing the log-

likelihood itself (Equation 3.1), and thus the partition function Z. The authors in (Richardson

and Domingos 2006) used a Monte Carlo maximum likelihood estimator (MC-MLE) (Geyer

and Thompson 1992). However, they found in their experiments that the Gibbs sampling used

to compute the MC-MLEs and gradients did not converge in reasonable time, and using the

samples from the unconverged chains yielded poor results. For this reason, pseudo-likelihood

was considered a better choice since it does not require inference during learning. Because

likelihood is a concave function of the weights, optimal weights can be found efficiently using

standard gradient-based or quasi-Newton optimization methods (Nocedal and Wright 1999). In

(Richardson and Domingos 2006) the optimal weights for pseudo-log-likelihood were found

using the limited-memory BFGS (L-BFGS) algorithm (Liu and Nocedal 1989).

3.3.2 Discriminative Approaches

Since generative approaches optimize the joint distribution of all the variables there is a mis-

match between the objective function used (likelihood or a function thereof) and the goal of

classification (maximizing accuracy or conditional likelihood). This can often lead to subopti-

mal results for predictive tasks for which it is known a priori which predicates will be evidence

and which ones will be queried, and the goal is to correctly predict the query predicate given

the evidence. If we partition the ground atoms in the domain into a set of evidence atoms E

and a set of query atoms Q, the conditional likelihood of Q given E is:

P(q|e) =1Zx

exp( ∑i∈Fq

wini(e,q)) =1Zx

exp( ∑j∈Gq

w jg j(e,q)) (3.7)

where Fq is the set of all MLN clauses with at least one grounding involving a query atom,

ni(e,q) is the number of true groundings of the ith clause involving query atoms, Gq is the set

of ground clauses in the ground MN involving query atoms, and g j(e,q) = 1 if the jth ground

clause is true in the data and 0 otherwise. When some variables are “hidden” (i.e., neither query

nor evidence) the conditional likelihood should be computed by summing them out (here for

clarity we treat all non-evidence variables as query variables). The gradient of the conditional

log-likelihood (CLL) is given by:

∂

∂wilogPw(q|e) = ni(q,e)−∑

q′Pw(q′|e)ni(e,q′) = ni(e,q)−Ew[ni(e,q)] (3.8)

41


Computing the expected counts Ew[ni(e,q)] is intractable. However, these can be approxi-

mated by the counts ni(e,q∗w) in the MAP state q∗w(x). Thus, computing the gradient needs only

MAP inference to find q∗w(x) which is much faster than full conditional inference for comput-

ing Ew[ni(e,q)]. This approach was successfully used in (Collins 2002) for a special case of

MNs where the query nodes form a linear chain. In this case the MAP state can be found us-

ing the Viterbi algorithm (Rabiner 1989) and the voted perceptron algorithm in (Collins 2002)

follows this approach. To generalize this method to arbitrary MLNs it is necessary to replace

the Viterbi algorithm with a general-purpose algorithm for MAP inference in MLNs. From

Equation 3.7 we can see that since q∗w(x) is the state that maximizes the sum of the weights of

the satisfied ground clauses, it can be found using a MAX-SAT solver. The authors in (Singla

and Domingos 2005), generalized the voted-perceptron algorithm to arbitrary MLNs by replac-

ing the Viterbi algorithm with the MaxWalkSAT solver (Kautz et al. 1997b). Given an MLN

and set of evidence atoms, the KB to be passed to MaxWalkSAT is formed by constructing

all groundings of clauses in the MLN involving query atoms, replacing the evidence atoms in

those groundings by their truth values, and simplifying.

However, unlike the Viterbi algorithm, MaxWalkSAT is not guaranteed to reach the global

MAP state. This can potentially lead to errors in the weight estimates produced. The quality

of the estimates can be improved by running a Gibbs sampler starting at the state returned by

MaxWalkSAT, and averaging counts over the samples. If the Pw(q|e) distribution has more than

one mode, doing multiple runs of MaxWalkSAT followed by Gibbs sampling can be helpful.

This approach is followed in the algorithm in (Singla and Domingos 2005) which is essentially

gradient descent.

Weight learning in MLNs is a convex optimization problem, and thus gradient descent

is guaranteed to find the global optimum. However, convergence to this optimum may be too

slow. The sufficient statistics for MLNs are the number of true groundings of each clause. Since

this number can easily vary by orders of magnitude from one clause to another, a learning rate

that is small enough to avoid divergence in some weights may be too small for fast convergence

in others. This is an instance of the well-known problem of ill-conditioning in numerical

optimization, and many candidate solutions for it exist (Nocedal and Wright 1999). However,

most of these are not easily applicable to MLNs because of the nature of the function to be

optimized.

In (Lowd and Domingos 2007) was proposed another approach for discriminative weight

learning of MLNs. In this work conjugate gradient (Shewchuck. 1994) is used. Gradient

42

3.4 Inference in MLNs

descent can be sped up through performing a line search to find the optimum along the chosen

descent direction instead of taking a small step of constant size at each iteration. This can be

inefficient on ill-conditioned problems, since line searches along successive directions tend to

partly undo the effect of each other: each line search makes the gradient along its direction

zero, but the next line search will generally make it non-zero again. This can be solved by

imposing at each step the condition that the gradient along previous directions remain zero.

The directions chosen in this way are called conjugate, and the method conjugate gradient. In

(Lowd and Domingos 2007), the authors used the Polak-Ribiere method for choosing conjugate

gradients since it has generally been found to be the best-performing one.

Conjugate gradient methods are among the most efficient ones, on a par with quasi-Newton

ones. Unfortunately, as the authors point out in (Lowd and Domingos 2007), applying them to

MLNs is difficult, because line searches require computing the objective function, and there-

fore the partition function Z, which is intractable. Fortunately, the Hessian can be used instead

of a line search to choose a step size. This method is known as scaled conjugate gradient

(SCG), and was proposed in (Moller 1993) for training neural networks. In (Lowd and Domin-

gos 2007), a step size was chosen by using the Hessian similar to a diagonal Newton method.

Conjugate gradient methods are often more effective with a preconditioner, a linear transfor-

mation that attempts to reduce the condition number of the problem (Sha and Pereira 2003).

Good preconditioners approximate the inverse Hessian. In (Lowd and Domingos 2007), the

authors used the inverse diagonal Hessian as preconditioner and called the SCG algorithm Pre-

conditioned SCG (PSCG). PSCG was shown to outperform the voted perceptron algorithm of

(Singla and Domingos 2005) on two real-world domains both for CLL and AUC. For the same

learning time, PSCG learned much more accurate models.


This section introduces the inference tasks for MLNs and the related existing algorithms.

3.4.1 MAP Inference

Maximum a posteriori (MAP) inference means finding the most likely state of a set of output

variables given the state of the input variables. As introduced in the previous Section, the MAP

state for a MLN is the state that maximizes the sum of the weights of the satisfied ground

clauses. This state can be efficiently found using a weighted MAX-SAT solver. The authors

43


in (Singla and Domingos 2005) use the MaxWalkSAT solver (Kautz et al. 1997b) to find the

MAP state and use it in a gradient descent method to compute the number of true groundings

of clauses.

Propositionalization is the process of replacing a first-order KB by an equivalent proposi-

tional one. For finite domains, this can be done by replacing each universally (existentially)

quantified formula with a conjunction (disjunction) of all its groundings. A first-order KB is

satisfiable iff the equivalent propositional KB is satisfiable. Thus, inference over a first-order

KB can be performed by propositionalization followed by satisfiability testing.

Stochastic Local Search (Hoos and Stutzle 2005) methods have made much progress in

solving hard combinatorial problems. However, fully instantiating a finite first-order theory re-

quires memory on the order of the number of constants raised to the arity of the clauses, which

significantly limits the size of domains where it remains feasible. In (Singla and Domingos

2006a) a powerful algorithm called LazySAT was proposed that avoids this blowup by taking

advantage of the extreme sparseness that is typical of relational domains (i.e., only a small

fraction of ground atoms are true, and most clauses are trivially satisfied). LazySAT grounds

clauses lazily. At each step in the search it adds only those clauses that could become unsat-

isfied. In contrast, WalkSAT grounds all possible clauses at the outset, consuming time and

memory exponential in their arity.

LazySAT has in input an MLN (or a pure first-order KB) and a database which is a set of

ground atoms. An evidence atom is either a ground atom in the database, or a ground atom that

is false by the closed world assumption (Genesereth and Nilsson 1987). The truth values of

evidence atoms are fixed throughout the search, and ground clauses are simplified by removing

the evidence atoms. LazySAT maintains a set of active atoms and a set of active clauses. A

clause is active if it can be made unsatisfied by flipping zero or more of its active atoms. An

atom is active if it is in the initial set of active atoms, or if it was flipped at some point in the

search. The initial active atoms are all those appearing in clauses that are unsatisfied if only the

atoms in the database are true, and all others are false. At each step in the search, the variable

that is flipped is activated together with any clauses that by definition should become active as

a result. Experiments in (Singla and Domingos 2006a) showed that LazySAT greatly reduces

memory requirements compared to WalkSAT, without sacrificing speed or solution quality.

44


3.4.2 Conditional Inference


variables given the evidence and it has been shown to be #P-complete. The most widely used

approach to approximate inference is by using MCMC methods (Gilks et al. 1996) and in

particular Gibbs sampling which proceeds by sampling each variable in turn given its Markov

blanket (the variables it appears in some potential with). To generate samples from the correct

distribution, it suffices that the Markov chain satisfy ergodicity and detailed balance.

If F1 and F2 are two formulas in FOL, C is a finite set of constants including any constants

that appear in F1 or F2, and N is an MLN, then

P(F1|F2,N,C) = P(F1|F2,MN,C) =P(F1∧F2|MN,C)

P(F2|MN,C=

∑x∈χF1∩χF2P(X = x|MN,C)

∑x∈χF2P(X = x|MN,C)

(3.9)

where χFi is the set of worlds where Fi holds and P(x|MN,C) is given by Equation 3.1. In

graphical models conditional queries are special cases of Equation 3.9 where all predicates in

F1,F2 and N are just zero-arity and formulas are conjunctions.

The computation of Equation 3.9 is intractable even for very small domains. Probabilistic

inference is #P-complete while on the other side logical inference is NP-complete even in finite

domains leading therefore to no better results for MLNs. The hope is that since MLNs allow

fine-grained encoding of knowledge, including context-specific independences, inference may

in some cases be more efficient than inference in an ordinary graphical model for the same

domain.

In theory, P(F1|F2,N,C) can be approximated using an MCMC algorithm that rejects all

moves to states where F2 does not hold, and counts the number of samples in which F1 holds.

However, even this is likely to be extremely slow for arbitrary formulas. What is interesting

for practical purposes is the case when F1 and F2 are conjunctions of ground literals, although

lifted inference is an active research area (Poole 2003; Singla and Domingos. 2008).

The authors in (Richardson and Domingos 2006) propose an algorithm that works in two

phases. The first phase returns the minimal set M of the ground MN required to compute

P(F1|F2,N,C). The second phase performs inference on this network, with the nodes in F2 set

to their values in F2. A possible method is Gibbs sampling, but any inference method may be

45


used. The basic Gibbs step consists of sampling one ground atom given its Markov blanket.

The probability of a ground atom Xl when its Markov blanket Bl is in state bl is given by:

P(Xl = xl) =exp(∑ fi∈Fl

wi fi(Xl = xl,Bl = bl))exp(∑ fi∈Fl

wi fi(Xl = 0,Bl = bl))+ exp(∑ fi∈Flwi fi(Xl = 1,Bl = bl))

(3.10)

where Fl is the set of ground formulas that Xl appears in, and fi(Xl = xl,Bl = bl) is the

value (0 or 1) of the feature corresponding to the ith ground formula when Xl = xl and Bl = bl .

As pointed out in (Richardson and Domingos 2006), for sets of atoms of which exactly one

is true (e.g., the possible values of an attribute), blocking can be used (i.e., one atom is set to

true and the others to false in one step, by sampling conditioned on their collective Markov

blanket). The estimated probability of a conjunction of ground literals is simply the fraction of

samples in which the ground literals are true, after the Markov chain has converged.

One of the problems that arises in real-world applications, is that an inference method must

be able to handle probabilistic and deterministic dependencies that might hold in the domain.

MCMC methods are suitale for handling probabilistic dependencies but give poor results when

deterministic or near deterministic dependencies characterize a certain domain. On the other

side logical ones, like satisfiability testing cannot be applied to probabilistic dependencies. One

approach to deal with both kinds of dependencies is that of (Poon and Domingos 2006) where

the authors use SampleSAT (Wei et al. 2004) in a MCMC algorithm to uniformly sample from

the set of satisfying solutions. As pointed out in (Wei et al. 2004), SAT solvers find solutions

very fast but they may sample highly non-uniformly. On the other side, MCMC methods may

take exponential time, in terms of problem size, to reach the stationary distribution. For this

reason, the authors in (Wei et al. 2004) proposed to use a hybrid strategy by combining random

walk steps with MCMC steps, and in particular with Metropolis transitions. This permits

to efficiently jump between isolated or near-isolated regions of non-zero probability, while

preserving detailed balance.

Experimental results in (Poon and Domingos 2006) show that MC-SAT greatly outperforms

Gibbs sampling and simulated tempering. Recently, a lazy version, Lazy-MC-SAT (Poon et al.

2008) was shown to greatly reduce memory requirements for the inference task. Experimental

evaluation in (Poon et al. 2008) shows that it reduces memory and time by orders of magnitude

compared to MC-SAT.

46

Chapter 4

The GSL algorithm

This chapter describes the Generative Structure Learning (GSL) algorithm for learning the

structure of MLNs based on the Iterated Local Search (ILS) metaheuristic.

4.1 The Iterated Local Search metaheuristic

Many widely known and high-performance local search algorithms make use of randomized

choice in generating or selecting candidate solutions for a given combinatorial problem in-

stance. These algorithms are called Stochastic Local Search (SLS) algorithms (Hoos and Stut-

zle 2005) and represent one of the most successful and widely used approaches for solving

hard combinatorial problems. These problems are characterized by a large space of possible

candidates to be explored and the use of systematic deterministic methods could be highly

computationally expensive. In the following we motivate why we choose SLS algorithms in

our structure learning algorithm for MLNs.

As pointed out in (Hoos and Stutzle 2005) there are three good reasons to consider applying

SLS algorithms. The first regards the fact that many problems are of a constructive nature and

their instance is known to be solvable. In these situations, the goal of any search algorithm is to

find a solution rather than just to decide whether a solution exists. This holds in particular for

optimization problems, such as Travelling Salesman Problem (TSP), where the actual problem

is to find a solution of sufficiently high quality. Therefore, the main advantage of a complete

systematic algorithm (the ability to detect that a given problem instance has no solution) is not

relevant for finding solutions of solvable instances. Secondly, in most application scenarios,

the time to find a solution is limited. In these situations, systematic algorithms often have to

47

4. THE GSL ALGORITHM

be aborted after the given time has been exhausted, which renders them incomplete. This is

problematic for many systematic optimization algorithms that search through spaces of partial

solutions without computing complete solutions early in the search, and if such a systematic al-

gorithm is aborted prematurely, usually a non solution candidate is available, while in the same

situation SLS algorithms typically return the best solution found so far. Thirdly, algorithms for

real-time problems should be able to deliver reasonably good solutions at any point during their

execution. For optimization problems this typically means that run-time and solution quality

should be positively correlated; for decision problems one could guess a solution when e time-

out occurs, where the accuracy of the guess should increase with the run-time of the algorithm.

This so-called any-time property of algorithms, is usually very difficult to achieve, but in many

situations the SLS paradigm is naturally suited for devising any time algorithms.

In general, it is not straightforward to decide whether to use a systematic or SLS algorithm

in a certain task. Systematic and SLS algorithms can be considered complementary to each

other. SLS algorithms are advantageous in many situations, particularly if reasonably good

solutions are required within a short time, if parallel processing is used and if the knowledge

about the problem domain is rather limited. In other cases, when time constraints are less

important and some knowledge about the problem domain can be exploited, systematic search

may a better choice.

In learning the structure of MLNs, we are faced with the problem of maximizing the like-

lihood of the data. This implies searching for the possibly best structure among the candidate

structures. Many search algorithms fall in local optima of the evaluation function and probably

do not find the best solution (often solutions are typically of not sufficiently high quality). SLS

methods exploit different mechanisms for escaping local optima and this feature renders them

very useful in many optimization problems. Moreover, depending on the future application

domain of the MLNs, the time requirements may be very strict and thus a sufficiently high

quality solution may be needed within fixed time. Future extensions of the system that learns

MLNs using SLS methods, may benefit from parallel processing abilities. Finally, trying to

fulfill the any-time property through SLS methods, the system to be developed can be used to

successfully solve real-time problems that involve learning MLNs.

Many “simple” SLS methods come from other search methods by just randomizing the

selection of the candidates during search, such as Randomized Iterative Improvement (RII),

Uniformed Random Walk, etc. Many other SLS methods combine “simple” SLS methods to

exploit the abilities of each of these during search. These are known as Hybrid SLS methods

48

4.1 The Iterated Local Search metaheuristic

Algorithm 4.1 The Iterated Local Search algorithmProcedure Iterated Local Search

s0 = GenerateInitialSolutions∗ = LocalSearch(s0)

repeats′= Perturb(s∗,history)

s∗′= LocalSearch(s

′)

s∗ = Accept(s∗′,s∗,history)

until termination condition is truereturn s∗

end

(Hoos and Stutzle 2005). ILS is one of these metaheuristics because it can be easily combined

with other SLS methods.

One of the simplest and most intuitive ideas for addressing the fundamental issue of es-

caping local optima is to use two types of SLS steps: one for reaching a local optimum as

efficiently as possible, and the other for effectively escaping it. ILS methods (Hoos and Stut-

zle 2005; Loureno et al. 2002) exploit this key idea, and essentially use two types of search

steps alternatingly to perform a walk in the space of local optima w.r.t. the given evaluation

function. Algorithm 4.1 works as follows: The search process starts from a randomly selected

element s0 of the search space. From this initial candidate solution, a locally optimal solution

s∗ is obtained by applying a subsidiary local search procedure. Then each iteration step of the

algorithm consists of three major steps: first a perturbation method is applied to the current

candidate solution s∗; this yields a modified candidate solution s′ from which in the next step

a subsidiary local search is performed until another local optimum s∗′

is obtained. In the last

step, an acceptance criterion is used to decide from which of the two local optima s∗ or s∗′

the

search process is continued. The algorithm can terminate after some steps have not produced

improvement or simply after a certain number of steps. The choice of the components of the

ILS has a great impact on the performance of the algorithm. A schematic representation of

the ILS algorithm is given in Figure 4.1, where the perturbation operator causes the search to

continue in another region of the search space in order to escape the first local optimum.

49


Figure 4.1: The Iterated Local Search schema -

4.2 Generative Structure Learning using ILS

In this section we describe the ILS metaheuristic tailored to the problem of learning the struc-

ture of MLNs. Algorithm 4.2 iteratively adds the best clause to the current MLN until two

consecutive steps have not produced improvement (however other stopping criteria could be

applied). Algorithm 4.3 performs an iterated local search to find the best clause to add to the

MLN. It starts by randomly choosing a unit clause CLC in the search space. Then it performs

a greedy local search to efficiently reach a local optimum CLS. At this point, a perturbation

method is applied leading to the neighbor CL′C of CLS and then a greedy local search is ap-

plied to CL′C to reach another local optimum CL′S . The accept function decides whether the

search must continue from the previous local optimum CLC or from the last found local op-

timum CL′S (accept can perform random walk or iterative improvement in the space of local

optima). Careful choice of the various components of Algorithm 4.3 is important to achieve

high performance.

4.2.1 The Perturbation Component

The clause perturbation operator (flipping the sign of literals, removing literals or adding lit-

erals) has the goal to jump in a different region of the search space where search should start

with the next iteration. There can be strong or weak perturbations which means that if the jump

in the search space is near to the current local optimum the subsidiary local search procedure

LocalSearchII may fall again in the same local optimum and enter regions with the same value

of the objective function called plateau, but if the jump is too far, LocalSearchII may take too

many steps to reach another good solution. In our algorithm we use only strong perturbations,

i.e., we always re-start from unit clauses (in future work we intend to adapt dynamically the

50


Algorithm 4.2 The GSL algorithmInput: P:set of predicates, MLN:Markov Logic Network, RDB:Relational DatabaseCLS = All clauses in MLN ∪ P;LearnWeights(MLN,DB); Score = WPLL(MLN,RDB);repeat

BestClause = SearchBestClause(P,MLN,Score,CLS,RDB);if BestClause 6= null then

Add BestClause to MLN;Score = WPLL(MLN,RDB);

if BestScore <= Score thenGain = Score - BestScore; BestScore = Score;

end ifend if

until BestClause = null || Gain <= minGain for two consecutive stepsreturn MLN

nature of the perturbation). In this way we induce randomness in the search process to avoid

search stagnation. Careful check (tabu check) is performed when restarting, in order to avoid

starting again from the same unit clause.

Finding a good perturbation operator is not an easy task because this procedure has an im-

pact on the other components of ILS (Loureno et al. 2002). When optimizing separately each

of the four components of ILS, we suppose other components remain fixed. This is quite a use-

ful approximation, but clearly the optimization of one component depends on the choices made

for the others. Therefore, at least in principle, one should tackle the global optimization of an

ILS. As pointed out in (Loureno et al. 2002), since at present there is no theory for analyzing

a metaheuristic such as iterated local search, there are some practical general rules to follow

when trying to globally optimize ILS. For example, the procedure GenerateInitialSolution is

probably irrelevant when the ILS performs well and rapidly looses memory of its starting point.

If the optimization of GenerateInitialSolution can be ignored, then the joint optimization of

the other three components must be achieved. The best choice of Perturb depends on the

choice of LocalSearch while the best choice of Accept depends on the choices of LocalSearch

and Perturb. In practice, the global optimization problem can be approximated by successively

optimizing each component assuming the others are fixed until no improvements are found for

any of the components. Thus the global optimization can be seen as an iterative process. This

51


Algorithm 4.3 The SearchBestClause component of GSLInput: P:set of predicates, MLN:Markov Logic Network, BestScore: current best score,CLS: List of clauses, RDB:Relational Database)CLC = Random Pick a clause in CLS ∪ P;CLS = LocalSearchII(CLS);BestClause = CLS;repeat

CL’C = Perturb(CLS);CL’S = LocalSearchII(CL’C,MLN,BestScore);if WPLL(BestClause,MLN,RDB) ≥WPLL(CL’S,MLN,RDB) then

BestClause = CL’S;Add BestClause to MLN;BestScore = WPLL(CL’S,MLN,RDB)

end ifCLS = accept(CLS,CL’S);

until two consecutive steps have not produced improvementReturn BestClause

does not guarantee global optimization of the ILS, but it should lead to an adequate optimiza-

tion of the overall algorithm.

Regarding the strength of the perturbation, its effect should not be easily undone by the

local search; if the local search has obvious short-comings, a good perturbation should com-

pensate for them. The authors in (Loureno et al. 2002) point out that the decision to use weak

perturbations depends on whether the best solutions “cluster” in the space S∗ of locally optimal

solutions. In some problems (TSP is one of them), there is a strong correlation between the cost

of a solution and its “distance” to the optimum: in effect, the best solutions cluster together, i.e.,

have many similar components. This has been referred to as: “Massif Central” phenomenon

(Fonlupt et al. 1999), principle of proximate optimality (Glover and Laguna. 1997), and replica

symmetry (Mezard et al. 1987). If the problem under consideration has this property, it is use-

ful to attempt find the true optimum using a biased sampling of S∗. In particular, it is clear that

is is useful to use exploitation to improve the probability of hitting the global optimum.

For problems where clustering of best solutions is incomplete, i.e., where very distant

solutions can be nearly as good as the optimum, weak perturbation may fail. An example

of combinatorial optimization problem in this category is the graph bi-section and MAX-SAT.

52


Naturally, exploitation is still needed to get the best solution in one’s current neighborhood, but

generally this will not lead to the optimum. After an exploitation phase, one must go explore

other regions of S∗. This can be achieved by using strong perturbations whose strength grows

with the instance. Another possibility is to restart the algorithm from scratch and repeat another

exploitation phase.

For MLNs structure learning we do not have a theoretical analysis of the search space S∗

and to the best of the knowledge of the author, there are no works on the properties of the search

space of the evaluation function pseudo-likelihood for MLNs. We empirically found that many

good candidates, as good as the optimum, were too distant from each other (differ with a large

number of literals). This led us follow the approach of using strong perturbations in order to

explore different regions ( i.e., clusters) of S∗. As the results of (Biba et al. 2008) and of the

next Section show, this was quite a reasonable choice together with an iterative improvement

local search procedure.

4.2.2 The Local Search Component

As a general rule, LocalSearch should be as powerful as possible as long as it is not too expen-

sive for the whole search process. Since finding a good MLNs structure is a hard optimization

problem, greedily improving the scoring function is a good choice to achieve high quality so-

lutions. As described in the previous paragraph, this will be useful to explore each cluster of

solutions of S∗ and then leave the cluster with a strong perturbation operator. Since, we are not

sure about the solutions clustering for the task of MLNs structure learning, we must be sure

that when entering a cluster the best possible solution is found. We can have good chances to

achieve this by using a greedy local search procedure. For this reason, regarding the procedure

LocalSearchII (Algorithm 4.4), we decided to use an iterative improvement approach (the walk

probability is set to zero and the best clause is always chosen in stepII) in order to balance

intensification (greedily increase solution quality by exploiting the evaluation function) and

diversification (randomness induced by strong perturbation to avoid search stagnation). How-

ever, as future work we intend to study the properties of the search space for S∗ and weaken

intensification by using a higher walk probability. Finally, the accept function always accepts

the best solution found so far.

53


Algorithm 4.4 The local search component of GSLInput: (CLC: current clause)wp: walk probability, the probability of performing an improvement step or arandom steprepeat

NBHD = Neighborhood of CLC constructed using the clause construction operators;CLS = StepRII(CLC,NBHD,wp);CLC = CLS;

until two consecutive steps do not produce improvementReturn CLS;

StepRII(CLC,NBHD,wp)U = random(]0,1]); random number using a Uniform Probability Distributionif (U ≤ wp) then

CLS = stepURW (CLC,NBHD)Uninformed Random Walk: randomly choose a neighbor from NBHD

elseCLS = stepII(CLC,NBHD)Iterative Improvement: choose the best among the improving neighbours in NBHD.If there is no improving neighbor choose the minimally worsening one

end ifReturn CLS

4.3 Experiments

In this Section experiments in two real-world domains are presented. One is social networks

analysis with the goal of predicting links among web pages that describe social actors and the

other domain regards entity resolution in a large database of citations.

4.3.1 Link Analysis

Link analysis is an important problem in many domains where entities such as people, Web

pages, computers, scientific publications, organizations, are interconnected and interact in one

way or another (Popescul and Ungar 2003). Predicting the presence of links between entities

is not an easy task due to the characteristics of such domains. The first requirement regards

the representation formalism. Flat representations are not suitable to deal with these problems,

54

4.3 Experiments

hence relational formalisms must be used. Second, most of these domains contain noisy or

partially observed data thus robust methods for dealing with uncertainty must be used.

Regarding other SRL models applied to link analysis, in (Popescul and Ungar 2003) a SRL

model, Structural Logistic Regression was used to solve a link analysis problem about predict-

ing citations in scientific literature. In (Taskar et al. 2003) Relational Markov Networks were

used for link prediction in two domains: university webpages and social networks. Markov

Logic was first applied to link prediction by (Richardson and Domingos 2006) and then by

(Kok and Domingos 2005; Mihalkova and Mooney 2007; Singla and Domingos 2005).

Dataset

The experiments on link prediction were carried out on a publicly-available database: the

UW-CSE database (available at http://alchemy.cs.washington.edu/data/uw-cse) used by (Kok

and Domingos 2005; Mihalkova and Mooney 2007; Richardson and Domingos 2006; Singla

and Domingos 2005). This dataset represents a standard relational one and is used for the

important relational task of social network analysis.

The published UW-CSE dataset consists of 15 predicates (Table 4.1) divided into 9 types

with 1323 constants. Types include: publication, person, course, etc. Predicates include:

Student(person), Professor(person), AdvisedBy(person1, person2),TaughtBy(course, person,

quarter), Publication (paper, person) etc. The dataset contains 2673 tuples (true ground atoms,

with the remainder assumed false). The task is to predict who is whose advisor from informa-

tion about coauthorships, classes taught, etc. More precisely, the query atoms are all ground-

ings of AdvisedBy(person1, person2), and the evidence atoms are all groundings of all other

predicates (Richardson and Domingos 2006). In our experiments we performed inference over

all the predicates in the domain, not only AdvisedBy, to see how the learned models are good

in estimating probabilities for a large number of query atoms.

4.3.2 Entity Resolution

Entity resolution is the problem of determining which records in a database refer to the same

entities, and is an essential and expensive step in the data mining process. The problem was

originally defined in (Newcombe et al. 1959) and then the work in (Fellegi and Sunter 1969)

laid the theoretical basis of what is now known by the name of object identification, record

linkage, de-duplication, merge/purge, identity uncertainty, co-reference resolution, and others.

The main characteristics of domains where this problem must be solved, are relations between

55


objects and uncertainty about these relations due to noisy or partially observed data.

Dataset

The Cora dataset consists of 1295 citations of 132 different computer science papers, drawn

from the Cora CS Research Paper Engine. The task is to predict which citations refer to the

same paper, given the words in their author, title, and venue fields. The labeled data also

specify which pairs of author, title, and venue fields refer to the same entities. We performed

experiments for each field in order to evaluate the ability of the model to deduplicate fields

as well as citations. The dataset contains 10 predicates (Table 4.2) and 70367 tuples (true

and false ground atoms, with the remainder assumed false). Since the number of possible

equivalences is very large, like the authors did in (Lowd and Domingos 2007) we used the

canopies found in (Singla and Domingos 2006b) to make this problem tractable. The dataset

used is in Alchemy format (publicly-available at http://alchemy.cs.washington.edu/data/cora/).

The original version not in alchemy format was segmented by Bilenko and Mooney in (Bilenko

and Mooney 2003) (available at http://www.cs.utexas.edu/users/ml/riddle/data/cora.tar.gz).

4.3.3 Systems and Methodology

We implemented Algorithm 4.2 (GSL) as part of the MLN++ package 8.3 which is a suite of

algorithms based on Markov Logic and built upon the Alchemy framework (Kok et al. 2005).

Alchemy implements inference and learning algorithms for Markov Logic. Alchemy can be

viewed as a declarative programming language akin to Prolog, but with some key differences:

the underlying inference mechanism is model checking instead of theorem proving; the full

syntax of first-order logic is allowed, rather than just Horn clauses. Moreover, Alchemy has

some built-in functionalities with the ability to handle uncertainty and learn from data. MLN++

uses the API of Alchemy for some tasks such as the implementation of L-BFGS in Alchemy

to learn maximum WPLL weights.

To evaluate the performance of GSL towards the state-of-the art algorithms we compared

our algorithm performance with the state-of-the-art algorithms for generative structure learning

of MLNs: BS (Beam Search) of (Kok and Domingos 2005) and BUSL (Bottom-Up Structure

Learning) of (Mihalkova and Mooney 2007).

In the UW-CSE domain, we used the same leave-one-area-out methodology as in (Richard-

son and Domingos 2006). In the Cora domain, we performed cross-validation. For each sys-

tem on each test set, we measured the conditional log-likelihood (CLL) and the area under

56

4.3 Experiments

Table 4.1: All predicates in the UW-CSE domain

TaughtBy(course, person, semester) CourseLevel(course, level) Position(person, pos)AdvisedBy(person, person) ProjectMember(project, person) Phase(person, phas)

TempAdvisedBy(person, person) YearsInProgram(person, year) TA(course, person, semester)Student(person) Professor(person) SamePerson(person, person)

SameCourse(course, course) SameProject(project, project) Publication(title, person)

Table 4.2: All predicates in the CORA domain

author(citation,author) title(citation,title)venue(citation,venue) sameBib(citation,citation)

sameAuthor(author,author) sameTitle(title,title)sameVenue(venue,venue) hasWordAuthor(author, word)hasWordTitle(title, word) hasWordVenue(venue, word)

the precision-recall curve (AUC) for all the predicates. The advantage of the CLL is that it

directly measures the quality of the probability estimates produced. The advantage of the AUC

is that it is insensitive to the large number of true negatives (i.e., ground atoms that are false

and predicted to be false). The CLL of a query predicate is the average over all its groundings

of the ground atom’s log-probability given evidence. The precision-recall curve for a predicate

is computed by varying the CLL threshold above which a ground atom is predicted to be true;

i.e., the predicates whose probability of being true is greater than the threshold are positive and

the rest are negative.

For all algorithms, we used the default parameters of Alchemy changing only the following

ones: maximum variables per clause = 5 for UW-CSE and 6 for Cora; penalization of WPLL:

0.01 for UW-CSE and 0.001 for Cora. For L-BFGS: convergence threshold = 10−5 (tight) and

10−4 (loose); minWeight = 0.5 for UW-CSE for BUSL as in (Mihalkova and Mooney 2007),

1 for BS as in (Kok and Domingos 2005) and 1 for ILS; minGain = 0.05 for ILS. For GSL

we used a multiple independent walk parallelism, assigning an instance of the algorithm to a

separate CPU on a cluster of Intel Core2 Duo 2.13 GHz CPUs.

4.3.4 Results

After learning the structure, we performed inference on the test fold for both datasets by using

MC-SAT (Poon and Domingos 2006) with number of steps = 10000 and simulated annealing

57


Table 4.3: Accuracy results on UW-CSE for ten parallel independent walks of GSL

language ai systems graphics theoryRUN CLL AUC CLL AUC CLL AUC CLL AUC CLL AUCR1 -0.232± 0.035 0.420 -0.322± 0.034 0.413 -0.056± 0.023 0.442 -0.080± 0.016 0.425 -0.292±0.036 0.336R2 -0.140± 0.034 0.419 -0.375± 0.039 0.353 -0.267± 0.032 0.445 -0.342± 0.041 0.421 -0.251±0.028 0.386R3 -0.071± 0.023 0.430 -0.171± 0.018 0.408 -0.293± 0.033 0.467 -0.064± 0.013 0.462 -0.112±0.022 0.386R4 -0.464± 0.082 0.393 -0.054± 0.005 0.419 -0.307± 0.040 0.442 -0.111± 0.012 0.426 -0.300±0.046 0.359R5 -0.329± 0.070 0.404 -0.331± 0.034 0.421 -0.323± 0.034 0.449 -0.465± 0.046 0.365 -0.104±0.030 0.368R6 -0.335± 0.060 0.449 -0.125± 0.015 0.415 -0.266± 0.033 0.411 -0.358± 0.036 0.442 -0.262±0.034 0.384R7 -0.285± 0.067 0.427 -0.060± 0.008 0.394 -0.254± 0.032 0.402 -0.306± 0.035 0.465 -0.249±0.040 0.384R8 -0.243± 0.052 0.418 -0.381± 0.036 0.353 -0.371± 0.047 0.398 -0.053± 0.007 0.483 -0.178±0.033 0.414R9 -0.224± 0.031 0.414 -0.128± 0.020 0.397 -0.416± 0.051 0.422 -0.348± 0.030 0.400 -0.321±0.032 0.412

R10 -0.356± 0.067 0.386 -0.377± 0.025 0.369 -0.295± 0.034 0.482 -0.299± 0.034 0.447 -0.212±0.033 0.391Avg. -0.268±0.052 0.416 -0.233±0.023 0.394 -0.285±0.036 0.436 -0.243±0.027 0.434 -0.228±0.033 0.381

temperature = 0.5. For each experiment, all the groundings of the query predicates on the

test fold were commented. MC-SAT produces probability outputs for every grounding of the

query predicate on the test fold. We used these values to compute the average CLL over all the

groundings and the relative AUC (for AUC we used the method and the package of (Davis and

Goadrich 2006)). For CORA, in some cases, the memory and requirements of the inference

task were too high. Thus to score the learned structures within reasonable time and available

memory, in some cases we used the lazy version of MC-SAT and posed a limit of one hour for

the process to complete. For all the other cases where the memory requirements were not high,

we ran MC-SAT with number of steps = 10000. For ILS we report the performance in terms

of CLL for ten parallel independent walks. Both CLL and AUC results are averaged over all

predicates of the domain.

For UW-CSE the results of GSL are reported in Table 4.3. Every value in the table for

each fold is an average of accuracy for all the predicates. The overall results for UW-CSE

comparing BUSL and GSL are reported in Table 4.4. The columns refer to the results taken for

GSL: GSL-Average refers to the average results of all the parallel runs over all the folds, GSL-

BestCLL refers to the best parallel run for each fold in terms of CLL, GSL-BestAUC refers to

the best parallel run for each fold in terms of AUC, GSL-Best refers to the best parallel run for

each fold taking into account both the optimization of CLL and AUC.

As the results of Table 4.4 show, if we take all the parallel independent walks of GSL,

these on average produce better results than BS but worse results than BUSL in terms of CLL

and AUC. However, if for each fold of the dataset we take the best run of GSL in terms of

58

4.3 Experiments

Table 4.4: Accuracy comparison of GSL, BUSL and BS on the UW-CSE dataset

GSL-Average GSL-BestCLL GSL-BestAUCFold CLL AUC CLL AUC CLL AUC

language -0.268±0.052 0.416 -0.071±0.023 0.430 -0.335 ±0.060 0.449ai -0.233±0.023 0.394 -0.054±0.005 0.419 -0.331 ±0.034 0.421

systems -0.285±0.036 0.436 -0.056±0.023 0.442 -0.295±0.034 0.482graphics -0.243±0.027 0.434 -0.053±0.007 0.483 -0.053±0.007 0.483theory -0.228±0.033 0.381 -0.104±0.030 0.368 -0.178±0.033 0.414

Average -0.251±0.034 0.412 -0.068±0.017 0.428 -0.239±0.033 0.450

GSL-Best BUSL BSFold CLL AUC CLL AUC CLL AUC

language -0.071±0.023 0.430 -0.090±0.030 0.439 -0.433±0.078 0.300ai -0.054±0.005 0.419 -0,067±0.009 0.406 -0.289±0.037 0.328

systems -0.056±0.023 0.442 -0.053±0.005 0.461 -0.242±0.031 0.414graphics -0.053±0.007 0.483 -0.101±0.018 0.458 -0.282±0.039 0.287theory -0.112±0.022 0.386 -0.061±0.009 0.390 -0.313±0.044 0.273

Average -0.069±0.016 0.432 -0.074±0.014 0.431 -0.312±0.046 0.320

CLL or AUC, then we see that GSL improves over BUSL. For example, if we take for each

fold the best run that produced better results in terms of CLL, the results show that GSL-

BestCLL performs better than BS and BUSL in terms of CLL. The same happens also for

GSL-BestAUC which performs better than BS and BUSL in terms of AUC. When learning

classifiers or estimating probabilities, often the goal is to optimize both CLL and AUC. This

is known as multiobjective optimization and implies that more than an evaluation function be

employed to evaluate the quality of the algorithms. In this case, in order to make a comparison

with BUSL, we would have to take for each fold the best independent run of GSL that produced

best results by combining CLL and AUC. From Table 4.4, we can see that GSL-Best produces

the best overall results compared to BS and BUSL.

Regarding learning times for UW-CSE, the results for GSL are shown in Table 4.5. The

comparison with BUSL and BS is presented in Table 4.6 with learning times in minutes. As

we can see, the GSL runs are quite fast compared to BS and BUSL. BUSL is the slowest and

employs three to four times more than GSL to complete. BS is faster than BUSL but generally

two to three times slower than GSL.

For CORA the results of GSL are reported in Table 4.7. Every value in the table for each

fold is an average of accuracy for all the predicates. The overall results for CORA comparing

59


Table 4.5: Learning times (in minutes) on UW-CSE for ten parallel independent walks of GSL

Run Language Ai Systems Graphics Theory1 152 118 112 111 1562 65 152 173 333 1823 102 142 256 77 544 106 144 153 217 1315 120 165 65 221 546 179 93 142 326 847 195 75 117 303 1138 286 351 54 85 1039 117 118 115 123 14210 164 66 187 138 150

Average 149 142 137 193 117

Table 4.6: Comparison of learning times (in minutes) on UW-CSE for GSL, BUSL and BS

Run Language Ai Systems Graphics Theory AverageGSL-Average 149 142 137 193 117 148GSL-BestCLL 102 144 112 85 54 99GSL-BestAUC 179 165 187 85 103 144

GSL-Best 102 144 112 85 54 99BUSL 502 664 765 560 598 618

BS 454 315 336 289 280 335

BUSL and GSL are reported in Table 4.8. The columns refer to the results taken for GSL: GSL-

Average refers to the average results of all the parallel runs over all the folds, GSL-BestCLL

refers to the best parallel run for each fold in terms of CLL, GSL-BestAUC refers to the best

parallel run for each fold in terms of AUC, GSL-Best refers to the best parallel run for each

fold taking into account both CLL and AUC. As the results show, BUSL is competitive with

GSL-Average in terms of AUC but is outperformed by GSL-Average in terms of CLL. The

same results are valid also for GSL-BestCLL which outperforms BUSL in terms of CLL with

an overall results of −0.071 compared to −0.196 of BUSL. The other two views of results,

GSL-BestAUC and GSL-Best clearly outperform BUSL both in terms of CLL and in terms of

AUC. GSL-Best can be viewed as an approach for multiobjective optimization since both CLL

and AUC are optimized.

Learning times for CORA are reported in Table 4.10. The independent walks of GSL are

quite fast compared to BUSL which spends much time by first evaluating clauses and then by

60

4.3 Experiments

Table 4.7: Accuracy results on CORA for ten parallel independent walks of GSL

Fold1 Fold2 Fold3 Fold4 Fold5RUN CLL AUC CLL AUC CLL AUC CLL AUC CLL AUCR1 -0.059±0.002 0.131 -0.085±0.004 0.250 -0.076±0.003 0.230 -0.071±0.003 0.146 -0.079±0.002 0.124R2 -0.143±0.003 0.228 -0.095±0.004 0.264 -0.111±0.004 0.134 -0.177±0.005 0.301 -0.114±0.003 0.109R3 -0.076±0.003 0.134 -0.077±0.003 0.128 -0.075±0.004 0.234 -0.114±0.005 0.132 -0.148±0.003 0.234R4 -0.086±0.003 0.133 -0.113±0.006 0.262 -0.122±0.003 0.200 -0.089±0.004 0.244 -0.112±0.003 0.216R5 -0.122±0.002 0.134 -0.171±0.005 0.231 -0.071±0.004 0.242 -0.099±0.004 0.247 -0.105±0.003 0.131R6 -0.065±0.003 0.130 -0.117±0.003 0.259 -0.124±0.004 0.126 -0.086±0.004 0.127 -0.132±0.005 0.127R7 -0.082±0.003 0.133 -0.111±0.005 0.256 -0.070±0.004 0.238 -0.103±0.004 0.129 -0.149±0.003 0.125R8 -0.072±0.002 0.235 -0.097±0.004 0.259 -0.113±0.004 0.132 -0.095±0.004 0.245 -0.131±0.004 0.195R9 -0.090±0.003 0.131 -0.086±0.004 0.260 -0.143±0.004 0.235 -0.098±0.004 0.118 -0.171±0.003 0.230R10 -0.088±0.004 0.114 -0.113±0.006 0.261 -0.127±0.005 0.132 -0.295±0.006 0.112 -0.143±0.003 0.126Avg. -0.088±0.003 0.150 -0.107±0.004 0.243 -0.103±0.004 0.190 -0.123±0.004 0.180 -0.128±0.003 0.162

Table 4.8: Accuracy comparison of GSL with BUSL on the CORA dataset

GSL-Average GSL-BestCLL GSL-BestAUC GSLBest BUSLFold CLL AUC CLL AUC CLL AUC CLL AUC CLL AUC

1 -0.088±0.003 0.150 -0.059±0.002 0.131 -0.072±0.002 0.235 -0.072±0.002 0.235 -0.099±0.002 0.2202 -0.107±0.004 0.243 -0.077±0.003 0.128 -0.095±0.004 0.264 -0.086±0.004 0.260 -0.118±0.003 0.1293 -0.103±0.004 0.190 -0.071±0.004 0.242 -0.071±0.004 0.242 -0.071±0.004 0.242 -0.558±0.007 0.1864 -0.123±0.004 0.180 -0.071±0.003 0.146 -0.177±0.005 0.301 -0.099±0.004 0.247 -0.100±0.002 0.2385 -0.128±0.003 0.162 -0.079±0.002 0.124 -0.148±0.003 0.234 -0.112±0.003 0.216 -0.103±0.003 0.234

Avg. -0.110±0.004 0.185 -0.071±0.003 0.154 -0.112±0.004 0.255 -0.088±0.004 0.240 -0.196±0.003 0.201

adding them one by one to the current structure. This implies that the MLN is changed at

every step and thus the optimization of WPLL requires much more time. In GSL this does not

happen since the clause evaluation is performed following the approach presented in section

3.2.3 where the authors in (Kok and Domingos 2005) found that the very simple approach of

initializing L-BFGS with the current weights (and zero weight for a new clause) was quite

successful. Although in principle all weights could change as the result of introducing or

modifying a clause, in practice this is very rare. Second-order, quadratic-convergence methods

like L-BFGS are known to be very fast if started near the optimum (Sha and Pereira 2003).

For the BS algorithm (Kok and Domingos 2005) in the CORA domain we were not able to

report results, since structure learning with this algorithm did not finish in 45 days. BS is

heavily slowed by its systematic top-down nature that tends to evaluate a very large number of

candidates.

61


Table 4.9: Learning times (in minutes) on CORA for ten parallel independent walks of GSL

Run Fold1 Fold2 Fold3 Fold4 Fold51 2088 1590 1826 848 11642 1781 534 1494 2027 9873 2615 2986 586 1673 11904 1264 883 1468 959 5225 2230 1734 1014 1232 4506 787 2015 1748 1213 16977 1012 2578 1888 1601 8728 1020 1064 883 1796 32809 2014 518 1270 1531 1958

10 1307 1889 1392 1008 1202Average 1612 1579 1357 1389 1332

Table 4.10: Comparison of learning times (in minutes) on CORA for GSL and BUSL

Run Fold1 Fold2 Fold3 Fold4 Fold5 AverageGSL-Average 1612 1579 1357 1389 1332 1454GSL-BestCLL 2088 2986 1888 848 1164 1795GSL-BestAUC 1020 534 1014 2027 1190 1157

GSL-Best 1020 2014 1014 1232 522 1160BUSL 7400 7000 8200 12000 12150 9350

4.4 Related Work

GSL is related to the growing amount of research on learning SRL models (Getoor and Taskar

2007). In particular, it is similar to approaches that tightly integrate the steps of structure and

parameter learning. These approaches typically learn the structure of the model by directly

optimizing a likelihood-type measure. Most of these approaches such as those in (Dehaspe

1997; Huynh and Mooney 2008; Landwehr et al. 2005, 2006, 2007) have their roots in the ILP

community and were extended to SRL or PILP by combining refinement operators with statis-

tical learning. All these systems perform classification while the goal of GSL is to do general

probability estimation on relational data, i.e., learn the joint distribution of all the predicates.

Being based on MLNs the most closely related algorithms are those presented in (Kok and

Domingos 2005; Mihalkova and Mooney 2007) which optimize directly the pseudo-likelihood

measure. The main difference of GSL is that it is a stochastic algorithm based on a hybrid SLS

metaheuristic such as ILS. From this point of view the most closely related approach is that of

62

4.4 Related Work

(Zelezny. et al. 2006) that exploits SLS methods in ILP. However, the algorithm GSL that we

propose here is different in that it uses likelihood as evaluation measure instead of ILP coverage

criteria. Moreover, GSL differs from the algorithms proposed in (Zelezny. et al. 2006) in that

it uses Hybrid SLS approaches which can combine other simple SLS methods to produce high

performance algorithms.

63


4.5 Summary

GSL is a stochastic algorithm that performs an Iterated Local Search in the space of structures

guided by pseudo-likelihood. The approach is based on a biased sampling of the set of local

optima focusing the search not on the full space of solutions but on a smaller subspace de-

fined by the solutions that are locally optimal for the optimization engine. It employs a strong

perturbation operator and an iterative improvement local search procedure in order to balance

diversification (randomness induced by strong perturbation to avoid search stagnation) and

intensification (greedily increase solution quality by exploiting the evaluation function). The

experimental evaluation on two benchmarking datasets regarding the problem of Link Analysis

in Social Networks and Entity Resolution in citation databases, show that by running parallel

multiple independent walks, GSL achieves improvements over the state-of-the-art algorithms

for generative structure learning of Markov Logic Networks. GSL can be further improved in

future work with the following: weakening intensification through a higher random walk prob-

ability in the local search procedure; the current used acceptance function in GSL performs

iterative improvement in the space of local optima. This can lead to getting stuck in local op-

tima therefore in order to induce some random walk among the local optima, the acceptance

function can be rendered probabilistic; implementing more sophisticated parallel models such

as MPI (Message Passing Interface) or PVM (Parallel Virtual Machine); dynamically adapting

the nature of perturbations.

64

Chapter 5

The ILS-DSL algorithm

Generative approaches optimize the joint distribution of all the variables. This can lead to sub-

optimal results for predictive tasks because of the mismatch between the objective function

used (likelihood or a function thereof) and the goal of classification (maximizing accuracy or

conditional likelihood). In contrast discriminative approaches maximize the conditional like-

lihood of a set of outputs given a set of inputs (Lafferty et al. 2001) and this often produces

better results for prediction problems. In (Singla and Domingos 2005) the voted perceptron

based algorithm for discriminative weight learning of MLNs was shown to greatly outperform

maximum-likelihood and pseudo-likelihood approaches for two real-world prediction prob-

lems. Recently, the algorithm in (Lowd and Domingos 2007), outperforming the voted percep-

tron became the state-of-the-art method for discriminative weight learning of MLNs. However,

both discriminative approaches to MLNs learn weights for a fixed structure, given by a domain

expert or learned through another structure learning method (usually generative). Better results

could be achieved if the structure could be learned in a discriminative fashion. Unfortunately,

the computational cost of optimizing structure and parameters for conditional likelihood is

prohibitive. In this chapter it is shown that the simple approximation of choosing structures

by maximizing conditional likelihood (CLL) or AUC for Precision-Recall (PR) curve, while

setting parameters by maximum likelihood can produce better results in terms of predictive

accuracy. Structures are scored through a very fast inference algorithm MC-SAT (Poon and

Domingos 2006) whose lazy version Lazy-MC-SAT (Poon et al. 2008) greatly reduces mem-

ory requirements, while parameters are learned through a quasi-Newton optimization method

like L-BFGS (Liu and Nocedal 1989) that has been found to be much faster (Sha and Pereira

2003) than iterative scaling initially used for MNs weight learning (Della Pietra et al. 1997).

65

5. THE ILS-DSL ALGORITHM

The chapter presents the ILS-Discriminative Structure Learning (ILS-DSL) algorithm which

is based on the Iterated Local Search (ILS) metaheuristic (Loureno et al. 2002). We present

two variants of the algorithm that set parameters by maximum likelihood and choose structures

by maximum CLL or AUC of Precision Recall (PR) curve respectively.

5.1 Setting Parameters through Likelihood

Weight learning in MLNs is a convex optimization problem, thus gradient descent is guaran-

teed to find the global optimum. However, in practice convergence to this optimum is extremely

very slow, since MLNs, being exponential models, require as sufficient statistics, computing

the number of true groundings of a clause in the data. Optimizing the CLL for MLNs (similar to

Markov random fields) requires computing the partition function which is generally intractable.

Moreover, since the number of true groundings of clauses can easily vary by orders of mag-

nitude from one clause to another, learning rates that are small enough to avoid divergence in

some weights, may be too small for convergence in others. This causes the ill-conditioning

problem in numerical optimization (Nocedal and Wright. 2006). This can be solved through

different methods, but the most well-known are not directly applicable to MLNs. These in-

clude methods that perform line searches (computing the function as well as the gradient) such

as conjugate gradient and quasi-Newton methods. All these require the computation of the par-

tition function, therefore the approach of optimizing CLL for every refinement is impractical.

Instead, the optimization of WPLL does not require these computations.

For every candidate structure, the parameters that optimize the WPLL are set through L-

BFGS. As pointed out in (Kok and Domingos 2005) a potentially serious problem that arises

when evaluating candidate clauses using WPLL is that the optimal (maximum WPLL) weights

need to be computed for each candidate. Since this involves numerical optimization, and needs

to be done millions of times, it could easily make the algorithm too slow. In (Della Pietra et al.

1997; McCallum 2003) the problem is addressed by assuming that the weights of previous

features do not change when testing a new one. Surprisingly, the authors in (Kok and Domingos

2005) found this to be unnecessary if the very simple approach of initializing L-BFGS with the

current weights (and zero weight for a new clause) is used. Although in principle all weights

could change as the result of introducing or modifying a clause, in practice this is very rare.

Second-order, quadratic-convergence methods like L-BFGS are known to be very fast if started

near the optimum (Sha and Pereira 2003). This is what happened in (Kok and Domingos 2005):

66

5.2 Scoring Structures through Conditional Likelihood

L-BFGS typically converges in just a few iterations, sometimes one. We use the same approach

for setting the parameters that optimize the WPLL.

5.2 Scoring Structures through Conditional Likelihood

In order to score MLN structures, we need to perform inference over the network. A very fast

algorithm for inference in MLNs is MC-SAT (Poon and Domingos 2006). Since probabilistic

inference methods like MCMC or belief propagation tend to give poor results when determin-

istic or near-deterministic dependencies are present, and logical ones like satisfiability testing

are inapplicable to probabilistic dependencies, MC-SAT combines ideas from both MCMC

and satisfiability to handle probabilistic, deterministic and near-deterministic dependencies that

are typical of statistical relational learning. MC-SAT was shown to greatly outperform Gibbs

sampling and simulated tempering in two real-world datasets regarding entity resolution and

collective classification.

Even though MC-SAT is a very fast inference algorithm, scoring candidate structures at

each step can be potentially very expensive since inference has to be performed for each can-

didate clause added to the current structure. One problem that arises is that fully instantiating

a finite first-order theory requires memory in the order of the number of constants raised to the

length of the clauses, which significantly limits the size of domains where the problem can still

be tractable. To avoid this problem, we used a lazy version of MC-SAT, Lazy-MC-SAT (Poon

et al. 2008) which reduces memory and time by orders of magnitude compared to MC-SAT.

Before Lazy-MC-SAT was introduced, the LazySat algorithm (Singla and Domingos 2006a)

was shown to greatly reduce memory requirements by exploiting the sparseness of relational

domains (i.e., only a small fraction of ground atoms are true, and most clauses are trivially sat-

isfied). The authors in (Poon et al. 2008) generalize the ideas in (Singla and Domingos 2006a)

by proposing a general method for applying lazy inference to a broad class of algorithms such

as other SAT solvers or MCMC methods. Another problem is that even though Lazy-MC-

SAT makes memory requirements tractable, it can take too much time to construct the Markov

random field in the first step of MC-SAT for every candidate structure.

To make the execution of Lazy-MC-SAT tractable for every candidate structure, we use

the following simple heuristics: 1) We score through Lazy-MC-SAT only those candidates

that produce an improvement in WPLL. Once the parameters are set through L-BFGS, it is

straightforward to compute the gain in WPLL for each candidate. This reduces the number of

67


candidates to be scored through Lazy-MC-SAT for a gain in CLL. 2) We pose a memory limit

for Lazy-MC-SAT on the clause activation phase and this greatly speeds up the whole inference

task. Although in principle this limit can reduce the accuracy of inference, we found that in

most cases the memory limit is never reached making the overall inference task very fast. 3)

We pose a time limit in the clause activation phase in order to avoid those rare cases where the

step takes a very long time to be completed. For most candidate structures such a time limit

is never reached and for those rare cases where time limit is reached, inference is performed

using the activated clauses within the limit.

We found that these simple approximations greatly speed up the scoring of each structure at

each step. Filtering the potential candidates through the gain in WPLL can in principle exclude

good candidates due to the mismatch between the optimization of WPLL and that of CLL.

However, we empirically found that most candidates not improving WPLL, did not improve

CLL. Further investigation on this issue may help to select better or more candidates to be

scored through Lazy-MC-SAT.

5.3 Discriminative Structure Learning using ILS

In this section we describe our proposal for tailoring the ILS metaheuristic to the problem

of learning the structure of MLNs. We describe how weights are set and how structures are

scored. The approach we follow is similar to (Grossman and Domingos 2004) where Bayesian

Networks were learned by setting weights through maximum likelihood and choosing struc-

tures by maximizing conditional likelihood.

Algorithms ILS−DSLCLL and ILS−DSLAUC (Algorithm 5.1) iteratively add the best clause

to the current MLN until δ consecutive steps have not produced improvement (however other

stopping criteria could be applied). It can start from an empty network or from an existing KB.

Like in (Kok and Domingos 2005; Richardson and Domingos 2006) we add all unit clauses

(single predicates) to the MLN. The initial weights are learned in LearnWeights through L-

BFGS and the initial structure is scored in ComputeScore through MC-SAT. MC-SAT takes in

input a MLN, a query predicate and evidence ground facts and computes for each grounding

of the query predicate, its probability of being true. From these values in ComputeScore, the

CLL is computed as the average of CLL over all these groundings. DSLCLL scores directly

the candidate clauses by CLL, while DSLAUC computes the AUC of the PR curve by using the

package of (Davis and Goadrich 2006).

68


Algorithm 5.1 The ILS-DSL algorithmInput: (P:set of predicates, MLN:Markov Logic Network, RDB:Relational Database, QP:Query predicate)CLS = All clauses in MLN ∪ P;LearnWeights(MLN,RDB);BestScore = ComputeScore(MLN,RDB,QP);repeat

BestClause = SearchBestClause(P,MLN,BestScore,CLS,RDB,QP);if BestClause 6= null then

Add BestClause to MLN;BestScore = ComputeScore(MLN,RDB,QP);

end ifuntil BestClause = null for δ consecutive stepsReturn MLNFor DSLCLL ComputeScore computes the average CLL over all the groundings of the querypredicate QPFor DSLAUC ComputeScore computes the AUC of PR curve

The search for the best clause is performed in the SearchBestClause procedure (Algorithm

5.2). The algorithm performs an iterated local search to find the best clause to add to the current

MLN. It starts by randomly choosing a unit clause CLC in the search space. Then it performs

a greedy local search to efficiently reach a local optimum CLS. At this point, a perturbation

method is applied leading to the neighbor CL′C of CLS and then a greedy local search is applied

to CL′C to reach another local optimum CL′S . The accept function decides whether the search

must continue from the previous local optimum CLC or from the last found local optimum CL′S(accept can perform random walk or iterative improvement in the space of local optima).

Careful choice of the various components of SearchBestClause is important to achieve high

performance. The clause perturbation operator (flipping the sign of literals, removing literals

or adding literals) has the goal to jump in a different region of the search space where search

should start with the next iteration. There can be strong or weak perturbations which means that

if the jump in the search space is near to the current local optimum the subsidiary local search

procedure LocalSearchII (Algorithm 5.3) may fall again in the same local optimum and enter

regions with the same value of the objective function called plateau, but if the jump is too far,

LocalSearchII may take too many steps to reach another good solution. In our algorithm we use

69


Algorithm 5.2 The SearchBestClause component of ILS-DSLInput:(P: set of predicates, MLN: Markov Logic Network, BestScore: CLL of AUC score,BestWPLL: WPLL score, CLS: List of clauses, RDB: Relational Database, QP: Query pred-icate)CLC = Random Pick a clause in CLS ∪ P;CLS = LocalSearchII(CLS,BestScore,BestWPLL);BestClause = CLS;repeat

CL’C = Perturb(CLS);CL’S = LocalSearchII(CL’C,MLN,BestScore,BestWPLL);if ComputeScore(BestClause,MLN,RDB,QP) ≤ ComputeScore(CL’S,MLN,RDB,QP)then

BestClause = CL’S;Add BestClause to MLN;BestScore = ComputeScore(CL’S,MLN,RDB,QP)

end ifCLS = accept(CLS,CL’S);

until k consecutive steps have not produced improvementReturn BestClauseFor DSLCLL ComputeScore computes the average CLL over all the groundings of the querypredicateFor DSLAUC ComputeScore computes the AUC of PR curve

only strong perturbations, i.e., we always re-start from unit clauses (in future work we intend

to dynamically adapt the nature of the perturbation). Regarding the procedure LocalSearchII ,

we decided to use an iterative improvement approach (the walk probability is set to zero and

the best clause is always chosen in StepII) in order to balance intensification (greedily increase

solution quality by exploiting the evaluation function) and diversification (randomness induced

by strong perturbation to avoid search stagnation). In future work we intend to further weaken

intensification by using a higher walk probability. Finally, the accept function always accepts

the best solution found so far.

5.3.1 The ILS-DSLCLL version

The ILS-DSLCLL version of the algorithm maximizes CLL during search. In Algorithm 5.2,

the function ComputeScore computes the average CLL over all the groundings of the query

70


Algorithm 5.3 The subsidiary procedure LocalSearch and the Step function of ILS-DSLLocalSearchII(CLS,BestScore,BestWPLL)wp: walk probability, the probability of performing an improvement step or a random steprepeat

NBHD = Neighborhood of CLC constructed using the clause construction operators;CLS = StepRII(CLC,NBHD,wp,BestScore,BestWPLL);CLC = CLS;

until two consecutive steps do not produce improvementReturn CLS;StepRII(CLC,NBHD,wp,BestScore,BestWPLL)U = random(]0,1]); random number using a Uniform Probability Distributionif U ≤ wp) then

CLS = stepURW (CLC,NBHD)Uninformed Random Walk: randomly choose a neighbor from NBHD

elseCLS = stepII(CLC,NBHD)Iterative Improvement: among the improving neighbors in NBHD that improve BestW-PLL,choose the one that maximally improves BestScore in terms of CLL or AUC.If there is no improving neighbor choose the minimally worsening one

end ifReturn CLS

predicate QP. It uses the Lazy-MC-SAT algorithm to perform inference over the network con-

structed using the current structure MLN and the relational data RDB of a tuning set. In the

tuning set all the groundings of the query predicate QP are commented. MC-SAT produces for

each grounding of the query predicate QP the probability that it is true. These values are then

used to compute the average CLL by distinguishing positive and negative atoms. For a positive

atom its estimated probability P contributes with logP to the CLL and a negative’s estimated

probability contributes with log(1−P). In the subsidiary local search procedure LocalSearchII

and in the StepII function, the candidates are filtered based on their improvement in terms of

WPLL. Among the candidates that improve WPLL, the one that most maximizes CLL is then

chosen as the best candidate to continue the search.

71


5.3.2 The ILS-DSLAUC version

The ILS-DSLAUC version of the algorithm maximizes AUC of the PR during search. In Algo-

rithm 5.2, the function ComputeScore computes the average AUC over all the groundings of the

query predicate QP. It uses the Lazy-MC-SAT algorithm to perform inference over the network

constructed using the current structure MLN and the relational data RDB of a tuning set. In

the tuning set all the groundings of the query predicate QP are commented. MC-SAT produces

for each grounding of the query predicate QP the probability that it is true. The precision-

recall curve for a predicate is computed by varying the CLL threshold above which a ground

atom is predicted to be true; i.e. the predicates whose probability of being true is greater than

the threshold are positive and the rest are negative. For the computation of AUC we used the

package of (Davis and Goadrich 2006). In the subsidiary local search procedure LocalSearchII

and in the StepII function, the candidates are filtered based on their improvement in terms of

WPLL. Among the candidates that improve WPLL, the one that most maximizes AUC is then

chosen as the best candidate to continue the search.

5.4 Experiments

Through experimental evaluation we want to answer the following questions:

(Q1) Are the proposed algorithms competitive with state-of-the-art discriminative training

algorithms of MLNs?

(Q2) Are the proposed algorithms competitive with the state-of-the-art generative algo-

rithm for structure learning of MLNs?

(Q3) Are the proposed algorithms competitive with pure probabilistic approaches such as

Naïve Bayes and Bayesian Networks?

(Q4) Are the proposed algorithms competitive with state-of-the-art ILP systems for the task

of structure learning of MLNs?

(Q5) Do the proposed algorithms always perform better than BUSL for classification tasks?

If not, are there any regimes in which each algorithm performs better?

(Q6) Regarding the task of Entity Resolution, do the proposed algorithms perform better

than other language-independent discriminative approaches based on MLNs?

Regarding question (Q1) we have to compare all our algorithms with Preconditioned Scaled

Conjugate Gradient (PSCG) which is the state-of-the-art discriminative training algorithm for

72

5.4 Experiments

MLNs proposed in (Lowd and Domingos 2007). It must be noted that this algorithm takes

in input a fixed structure, and with the clausal knowledge base we use in our experiments for

CORA (each dataset comes with a hand-coded knowledge base), PSCG has achieved the best

published results. We also exclude the approach of adapting the rule set and then learning

weights with PSCG, since it would be computationally intractable.

To answer question (Q2) we have to perform experimental comparison with the Bottom-Up

Structure Learning (BUSL) algorithm (Mihalkova and Mooney 2007) which is the state-of-the-

art algorithm for this task. Since in principle, the MLNs structure can be learned using any ILP

technique it would be interesting to know how our algorithms compare to ILP approaches. In

(Kok and Domingos 2005), the proposed algorithm based on beam search (BS) was shown to

outperform FOIL and the state-of-the-art ILP system ALEPH for the task of learning MLNs

structure. Moreover, BS outperformed both Naïve Bayes and Bayesian Networks in terms of

CLL and AUC. Since in (Mihalkova and Mooney 2007) was shown that BUSL outperforms

the BS algorithm of (Kok and Domingos 2005), our baseline for questions (Q2), (Q3) and

(Q4) is again BUSL. It must be noted that since the goal of learning MLNs (and then perform

inference over the model) is to perform probability estimation, the proposed algorithms are not

directly comparable with ILP systems because these are not designed to maximize the data’s

likelihood (and thus the quality of the probabilistic predictions). Moreover, since ALEPH and

FOIL learn more restricted clauses (non recursive definite clauses), the only ILP system that

is directly comparable with our algorithm is CLAUDIEN which, unlike most ILP systems that

learn only Horn clauses, is able to learn arbitrary first-order clauses. Thus the comparison re-

gards the task of structure learning of MLNs where ILP systems learn the structure followed by

a weight learning phase. In (Kok and Domingos 2005) the authors showed that CLAUDIEN

(also ALEPH and FOIL) followed by a weight learning phase was outperformed by the BS

algorithm in terms of CLL and AUC. Regarding question (Q5), we compare all our algorithms

and BUSL on two datasets with the goal of discovering regimes in which each one can per-

form better. We will use two datasets, one of which can be considered of small size and the

other one of much larger size. Finally, to answer question (Q6), we should compare our algo-

rithms with the best language-independent discriminative approach to Entity Resolution based

on MLNs proposed in (Singla and Domingos 2006b). In this work, the MLN(G+C+T) model is

language-independent because it does not contain rules referring to specific strings occurring

in the data. This is similar to the approach that we follow here for this task: we learn rules

which are not vocabulary specific. In (Singla and Domingos 2006b) the discriminative weight

73


learning approach is based on the voted perceptron for MLNs and was used to learn weights for

different hand-coded models (one of these was MLN(G+C+T)). Since in (Lowd and Domingos

2007) it was shown that PSCG in general outperforms the voted perceptron and for the task

of entity resolution the comparison was performed following a language-dependent approach

(excluding MLN(G+C+T)), it would be interesting to investigate how our algorithms compare

to MLN(G+C+T).

5.4.1 Link Analysis

As introduced in Section 4.3.1, Link Analysis is an important problem in many domains where

entities such as people, Web pages, computers, scientific publications, organizations, are in-

terconnected and interact in one way or another (Popescul and Ungar 2003). Predicting the

presence of links between entities is not an easy task due to the characteristics of such do-

mains. The first requirement regards the representation formalism. Flat representations are not

suitable to deal with these problems, hence relational formalisms must be used. Second, most

of these domains contain noisy or partially observed data thus robust methods for dealing with

uncertainty must be used.

In many scenarios related to Link Analysis, it is known in advance which will be the entity

subject to query after the model has been learned, i.e., it is known before which is the query

variable. Therefore it is useless to optimize the joint distribution of all the variables. Instead, in

order to increase classification accuracy, it is sufficient to optimize the distribution of the query

predicate given the evidence. For example, in Social Network modeling, often there is a target

predicate that expresses a relationship among two entities of a certain type and the problem

is to find whether this relation between two objects in the domain holds or not. Therefore a

discriminative approach to this problem would be more helpful in case it is known a priori what

the query predicate is.

Regarding other discriminative approaches based on SRL models applied to link analysis,

in (Popescul and Ungar 2003) a SRL model, Structural Logistic Regression was used to solve

a link analysis problem about predicting citations in scientific literature. In (Taskar et al. 2003)

Relational Markov Networks were used for link prediction in two domains: university web-

pages and social networks. Markov Logic was first applied to link prediction by (Richardson

and Domingos 2006) and then by (Kok and Domingos 2005; Mihalkova and Mooney 2007;

Singla and Domingos 2005). The only discriminative approach regards the work of (Singla

74

5.4 Experiments

and Domingos 2005) where a discriminative weight learning algorithm based on the voted per-

ceptron was applied to the problem of link prediction in social networks. The experiments of

(Singla and Domingos 2005) were performed on the same data that we use here and no struc-

ture was learned. Instead a hand-coded MLN structure was used and only the parameters of

the model were learned. In this dissertation, we try to learn the clauses from the given facts

together with their parameters.

Dataset

The experiments on link prediction were carried out on a publicly-available database: the

UW-CSE database (available at http://alchemy.cs.washington.edu/data/uw-cse) used by (Kok

and Domingos 2005; Mihalkova and Mooney 2007; Richardson and Domingos 2006; Singla

and Domingos 2005). This dataset represents a standard relational one and is used for the

important relational task of social network analysis.

The published UW-CSE dataset consists of 15 predicates divided into 9 types with 1323

constants. Types include: publication, person, course, etc. Predicates include: Student(person),

Professor(person), AdvisedBy(person1, person2),TaughtBy(course, person, quarter), Publica-

tion (paper, person) etc. The dataset contains 2673 tuples (true ground atoms, with the re-

mainder assumed false). The task is to predict who is whose advisor from information about

coauthorships, classes taught, etc. More precisely, the query atoms are all groundings of Ad-

visedBy(person1, person2), and the evidence atoms are all groundings of all other predicates

(except Student and Professor) (Richardson and Domingos 2006). In our experiments we per-

formed inference over the predicate AdvisedBy by commenting all the groundings of the pred-

icates Professor and Student.

5.4.2 Entity Resolution

As introduced in Section 4.3.2 Entity Resolution is the problem of determining which records

in a database refer to the same entities, and is an essential and expensive step in the data

mining process. The problem was originally defined in (Newcombe et al. 1959) and then the

work in (Fellegi and Sunter 1969) laid the theoretical basis of what is now known by the name

of object identification, record linkage, de-duplication, merge/purge, identity uncertainty, co-

reference resolution, and others. The main characteristics of domains where this problem must

be solved, are relations between objects and uncertainty about these relations due to noisy or

partially observed data.

75


In most cases related to Entity Resolution, it is not required to solve all the different types

of entities in domain, but only a category of all the different types in the domain. Therefore it

is useless to optimize the joint distribution of all the variables when we know a priori the entity

we want to resolve. In order to increase classification accuracy, it is sufficient to optimize the

distribution of the query predicate given the evidence. For example, in protein networks, often

there is a target predicate that expresses a relationship among two proteins of a certain type and

the problem is to find whether the two observations refer to the same entity, i.e., the proteins

are the same but for different reasons (noise or partially observed data) it was not clear the

equality. Another application field is that of citation databases where many citations may refer

to the same paper and it is important to deduplicate the data. Discriminative approaches take

a set of variables in input and produce predictions for a set of output variables. For this rea-

son, discriminative approaches are more suitable for entity resolution problems than generative

approaches.

Regarding other SRL models for Entity Resolution, there have been proposed several ap-

proaches such as those in Bhattacharya and Getoor (2004); Milch et al. (2005); Pasula and

Russell (2001). Other approaches similar to ours, based on Markov Logic, are those in Lowd

and Domingos (2007); Singla and Domingos (2005, 2006b). All these propose discriminative

weight learning approaches that take an existing structure in input and learn the parameters of

the model. The approach that we propose in this chapter learns both structure and parameters

in a discriminative fashion.

Dataset

The CORA dataset consists of 1295 citations of 132 different computer science papers,

drawn from the CORA CS Research Paper Engine. The task is to predict which citations refer

to the same paper, given the words in their author, title, and venue fields. The labeled data also

specify which pairs of author, title, and venue fields refer to the same entities. We performed

experiments for each field in order to evaluate the ability of the model to deduplicate fields as

well as citations. The dataset contains 10 predicates and 70367 tuples (true and false ground

atoms, with the remainder assumed false). Since the number of possible equivalences is very

large, like the authors did in (Lowd and Domingos 2007) we used the canopies found in (Singla

and Domingos 2006b) to make this problem tractable. The dataset used is in Alchemy format

(publicly-available at http://alchemy.cs.washington.edu/data/cora/). The original version not

76

5.4 Experiments

in alchemy format was segmented by Bilenko and Mooney in (Bilenko and Mooney 2003)

(available at http://www.cs.utexas.edu/users/ml/riddle/data/cora.tar.gz).


We implemented the algorithm ILS-DSL as part of the MLN++ package 8.3 which is a suite of


Alchemy implements inference and learning algorithms for Markov Logic. Alchemy can be

viewed as a declarative programming language akin to Prolog, but with some key differences:

the underlying inference mechanism is model checking instead of theorem proving; the full

syntax of first-order logic is allowed, rather than just Horn clauses. Moreover, Alchemy has

some built-in functionalities with the ability to handle uncertainty and learn from data. MLN++

uses the API of Alchemy for some tasks such as the implementation of L-BFGS and Lazy-MC-

SAT in Alchemy to learn maximum WPLL weights and compute CLL during clause search.

Regarding parameter learning, we compared our algorithms performance with the state-

of-the-art algorithm PSCG of (Lowd and Domingos 2007) for discriminative weight learning

of MLNs. This algorithm takes as input an MLN and the evidence (groundings of non-query

predicates) and discriminatively trains the MLN to optimize the CLL of the query predicates

given evidence. Our algorithms and PSCG optimize the CLL (or AUC) of the query predicates

and a comparison between these algorithms would be useful to understand if learning auto-

matically the clauses from scratch can improve over hand-coded MLN structures in terms of

classification accuracy of the query predicates given evidence.

We performed all the experiments on a 2.13 GHz Intel Core2 Duo CPU. For the UW-

CSE dataset we trained PSCG on the hand-coded knowledge base provided with the dataset.

We used the implementation of PSCG in the Alchemy package and ran this algorithm with the

default parameters for 10 hours. For the CORA dataset, for PSCG we report the results obtained

in (Lowd and Domingos 2007) where PSCG was trained on a hand-coded MLN and achieved

best current results on this dataset. For the language-independent approach MLN(G+C+T) of

(Singla and Domingos 2006b) we report the results for three of the query predicates in this

domain: sameBib, sameAuthor and sameVenue (the results reported in (Singla and Domingos

2006b) do not include the predicate sameTitle).

For both datasets, for all our algorithms we used the following parameters: the mean and

variance of the Gaussian prior were set to 0 and 100, respectively; maximum variables per

clause = 4; maximum predicates per clause = 4; penalization of weighted pseudo-likelihood

77


= 0.01 for UW-CSE and 0.001 for CORA. For L-BFGS we used the following parameters:

maximum iterations = 10,000 (tight) and 10 (loose); convergence threshold = 10−5 (tight) and

10−4 (loose). For Lazy-MC-SAT during learning we used the following parameters: memory

limit = 300MB for both datasets; maximum number of steps for Gibbs sampling = 100; simu-

lated annealing temperature = 0.5; the parameter k (number of iterations without improvement)

was set to three, while the parameter δ was set to 2. All these parameters were set in an ad

hoc manner and per-fold optimization may lead to better results. Regarding BUSL, for both

datasets, we used the following parameters: the mean and variance of the Gaussian prior were

set to 0 and 100, respectively; maximum variables per clause: 5 for UW-CSE and 6 for CORA;

maximum predicates per clause = 6; penalization of WPLL: 0.01 for UW-CSE and 0.001 for

CORA. minWeight 0.5 for UW-CSE and 0.01 for CORA; For L-BFGS we used the following

parameters: maximum iterations = 10,000 (tight) and 10 (loose); convergence threshold = 10−5

(tight) and 10−4 (loose).

In the UW-CSE domain, we followed the same leave-one-area-out methodology as in

(Richardson and Domingos 2006). In the CORA domain, we performed 5-fold cross-validation.

For each train/test split, one of the training folds is used as tuning set for computing the CLL

(or AUC). For each system on each test set, we measured the CLL and the AUC of PR curve

for the query predicates. The advantage of the CLL is that it directly measures the quality of

the probability estimates produced. The advantage of the AUC is that it is insensitive to the

large number of true negatives (i.e., ground atoms that are false and predicted to be false), but

the disadvantage is that it ignores calibration by considering only whether true atoms are given

higher probability than false atoms. The CLL of a query predicate is the average over all its

groundings of the ground atom’s log-probability given evidence. The precision-recall curve for

a predicate is computed by varying the CLL threshold above which a ground atom is predicted

to be true; i.e. the predicates whose probability of being true is greater than the threshold are

positive and the rest are negative. For the computation of AUC we used the package of (Davis

and Goadrich 2006).

5.4.4 Results

After learning the structure discriminatively, we performed inference on the test fold for both

datasets by using MC-SAT with number of steps = 10000 and simulated annealing temperature

= 0.5. For each experiment, on the test fold all the groundings of the query predicates were

commented: advisedBy for the UW-CSE dataset (professor and student are also commented)

78

5.4 Experiments

and sameBib, sameTitle, sameAuthor and sameVenue for CORA. MC-SAT produces probabil-

ity outputs for every grounding of the query predicate on the test fold. We used these values to

compute the average CLL over all the groundings and to compute the PR curve.

We denote the two versions of the algorithm as ILS−DSLCLL and ILS−DSLAUC. For the

algorithm that optimizes AUC of PR curve during search, we scored each structure by using

the package of (Davis and Goadrich 2006). The results for all algorithms on the UW-CSE

dataset are reported in Table 5.1 for CLL and Table 5.2 for AUC. In Table 5.1, CLL is averaged

over all the groundings of the predicate advisedBy in the test fold. Regarding the comparison

with PSCG in terms of CLL, in this domain our algorithms perform better than PSCG in every

fold of the dataset and overall. Regarding AUC, PSCG overall performs better than both our

algorithms. It must be noted that on two out of five folds (language and graphics) the results

of our algorithms were quite competitive and there was a large difference only in the theory

fold where PSCG achieved a high result. Our best performing algorithm in terms of CLL was

ILS−DSLAUC. This was a surprising result since we expected better results from the algorithm

ILS−DSLCLL that optimizes CLL during search. On the other side, in terms of AUC, our

algorithms performed equally. Overall for UW-CSE, we can state that our algorithms perform

better in terms of CLL and worse in terms of AUC.

For the CORA dataset the results are reported in Tables 5.3 and 5.4. For CLL for each

query predicate we report the average of CLL of its groundings over the test fold (for each

predicate, training is performed on four folds and testing on the remaining one in a 5-fold

cross-validation). For CORA, compared to PSCG, all our algorithms perform better in terms

of CLL for each of the query predicates, but worse in terms of AUC. We observed empirically

on each fold that the performances in terms of CLL and AUC were always balanced, a slightly

better performance in CLL always resulted in a slightly worse performance in terms of AUC

and vice versa. Since CLL determines the quality of the probability predictions output by the

algorithm, all our algorithms outperform PSCG in terms of the ability to predict correctly the

query predicates given evidence. However, since AUC is useful to predict the few positives in

the data, PSCG produces better results for only positive examples. Hence, these results answer

question (Q1). It must be noted that PSCG has achieved the best published results on CORA in

terms of AUC (Lowd and Domingos 2007) and the approach followed is language-dependent,

i.e. the hand-coded MLN used with PSCG in (Lowd and Domingos 2007) contains rules such

that a weight is learned for each ground clause that is constructed using specific constants in

79


Table 5.1: CLL results for the query predicate advisedBy in the UW-CSE domain

area language graphics systems theory ai OverallILS−DSLCLL -0.048±0.016 -0.016±0.003 -0.020±0.003 -0.020±0.005 -0.022±0.003 -0.025±0.006ILS−DSLAUC -0.028±0.008 -0.015±0.003 -0.017±0.002 -0.018±0.004 -0.019±0.003 -0.019±0.004

PSCG -0.049±0.016 -0.023±0.005 -0.026±0.005 -0.028±0.007 -0.032±0.005 -0.032±0.008BUSL -0.024±0.008 -0.014±0.002 -0.295±0.000 -0.013±0.003 -0.019±0.003 -0.073±0.003

Table 5.2: AUC results for the query predicate advisedBy in the UW-CSE domain

area language graphics systems theory ai OverallILS−DSLCLL 0.011 0.006 0.007 0.010 0.006 0.008ILS−DSLAUC 0.016 0.005 0.007 0.005 0.008 0.008

PSCG 0.011 0.005 0.069 0.101 0.034 0.044BUSL 0.115 0.007 0.007 0.032 0.013 0.035

the domain. This makes the approach with PSCG of (Lowd and Domingos 2007) vocabulary

specific while all our algorithms learn general rules not related to a specific set of strings.

Regarding the comparison with BUSL, the results show that all our algorithms perform

better than BUSL in terms of CLL on both datasets. It must be noted, however, that for UW-

CSE, BUSL performed generally better than our algorithms, but produced very low results in

one fold. In terms of AUC, BUSL performs slightly better on the UW-CSE dataset while in

the CORA dataset all our algorithms outperform BUSL. Therefore, questions (Q2), (Q3) and

(Q4) can be answered affirmatively. Our discriminative algorithms are competitive with BUSL

even though for BUSL, in the UW-CSE domain, we used optimized parameters taken from

(Mihalkova and Mooney 2007) in terms of number of variables and literals per clause, while

for our algorithms we did not perform per-fold optimization of any parameter.

Regarding question (Q5), the goal was whether previous results of (Ng and Jordan 2002)

carry on to MLNs, that on small datasets generative approaches can perform better than dis-

criminative ones. The UW-CSE dataset with a total of 2673 tuples can be considered of much

smaller size compared to CORA that has 70367 tuples. The results of Tables 5.1 and 5.2show

that on the UW-CSE dataset, the generative algorithm BUSL performs better in terms of AUC

and is competitive in terms of CLL since it underperforms our algorithms only because of the

low results in the systems fold of the dataset. Thus we can answer question (Q5) confirming

the results in (Ng and Jordan 2002) that on small datasets generative approaches can perform

better than discriminative ones, while for larger datasets discriminative approaches outperform

80

5.4 Experiments

Table 5.3: CLL results for all query predicates in the CORA domain

area sameBib sameTitle sameAuthor sameVenue OverallILS−DSLCLL -0.087±0.001 -0.077±0.006 -0.148±0.009 -0.121±0.004 -0.108±0.005ILS−DSLAUC -0.168±0.002 -0.117±0.010 -0.158±0.011 -0.101±0.004 -0.136±0.007

PSCG -0.291±0.003 -0.231±0.014 -0.182±0.013 -0.444±0.012 -0.287±0.011MLN(G+C+T) -0.394±0.004 − -0.263±0.053 -1.196±0.031 -0.618±0.030

BUSL -0.566±0.001 -0.100±0.004 -0.834±0.009 -0.232±0.005 -0.433±0.005

Table 5.4: AUC results for all query predicates in the CORA domain

area sameBib sameTitle sameAuthor sameVenue OverallILS−DSLCLL 0.603 0.428 0.371 0.315 0.429ILS−DSLAUC 0.334 0.470 0.688 0.252 0.436

PSCG 0.990 0.953 0.999 0.823 0.941MLN(G+C+T) 0.973 − 980 0.743 0.899

BUSL 0.138 0.419 0.323 0.218 0.275

generative ones.

The final question (Q6) is related to the task of entity resolution and the approaches which

are based on MLNs and are language independent, i.e., that do not contain rules which refer to

specific constants in the domain. The results of Tables 5.3 and 5.4 show that in terms of CLL,

all our algorithms outperform MLN(G+C+T) for all the query predicates, but in terms of AUC,

MLN(G+C+T) outperforms our algorithms. Thus, the same conclusions for PSCG are valid

for MLN(G+C+T). Our algorithms produce in general more accurate probability predictions,

while MLN(G+C+T) produces better results for only positive atoms. Therefore, question (Q6)

can be answered affirmatively.

Finally, we give examples of clauses from MLN structures learned for both datasets (we

omit the relative weights). For the UW-CSE dataset examples of learned clauses are:

position(a1,a2)∨¬advisedBy(a1,a3)∨ yearsInProgram(a1,a4)∨ yearsInProgram(a3,a4)

¬professor(a1)∨ student(a1)∨advisedBy(a2,a1)∨ tempAdvisedBy(a1,a2)

These clauses model the relation advisedBy between students and professors. In the first

clause, a1 and a3 are variables that denote persons (students or professors), while a2 and a4

81


denote respectively university positions and years spent in university programs. The predicate

position relates the person denoted by a1 (only professors have a position) to his university

position. In the second clause, the constants a1 and a2 denote persons and are either in a

advisedBy or tempAdvisedBy relationship.

For CORA, examples of learned clauses are the following:

sameAuthor(a1,a2) ∨ ¬hasWordAuthor(a1,a3) ∨ ¬hasWordAuthor(a2,a3)

¬title(a1,a2)∨¬title(a3,a2)∨ sameBib(a3,a1)

In the first clause, a1 and a2 denote author fields while the predicate hasWordAuthor relates

author fields to words contained in these fields. In the second rule the predicate title relates titles

to their respective citations and the predicate sameBib is true if both its arguments denote the

same citation.

5.5 Related Work

Many works in the SRL or PILP area have addressed classification tasks. Our discriminative

method falls among those approaches that tightly integrate ILP and statistical learning in a

single step for structure learning. The earlier works in this direction are those in (Dehaspe 1997;

Popescul and Ungar 2003) that employ statistical models such as maximum entropy modeling

in (Dehaspe 1997) and logistic regression in (Popescul and Ungar 2003). These approaches

can be computationally very expensive. A simpler approach that integrates FOIL and Naïve

Bayes is nFOIL proposed in (Landwehr et al. 2005). This approach interleaves the steps of

generating rules and scoring them through CLL. In another work (Davis et al. 2005) these

steps are coupled by scoring the clauses through the improvement in classification accuracy.

This algorithm incrementally builds a Bayes net during rule learning and each candidate rule is

introduced in the network and scored by whether it improves the performance of the classifier.

In a recent approach (Landwehr et al. 2006), the kFOIL system integrates ILP and support

vector learning. kFOIL constructs the feature space by leveraging FOIL search for a set of

relevant clauses. The search is driven by the performance obtained by a support vector machine

based on the resulting kernel. The authors showed that kFOIL improves over nFOIL. Recently,

in TFOIL (Landwehr et al. 2007), Tree Augmented Naïve Bayes, a generalization of Naïve

Bayes was integrated with FOIL and it was shown that TFOIL outperforms nFOIL.

82

5.5 Related Work

Regarding other approaches on MLNs, the most closely related approach is the recently

published work of (Huynh and Mooney 2008). The difference with our algorithms stands in

the very restricted clauses that this approach can learn which are non-recursive definite clauses.

The authors use a modification of ALEPH to generate a very large number of potential clauses

and then effectively learn their parameters by altering existing discriminative MLN weight-

learning methods to perform exact inference and L1 regularization. Since clauses are generated

by ALEPH, this approach is limited only to problems where there is a target predicate that can

be inferred using non-recursive definite clauses and only in this case, it is possible to perform

exact inference. On the other side, our algorithms, having no restrictions on the clauses that

can be learned, can deal with more general problems that need FOL expressiveness. Another

difference with our algorithms is that the work of (Huynh and Mooney 2008) follows a two

step approach, clauses are first generated by ALEPH and then weights are learned on the final

theory. This can be seen as a kind of static propositionalization which was shown in (Landwehr

et al. 2007) to be outperformed on a large number of ILP datasets by the dynamic proposition-

alization approach that we follow in our algorithms. Another advantage of our algorithms is

that Lazy-MC-SAT is an approximate inference algorithm and it can handle ground atoms with

unknown truth value which characterize many SRL domains. On the other side, the algorithm

of (Huynh and Mooney 2008) performs exact inference and does not handle cases where data

maybe incomplete or partially observed.

Regarding the two steps integration, the most closely related approach to the proposed

algorithms is nFOIL (and TFOIL as an extension) which is the first system in literature to

tightly integrate feature construction and Naïve Bayes. Such a dynamic propositionalization

was shown to be superior compared to static propositionalization approaches that use Naïve

Bayes only to post-process the rule set. The approach is different from ours in that nFOIL

selects features and parameters that jointly optimize a probabilistic score on the training set,

while our algorithms maximize the likelihood on the training data but select the clauses based

on the tuning set. This approach is similar to SAYU (Davis et al. 2005) that uses the tuning

set to compute the score in terms of classification accuracy or AUC, with the difference that

DSLCLL uses CLL as score instead of AUC. SAYU is similar only to DSLAUC. From the point of

view of steps integration, MACCENT (Dehaspe 1997) follows a similar approach by inducing

clausal constraints (one at a time) that are used as features for maximum-entropy classification.

Another difference with nFOIL and SAYU is that all our algorithms, to perform inference

for the computation of CLL, use MC-SAT that is able to handle probabilistic, deterministic and

83


near-deterministic dependencies that are typical of statistical relational learning. Moreover, the

lazy version Lazy-MC-SAT reduces memory and time by orders of magnitude as the results in

(Poon et al. 2008) show. This makes it possible to apply the proposed algorithms to very large

domains.

Finally, from the point of view of search strategies, our algorithms are also similar to ap-

proaches in ILP that exploit SLS (Zelezny. et al. 2006). The algorithms that we propose here

are different in that they use likelihood as evaluation measure instead of ILP coverage criteria.

Moreover, our algorithms differ from those in (Zelezny. et al. 2006) in that we use Hybrid

SLS approaches which can combine other simple SLS methods to produce high performance

algorithms.

84

5.6 Summary

5.6 Summary

In this chapter we have introduced the ILS-DSL algorithm that learns discriminatively first-

order clauses and their weights. The algorithm scores the candidate structures by maximizing

conditional likelihood or area under the Precision-Recall curve while setting the parameters by

maximum pseudo-likelihood. ILS-DSL is based on the Iterated Local Search metaheuristic.

To speed up learning we propose some simple heuristics that greatly reduce the computational

effort for scoring structures. Empirical evaluation with real-world data in two domains show

the promise of our approach improving over the state-of-the-art discriminative weight learn-

ing algorithm for MLNs in terms of conditional log-likelihood of the query predicates given

evidence. We have also compared the proposed algorithm with the state-of-the-art generative

structure learning algorithm and shown that on small datasets the generative approach is com-

petitive, while on larger datasets the discriminative approach outperforms the generative one.

The algorithm can be further improved by the following: weakening intensification through

a higher random walk probability in the local search procedure; the current used acceptance

function in ILS-DSL performs iterative improvement in the space of local optima. This can

lead to getting stuck in local optima therefore in order to induce some random walk among

the local optima, the acceptance function can be rendered probabilistic; dynamically adapting

the nature of perturbations; implementing parallel models such as MPI (Message Passing In-

terface) or PVM (Parallel Virtual Machine) in order to score more structures in parallel or to

assign each iteration of ILS-DSL to a separate thread and then take the best result; develop

heuristics that can find among those that do not improve WPLL, potential candidates that can

improve CLL or AUC.

85


86

Chapter 6

The RBS-DSL algorithm

6.1 The GRASP metaheuristic

Greedy Randomized Adaptive Search Procedure (GRASP) (Feo and Resende 1989, 1995) is an

approach for quickly finding high-quality solutions by applying a greedy construction search

method (that starting from an empty candidate solution at each construction step adds the so-

lution component ranked best, according to a heuristic selection function) and subsequently a

perturbative local search algorithm to improve the candidate solution thus obtained. This type

of hybrid search method often yields much better solution quality than simple SLS methods

initialized at candidate solutions by Uninformed Random Picking (Hoos and Stutzle 2005).

Moreover, when starting from a greedily constructed candidate solution, the subsequent per-

turbative local search process typically takes much fewer improvement steps to reach a local

optimum. Since greedy construction methods can typically generate a very limited number of

different candidate solutions, GRASP avoids this disadvantage by randomizing the construc-

tion method such that it can generate a large number of different good starting points for a

perturbative local search method. In Algorithm 6.1, in each iteration, the randomized construc-

tive local search algorithm GreedyRandomizedConstruction and the perturbative LocalSearch

algorithm are applied until the termination criterion is met. The algorithm GreedyRandom-

izedConstruction, in contrast to greedy constructive algorithms, does not necessarily add the

best solution component but rather selects it randomly from a list of highly ranked solution

components (Restricted Candidate List) which can be defined by cardinality restriction or by

value restriction. In this chapter we present a novel algorithm inspired from GRASP that per-

forms randomized beam search by scoring the structures through maximum likelihood in the

87

6. THE RBS-DSL ALGORITHM

Algorithm 6.1 The GRASP metaheuristic.Procedure GRASPS = φ

repeatS0 = GreedyRandomizedConstruction(S)S∗ = LocalSearch(S0)S = UpdateSolution(S,S∗)

until termination criteria is metReturn Send

first phase and then uses maximum CLL or AUC for PR curve in a second step to randomly

generate a beam of the best clauses to add to the current MLN structure.

6.2 Randomized Beam Discriminative Structure Learning

In this Section we present the Randomized Beam Search Discriminative Structure Learning

(RBS-DSL) algorithm. Algorithm 6.2 starts with a beam of the initial clauses (in case there are

other clauses previously learned) and unit clauses and iteratively adds to the current structure

the best clause found by the SearchBestClause procedure. This procedure (Algorithm 6.3),

takes in input the current beam and by using the clause construction operators, constructs in

GenerateCandidates all the potential candidate clauses to be scored for adding to the current

structure. Then for each of these candidates the gain in WPLL is computed. In the next step,

the algorithm performs a randomized construction of candidate clauses in the Randomized-

Construction procedure (Algorithm 6.4). In this procedure, similar to a GRASP approach, it is

first defined the Restricted Candidate List (RCL) in a random fashion by cardinality value of

WPLL. All candidates with a gain in WPLL greater than minGain+α ∗ (maxGain−minGain)

(where α is a random number from a uniform probability distribution), are considered to be

included in the RCL. To induce randomness in the algorithm the parameter α has an important

function and is called the RCL parameter. It determines the level of randomness or greediness

in the construction. In some GRASP implementations the parameter is fixed while in others it

is adapted dynamically. The case α = 0 corresponds to a pure greedy algorithm, while α = 1

is equivalent to a random construction.

88


Algorithm 6.2 The RBS-DSL algorithmInput: (P:set of predicates, MLN:Markov Logic Network, RDB:Relational Database, QP:Query predicate)CLS = All clauses in MLN ∪ P;LearnWeights(MLN,RDB);BestScore = ComputeScore(MLN,RDB,QP);repeat

BestClause = SearchBestClause(P,MLN,BestScore,CLS,RDB,QP);if BestClause 6= null then

Add BestClause to MLN;BestScore = ComputeScore(MLN,RDB,QP);

end ifuntil BestClause = null for δ consecutive stepsReturn MLNFor RBS−DSLCLL ComputeScore computes the average CLL over all the groundings of thequery predicate QPFor RBS−DSLAUC ComputeScore computes the AUC of PR curve

The similarity of our algorithm with GRASP is that randomization is applied not only to

the choice of the candidates from the RCL but also to the construction of the RCL. On the other

side, the difference with GRASP is that in GRASP only one candidate is randomly chosen from

the RCL in order to continue the search, while in our algorithm a list of clauses is randomly

constructed by choosing them from the RCL. Another difference is that we follow the heuristic

that only candidates with a positive gain in WPLL are to be considered for scoring of CLL.

Thus, in case there are candidates with no gain (minGain ≤ 0), we set the value threshold to

zero. In order not to loose randomness in case of threshold = 0, a random choice among the

RCL candidates follows. Once the potential candidates for the RCL are randomly constructed,

the algorithm randomly chooses among these according to the random number rand and the

paramater λ . In our experiments we found empirically that the value λ = 0.5∗beamSize/100

induces enough randomness in the choice from RCL candidates. This value depends on the

size of the beam which is a parameter of the main algorithm. In most cases, the number of

candidates in the RCL and those that are chosen from this list can be very high. This can cause

intractable computation times because most of these candidates have to be scored again in terms

of CLL (or AUC). For this reason, it is reasonable to pose a limit in the size of the clauses to

89


Algorithm 6.3 The SearchBestClause procedure of the RBS-DSL algorithmSearchBestClause(P: set of predicates, MLN: Markov Logic Network, BestScore: CLL orAUC score, BestWPLL: WPLL score,CLS: List of clauses, RDB: Relational Database, QP:Query predicate)Beam = CLS;repeat

CandidateClauses = GenerateCandidates(Beam,P);for Each Clause C in CandidateClauses do

Add C to the current MLN; LearnWeights(MLN,RDB);CWPLL = Score of C by WPLL; WPLLGain of C = CWPLL - BestWPLL;

end forBestWPLLClauses = RandomizedConstruction(CandidateClauses,BestWPLL);scoredList: list of candidates scored in terms of CLL (or AUC);for Each Clause C in BestWPLLClauses do

Add C to the current MLN;ComputeScore(MLN,RDB);Add C to scoredList;

end forNewBeam = RandomizedBeam(scoredList,BestScore);BestClause = Best Clause in NewBeam;Beam = NewBeam;

until two consecutive iterations have not produced improvementReturn BestClauseFor RBS−DSLCLL ComputeScore computes the average CLL over all the groundings of thequery predicateFor RBS−DSLAUC ComputeScore computes the AUC of PR curve

90


be evaluated in the next step. This is achieved by setting the parameter maxNumClauses which

determines the number of potential candidates to be scored by CLL (or AUC).

Algorithm 6.4 Randomized Construction of the best WPLL candidate listRandomizedConstruction(CandidateClauses,BestWPLL)BestWPLLClauses: Randomized List of best WPLL candidates;maxNumClauses = maximum number of clauses to choose from RCL;α = random([0,1]); random number using a Uniform Probability Distributionthreshold: value to use as limit;minGain = minimumWPLLGain(CandidateClauses);maxGain = maximumWPLLGain(CandidateClauses);if minGain > 0 then

threshold = minGain + α * (maxGain - minGain) ;else

threshold = 0;end iffor Each Clause C in CandidateClauses do

if WPLLGain(C) > threshold thenrand = random([0,1]); random number using a Uniform Probability Distributionif rand > λ then

Add C to BestWPLLClauses;end ifif size of BestWPLLClauses = maxNumClauses then

break;end if

end ifend forReturn BestWPLLClauses

After the procedure RandomizedConstruction returns the list BestWPLLClauses, all the

candidates in this list are scored for CLL (or AUC) and given in input to the RandomizedBeam

procedure. In this procedure (Algorithm 6.5) it is performed the same randomized process

on the candidates but this time based on their CLL (or AUC) values. Differently from the

RandomizedConstruction procedure, the randomized construction of the beam does not exclude

negative values for the gain of the candidates. The value of the parameter λ is the same as for

the RandomizedConstruction procedure.

91


Algorithm 6.5 Randomized choice of the best CLL (or AUC) candidate list to form the newbeam.

RandomizedBeam(ListClauses,BestScore)ListClauses: list of clauses scored for CLL (or AUC)newBeam: new list of clauses to randomly generate from ListClauses;beamSize = Size of beam for the algorithm RBS-DSL;α = random([0,1]); random number using a Uniform Probability Distributionthreshold: value to use as limit;minGain = minimumGain(ListClauses);maxGain = maximumGain(ListClauses);threshold = minGain + α * (maxGain - minGain) ;for Each Clause C in ListClauses do

if Gain(C) > threshold thenrand = random([0,1]); random number using a Uniform Probability Distributionif rand > λ then

Add C to newBeam;end ifif size of newBeam = beamSize then

break;end if

end ifend forReturn newBeam;For RBS−DSLCLL minimumGain returns the minimum gain in CLL among all candidatesFor RBS−DSLAUC minimumGain returns the minimum gain in AUC among all candidates

92

6.3 Experiments

6.2.1 The RBS-DSLCLL version

The RBS-DSLCLL version of the algorithm maximizes CLL during search. In Algorithms 6.2

and 6.3, the function ComputeScore computes the average CLL over all the groundings of the

query predicate QP. It uses the Lazy-MC-SAT algorithm to perform inference over the network

constructed using the current structure MLN and the relational data RDB of a tuning set. In

the tuning set all the groundings of the query predicate QP are commented. MC-SAT produces

for each grounding of the query predicate QP the probability that it is true. These values are

then used to compute the average CLL by distinguishing positive and negative atoms. For a

positive atom its estimated probability P contributes with logP to the CLL and a negative’s

estimated probability contributes with log(1−P). In Algorithm 6.5, the minimumGain and

maximumGain functions compute respectively the minimum and maximum gain in CLL among

all the potential candidates.

6.2.2 The RBS-DSLAUC version

The RBS-DSLAUC version of the algorithm maximizes AUC of the PR during search. In Al-

gorithms 6.2 and 6.3, the function ComputeScore computes the AUC over all the groundings

of the query predicate QP. It uses the Lazy-MC-SAT algorithm to perform inference over the

network constructed using the current structure MLN and the relational data RDB of a tuning

set. In the tuning set all the groundings of the query predicate QP are commented. MC-SAT

produces for each grounding of the query predicate QP the probability that it is true. The

precision-recall curve for a predicate is computed by varying the CLL threshold above which a

ground atom is predicted to be true; i.e. the predicates whose probability of being true is greater

than the threshold are positive and the rest are negative. For the computation of AUC we used

the package of (Davis and Goadrich 2006). In Algorithm 6.5, the minimumGain and maximum-

Gain functions compute respectively the minimum and maximum gain in AUC among all the

potential candidates.

6.3 Experiments

Experimental evaluation of RBS-DSL was performed for the same problems introduced in

the previous chapters: Link Analysis in Social Networks and Entity Resolution in citation

93


databases. The datasets used for RBS-DSL are the same used for the algorithms GSL and ILS-

DSL as introduced in sections 4.3.2 and 4.3.1.


(Q1) Are the proposed algorithms competitive with state-of-the-art discriminative training

algorithms of MLNs?

(Q2) Are the proposed algorithms competitive with the state-of-the-art generative algo-

rithm for structure learning of MLNs?

(Q3) Are the proposed algorithms competitive with pure probabilistic approaches such as

Naïve Bayes and Bayesian Networks?

(Q4) Are the proposed algorithms competitive with state-of-the-art ILP systems for the task

of structure learning of MLNs?

(Q5) Do the proposed algorithms always perform better than BUSL for classification tasks?

If not, are there any regimes in which each algorithm performs better?

(Q6) Regarding the task of Entity Resolution, do the proposed algorithms perform better

than other language-independent discriminative approaches based on MLNs?

Regarding question (Q1) we have to compare all our algorithms with Preconditioned Scaled

Conjugate Gradient (PSCG) which is the state-of-the-art discriminative training algorithm for

MLNs proposed in (Lowd and Domingos 2007). It must be noted that this algorithm takes

in input a fixed structure, and with the clausal knowledge base we use in our experiments for

CORA (each dataset comes with a hand-coded knowledge base), PSCG has achieved the best

published results. We also exclude the approach of adapting the rule set and then learning

weights with PSCG, since it would be computationally intractable.

To answer question (Q2) we have to perform experimental comparison with the Bottom-Up

Structure Learning (BUSL) algorithm (Mihalkova and Mooney 2007) which is the state-of-the-

art algorithm for this task. Since in principle, the MLNs structure can be learned using any ILP

technique it would be interesting to know how our algorithms compare to ILP approaches. In

(Kok and Domingos 2005), the proposed algorithm based on beam search (BS) was shown to

outperform FOIL and the state-of-the-art ILP system ALEPH for the task of learning MLNs

structure. Moreover, BS outperformed both Naïve Bayes and Bayesian Networks in terms of

CLL and AUC. Since in (Mihalkova and Mooney 2007) was shown that BUSL outperforms

the BS algorithm of (Kok and Domingos 2005), our baseline for questions (Q2), (Q3) and

94

6.3 Experiments

(Q4) is again BUSL. It must be noted that since the goal of learning MLNs (and then perform

inference over the model) is to perform probability estimation, the proposed algorithms are not

directly comparable with ILP systems because these are not designed to maximize the data’s

likelihood (and thus the quality of the probabilistic predictions). Moreover, since ALEPH and

FOIL learn more restricted clauses (non recursive definite clauses), the only ILP system that

is directly comparable with our algorithm is CLAUDIEN which, unlike most ILP systems that

learn only Horn clauses, is able to learn arbitrary first-order clauses. Thus the comparison re-

gards the task of structure learning of MLNs where ILP systems learn the structure followed by

a weight learning phase. In (Kok and Domingos 2005) the authors showed that CLAUDIEN

(also ALEPH and FOIL) followed by a weight learning phase was outperformed by the BS

algorithm in terms of CLL and AUC. Regarding question (Q5), we compare all our algorithms

and BUSL on two datasets with the goal of discovering regimes in which each one can per-

form better. We will use two datasets, one of which can be considered of small size and the

other one of much larger size. Finally, to answer question (Q6), we should compare our algo-

rithms with the best language-independent discriminative approach to Entity Resolution based

on MLNs proposed in (Singla and Domingos 2006b). In this work, the MLN(G+C+T) model is

language-independent because it does not contain rules referring to specific strings occurring

in the data. This is similar to the approach that we follow here for this task: we learn rules

which are not vocabulary specific. In (Singla and Domingos 2006b) the discriminative weight

learning approach is based on the voted perceptron for MLNs and was used to learn weights for

different hand-coded models (one of these was MLN(G+C+T)). Since in (Lowd and Domingos

2007) it was shown that PSCG in general outperforms the voted perceptron and for the task

of entity resolution the comparison was performed following a language-dependent approach

(excluding MLN(G+C+T)), it would be interesting to investigate how our algorithms compare

to MLN(G+C+T). Finally, we would like to compare RBS-DSL with the algorithm ILS-DSL

presented in the previous chapter and check if there is any difference in performance between

them.


We implemented the algorithm RBS-DSL as part of the MLN++ package 8.3 which is a suite of


We used the implementation of L-BFGS and Lazy-MC-SAT in Alchemy to learn maximum

WPLL weights and compute CLL during clause search. Regarding parameter learning, we

95


compared our algorithms performance with the state-of-the-art algorithm PSCG of (Lowd and

Domingos 2007) for discriminative weight learning of MLNs. This algorithm takes as input

an MLN and the evidence (groundings of non-query predicates) and discriminatively trains the

MLN to optimize the CLL of the query predicates given evidence. Our algorithms and PSCG

optimize the CLL (or AUC) of the query predicates and a comparison between these algorithms

would be useful to understand if learning automatically the clauses from scratch can improve

over hand-coded MLN structures in terms of classification accuracy of the query predicates

given evidence.

We performed all the experiments on a 2.13 GHz Intel Core2 Duo CPU. For the UW-

CSE dataset we trained PSCG on the hand-coded knowledge base provided with the dataset.

We used the implementation of PSCG in the Alchemy package and ran this algorithm with the

default parameters for 10 hours. For the CORA dataset, for PSCG we report the results obtained

in (Lowd and Domingos 2007) where PSCG was trained on a hand-coded MLN and achieved

best current results on this dataset. For the language-independent approach MLN(G+C+T) of

(Singla and Domingos 2006b) we report the results for three of the query predicates in this

domain: sameBib, sameAuthor and sameVenue (the results reported in (Singla and Domingos

2006b) do not include the predicate sameTitle).

For both datasets, for all our algorithms we used the following parameters: the mean and

variance of the Gaussian prior were set to 0 and 100, respectively; maximum variables per

clause = 4; maximum predicates per clause = 4; penalization of weighted pseudo-likelihood =

0.01 for UW-CSE and 0.001 for CORA; beamSize = 5 for UW-CSE and 10 for CORA. For L-

BFGS we used the following parameters: maximum iterations = 10,000 (tight) and 10 (loose);

convergence threshold = 10−5 (tight) and 10−4 (loose). For Lazy-MC-SAT during learning we

used the following parameters: memory limit = 600MB for UW-CSE and 1GB for CORA,

maximum number of steps for Gibbs sampling = 100; simulated annealing temperature = 0.5;

the parameter δ was set to 1. All these parameters were set in an ad hoc manner and per-fold

optimization may lead to better results. In particular the memory requirements of Lazy-MC-

SAT were set higher for RBS based algorithms because we empirically observed that a larger

number of potential clauses compared to ILS-DSL, required larger memory requirements. This

is due to the nature of RBS which evaluates more clauses than ILS during search. However,

as can be noted from the results of the experiments a larger memory spent by Lazy-MC-SAT

to score the structures in the RBS based algorithms did not produce much higher results than

the ILS based versions. This confirms our heuristic on the memory limit of Lazy-MC-SAT that

96

6.3 Experiments

most significant clauses are normally scored within a certain limit and a higher limit would

not change the results. Regarding BUSL, for both datasets, we used the following parameters:

the mean and variance of the Gaussian prior were set to 0 and 100, respectively; maximum

variables per clause: 5 for UW-CSE and 6 for CORA; maximum predicates per clause = 6;

penalization of WPLL: 0.01 for UW-CSE and 0.001 for CORA. minWeight 0.5 for UW-CSE

and 0.01 for CORA; For L-BFGS we used the following parameters: maximum iterations =

10,000 (tight) and 10 (loose); convergence threshold = 10−5 (tight) and 10−4 (loose).

In the UW-CSE domain, we followed the same leave-one-area-out methodology as in

(Richardson and Domingos 2006). In the CORA domain, we performed 5-fold cross-validation.

For each train/test split, one of the training folds is used as tuning set for computing the CLL

(or AUC). For each system on each test set, we measured the CLL and the AUC of PR curve

for the query predicates. The advantage of the CLL is that it directly measures the quality of

the probability estimates produced. The advantage of the AUC is that it is insensitive to the

large number of true negatives (i.e., ground atoms that are false and predicted to be false), but

the disadvantage is that it ignores calibration by considering only whether true atoms are given

higher probability than false atoms. The CLL of a query predicate is the average over all its

groundings of the ground atom’s log-probability given evidence. The precision-recall curve for

a predicate is computed by varying the CLL threshold above which a ground atom is predicted

to be true; i.e. the predicates whose probability of being true is greater than the threshold are

positive and the rest are negative. For the computation of AUC we used the package of (Davis

and Goadrich 2006).

6.3.2 Results

After learning the structure discriminatively, we performed inference on the test fold for both

datasets by using MC-SAT with number of steps = 10000 and simulated annealing temperature

= 0.5. For each experiment, on the test fold all the groundings of the query predicates were

commented: advisedBy for the UW-CSE dataset (professor and student are also commented)

and sameBib, sameTitle, sameAuthor and sameVenue for CORA. MC-SAT produces probabil-

ity outputs for every grounding of the query predicate on the test fold. We used these values to

compute the average CLL over all the groundings and to compute the PR curve.

We denote the two versions of the algorithm as RBS−DSLCLL and RBS−DSLAUC. For the

algorithm that optimizes AUC of PR curve during search, we scored each structure by using the

97


package of (Davis and Goadrich 2006). We compare the results also with the algorithms pre-

sented in the previous chapter ILS−DSLCLL and ILS−DSLAUC. The results for all algorithms

on the UW-CSE dataset are reported in Table 6.1 for CLL and Table 6.2 for AUC. In Table 6.1,

CLL is averaged over all the groundings of the predicate advisedBy in the test fold. Regarding

the comparison with PSCG in terms of CLL, in this domain all our algorithms perform bet-

ter than PSCG except RBS−DSLCLL in every fold of the dataset and overall. RBS−DSLCLL

performs better than PSCG in two folds, worse in other two and equally in the ai area. The

difference in the overall results between RBS−DSLCLL and PSCG is due to the low result of

the former in the systems fold. Regarding AUC, PSCG overall performs better than all our

algorithms. It must be noted that on two out of five folds (language and graphics) the results of

our algorithms were quite competitive and there was a large difference only in the theory fold

where PSCG achieved a high result. Our best performing algorithms in terms of CLL were

those that optimize AUC during search. This was a surprising result since we expected better

results from the algorithms that optimize CLL during search. On the other side, in terms of

AUC, our best performing algorithm was RBS−DSLAUC. Overall for UW-CSE, we can state

that our algorithms perform better in terms of CLL and worse in terms of AUC.

For the CORA dataset the results are reported in Table 6.3 and 6.4. For CLL for each

query predicate we report the average of CLL of its groundings over the test fold (for each

predicate, training is performed on four folds and testing on the remaining one in a 5-fold

cross-validation). For CORA, compared to PSCG, all our algorithms perform better in terms

of CLL for each of the query predicates, but worse in terms of AUC. We observed empirically

on each fold that the performances in terms of CLL and AUC were always balanced, a slightly

better performance in CLL always resulted in a slightly worse performance in terms of AUC

and vice versa. Since CLL determines the quality of the probability predictions output by the

algorithm, all our algorithms outperform PSCG in terms of the ability to predict correctly the

query predicates given evidence. However, since AUC is useful to predict the few positives in

the data, PSCG produces better results for only positive examples. Hence, these results answer

question (Q1). It must be noted that PSCG has achieved the best published results on CORA in

terms of AUC (Lowd and Domingos 2007) and the approach followed is language-dependent,

i.e. the hand-coded MLN used with PSCG in (Lowd and Domingos 2007) contains rules such

that a weight is learned for each ground clause that is constructed using specific constants in

the domain. This makes the approach with PSCG of (Lowd and Domingos 2007) vocabulary

specific while all our algorithms learn general rules not related to a specific set of strings.

98

6.3 Experiments

Table 6.1: CLL results for the query predicate advisedBy in the UW-CSE domain

area language graphics systems theory ai OverallILS−DSLCLL -0.048±0.016 -0.016±0.003 -0.020±0.003 -0.020±0.005 -0.022±0.003 -0.025±0.006RBS−DSLCLL -0.043± 0.015 -0.026±0.004 -0.058±0.002 -0.019±0.004 -0.032±0.005 -0.036±0.006ILS−DSLAUC -0.028±0.008 -0.015±0.003 -0.017±0.002 -0.018±0.004 -0.019±0.003 -0.019±0.004RBS−DSLAUC -0.025±0.007 -0.015±0.003 -0.017±0.003 -0.018±0.004 -0.020±0.003 -0.019±0.004

PSCG -0.049±0.016 -0.023±0.005 -0.026±0.005 -0.028±0.007 -0.032±0.005 -0.032±0.008BUSL -0.024±0.008 -0.014±0.002 -0.295±0.000 -0.013±0.003 -0.019±0.003 -0.073±0.003

Table 6.2: AUC results for the query predicate advisedBy in the UW-CSE domain

area language graphics systems theory ai OverallILS−DSLCLL 0.011 0.006 0.007 0.010 0.006 0.008RBS−DSLCLL 0.034 0.009 0.010 0.012 0.008 0.015ILS−DSLAUC 0.016 0.005 0.007 0.005 0.008 0.008RBS−DSLAUC 0.073 0.005 0.005 0.005 0.007 0.019

PSCG 0.011 0.005 0.069 0.101 0.034 0.044BUSL 0.115 0.007 0.007 0.032 0.013 0.035

Regarding the comparison with BUSL, the results show that all our algorithms perform

better than BUSL in terms of CLL on both datasets. It must be noted, however, that for UW-

CSE, BUSL performed generally better than our algorithms, but produced very low results in

one fold. In terms of AUC, BUSL performs slightly better on the UW-CSE dataset while in

the CORA dataset all our algorithms outperform BUSL. Therefore, questions (Q2), (Q3) and

(Q4) can be answered affirmatively. Our discriminative algorithms are competitive with BUSL

even though for BUSL, in the UW-CSE domain, we used optimized parameters taken from

(Mihalkova and Mooney 2007) in terms of number of variables and literals per clause, while

for our algorithms we did not perform per-fold optimization of any parameter.

Regarding question (Q5), the goal was whether previous results of (Ng and Jordan 2002)

carry on to MLNs, that on small datasets generative approaches can perform better than dis-

criminative ones. The UW-CSE dataset with a total of 2673 tuples can be considered of much

smaller size compared to CORA that has 70367 tuples. The results of Tables 6.1 and 6.2 show

that on the UW-CSE dataset, the generative algorithm BUSL performs better in terms of AUC

and is competitive in terms of CLL since it underperforms our algorithms only because of the

low results in the systems fold of the dataset. Thus we can answer question (Q5) confirming

the results in (Ng and Jordan 2002) that on small datasets generative approaches can perform

better than discriminative ones, while for larger datasets discriminative approaches outperform

99


Table 6.3: CLL results for all query predicates in the CORA domain

area sameBib sameTitle sameAuthor sameVenue OverallILS−DSLCLL -0.087±0.001 -0.077±0.006 -0.148±0.009 -0.121±0.004 -0.108±0.005RBS−DSLCLL -0.222±0.003 -0.120±0.008 -0.126±0.008 -0.129±0.005 -0.149±0.006ILS−DSLAUC -0.168±0.002 -0.117±0.010 -0.158±0.011 -0.101±0.004 -0.136±0.007RBS−DSLAUC -0.254±0.002 -0.077±0.007 -0.133±0.011 -0.172±0.005 -0.159±0.006

PSCG -0.291±0.003 -0.231±0.014 -0.182±0.013 -0.444±0.012 -0.287±0.011MLN(G+C+T) -0.394±0.004 − -0.263±0.053 -1.196±0.031 -0.618±0.030

BUSL -0.566±0.001 -0.100±0.004 -0.834±0.009 -0.232±0.005 -0.433±0.005

Table 6.4: AUC results for all query predicates in the CORA domain

area sameBib sameTitle sameAuthor sameVenue OverallILS−DSLCLL 0.603 0.428 0.371 0.315 0.429RBS−DSLCLL 0.265 0.546 0.600 0.233 0.411ILS−DSLAUC 0.334 0.470 0.688 0.252 0.436RBS−DSLAUC 0.322 0.423 0.534 0.175 0.364

PSCG 0.990 0.953 0.999 0.823 0.941MLN(G+C+T) 0.973 − 980 0.743 0.899

BUSL 0.138 0.419 0.323 0.218 0.275

generative ones.

The final question (Q6) is related to the task of entity resolution and the approaches which

are based on MLNs and are language independent, i.e. that do not contain rules which refer to

specific constants in the domain. The results of Tables 6.3 and 6.4 show that in terms of CLL,

all our algorithms outperform MLN(G+C+T) for all the query predicates, but in terms of AUC,

MLN(G+C+T) outperforms our algorithms. Thus, the same conclusions for PSCG are valid

for MLN(G+C+T). Our algorithms produce in general more accurate probability predictions,

while MLN(G+C+T) produces better results for only positive atoms. Therefore, question (Q6)

can be answered affirmatively.

Finally, regarding the comparison with ILS-DSL, on the UW-CSE dataset, RBS-DSL per-

formed generally better than ILS-DSL in terms of AUC. In terms of CLL the results were quite

balanced, ILS-DSLAUC and ILS-DSLAUC performed equally, and only ILS-DSLCLL performed

better than RBS-DSLCLL. On CORA, the ILS-DSL algorithm, generally performed better than

RBS-DSL both in terms of CLL and AUC. It must be noted however, that a larger beamSize

parameter (for CORA it was set ot 10) for RBS-DSL could lead to improvements in accuracy.

This parameter seems to be more critic for RBS than the parameter k (number of restarts) is

100

6.4 Related Work

for ILS. Moreover, for RBS the parameter δ was set to 1 and for ILS was set to 2. All these

parameters were not found following a per-fold optimization process, thus better performance

could be achieved by tuning the parameters for both algorithms.

6.4 Related Work

Regarding discriminative structure learning of MLNs, RBS-DSL is similar to ILS-DSL, thus

we remind to section 5.5 for related work on structure learning of SRL models. From the point

of view of the search strategy, the algorithm RBS-DSL has similarities with that in (Kok and

Domingos 2005) that performs a beam search. However, RBS-DSL is a stochastic algorithm

which randomizes the process of beam construction, whereas in (Kok and Domingos 2005) the

search is deterministic. Moreover, the algorithm of (Kok and Domingos 2005) is a generative

one and search is guided by WPLL while our algorithms are guided by conditional likelihood

or area under the precision-recall curve.

RBS-DSL approach is also similar to approaches in ILP that exploit SLS (Zelezny. et al.

2006). The algorithms that we propose here are different in that they use likelihood as evalu-

ation measure instead of ILP coverage criteria. Moreover, our algorithms differ from those in

(Zelezny. et al. 2006) in that we use Hybrid SLS approaches which can combine other simple

SLS methods to produce high performance algorithms.

GRASP is a widely used metaheuristic for hard combinatorial problems in many fields as

shown in (Festa and Resende 2002). However, its use in Machine Learning, to the best of the

author’s knowledge, has not been yet experimented. The results obtained in this chapter with

RBS-DSL, show that GRASP can be help in developing robust and highly efficient algorithms

for complex optimization problems in learning SRL models.

101


6.5 Summary

In this chapter we have introduced the RBS-DSL algorithm that learns discriminatively first-

order clauses and their weights. The algorithm scores the candidate structures by maximizing

conditional likelihood or area under the Precision-Recall curve while setting the parameters by

maximum pseudo-likelihood. RBS-DSL is inspired from the Greedy Randomized Adaptive

Search Procedure metaheuristics and performs randomized beam search by scoring the struc-

tures through maximum likelihood in the first phase and then uses maximum CLL or AUC for

PR curve in a second step to randomly generate a beam of the best clauses to add to the current

MLN structure. To speed up learning we propose some simple heuristics that greatly reduce

the computational effort for scoring structures. Empirical evaluation with real-world data in

two domains show the promise of our approach improving over the state-of-the-art discrimi-

native weight learning algorithm for MLNs in terms of conditional log-likelihood of the query

predicates given evidence. We have also compared the proposed algorithm with the state-of-

the-art generative structure learning algorithm and shown that on small datasets the generative

approach is competitive, while on larger datasets the discriminative approach outperforms the

generative one.

RBS-DSL can be further improved with the following: dynamically adapting the parameter

α for the Restricted Candidate List construction; score structures with MC-SAT in a parallel

model such as MPI (Message Passing Interface) or PVM (Parallel Virtual Machine) by assign-

ing a run of MC-SAT to a separate thread; develop heuristics that can find among those that do

not improve WPLL, potential candidates that can improve CLL or AUC; since the iterations of

GRASP are independent they could be assigned to parallel CPUs in order to learn high quality

structures and greatly speed up the whole learning task; develop other heuristics for choosing

candidates from the RCL.

102

Chapter 7

The IRoTS and MC-IRoTS algorithms

Most real-world problems are characterized by both probabilistic and deterministic informa-

tion. The state-of-the-art in pure probabilistic and deterministic inference has seen in recent

years important advances towards solving hard problems. However at the boundary of the

two, there has not been much work in investigating combined methods for dealing with near-

deterministic dependencies that cause the #P-completeness of probabilistic inference (Roth

1996). Many problems with these dependencies appear in Statistical Relational Learning, thus

it is important to investigate how probabilistic and deterministic inference methods can be com-

bined. For example, in Entity Resolution (the problem of determining which observations refer

to the same object), both probabilistic inferences (e.g., observations with similar properties

are more likely to be the same object) and deterministic ones (e.g., transitive closure: if x =

y and y = z, then x = z) are involved (McCallum and Wellner 2005). This chapter presents

two algorithms, IRoTS and MC-IRoTS, for MAP/MPE and conditional inference in Markov

Logic respectively. IRoTS is a MAX-SAT solver based on the Iterated Local Search (Hoos

and Stutzle 2005; Loureno et al. 2002) and Robust Tabu Search (Taillard 1991) metaheuris-

tics while MC-IRoTS combines IRoTS with Markov Chain Monte Carlo and is able to deal

with probabilistic and deterministic dependencies. Experimental evaluation shows that IRoTS

performs better than MaxWalkSAT (Kautz et al. 1997a) for MAP/MPE inference in Markov

Logic, being faster and more accurate. We also show that MC-IRoTS improves in terms of

inference time over the state-of-the-art algorithm for conditional inference in MLNs.

103

7. THE IROTS AND MC-IROTS ALGORITHMS

7.1 MAP/MPE inference using IRoTS

The basic inference task in MNs and BNs is finding the most probable state of the world given

some evidence. This is generally known as Maximum a posteriori (MAP) inference in Markov

random fields, and Most Probable Explanation (MPE) inference in Bayesian Networks. MAP

inference in MNs means finding the most likely state of a set of output variables given the state

of the input variables and it is a NP-hard problem. From Equation 3.1 introduced in Section

3.1, for MLNs this inference task reduces to finding a truth assignment that maximizes the

sum of weights of satisfied clauses. This can be done using any weighted satisfiability solver,

and in practice need not be more expensive than standard logical inference by model checking.

The authors in (Singla and Domingos 2005) use the MaxWalkSAT solver (Kautz et al. 1997a)

for MAP inference in MLNs. This section proposes IRoTS with some modifications from the

original version of (Smyth et al. 2003) as a MAX-SAT solver for the MAP inference task in

MLNs.

7.1.1 The SAT and MAX-SAT problems

One of the central problems in logic is that of determining if a knowledge base (usually in

clausal form) is satisfiable, i.e., if there is an assignment of truth values to all ground atoms

that makes the KB true. The satisfiability problem in propositional logic (SAT) is the task

of deciding whether a given propositional formula has a model. More formally, given a set

of m clauses C1, ...,Cm involving n Boolean variables x1, ...,xn the SAT problem is to decide

whether an assignment of values to variables exists such that all clauses are simultaneously

satisfied. This problem plays a crucial role in various areas of computer science, mathematical

logic and artificial intelligence.

MAX-SAT is the optimisation variant of SAT and can be seen as a generalisation of the

SAT problem: Given a propositional formula in conjunctive normal form (CNF), the MAX-

SAT problem then is to find a variable assignment that maximises the number of satisfied

clauses. In weighted MAX-SAT, each clause Ci has an associated weight wi and the goal is to

maximise the total weight of the satisfied clauses. The decision variants of SAT and MAX-SAT

are NP-complete (Garey and Johnson. 1979). Furthermore, it is known that optimal solutions

to MAX-SAT are hard to approximate; for MAX-3-SAT (unweighted MAX-SAT with 3 literals

per clause), e.g., there exists no polynomial-time approximation algorithm with a (worst-case)

approximation ratio lower than 8/7 ≈ 1.1429. It is worth noting that approximation algorithms

104


for MAX-SAT can be empirically shown to achieve much better solution qualities for many

types of MAX-SAT instances; however, their performance is usually substantially inferior to

that of state-of-the-art stochastic local search (SLS) algorithms for MAX-SAT (Hansen and

Jaumard. 1990).

A successful approach to the SAT and MAX-SAT problems is stochastic local search (Hoos

and Stutzle 2005). Many SLS methods have been applied to SAT and MAX-SAT leading to

a large number of algorithms. These include algorithms originally proposed for SAT, which

can be applied to unweighted MAX-SAT in a straightforward way by keeping track of the

best solution found so far in the search process. As pointed out in (Hoos and Stutzle 2005)

it is not clear that SLS algorithms that are known to perform well on SAT can be expected

to show equally strong performance on MAX-SAT and some empirical evidence suggests that

this is generally not the case. Therefore, many SLS algorithms were directly developed for

unweighted and, in particular, weighted MAX-SAT or extended from existing SLS algorithms

for SAT in various ways.

The best performing SLS algorithms for unweighted and weighted MAX-SAT belong to

three categories: Tabu Search algorithms, Dynamic Local Search algorithms, and Iterated Lo-

cal Search. High performance was shown by Reactive Tabu Search (H-RTS), a tabu search that

dynamically adjusts the tabu tenure, on unweighted MAX-SAT instances (Battiti and Protasi.

1997). High performing Dynamic Local Search algorithms include DLM (Shang and Wah.

1997), a later extension called DLM-99-SAT (Wu and Wah. 1999), and Guided Local Search

(GLS) (Mills and Tsang. 2000). Computational results show that GLS is currently the top per-

forming SLS algorithm for specific classes of weighted MAX-SAT instances, outperforming

DLM and MaxWalkSAT. Also highly competitive is the Iterated Local Search (ILS-YI) (Yag-

iura and Ibaraki. 2001) that uses a local search algorithm based on 2- and 3-flip neighbour-

hoods. Particularly for MAX-SAT-encoded minimum-cost graph colouring and set covering

instances, as well as for a big, MAX-SAT-encoded real-world time-tabling instance, the 2-flip

variant of ILS-YI performs better than other versions of ILS-YI and a tabu search algorithm.

In (Smyth et al. 2003) the authors showed that IRoTS is highly competitive with GLS and

Novelty+/wcs+we on many MAX-SAT instances. On weighted and unweighted Uniform Ran-

dom 3-SAT instances, IRoTS performs significantly better than GLS and Novelty+ variants in

terms of CPU time; on the wjnh instances, IRoTS performs worse than Novelty+ variants and

for MAX-SAT-encoded instances, IRoTS performs worse than GLS.

105


One of the most successful SLS algorithms applied to SAT is WalkSAT (Selman et al.

1996). WalkSAT, (Algorithm 7.1) starting from a random initial state, repeatedly flips (changes

the truth value of) an atom in a random unsatisfied clause. With probability p, WalkSAT

chooses the atom that minimizes the number of unsatisfied clauses or the number of satisfied

clauses that become unsatisfied, and with probability 1- p it chooses a random one. WalkSAT

has been shown to be able to solve hard instances of satisfiability with hundreds of thousands

of variables in minutes. The MaxWalkSAT (Kautz et al. 1997a) algorithm extends WalkSAT

to the weighted satisfiability problem, where each clause has a weight and the goal is to maxi-

mize the sum of the weights of satisfied clauses. (Systematic solvers have also been extended

to weighted satisfiability, but tend to work poorly).

In (Park 2005) it was shown how the problem of finding the most likely state of a Bayesian

network given some evidence can be efficiently solved by reduction to weighted satisfiabil-

ity. WalkSAT is essentially the special case of MaxWalkSAT obtained by giving all clauses

the same weight. In this dissertation we focus on function-free FOL with the domain closure

assumption (i.e., the only objects in the domain are those represented by the constants). A pred-

icate or formula is grounded by replacing all its variables by constants. Propositionalization

is the process of replacing a first-order knowledge base (KB) by an equivalent propositional

one. In finite domains, this can be done by replacing each universally (existentially) quantified

formula with a conjunction (disjunction) of all its groundings. A first-order KB is satisfiable

iff the equivalent propositional KB is satisfiable. Thus, inference over a first-order KB can

be performed by propositionalization followed by satisfiability testing. For MAP inference in

MLNs, the authors in (Singla and Domingos 2006a) use MaxWalkSAT as a weighted MAX-

SAT solver and show also how to use it in an algorithm for discriminative learning of MLNs

parameters.

7.1.2 Iterated Robust Tabu Search

Robust Tabu SearchRobust Tabu Search (RoTS) (Taillard 1991) is a special case of Tabu Search (Glover and

Laguna. 1997). In each search step, the RoTS algorithm (Algorithm 7.2) for MAX-SAT flips

a non-tabu variable that achieves a maximal improvement in the total weight of the unsatisfied

clauses (the size of this improvement is also called score) and declares it tabu for the next tt

steps. The parameter tt is called the tabu tenure. An exception to this “tabu” rule is made if a

more recently flipped variable achieves an improvement over the best solution seen so far (this

106


Algorithm 7.1 The WalkSAT algorithmWalkSAT(wcl: weighted clauses, max_flips: maximum number of flips, max_ tries: numberof tries, target: the target cost, p: probability of random walk)atoms = variables in wcl;for i = 1 to max_tries do

solution = a random truth assignment to atoms;cost = sum of weights of unsatisfied clauses in solution;for i = 1 to max flips do

if cost ≤ target thenReturn solution;

end ifC = a randomly chosen unsatisfied clause;if Uniform(0,1) < p then

AtomToFlip = a randomly chosen variable from C;else

for each variable A in C docompute Cost(A);

end forAtomToFlip = A with lowest Cost(A);

end ifsolution = solution with AtomToFlip flipped;cost = cost + Cost(A);

end forend forreturn solution;

107


mechanism is called aspiration). Furthermore, whenever a variable has not been flipped within

a certain number of search steps, it is forced to be flipped. This implements a form of long-

term memory and helps prevent stagnation of the search process. The tabu status of variables is

determined by comparing the number of search steps that have been performed since the most

recent flip of a given variable with the current tabu tenure. Finally, instead of using a fixed

tabu tenure, every n iterations the parameter tt is randomly chosen from an interval [ttmin, ttmax]

according to a uniform distribution.

The RoTS algorithm is closely related to MaxWalkSAT/Tabu for weighted MAX-SAT. In

each search step one of the non-tabu variables that achieves a maximal improvement in the total

weight of the unsatisfied clauses is flipped and declared tabu for the next tt steps. However,

different from MaxWalkSAT, RoTS additionally to the aspiration criteria, forces a variable to

be flipped if it has not been flipped for a certain number of steps.

Iterated Robust Tabu SearchThe original version of IRoTS for MAX-SAT was proposed in (Smyth et al. 2003). Al-

gorithm 7.3 starts by independently (with equal probability) initializing the truth values of the

atoms. Then it performs a local search to efficiently reach a local optimum CLS by using RoTS.

At this point, a perturbation method based again on RoTS is applied leading to the neighbor

CL′C of CLS and then again a local search based on RoTS is applied to CL′C to reach another

local optimum CL′S . The accept function decides whether the search must continue from the

previous local optimum or from the last found local optimum CL′S (accept can perform random

walk or iterative improvement in the space of local optima).

Careful choice of the various components of Algorithm 7.3 is important to achieve high

performance. For the tabu tenure we refer to the parameters used in (Smyth et al. 2003) that

have proven to be highly performant across many domains. At the beginning of each local

search and perturbation phase, all variables are declared non-tabu. The clause perturbation op-

erator (flipping the atoms truth value) has the goal to jump in a different region of the search

space where search should start with the next iteration. There can be strong or weak perturba-

tions which means that if the jump in the search space is near to the current local optimum the

subsidiary local search procedure LocalSearchRoT S may fall again in the same local optimum

and enter regions with the same value of the objective function called plateau, but if the jump

is too far, LocalSearchRoT S may take too many steps to reach another good solution. In our

algorithm we use a fixed number of RoTS steps 9n/10 with tabu tenure n/2 where n is the

108


Algorithm 7.2 The Robust Tabu Search algorithmRoTS(F: weighted CNF formula,ttmin: minimum tabu tenure,ttmax: maximum tabu tenure,maxNoImprov: maximum number of steps without improvement)num_atoms = number of variables in F;Å= randomly chosen assignment of the variables in F;Cost(Å) = sum of weights of unsatisfied formulas;A = Å;k = 0;repeat

if k mod n = 0 thentt = random([ttmin,ttmax]);

end ifAtom = randomly selected variable whose flip results in a maximal improvement in Cost;if Score(A with Atom flipped) < Score(A) then

A = A with Atom flipped;else

if ∃ variable A that has not been flipped for ≥ 10∗n steps thenA = A with Atom flipped;

elseAtom = randomly selected non-tabu variable whose flip results in the maximal im-provement in Cost;A = A with Atom flipped;

end ifend ifif Score(A) < Score(Å) then

Å= A;end ifk = k +1;

until no improvement in Å for maxNoImprov stepsreturn Å

109


Algorithm 7.3 The Iterated Robust Tabu Search algorithmInput: C: set of weighted clauses in CNF, BestScore: current best score)CLC = Random initialization of truth values for atoms in C;CLS = LocalSearchRoT S(CLS);BestAssignment = CLS;BestScore = Score(CLS);repeat

CL’C = PerturbRoT S(BestAssignment);CL’S = LocalSearchRoT S(CL’C);if Score(CL’S) ≥ BestScore then

BestScore = Score(CL’S)end ifBestAssignment = accept(BestAssignment,CL’S);

until k consecutive steps have not produced improvementReturn BestAssignment

number of atoms (in future work we intend to dynamically adapt the nature of the perturba-

tion). Regarding the procedure LocalSearchRoT S, it performs RoTS steps until no improvement

is achieved for n2/d steps (we call d threshold ratio) with a tabu tenure n/10 + 4. The accept

function always accepts the best solution found so far. The difference of our algorithm with

that in (Smyth et al. 2003) is that we do not dynamically adapt the tabu tenure and do not use

a probabilistic choice in accept.

7.1.3 Experiments


(Q1) Does the proposed algorithm IRoTS improve over the state-of-the-art algorithm for

MLNs in terms of solutions quality?

(Q2) Does the performance depend on the particular configuration of clauses’ weights?

(Q3) Does the performance depend on particular features of the dataset, i.e., number of

ground clauses and predicates?

(Q4) In case IRoTS finds better solutions than the state-of-the-art algorithm, what is the

performance in terms of running times?

(Q5) What is the performance of the algorithms for huge relational domains with hundreds

of thousands of ground predicates and clauses?

110


Table 7.1: Inference results in terms of cost of false clauses for query predicate advisedBy forIRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learned by runningPSCG for 500 iterations

fold IRoTS MWSAT-Tabu MWSAT-Tabu&Restartsai 93103.7 92393.5 92512.9

graphics 72221.8 72245.1 71659.8language 32398.1 32668.2 32380.1systems 117144.0 118416.0 118629.0theory 71726.1 71727.9 71873.1

average 77318.7 77490.1 77411.0

We implemented the algorithm as part of the MLN++ package 8.3. In order to perform

MAP inference we need MLN models and evidence data. MLN models can be hand-coded or

learned from training data using the algorithms in the previous chapters. Since the goal here is

to perform inference for complex models where it is not easy to find the best MAP state given

evidence, we decided to generate complex models from real-world data and test IRoTS against

the current state-of-the-art algorithm MaxWalkSAT that is implemented in alchemy. We took

as a dataset, the UW-CSE dataset that we used in the previous chapter and the MLN hand-coded

model that comes together with this dataset. For the first experiment we learned weights using

the algorithm PSCG (Lowd and Domingos 2007) giving advisedBy as non-evidence predicate.

We trained the algorithm on the basis of a leave-one-out methodology for 500 iterations on each

area of the dataset. After having learnt the MLNs, we performed MAP inference with IRoTS

with query predicate advisedBy. As we did in the previous chapters, we commented on the test

set also the student and professor predicates together with the predicate advisedBy. In order

to equally compare IRoTS with MaxWalkSAT, we decided to compare IRoTS with the tabu

version of MaxWalkSAT and by using the same number of search steps for both algorithms.

For IRoTS the threshold ratio d was set to 1 and the parameter k, number of iterations without

improvement was set to 3. We observed that on the language and theory folds the iterations

were very fast and three steps without improvement were too few. For this reason we used k=10

only for these areas, and k=3 for the other areas. Anyway, at the end of IRoTS we counted the

overall number of flips of the algorithm and used the same number for MaxWalkSAT with tabu

(MWSAT-Tabu). The tabu tenure for MWSAT-Tabu was set to the default of alchemy, i.e.,

equal to 5.

Since IRoTS uses the perturbation procedure to escape local optima, it would be fair to

111


Table 7.2: Running times (in minutes) for the same number of search steps for query predicate ad-visedBy for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learnedby running PSCG for 500 iterations

fold IRoTS MWSAT-Tabu MWSAT-Tabu&Restarts Num. ground preds Num. ground clausesai 56.65 62.98 60.27 4760 185849

graphics 26.91 28.88 46.30 3843 136392language 1.03 1.08 1.06 840 15762systems 125.71 134.35 192.16 5328 218918theory 17.85 19.30 34.03 2499 73600

average 45.63 49.32 66,76 - -

compare IRoTS with a version of MaxWalkSAT-Tabu that uses a similar mechanism to jump

in a different region of the search space. For this reason we compared IRoTS also with

MaxWalkSAT-Tabu&Restarts with a number of ten restarts and with a number of flips for each

iteration equal to 1/10th of the overall number of flips. In this way, the equality of comparison

is maintained in order to perform the same number of flips for all algorithms. Moreover, for

MaxWalkSAT-Tabu we used the default tabu tenure of alchemy that is five, but it would more

interesting to compare IRoTS with MaxWalkSAT-Tabu&Restarts that has the same tabu tenure

as IRoTS, i.e., n/10+4. Thus we used this tabu tenure for MaxWalkSAT-Tabu&Restarts.

The results are reported in Table 7.1 where for each algorithm we report the cost of false

clauses of the final solution, while running times of inference are reported in Table 7.2. As

can be seen IRoTS is more accurate than the other two algorithms since it finds solutions

of higher quality. MaxWalkSAT-Tabu&Restarts is more competitive than MaxWalkSAT-Tabu

due to the ability of escaping local optima by jumping in a different region of the search space.

Running times show that IRoTS is faster than both the other algorithms even though the number

of search steps is the same. Thus, questions (Q1) and (Q4) can be answered affirmatively.

However, we want to be sure that the performance of IRoTS towards the other algorithms does

not depend on the weights of the model. For this reason we decided to generate other MLNs on

the same dataset but with different weights than the first ones. We did this by using again PSCG

and running it for 10 hours instead of 500 iterations for each training set. This will guarantee

that the MLNs generated will be different in terms of the clauses weights. The MAP inference

results regarding these MLNs are reported in Table 7.3. As can be seen, again IRoTS performs

better than the other algorithms, thus question (Q2) can be answered affirmatively, since for the

same number of ground clauses and predicates but with different clauses’ weights, IRoTS finds

112


Table 7.3: Inference results in terms of cost of false clauses for query predicate advisedBy forIRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learned by runningPSCG for 10 hours



average 60328.2 60374.6 60339.7

Table 7.4: Running times (in minutes) for the same number of search steps for query predicate ad-visedBy for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learnedby running PSCG for 10 hours



‘ average 35.94 38.91 45.23 - -

better solutions than the other algorithms. Regarding running times for the last experiments,

the results are reported in Table 7.4 and again IRoTS is faster than the other algorithms.

An important question to answer is whether the performance of IRoTS towards the other

algorithms depends on the number of ground clauses and predicates. It is important to maintain

the same performance for other number of groundings. For this reason we decided to consider

another query predicate in more in the UW-CSE dataset in order to change the number of

ground atoms and clauses. We learned again the weights using PSCG but this time considering

as non-evidence predicates both the predicate advisedBy and tempAdvisedBy. We ran PSCG

on each fold for 50 iterations. The final MLNs learned should be able to predict the probability

for all groundings of both predicates given evidence. We report experiments for each of the

predicates in turn and finally for an inference task where both predicates are specified as query

predicate of the inference task. In this way we will have a different number of ground pred-

icates and clauses compared to the previous experiments. The results for the query predicate

advisedBy with the new MLNs are reported in Table 7.5 and the respective running times are

113


Table 7.5: Inference results in terms of cost of false clauses for query predicate advisedBy forIRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learned by runningPSCG for 50 iterations with both advisedBy and tempAdvisedBy as non-evidence predicates



average 46.58 46.58 46.58

Table 7.6: Running times (in minutes) for the same number of search steps for query predicate ad-visedBy for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learnedby running PSCG for 50 iterations with both advisedBy and tempAdvisedBy as non-evidence pred-icates



average 3.14 21.01 15.71 - -

reported in Table 7.6. As it can be seen, in this case the algorithms find the same solution but

IRoTS is much faster than the other two algorithms. The number of ground predicates and

clauses is different from the previous experiments.

In Table 7.7, we report the inference results for the query predicate tempAdvisedBy. As

the results show, IRoTS performs much better than MWSAT-Tabu and is more accurate than

MWSAT-Tabu&Restarts. Regarding running times, the results for all algorithms are reported

in Table 7.8 and IRoTS is clearly faster than the other algorithms.

Finally, with the last generated MLNs by declaring as non-evidence predicates both ad-

visedBy and tempAdvisedBy, we perform inference by specifying as query predicates both

these predicates in a single inference task. The results are shown in Table 7.9. IRoTS is

clearly superior against the other algorithms. The difference in solutions quality is more evi-

dent towards MWSAT-Tabu with an improvement of approximately 12% in the solution quality.

MWSAT-Tabu&Restarts is competitive with IRoTS but looses on average 7% in terms of solu-

114


Table 7.7: Inference results in terms of cost of false clauses for query predicate tempAdvisedBy forIRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learned by runningPSCG for 50 iterations with both advisedBy and tempAdvisedBy as non-evidence predicates



average 11667.11 12007.90 11727.50

Table 7.8: Running times (in minutes) for the same number of search steps for query predi-cate tempAdvisedBy for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, usingMLNs learned by running PSCG for 50 iterations with both advisedBy and tempAdvisedBy asnon-evidence predicates



average 46.83 51.36 61.42 - -

tion quality towards IRoTS. Regarding running times, these are reported in Table 7.10. IRoTS

is slightly slower than the MWSAT-Tabu and slightly faster than MWSAT-Tabu&Restarts.

However, the differences are not so significant compared to the overall running times.

The results in the last six Tables, clearly answer questions (Q2) and (Q3). We have gener-

ated different MLN models with different weights, but the better performance of IRoTS towards

the other algorithms seems not to be sensible to the clauses’ weights. Moreover, with the last

three experiments we generated MLN models that together with the evidence data give rise to

a different number of ground predicates and clauses during inference. The results show that

IRoTS is superior in terms of solutions quality and the performance does not change with the

number of ground predicates and clauses. Regarding question (Q4), IRoTS is in general faster

than the other algorithms. In only one case, IRoTS is slightly slower than MWSAT-Tabu but

finds much better solutions. Thus question (Q4) can be answered by stating that even though it

finds better solutions, IRoTS does not spend more time than the other algorithms, but it’s faster

115


Table 7.9: Inference results in terms of cost of false clauses for query predicates advisedBy andtempAdvisedBy in a single inference task, for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabuwith restarts, using MLNs learned by running PSCG for 50 iterations with both advisedBy andtempAdvisedBy as non-evidence predicates



average 10108.82 11433.16 10841.27

Table 7.10: Running times (in minutes) for the same number of search steps for query predi-cates advisedBy and tempAdvisedBy in a single inference task, for IRoTS, MaxWalkSAT-Tabuand MaxWalkSAT-Tabu with restarts, using MLNs learned by running PSCG for 50 iterations withboth advisedBy and tempAdvisedBy as non-evidence predicates



average 182.26 178.26 186.11 - -

than both MWSAT-Tabu and MWSAT-Tabu&Restarts.

The last experiment and the results reported in Tables 7.9 and 7.10 answer question (Q5).

As the results in Table 7.10 show, the inference task consists in thousands of ground predicates

and with a really huge number of clauses.

Finally, to completely answer question (Q5), we decided to generate MLNs with an ad-

ditional query predicate such that the number of ground predicates and clauses could be very

high. We chose from the predicates in the UW-CSE domain the taughtBy predicate which has

three arguments: course, person and period. This gives rise to a huge ground MN that is to

be solved for MAP inference. We learned the MLNs with PSCG specifying taughtBy as an

additional non-evidence predicate and running the weight learning algorithm for 50 iterations.

We then performed inference with query predicates taughtBy, advisedBy and tempAdvisedBy.

The results are reported in Table 7.11 and the results show that IRoTS is again more performant

116

7.2 Conditional Inference for MLNs using MC-IRoTS

Table 7.11: Inference results in terms of cost of false clauses for query predicates taughtBy,advisedBy and tempAdvisedBy in a single inference task, for IRoTS, MaxWalkSAT-Tabu andMaxWalkSAT-Tabu with restarts, using MLNs learned by running PSCG for 50 iterations withthe three predicates as non-evidence predicates



average 311071.60 350522.44 347179.74

Table 7.12: Running times (in minutes) for the same number of search steps for query predicatestaughtBy, advisedBy and tempAdvisedBy in a single inference task, for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learned by running PSCG for 50 iterationswith the three predicates as non-evidence predicates



average 31.68 27.21 26.7 - -

than the other two algorithms. Running times and number of ground predicates and clauses are

reported in Table 7.12. As can be seen, the number of ground clauses is very high and in one

fold it reaches nearly 2 million. This is common for relational domains where grounding of

first-order clauses causes a combinatorial explosion in the number of ground clauses. The re-

sults however show that IRoTS is slower than the other two algorithms, even though IRoTS

finds better solutions. Thus question (Q5) can be answered affirmatively in that for inference

tasks with a huge number of ground predicates and clauses, IRoTS is highly superior towards

the other algorithms in terms of solutions quality.



variables given the evidence and it has been shown to be #P-complete (Roth 1996). The most

117


widely used approach to approximate inference is by using MCMC methods and in particular

Gibbs sampling. One of the problems that arises in real-world applications, is that an infer-

ence method must be able to handle probabilistic and deterministic dependencies that might

hold in the domain. MCMC methods are suitale for handling probabilistic dependencies but

give poor results when deterministic or near deterministic dependencies characterize a certain

domain. On the other side logical ones, like satisfiability testing cannot be applied to proba-

bilistic dependencies. One approach to deal with both kinds of dependencies is that of (Poon

and Domingos 2006) where the authors use SampleSAT (Wei et al. 2004) in a MCMC algo-

rithm to uniformly sample from the set of satisfying solutions. As pointed out in (Wei et al.

2004), SAT solvers find solutions very fast but they may sample highly non-uniformly. On the

other side, MCMC methods may take exponential time, in terms of problem size, to reach the

stationary distribution. For this reason, the authors in (Wei et al. 2004) proposed to use a hybrid

strategy by combining random walk steps with MCMC steps, and in particular with Metropo-

lis transitions. This permits to efficiently jump between isolated or near-isolated regions of

non-zero probability, while preserving detailed balance.

Deterministic dependencies often cause the support of probability distribution to be bro-

ken into disconnected regions. This makes difficult the design of ergodic Markov chains for

MCMC inference (Gilks et al. 1996). Thus, Gibbs sampling is trapped in a single region, and

it may never converge to the correct answers. A simple solution to this is running multiple

chains with random starting points, but in general this does not solve the problem, since it is

not guaranteed that different regions will be sampled with frequency proportional to their prob-

ability. In practice there may be a very large number of regions and simply running multiple

chains is not the optimal solution. On the other side, near-deterministic dependencies preserve

ergodicity, but lead to intractable long convergence times, such as simulated tempering (Mari-

nari and Parisi 1992). Another inference method is belief propagation (Yedidia et al. 2001),

where deterministic or near-deterministic dependencies can lead to incorrect answers or failure

to converge. Deterministic dependencies can be exploited to speed up exact inference but this

is highly unlikely scalable for problems found in SRL domains where there are many densely

connected variables.

In this dissertation, we use the same approach as the authors did in (Poon and Domingos

2006), but instead of SampleSAT, for MC-IRoTS we propose to use SampleIRoTS, which

performs with probability p a RoTS step and with probability 1− p a simulated annealing (SA)

step. We use fixed temperature annealing (i.e., Metropolis) moves. The goal is to reach as

118


fast as possible a first solution through IRoTS and then exploit the ability of SA to explore a

cluster of solutions. A cluster of solutions is usually a set of connected solutions, so that any

two solutions within the cluster can be connected through a series of flips without leaving the

cluster. In many domains of interest, solutions exist in clusters and it is highly useful to explore

such clusters without leaving them. SA has good properties in exploring a connected space,

therefore it samples near-uniformly and often explores all the neighboring solutions.

Through MC-IRoTS we can perform conditional inference given evidence to compute

probabilities for query predicates. These probabilities can be used to make predictions from the

model. Since inference is a computationally hard task, it is highly desirable to design high per-

forming algorithms. IRoTS has been shown to be a very competitive algorithm on some SAT

instances (Smyth et al. 2003), but to the best of our knowledge, no results have been reported

for huge domains such as those of SRL.

Often, in many application domains, it is not required performing learning and/or inference

once in batch mode, but rather for many time steps in an on-line mode. On-line learning and

inference, often used by agents, requires high performing algorithms since the agent contin-

uously updates the evidence and query by adding, changing or deleting evidence and query

atoms and then waits for a response from the inference algorithm, in order to make a decision

based on the output of the inference process. For this reason, we will compare MC-IRoTS with

the state of the art algorithm not only in terms of quality of query probabilities produced but

also in terms of running time. For the same reason, as it is shown in section 7.3, inference

can be used during learning and the inference procedure may be called thousands of times dur-

ing learning. This requires fast inference algorithms in order to speed up the entire learning

process. We will show through experiments that MC-IRoTS is faster than the state-of-the-art

algorithm for inference in Markov logic.

7.2.1 The SampleIRoTS algorithm: Combining MCMC and IRoTS

One of the most widely used MCMC method for computing conditional probabilities is Gibbs

sampling, which proceeds by sampling each variable in turn given its Markov blanket (the

variables it appears in some potential with). In order to generate samples from the correct

distribution, it is sufficient that the Markov chain satisfy ergodicity and detailed balance. In

essence, all states must be aperiodically reachable from each other, and for any two states x, y,

P(x)T (x→ y) = P(y)T (y→ x), where T is the chain’s transition probability. In the presence

of strong dependencies, changes to the state of a certain variable given its neighbors become

119


very unlikely, and convergence of the probability estimates to the true values becomes very

slow. In the limit of deterministic dependencies, ergodicity breaks down. Simulated tempering

can be used to speed up Gibbs sampling, by running in parallel with the original one, chains

with reduced weights, and by periodically attempting to swap the states of the two chains. The

disadvantage is that if weights are very large, swaps become very unlikely and ergodicity is

broken by infinite weights.

Another widely used approach relies on auxiliary variables to capture the dependencies.

For instance, let P(X = x,U = u) = (1/Z)∏k I[0,φk(xk)](uk), where φk is the kth potential func-

tion, uk is the kth auxiliary variable, I[a,b](uk) = 1 if a ≤ uk ≤ b, and I[a,b](uk) = 0 otherwise.

The marginal distribution of X under this joint is P(X = x), thus for sampling from the original

distribution it is sufficient to sample from P(x,u) and ignore the u values. P(uk|x) is uniform

in [0,φk(xk)], and thus easy to sample from. P(x|u) is uniform in the “slice” of χ that satisfies

φk(xk) ≥ uk for all k. Identifying this region is the main difficulty in this technique, known as

slice sampling (Damien et al. 1999).

The question whether state-of-the-art satisfiability procedures, based on random walk strate-

gies, can be used to sample uniformly or near-uniformly from the space of satisfying assign-

ments, was first dealt with in (Wei et al. 2004). It was shown that random walk SAT procedures

often do reach the full set of solutions of complex logical theories. Moreover, by interleaving

random walk steps with Metropolis transitions, it was also shown how the sampling becomes

near-uniform. At near-zero temperature, simulated annealing samples solutions uniformly, but

will generally take too long to find them. WalkSAT finds solutions very fast, but samples them

highly non-uniformly. The SampleSAT algorithm samples solutions near- uniformly and highly

efficiently by, at each iteration, performing a WalkSAT step with probability p and a simulated

annealing step with probability 1− p. The parameter p is used to trade off uniformity and

computational cost.

In the previous section we showed how IRoTS outperformed MaxWalkSAT in terms of

quality solutions found and running times. In this section, the idea is to combine the high

performing algorithm IRoTS with simulated annealing. The novel algorithm performs with

probability p a RoTS step and with probability 1− p a simulated annealing step. We call this

algorithm SampleIRoTS and the expectation is that the novel algorithm will be faster towards

SampleSAT in the same way that IRoTS is faster than WalkSAT. The goal is to exploit Sam-

pleIRoTS in an inference algorithm for Markov logic and compare it with the state-of-the-art

algorithm for this task.

120


7.2.2 The MC-IRoTS algorithm

The basic idea of how to use SampleIRoTS in an inference algorithm, was first proposed in

(Poon and Domingos 2006). The MC-IRoTS algorithm that we propose here, applies slice

sampling to Markov logic by using SampleIRoTS to sample a new state given the auxiliary

variables. Algorithm 7.4 gives pseudo-code for MC-IRoTS and is similar to that proposed in

(Poon and Domingos 2006). In the following we describe how it works (for further reading see

(Poon and Domingos 2006)).

Algorithm 7.4 The MC-IRoTS algorithmMC-IRoTS (clauses, numSamples)x(0)← Satisfy(hard clauses)for i = 1 to numSamples do

M← φ

for all ck ∈ clauses satisfied by x(i−1) doWith probability 1− e−wk add ck to M

end forSample x(i) ∼Uni fSAT (M)

end for

In a ground MN, each ground clause ck corresponds to the potential function φk(x) =

exp(wk fk(x)). This function has value ewk if ck is satisfied, and 1 otherwise. The authors

in (Poon and Domingos 2006) introduced an auxiliary variable uk for each ck. In the ith iter-

ation of MC-IRoTS, if ck is not satisfied by the current state x(i), uk is drawn uniformly from

[0, 1]; thus uk ≤ 1 and uk ≤ ewi , and it is not required to be satisfied in the next state. On the

other side, if ck is satisfied, uk is drawn uniformly from [0, ewi], and with probability 1− e−wi

it will be greater than 1, in which case the next state must satisfy ck. In this way, sampling

all the auxiliary variables determines a random subset M of the currently satisfied clauses that

must also be satisfied in the next state. As the next state a uniform sample from the set of states

SAT(M) that satisfy M is taken. (SAT(M) is never empty, because it always contains at least the

current state). The initial state is found by applying the satisfiability solver IRoTS to the set of

all hard clauses in the network (i.e., all clauses with infinite weight). If this set is unsatisfiable,

the output of MC-IRoTS is undefined.

In Algorithm 7.4, Uni fSAT (M) is the uniform distribution over the set SAT(M). At each

step of the algorithm, hard clauses are selected with probability 1, and thus all sampled states

121


Table 7.13: Inference running times for 1000 samples in the CORA domain

preds clauses MC-IRoTS SampleIRoTS MC-SAT SampleSAT Gain MC Gain Sample1849 77701 22.71 4.48 26.95 5.44 4.24 0.9667081 1171457 444.91 133.65 435.10 114.95 -9.81 -18.7059536 113798 68.41 27.55 104.82 58.64 36.41 31.0959536 118828 64.63 34.97 93.69 51.10 29.06 16.131681 2724901 1291.30 187.33 1479.75 382.20 188.45 194.879409 912673 207.22 41.65 200.76 42.83 -6.46 1.1871289 142311 363.59 313.39 385.50 321.46 21.91 8.0769169 69169 50.91 28.69 56.72 32.47 5.81 3.7859536 59536 42.86 23.12 45.53 25.74 2.67 2.621849 79507 28.93 6.69 29.96 7.47 1.03 0.783844 234546 72.71 15.54 82.76 18.92 10.05 3.381681 68921 24.22 5.03 25.28 6.15 1.06 1.128836 821842 275.11 51.58 294.25 57.11 19.14 5.536084 350298 66.56 8.40 70.10 10.83 3.54 2.431849 82216 28.16 8.19 33.81 9.84 5.65 1.6571289 142311 94.61 46.28 112.59 53.70 17.98 7.4259536 62051 43.31 23.41 49.03 27.28 5.72 3.873844 121086 34.65 5.76 35.92 6.20 1.27 0.449409 14065 5.86 3.37 6.51 3.66 0.65 0.2911025 1157625 428.42 92.97 448.42 103.44 20.00 10.47

- - 182.95 53.10 200.87 66.97 17.92 13.87

satisfy them. (For simplicity, we omit the case of negative weights. These are simply handled

by considering that a clause with a negative weight is equivalent to its negation with the same

weight but opposite sign, and a clause’s negation is the conjunction of the negations of all of

its literals. Instead of checking whether the clause is satisfied, the algorithm checks whether

its negation is satisfied. If the clause is satisfied, all of its negated literals are selected with

probability 1− ew, and with probability ew none is selected.

As shown and proven in (Poon and Domingos 2006), this kind of algorithm, generates a

Markov chain which satisfies ergodicity and detailed balance. As their algorithm, also MC-

IRoTS is guaranteed to be sound, even in the presence of deterministic dependencies, while

these other MCMC methods such as Gibbs sampling and simulated tempering are not. Al-

though, in practice, perfectly uniform samples are too hard to obtain, MC-IRoTS uses Sam-

pleIRoTS to obtain nearly uniform ones. Furthermore, the parameter p of SampleIRoTS can

be used to trade off speed and uniformity of sampling.

122


7.2.3 Experiments


(Q1) Does the proposed algorithm MC-IRoTS improve over the state-of-the-art algorithm

in terms of running time?

(Q2) What is the performance of MC-IRoTS compared the state-of-the-art algorithm in

terms of quality of query probabilities produced?

We implemented the algorithm as part of the MLN++ package 8.3. In order to perform

inference with MC-IRoTS, we first need to have the models MLNs. For this reason, we use

the MLNs learned discriminatively with the algorithms proposed in the previous chapter. For

each model, we perform inference with a query predicate both with MC-SAT and MC-IRoTS.

For CORA, we perform inference with the query predicates sameBib, sameAuthor, sameVenue

and sameTitle.

The results are reported in Table 7.13 where both algorithms were ran with 1000 samples

each. We generated different models in order to have different ratios of ground predicates and

ground clauses during inference. This would help better evaluate the inference algorithm over

a wide range of inference scenarios. As the results show, MC-IRoTS improves over MC-SAT

in terms of overall running time of the inference task. Results in terms of CLL and AUC are

reported in Table 7.14. It is clear that the quality of the probabilities predicted is not different

for the algorithms. Thus, MC-IRoTS maintains the same accuracy of inference but is faster

than MC-SAT.

In order to provide further experimental evidence of the superiority of MC-IRoTS towards

MC-SAT, we decided to perform inference also on the UW-CSE dataset by exploiting the

MLNs generated in the previous section for MAP inference. We first performed experiments

with MLNs generated for the non-evidence predicate advisedBy with 500 iterations of PSCG.

The accuracy results are reported in Table 7.15 and running times are reported in Table 7.16.

As the results show, MC-IRoTS improves in terms of running time, while preserving almost

the same accuracy in terms of CLL and AUC..

We then took the MLNs generated in the previous section by running PSCG for 10 hour on

the training data and used these to perform conditional inference with the predicate advisedBy

as query predicate. The results are reported in Tables 7.17 and 7.18. Again MC-IRoTS is faster

than MC-SAT, but this time it looses in accuracy in terms of AUC in one of the folds of the

dataset. However the difference is not significant.

123


Table 7.14: Accuracy results of inference for 1000 samples in the CORA domain

MC-IRoTS MC-SATCLL AUC CLL AUC

-0.043± 0.003 0.901 -0.043± 0.003 0.901-0.248± 0.003 0.092 -0.247± 0.003 0.094-1.686± 0.003 0.059 -1.714± 0.003 0.059-0.170± 0.002 0.158 -0.146± 0.001 0.156-1.427± 0.010 0.050 -1.447± 0.010 0.055-2.011± 0.007 0.083 -1.990± 0.007 0.090-0.079± 0.000 0.815 -0.079± 0.000 0.813-0.044± 0.000 0.907 -0.044± 0.000 0.907-0.057± 0.001 0.797 -0.057± 0.001 0.799-0.158± 0.011 0.333 -0.154± 0.011 0.348-0.056± 0.005 0.432 -0.057± 0.005 0.434-0.085± 0.009 0.452 -0.085± 0.009 0.447-0.083± 0.002 0.324 -0.084± 0.002 0.319-0.139± 0.005 0.099 -0.137± 0.005 0.116-0.159± 0.010 0.333 -0.162± 0.010 0.343-0.124± 0.001 0.406 -0.124± 0.001 0.410-0.315± 0.004 0.283 -0.315± 0.004 0.283-0.625± 0.024 0.076 -0.651± 0.024 0.069-0.246± 0.008 0.108 -0.242± 0.008 0.110-0.101± 0.003 0.219 -0.100± 0.003 0.228-0.393± 0.006 0.346 -0.394± 0.006 0.349

We performed another experiment with the MLNs generated in the previous section by

running PSCG with both query predicates advisedBy and tempAdvisedBy as query predicates.

The results for advisedBy are reported in Tables 7.19 and 7.20. Again MC-IRoTS is faster

than MC-SAT and this time is also more accurate in terms of AUC. The same experiments

were performed by specifying tempAdvisedBy as query predicate. Results are shown in Tables

7.21 and 7.22. For this predicate, MC-IRoTS is faster and much more accurate than MC-SAT.

We also performed inference with both predicates as query predicates. Results are reported in

Tables 7.23 and 7.24. As can be seen, running times are lower for MC-IRoTS and accuracy

is almost the same. Finally, we exploited the MLNs generated by adding taughtBy as non-

evidence predicate. Results are reported in Tables 7.25 and 7.26 and again MC-IRoTS is faster

and preserves the same accuracy of MC-SAT.

124

7.3 Discriminative Parameter Learning

Table 7.15: Accuracy results of inference for 1000 samples for the advisedBy predicate based onthe MLNs generated with 500 iterations of PSCG

MC-IRoTS MC-SATfold CLL AUC CLL AUCai -0.031±0.005 0.043 -0.033±0.005 0.008

graphics -0.023±0.005 0.005 -0.023±0.005 0.005language -0.049±0.016 0.011 -0.049±0.016 0.011systems -0.026±0.005 0.074 -0.028±0.005 0.006theory -0.028±0.007 0.101 -0.029±0.007 0.007

average -0.031±0.008 0.047 -0.032±0.008 0.007

Table 7.16: Inference running times (in seconds) for 1000 samples for the predicate advisedBybased on the MLNs generated with 500 iterations of PSCG

fold preds clauses MC-IRoTS SampleIRoTS MC-SAT SampleSAT Gain MC Gain Sampleai 4760 185849 59.72 16.16 69.66 16.12 9.94 -0.04

graphics 3843 136392 45.28 12.29 53.35 11.6 8.07 -0.69language 840 15762 3.5 1.21 3.63 1.22 0.13 0.01systems 5328 218918 62.45 19.54 80.92 20.05 18.47 0.51theory 2499 73600 24.93 6.72 28.68 6.58 3.75 -0.14

average - - 39.18 11.18 47.25 11.11 8.07 -0.07


As previously introduced in Section 3.3, parameter learning for MNs and MLNs can be distin-

guished in generative and discriminative. Generative approaches optimize the joint probability

distribution of all the variables. In contrast discriminative approaches maximize the conditional

likelihood of a set of outputs given a set of inputs (Lafferty et al. 2001) and this often produces

better results for prediction problems. In this section, we will show how the MC-IRoTS algo-

rithm provides good samples in a discriminative weight learning algorithm for MLNs.

7.3.1 Optimizing Conditional Likelihood for Weight Learning

As described in Section 3.3, computing the expected counts Ew[ni(e,q)] in Equation 3.8 is

intractable. These can be approximated by the counts ni(e,q∗w) in the MAP state q∗w(x). Thus,

computing the gradient needs only MAP inference to find q∗w(x) which is much faster than full

conditional inference for computing Ew[ni(e,q)]. To generalize this method to arbitrary MLNs

it is necessary to develop a general-purpose algorithm for MAP inference in MLNs. From

125


Table 7.17: Accuracy results of inference for 1000 samples for the advisedBy predicate based onthe MLNs generated by running PSCG for 10 hours on the training data.



average -0.032±0.008 0.007 -0.031±0.008 0.037

Table 7.18: Inference running times (in seconds) for 1000 samples for the predicate advisedBybased on the MLNs generated by running PSCG for 10 hours on the training data.

fold preds clauses MC-IRoTS SampleIRoTS MC-SAT SampleSAT Gain MC Gain Sampleai 4760 185849 59.47 16.05 70.4 17.18 10.93 1.13

graphics 3843 136392 45.85 12.24 54.23 12.06 8.38 -0.18language 840 15762 3.61 1.24 3.68 1.18 0.07 -0.06systems 5328 218918 71.99 19.99 87.35 19.7 15.36 -0.29theory 2499 73600 24.33 6.21 28.18 6.41 3.85 0.2

‘ average - - 41.05 11.15 48.77 11.31 7.72 0.16

Equation 3.7 it can be seen that since q∗w(x) is the state that maximizes the sum of the weights

of the satisfied ground clauses, it can be found using a MAX-SAT solver. The authors in (Singla

and Domingos 2005), replaced the Viterbi algorithm with the MaxWalkSAT solver (Kautz et al.

1997b). Given an MLN and set of evidence atoms, the KB to be passed to MaxWalkSAT is

formed by constructing all groundings of clauses in the MLN involving query atoms, replacing

the evidence atoms in those groundings by their truth values, and simplifying.

MaxWalkSAT is not guaranteed to reach the global MAP state, unlike the Viterbi algorithm.

This can lead to errors in the weight estimates produced. The quality of the estimates can

be improved by running a Gibbs sampler starting at the state returned by MaxWalkSAT, and

averaging counts over the samples. If the Pw(q|e) distribution has more than one mode, doing

multiple runs of MaxWalkSAT followed by Gibbs sampling can be helpful. This approach is

followed in the algorithm in (Singla and Domingos 2005) which is essentially gradient descent.

Weight learning in MLNs represents a convex optimization problem, and gradient descent

is guaranteed to find the global optimum. However, convergence to this optimum may be too

slow. The sufficient statistics for MLNs are the number of true groundings of each clause. Since

126


Table 7.19: Accuracy results of inference for 1000 samples for the advisedBy predicate based onthe MLNs generated by running PSCG with both advisedBy and tempAdvisedBy as non-evidencepredicates



average -0.029±0.006 0.138 -0.028±0.006 0.107

Table 7.20: Inference running times (in seconds) for 1000 samples for the predicate advisedBybased on the MLNs generated by running PSCG with both advisedBy and tempAdvisedBy as non-evidence predicates


graphics 3843 136297 51.1 17.19 57.63 15.16 6.53 -2.03language 840 15711 4 1.74 4.1 1.87 0.1 0.13systems 5328 218820 78.58 26.56 89.5 25.88 10.92 -0.68theory 2499 73540 27.99 9.29 31.19 8.88 3.2 -0.41

average - - 45.69 15.47 51.86 14.92 6.17 -0.55

this number can easily vary by orders of magnitude from one clause to another, a learning rate

that is small enough to avoid divergence in some weights may be too small for fast convergence

in others. This is an instance of the well-known problem of ill-conditioning in numerical

optimization, and many candidate solutions for it exist (Nocedal and Wright 1999). However,

most of these are not easily applicable to MLNs because of the nature of the function to be

optimized.

In (Lowd and Domingos 2007) was proposed another approach based on conjugate gradi-

ent (Shewchuck. 1994). Gradient descent can be sped up by performing a line search to find

the optimum along the chosen descent direction instead of taking a small step of constant size

at each iteration. This can be inefficient on ill-conditioned problems, since line searches along

successive directions tend to partly undo the effect of each other: each line search makes the

gradient along its direction zero, but the next line search will generally make it non-zero again.

This can be solved by imposing at each step the condition that the gradient along previous di-

127


Table 7.21: Accuracy results of inference for 1000 samples for the predicate tempAdvisedBybased on the MLNs generated by running PSCG with both advisedBy and tempAdvisedBy as non-evidence predicates



average -0.007±0.002 0.243 -0.008±0.003 0.021

Table 7.22: Inference running times (in seconds) for 1000 samples for the predicate tempAd-visedBy based on the MLNs generated by running PSCG with both advisedBy and tempAdvisedByas non-evidence predicates


graphics 3843 136244 43.96 10.13 53.44 10.91 9.48 0.78language 840 15706 3.26 0.97 3.42 1.21 0.16 0.24systems 5328 218727 69.51 16.5 84.54 17.05 15.03 0.55theory 2499 73513 23.9 5.47 28.95 6.28 5.05 0.81

average - - 39.93 9.45 48.36 10.07 8.43 0.62

rections remain zero. The directions chosen in this way are called conjugate, and the method

conjugate gradient. In (Lowd and Domingos 2007), the authors used the Polak-Ribiere method

for choosing conjugate gradients since it has generally been found to be the best-performing

one.

7.3.2 Learning MLNs Weights by Sampling with MC-IRoTS

Conjugate gradient methods are among the most efficient ones, on a par with quasi-Newton

ones. Unfortunately, as the authors point out in (Lowd and Domingos 2007), applying them to

MLNs is difficult, because line searches require computing the objective function, and therefore

the partition function Z, which is intractable. Fortunately, the Hessian (matrix of second-order

partial derivatives) can be used instead of a line search to choose a step size. This method is

known as scaled conjugate gradient (SCG), and was proposed in (Moller 1993) for training

neural networks. In (Lowd and Domingos 2007), a step size was chosen by using the Hessian

128


Table 7.23: Accuracy results of inference for 1000 samples with both query predicates advisedByand tempAdvisedBy in a single inference task



average -0.021±0.004 0.013 -0.022±0.004 0.004

Table 7.24: Inference running times (in seconds) for 1000 samples with both query predicatesadvisedBy and tempAdvisedBy in a single inference task

fold preds clauses MC-IRoTS SampleIRoTS MC-SAT SampleSAT Gain MC Gain Sampleai 9384 680351 210.58 52.88 243.07 56.71 32.49 3,83

graphics 7564 495227 153.00 40.31 183.7 46.34 30.7 6,03language 1624 52491 22.34 11.04 17.56 4.44 -4.78 -6,6systems 10512 804425 305.73 117.17 286.46 65.43 -19.27 -51,74theory 4900 261890 83.72 22.24 97.61 22.06 13.89 -0,18

average - - 155.07 48.73 165.68 39.00 10.61 -9.73

similar to a diagonal Newton method. Conjugate gradient methods are often more effective

with a preconditioner, a linear transformation that attempts to reduce the condition number of

the problem (Sha and Pereira 2003). Good preconditioners approximate the inverse Hessian. In

(Lowd and Domingos 2007), the authors used the inverse diagonal Hessian as preconditioner

and called the SCG algorithm Preconditioned SCG (PSCG). PSCG was shown to outperform

the voted perceptron algorithm of (Singla and Domingos 2005) on two real-world domains both

for CLL and AUC. For the same learning time, PSCG learned much more accurate models.

However, to compute the Hessian the MPE approximation is no longer sufficient. The au-

thors in (Lowd and Domingos 2007) address both this problem by computing expected counts

using MC-SAT. When optimizing quadratic functions, Newton’s method can move to the global

minimum or maximum in a single step. It does this by multiplying the gradient, g, by the in-

verse Hessian, H−1, thus having wt+1 = wtH−1g. For hundreds or thousands of weights, the

use of the full Hessian becomes infeasible. A good approximation is to use the diagonal New-

ton (DN) method, which uses the inverse of the diagonalized Hessian in place of the inverse

Hessian. DN typically uses a smaller step size than the full Newton method. This is impor-

129


Table 7.25: Accuracy results of inference for 1000 samples with query predicates taughtBy, ad-visedBy and tempAdvisedBy in a single inference task



average -0.024±0.002 0.012 -0.024±0.002 0.005

Table 7.26: Inference running times (in seconds) for 1000 samples with the query predicates taugh-tBy, advisedBy and tempAdvisedBy in a single inference task

fold preds clauses MC-IRoTS SampleIRoTS MC-SAT SampleSAT Gain MC Gain Sampleai 23664 1894428 559.43 142.31 672.84 141.64 113.41 -0.67

graphics 23485 1510794 434.36 109.84 556.14 123.06 121.78 13.22language 5152 157944 44.08 10.86 56.63 12.93 12.55 2.07systems 26136 2045461 604.5 148.7 744.24 158.96 139.74 10.26theory 14504 794484 228.35 57.65 289.32 123.62 60.97 65.97

average - - 374.14 93.87 463.83 112.04 89.69 18.17

tant when applying the algorithm to non-quadratic functions, such as MLN conditional log

likelihood, where the quadratic approximation is only good within a local region.

Since the Hessian for an MLN is simply the negative covariance matrix:

∂

∂wi∂w j= logP(Y = y|X = x) = Ew[ni]Ew[n j]−Ew[nin j] (7.1)

similar to the gradient, we can approximate this using samples from MC-IRoTS. The au-

thors in (Lowd and Domingos 2007), used MC-SAT to achieve this. In each iteration they took

a step in the diagonalized Newton direction:

wi = wi−αni−Ewni

Ew[n2i ]− (Ew[ni])2 (7.2)

Since in the previous Section, we showed that MC-IRoTS ran faster than MC-SAT on

a wide range of inference scenarios maintaining almost the same accuracy in the probabili-

ties produced, we expect MC-IRoTS to be a good sampler to estimate the sufficient statistics

needed in Equation 7.1. We will show in the next section, through experiments in the webpage

130


Table 7.27: Accuracy results for classifying webpages of students

fold CLL AUCWisconsin -0.121±0.013 0.674

Washington -0.164±0.020 0.601Cornell -0.175±0.014 0.551Texas -0.139±0.015 0.623

average -0.150±0.015 0.612

classification domain, that MC-IRoTS provides good samples for the discriminative weight

learning algorithm of (Lowd and Domingos 2007).

7.3.3 Experiments on Web Page Classification

In this section we want to answer the following question:

(Q1) Does MC-IRoTS produce good samples to be used in a discriminative weight learning

algorithm for MLNs?

In order to perform experiments we need to learn MLNs from data with PSCG by sampling

with MC-IRoTS and see if the learned models are accurate. We decided to perform experiments

in the webpage classification domain and chose the WebKB dataset (Craven et al. 1998). The

relational version of the dataset that we use is that of (Craven and Slattery. 2001), the same used

by (Lowd and Domingos 2007). This dataset contains labeled webpages from the Department

of Computer Science of four universities: Texas, Cornell, Washington and Wisconsin. The

relational version consists of 4165 webpages and 10.935 web links together with the words on

the webpages, anchors of the links and the neighbourhoods of each link. Each webpage in the

dataset is labeled as being a page of a student, faculty, course or research project. The goal is

to predict the class for each page based on the words of that page and from the links that the

page has with other pages. The databases that we used from WebKB are the following for each

department of computer science:

Database common. Defines the relational “LinkTo” that specifies hyperlink connections.

Moreover, it contains boolean predicates characterizing the anchor text of hyperlinks, “All-

WordsCapitalized” and “HasAlphanumericWord”.

Database of page-words. Contains a bag-of-words representation of the words that occur

in the webpages. Each predicate in these files specifies a stemmed word, and the instances of

the predicate are those pages that contain the word.

131


Table 7.28: Accuracy results for classifying webpages of faculty members



average -0.051±0.010 0.504

Table 7.29: Accuracy results for classifying webpages of research projects



average -0.049±0.009 0.130

Database of anchor-words. Contains the words that occur in the anchor text of hyperlinks.

Neighborhood-words. Contains the words that occur in the “neighboring” text of hyper-

links. The neighborhood of a hyperlink includes words in a single paragraph, list item, table

entry, title or heading in which the hyperlink is contained.

We hand-coded a very simple MLN for this problem:

Has(+w1, p1)⇒ PageClass(p1)

¬Has(+w1, p1)⇒ PageClass(p1)

Has(+w1, p1)∧HasAnchor(+w1, lnkid)⇒ PageClass(p1)

¬Has(+w1, p1)∧HasAnchor(+w1, lnkid)⇒ PageClass(p1)

PageClass(p1)∧LinkTo(+lnkid, p1, p2)⇒ PageClass(p2)

“Has” is the predicate that expresses that the word “w” is contained in the page “p1”, while

“HasAnchor” relates a word with it’s anchor. The last rule, states the relationship of pages

of class p1 and p2 as linked by hyperlink “lnkid”. The sign + means a separate weight is

learned for each ground word and hyperlink. When instantiated, the model contained nearly

10.000 weights which represents a very complex non-i.i.d. probability distribution where query

predicates are linked together in a huge graph. We used the following parameters for MC-

IRoTS: d = 1, k = 3. While for PSCG we ran 100 iterations of this algorithms with 100

samples of MC-IRoTS for each inference run. We performed leave-one-area-out for each class

132


Table 7.30: Accuracy results for classifying webpages of courses



average -0.117±0.014 0.335

Table 7.31: Overall accuracy results for web page classification in the WebKB domain

Class CLL AUCStudent -0.150±0.015 0.612

Research -0.049±0.010 0.130Faculty -0.051±0.010 0.504Course -0.117±0.014 0.335average -0.092±0.012 0.395

of webpages learning a MLN for each department. After learning the models, we performed

inference on the left-out area by using again MC-IRoTS with 1000 samples. Tables 7.27, 7.28,

7.29, 7.30, 7.31 present the results in terms of CLL and AUC for all the classes of the domain.

Each table contains the results for each area of the dataset and the overall accuracy. As can

be seen, CLL results are very accurate while for AUC results are competitive. This shows that

samples from MC-IRoTS are good for PSCG for the sufficient statistics. In only one case, the

AUC results were not high. In fact for the research project webpages, AUC is quite low, but

on the other side CLL was the best among the four classes with an excellent result of -0.049.

For the Courses class, the AUC results were very good for three area and in only one area

(Washington) AUC was very low. This affected the overall result for the class.

Overall, the results obtained by using MC-IRoTS as sampler in PSCG answer question

(Q1) and confirm that MC-IRoTS is a good algorithm for inference in statistical relational

domains. In the previous section it was shown that MC-IRoTS was faster as an inference

algorithm, while in this section we showed that it is also useful to produce good samples for

a weight learning algorithm. This implies a double use of this algorithm: for inference and

weight learning in statistical relational domains.

133


7.4 Summary

Inference is the process of responding to queries once the model has been learned. Efficient and

effective inference is important to evaluate and compare the learned models. On the other side,

inference is often a subroutine when learning statistical models of relational domains. These

models often contain hundreds of thousands of variables or more, making efficient inference

crucial to their learnability. Moreover, in on-line learning and inference, often used by agents,

decisions are based on the output of the inference process, thus fast and accurate algorithms

are strongly needed for this task. In this chapter we introduced two high performing algorithms

for MAP and conditional inference in Markov Logic, based on the Iterated Local Search and

Tabu Search metaheuristics. The first algorithm, IRoTS performs a biased sampling of the

set of local optima by using Tabu Search as a local search procedure and repetitively jump-

ing in the search space through a perturbation operator. Extensive experiments on real-world

data show that IRoTS outperforms the state-of-the-art algorithm for MAP inference in Markov

Logic. The second algorithm MC-IRoTS combines IRoTS with Markov Chain Monte Carlo

by interleaving RoTS steps with Metropolis transitions in a iterated local search. Experiments

on real-world domains show that MC-IRoTS is faster than the state-of-the-art algorithm for

conditional inference in Markov Logic while maintaining the same quality of probabilities pro-

duced. Finally, we used MC-IRoTS as a sampler to approximate the sufficient statistics in a

state-of-the-art discriminative parameter learning algorithm for MLNs and showed through ex-

periments in the webpage classification domain that MC-IRoTS produces good samples to be

used during learning.

Future work regards the application of both IRoTS and MC-IRoTS to other problems in

complex SRL domains and the adaptation of lazy techniques such as those presented in (Poon

et al. 2008).

134

Chapter 8

Conclusion

8.1 Contributions of this Dissertation

This dissertation presented novel algorithms for Markov Logic Networks by addressing the

problems of learning and inference of these models. Its contributions are:

• A novel and powerful algorithm for generative structure learning of Markov Logic Net-

works. The GSL algorithm for generative structure learning of Markov Logic Networks

exploits the iterated local search metaheuristic guided by pseudo-likelihood. The algo-

rithm performs a biased sampling of the set of local optima focusing the search not on

the full space of solutions but on a smaller subspace defined by the solutions that are lo-

cally optimal for the optimization engine. It employs a strong perturbation operator and

an iterative improvement local search procedure in order to balance diversification (ran-

domness induced by strong perturbation to avoid search stagnation) and intensification

(greedily increase solution quality by exploiting the evaluation function). Experimen-

tal evaluation on two benchmarking datasets regarding the problem of Link Analysis in

Social Networks and Entity Resolution in citation databases, show that GSL achieves

improvements over the state-of-the-art algorithms for generative structure learning of

Markov Logic Networks.

• The first algorithm for discriminative structure learning of Markov Logic Networks. The

ILS-DSL algorithm learns discriminatively first-order clauses and their weights. The al-

gorithm scores the candidate structures by maximizing conditional likelihood or area

135

8. CONCLUSION

under the Precision-Recall curve while setting the parameters by maximum pseudo-

likelihood. ILS-DSL is based on the Iterated Local Search metaheuristic. To speed up

learning we propose some simple heuristics that greatly reduce the computational effort

for scoring structures. Empirical evaluation with real-world data in two domains show

the promise of our approach improving over the state-of-the-art discriminative weight

learning algorithm for MLNs in terms of conditional log-likelihood of the query pred-

icates given evidence. We have also compared the proposed algorithm with the state-

of-the-art generative structure learning algorithm and shown that on small datasets the

generative approach is competitive, while on larger datasets the discriminative approach

outperforms the generative one.

• A powerful algorithm based on randomized beam search for discriminative structure

learning. The RBS-DSL algorithm learns discriminatively first-order clauses and their

weights. The algorithm scores the candidate structures by maximizing conditional likeli-

hood or area under the Precision-Recall curve while setting the parameters by maximum

pseudo-likelihood. RBS-DSL is inspired from the Greedy Randomized Adaptive Search

Procedure metaheuristics and performs randomized beam search by scoring the struc-

tures through maximum likelihood in the first phase and then uses maximum CLL or

AUC for PR curve in a second step to randomly generate a beam of the best clauses

to add to the current MLN structure. To speed up learning we propose some simple

heuristics that greatly reduce the computational effort for scoring structures. Empirical

evaluation with real-world data in two domains show the promise of our approach im-

proving over the state-of-the-art discriminative weight learning algorithm for MLNs in

terms of conditional log-likelihood of the query predicates given evidence. We have also

compared the proposed algorithm with the state-of-the-art generative structure learning

algorithm and shown that on small datasets the generative approach is competitive, while

on larger datasets the discriminative approach outperforms the generative one.

• A novel and powerful algorithm for MAP inference in Markov Logic. The IRoTS al-

gorithm based on the Iterated Local Search and Tabu Search metaheuristics, performs

a biased sampling of the set of local optima by using Tabu Search as a local search

procedure and repetitively jumping in the search space through a perturbation operator.

Extensive experiments on real-world data show that IRoTS outperforms the state-of-the-

art algorithm for MAP inference in Markov Logic.

136

8.2 Directions for Future Research

• A novel and powerful algorithm for conditional inference in Markov Logic. The MC-

IRoTS algorithm combines IRoTS with Markov Chain Monte Carlo by interleaving

RoTS steps with Metropolis transitions in a iterated local search. Experiments on real-

world domains show that MC-IRoTS is faster than the state-of-the-art algorithm for con-

ditional inference in Markov Logic while maintaining the same quality of probabilities

produced. Finally, MC-IRoTS was used as a sampler to approximate the sufficient statis-

tics in a state-of-the-art discriminative parameter learning algorithm for MLNs and it

was shown through experiments in the webpage classification domain that MC-IRoTS

produces good samples to be used during learning.

8.2 Directions for Future Research

Any research effort might be the beginning of new exciting research or even of entire novel

research areas. This dissertation aimed at investigating the integration of logic and probability

in the context of Markov Logic Networks and this section describes the future directions that

might be followed.

• Parallel computing for models that integrate logic and probability. The GSL algorithm

is a simple example of how parallel computing can help learn better SRL models. Imple-

menting more sophisticated parallel models such as MPI (Message Passing Interface) or

PVM (Parallel Virtual Machine) could boost performance in learning complex models

such as Markov Logic Networks. In the era of multi-core computing, running parallel

threads of an algorithm has become easier and easier. Moreover, algorithms such as

those proposed in this dissertation based on ILS or GRASP could be easily parallelized

due to the independent nature of the iterations that could be assigned to separate threads.

• Search space structure analysis for Markov Logic Networks and other SRL models in

general. The performance of SLS algorithms strongly depends on the structural aspects

of the search space. To the best of the authors knowledge, no theoretical or empirical

analysis of the search space for MLNs exists (neither for SRL models). Understanding

the properties of such spaces could greatly improve our ability to use SLS algorithms

to learn MLNs (or SRL) models. These properties include fundamental features of the

search space such as size, connectivity, diameter and solution density as well as global

and local properties of the search landscapes.

137

8. CONCLUSION

• Multiobjective optimization for Markov Logic Networks and other SRL models in gen-

eral. The accuracy performance of a learned MLN model should satisfy not only con-

ditional likelihood but also the area under curve for precision recall. Many algorithms

optimize only one of these during structure search, giving often poor results for the other

desired measure. It is interesting to investigate how multiobjective optimization tech-

niques can be applied to learning MLNs and SRL models in general.

• Analysis of the relationship between different evaluation functions. Pseudo-likelihood is

a good measure when learning probabilistic models due to it’s efficiency, but gives poor

results when long chains of inference are required at query time. Conditional likelihood

would be the perfect measure to optimize during search, but it is intractable. Thus it is

interesting to further investigate the relationship between these two measures and under-

stand whether a good structure in terms of pseudo-likelihood is also a good measure in

terms of conditional likelihood.

• Use of ILP techniques to restrict the space of structures. Search in ILP is restricted by

refinement operators which direct the search of the lattice exploring the candidates of a

certain structure by a generality ordering. In Markov Logic this has not been attempted

yet and current algorithms and the algorithms proposed in this dissertation blindly gen-

erate all the potential candidates of a certain structure leading to a huge space structures.

Generality ordering in Markov Logic is not easy to address but further investigation in

this direction is precious in order to achieve major breakthroughs in the field.

• Efficient computation of clauses true counts. The main bottleneck of learning MLN

structures is the computation of the number of true groundings of a clause. High per-

forming SAT solvers such as IRoTS could be efficiently used to sample the satisfying

solutions for a clause and then count the number of its true groundings. These approaches

have the state-of-the-art for model counting (Gomes et al. 2007; Wei and Selman 2005).

• Piecewise training of Markov Logic Networks. An appealing idea for undirected models

is to independently train a local undirected classiïnAer over each clique and then com-

bine the learned weights into a single global model. Piecewise training or piecewise

pseudolikelihood has been shown to be more accurate than standard pseudolikelihood

(Sutton and McCallum 2005b, 2007).

138

8.3 Summary

8.3 Summary

Integrating logic and probability has a long story in Artificial Intelligence and Machine Learn-

ing. This dissertation attempted the challenge of exploring and developing high performing

algorithms for a state-of-the-art model that integrates first-order logic and probability. How-

ever, much remains to be done until AI systems will reach human intelligence. A powerful

language to achieve this is Markov Logic which embodies the experience and successes of

various subfields of AI and Statistics. It allows to express complexity and uncertainty, just as

humans would do in complex environments. Moreover, complex models that reflect real-world

phenomena can be learned efficiently from examples and powerful inference algorithms can

be used to answer queries about the world. This dissertation made an effort to build powerful

algorithms for these two tasks. Thus it is hoped that this dissertation will constitute in another

step in our attempt to better understand and build intelligent systems.

139

8. CONCLUSION

140

Appendix A The MLN++ Package

MLN++ package is a suite of algorithms built upon alchemy (Kok et al. 2005). Alchemy can be

seen as a declarative programming language related to Prolog. Prolog has played for a long time

an important role in Artificial Intelligence and Machine Learning. In particular, most current

state-of-the-art Inductive Logic Programming and Statistical Relational Learning systems are

written in this language. However, for alchemy the underlying inference mechanism is model

checking instead of theorem proving; the full syntax of first-order logic is allowed, rather than

just Horn clauses; the ability to handle uncertainty and learn from data is already built in.

MLN++ can bee seen as the analog upon alchemy of the ILP and SRL systems built upon

Prolog. MLN++ includes the algorithms: GSL, ILS-DSL, RBS-DSL, IRoTS and MC-IRoTS.

It includes LearnParams which is just a version of PSCG that works by sampling with MC-

IRoTS.

In this appendix we present for each algorithm of MLN++ the parameters and how it is

used. Most part of the parameters are common with alchemy and we will describe here only

the specific parameters of each algorithm. For the standard parameters of alchemy please refer

to (Kok et al. 2005).

GSL. The GSL algorithm has these parameters in more or different from alchemy:

bestGainUnchangedLimit. Number of iterations without improvement for iterated local search.

minGain. Minimum gain of a candidate structure to be accepted as the new best structure.

ILS-DSL. The ILS-DSL algorithm has these parameters in more or different from alchemy:

bestGainUnchangedLimit. Number of iterations without improvement for iterated local search.

queryPredicate. The query predicate for which the discriminative model should be learned.

RBS-DSL. The RBS-DSL algorithm has these parameters in more or different from alchemy:

141

8. CONCLUSION

beamSize. The size of beam to consider in the randomized construction of the beam of clauses.

numClausesReEval. Maximum number of clauses to be considered for scoring in terms of CLL

or AUC.

bestGainUnchangedLimit. Number of iterations without improvement for randomized beam

search.

queryPredicate. The query predicate for which the discriminative model should be learned.

IRoTS. The IRoTS algorithm has these parameters in more or different from alchemy:

iterations. Number of iterations without improvement for iterated robust tabu search.

threshold. The threshold ratio for iterated robust tabu search.

MC-IRoTS. The MC-IRoTS algorithm has these parameters in more or different from alchemy:



LearnParams. The LearnParams algorithm needs the parameters necessary for MC-IRoTS:



142

References

M.N. Garofalakis A. Deshpande and M.I. Jordan. Efficient stepwise selection in decomposable

models. In In Proc. UAI, pages 128–135, 2001. 15

C. Anderson, P. Domingos, and D. Weld. Relational markov models and their application to

adaptive web navigation. In Proc. of 8th ACM SIGKDD Int’l Conf. on Knowledge Discovery

and Data Mining, pages 143–152. Edmonton, Canada: ACM Press, 2002. 27

F. Bacchus. Representing and Reasoning with Probabilistic Knowledge. Cambridge, MA: MIT

Press, 1990. 1

F. Bach and M. Jordan. Thin junction trees. In In NIPS 14, 2002. 15

R. Battiti and M. Protasi. Reactive search, a history-based heuristic for max-sat. ACM Journal

of Experimental Algorithmics, (2), 1997. 105

J. Besag. Statistical analysis of non-lattice data. Statistician, 24:179–195, 1975. 2, 36

I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In Proc.

SIGMOD-04 DMKD Workshop, 2004. 76

M. Biba, S. Ferilli, and F. Esposito. Structure learning of markov logic networks through

iterated local search. In Frontiers in Artificial Intelligence and Applications, Proceedings of

18th European Conference on Artificial Intelligence (ECAI)., volume 178, pages 361–365,

2008. 53

M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity

measures. In Proc. KDD-03, pages 39–48, 2003. 56, 77

C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. 10

143

REFERENCES

W. L. Buntine. Operations for learning with graphical models. J. AI Research, 2:159–225,

1994. 15

M. Collins. Discriminative training methods for hidden markov models: Theory and experi-

ments with perceptron algorithms. In In Proc. of the 2002 Conference on Empirical Methods

in Natural Language Processing. Philadelphia, PA: ACL„ 2002. 4, 42

V. Santos Costa, D. Page, and J. Cussens. Clp(bn): Constraint logic programming for proba-

bilistic knowledge. In In Probabilistic Inductive Logic Programming, volume LNCS 4911,

pages 156–188. Springer, 2008. 28

R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. Probabilistic Networks and

Expert Systems. Springer-Verlag, 1999. 9, 10

M. Craven and S. Slattery. Relational learning with statistical predicate invention: Better mod-

els for hypertext. Machine Learning, 43:97–119, 2001. 131

M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery.

Learning to extract symbolic knowledge from the world wide web. In Proc. of AAAI. AAAI

Press, 1998. 131

C. Cumby and D. Roth. Feature extraction languages for propositionalized relational learning.

In Proceedings of the IJCAI-2003 Workshop on Learning Statistical Models from Relational

Data, pages 24–31. Acapulco, Mexico: IJCAII, 2003. 2

J. Cussens. Parameter estimation in stochastic logic programs. Machine Learning, 44(3):

245–271, 2001. 2, 21, 24

J. Cussens. Loglinear models for first-order probabilistic reasoning. In Fifteenth Annual Con-

ference on Uncertainty in Artificial Intelligence, pages 126–133. Morgan Kaufmann, 1999.

21, 22, 24

P. Damien, J. Wakefield, and S. Walker. Gibbs sampling for bayesian non-conjugate and hi-

erarchical models by auxiliary variables. Journal of the Royal Statistical Society B, 61:2,

1999. 120

J. Davis and M. Goadrich. The relationship between precision-recall and roc curves. In Proc.

23rd ICML, pages 233–240, 2006. 58, 68, 72, 78, 79, 93, 97, 98

144

REFERENCES

J. Davis, I. de Castro Dutra E. Burnside, D. Page, and V. Santos Costa. An integrated approach

to learning bayesian networks of rules. In Proc. 16th European Conf. on Machine Learning,

volume 3720 LNCS, pages 84–95, 2005. 27, 82, 83

A. P. Dawid. Conditional independence for statistical operations. Annals of Statistics, 8:598–

617, 1980. 11

L. De Raedt. Logical settings for concept-learning. Artificial Intelligence, 95(1):197–201,

1997. 18, 19

L. De Raedt and L. Dehaspe. Clausal discovery. Machine Learning, 26:99–146, 1997. 19, 37

L. De Raedt and K. Kersting. Probabilistic logic learning. SIGKDD Explorations, 5(1):31–48,

2003. 18, 20

L. De Raedt and K. Kersting. Probabilistic inductive logic programming. In In Proc. of Algo-

rithmic Learning Theory, pages 19–36, 2004. 18, 20, 21

L. De Raedt, K. Kersting, and S. Torge. Towards learning stochastic logic programs from

proof-banks. In AAAI Press, pages 752–757, 2005. 22, 25

L. De Raedt, P. Frasconi, K. Kersting, and S. Muggleton, editors. Probabilistic Inductive Logic

Programming - Theory and Applications. Springer, 2008. 1

L. Dehaspe. Maximum entropy modeling with clausal constraints. In Proc. of 17th Int’l Work-

shop on Inductive Logic Programming, volume volume 1297 of LNCS, pages 109–124.

Springer, 1997. 27, 62, 82, 83

S. Della Pietra, V. Della Pietra, and J. Laferty. Inducing features of random fields. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 19:380–392, 1997. 4, 12, 15,

39, 65, 66

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via

the em algorithm. Journal of the Royal Statistical Society, Series B, vol. 39:1–38, 1977. 15

P. Domingos and M. Pazzani. On the optimality of the simple bayesian classifier under zero-one

loss. Machine Learning, 29:103–130, 1997. 4

145

REFERENCES

P. Domingos and M. Richardson. Markov logic: A unifying framework for statistical relational

learning. In L. Getoor and B. Taskar, editors, Introduction to Statistical Relational Learning,

pages 339–371. Cambridge, MA: MIT Press, 2007. 2

P. Domingos, S. Kok, H. Poon, M. Richardson, and P. Singla. Markov logic. In K. Kersting

L. De Raedt, P. Frasconi and S. Muggleton, editors, Probabilistic Inductive Logic Program-

ming, pages 92–117. New York: Springer, 2008. 2

S. Dzeroski and N. Lavrac. Relational Data Mining. Springer-Verlag, 2001. 18

D. Edwards. Introduction to Graphical Modelling, 2nd ed. Springer-Verlag, 2000. 10

I. Fellegi and A. Sunter. A theory for record linkage. J. American Statistical Association, 64:

1183–1210, 1969. 55, 75

T. A. Feo and M.G.C. Resende. A probabilistic heuristic for a computationally difficult set

covering problem. Operations Research Letters, 8(2):67–71, 1989. 87

T. A. Feo and M.G.C. Resende. Greedy randomized adaptive search procedures. Journal of

Global Optimization, 6:109–133, 1995. 87

P. Festa and M.G.C. Resende. Grasp: An annotated bibliography. In C.C. Ribeiro and P.

Hansen, editors, Essays and Surveys on Metaheuristics,, pages 325–367. Kluwer Academic

Publishers, 2002. 101

P. Flach and N. Lachiche. Naïve bayesian classification of structured data. Machine Learning,

57(3):233–269, 2004. 27

C. Fonlupt, D. Robilliard, P. Preux, and E.-G. Talbi. Fitness landscape and performance of

meta-heuristics. In S. Voss, S. Martello, I.H. Osman, , and C. Roucairol, editors, Meta-

Heuristics: Advances and Trends in Local Search Paradigms for Optimization, pages 257–

268. Kluwer Academic Publishers,Boston, MA, 1999. 52

J. H. Friedman. On bias, variance, 0/1 - loss, and the curse-of-dimensionality. Data Mining

and Knowledge Discovery, pages 55–77, 1997a. 4

N. Friedman. Learning belief networks in the presence of missing values and hidden variables.

In Fourteenth Inter. Conf. on Machine Learning (ICML97), 1997b. 14

146

REFERENCES

N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In

Proc. 16th Int’l Joint Conf. on AI (IJCAI), pages 1300–1307. Morgan Kaufmann, 1999. 2,

23

J. Furnkranz. Separate-and-conquer rule learning. Artificial Intelligence Review, 13(1):3–54,

1999. 19

V. Ganapathi, D. Vickrey, J. Duchi, and D. Koller. Constrained approximate maximum entropy

learning. In In Proc. of UAI, 2008. 16

M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory of NP-

Completeness. Freeman, San Francisco, CA, 1979. 104

D. Heckerman D. Geiger and D. M. Chickering. Learning bayesian networks: The combination

of knowledge and statistical data. Machine Learning, 20:197–243, 1995. 14

M. R. Genesereth and N. J. Nilsson. Logical foundations of artificial intelligence. San Mateo,

CA: Morgan Kaufmann., 1987. 17, 30, 31, 36, 44

L. Getoor and B. Taskar. Introduction to Statistical Relational Learning. MIT, 2007. 1, 62

C. J. Geyer and E. A. Thompson. Constrained monte carlo maximum likelihood for dependent

data. Journal of the Royal Statistical Society, Series B, 54:657–699, 1992. 36, 41

W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain Monte Carlo in Practice.

Chapman and Hall, 1996. 45, 118

F. Glover and M. Laguna. Tabu Search. Kluwer Academic Publishers, Boston, MA„ 1997. 52,

106

Carla P. Gomes, Jörg Hoffmann, Ashish Sabharwal, and Bart Selman. From sampling to model

counting. In IJCAI, pages 2293–2299, 2007. 138

R. Greiner, X. Su, S. Shen, and W. Zhou. Structural extension to logistic regression: Discrim-

inative parameter learning of belief net classifiers. Machine Learning, 59:297–322, 2005.

4

D. Grossman and P. Domingos. Learning bayesian network classifiers by maximizing con-

ditional likelihood. In Proc. 21st Int’l Conf. on Machine Learning, pages 361–368. Banf,

Canada: ACM Press, 2004. 4, 68

147

REFERENCES

J. Halpern. An analysis of first-order logics of probability. Artificial Intelligence, 46:311–350,

1990. 1, 29

P. Hansen and B. Jaumard. Algorithms for the maximum satisfiability problem. Computing,

44:279–303, 1990. 105

D. Heckerman. A tutorial on learning with bayesian networks. In M. Jordan, editor, Learning

in Graphical Models. MIT Press, 1998. 15

D. Heckerman, D. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks

for inference, collaborative filtering and data visualization. Journal of Machine Learning

Research, pages 49–75, 2000. 26

H. H. Hoos and T. Stutzle. Stochastic Local Search: Foundations and Applications. Morgan

Kaufmann, San Francisco, 2005. 44, 47, 49, 87, 103, 105

T. N. Huynh and R. J. Mooney. Discriminative structure and parameter learning for markov

logic networks. In In Proc. of the 25th International Conference on Machine Learning

(ICML), 2008. 62, 83

R. Jirousek and S. Preucil. On the effective implementation of the iterative proportional fitting

procedure. Computational Statistics & Data Analysis, 19:177–189, 1995. 15

M. I. Jordan, editor. Learning in Graphical Models. MIT Press, 1998. 9, 10

M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational

methods for graphical models. In M. I. Jordan (Ed.), Learning in Graphical Models. Cam-

bridge: MIT Press, 1999. 17

H. Kautz, B. Selman, and Y. Jiang. A general stochastic approach to solving problems with

hard and soft constraints. In The Satisfiability Problem: Theory and Applications. AMS.

1997a. 103, 104, 106

H. Kautz, B. Selman, and Y. Jiang. A general stochastic approach to solving problems with

hard and soft constraints. In D. In Gu, J. Du, and P. eds. Pardalos, editors, The Satisfiability

Problem: Theory and Applications., pages 573–586. New York, NY: American Mathemati-

cal Society, 1997b. 42, 44, 126

148

REFERENCES

K. Kersting and L. De Raedt. Towards combining inductive logic programming with bayesian

networks. In Proc. 11th Int’l Conf. on Inductive Logic Programming, pages 118–131.

Springer, 2001a. 2, 21, 24

K. Kersting and L. De Raedt. Adaptive bayesian logic programs. In Proc. of the 11th Confer-

ence on Inductive Logic Programming, volume 2157. Springer, 2001b. 24

R. Kindermann and J. L. Snell. Markov Random Fields and Their Applications. American

Mathematical Society, 1980. 12

S. Kok and P. Domingos. Learning the structure of markov logic networks. In Proc. 22nd Int’l

Conf. on Machine Learning, pages 441–448, 2005. 3, 38, 39, 55, 56, 57, 61, 62, 66, 68, 73,

74, 75, 94, 95, 101

S. Kok, P. Singla, M. Richardson, and P. Domingos. The alchemy system for

statistical relational ai. Technical report, Department of CSE-UW, Seattle, WA,

http://alchemy.cs.washington.edu/, 2005. 56, 77, 95, 141

D. Koller and A. Pfeffer. Probabilistic frame-based systems. In In Proc. AAAI, 1998. 23

D. Koller, A. Levy, and A. Pfeffer. P-classic: A tractable probabilistic description logic. In In

Proc. of NCAI97, pages 360–397, 1997. 2, 23

F. Kschischang, B. Frey, and H. Loeliger. Factor graphs and the sum product algorithm. IEEE

Transactions on Information Theory, February 2001. 17

J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for

segmenting and labeling sequence data. In Proc. 18th Int’l Conf. on Machine Learning,

pages 282–289, 2001. 3, 40, 65, 125

N. Landwehr, K. Kersting, and L De Raedt. nfoil: Integrating naive bayes and foil. In Proc.

20th Nat’l Conf. on Artificial Intelligence, pages 795–800. AAAI Press, 2005. 26, 62, 82

N. Landwehr, A. Passerini, De Raedt L., and P. Frasconi. kfoil: Learning simple relational

kernels. In Proc. 21st Nat’l Conf. on Artificial Intelligence. AAAI Press, 2006. 26, 62, 82

N. Landwehr, K. Kersting, and L. De Raedt. Integrating naive bayes and foil. Journal of

Machine Learning Research, pages 481–507, 2007. 26, 62, 82, 83

149

REFERENCES

N. Lavrac and S. Dzeroski. Inductive Logic Programming: Techniques and applications. UK:

Ellis Horwood, Chichester, 1994. 1, 2, 18

S. Lee, V. Ganapathi, and D. Koller. Efficient structure learning of markov networks using

l1-regularization. In In Proc. of NIPS, 2006. 15

D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization.

Mathematical Programming, 45:503–528, 1989. 37, 41, 65

H.R. Loureno, O. Martin, and T. Stutzle. Iterated local search. In Handbook of Metaheuristics,

pages 321–353. F. Glover and G. Kochenberger, Kluwer Academic Publishers, Norwell,

MA, USA, 2002. 3, 5, 49, 51, 52, 66, 103

D. Lowd and P. Domingos. Efficient weight learning for markov logic networks. In Proc. of

the 11th PKDD, pages 200–211. Springer Verlag, 2007. 3, 4, 6, 42, 43, 56, 65, 73, 74, 76,

77, 79, 80, 94, 95, 96, 98, 111, 127, 128, 129, 130, 131

E. Marinari and G. Parisi. Simulated tempering: A new monte carlo scheme. Europhysics

Letters, 19:451–458, 1992. 118

A. McCallum. Efficiently inducing features of conditional random fields. In Proc. UAI-03,

pages 403–410, 2003. 15, 39, 66

A. McCallum and B Wellner. Conditional models of identity uncertainty with application to

noun coreference. In NIPS-04, 2005. 103

R. McEliece and S. M. Aji. The generalized distributive law. IEEE Trans. Inform. Theory, 46:

325–343, 2000. 17

M. Mezard, G. Parisi, and M. A. Virasoro. Spin-glass theory and beyond. In Lecture Notes in

Physics, volume 9. World Scientific,Singapore, 1987. 52

L. Mihalkova and R. J. Mooney. Bottom-up learning of markov logic network structure. In

Proc. 24th Int’l Conf. on Machine Learning, pages 625–632, 2007. 3, 39, 40, 55, 56, 57, 62,

73, 74, 75, 80, 94, 99

B. Milch, B. Marthi, D. Sontag, S. Russell, and D. L. Ong. Blog: Probabilistic models with

unknown objects. In Proc.IJCAI-05, pages 1352–1359. Edinburgh, Scotland, 2005. 76

150

REFERENCES

P. Mills and E. Tsang. Guided local search for solving sat and weighted max-sat problems.

In I.P. Gent, H. van Maaren, and T. Walsh, editors, SAT2000 - Highlights of Satisfiability

Research in the Year 2000, pages 89–106. IOS Press, 2000. 105

T. M. Mitchell. Machine Learning. The McGraw-Hill Companies, Inc., 1997. 19

M. Moller. A scaled conjugate gradient algorithm for fast supervised learning. Neural Net-

works, 6:525–533, 1993. 6, 43, 128

S Muggleton. Learning structure and parameters of stochastic logic programs. In In Proc. of

12th Int’l Conference on Inductive Logic Prgramming, pages 198–206, 2002. 24

S. Muggleton. Stochastic logic programs. In In L. De Raedt (Ed.), Advances in inductive logic

programming. IOS Press, Amsterdam, 1996. 2, 21, 24

S. Muggleton and L. De Raedt. Inductive logic programming: Theory and methods. Journal

of Logic Programming, 19(20):629–679, 1994. 18

S. Muggleton and C. Feng. Efficient induction of logic programs. In Inductive logic program-

ming, pages 281–297. New York: Academic Press., 1992. 39

S. H. Muggleton. Inverse entailment and progol. New Generation Computing Journal, pages

245–286, 1995. 18

K. Murphy. Learning bayes net structure from sparse data sets. Technical report, Technical

report,Comp. Sci. Div., UC Berkeley„ 2001. 14

K. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief-propagation for approximate inference:

An empirical study. In 15th Conf. on Uncertainty in Artificial Intelligence (UAI). San Mateo,

CA: Morgan Kaufmann, 1999. 17

J. Neville and D. Jensen. Relational dependency networks. Journal of Machine Learning

Research, 8(Mar):653–692, 2007. 26

H. Newcombe, J. Kennedy. S. Axford., and A. James. Automatic linkage of vital records.

Science, 130:954–959, 1959. 55, 75

A. Y. Ng and M. I. Jordan. On discriminative vs. generative: A comparison of logistic re-

gression and naive bayes. In Advances in Neural Information Processing Systems, pages

841–848. Cambridge, MA: MIT Press, 2002. 4, 80, 99

151

REFERENCES

R. T. Ng and V. S. Subrahmanian. Probabilistic logic programming. Information and Compu-

tation, 101(2):150–201, December 1992. 22

L. Ngo and P. Haddawy. Answering queries from context-sensitive probabilistic knowledge

bases. Theoretical Computer Science, 171:147–177, 1997. 2, 22

S.-H. Nienhuys-Cheng and R. de Wolf. Foundations of Inductive Logic Programming.

Springer-Verlag, 1997. 19

N. Nilsson. Probabilistic logic. Artificial Intelligence, 28:71–87, 1986. 1

J. Nocedal and S. Wright. Numerical Optimization. Springer, New York, NY„ 2006. 66

J. Nocedal and S. J. Wright. Numerical optimization. New York, NY:Springer, 1999. 41, 42,

127

Minka. T. P. Algorithms for maximum-likelihood logistic regression. Technical report, Avail-

able from http://www.stat.cmu.edu/minka/, 2001. 16

J. D. Park. Using weighted max-sat engines to solve mpe. In Proc. of AAAI, pages 682–687,

2005. 106

H. Pasula and S. Russell. Approximate inference for first-order probabilistic languages. In Pro-

ceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pages

741–748. Seattle, WA: Morgan Kaufmann, 2001. 2, 76

J. Pearl. Probabilistic reasoning in intelligent systems: Networks of plausible inference. San

Francisco, CA: Morgan Kaufmann, 1988. 1, 9, 16, 17

F. Pernkopf and J. Bilmes. Discriminative versus generative parameter and structure learning

of bayesian network classifiers. In Proc, 22nd Int’l Conf. on Machine Learning, pages 657–

664, 2005. 4

G. D. Plotkin. A note on inductive generalization. In Machine Intelligence, Edinburgh Univer-

sity Press, 5:153–163, 1970. 19, 22, 25

D. Poole. First-order probabilistic inference. In Proceedings of the 18th International Joint

Conference on Artificial Intelligence, pages 985–991. Acapulco, Mexico: Morgan Kauf-

mann., 2003. 45

152

REFERENCES

D. Poole. Probabilistic horn abduction and bayesian networks. Artificial Intelligence, 64(81-

129), 1993. 2

H. Poon and P. Domingos. Sound and efficient inference with probabilistic and deterministic

dependencies. In Proc. 21st Nat’l Conf. on AI, (AAAI), pages 458–463. AAAI Press, 2006.

4, 5, 6, 46, 57, 65, 67, 118, 121, 122

H. Poon, P. Domingos, and M. Sumner. A general method for reducing the complexity of

relational inference and its application to mcmc. In Proc. 23rd Nat’l Conf. on Artificial

Intelligence. Chicago, IL: AAAI Press, 2008. 4, 46, 65, 67, 84, 134

A. Popescul and L. H. Ungar. Structural logistic regression for link analysis. In Proceed-

ings of the Second International Workshop on Multi-Relational Data Mining, pages 92–106.

Washington, DC: ACM Press, 2003. 2, 27, 54, 55, 74, 82

J. R. Quinlan. Learning logical definitions from relations. Machine Learning, 5:239–266,

1990. 18, 26

L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recog-

nition. In Proceedings of the IEEE, pages 257–286. IEEE, 1989. 42

M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62:107–236,

2006. 2, 29, 31, 32, 36, 37, 38, 39, 40, 41, 45, 46, 55, 56, 68, 74, 75, 78, 97

F. Riguzzi. Learning logic programs with annotated disjunctions. In Proc. 14th International

Conference on Inductive Logic Programming, pages p. 270–287. Springer, 2004. 28

D. Roth. On the hardness of approximate reasoning. Artificial Intelligence, 82:273–302, 1996.

103, 117

V. Santos Costa, D. Page, M. Qazi, and J. Cussens. Clp(bn): Constraint logic programming

for probabilistic knowledge. In Proceedings of the Nineteenth Conference on Uncertainty in

Artificial Intelligence, pages 517–524. Acapulco, Mexico: Morgan Kaufmann, 2003. 2

T. Sato. A statistical learning method for logic programs with distribution semantics. In Proc.

of the 12th Int’l Conference on Logic Programming, Tokyo, pages 715–729, 1995. 25

153

REFERENCES

T. Sato and Y. Kameya. Prism: A symbolic-statistical modeling language. In Proceedings

of the Fifteenth International Joint Conference on Artificial Intelligence, pages 1330–1335.

Nagoya, Japan: Morgan Kaufmann, 1997a. 2

T. Sato and Y. Kameya. Prism: A symbolic-statistical modeling language. In Proceedings of

the 15th International Joint Conference on Artificial Intelligence, pages 1330–1335, 1997b.

25

T. Sato and Y. Kameya. Parameter learning of logic programs for symbolic-statistical modeling.

Journal of Artificial Intelligence Research (JAIR), 15:391–454, 2001. 25

T. Sato and Y. Kameya. New advances in logic-based probabilistic modeling by prism. In In

Probabilistic Inductive Logic Programming, volume LNCS 4911, pages 118–155. Springer,

2008. 25

B. Selman, H. Kautz, and B. Cohen. Local search strategies for satisfiability testing. In Cliques,

Coloring, and Satisfiability: Second DIMACS Implementation Challenge, pages 521–532.

American Mathematical Society, 1996. 5, 106

F. Sha and F. Pereira. Shallow parsing with conditional random fields. In Proc. HLT-NAACL-

03, pages 134–141, 2003. 4, 16, 39, 43, 61, 65, 66, 129

Y. Shang and B. Wah. Discrete lagrangian-based search for solving max-sat problems. In

Proc. of IJCAI, pages 378–383. Morgan Kaufmann Publishers, San Francisco, CA, USA,

1997. 105

E. Shapiro. Algorithmic Program Debugging. MIT Press, 1983. 19

N. Shental, A. Zomet, T. Hertz, and Y. Weiss. Learning and inferring image segmentations

using the gbp typical cut algorithm. In Proc. ICCV, 2003. 16

J. Shewchuck. An introduction to the conjugate gradient method without the agonizing pain.

Technical report, School of Computer Science,Carnegie Mellon University, 1994. Technical

Report CMU-CS-94-125. 42, 127

P. Singla and P. Domingos. Markov logic in infinite domains. In Proc. 23rd UAI, pages 368–

375. AUAI Press, 2007. 2, 29

154

REFERENCES

P. Singla and P. Domingos. Lifted first-order belief propagation. In Twenty-Third National

Conference on Artificial Intelligence, Chicago, IL. AAAI Press, 2008. 45

P. Singla and P. Domingos. Discriminative training of markov logic networks. In Proc. 20th

Nat’l Conf. on AI, (AAAI), pages 868–873. AAAI Press, 2005. 3, 4, 5, 6, 42, 43, 44, 55, 65,

74, 75, 76, 104, 126, 129

P. Singla and P. Domingos. Memory-efficient inference in relational domains. In Proc. 21st

Nat’l Conf. on AI, (AAAI), pages 488–493. AAAI Press, 2006a. 44, 67, 106

P. Singla and P. Domingos. Entity resolution with markov logic. In Proc. ICDM-2006, pages

572–582. IEEE Computer Society Press, 2006b. 56, 73, 76, 77, 95, 96

K. Smyth, H. Hoos, and T. Stützle. Iterated robust tabu search for max-sat. In Canadian

Conference on AI,, pages 129–144, 2003. 104, 105, 108, 110, 119

A. Srinivasan. The Aleph Manual. Available at http://www.comlab.ox.ac.uk/oucl/ es-

earch/areas/machlearn/Aleph/. 18

A. Stolcke and S. Omohundro. Hidden markov model induction by bayesian model merging.

In In Advances in Neural Information Processing Systems, volume 5, 1993. 22

C. Sutton and A. McCallum. Piecewise training of undirected models. In Proc. UAI., 2005a.

16

Charles A. Sutton and Andrew McCallum. Piecewise training for undirected models. In UAI,

pages 568–575, 2005b. 138

Charles A. Sutton and Andrew McCallum. Piecewise pseudolikelihood for efficient training of

conditional random fields. In ICML, pages 863–870, 2007. 138

E.D. Taillard. Robust taboo search for the quadratic assignment problem. Parallel Computing,

17:443–455, 1991. 5, 103, 106

B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data.

In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, pages

485–492. Edmonton, Canada: Morgan Kaufmann, 2002. 2, 16, 23, 27

155

REFERENCES

B. Taskar, M. F. Wong, P. Abbeel, and D. Koller. Link prediction in relational data. In Proc. of

Neural Information Processing Systems Conference. Vancouver, Canada, December 2003.

55, 74

J. Vennekens, S. Verbaeten, and M. Bruynooghe. Logic programs with annotated disjunctions.

In Proc. of 20th international conference on logic programming. Springer, 2004. 28

S. V. N. Vishwanathan, N. N. Schraudolph, M. W. Schmidt, and K. P. Murphy. Accelerated

training of conditional random fields with stochastic gradient methods. In ICML06, pages

969–976, 2006. 16

W. Wei, J. Erenrich, and B. Selman. Towards efficient sampling: Exploiting random walk

strategies. In Proc. 19th Nat’l Conf. on AI, (AAAI), 2004. 5, 46, 118, 120

Wei Wei and Bart Selman. A new approach to model counting. In SAT, pages 324–339, 2005.

138

J. S. Wellman, M. Breese and R. P. Goldman. From knowledge bases to decision models.

Knowledge Engineering Review, 7, 1992. 1, 2, 22

Z. Wu and B.W. Wah. Trap escaping strategies in discrete lagrangian methods for solving hard

satisfiability and maximum satisfiability problems. In Proc. of AAAI, pages 673–678. MIT

Press, 1999. 105

M. Yagiura and T. Ibaraki. Efficient 2 and 3-flip neighborhood search algorithms for the max

sat:experimental evaluation. Journal of Heuristics, 7(5):423–442, 2001. 105

J. Yedidia, W. Freeman, and Y. Weiss. Constructing free-energy approximations and gener-

alized belief propagation algorithms. IEEE Transaction on Information Theory, 51:2282–

2312, 2005. 16

J. S. Yedidia, W. T. Freeman, and Y. Weiss. Generalized belief propagation. In NIPS 2001,

2001. 118

F. Zelezny., A. Srinivasan., and D. Page. Randomised restarted search in ilp. Machine Learning,

64(1-3):183–208, 2006. 63, 84, 101

156

integrating logic and probability: algorithmic improvements...

Documents