integrating logic and probability: algorithmic improvements...
TRANSCRIPT
Integrating Logic and Probability:Algorithmic Improvements in
Markov Logic Networks
Marenglen Biba
Department of Computer Science
University of Bari, Italy
DISSERTATIONsubmitted in partial fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHYin Computer Science
2009
Reading Committee
1. Advisor: Professor Floriana Esposito
2. Reviewer:
3. Reviewer:
4. Reviewer:
Signature from head of PhD committee:
ii
Abstract
This dissertation proposes novel algorithms for learning and inference in Markov
Logic Networks.
Statistical Relational Learning challenges one of the most important problems of
Machine Learning since its birth: integrating logic and probability in learning.
Markov Logic is a powerful representation formalism that combines full first-order
logic with probabilistic graphical models by attaching weights to first-order for-
mulas and viewing these as templates for features of Markov Networks (MNs).
Markov Logic Networks (MLNs) together with a set of constants define ground
MNs. MLNs preserve the expressivity of first-order logic and take advantage of
probabilistic graphical models algorithms being therefore a powerful model for
dealing with structured, noisy and uncertain data.
The rich expressivity of MLNs comes at the cost of learning and inference. Struc-
ture learning is the task of learning the logical clauses together with their weights
and is a very computationally hard task involving a search in a huge space of hy-
pothesis with many local optima for the evaluation function. Therefore robust al-
gorithms for structure learning of MLNs are needed. This dissertation proposes a
novel generative structure learning algorithm based on the iterated local search
metaheuristic. Extensive empirical study using real-world benchmark datasets
show that the algorithm improves predictive accuracy and learning time compared
to the state-of-the-art algorithms.
Generative structure learning algorithms optimize the joint distribution of all the
variables. This can lead to suboptimal results for predictive tasks because of the
mismatch between the objective function used (likelihood or a function thereof)
and the goal of classification (maximizing accuracy or conditional likelihood). In
contrast discriminative approaches maximize the conditional likelihood of a set
of outputs given a set of inputs and this often produces better results for predic-
tion problems. Unfortunately, the computational cost of optimizing structure and
parameters for conditional likelihood is prohibitive. This disseratation proposes
novel discriminative structure learning algorithms based on the simple approxi-
mation of choosing structures by maximizing conditional likelihood while setting
parameters by maximum likelihood. Extensive experiments in real-world domains
show that the proposed discriminative algorithms improve over state-of-the-art
generative structure learning and discriminative weight learning algorithms.
Inference in graphical models is NP-hard. For MLNs, MAP inference can be per-
formed through SAT solvers. This dissertation proposes the IRoTS algorithm for
MAP inference in MLNs and shows through experiments that it is a high per-
forming algorithm by improving over the state-of-the-art existing algorithm in
terms of solutions quality and inference running times. Moreover in statistical
relational learning, probabilistic and deterministic dependencies must be handled.
This dissertation extends IRoTS by proposing MC-IRoTS, an algorithm that com-
bines MCMC methods and SAT solvers for the problem of conditional inference
in MLNs. Empirical evaluation on real-world data shows good improvements to-
wards the state-of-the-art algorithm for conditional inference in MLNs.
For my parents
Acknowledgements
There are a lot of people that I would like to acknowledge for having been of
support during this long period of hard work. I will start with my colleagues who,
everyday have shared with me my work on machine learning research. I would
like to thank Floriana Esposito for having given me both freedom and good advice
for research; she taught me the importance of high quality research and inspired
me to investigate pattern recognition. Many thanks to Stefano Ferilli for having
shared with me all the ideas of my research and for having been of great support
during all these years; he rounded out my background on logic programming and
relational learning, taught me the importance of empirical evaluation in machine
learning and encouraged me to publish. Thanks to Nicola Di Mauro from whom I
received precious advice and ideas on metaheuristics. Thanks also to Teresa Basile
for her careful suggestions on our joint works.
I would like to thank Nicola Fanizzi for many helpful discussions on machine
learning topics and for having shared with me my curiosity on Linux, Latex and
kernels. Thanks also to Claudia d’Amato for having been a great colleague in the
LACAM laboratory. Many thanks also to all the colleagues at Dipartimento di
Informatica, Bari.
Part of this research was carried out at the Department of Computer Science, Uni-
veristy of Washington, Seattle. Special acknoledgement goes to Pedro Domingos
of University of Washington, for having given me the possibility to deepen my
knowledge on statistical relational learning by visiting his machine learning group.
He made my period in Seattle very productive and I learned from him the impor-
tance of practical machine learning. I would like to thank all the other members of
the machine learning group at UW-CSE: Stanley Kok for having shared with me
the results on structure learning, Marc Sumner for his help on Alchemy, Hoifung
Poon for useful discussions on MC-SAT, Parag Singla for helpful discussions on
discriminative learning, Jesse Davis for helpful talks on machine learning topics
and Daniel Lowd for his help with PSCG and for having shared with me his re-
sults. I would like to thank also Liliana Mihalkova of University of Texas for her
help with BUSL.
Of great help for me was Mary Giordano and her family in Seattle, who made my
stay there a very exciting experience. Thank you all for your support and for the
splendid time we had together in Seattle.
I would like to thank my girlfriend Eni for having been of great support during
these years of hard work and for having shared with me all my difficult moments.
She made me understand the great value of taking care of people and giving love
to them. Thank you Eni!
Finally, I would like to thank my parents who helped me understand the sense of
life. They made my work easier by giving me good advice and by encouraging
me to continue my way. Thanks also to my brother and his family for their help
during these years.
iv
Contents
List of Figures ix
List of Tables xi
List of Algorithms xiv
1 Introduction 1
1.1 Statistical Models of Relational Data . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Overview of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Statistical Relational Learning 9
2.1 Probabilistic Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Markov Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.4 Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 First-Order Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Inductive Logic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Learning from entailment . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Learning from interpretations . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Learning from proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Probabilistic Inductive Logic Programming . . . . . . . . . . . . . . . . . . . 20
2.4.1 Learning from Probabilistic Entailment . . . . . . . . . . . . . . . . . 20
2.4.2 Learning from Probabilistic Interpretations . . . . . . . . . . . . . . . 21
2.4.3 Learning from Probabilistic Proofs . . . . . . . . . . . . . . . . . . . . 21
v
CONTENTS
2.5 SRL and PILP models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.1 Knowledge-based Model Construction . . . . . . . . . . . . . . . . . . 22
2.5.2 Probabilistic Relational Models . . . . . . . . . . . . . . . . . . . . . 23
2.5.3 Bayesian Logic Programs . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.4 Stochastic Logic Programs . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.5 PRISM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.6 Relational Dependency Networks . . . . . . . . . . . . . . . . . . . . 26
2.5.7 nFOIL, TFOIL and kFOIL . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.8 Other models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Markov Logic Networks 29
3.1 Markov Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Structure Learning of MLNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 Pseudo-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.2 Two-step Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.3 Single-step Learning by Optimizing Weighted Pseudo-likelihood . . . . 38
3.2.4 Bottom-up Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Parameter Learning of MLNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Generative approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 Discriminative Approaches . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Inference in MLNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.1 MAP Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.2 Conditional Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 The GSL algorithm 47
4.1 The Iterated Local Search metaheuristic . . . . . . . . . . . . . . . . . . . . . 47
4.2 Generative Structure Learning using ILS . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 The Perturbation Component . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2 The Local Search Component . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Link Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.2 Entity Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.3 Systems and Methodology . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
vi
CONTENTS
4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5 The ILS-DSL algorithm 65
5.1 Setting Parameters through Likelihood . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Scoring Structures through Conditional Likelihood . . . . . . . . . . . . . . . 67
5.3 Discriminative Structure Learning using ILS . . . . . . . . . . . . . . . . . . . 68
5.3.1 The ILS-DSLCLL version . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.2 The ILS-DSLAUC version . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.1 Link Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4.2 Entity Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4.3 Systems and Methodology . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6 The RBS-DSL algorithm 87
6.1 The GRASP metaheuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Randomized Beam Discriminative Structure Learning . . . . . . . . . . . . . . 88
6.2.1 The RBS-DSLCLL version . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2.2 The RBS-DSLAUC version . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3.1 Systems and Methodology . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7 The IRoTS and MC-IRoTS algorithms 103
7.1 MAP/MPE inference using IRoTS . . . . . . . . . . . . . . . . . . . . . . . . 104
7.1.1 The SAT and MAX-SAT problems . . . . . . . . . . . . . . . . . . . . 104
7.1.2 Iterated Robust Tabu Search . . . . . . . . . . . . . . . . . . . . . . . 106
7.1.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2 Conditional Inference for MLNs using MC-IRoTS . . . . . . . . . . . . . . . 117
vii
CONTENTS
7.2.1 The SampleIRoTS algorithm: Combining MCMC and IRoTS . . . . . 119
7.2.2 The MC-IRoTS algorithm . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.3 Discriminative Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . . 125
7.3.1 Optimizing Conditional Likelihood for Weight Learning . . . . . . . . 125
7.3.2 Learning MLNs Weights by Sampling with MC-IRoTS . . . . . . . . . 128
7.3.3 Experiments on Web Page Classification . . . . . . . . . . . . . . . . . 131
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8 Conclusion 135
8.1 Contributions of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.2 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Appendix A The MLN++ Package 141
References 143
viii
List of Figures
2.1 Example of the graph structure of a Bayesian Network . . . . . . . . . . . . . 10
2.2 Example of the graph structure of a Markov Network . . . . . . . . . . . . . . 13
3.1 Example of a knowledge base in first-order logic . . . . . . . . . . . . . . . . 33
3.2 Example of a knowledge base in Markov Logic . . . . . . . . . . . . . . . . . 33
3.3 Partial construction of the nodes of the ground Markov Network . . . . . . . . 34
3.4 Complete construction of the nodes of the ground Markov Network . . . . . . 34
3.5 Connecting nodes whose predicates appear in some ground formula . . . . . . 35
3.6 Complete construction of the structure of the graph for the Markov Network . . 35
4.1 The Iterated Local Search schema . . . . . . . . . . . . . . . . . . . . . . . . 50
ix
LIST OF FIGURES
x
List of Tables
2.1 Conditional Probability Tables (CPTs) for all variables . . . . . . . . . . . . . 11
4.1 All predicates in the UW-CSE domain . . . . . . . . . . . . . . . . . . . . . . 57
4.2 All predicates in the CORA domain . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Accuracy results on UW-CSE for ten parallel independent walks of GSL . . . . 58
4.4 Accuracy comparison of GSL, BUSL and BS on the UW-CSE dataset . . . . . 59
4.5 Learning times (in minutes) on UW-CSE for ten parallel independent walks of
GSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.6 Comparison of learning times (in minutes) on UW-CSE for GSL, BUSL and BS 60
4.7 Accuracy results on CORA for ten parallel independent walks of GSL . . . . . 61
4.8 Accuracy comparison of GSL with BUSL on the CORA dataset . . . . . . . . 61
4.9 Learning times (in minutes) on CORA for ten parallel independent walks of GSL 62
4.10 Comparison of learning times (in minutes) on CORA for GSL and BUSL . . . 62
5.1 CLL results for the query predicate advisedBy in the UW-CSE domain . . . . . 80
5.2 AUC results for the query predicate advisedBy in the UW-CSE domain . . . . 80
5.3 CLL results for all query predicates in the CORA domain . . . . . . . . . . . . 81
5.4 AUC results for all query predicates in the CORA domain . . . . . . . . . . . . 81
6.1 CLL results for the query predicate advisedBy in the UW-CSE domain . . . . . 99
6.2 AUC results for the query predicate advisedBy in the UW-CSE domain . . . . 99
6.3 CLL results for all query predicates in the CORA domain . . . . . . . . . . . . 100
6.4 AUC results for all query predicates in the CORA domain . . . . . . . . . . . . 100
xi
LIST OF TABLES
7.1 Inference results in terms of cost of false clauses for query predicate advisedBy
for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using
MLNs learned by running PSCG for 500 iterations . . . . . . . . . . . . . . . 111
7.2 Running times (in minutes) for the same number of search steps for query
predicate advisedBy for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu
with restarts, using MLNs learned by running PSCG for 500 iterations . . . . . 112
7.3 Inference results in terms of cost of false clauses for query predicate advisedBy
for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using
MLNs learned by running PSCG for 10 hours . . . . . . . . . . . . . . . . . . 113
7.4 Running times (in minutes) for the same number of search steps for query
predicate advisedBy for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu
with restarts, using MLNs learned by running PSCG for 10 hours . . . . . . . . 113
7.5 Inference results in terms of cost of false clauses for query predicate advisedBy
for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using
MLNs learned by running PSCG for 50 iterations with both advisedBy and
tempAdvisedBy as non-evidence predicates . . . . . . . . . . . . . . . . . . . 114
7.6 Running times (in minutes) for the same number of search steps for query
predicate advisedBy for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu
with restarts, using MLNs learned by running PSCG for 50 iterations with both
advisedBy and tempAdvisedBy as non-evidence predicates . . . . . . . . . . . 114
7.7 Inference results in terms of cost of false clauses for query predicate tempAd-
visedBy for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts,
using MLNs learned by running PSCG for 50 iterations with both advisedBy
and tempAdvisedBy as non-evidence predicates . . . . . . . . . . . . . . . . . 115
7.8 Running times (in minutes) for the same number of search steps for query
predicate tempAdvisedBy for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-
Tabu with restarts, using MLNs learned by running PSCG for 50 iterations with
both advisedBy and tempAdvisedBy as non-evidence predicates . . . . . . . . 115
7.9 Inference results in terms of cost of false clauses for query predicates ad-
visedBy and tempAdvisedBy in a single inference task, for IRoTS, MaxWalkSAT-
Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learned by running
PSCG for 50 iterations with both advisedBy and tempAdvisedBy as non-evidence
predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
xii
LIST OF TABLES
7.10 Running times (in minutes) for the same number of search steps for query pred-
icates advisedBy and tempAdvisedBy in a single inference task, for IRoTS,
MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learned
by running PSCG for 50 iterations with both advisedBy and tempAdvisedBy
as non-evidence predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.11 Inference results in terms of cost of false clauses for query predicates taugh-
tBy, advisedBy and tempAdvisedBy in a single inference task, for IRoTS,
MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learned
by running PSCG for 50 iterations with the three predicates as non-evidence
predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.12 Running times (in minutes) for the same number of search steps for query pred-
icates taughtBy, advisedBy and tempAdvisedBy in a single inference task, for
IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs
learned by running PSCG for 50 iterations with the three predicates as non-
evidence predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.13 Inference running times for 1000 samples in the CORA domain . . . . . . . . 122
7.14 Accuracy results of inference for 1000 samples in the CORA domain . . . . . . 124
7.15 Accuracy results of inference for 1000 samples for the advisedBy predicate
based on the MLNs generated with 500 iterations of PSCG . . . . . . . . . . . 125
7.16 Inference running times (in seconds) for 1000 samples for the predicate ad-
visedBy based on the MLNs generated with 500 iterations of PSCG . . . . . . 125
7.17 Accuracy results of inference for 1000 samples for the advisedBy predicate
based on the MLNs generated by running PSCG for 10 hours on the training
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.18 Inference running times (in seconds) for 1000 samples for the predicate ad-
visedBy based on the MLNs generated by running PSCG for 10 hours on the
training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.19 Accuracy results of inference for 1000 samples for the advisedBy predicate
based on the MLNs generated by running PSCG with both advisedBy and tem-
pAdvisedBy as non-evidence predicates . . . . . . . . . . . . . . . . . . . . . 127
7.20 Inference running times (in seconds) for 1000 samples for the predicate ad-
visedBy based on the MLNs generated by running PSCG with both advisedBy
and tempAdvisedBy as non-evidence predicates . . . . . . . . . . . . . . . . . 127
xiii
LIST OF TABLES
7.21 Accuracy results of inference for 1000 samples for the predicate tempAd-
visedBy based on the MLNs generated by running PSCG with both advisedBy
and tempAdvisedBy as non-evidence predicates . . . . . . . . . . . . . . . . . 128
7.22 Inference running times (in seconds) for 1000 samples for the predicate tem-
pAdvisedBy based on the MLNs generated by running PSCG with both ad-
visedBy and tempAdvisedBy as non-evidence predicates . . . . . . . . . . . . 128
7.23 Accuracy results of inference for 1000 samples with both query predicates ad-
visedBy and tempAdvisedBy in a single inference task . . . . . . . . . . . . . 129
7.24 Inference running times (in seconds) for 1000 samples with both query predi-
cates advisedBy and tempAdvisedBy in a single inference task . . . . . . . . . 129
7.25 Accuracy results of inference for 1000 samples with query predicates taughtBy,
advisedBy and tempAdvisedBy in a single inference task . . . . . . . . . . . . 130
7.26 Inference running times (in seconds) for 1000 samples with the query predi-
cates taughtBy, advisedBy and tempAdvisedBy in a single inference task . . . 130
7.27 Accuracy results for classifying webpages of students . . . . . . . . . . . . . . 131
7.28 Accuracy results for classifying webpages of faculty members . . . . . . . . . 132
7.29 Accuracy results for classifying webpages of research projects . . . . . . . . . 132
7.30 Accuracy results for classifying webpages of courses . . . . . . . . . . . . . . 133
7.31 Overall accuracy results for web page classification in the WebKB domain . . . 133
xiv
List of Algorithms
4.1 The Iterated Local Search algorithm . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 The GSL algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 The SearchBestClause component of GSL . . . . . . . . . . . . . . . . . . . . 52
4.4 The local search component of GSL . . . . . . . . . . . . . . . . . . . . . . . 54
5.1 The ILS-DSL algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 The SearchBestClause component of ILS-DSL . . . . . . . . . . . . . . . . . 70
5.3 The subsidiary procedure LocalSearch and the Step function of ILS-DSL . . . 71
6.1 The GRASP metaheuristic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 The RBS-DSL algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.3 The SearchBestClause procedure of the RBS-DSL algorithm . . . . . . . . . . 90
6.4 Randomized Construction of the best WPLL candidate list . . . . . . . . . . . 91
6.5 Randomized choice of the best CLL (or AUC) candidate list to form the new
beam. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.1 The WalkSAT algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2 The Robust Tabu Search algorithm . . . . . . . . . . . . . . . . . . . . . . . . 109
7.3 The Iterated Robust Tabu Search algorithm . . . . . . . . . . . . . . . . . . . 110
7.4 The MC-IRoTS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
xv
LIST OF ALGORITHMS
xvi
Chapter 1
Introduction
1.1 Statistical Models of Relational Data
Traditionally, Artificial Intelligence research has fallen into two separate subfields: one that
has focused on logical representations, and one on statistical ones. Logical AI approaches like
logic programming, description logics, classical planning, symbolic parsing, rule induction,
etc, tend to emphasize handling complexity. Statistical AI approaches like Bayesian networks,
hidden Markov models, Markov decision processes, statistical parsing, neural networks, etc,
tend to emphasize handling uncertainty. However, intelligent agents must be able to handle
both for real-world applications. The first attempts to integrate logic and probability in AI
date back to the works in (Bacchus 1990; Halpern 1990; Nilsson 1986). Later, several authors
began using logic programs to compactly specify Bayesian networks, an approach known as
knowledge-based model construction (Wellman and Goldman 1992).
In Machine Learning, a central problem has always been learning in rich representations
that enable to deal with structure and relations. Much progress has been achieved in the rela-
tional learning field or differently known as Inductive Logic Programming (Lavrac and Dze-
roski 1994). On the other hand, successful statistical machine learning models with their roots
in statistics and pattern recognition, have made possible to deal with noisy and uncertain do-
mains in a robust manner. Powerful models such as Probabilistic Graphical Models (Pearl
1988) and related algorithms have the power to handle uncertainty but lack the capability of
dealing with structured domains.
Statistical Relational Learning (Getoor and Taskar 2007) or Probabilistic Inductive Logic
Programming (De Raedt et al. 2008) has undertaken the hard task not only of Machine Learning
1
1. INTRODUCTION
but of entire Artificial Intelligence to build hybrid models that integrate logical formalisms and
statistical ones. A growing amount of work has been dedicated to integrating subsets of first-
order logic with probabilistic graphical models, to extend logic programs with a probabilistic
semantics or integrate other formalisms with probability. Some of the logic-based approaches
are: Knowledge-based Model Contruction (Wellman and Goldman 1992), Bayesian Logic Pro-
grams (Kersting and De Raedt 2001a), Stochastic Logic Programs (Cussens 2001; Muggleton
1996), Probabilistic Horn Abduction (Poole 1993), Queries for Probabilistic Knowledge Bases
(Ngo and Haddawy 1997), PRISM (Sato and Kameya 1997a), CLP(BN) (Santos Costa et al.
2003). Other approaches include frame-based systems such as Probabilistic Relational Models
(Friedman et al. 1999) or PRMs extensions defined in (Pasula and Russell 2001), description
logics based approaches such as those in (Cumby and Roth 2003) and P-CLASSIC of (Koller
et al. 1997), database query langauges (Taskar et al. 2002), (Popescul and Ungar 2003), etc.
All these SRL approaches are based on subsets of first-order logic. Markov Logic (Domin-
gos and Richardson 2007; Domingos et al. 2008) is a further step in generalizing these ap-
proaches. It is a simple language that provides the full expressiveness of graphical models
and first-order logic in finite domains, and remains well-defined in many infinite domains as
the results in (Richardson and Domingos 2006; Singla and Domingos 2007) show. Markov
Logic extends first-order logic by attaching weights to first-order formulas. which are viewed
as templates for constructing Markov networks. In the infinite-weight limit, Markov Logic
reduces to standard first-order logic. In Markov Logic it is avoided the assumption of i.i.d.
(independent and identically distributed) data made by most statistical learners by using the
power of first-order logic to compactly represent dependencies among objects and relations. A
Markov Logic Network (MLN) can be seen as a knowledge base capable of soundly handling
uncertainty, tolerating imperfect and contradictory knowledge, and reducing brittleness.
The expressiveness of Markov Logic poses computationally challenging problems. As
with other graphical models, there are three important tasks with MLNs: structure learning,
parameter learning and inference. This dissertation aims at giving a contribution for each of
these tasks.
Structure learning for MLNs is the task of learning the logical clauses. This can be per-
formed from scratch or starting from an already learned structure and try to revise it. In
(Richardson and Domingos 2006) structure learning was performed through ILP methods
(Lavrac and Dzeroski 1994) followed by a weight learning phase during which maximum-
pseudolikelihood (Besag 1975) weights were learned for each previously learned clause. State-
2
1.1 Statistical Models of Relational Data
of-the-art algorithms for structure learning are those in (Kok and Domingos 2005; Mihalkova
and Mooney 2007) where learning of MLNs is performed in a single step using weighted
pseudo-likelihood as the evaluation measure during structure search. However, these algo-
rithms follow systematic search strategies that can lead to local optima and prohibitive learning
times. The algorithm in (Kok and Domingos 2005) performs a beam search in a greedy fash-
ion which makes it very susceptible to local optima, while the algorithm in (Mihalkova and
Mooney 2007) works in a bottom-up fashion trying to consider fewer candidates for evalu-
ation. Even though it considers fewer candidates, after initially scoring all candidates, this
algorithm attempts to add them one by one to the MLN, thus changing the MLN at almost each
step, which greatly slows down the computation of the optimal weights. Moreover, both these
algorithms cannot benefit from parallel architectures. This dissertation proposes an approach
based on the Iterated Local Search (ILS) metaheuristics (Loureno et al. 2002) that samples the
set of local optima and performs a search in the sampled space. We show that, through a simple
parallelism model such as independent multiple walk, ILS achieves important improvements
towards the state-of-the-art algorithms of (Kok and Domingos 2005; Mihalkova and Mooney
2007).
Generative approaches optimize the joint distribution of all the variables. This can lead to
suboptimal results for predictive tasks because of the mismatch between the objective function
used (likelihood or a function thereof) and the goal of classification (maximizing accuracy or
conditional likelihood). In contrast discriminative approaches maximize the conditional like-
lihood of a set of outputs given a set of inputs (Lafferty et al. 2001) and this often produces
better results for prediction problems. In (Singla and Domingos 2005) the voted perceptron
based algorithm for discriminative weight learning of MLNs was shown to greatly outperform
maximum-likelihood and pseudo-likelihood approaches for two real-world prediction prob-
lems. Recently, the algorithm in (Lowd and Domingos 2007), outperforming the voted percep-
tron became the state-of-the-art method for discriminative weight learning of MLNs. However,
both discriminative approaches to MLNs learn weights for a fixed structure, given by a domain
expert or learned through another structure learning method (usually generative). Better results
could be achieved if the structure could be learned in a discriminative fashion. Unfortunately,
the computational cost of optimizing structure and parameters for conditional likelihood is
prohibitive. This dissertation proposes discriminative structure learning algorithms based on
the simple approximation of choosing structures by maximizing conditional likelihood while
3
1. INTRODUCTION
setting parameters by maximum likelihood. Structures are scored through a very fast infer-
ence algorithm MC-SAT (Poon and Domingos 2006) whose lazy version Lazy-MC-SAT (Poon
et al. 2008) greatly reduces memory requirements, while parameters are learned through a
quasi-Newton optimization method like L-BFGS that has been found to be much faster (Sha
and Pereira 2003) than iterative scaling initially used for Markov Networks’ weights learning
(Della Pietra et al. 1997). We show through experiments in two real-world domains that the
proposed algorithm improves over the state-of-the-art algorithm of (Lowd and Domingos 2007)
in terms of conditional likelihood of the query predicates.
Discriminative approaches may not always provide the highest classification accuracy. An
empirical and theoretical comparison of discriminative and generative classifiers (logistic re-
gression and Naïve Bayes (NB)) was given in (Ng and Jordan 2002). It was shown that for small
sample sizes the generative NB classifier can outperform a discriminatively trained model. This
is consistent with the fact that, for the same representation, discriminative training has lower
bias and higher variance than generative training, and the variance term dominates at small sam-
ple sizes (Domingos and Pazzani 1997; Friedman 1997a). For the dataset sizes typically found
in practice, however, the results in (Greiner et al. 2005; Grossman and Domingos 2004; Ng and
Jordan 2002) all support the choice of discriminative training. An experimental comparison
of discriminative and generative parameter training on both discriminatively and generatively
structured Bayesian Network classifiers has been performed in (Pernkopf and Bilmes 2005).
This dissertation presents an experimental comparison between the proposed generative and
discriminative structure learning algorithms for MLNs and confirms the results in (Ng and Jor-
dan 2002) in the case of MLNs by showing that on a small dataset the generative algorithm is
competitive, while on a larger dataset the discriminative algorithm outperforms the generative
one in terms of conditional likelihood (Ng and Jordan 2002).
Maximum a posteriori (MAP) inference in MNs means finding the most likely state of a set
of output variables given the state of the input variables. This problem is NP-hard. For discrim-
inative training, the voted perceptron is a special case in which tractable inference is possible
using the Viterbi algorithm (Collins 2002). In (Singla and Domingos 2005) the voted percep-
tron was generalized to MLNs by replacing the Viterbi algorithm with a weighted SAT solver.
This algorithm is gradient descent and computing the gradient of the conditional log-likelihood
(CLL) requires the computation of the number of true groundings for each clause. This can be
performed by finding the MAP state which can be computed by dynamic programming meth-
ods. Since for MLNs, the MAP state is the state that maximizes the sum of the weights of the
4
1.1 Statistical Models of Relational Data
satisfied ground clauses, this state can be efficiently found using a weighted MAX-SAT solver.
The authors in (Singla and Domingos 2005) use the MaxWalkSAT solver (Selman et al. 1996).
This dissertation proposes to use IRoTS as a MAX-SAT solver for performing MAP inference
in MLNs. Extensive experiments in real-world domains show that IRoTS performs better than
the state-of-the-art algorithm for MAP inference in MLNs, in terms of solutions quality and
inference running times.
Conditional inference in graphical models involves computing the distribution of the query
variables given the evidence and it has been shown to be #P-complete. The most widely used
approach to approximate inference is by using Markov Chain Monte Carlo (MCMC) methods
and in particular Gibbs sampling. One of the problems that arises in real-world applications, is
that an inference method must be able to handle probabilistic and deterministic dependencies
that might hold in the domain. MCMC methods are suitale for handling probabilistic dependen-
cies but give poor results when deterministic or near deterministic dependencies characterize a
certain domain. On the other side logical ones, like satisfiability testing cannot be applied to
probabilistic dependencies. One approach to deal with both kinds of dependencies is that of
(Poon and Domingos 2006) where the authors use SampleSAT (Wei et al. 2004) in a MCMC
algorithm to uniformly sample from the set of satisfying solutions. As pointed out in (Wei et al.
2004), SAT solvers find solutions very fast but they may sample highly non-uniformly. On the
other side, MCMC methods may take exponential time, in terms of problem size, to reach
the stationary distribution. For this reason, the authors in (Wei et al. 2004) proposed to use
a hybrid strategy by combining random walk steps with MCMC steps, and in particular with
Metropolis transitions. MC-SAT (Poon and Domingos 2006) is an inference algorithm that
combines ideas from satisfiability (SAT) and MCMC methods. It uses the SampleSAT (Wei
et al. 2004) algorithm as a subroutine to efficiently jump between isolated or near-isolated re-
gions of non-zero probability, while preserving detailed balance. SampleSAT is an extension
to WalkSAT to sample satisfying solutions near-uniformly by combining it with simulated an-
nealing. This dissertation proposes the novel algorithm SampleIRoTS based on the iterated
local search (Loureno et al. 2002) and robust tabu search (RoTS) (Taillard 1991) metaheuris-
tics, that interleaves RoTS steps with simulated annealing ones in an iterated local search. The
dissertation then proposes SampleIRoTS plugged in the novel proposed algorithm MC-IRoTS.
Experimental evaluation shows that on a large number of inference tasks, MC-IRoTS performs
inference faster than the state-of-the-art algorithm for MLNs while maintaining the same qual-
ity of predicted probabilities.
5
1. INTRODUCTION
Often the structure of the model is already given by a domain expert and the task is to
learn the parameters of the model. Weight learning of MLNs in a discriminative fashion has
produced for predictive tasks much better results than generative approaches as the results in
(Singla and Domingos 2005) show. In this work the voted-perceptron algorithm was general-
ized to arbitrary MLNs by replacing the Viterbi algorithm with a weighted satisfiability solver.
The new algorithm is essentially gradient descent with an MPE approximation to the expected
sufficient statistics (true clause counts) and these can vary widely between clauses, causing
the learning problem to be highly ill-conditioned, and making gradient descent very slow. In
(Lowd and Domingos 2007) a preconditioned scaled conjugate gradient (PSCG) approach was
shown to outperform the algorithm in (Singla and Domingos 2005) in terms of learning time
and prediction accuracy. This algorithm is based on the scaled conjugate gradient method and
very good results are obtained with a simple approach: per-weight learning weights, with the
weight’s learning rate being the global one divided by the corresponding clause’s empirical
number of true groundings. This approach was originally proposed in (Moller 1993) for train-
ing neural networks. PSCG, in each iteration, takes a step in the diagonalized Newton direction
and uses samples from the MC-SAT algorithm (Poon and Domingos 2006) to approximate the
Hessian for MLNs instead of the line search to choose a step size. In this dissertation, we plug
in the PSCG algorithm, the MC-IRoTS in order to sample for approximating the Hessian. We
show through experiments in the web page classification domain that parameter learning with
PSCG by sampling with MC-IRoTS produces a model whose probabilities produced from the
model through inference show high accuracy in classification. This shows that MC-IRoTS, is
not only a fast inference algorithm, but it can be used also during learning since it samples
uniformly or near-uniformly.
1.2 Overview of this Dissertation
The dissertation is structured as follows:
• Chapter 2 introduces the basic notions and terminology on Probabilistic Graphical Mod-
els and Inductive Logic Programming. It describes learning and inference algorithms for
Bayesian Networks and Markov Networks and gives a detailed description of the differ-
ent ILP learning settings. Finally it describes related work on other Statistical Relational
Learning models.
6
1.2 Overview of this Dissertation
• Chapter 3 presents in detail Markov Logic and the three tasks: structure learning, param-
eter learning and inference with existing related algorithms.
• Chapter 4 presents the Generative Structure Learning (GSL) algorithm. It describes the
Iterated Local Search metaheuristic and the choice of its components for the task of
structure learning of MLNs. Finally it presents experimental evaluation of GSL.
• Chapter 5 presents the discriminative structure learning algorithm ILS-DSL based on
the Iterated Local Search metaheuristic. It describes how parameters are set and how
structures are scored. Two versions of the algorithm are presented, each optimizing
respectively conditional-likelihood and area under curve of precision-recall. Finally ex-
perimental evaluation of DSL is presented.
• Chapter 6 presents the discriminative structure learning algorithm RBS-DSL inspired
from the GRASP metaheuristic. It describes how parameters are set and how structures
are scored. Two versions of the algorithm are presented, each optimizing respectively
conditional-likelihood and area under curve of precision-recall. Finally experimental
evaluation of RBS-DSL is presented.
• Chapter 7 introduces the basic notions of the satisfiability problem and how MAP infer-
ence for Markov Logic Networks (MLNs) can be performed using MAX-SAT solvers.
It then presents the Iterated Robust Tabu Search algorithm with the experimental evalu-
ation for the task of MAP inference in MLNs. Then it introduces Markov Chain Monte
Carlo methods and how these can be combined with SAT solvers. It presents the Sam-
pleIRoTS algorithm for uniformly sampling from the set of satisfying assignments of a
clause. Then it presents Markov Chain IRoTS (MC-IRoTS), an algorithm that combines
MCMC with SAT. Finally it presents experiments with MC-IRoTS in two tasks: proba-
bilistic inference on a large variety of MLNs inference problems and parameter learning
with the PSCG algorithm using MC-IRoTS as sampler.
• Chapter 8 reviews the main contributions of this dissertation and outlines directions for
future research.
7
1. INTRODUCTION
8
Chapter 2
Statistical Relational Learning
This chapter introduces basic notions of Probabilistic Graphical Models (PGMs) and Relational
Learning approaches. It describes two graphical models such as Bayesian Networks (BNs)
and Markov Networks (MNs) and their related algorithms. Then notions of Inductive Logic
Programming (ILP) are presented describing the three ILP learning settings. Finally, the last
part of the chapter presents different SRL models combining PGMs with ILP. Some other
models which do not built upon PGMs but integrate statistical learning in the ILP setting such
as nFOIL, TFOIL and kFOIL are presented at the end of the chapter to give a complete view
of SRL.
2.1 Probabilistic Graphical Models
This section introduces basic notions of Probabilistic Graphical Models (PGMs). As pointed
out in (Jordan 1998), PGMs are a marriage between probability theory and graph theory and
provide a natural tool for dealing with two problems that occur throughout applied mathematics
and engineering – uncertainty and complexity. PGMs are graphs in which nodes represent
random variables, and the arcs represent probabilistic relationships between these variables
(Cowell et al. 1999; Jordan 1998; Pearl 1988). PGMs have several useful properties: they
are a simple way to visualize the structure of a probabilistic model and can be used to design
new models; insights into the properties of the model, including conditional independence, can
be achieved by inspecting the graph; complex computations regarding inference and learning
can be expressed in terms of graphical manipulations and the mathematical expressions can be
performed implicitly. The graph captures the way in which the joint distribution over all of
9
2. STATISTICAL RELATIONAL LEARNING
the random variables can be decomposed into a product of factors each depending only on a
subset of the variables. BNs are directed graphical models, in which the links of the graphs
have a particular directionality indicated by arrows. The other major class of graphical models
are MNs (also known as Markov random fields) which are undirected graphical models, in
which the links do not carry arrows and have no directional significance. Directed graphs are
useful for expressing causal relationships between random variables, while undirected graphs
are better suited to expressing soft constraints between random variables. In the following there
will be discussed key aspects of these two graphical models as needed for understanding the
statistical relational learning setting. Detailed treatment of these two models can be found in
(Bishop 2006; Cowell et al. 1999; Edwards. 2000; Jordan 1998).
2.1.1 Bayesian Networks
In order to better present the use of directed graphs to describe probability distributions, let’s
consider first an arbitrary joint distribution p(S, A, H, F) over four variables S, A, H, F. Let’s
suppose S stands for Sunny, A stands for Arsonists, H stands for Hot and F stands for Forests
in Fire (Figure 2.1).
Figure 2.1: Example of the graph structure of a Bayesian Network -
In directed models there is a more complicated notion of independence than in undirected
models. This implies several advantages. The most important is that one can regard an arc from
A to B as indicating that A “causes” B. For example in Figure 2.1, Hot causes Fire. This guides
the construction of the graph structure. In addition, directed models can encode deterministic
relationships. In addition to the graph structure, we have to specify the parameters of the model.
10
2.1 Probabilistic Graphical Models
A H P(F=T) P(F=F)F F 0.0 1.0T T 0.99 0.01T F 0.9 0.1F T 0.9 0.1
P(S=T) P(S=F)0.5 0.5
Table 2.1: Conditional Probability Tables (CPTs) for all variables
For a directed model, we must specify the Conditional Probability Distribution (CPD) at each
node. If the variables are discrete, this can be represented as a table (CPT), which lists the
probability that the child node takes on each of its different values for each combination of
values of its parents. For example, in Figure 2.1, all nodes are binary, i.e., have two possible
values, which we will denote by T (true) and F (false). An important concept for probability
distributions over multiple variables is that of conditional independence (Dawid 1980).
We can see from Figure 2.1 that the event "Forest on Fire" (F = true) has two possible
causes: either there is hot weather (H = True) or an arsonist is causing fire (A = true). The
strengths of these relationships are given in the table. For example, we see that P(F = true|A =
true,S = f alse) = 0.9, and since each row must sum to one, P(F = f alse|S = true,A =
f alse) = 1− 0.9 = 0.1. Since the S node has no parents, its CPT specifies the prior proba-
bility that it is sunny (in this case, 0.5). (We are thinking of S as representing the season: if it
is a sunny, it is more likely to have fires).
The simplest conditional independence relationship encoded in a BN is the following: a
node is independent of its ancestors given its parents, where the ancestor/parent relationship is
with respect to some fixed topological ordering of the nodes.
Based on the chain rule of probability, the joint probability of all the nodes in the graph
above is:
P(S,A,H,F) = P(S)∗P(A|S)∗P(H|S,A)∗P(F |S,A,H)
By using conditional independence relationships, we can rewrite this equation as:
P(S,A,H,F) = P(S)∗P(A|S)∗P(H|S)∗P(F |A,H)
simplifying the third term because H is independent of A (hot weather is independent of the
fact that an arsonist is in action) and the last term because F is independent of S.
The conditional independence relationships allow to represent the joint distribution more
compactly. If we had n binary nodes, the full joint would require O(2n) space to represent, but
11
2. STATISTICAL RELATIONAL LEARNING
the factored form would require O(2max(|Par(Xi)|)) space to represent, where Par(Xi) is the set of
parents of the variable Xi. Fewer parameters make learning easier too.
We can now give in general terms the relationship between a given directed graph and the
corresponding distribution over the variables. The joint distribution defined by a graph is given
by the product, over all of the nodes of the graph, of a conditional distribution for each node
conditioned on the variables corresponding to the parents of that node in the graph. Thus, for a
graph with N nodes, the joint distribution is given by:
p(X) =N
∏n=1
p(x{n}|Pa(x{n})) (2.1)
where Pa(x{n}) denotes the set of parents of x{n} , and X = x1, ...,xN . This key equation
expresses the factorization properties of the joint distribution for a directed graphical model.
Directed graphs that we are considering are subject to an important restriction, that there must
be no cycles, in other words there must be no paths from node to node along links following
the direction of the arrows and end up back at the starting node. Such graphs are also called
directed acyclic graphs, or DAGs. This is equivalent to the statement that there exists an
ordering of the nodes such that there are no links that go from any node to any lower numbered
node.
2.1.2 Markov Networks
A Markov Network (also known as Markov random field) is a model for the joint distribution
of a set of variables X = (X1,X2,. . . ,Xn) ∈ χ (Della Pietra et al. 1997; Kindermann and Snell
1980). It is composed of an undirected graph G and a set of potential functions. The graph has
a node for each variable, and the model has a potential function φk for each clique in the graph.
A potential function is a non-negative real-valued function of the state of the corresponding
clique. The joint distribution represented by a MN is given by:
P(X = x) =1Z ∏
kφk(x{k}) (2.2)
where x{k} is the state of the kth clique (i.e., the state of the variables that appear in that
clique). Z, known as the partition function, is given by:
Z = ∑x∈χ
∏k
φk(x{k}) (2.3)
12
2.1 Probabilistic Graphical Models
A clique is defined as a subset of the nodes in a graph such that there exists a link between
all pairs of nodes in the subset. In other words, the set of nodes in a clique is fully connected.
Furthermore, a maximal clique is a clique where it is not possible to include any other nodes
from the graph in the set without it ceasing to be a clique. The graphical structure in Figure 2.2
contains two maximal cliques {S, A, F} and {S, H, F}, all the other cliques are not maximal {S,
A}, {S, H}, {S, F}, {H, F}, {A, F}. We can consider only functions of the maximal cliques,
without loss of generality, because other cliques must be subsets of maximal cliques. For
example, if {S, A , F } is a maximal clique and we define an arbitrary function over this clique,
then including another factor defined over a subset of these variables would be redundant.
Figure 2.2: Example of the graph structure of a Markov Network -
MNs are often conveniently represented as log-linear models, with each clique potential
replaced by an exponentiated weighted sum of features of the state, leading to:
P(X = x) =1Z
exp(∑j
w jf j(x)) (2.4)
A feature may be any real-valued function of the state. We will focus on binary features,
f j ∈ {0,1}. In the most direct translation from the potential-function form, there is one feature
corresponding to each possible state xk of each clique, with its weight being log(φ(x{k}). This
representation is exponential in the size of the cliques.
2.1.3 Structure Learning
Structure Learning of Bayesian Networks
Structure learning for graphical models is learning the graph structure from data. For BNs the
13
2. STATISTICAL RELATIONAL LEARNING
goal is to learn a DAG that best explains the data. This problem is NP-hard since the number
of DAG’s on N variables is super-exponential in N. (There is no closed form formula for this,
but there are 543 dags on 4 nodes, and O(1018) dags on 10 nodes). The maximum likelihood
model is a complete graph, since this has the largest number of parameters, and hence fits the
data the best. A well-principled way to avoid this kind of over-fitting is putting a prior on
models, specifying preference for sparse models. Based on Bayes’ rule, the MAP (maximum
a posteriori) model is the one that maximizes:
P(G|D) =P(D|G)P(G)
P(D)(2.5)
Taking logs of each component of the equation:
logP(G|D) = logP(D|G)+ logP(G)+ e (2.6)
where e =−logP(D) is a constant independent of G. The scoring function for each model
is given by P(D|G).
Structure learning can be performed under full observability or partial observability. In
the first case local search algorithms can be used efficiently (possibly with multiple restarts).
Since the scoring function is a product of local terms, local search is more efficient, because to
compute the relative score of two models that differ by only a few arcs (i.e., neighbors in the
space), it is only necessary to compute the terms which they do not have in common; the other
terms cancel when taking the ratio. One of the most common methods for learning BNs is that
of (Geiger and Chickering 1995) which performs a search over the space of network structures,
starting from an initial network which may be random, empty, or derived from prior knowl-
edge. At each step, the algorithm generates all variations of the current network that can be
obtained by adding, deleting or reversing a single arc, without creating cycles, and selects the
best one using the Bayesian Dirichlet (BD) score. Bayesian methods can also be used since in
a Bayesian approach, the goal is to compute the posterior P(G|D). This is super-exponentially
large and a well-principled way is to sample a set of graphs from this distribution. The standard
approach is using an MCMC search procedure. This approach is quite popular and different
variants exist (for a review see (Murphy. 2001)). In case of partial observability an approach
is that of doing a local search inside the M-step of the Expectation-Maximization (EM) algo-
rithm. This is called Structural EM and it is proven to converge to a local maximum of the
Bayesian Information Criterion (BIC) (Friedman 1997b). More on learning BNs can be found
14
2.1 Probabilistic Graphical Models
in (Buntine 1994; Heckerman. 1998).
Structure Learning of Markov Networks
The problem of learning the structure of MNs is a very hard one. Most algorithms for this
task are based on greedy heuristic search which incrementally modifies the model by adding
and possibly deleting features. For example, the approaches in (Della Pietra et al. 1997; Mc-
Callum 2003) add features in order to greedily improve the model likelihood; once a feature is
added, it is never removed. Since the feature addition step is heuristic and greedy, it can cause
the inclusion of unnecessary features, leading therefore to overly complex structures and over-
fitting. Another approach is that of (A. Deshpande and Jordan 2001; Bach and Jordan 2002)
that searches over the space of low-treewidth models. However, the advantage of such models
in practice is unclear. Another method was proposed in (Lee et al. 2006) that is based on the
use of L1-regularization on the weights of the log-linear model. This has the effect of biasing
the model towards solutions where many of the parameters are zero. This formulation converts
the MNs learning problem into a convex optimization problem in a continuous space, which
can be solved using efficient gradient methods.
2.1.4 Parameter Learning
Parameter Learning of Bayesian Networks
When learning only the parameters, the structure is known and fixed and the goal is to
learn, for each node in the network, a probabilistic model of that variable given the values of
its parents. The goal is typically to maximize the likelihood of the training data. If the training
data is complete, this is accomplished simply by counting the co-occurrences of the values of
the node with the various values of its parents. When some of the data values are missing, the
well-known EM algorithm (Dempster et al. 1977) can be employed. EM estimates the CPTs
from the known data, then uses those to estimate the missing values, then uses those to re-
estimate the CPTs, and repeats until convergence.
Parameter Learning of Markov Networks
For undirected graphical models, the parameters are the clique potentials. Maximum like-
lihood estimates of these can be computed using iterative proportional fitting (Jirousek and
Preucil 1995). In (Della Pietra et al. 1997), MNs weights were learned using iterative scaling.
However, maximizing the likelihood (or posterior) using a quasi-Newton optimization method
15
2. STATISTICAL RELATIONAL LEARNING
like L-BFGS was found to be much faster (Sha and Pereira 2003). Second-order, quadratic-
convergence methods like L-BFGS are known to be very fast if started near the optimum. In
general the most commonly objectives used are maximum likelihood and maximum condi-
tional likelihood, often with additional parameter priors. There is no closed form for these
parameters, but they are convex, and so the global optimum can be found using iterative meth-
ods, such as simple gradient descent or more sophisticated optimization algorithms (P. 2001;
Vishwanathan et al. 2006). Unfortunately, each step of these optimization algorithms requires
the computation of the log partition function and the gradient which in turn requires performing
inference on the model with the current parameters. As MRF inference is computationally ex-
pensive or even intractable, the learning task that executes inference repeatedly is often viewed
as intractable.
One commonly-used approach (Shental et al. 2003; Sutton and McCallum 2005a; Taskar
et al. 2002) is the approximation of the gradient of the maximum likelihood objective through
an approximate inference technique, most often the loopy belief propagation (LBP) (Pearl
1988; Yedidia et al. 2005) algorithm. LBP uses message passing to find fixed points of the
non-convex Bethe approximation to the energy functional (Yedidia et al. 2005). Unfortunately,
for some choices of models, LBP can be highly non-robust, providing wrong answers or not
converging at all. Recently, in (Ganapathi et al. 2008), an approach for combining MRF learn-
ing and Bethe approximation was proposed. They consider the dual of maximum likelihood
MN learning – maximizing entropy with moment matching constraints – and then approximate
both the objective and the constraints in the resulting optimization problem.
2.1.5 Inference
Graphical models specify a complete joint probability distribution (JPD) over all the variables.
Given the JPD, we can answer all possible inference queries by marginalization (summing out
over irrelevant variables). However, the JPD has size O(2n), where n is the number of nodes.
Hence summing over the JPD takes exponential time. One of the algorithms is called Variable
Elimination that uses the factored representation of the JPD to do marginalisation efficiently.
The key idea is to "push sums in" as far as possible when summing (marginalizing) out irrel-
evant terms. The principle of distributing sums over products can be generalized greatly to
apply to any commutative semiring. This forms the basis of many common algorithms, such
as Viterbi decoding and the Fast Fourier Transform. If we want to compute several marginals
16
2.2 First-Order Logic
at the same time, we can use Dynamic Programming (DP) to avoid the redundant computa-
tion that would be involved if we used variable elimination repeatedly (Pearl 1988). However,
for real-world problems and models, such as those with repetitive structure, as in multivariate
time-series or image analysis, large induced width makes exact inference very slow. Therefore
approximation techniques must be used. Unfortunately, approximate inference is #P-hard, but,
nonetheless, approximation method often work well in practice. Some of these methods are:
Variational methods, Monte Carlo methods, Loopy belief propagation, Bounded cutset condi-
tioning, Parametric approximation methods. More on exact inference can be found in (Kschis-
chang et al. 2001; McEliece and Aji 2000), whereas for approximate inference in graphical
models more can be found in (Jordan et al. 1999; Murphy et al. 1999).
2.2 First-Order Logic
A first-order knowledge base (KB) is a set of sentences or formulas in first-order logic (FOL)
(Genesereth and Nilsson 1987). Formulas in FOL are constructed using four types of symbols:
constants, variables, functions, and predicates. Constant symbols represent objects in the do-
main of interest. Variable symbols range over the objects in the domain. Function symbols
represent mappings from tuples of objects to objects. Predicate symbols represent relations
among objects in the domain or attributes of objects. A term is any expression representing an
object in the domain. It can be a constant, a variable, or a function applied to a tuple of terms.
An atomic formula or atom is a predicate symbol applied to a tuple of terms. A ground term
is a term containing no variables. A ground atom or ground predicate is an atomic formula
all of whose arguments are ground terms. Formulas are recursively constructed from atomic
formulas using logical connectives and quantifiers. A positive literal is an atomic formula; a
negative literal is a negated atomic formula. A KB in clausal form is a conjunction of clauses,
a clause being a disjunction of literals. A definite clause is a clause with exactly one positive
literal (the head, with the negative literals constituting the body). A possible world or Herbrand
interpretation assigns a truth value to each possible ground predicate.
It is often convenient to convert formulas to a more regular form, typically clausal form
(also known as conjunctive normal form (CNF)). A KB in clausal form is a conjunction of
clauses, a clause being a disjunction of literals (predicates or their negations). Every KB in FOL
can be converted to clausal form through a sequence of steps. Formulas in clausal form contain
no quantifiers; all variables are implicitly universally quantified. (They are also standardized
17
2. STATISTICAL RELATIONAL LEARNING
apart, i.e., no variable appears in more than one clause.) Existentially quantified variables are
replaced by Skolem functions. A Skolem function is a function of all the universally quantified
variables in whose scope the corresponding existential quantifier appears. To perform inference
in FOL using clausal form, resolution and local satisfiability search can be used. The latter is
applied after propositionalizing the KB (i.e., forming all ground instances of CNF clauses),
and proceeds by repeatedly flipping the truth values of propositions to increase the number of
satisfied clauses. These are known under the name of SAT solvers.
Because of the computational complexity, KBs are generally constructed using a restricted
subset of FOL where inference and learning is more tractable. The most widely-used restriction
is to Horn clauses, which are clauses containing at most one positive literal. In other words,
a Horn clause is an implication with all positive antecedents, and only one (positive) literal in
the consequent. A program in the Prolog language is a set of Horn clauses. Prolog programs
can be learned from examples (often databases) by searching for Horn clauses that hold in the
data. The field of inductive logic programming (ILP) (Muggleton and De Raedt 1994) deals
exactly with this problem.
2.3 Inductive Logic Programming
Inductive Logic Programming (ILP) and multi-relational data mining are concerned with learn-
ing and mining within first order logical or relational representations. The main task in ILP is
finding an hypothesis H (a logic program, i.e. a definite clause program) from a set of posi-
tive and negative examples P and N. In particular, it is required that the hypothesis H covers
all positive examples in P and none of the negative examples in N. The representation lan-
guage for representing the examples together with the covers relation determines the ILP setting
(De Raedt 1997). Overviews of inductive logic learning and multi-relational data mining can
be found in (Dzeroski and Lavrac 2001; Lavrac and Dzeroski 1994; Muggleton and De Raedt
1994). In the following will be discussed the three main settings for learning in ILP. A recent
and more detailed review of these three settings can be found in (De Raedt and Kersting 2003,
2004).
2.3.1 Learning from entailment
Learning from entailment is probably the most popular ILP setting and many well-known ILP
systems such as FOIL (Quinlan 1990), PROGOL (Muggleton 1995) or ALEPH (Srinivasan)
18
2.3 Inductive Logic Programming
follow this setting. In this setting examples are definite clauses and an example e is covered
by an hypothesis H, w.r.t the background theory B if and only if B∪H |= e. Most ILP systems
in this setting require ground facts as examples. They typically proceed following a separate-
and-conquer rule-learning approach (Furnkranz 1999). This means that in the outer loop they
repeatedly search for a rule covering many positive examples and none of the negatives (set-
covering approach (Mitchell 1997). In the inner loop ILP systems generally perform a general-
to-specific heuristic search using refinement operators (Nienhuys-Cheng and de Wolf 1997;
Shapiro. 1983) based on θ -subsumption (Plotkin. 1970). These operators perform the steps in
the search-space, by making small modifications to a hypothesis. From a logical perspective,
these refinement operators typically realize elementary generalization and specialization steps
(usually under θ -subsumption). More sophisticated systems like PROGOL or ALEPH employ
a search bias to reduce the search space of hypothesis.
2.3.2 Learning from interpretations
In the ILP setting of learning from interpretations, examples are Herbrand interpretations and
an example e is covered by an hypothesis H, w.r.t the background theory B, if and only if e is a
model of B∪H. A possible world is described through sets of true ground facts which are the
Herbrand interpretations. Learning from interpretations is generally easier and computationally
more tractable than learning from entailment (De Raedt 1997). This is due to the fact that
interpretations carry much more information than the examples in learning from entailment. In
learning from entailment, examples consist of a single fact, while in interpretations all the facts
that hold in the example are known. The approach followed by ILP systems learning from
interpretations is similar to those that learn from entailment. The most important difference
stands in the generality relationship. In learning from entailment an hypothesis H1 is more
general than H2 if and only if H1 |= H2, while in learning from interpretations when H2 |= H1.
A hypothesis H1 is more general than a hypothesis H2 if all examples covered by H2 are also
covered by H1. ILP systems that learn from interpretations are also well suited for learning
from positive examples only (De Raedt and Dehaspe 1997).
2.3.3 Learning from proofs
The first learning system that used to perform a kind of learning from proofs was the Model
Inference System (MIS) (Shapiro. 1983). This system normally learned from entailment, but
19
2. STATISTICAL RELATIONAL LEARNING
when information was missing, it queried the user for missing information by asking the truth
value of facts. The answers to these queries would then allow MIS to reconstruct the trace
or the proof for the positive example. Inspired by the work of Shapiro on MIS, the authors
in (De Raedt and Kersting 2004) defined the learning from proofs setting of ILP. In learning
from proofs, the examples are ground proof-trees and an example e is covered by a hypothesis
H w.r.t. the background theory B if and only if e is a proof-tree for H ∪ B. There can be
different possible forms of proof trees. For example, the authors in (De Raedt and Kersting
2004), assume that the proof tree is in the form of an and-tree where the nodes contain ground
atoms. They define a proof tree in this way: t is a proof-tree for T if and only if t is a rooted
tree where for every node n ∈ t with children child(n) satisfies the property that there exists a
substitution θ and a clause c such that n = head(c)θ and child(n) = body(c)θ .
2.4 Probabilistic Inductive Logic Programming
Probabilistic Inductive Logic Programming (PILP) can be seen as a field that aims at combining
ILP principles such as refinement operators with statistical learning. The most natural way to
do this is be giving a probabilistic semantics to the three ILP settings. In the following will
be sketched how all ILP settings can be extended with probabilistic semantics. More on this
extension can be found in (De Raedt and Kersting 2003, 2004).
The first change from ILP is that the cover relation becomes a probabilistic one. Then
clauses become annotated with probability values. A probabilistic covers relation for an ex-
ample e, an hypothesis H and a background theory B, returns a probability P. We can write
cover(e,H ∪B) = P(e|H,B). The latter is the likelihood of the example e. With this cover
relation, the goal of PILP is to find an hypothesis H that maximizes the likelihood of the data
P(E|H,B) where E is the set of examples.
2.4.1 Learning from Probabilistic Entailment
In a probabilistic setting, a logic program C becomes a set of clauses of the form h← bi where h
is an atom and bi are different bodies of clauses. For each clause in C, the probability P(bi|h) is
the conditional probability distribution that for a random substitution θ for which hθ is ground
and true (resp. false), the query biθ succeeds (resp. fails). It is assumed the prior probability of
h is given as P(h), the probability that for a random substitution θ , h is true (resp. false). The
covers relation P(hθ |C) (B is fixed) can thus be defined as:
20
2.4 Probabilistic Inductive Logic Programming
P(hθ |C) = P(hθ |b1θ , ...,bkθ) =P(b1θ , ...,bkθ |hθ)×P(hθ)
P(b1θ , ...,bkθ)
If we apply the naïve Bayes assumption to this equation we have:
P(hθ |C) = ∏P(biθ |hθ)×P(hθ)P(b1θ , ...,bkθ)
2.4.2 Learning from Probabilistic Interpretations
In order to give a probabilistic semantics to this ILP setting, probabilities must be assigned to
interpretations covered by a logic program. One way of doing this is to consider ground atoms
as random variables that are defined by the underlying definite clause programs (Kersting and
De Raedt 2001a). The authors distinguish between two types of predicates: deterministic and
probabilistic ones. The former are called logical, the latter Bayesian. A Bayesian logic program
is a set of of Bayesian (definite) clauses of the form A|A1, ...,An where A is a Bayesian atom,
A1, ...,An,n ≥ 0, are Bayesian and logical atoms and all variables are (implicitly) universally
quantified. To quantify probabilistic dependencies, each Bayesian clause c is annotated with
its conditional probability distribution cpd(c) = P(A|A1, ...,An), which quantifies the proba-
bilistic dependency among ground instances of the clause. A set of Bayesian logic program
together with the background theory induces a Bayesian network. The random variables A
of the Bayesian network are the Bayesian ground atoms in the least Herbrand model I of the
annotated logic program (For details see (Kersting and De Raedt 2001a)).
2.4.3 Learning from Probabilistic Proofs
Learning from probabilistic proofs is similar to Stochastic Logic Programs (SLPs) (Cussens
2001; Muggleton 1996) (this model will be discussed in the following paragraph). In SLPs,
similar to stochastic context-free grammars, the clauses are annotated with probability labels
in such a way that the sum of the probabilities associated to each clause defining any predicate
is 1.0 (less restricted versions have been considered in (Cussens. 1999)). SLPs are an example
of learning from entailment because the examples are ground facts entailed by the target SLP,
while in (De Raedt and Kersting 2004) the idea is to learn from proofs which carry a lot more
information about the structure of the underlying logic program. The basic element is that
proofs are probabilistic and a probability of a proof given a query predicate q, is the product
of the probabilities of clauses that have been used in the proof of q assuming proofs are finite
21
2. STATISTICAL RELATIONAL LEARNING
(see (Cussens. 1999) for the general case). The probability of a ground atom is then defined
as the sum of all the probabilities of all the proofs for that ground atom. An approach for
learning SLPs from proofs is that of (De Raedt et al. 2005) that combines ideas from the early
ILP system Golem [33] that employs Plotkin’s (Plotkin. 1970) least general generalization
(LGG) with bottom-up generalization of grammars and hidden Markov models (Stolcke and
Omohundro. 1993). The resulting algorithm employs the likelihood of the proofs as scoring
function.
2.5 SRL and PILP models
Most approaches in PILP start from ILP and extend it with probabilistic semantics. On the
other side SRL approaches start from Probabilistic Graphical Models (PGMs) and extend them
with relational representations. In the following will be presented the most well known SRL
and PILP approaches.
2.5.1 Knowledge-based Model Construction
One of the simplest way of combining probability and first-order logic is augmenting an ex-
isting first-order (Horn clause) knowledge base with probabilistic information. This is the ap-
proach taken by knowledge-based model construction (KBMC) methods, which derives from
work by Ngo and Haddaway (Ngo and Haddawy 1997) and earlier work surveyed in (Wellman
and Goldman 1992). Probabilistic logic programs proposed in (Ng and Subrahmanian 1992)
are also similar to this approach. The basic idea of all KBMC approaches is that with each
clause in a knowledge base is associated a set of parameters that specify how the consequent
probabilistically depends on its antecedents. In the simplest case, this is a single parameter that
specifies the probability that the consequent holds given that the antecedents hold. To answer
queries, KBMC constructs from the knowledge base a BN containing the relevant knowledge.
Each grounded predicate that is relevant to the query appears in the BN as a node. Relevant
predicates are found using Prolog backward chaining, except that rather than stopping when
finding a proof tree, KBMC finds every possible proof tree. Further, in order to find all relevant
predicates, backward chaining is performed not only from the query predicate to the evidence
predicates, but also from each evidence predicate to the other evidence predicates and the query
predicate.
22
2.5 SRL and PILP models
2.5.2 Probabilistic Relational Models
Probabilistic relational models (PRMs) (Friedman et al. 1999) are a combination of frame-
based systems and Bayesian networks. Differently from PILP approaches, the authors start
from BNs learning approaches and extend these to a rich representation language in order to
deal with both relations and uncertainty. The early idea od PRMs was to allow the properties of
an object to depend probabilistically both on other properties of that object and on properties
of related objects. The authors in (Friedman et al. 1999) generalize the ideas of (Koller and
Pfeffer. 1998) on constructing rich probabilistic representations and of a related work based on
description logics (Koller et al. 1997). The major limitation of a BN is its propositional nature,
thus the entire domain must be known and probabilistic parameters that could be shared, end up
across many CPTs. What BNs lack, is the concept of variable instantiation which is common
in logic. PRMs achieve exactly this.
A PRM is of a set of classes Cl1,Cl2, ...,Cn. Each class Cl has a set of attributes A; each
attribute A is denoted Cl.A (For example, Car.color). Each class also has a set of reference
slots R, where each reference slot is denoted Cl., and points to an instance of the same or
another class (for example, Car.engine). Reference slots can be composed to form a slot chain
(for example, Car.engine.power refers to a car’s engine power). A also defines a probabilistic
relationship between attributes of classes. An attribute may depend on any attribute of the same
class, or of a class that is reachable through some slot-chain. Every PRM can be compiled into
a BN and essentially a PRM can be thought of as a template which, when given a specific
domain of objects, is “compiled” into a BN. Given a PRM and a set of objects, inference is
performed by constructing the corresponding BN and applying standard inference techniques
to it.
As the authors in (Taskar et al. 2002) point out, the need to avoid cycles in PRMs causes
significant representational and computational difficulties. Inference in PRMs is done by cre-
ating the complete ground network, which limits their scalability. PRMs require specifying a
complete conditional model for each attribute of each class, which in large complex domains
can be quite burdensome.
2.5.3 Bayesian Logic Programs
Bayesian Logic Programs (BLPs) fall in the ILP setting of learning from probabilistic entail-
ment introduced in the previous sections. A BLP together with the background theory induces
23
2. STATISTICAL RELATIONAL LEARNING
a BN. The random variables A of the Bayesian network are the Bayesian ground atoms in the
least Herbrand model I of the annotated logic program. This is similar in spirit with the ap-
proach KBMC described above. BLPs are represented by regular clauses, on which the typical
refinement operators from ILP can be applied. However, in BLP, it is required that the exam-
ples are models of the BLP, i.e. cover(H,e) = true if and only if e is model of H. This is
needed since the set of random variables defined by a BLP corresponds to a Herbrand model.
The requirement is enforced when learning the structure of BLP by starting from an initial set
of hypothesis that satisfies this requirement and from then on only considering refinements that
do not result in a violation. In addition, acyclicity is enforced by checking for each refine-
ment and each example that the induced Bayesian network is acyclic. Scooby (Kersting and
De Raedt 2001a,b) is a greedy hill-climbing approach for learning Bayesian logic programs.
Scooby takes the initial BLP as starting point and computes the parameters maximizing likeli-
hood. Then, refinement operators generalizing respectively specializing H are used to compute
all legal neighbours of H in the hypothesis space.
2.5.4 Stochastic Logic Programs
Stochastic Logic Programs were first defined by Muggleton in (Muggleton 1996) as general-
izations of Hidden Markov Models (HMMs) and Stochastic Context-Free Grammars (SCFGs).
An SLP is a definite logic program where some of the clauses are labelled with non-negative
numbers. A pure SLP is an SLP where all clauses have labels, while in an impure SLP some of
the clauses do not have labels. A normalised SLP is the one where labels for clauses that share
the same predicate symbol sum to one. If this is not true, the SLP is said to be unnormalised
(Cussens 2001; Cussens. 1999). In normalised SLPs, labels can be considered as probabilities.
In these SLPs, since each clause has a probability label associated, through the SLD-resolution
strategy that employs a stochastic selection rule, it is induced a probability distribution over
atoms of each predicate in the Herbrand base. An SLP defines a probability distribution over
derivations where the probability of a derivation is given by the product of the labels of the
clauses used in the SLD derivation.
Parameters of SLPs can be learned through the Failure Adjusted Maximisation (FAM)
algorithm (Cussens 2001), while structure learning of SLPs can be performed through an ILP
system and then learn the parameters with FAM. Another approach to learn the structure of
SLPs is that in (Muggleton 2002). This approach incrementally learns an additional clause
for a single predicate in a SLP. From an ILP perspective, this corresponds to a typical single
24
2.5 SRL and PILP models
predicate learning setting under entailment. A related approach is that of learning SLPs from
proofs (De Raedt et al. 2005) which corresponds to the third ILP setting of learning from
proofs. In (De Raedt et al. 2005) the authors employ Plotkin’s (Plotkin. 1970) least general
generalization (LGG) in an ILP system to learn SLPs from proof banks.
2.5.5 PRISM
Programming in Statistical Modelling (PRISM) (Sato and Kameya 1997b, 2001) combines
logic programming and statistical modelling based on the distributional semantics introduced
by Sato (Sato 1995). PRISM programs are not only just a probabilistic extension of logic pro-
grams but are also able to learn from examples through the EM (Expectation- Maximization)
algorithm which is built-in in the language. PRISM represents a formal knowledge represen-
tation language for modeling scientific hypotheses about phenomena which are governed by
rules and probabilities. The parameter learning algorithm (Sato and Kameya 2001), provided
by the language, is a new EM algorithm called graphical EM algorithm that when combined
with the tabulated search has the same time complexity as existing EM algorithms, i.e., the
Baum-Welch algorithm for HMMs, the Inside-Outside algorithm for PCFGs, and the one for
singly connected BNs that have been developed independently in each research field. Since
PRISM programs can be arbitrarily complex (no restriction on the form or size), the most pop-
ular probabilistic modeling formalisms such as HMMs, PCFGs and BNs can be described by
these programs.
PRISM programs are defined as logic programs with a probability distribution given to
facts that is called basic distribution. Formally a PRISM program is P = F ∪R where R is a
set of logical rules working behind the observations and F is a set of facts that model observa-
tions’ uncertainty with a probability distribution. Through the built-in graphical EM algorithm
the parameters (probabilities) of F are learned and through the rules this learned probabil-
ity distribution over the facts induces a probability distribution over the observations. The
most appealing feature of PRISM is that it allows the users to use random switches to make
probabilistic choices in the logic program. A random switch has a name, a space of possible
outcomes, and a probability distribution. Recent advances on PRISM can be found in (Sato
and Kameya 2008).
25
2. STATISTICAL RELATIONAL LEARNING
2.5.6 Relational Dependency Networks
Relational Dependency Networks (RDNs) are dependency networks in which each node’s
probability, conditioned on its Markov blanket, is given by a decision tree over relational at-
tributes (Neville and Jensen 2007). RDNs are an extension of dependency networks (DNs)
(Heckerman et al. 2000) for relational data. DNs are an approximate representation. They ap-
proximate the joint distribution with a set of conditional probability distributions (CPDs) that
are learned independently. This approach to learning, results in significant efficiency gains over
exact models. However, because the CPDs are learned independently, DNs are not guaranteed
to specify a consistent joint distribution, where each CPD can be derived from the joint distri-
bution using the rules of probability. This limits the applicability of exact inference techniques.
RDNs can represent and reason with the cyclic dependencies required to express and ex-
ploit autocorrelation during collective inference. RDNs share certain advantages of relational
undirected models. In (Neville and Jensen 2007) the authors describe a relatively simple
method for structure learning and parameter estimation, which results in models that are easier
to understand and interpret. The primary distinction between RDNs and other existing SRL
models is that RDNs are an approximate model. RDNs approximate the full joint distribution
and thus are not guaranteed to specify a consistent probability distribution. The quality of the
approximation will be determined by the data available for learning: if the models are learned
from large data sets, and combined with Monte Carlo inference techniques, the approximation
should be sufficiently accurate.
2.5.7 nFOIL, TFOIL and kFOIL
These three PILP systems are probabilistic extensions to the ILP system FOIL (Quinlan 1990).
They all fall in the PILP setting of learning from probabilistic entailment. nFOIL (Landwehr
et al. 2005) was the first system in literature to tightly integrate feature construction and Naïve
Bayes. Such a dynamic propositionalization was shown to be superior compared to static
propositionalization approaches that use Naïve Bayes only to post-process the rule set. nFOIL
adapts FOIL by using conditional likelihood as the scoring function. A significant difference
with FOIL is, however, that the covered positive examples are not removed. TFOIL (Landwehr
et al. 2007) is similar in spirit and Tree Augmented Naïve Bayes, a generalization of Naïve
Bayes is integrated with FOIL. The authors show in (Landwehr et al. 2007) that TFOIL outper-
forms nFOIL. In a recent approach (Landwehr et al. 2006), the kFOIL system integrates ILP
26
2.5 SRL and PILP models
and support vector learning. kFOIL constructs the feature space by leveraging FOIL search for
a set of relevant clauses. The search is driven by the performance obtained by a support vector
machine based on the resulting kernel. The authors showed that kFOIL improves over nFOIL.
2.5.8 Other models
This paragraph gives an overview of other SRL and PILP models.
Relational Markov models. In this model, the states of the Markov model are labeled
with parameterized predicates (Anderson et al. 2002). Based on a first-order representation for
the state, RMMs are more able to use smoothing to combat data scarcity.
Structural Logistic Regression. In structural logistic regression (SLR) (Popescul and
Ungar 2003), the predictors are the output of SQL queries over the input data. This model
integrates tightly the ILP step with the statistical learning step in a dynamic propositionalization
approach similar to nFOIL, but the difference is that SLR employs a more advanced (and hence
computationally more expensive) statistical model such as logistic regression.
Maximum Entropy Modelling with Clausal Constraints. MACCENT (Dehaspe 1997)
falls in the same category as SLR described above with the difference that it uses as a statistical
model, the maximum entropy modeling.
Relational Markov networks. RMNs (Taskar et al. 2002) combine MNs with database
queries. The relational structure of the RMN is defined by relational clique templates, which are
essentially SQL queries, and their associated potential functions. Each clique template, when
applied to a database, generates a set of tuples. Each tuple defines a clique in the “unrolled”
ground MN. Since they use one parameter per state of each clique, RMNs are limited to small
cliques.
1BC2. 1BC2 (Flach and Lachiche 2004) is a naïve Bayes classifier for structured data. The
logical component (and hence the features) of such a model are fixed, and only the parameters
are learned using statistical learning techniques.
SAYU. SAYU (Davis et al. 2005) uses a “wrapper” approach where (partial) clauses gener-
ated by the refinement search of an ILP system are proposed as features to a (tree augmented)
naïve Bayes, and incorporated if they improve performance. This means that feature learning
and naïve Bayes are tightly coupled similar to nFOIL. However, SAYU scores features based
on a separate tuning set. The probabilistic model is trained to maximize the likelihood on the
training data, while clause selection is based on the area under the precision-recall curve of the
model on a separate tuning set.
27
2. STATISTICAL RELATIONAL LEARNING
CLP(BN). CLP(BN) (Costa et al. 2008) aims at integrating BNs with constraint logic pro-
gramming. The authors in (Costa et al. 2008) propose the language CLP(BN) and show that
this model subsumes PRMs and show also that algorithms from ILP can be used with minor
modifications to learn CLP(BN) from data.
LPADs. Logic Programs with Annotated Disjunctions (Riguzzi 2004; Vennekens et al.
2004) combine logic and probability in an elegant way. Each ground annotated disjunctive
clause represents a probabilistic choice between a number of ground non-disjunctive clauses.
By choosing a head atom for each ground clause of an LPAD, it is obtained a normal logic
program called an instance of the LPAD. A probability distribution is defined over the space of
instances by assuming independence between the choices made for each clause.
28
Chapter 3
Markov Logic Networks
This chapter presents Markov Logic and how Markov Logic Networks (MLNs) serve as tem-
plates for constructing Markov Networks (MNs). It describes existing algorithms for learning
and inference in MLNs.
3.1 Markov Logic
Markov Logic is a combination of MNs and FOL. A FOL KB is a set of hard constraints on
the set of possible worlds: worlds that violate even one formula, have zero probability. Markov
Logic is based on the idea that these constraints must be soften: when a world violates one
formula in the KB it is less probable, but not impossible. A world is more probable, if it vi-
olates fewer formulas. Each formula in Markov Logic has an associated weight that reflects
how strong a constraint is: the higher the weight, the greater the difference in log probability
between a world that satisfies the formula and one that does not, other things being equal. A
set of formulas in Markov Logic is a Markov Logic Network. In this chapter, we make the
assumption that we are in a finite domain. Extending MLNs to infinite domains is a topic of
current work (Singla and Domingos 2007). MLNs allow to define probability distributions over
possible worlds (Halpern 1990). MLNs are defined as follows:
Definition 3.1.1 A Markov Logic Network (Richardson and Domingos 2006) N is a set of
pairs (Fi;wi), where Fi is a formula in first-order logic and wi is a real number. Together with
a finite set of constants C = {c1,c2, . . . ,cp} it defines a Markov Network MN;C as follows:
1. MN;C contains one binary node for each possible grounding of each predicate appearing in
29
3. MARKOV LOGIC NETWORKS
N. The value of the node is 1 if the ground predicate is true, and 0 otherwise.
2. MN;C contains one feature for each possible grounding of each formula Fi in N. The value
of this feature is 1 if the ground formula is true, and 0 otherwise. The weight of the feature
is the wi associated with Fi in N. Thus there is an edge between two nodes of MN;C iff the
corresponding ground predicates appear together in at least one grounding of one formula in N.
An MLN can be viewed as a template for constructing MNs. The probability distribution over
possible worlds x specified by the ground MN MN;C is given by:
P(X = x) =1Z
exp(F
∑i=1
wini(x)) =1Z ∏
iφi(xi)ni(x) (3.1)
where F is the number of formulas in the MLN, ni(x) is the number of true groundings of Fi
in x and xi is the state of the atoms appearing in Fi. As formula weights increase, an MLN
increasingly resembles a purely logical KB, becoming equivalent to one in the limit of all
infinite weights.
The syntax of the formulas in an MLN is the standard syntax of FOL (Genesereth and Nils-
son 1987). Free (unquantified) variables are treated as universally quantified at the outermost
level of the formula. In this dissertation the focus is on MLNs whose formulas are function-
free clauses and it is also assumed domain closure (it has been proven that no expressiveness is
lost), ensuring that the MNs generated are finite. In this case, the groundings of a formula are
formed simply by replacing its variables with constants in all possible ways.
Considerations:
1. The predicates in every ground formula form a clique in MN;C which not necessarily is
a maximal one. The structure of the ground MN;C is constructed as follows: there is an edge
between two nodes of MN;C iff the corresponding ground predicates appear together in at least
one grounding of one formula in N.
2. An MLN without variables is an ordinary MN. Any log-linear model over Boolean
variables can be represented as an MLN, since each state of a Boolean clique is defined by a
conjunction of literals.
3. An MLN is different from an ordinary first-order KB in that it can produce useful results
even if it contains contradictions. An MLN can also be obtained by merging several KBs, even
if they are partly incompatible.
4. If a knowledge base KB is satisfiable, the satisfying assignments are the modes of the
distribution represented by an MLN consisting of KB with all positive weights.
30
3.1 Markov Logic
5. Each state of MN;C represents a possible world. A possible world is a set of objects,
functions (mappings from tuples of objects to objects), and relations that hold between the
objects; together with an interpretation, they determine the truth value of each ground atom.
The following assumptions ensure that the set of possible worlds for (N,C) is finite, and that
MN;C represents a unique, well-defined probability distribution over those worlds, irrespective
of the interpretation and domain (Richardson and Domingos 2006): unique names (different
constants refer to different objects) (Genesereth and Nilsson 1987); domain closure (the only
objects in the domain are those representable using the constant and the function symbols in
(N,C)) (Genesereth and Nilsson 1987); known functions (for each function appearing in N, the
value of that function applied to every possible tuple of arguments is known, and is an element
of C).
As pointed out in (Richardson and Domingos 2006), the third assumption allows to re-
place functions by their values when grounding formulas. Therefore the only ground atoms
to be considered are those having constants as arguments. The infinite number of terms con-
structible from all functions and constants in (N, C) (the “Herbrand universe” of (N, C)) can
be ignored, because each of those terms corresponds to a known constant in C, and atoms
involving them are already represented as the atoms involving the corresponding constants.
The possible groundings of a predicate are obtained simply by replacing each variable in the
predicate with each constant in C, and replacing each function term in the predicate by the cor-
responding constant. If a formula contains more than one clause, its weight is divided equally
among the clauses, and a clause’s weight is assigned to each of its groundings.
The unique names assumption can be removed by introducing an equality predicate of the
form equals(x,y) and adding the necessary axioms to the MLN. For the second assumption, as
shown in (Richardson and Domingos 2006), there can be different scenarios. When the number
n of unknown objects is known, the domain closure can be removed simply by introducing
n arbitrary new constants. If n is unknown but finite, domain closure can be removed by
introducing a distribution over n, grounding the MLN with each number of unknown objects
and computing the probability P(F) = ∑nmaxn=0 P(n)P(F |Mn
N;C) where MnN;C is the ground MLN
with n unknown objects. If n is infinite, it would be necessary to extend MLNs to the infinite
case. If HN,C is the set of all ground terms constructible from the function symbols in N and the
constants in N and C (the “Herbrand universe” of (N, C)), the assumption 3 (known functions)
can be removed by treating each element of HN,C as an additional constant and applying the
31
3. MARKOV LOGIC NETWORKS
same procedure used to remove the unique names assumption. As a summary, all the three
assumptions can be removed if the domain is finite.
6. An MLN can be viewed as a template for constructing MNs. In different worlds (dif-
ferent sets of constants) it will produce different networks of widely varying size, but all with
certain regularities in structure and parameters, given by the MLN (e.g., all groundings of the
same formula will have the same weight).
7. When weights increase, an MLN increasingly resembles a purely logical KB. In the limit
of all infinite weights, the MLN represents a uniform distribution over the worlds that satisfy
the KB. (A non-uniform distribution could easily be represented using additional formulas with
non-zero weights.)
What is important for the purpose of having the same power of PGMs, is that MLNs should
subsume other propositional graphical models.
Proposition 3.1.1 Every probability distribution over discrete or finite-precision numeric vari-
ables can be represented as a Markov Logic Network. (the proof can be found in (Richardson
and Domingos 2006).
On the other side it is important to preserve the power of FOL. The following proposition
states that FOL is a special case of Markov Logic.
Proposition 3.1.2 If KB is a first-order knowledge base, let N be the MLN obtained by as-
signing a weight w to every formula in KB, C be the set of constants appearing in KB, Pw(x)
be the probability assigned to a (set of) possible world(s) x by MN;C , χKB be the set of worlds
that satisfy KB, and F be an arbitrary formula in FOL. Then:
1.∀x ∈ χKB limw→∞ Pw(x) = |χKB|−1
2.For all F, KB |= F ⇐⇒ limw→∞ Pw(F) = 1
This states that, in the limit of all equal infinite weights, the MLN represents a uniform
distribution over the worlds that satisfy the KB, and all entailment queries can be answered by
computing the probability of the query formula and checking whether it is 1.
A simple example of a first-order KB is given in Figure 3.1. FOL statements are always
true, and the FOL formulas state that if someone drinks heavily, he will have an accident and
32
3.1 Markov Logic
if friends drinks, they’re going to have accidents.
Figure 3.1: Example of a knowledge base in first-order logic -
Since FOL statements, in practice are not always true, it is necessary to soften these hard
constraints. For example, in practice it is not always true that if someone drinks heavily, he
will have a car accident. In Figure 3.2, it is presented a KB in Markov Logic. As it can be seen,
formulas have weights attached and statements are not always true any more. Their degree
of truth depends on the weight attached. For instance, the first formula expresses a stronger
constraint than the second.
Figure 3.2: Example of a knowledge base in Markov Logic -
The simple KB in Figure 3.2 together with a set of constants, defines a MN. For example,
suppose we have two constants in the domain that represent two persons, Paolo and Cesare.
Then, the first step in the construction on the MN is given by the grounding of each predicate
in the domain according to the constants of the domain. Partial grounding is shown in Figure
3.3 where only groundings of HDrinks and CarAcc are considered. The complete nodes are
shown in Figure 3.4 where all the groundings of the predicates represent nodes in the graph.
In the next step, any two nodes whose corresponding predicates appear together is some
ground formula are connected. For example, in Figure 3.5, the nodes HDrinks(P) and CarAcc(P)
are connected through an arc, because the two predicates appear together in the grounding of
the second formula. The complete graph is presented in Figure 3.6.
33
3. MARKOV LOGIC NETWORKS
Figure 3.3: Partial construction of the nodes of the ground Markov Network -
Figure 3.4: Complete construction of the nodes of the ground Markov Network -
34
3.1 Markov Logic
Figure 3.5: Connecting nodes whose predicates appear in some ground formula -
Figure 3.6: Complete construction of the structure of the graph for the Markov Network -
35
3. MARKOV LOGIC NETWORKS
3.2 Structure Learning of MLNs
This section presents the main structure learning algorithms describing in detail the evaluation
function and search strategies.
3.2.1 Pseudo-likelihood
MLN weights can be learned by maximizing the likelihood of a relational database. Like in ILP,
a closed-world assumption (Genesereth and Nilsson 1987) is made, thus all ground atoms not
in the database are assumed false. If there are n possible ground atoms, then we can represent
a database as a vector x = (x1, ...,xi...,xn) and xi is the truth value of the ith ground atom, xi
= 1 if the atom appears in the database, otherwise xi = 0. Standard methods can be used to
learn MLN weights following Equation 3.1. If the jth formula has n j(x) true groundings, by
Equation 3.1 we get the derivative of the log-likelihood with respect to its weights by:
∂
∂w jlogPw(X = x) = n j(x)−∑
x′Pw(X = x′)n j(x′) (3.2)
where x′ are databases and Pw(X = x′) is P(X = x′) computed using the current weight
vector w = (w1, ...,w j). Thus, the jth component of the gradient is the difference between the
number of true groundings of the jth formula in the data and its expectation according to the
model. Counting the number of true groundings of a first-order formula, unfortunately, is a
#P-complete problem.
The problem with Equation 3.2 is that not only the first component is intractable, but also
computing the expected number of true groundings is also intractable, requiring inference over
the model. Further, efficient optimization methods also require computing the log-likelihood
itself (Equation 3.1), and thus the partition function Z. This can be done approximately using a
Monte Carlo maximum likelihood estimator (MC-MLE) (Geyer and Thompson 1992). How-
ever, the authors in (Richardson and Domingos 2006) found in their experiments that the Gibbs
sampling used to compute the MC-MLEs and gradients did not converge in reasonable time,
and using the samples from the unconverged chains yielded poor results.
In many other fields such as spatial statistics, social network modeling and language pro-
cessing, a more efficient alternative has been followed. This is optimizing pseudo-likelihood
(Besag 1975) instead of likelihood. If x is a possible world (a database or truth assignment)
36
3.2 Structure Learning of MLNs
and xl is the lth ground atom’s truth value, the pseudo-likelihood of x is given by the following
equation (we follow the same notation as the authors in (Richardson and Domingos 2006)):
P∗w(X = x) =n
∏l=1
Pw(Xl = xl|MBx(Xl)) (3.3)
where MBx(Xl) is the state of the Markov blanket of Xl in the data. (i.e., the truth values of
the ground atoms it appears in some ground formula with). From Equation 3.1 we have:
P(Xl = xl|MBx(Xl)) =exp(∑F
i=1 wini(x))exp(∑F
i=1 wini(x[Xl=0]))+ exp(∑Fi=1 wini(x[Xl=1]))
(3.4)
Or we can take the gradient of pseudo-log-likelihood:
∂
∂wilogP∗w(X = x) =
n
∑l=1
[ni(x)−Pw(Xl = 0|MBx(Xl))ni(x[Xl=0])−
Pw(Xl = 1|MBx(Xl))ni(x[Xl=1])](3.5)
where ni(x[Xl=1]) is the number of true groundings of the ith formula when Xl = 1 and the
remaining data do not change and similarly for ni(x[Xl=0]). To compute the expressions 3.4 or
3.5, we do not need to perform inference over the model. The optimal weights for pseudo-log-
likelihood can be found using the limited-memory BFGS algorithm (Liu and Nocedal 1989).
3.2.2 Two-step Learning
The first attempt to learn MLNs structure was that in (Richardson and Domingos 2006), where
the authors used CLAUDIEN (De Raedt and Dehaspe 1997) in a first step to learn the clauses
of MLNs and then learned the weights in a second step with a fixed structure. CLAUDIEN,
unlike most other ILP systems, which learn only Horn clauses, is able to learn arbitrary first-
order clauses being a good candidate system for learning the structure of MLNs. The authors
in (Richardson and Domingos 2006), sped up the computation of the optimal weights through
several techniques:
• In Equation 3.5, the sum can be greatly sped up ignoring predicates that do not appear in
the ith formula.
• The counts ni(x), ni(x[Xl=1]),ni(x[Xl=0]) do not change with the weights and need only be
computed once.
37
3. MARKOV LOGIC NETWORKS
• Ground formulas whose truth value is not affected by changing the truth value of any
single literal may be ignored, since then ni(x) = ni(x[Xl=1]) = ni(x[Xl=0]). In particular,
this holds for all clauses with at least two true literals. This can often be the great majority
of ground clauses.
To avoid overfitting, the authors penalized the pseudo-likelihood with a Gaussian prior on
each weight. The results obtained by learning the structure (from scratch or by refining an
existing KB) and then the weights, were not better than learning the weights for a hand-coded
KB. This is due to the fact that CLAUDIEN does not maximize the likelihood of the data, but
uses typical ILP coverage evaluation measures.
3.2.3 Single-step Learning by Optimizing Weighted Pseudo-likelihood
Since CLAUDIEN (like other ILP systems) is designed to simply learn first-order theories that
hold with some accuracy and frequency in the data, and not to maximize the data’s likelihood
(and hence the quality of the MLN’s probabilistic predictions), the authors in (Kok and Domin-
gos 2005) proposed an algorithm for learning the structure of MLNs by directly optimizing a
likelihood-type measure in a single-step. They showed experimentally that it outperforms the
approach of (Richardson and Domingos 2006).
The authors in (Kok and Domingos 2005) found that the measure used in (Richardson and
Domingos 2006) gives undue weight to the largest-arity predicates, resulting in poor modeling
of the rest. For this reason they defined the weighted pseudo-log-likelihood (WPLL):
logP+w (X = x) = ∑
r∈Rcr
gr
∑k=1
logPw(Xr,k = xr,k|MBx(Xr,k)) (3.6)
where R is the set of first-order predicates, gr is the number of groundings of first-order
predicate r, and xr,k is the truth value (0 or 1) of the kth grounding of r. The choice of predicate
weights cr depends on the user’s goals. In (Kok and Domingos 2005) cr was set to 1gr
, which
has the effect of weighting all first-order predicates equally. If modeling a predicate is not
important (e.g., because it will always be part of the evidence), its weight can be set to zero.
To combat overfitting, WPLL was penalized with a structure prior of e−α ∑Fi=1 di , where di is the
number of predicates between the current version of the clause and the original one. (If the
clause is new, this is simply its length). Like in (Richardson and Domingos 2006), the authors
in (Kok and Domingos 2005) penalized each weight with a Gaussian prior.
38
3.2 Structure Learning of MLNs
Regarding search strategies the authors used beam search to find the best clause to add. The
algorithm starts with the unit clauses and the expert-supplied ones, applies each legal literal
addition and deletion to each clause, keeps the b best ones, applies the operators to those, and
repeats until no new clause improves the WPLL. The chosen clause is the one with highest
WPLL found in any iteration of the search. If the new clause is a refinement of a hand-coded
one, it replaces it. Since each change must improve WPLL to be accepted, (even though literals
are added and deleted), no loops can occur.
As pointed out in (Kok and Domingos 2005) a potentially serious problem that arises when
evaluating candidate clauses using WPLL is that the optimal (maximum WPLL) weights need
to be computed for each candidate. Since this involves numerical optimization, and needs to be
done millions of times, it could easily make the algorithm too slow. In (Della Pietra et al. 1997;
McCallum 2003) the problem is addressed by assuming that the weights of previous features
do not change when testing a new one. Surprisingly, the authors in (Kok and Domingos 2005)
found this to be unnecessary if the very simple approach of initializing L-BFGS with the current
weights (and zero weight for a new clause) is used. Although in principle all weights could
change as the result of introducing or modifying a clause, in practice this is very rare. Second-
order, quadratic-convergence methods like L-BFGS are known to be very fast if started near
the optimum (Sha and Pereira 2003). This is what happened in (Kok and Domingos 2005):
L-BFGS typically converges in just a few iterations, sometimes one.
Experimental evaluation showed that learning the structure in a single step greatly improved
over other methods such as purely ILP-based methods, purely probabilistic methods or the two
step structure learning approach of (Richardson and Domingos 2006).
3.2.4 Bottom-up Learning
The algorithm in (Kok and Domingos 2005) follows a top-down approach based on a generate-
and-test strategy which blindly generates many potential candidates independent of the train-
ing data and then tests these for fitness on the data. For MLNs the space of potential model
revisions is combinatorially explosive and such a search can become intractable following a
top-down strategy. In ILP many attempts have been made to use the data to guide the search
for good candidates. These methods follow a bottom-up approach by using the training data to
construct hypotheses (Muggleton and Feng 1992). Inspired by these approaches, the authors in
(Mihalkova and Mooney 2007) propose Bottom-Up Structure Learning (BUSL), a bottom-up
39
3. MARKOV LOGIC NETWORKS
approach for learning the structure of MLNs. The algorithm uses a “propositional” MN struc-
ture learner to construct “template” networks that guide the construction of candidate clauses.
The basic idea of BUSL is to first automatically create a MN template from the provided data
and then use the nodes in this template as components for clauses that can contain one or more
literals that are connected by a shared variable. Since in MNs, a node is independent of all
other nodes given its immediate neighbors (i.e. its Markov blanket) and since every probability
distribution respecting the independencies captured by the graph of a MN can be represented
as the product of functions defined only over the cliques of the graph, then to specify the prob-
ability distribution over a MN template, the algorithm needs to consider only clauses defined
over the cliques of the template. BUSL does exactly this. It uses MN templates to restrict the
search space for clauses only to those candidates whose literals correspond to nodes that form
a clique in the template. In this way, it generates fewer candidates for evaluation. Even though
BUSL evaluates fewer candidates, after initially scoring all candidates, the algorithm attempts
to add them one by one to the MLN, thus changing the MLN at almost each step, which greatly
slows down the computation of the WPLL. This is the main drawback of the algorithm regard-
ing speed. Regarding accuracy, the results in (Mihalkova and Mooney 2007), clearly show that
BUSL outperforms the top-down approach in terms of conditional likelihood (CLL) and area
under curve (AUC) for the precision recall curve.
3.3 Parameter Learning of MLNs
Parameter learning for MNs and MLNs can be distinguished in generative and discriminative.
Generative approaches optimize the joint probability distribution of all the variables. In contrast
discriminative approaches maximize the conditional likelihood of a set of outputs given a set
of inputs (Lafferty et al. 2001) and this often produces better results for prediction problems.
In this Section will be presented the main approaches for learning the MLNs weights.
3.3.1 Generative approaches
Generative approaches for MLNs optimize likelihood or pseudo-likelihood. Both these ap-
proaches were proposed in (Richardson and Domingos 2006). As introduced in the previous
Section, the difficulty with Equation 3.2 is that not only the first component is intractable, but
also computing the expected number of true groundings is also intractable, requiring inference
40
3.3 Parameter Learning of MLNs
over the model. Moreover, efficient optimization methods also require computing the log-
likelihood itself (Equation 3.1), and thus the partition function Z. The authors in (Richardson
and Domingos 2006) used a Monte Carlo maximum likelihood estimator (MC-MLE) (Geyer
and Thompson 1992). However, they found in their experiments that the Gibbs sampling used
to compute the MC-MLEs and gradients did not converge in reasonable time, and using the
samples from the unconverged chains yielded poor results. For this reason, pseudo-likelihood
was considered a better choice since it does not require inference during learning. Because
likelihood is a concave function of the weights, optimal weights can be found efficiently using
standard gradient-based or quasi-Newton optimization methods (Nocedal and Wright 1999). In
(Richardson and Domingos 2006) the optimal weights for pseudo-log-likelihood were found
using the limited-memory BFGS (L-BFGS) algorithm (Liu and Nocedal 1989).
3.3.2 Discriminative Approaches
Since generative approaches optimize the joint distribution of all the variables there is a mis-
match between the objective function used (likelihood or a function thereof) and the goal of
classification (maximizing accuracy or conditional likelihood). This can often lead to subopti-
mal results for predictive tasks for which it is known a priori which predicates will be evidence
and which ones will be queried, and the goal is to correctly predict the query predicate given
the evidence. If we partition the ground atoms in the domain into a set of evidence atoms E
and a set of query atoms Q, the conditional likelihood of Q given E is:
P(q|e) =1Zx
exp( ∑i∈Fq
wini(e,q)) =1Zx
exp( ∑j∈Gq
w jg j(e,q)) (3.7)
where Fq is the set of all MLN clauses with at least one grounding involving a query atom,
ni(e,q) is the number of true groundings of the ith clause involving query atoms, Gq is the set
of ground clauses in the ground MN involving query atoms, and g j(e,q) = 1 if the jth ground
clause is true in the data and 0 otherwise. When some variables are “hidden” (i.e., neither query
nor evidence) the conditional likelihood should be computed by summing them out (here for
clarity we treat all non-evidence variables as query variables). The gradient of the conditional
log-likelihood (CLL) is given by:
∂
∂wilogPw(q|e) = ni(q,e)−∑
q′Pw(q′|e)ni(e,q′) = ni(e,q)−Ew[ni(e,q)] (3.8)
41
3. MARKOV LOGIC NETWORKS
Computing the expected counts Ew[ni(e,q)] is intractable. However, these can be approxi-
mated by the counts ni(e,q∗w) in the MAP state q∗w(x). Thus, computing the gradient needs only
MAP inference to find q∗w(x) which is much faster than full conditional inference for comput-
ing Ew[ni(e,q)]. This approach was successfully used in (Collins 2002) for a special case of
MNs where the query nodes form a linear chain. In this case the MAP state can be found us-
ing the Viterbi algorithm (Rabiner 1989) and the voted perceptron algorithm in (Collins 2002)
follows this approach. To generalize this method to arbitrary MLNs it is necessary to replace
the Viterbi algorithm with a general-purpose algorithm for MAP inference in MLNs. From
Equation 3.7 we can see that since q∗w(x) is the state that maximizes the sum of the weights of
the satisfied ground clauses, it can be found using a MAX-SAT solver. The authors in (Singla
and Domingos 2005), generalized the voted-perceptron algorithm to arbitrary MLNs by replac-
ing the Viterbi algorithm with the MaxWalkSAT solver (Kautz et al. 1997b). Given an MLN
and set of evidence atoms, the KB to be passed to MaxWalkSAT is formed by constructing
all groundings of clauses in the MLN involving query atoms, replacing the evidence atoms in
those groundings by their truth values, and simplifying.
However, unlike the Viterbi algorithm, MaxWalkSAT is not guaranteed to reach the global
MAP state. This can potentially lead to errors in the weight estimates produced. The quality
of the estimates can be improved by running a Gibbs sampler starting at the state returned by
MaxWalkSAT, and averaging counts over the samples. If the Pw(q|e) distribution has more than
one mode, doing multiple runs of MaxWalkSAT followed by Gibbs sampling can be helpful.
This approach is followed in the algorithm in (Singla and Domingos 2005) which is essentially
gradient descent.
Weight learning in MLNs is a convex optimization problem, and thus gradient descent
is guaranteed to find the global optimum. However, convergence to this optimum may be too
slow. The sufficient statistics for MLNs are the number of true groundings of each clause. Since
this number can easily vary by orders of magnitude from one clause to another, a learning rate
that is small enough to avoid divergence in some weights may be too small for fast convergence
in others. This is an instance of the well-known problem of ill-conditioning in numerical
optimization, and many candidate solutions for it exist (Nocedal and Wright 1999). However,
most of these are not easily applicable to MLNs because of the nature of the function to be
optimized.
In (Lowd and Domingos 2007) was proposed another approach for discriminative weight
learning of MLNs. In this work conjugate gradient (Shewchuck. 1994) is used. Gradient
42
3.4 Inference in MLNs
descent can be sped up through performing a line search to find the optimum along the chosen
descent direction instead of taking a small step of constant size at each iteration. This can be
inefficient on ill-conditioned problems, since line searches along successive directions tend to
partly undo the effect of each other: each line search makes the gradient along its direction
zero, but the next line search will generally make it non-zero again. This can be solved by
imposing at each step the condition that the gradient along previous directions remain zero.
The directions chosen in this way are called conjugate, and the method conjugate gradient. In
(Lowd and Domingos 2007), the authors used the Polak-Ribiere method for choosing conjugate
gradients since it has generally been found to be the best-performing one.
Conjugate gradient methods are among the most efficient ones, on a par with quasi-Newton
ones. Unfortunately, as the authors point out in (Lowd and Domingos 2007), applying them to
MLNs is difficult, because line searches require computing the objective function, and there-
fore the partition function Z, which is intractable. Fortunately, the Hessian can be used instead
of a line search to choose a step size. This method is known as scaled conjugate gradient
(SCG), and was proposed in (Moller 1993) for training neural networks. In (Lowd and Domin-
gos 2007), a step size was chosen by using the Hessian similar to a diagonal Newton method.
Conjugate gradient methods are often more effective with a preconditioner, a linear transfor-
mation that attempts to reduce the condition number of the problem (Sha and Pereira 2003).
Good preconditioners approximate the inverse Hessian. In (Lowd and Domingos 2007), the
authors used the inverse diagonal Hessian as preconditioner and called the SCG algorithm Pre-
conditioned SCG (PSCG). PSCG was shown to outperform the voted perceptron algorithm of
(Singla and Domingos 2005) on two real-world domains both for CLL and AUC. For the same
learning time, PSCG learned much more accurate models.
3.4 Inference in MLNs
This section introduces the inference tasks for MLNs and the related existing algorithms.
3.4.1 MAP Inference
Maximum a posteriori (MAP) inference means finding the most likely state of a set of output
variables given the state of the input variables. As introduced in the previous Section, the MAP
state for a MLN is the state that maximizes the sum of the weights of the satisfied ground
clauses. This state can be efficiently found using a weighted MAX-SAT solver. The authors
43
3. MARKOV LOGIC NETWORKS
in (Singla and Domingos 2005) use the MaxWalkSAT solver (Kautz et al. 1997b) to find the
MAP state and use it in a gradient descent method to compute the number of true groundings
of clauses.
Propositionalization is the process of replacing a first-order KB by an equivalent proposi-
tional one. For finite domains, this can be done by replacing each universally (existentially)
quantified formula with a conjunction (disjunction) of all its groundings. A first-order KB is
satisfiable iff the equivalent propositional KB is satisfiable. Thus, inference over a first-order
KB can be performed by propositionalization followed by satisfiability testing.
Stochastic Local Search (Hoos and Stutzle 2005) methods have made much progress in
solving hard combinatorial problems. However, fully instantiating a finite first-order theory re-
quires memory on the order of the number of constants raised to the arity of the clauses, which
significantly limits the size of domains where it remains feasible. In (Singla and Domingos
2006a) a powerful algorithm called LazySAT was proposed that avoids this blowup by taking
advantage of the extreme sparseness that is typical of relational domains (i.e., only a small
fraction of ground atoms are true, and most clauses are trivially satisfied). LazySAT grounds
clauses lazily. At each step in the search it adds only those clauses that could become unsat-
isfied. In contrast, WalkSAT grounds all possible clauses at the outset, consuming time and
memory exponential in their arity.
LazySAT has in input an MLN (or a pure first-order KB) and a database which is a set of
ground atoms. An evidence atom is either a ground atom in the database, or a ground atom that
is false by the closed world assumption (Genesereth and Nilsson 1987). The truth values of
evidence atoms are fixed throughout the search, and ground clauses are simplified by removing
the evidence atoms. LazySAT maintains a set of active atoms and a set of active clauses. A
clause is active if it can be made unsatisfied by flipping zero or more of its active atoms. An
atom is active if it is in the initial set of active atoms, or if it was flipped at some point in the
search. The initial active atoms are all those appearing in clauses that are unsatisfied if only the
atoms in the database are true, and all others are false. At each step in the search, the variable
that is flipped is activated together with any clauses that by definition should become active as
a result. Experiments in (Singla and Domingos 2006a) showed that LazySAT greatly reduces
memory requirements compared to WalkSAT, without sacrificing speed or solution quality.
44
3.4 Inference in MLNs
3.4.2 Conditional Inference
Conditional inference in graphical models involves computing the distribution of the query
variables given the evidence and it has been shown to be #P-complete. The most widely used
approach to approximate inference is by using MCMC methods (Gilks et al. 1996) and in
particular Gibbs sampling which proceeds by sampling each variable in turn given its Markov
blanket (the variables it appears in some potential with). To generate samples from the correct
distribution, it suffices that the Markov chain satisfy ergodicity and detailed balance.
If F1 and F2 are two formulas in FOL, C is a finite set of constants including any constants
that appear in F1 or F2, and N is an MLN, then
P(F1|F2,N,C) = P(F1|F2,MN,C) =P(F1∧F2|MN,C)
P(F2|MN,C=
∑x∈χF1∩χF2P(X = x|MN,C)
∑x∈χF2P(X = x|MN,C)
(3.9)
where χFi is the set of worlds where Fi holds and P(x|MN,C) is given by Equation 3.1. In
graphical models conditional queries are special cases of Equation 3.9 where all predicates in
F1,F2 and N are just zero-arity and formulas are conjunctions.
The computation of Equation 3.9 is intractable even for very small domains. Probabilistic
inference is #P-complete while on the other side logical inference is NP-complete even in finite
domains leading therefore to no better results for MLNs. The hope is that since MLNs allow
fine-grained encoding of knowledge, including context-specific independences, inference may
in some cases be more efficient than inference in an ordinary graphical model for the same
domain.
In theory, P(F1|F2,N,C) can be approximated using an MCMC algorithm that rejects all
moves to states where F2 does not hold, and counts the number of samples in which F1 holds.
However, even this is likely to be extremely slow for arbitrary formulas. What is interesting
for practical purposes is the case when F1 and F2 are conjunctions of ground literals, although
lifted inference is an active research area (Poole 2003; Singla and Domingos. 2008).
The authors in (Richardson and Domingos 2006) propose an algorithm that works in two
phases. The first phase returns the minimal set M of the ground MN required to compute
P(F1|F2,N,C). The second phase performs inference on this network, with the nodes in F2 set
to their values in F2. A possible method is Gibbs sampling, but any inference method may be
45
3. MARKOV LOGIC NETWORKS
used. The basic Gibbs step consists of sampling one ground atom given its Markov blanket.
The probability of a ground atom Xl when its Markov blanket Bl is in state bl is given by:
P(Xl = xl) =exp(∑ fi∈Fl
wi fi(Xl = xl,Bl = bl))exp(∑ fi∈Fl
wi fi(Xl = 0,Bl = bl))+ exp(∑ fi∈Flwi fi(Xl = 1,Bl = bl))
(3.10)
where Fl is the set of ground formulas that Xl appears in, and fi(Xl = xl,Bl = bl) is the
value (0 or 1) of the feature corresponding to the ith ground formula when Xl = xl and Bl = bl .
As pointed out in (Richardson and Domingos 2006), for sets of atoms of which exactly one
is true (e.g., the possible values of an attribute), blocking can be used (i.e., one atom is set to
true and the others to false in one step, by sampling conditioned on their collective Markov
blanket). The estimated probability of a conjunction of ground literals is simply the fraction of
samples in which the ground literals are true, after the Markov chain has converged.
One of the problems that arises in real-world applications, is that an inference method must
be able to handle probabilistic and deterministic dependencies that might hold in the domain.
MCMC methods are suitale for handling probabilistic dependencies but give poor results when
deterministic or near deterministic dependencies characterize a certain domain. On the other
side logical ones, like satisfiability testing cannot be applied to probabilistic dependencies. One
approach to deal with both kinds of dependencies is that of (Poon and Domingos 2006) where
the authors use SampleSAT (Wei et al. 2004) in a MCMC algorithm to uniformly sample from
the set of satisfying solutions. As pointed out in (Wei et al. 2004), SAT solvers find solutions
very fast but they may sample highly non-uniformly. On the other side, MCMC methods may
take exponential time, in terms of problem size, to reach the stationary distribution. For this
reason, the authors in (Wei et al. 2004) proposed to use a hybrid strategy by combining random
walk steps with MCMC steps, and in particular with Metropolis transitions. This permits
to efficiently jump between isolated or near-isolated regions of non-zero probability, while
preserving detailed balance.
Experimental results in (Poon and Domingos 2006) show that MC-SAT greatly outperforms
Gibbs sampling and simulated tempering. Recently, a lazy version, Lazy-MC-SAT (Poon et al.
2008) was shown to greatly reduce memory requirements for the inference task. Experimental
evaluation in (Poon et al. 2008) shows that it reduces memory and time by orders of magnitude
compared to MC-SAT.
46
Chapter 4
The GSL algorithm
This chapter describes the Generative Structure Learning (GSL) algorithm for learning the
structure of MLNs based on the Iterated Local Search (ILS) metaheuristic.
4.1 The Iterated Local Search metaheuristic
Many widely known and high-performance local search algorithms make use of randomized
choice in generating or selecting candidate solutions for a given combinatorial problem in-
stance. These algorithms are called Stochastic Local Search (SLS) algorithms (Hoos and Stut-
zle 2005) and represent one of the most successful and widely used approaches for solving
hard combinatorial problems. These problems are characterized by a large space of possible
candidates to be explored and the use of systematic deterministic methods could be highly
computationally expensive. In the following we motivate why we choose SLS algorithms in
our structure learning algorithm for MLNs.
As pointed out in (Hoos and Stutzle 2005) there are three good reasons to consider applying
SLS algorithms. The first regards the fact that many problems are of a constructive nature and
their instance is known to be solvable. In these situations, the goal of any search algorithm is to
find a solution rather than just to decide whether a solution exists. This holds in particular for
optimization problems, such as Travelling Salesman Problem (TSP), where the actual problem
is to find a solution of sufficiently high quality. Therefore, the main advantage of a complete
systematic algorithm (the ability to detect that a given problem instance has no solution) is not
relevant for finding solutions of solvable instances. Secondly, in most application scenarios,
the time to find a solution is limited. In these situations, systematic algorithms often have to
47
4. THE GSL ALGORITHM
be aborted after the given time has been exhausted, which renders them incomplete. This is
problematic for many systematic optimization algorithms that search through spaces of partial
solutions without computing complete solutions early in the search, and if such a systematic al-
gorithm is aborted prematurely, usually a non solution candidate is available, while in the same
situation SLS algorithms typically return the best solution found so far. Thirdly, algorithms for
real-time problems should be able to deliver reasonably good solutions at any point during their
execution. For optimization problems this typically means that run-time and solution quality
should be positively correlated; for decision problems one could guess a solution when e time-
out occurs, where the accuracy of the guess should increase with the run-time of the algorithm.
This so-called any-time property of algorithms, is usually very difficult to achieve, but in many
situations the SLS paradigm is naturally suited for devising any time algorithms.
In general, it is not straightforward to decide whether to use a systematic or SLS algorithm
in a certain task. Systematic and SLS algorithms can be considered complementary to each
other. SLS algorithms are advantageous in many situations, particularly if reasonably good
solutions are required within a short time, if parallel processing is used and if the knowledge
about the problem domain is rather limited. In other cases, when time constraints are less
important and some knowledge about the problem domain can be exploited, systematic search
may a better choice.
In learning the structure of MLNs, we are faced with the problem of maximizing the like-
lihood of the data. This implies searching for the possibly best structure among the candidate
structures. Many search algorithms fall in local optima of the evaluation function and probably
do not find the best solution (often solutions are typically of not sufficiently high quality). SLS
methods exploit different mechanisms for escaping local optima and this feature renders them
very useful in many optimization problems. Moreover, depending on the future application
domain of the MLNs, the time requirements may be very strict and thus a sufficiently high
quality solution may be needed within fixed time. Future extensions of the system that learns
MLNs using SLS methods, may benefit from parallel processing abilities. Finally, trying to
fulfill the any-time property through SLS methods, the system to be developed can be used to
successfully solve real-time problems that involve learning MLNs.
Many “simple” SLS methods come from other search methods by just randomizing the
selection of the candidates during search, such as Randomized Iterative Improvement (RII),
Uniformed Random Walk, etc. Many other SLS methods combine “simple” SLS methods to
exploit the abilities of each of these during search. These are known as Hybrid SLS methods
48
4.1 The Iterated Local Search metaheuristic
Algorithm 4.1 The Iterated Local Search algorithmProcedure Iterated Local Search
s0 = GenerateInitialSolutions∗ = LocalSearch(s0)
repeats′= Perturb(s∗,history)
s∗′= LocalSearch(s
′)
s∗ = Accept(s∗′,s∗,history)
until termination condition is truereturn s∗
end
(Hoos and Stutzle 2005). ILS is one of these metaheuristics because it can be easily combined
with other SLS methods.
One of the simplest and most intuitive ideas for addressing the fundamental issue of es-
caping local optima is to use two types of SLS steps: one for reaching a local optimum as
efficiently as possible, and the other for effectively escaping it. ILS methods (Hoos and Stut-
zle 2005; Loureno et al. 2002) exploit this key idea, and essentially use two types of search
steps alternatingly to perform a walk in the space of local optima w.r.t. the given evaluation
function. Algorithm 4.1 works as follows: The search process starts from a randomly selected
element s0 of the search space. From this initial candidate solution, a locally optimal solution
s∗ is obtained by applying a subsidiary local search procedure. Then each iteration step of the
algorithm consists of three major steps: first a perturbation method is applied to the current
candidate solution s∗; this yields a modified candidate solution s′ from which in the next step
a subsidiary local search is performed until another local optimum s∗′
is obtained. In the last
step, an acceptance criterion is used to decide from which of the two local optima s∗ or s∗′
the
search process is continued. The algorithm can terminate after some steps have not produced
improvement or simply after a certain number of steps. The choice of the components of the
ILS has a great impact on the performance of the algorithm. A schematic representation of
the ILS algorithm is given in Figure 4.1, where the perturbation operator causes the search to
continue in another region of the search space in order to escape the first local optimum.
49
4. THE GSL ALGORITHM
Figure 4.1: The Iterated Local Search schema -
4.2 Generative Structure Learning using ILS
In this section we describe the ILS metaheuristic tailored to the problem of learning the struc-
ture of MLNs. Algorithm 4.2 iteratively adds the best clause to the current MLN until two
consecutive steps have not produced improvement (however other stopping criteria could be
applied). Algorithm 4.3 performs an iterated local search to find the best clause to add to the
MLN. It starts by randomly choosing a unit clause CLC in the search space. Then it performs
a greedy local search to efficiently reach a local optimum CLS. At this point, a perturbation
method is applied leading to the neighbor CL′C of CLS and then a greedy local search is ap-
plied to CL′C to reach another local optimum CL′S . The accept function decides whether the
search must continue from the previous local optimum CLC or from the last found local op-
timum CL′S (accept can perform random walk or iterative improvement in the space of local
optima). Careful choice of the various components of Algorithm 4.3 is important to achieve
high performance.
4.2.1 The Perturbation Component
The clause perturbation operator (flipping the sign of literals, removing literals or adding lit-
erals) has the goal to jump in a different region of the search space where search should start
with the next iteration. There can be strong or weak perturbations which means that if the jump
in the search space is near to the current local optimum the subsidiary local search procedure
LocalSearchII may fall again in the same local optimum and enter regions with the same value
of the objective function called plateau, but if the jump is too far, LocalSearchII may take too
many steps to reach another good solution. In our algorithm we use only strong perturbations,
i.e., we always re-start from unit clauses (in future work we intend to adapt dynamically the
50
4.2 Generative Structure Learning using ILS
Algorithm 4.2 The GSL algorithmInput: P:set of predicates, MLN:Markov Logic Network, RDB:Relational DatabaseCLS = All clauses in MLN ∪ P;LearnWeights(MLN,DB); Score = WPLL(MLN,RDB);repeat
BestClause = SearchBestClause(P,MLN,Score,CLS,RDB);if BestClause 6= null then
Add BestClause to MLN;Score = WPLL(MLN,RDB);
if BestScore <= Score thenGain = Score - BestScore; BestScore = Score;
end ifend if
until BestClause = null || Gain <= minGain for two consecutive stepsreturn MLN
nature of the perturbation). In this way we induce randomness in the search process to avoid
search stagnation. Careful check (tabu check) is performed when restarting, in order to avoid
starting again from the same unit clause.
Finding a good perturbation operator is not an easy task because this procedure has an im-
pact on the other components of ILS (Loureno et al. 2002). When optimizing separately each
of the four components of ILS, we suppose other components remain fixed. This is quite a use-
ful approximation, but clearly the optimization of one component depends on the choices made
for the others. Therefore, at least in principle, one should tackle the global optimization of an
ILS. As pointed out in (Loureno et al. 2002), since at present there is no theory for analyzing
a metaheuristic such as iterated local search, there are some practical general rules to follow
when trying to globally optimize ILS. For example, the procedure GenerateInitialSolution is
probably irrelevant when the ILS performs well and rapidly looses memory of its starting point.
If the optimization of GenerateInitialSolution can be ignored, then the joint optimization of
the other three components must be achieved. The best choice of Perturb depends on the
choice of LocalSearch while the best choice of Accept depends on the choices of LocalSearch
and Perturb. In practice, the global optimization problem can be approximated by successively
optimizing each component assuming the others are fixed until no improvements are found for
any of the components. Thus the global optimization can be seen as an iterative process. This
51
4. THE GSL ALGORITHM
Algorithm 4.3 The SearchBestClause component of GSLInput: P:set of predicates, MLN:Markov Logic Network, BestScore: current best score,CLS: List of clauses, RDB:Relational Database)CLC = Random Pick a clause in CLS ∪ P;CLS = LocalSearchII(CLS);BestClause = CLS;repeat
CL’C = Perturb(CLS);CL’S = LocalSearchII(CL’C,MLN,BestScore);if WPLL(BestClause,MLN,RDB) ≥WPLL(CL’S,MLN,RDB) then
BestClause = CL’S;Add BestClause to MLN;BestScore = WPLL(CL’S,MLN,RDB)
end ifCLS = accept(CLS,CL’S);
until two consecutive steps have not produced improvementReturn BestClause
does not guarantee global optimization of the ILS, but it should lead to an adequate optimiza-
tion of the overall algorithm.
Regarding the strength of the perturbation, its effect should not be easily undone by the
local search; if the local search has obvious short-comings, a good perturbation should com-
pensate for them. The authors in (Loureno et al. 2002) point out that the decision to use weak
perturbations depends on whether the best solutions “cluster” in the space S∗ of locally optimal
solutions. In some problems (TSP is one of them), there is a strong correlation between the cost
of a solution and its “distance” to the optimum: in effect, the best solutions cluster together, i.e.,
have many similar components. This has been referred to as: “Massif Central” phenomenon
(Fonlupt et al. 1999), principle of proximate optimality (Glover and Laguna. 1997), and replica
symmetry (Mezard et al. 1987). If the problem under consideration has this property, it is use-
ful to attempt find the true optimum using a biased sampling of S∗. In particular, it is clear that
is is useful to use exploitation to improve the probability of hitting the global optimum.
For problems where clustering of best solutions is incomplete, i.e., where very distant
solutions can be nearly as good as the optimum, weak perturbation may fail. An example
of combinatorial optimization problem in this category is the graph bi-section and MAX-SAT.
52
4.2 Generative Structure Learning using ILS
Naturally, exploitation is still needed to get the best solution in one’s current neighborhood, but
generally this will not lead to the optimum. After an exploitation phase, one must go explore
other regions of S∗. This can be achieved by using strong perturbations whose strength grows
with the instance. Another possibility is to restart the algorithm from scratch and repeat another
exploitation phase.
For MLNs structure learning we do not have a theoretical analysis of the search space S∗
and to the best of the knowledge of the author, there are no works on the properties of the search
space of the evaluation function pseudo-likelihood for MLNs. We empirically found that many
good candidates, as good as the optimum, were too distant from each other (differ with a large
number of literals). This led us follow the approach of using strong perturbations in order to
explore different regions ( i.e., clusters) of S∗. As the results of (Biba et al. 2008) and of the
next Section show, this was quite a reasonable choice together with an iterative improvement
local search procedure.
4.2.2 The Local Search Component
As a general rule, LocalSearch should be as powerful as possible as long as it is not too expen-
sive for the whole search process. Since finding a good MLNs structure is a hard optimization
problem, greedily improving the scoring function is a good choice to achieve high quality so-
lutions. As described in the previous paragraph, this will be useful to explore each cluster of
solutions of S∗ and then leave the cluster with a strong perturbation operator. Since, we are not
sure about the solutions clustering for the task of MLNs structure learning, we must be sure
that when entering a cluster the best possible solution is found. We can have good chances to
achieve this by using a greedy local search procedure. For this reason, regarding the procedure
LocalSearchII (Algorithm 4.4), we decided to use an iterative improvement approach (the walk
probability is set to zero and the best clause is always chosen in stepII) in order to balance
intensification (greedily increase solution quality by exploiting the evaluation function) and
diversification (randomness induced by strong perturbation to avoid search stagnation). How-
ever, as future work we intend to study the properties of the search space for S∗ and weaken
intensification by using a higher walk probability. Finally, the accept function always accepts
the best solution found so far.
53
4. THE GSL ALGORITHM
Algorithm 4.4 The local search component of GSLInput: (CLC: current clause)wp: walk probability, the probability of performing an improvement step or arandom steprepeat
NBHD = Neighborhood of CLC constructed using the clause construction operators;CLS = StepRII(CLC,NBHD,wp);CLC = CLS;
until two consecutive steps do not produce improvementReturn CLS;
StepRII(CLC,NBHD,wp)U = random(]0,1]); random number using a Uniform Probability Distributionif (U ≤ wp) then
CLS = stepURW (CLC,NBHD)Uninformed Random Walk: randomly choose a neighbor from NBHD
elseCLS = stepII(CLC,NBHD)Iterative Improvement: choose the best among the improving neighbours in NBHD.If there is no improving neighbor choose the minimally worsening one
end ifReturn CLS
4.3 Experiments
In this Section experiments in two real-world domains are presented. One is social networks
analysis with the goal of predicting links among web pages that describe social actors and the
other domain regards entity resolution in a large database of citations.
4.3.1 Link Analysis
Link analysis is an important problem in many domains where entities such as people, Web
pages, computers, scientific publications, organizations, are interconnected and interact in one
way or another (Popescul and Ungar 2003). Predicting the presence of links between entities
is not an easy task due to the characteristics of such domains. The first requirement regards
the representation formalism. Flat representations are not suitable to deal with these problems,
54
4.3 Experiments
hence relational formalisms must be used. Second, most of these domains contain noisy or
partially observed data thus robust methods for dealing with uncertainty must be used.
Regarding other SRL models applied to link analysis, in (Popescul and Ungar 2003) a SRL
model, Structural Logistic Regression was used to solve a link analysis problem about predict-
ing citations in scientific literature. In (Taskar et al. 2003) Relational Markov Networks were
used for link prediction in two domains: university webpages and social networks. Markov
Logic was first applied to link prediction by (Richardson and Domingos 2006) and then by
(Kok and Domingos 2005; Mihalkova and Mooney 2007; Singla and Domingos 2005).
Dataset
The experiments on link prediction were carried out on a publicly-available database: the
UW-CSE database (available at http://alchemy.cs.washington.edu/data/uw-cse) used by (Kok
and Domingos 2005; Mihalkova and Mooney 2007; Richardson and Domingos 2006; Singla
and Domingos 2005). This dataset represents a standard relational one and is used for the
important relational task of social network analysis.
The published UW-CSE dataset consists of 15 predicates (Table 4.1) divided into 9 types
with 1323 constants. Types include: publication, person, course, etc. Predicates include:
Student(person), Professor(person), AdvisedBy(person1, person2),TaughtBy(course, person,
quarter), Publication (paper, person) etc. The dataset contains 2673 tuples (true ground atoms,
with the remainder assumed false). The task is to predict who is whose advisor from informa-
tion about coauthorships, classes taught, etc. More precisely, the query atoms are all ground-
ings of AdvisedBy(person1, person2), and the evidence atoms are all groundings of all other
predicates (Richardson and Domingos 2006). In our experiments we performed inference over
all the predicates in the domain, not only AdvisedBy, to see how the learned models are good
in estimating probabilities for a large number of query atoms.
4.3.2 Entity Resolution
Entity resolution is the problem of determining which records in a database refer to the same
entities, and is an essential and expensive step in the data mining process. The problem was
originally defined in (Newcombe et al. 1959) and then the work in (Fellegi and Sunter 1969)
laid the theoretical basis of what is now known by the name of object identification, record
linkage, de-duplication, merge/purge, identity uncertainty, co-reference resolution, and others.
The main characteristics of domains where this problem must be solved, are relations between
55
4. THE GSL ALGORITHM
objects and uncertainty about these relations due to noisy or partially observed data.
Dataset
The Cora dataset consists of 1295 citations of 132 different computer science papers, drawn
from the Cora CS Research Paper Engine. The task is to predict which citations refer to the
same paper, given the words in their author, title, and venue fields. The labeled data also
specify which pairs of author, title, and venue fields refer to the same entities. We performed
experiments for each field in order to evaluate the ability of the model to deduplicate fields
as well as citations. The dataset contains 10 predicates (Table 4.2) and 70367 tuples (true
and false ground atoms, with the remainder assumed false). Since the number of possible
equivalences is very large, like the authors did in (Lowd and Domingos 2007) we used the
canopies found in (Singla and Domingos 2006b) to make this problem tractable. The dataset
used is in Alchemy format (publicly-available at http://alchemy.cs.washington.edu/data/cora/).
The original version not in alchemy format was segmented by Bilenko and Mooney in (Bilenko
and Mooney 2003) (available at http://www.cs.utexas.edu/users/ml/riddle/data/cora.tar.gz).
4.3.3 Systems and Methodology
We implemented Algorithm 4.2 (GSL) as part of the MLN++ package 8.3 which is a suite of
algorithms based on Markov Logic and built upon the Alchemy framework (Kok et al. 2005).
Alchemy implements inference and learning algorithms for Markov Logic. Alchemy can be
viewed as a declarative programming language akin to Prolog, but with some key differences:
the underlying inference mechanism is model checking instead of theorem proving; the full
syntax of first-order logic is allowed, rather than just Horn clauses. Moreover, Alchemy has
some built-in functionalities with the ability to handle uncertainty and learn from data. MLN++
uses the API of Alchemy for some tasks such as the implementation of L-BFGS in Alchemy
to learn maximum WPLL weights.
To evaluate the performance of GSL towards the state-of-the art algorithms we compared
our algorithm performance with the state-of-the-art algorithms for generative structure learning
of MLNs: BS (Beam Search) of (Kok and Domingos 2005) and BUSL (Bottom-Up Structure
Learning) of (Mihalkova and Mooney 2007).
In the UW-CSE domain, we used the same leave-one-area-out methodology as in (Richard-
son and Domingos 2006). In the Cora domain, we performed cross-validation. For each sys-
tem on each test set, we measured the conditional log-likelihood (CLL) and the area under
56
4.3 Experiments
Table 4.1: All predicates in the UW-CSE domain
TaughtBy(course, person, semester) CourseLevel(course, level) Position(person, pos)AdvisedBy(person, person) ProjectMember(project, person) Phase(person, phas)
TempAdvisedBy(person, person) YearsInProgram(person, year) TA(course, person, semester)Student(person) Professor(person) SamePerson(person, person)
SameCourse(course, course) SameProject(project, project) Publication(title, person)
Table 4.2: All predicates in the CORA domain
author(citation,author) title(citation,title)venue(citation,venue) sameBib(citation,citation)
sameAuthor(author,author) sameTitle(title,title)sameVenue(venue,venue) hasWordAuthor(author, word)hasWordTitle(title, word) hasWordVenue(venue, word)
the precision-recall curve (AUC) for all the predicates. The advantage of the CLL is that it
directly measures the quality of the probability estimates produced. The advantage of the AUC
is that it is insensitive to the large number of true negatives (i.e., ground atoms that are false
and predicted to be false). The CLL of a query predicate is the average over all its groundings
of the ground atom’s log-probability given evidence. The precision-recall curve for a predicate
is computed by varying the CLL threshold above which a ground atom is predicted to be true;
i.e., the predicates whose probability of being true is greater than the threshold are positive and
the rest are negative.
For all algorithms, we used the default parameters of Alchemy changing only the following
ones: maximum variables per clause = 5 for UW-CSE and 6 for Cora; penalization of WPLL:
0.01 for UW-CSE and 0.001 for Cora. For L-BFGS: convergence threshold = 10−5 (tight) and
10−4 (loose); minWeight = 0.5 for UW-CSE for BUSL as in (Mihalkova and Mooney 2007),
1 for BS as in (Kok and Domingos 2005) and 1 for ILS; minGain = 0.05 for ILS. For GSL
we used a multiple independent walk parallelism, assigning an instance of the algorithm to a
separate CPU on a cluster of Intel Core2 Duo 2.13 GHz CPUs.
4.3.4 Results
After learning the structure, we performed inference on the test fold for both datasets by using
MC-SAT (Poon and Domingos 2006) with number of steps = 10000 and simulated annealing
57
4. THE GSL ALGORITHM
Table 4.3: Accuracy results on UW-CSE for ten parallel independent walks of GSL
language ai systems graphics theoryRUN CLL AUC CLL AUC CLL AUC CLL AUC CLL AUCR1 -0.232± 0.035 0.420 -0.322± 0.034 0.413 -0.056± 0.023 0.442 -0.080± 0.016 0.425 -0.292±0.036 0.336R2 -0.140± 0.034 0.419 -0.375± 0.039 0.353 -0.267± 0.032 0.445 -0.342± 0.041 0.421 -0.251±0.028 0.386R3 -0.071± 0.023 0.430 -0.171± 0.018 0.408 -0.293± 0.033 0.467 -0.064± 0.013 0.462 -0.112±0.022 0.386R4 -0.464± 0.082 0.393 -0.054± 0.005 0.419 -0.307± 0.040 0.442 -0.111± 0.012 0.426 -0.300±0.046 0.359R5 -0.329± 0.070 0.404 -0.331± 0.034 0.421 -0.323± 0.034 0.449 -0.465± 0.046 0.365 -0.104±0.030 0.368R6 -0.335± 0.060 0.449 -0.125± 0.015 0.415 -0.266± 0.033 0.411 -0.358± 0.036 0.442 -0.262±0.034 0.384R7 -0.285± 0.067 0.427 -0.060± 0.008 0.394 -0.254± 0.032 0.402 -0.306± 0.035 0.465 -0.249±0.040 0.384R8 -0.243± 0.052 0.418 -0.381± 0.036 0.353 -0.371± 0.047 0.398 -0.053± 0.007 0.483 -0.178±0.033 0.414R9 -0.224± 0.031 0.414 -0.128± 0.020 0.397 -0.416± 0.051 0.422 -0.348± 0.030 0.400 -0.321±0.032 0.412
R10 -0.356± 0.067 0.386 -0.377± 0.025 0.369 -0.295± 0.034 0.482 -0.299± 0.034 0.447 -0.212±0.033 0.391Avg. -0.268±0.052 0.416 -0.233±0.023 0.394 -0.285±0.036 0.436 -0.243±0.027 0.434 -0.228±0.033 0.381
temperature = 0.5. For each experiment, all the groundings of the query predicates on the
test fold were commented. MC-SAT produces probability outputs for every grounding of the
query predicate on the test fold. We used these values to compute the average CLL over all the
groundings and the relative AUC (for AUC we used the method and the package of (Davis and
Goadrich 2006)). For CORA, in some cases, the memory and requirements of the inference
task were too high. Thus to score the learned structures within reasonable time and available
memory, in some cases we used the lazy version of MC-SAT and posed a limit of one hour for
the process to complete. For all the other cases where the memory requirements were not high,
we ran MC-SAT with number of steps = 10000. For ILS we report the performance in terms
of CLL for ten parallel independent walks. Both CLL and AUC results are averaged over all
predicates of the domain.
For UW-CSE the results of GSL are reported in Table 4.3. Every value in the table for
each fold is an average of accuracy for all the predicates. The overall results for UW-CSE
comparing BUSL and GSL are reported in Table 4.4. The columns refer to the results taken for
GSL: GSL-Average refers to the average results of all the parallel runs over all the folds, GSL-
BestCLL refers to the best parallel run for each fold in terms of CLL, GSL-BestAUC refers to
the best parallel run for each fold in terms of AUC, GSL-Best refers to the best parallel run for
each fold taking into account both the optimization of CLL and AUC.
As the results of Table 4.4 show, if we take all the parallel independent walks of GSL,
these on average produce better results than BS but worse results than BUSL in terms of CLL
and AUC. However, if for each fold of the dataset we take the best run of GSL in terms of
58
4.3 Experiments
Table 4.4: Accuracy comparison of GSL, BUSL and BS on the UW-CSE dataset
GSL-Average GSL-BestCLL GSL-BestAUCFold CLL AUC CLL AUC CLL AUC
language -0.268±0.052 0.416 -0.071±0.023 0.430 -0.335 ±0.060 0.449ai -0.233±0.023 0.394 -0.054±0.005 0.419 -0.331 ±0.034 0.421
systems -0.285±0.036 0.436 -0.056±0.023 0.442 -0.295±0.034 0.482graphics -0.243±0.027 0.434 -0.053±0.007 0.483 -0.053±0.007 0.483theory -0.228±0.033 0.381 -0.104±0.030 0.368 -0.178±0.033 0.414
Average -0.251±0.034 0.412 -0.068±0.017 0.428 -0.239±0.033 0.450
GSL-Best BUSL BSFold CLL AUC CLL AUC CLL AUC
language -0.071±0.023 0.430 -0.090±0.030 0.439 -0.433±0.078 0.300ai -0.054±0.005 0.419 -0,067±0.009 0.406 -0.289±0.037 0.328
systems -0.056±0.023 0.442 -0.053±0.005 0.461 -0.242±0.031 0.414graphics -0.053±0.007 0.483 -0.101±0.018 0.458 -0.282±0.039 0.287theory -0.112±0.022 0.386 -0.061±0.009 0.390 -0.313±0.044 0.273
Average -0.069±0.016 0.432 -0.074±0.014 0.431 -0.312±0.046 0.320
CLL or AUC, then we see that GSL improves over BUSL. For example, if we take for each
fold the best run that produced better results in terms of CLL, the results show that GSL-
BestCLL performs better than BS and BUSL in terms of CLL. The same happens also for
GSL-BestAUC which performs better than BS and BUSL in terms of AUC. When learning
classifiers or estimating probabilities, often the goal is to optimize both CLL and AUC. This
is known as multiobjective optimization and implies that more than an evaluation function be
employed to evaluate the quality of the algorithms. In this case, in order to make a comparison
with BUSL, we would have to take for each fold the best independent run of GSL that produced
best results by combining CLL and AUC. From Table 4.4, we can see that GSL-Best produces
the best overall results compared to BS and BUSL.
Regarding learning times for UW-CSE, the results for GSL are shown in Table 4.5. The
comparison with BUSL and BS is presented in Table 4.6 with learning times in minutes. As
we can see, the GSL runs are quite fast compared to BS and BUSL. BUSL is the slowest and
employs three to four times more than GSL to complete. BS is faster than BUSL but generally
two to three times slower than GSL.
For CORA the results of GSL are reported in Table 4.7. Every value in the table for each
fold is an average of accuracy for all the predicates. The overall results for CORA comparing
59
4. THE GSL ALGORITHM
Table 4.5: Learning times (in minutes) on UW-CSE for ten parallel independent walks of GSL
Run Language Ai Systems Graphics Theory1 152 118 112 111 1562 65 152 173 333 1823 102 142 256 77 544 106 144 153 217 1315 120 165 65 221 546 179 93 142 326 847 195 75 117 303 1138 286 351 54 85 1039 117 118 115 123 14210 164 66 187 138 150
Average 149 142 137 193 117
Table 4.6: Comparison of learning times (in minutes) on UW-CSE for GSL, BUSL and BS
Run Language Ai Systems Graphics Theory AverageGSL-Average 149 142 137 193 117 148GSL-BestCLL 102 144 112 85 54 99GSL-BestAUC 179 165 187 85 103 144
GSL-Best 102 144 112 85 54 99BUSL 502 664 765 560 598 618
BS 454 315 336 289 280 335
BUSL and GSL are reported in Table 4.8. The columns refer to the results taken for GSL: GSL-
Average refers to the average results of all the parallel runs over all the folds, GSL-BestCLL
refers to the best parallel run for each fold in terms of CLL, GSL-BestAUC refers to the best
parallel run for each fold in terms of AUC, GSL-Best refers to the best parallel run for each
fold taking into account both CLL and AUC. As the results show, BUSL is competitive with
GSL-Average in terms of AUC but is outperformed by GSL-Average in terms of CLL. The
same results are valid also for GSL-BestCLL which outperforms BUSL in terms of CLL with
an overall results of −0.071 compared to −0.196 of BUSL. The other two views of results,
GSL-BestAUC and GSL-Best clearly outperform BUSL both in terms of CLL and in terms of
AUC. GSL-Best can be viewed as an approach for multiobjective optimization since both CLL
and AUC are optimized.
Learning times for CORA are reported in Table 4.10. The independent walks of GSL are
quite fast compared to BUSL which spends much time by first evaluating clauses and then by
60
4.3 Experiments
Table 4.7: Accuracy results on CORA for ten parallel independent walks of GSL
Fold1 Fold2 Fold3 Fold4 Fold5RUN CLL AUC CLL AUC CLL AUC CLL AUC CLL AUCR1 -0.059±0.002 0.131 -0.085±0.004 0.250 -0.076±0.003 0.230 -0.071±0.003 0.146 -0.079±0.002 0.124R2 -0.143±0.003 0.228 -0.095±0.004 0.264 -0.111±0.004 0.134 -0.177±0.005 0.301 -0.114±0.003 0.109R3 -0.076±0.003 0.134 -0.077±0.003 0.128 -0.075±0.004 0.234 -0.114±0.005 0.132 -0.148±0.003 0.234R4 -0.086±0.003 0.133 -0.113±0.006 0.262 -0.122±0.003 0.200 -0.089±0.004 0.244 -0.112±0.003 0.216R5 -0.122±0.002 0.134 -0.171±0.005 0.231 -0.071±0.004 0.242 -0.099±0.004 0.247 -0.105±0.003 0.131R6 -0.065±0.003 0.130 -0.117±0.003 0.259 -0.124±0.004 0.126 -0.086±0.004 0.127 -0.132±0.005 0.127R7 -0.082±0.003 0.133 -0.111±0.005 0.256 -0.070±0.004 0.238 -0.103±0.004 0.129 -0.149±0.003 0.125R8 -0.072±0.002 0.235 -0.097±0.004 0.259 -0.113±0.004 0.132 -0.095±0.004 0.245 -0.131±0.004 0.195R9 -0.090±0.003 0.131 -0.086±0.004 0.260 -0.143±0.004 0.235 -0.098±0.004 0.118 -0.171±0.003 0.230R10 -0.088±0.004 0.114 -0.113±0.006 0.261 -0.127±0.005 0.132 -0.295±0.006 0.112 -0.143±0.003 0.126Avg. -0.088±0.003 0.150 -0.107±0.004 0.243 -0.103±0.004 0.190 -0.123±0.004 0.180 -0.128±0.003 0.162
Table 4.8: Accuracy comparison of GSL with BUSL on the CORA dataset
GSL-Average GSL-BestCLL GSL-BestAUC GSLBest BUSLFold CLL AUC CLL AUC CLL AUC CLL AUC CLL AUC
1 -0.088±0.003 0.150 -0.059±0.002 0.131 -0.072±0.002 0.235 -0.072±0.002 0.235 -0.099±0.002 0.2202 -0.107±0.004 0.243 -0.077±0.003 0.128 -0.095±0.004 0.264 -0.086±0.004 0.260 -0.118±0.003 0.1293 -0.103±0.004 0.190 -0.071±0.004 0.242 -0.071±0.004 0.242 -0.071±0.004 0.242 -0.558±0.007 0.1864 -0.123±0.004 0.180 -0.071±0.003 0.146 -0.177±0.005 0.301 -0.099±0.004 0.247 -0.100±0.002 0.2385 -0.128±0.003 0.162 -0.079±0.002 0.124 -0.148±0.003 0.234 -0.112±0.003 0.216 -0.103±0.003 0.234
Avg. -0.110±0.004 0.185 -0.071±0.003 0.154 -0.112±0.004 0.255 -0.088±0.004 0.240 -0.196±0.003 0.201
adding them one by one to the current structure. This implies that the MLN is changed at
every step and thus the optimization of WPLL requires much more time. In GSL this does not
happen since the clause evaluation is performed following the approach presented in section
3.2.3 where the authors in (Kok and Domingos 2005) found that the very simple approach of
initializing L-BFGS with the current weights (and zero weight for a new clause) was quite
successful. Although in principle all weights could change as the result of introducing or
modifying a clause, in practice this is very rare. Second-order, quadratic-convergence methods
like L-BFGS are known to be very fast if started near the optimum (Sha and Pereira 2003).
For the BS algorithm (Kok and Domingos 2005) in the CORA domain we were not able to
report results, since structure learning with this algorithm did not finish in 45 days. BS is
heavily slowed by its systematic top-down nature that tends to evaluate a very large number of
candidates.
61
4. THE GSL ALGORITHM
Table 4.9: Learning times (in minutes) on CORA for ten parallel independent walks of GSL
Run Fold1 Fold2 Fold3 Fold4 Fold51 2088 1590 1826 848 11642 1781 534 1494 2027 9873 2615 2986 586 1673 11904 1264 883 1468 959 5225 2230 1734 1014 1232 4506 787 2015 1748 1213 16977 1012 2578 1888 1601 8728 1020 1064 883 1796 32809 2014 518 1270 1531 1958
10 1307 1889 1392 1008 1202Average 1612 1579 1357 1389 1332
Table 4.10: Comparison of learning times (in minutes) on CORA for GSL and BUSL
Run Fold1 Fold2 Fold3 Fold4 Fold5 AverageGSL-Average 1612 1579 1357 1389 1332 1454GSL-BestCLL 2088 2986 1888 848 1164 1795GSL-BestAUC 1020 534 1014 2027 1190 1157
GSL-Best 1020 2014 1014 1232 522 1160BUSL 7400 7000 8200 12000 12150 9350
4.4 Related Work
GSL is related to the growing amount of research on learning SRL models (Getoor and Taskar
2007). In particular, it is similar to approaches that tightly integrate the steps of structure and
parameter learning. These approaches typically learn the structure of the model by directly
optimizing a likelihood-type measure. Most of these approaches such as those in (Dehaspe
1997; Huynh and Mooney 2008; Landwehr et al. 2005, 2006, 2007) have their roots in the ILP
community and were extended to SRL or PILP by combining refinement operators with statis-
tical learning. All these systems perform classification while the goal of GSL is to do general
probability estimation on relational data, i.e., learn the joint distribution of all the predicates.
Being based on MLNs the most closely related algorithms are those presented in (Kok and
Domingos 2005; Mihalkova and Mooney 2007) which optimize directly the pseudo-likelihood
measure. The main difference of GSL is that it is a stochastic algorithm based on a hybrid SLS
metaheuristic such as ILS. From this point of view the most closely related approach is that of
62
4.4 Related Work
(Zelezny. et al. 2006) that exploits SLS methods in ILP. However, the algorithm GSL that we
propose here is different in that it uses likelihood as evaluation measure instead of ILP coverage
criteria. Moreover, GSL differs from the algorithms proposed in (Zelezny. et al. 2006) in that
it uses Hybrid SLS approaches which can combine other simple SLS methods to produce high
performance algorithms.
63
4. THE GSL ALGORITHM
4.5 Summary
GSL is a stochastic algorithm that performs an Iterated Local Search in the space of structures
guided by pseudo-likelihood. The approach is based on a biased sampling of the set of local
optima focusing the search not on the full space of solutions but on a smaller subspace de-
fined by the solutions that are locally optimal for the optimization engine. It employs a strong
perturbation operator and an iterative improvement local search procedure in order to balance
diversification (randomness induced by strong perturbation to avoid search stagnation) and
intensification (greedily increase solution quality by exploiting the evaluation function). The
experimental evaluation on two benchmarking datasets regarding the problem of Link Analysis
in Social Networks and Entity Resolution in citation databases, show that by running parallel
multiple independent walks, GSL achieves improvements over the state-of-the-art algorithms
for generative structure learning of Markov Logic Networks. GSL can be further improved in
future work with the following: weakening intensification through a higher random walk prob-
ability in the local search procedure; the current used acceptance function in GSL performs
iterative improvement in the space of local optima. This can lead to getting stuck in local op-
tima therefore in order to induce some random walk among the local optima, the acceptance
function can be rendered probabilistic; implementing more sophisticated parallel models such
as MPI (Message Passing Interface) or PVM (Parallel Virtual Machine); dynamically adapting
the nature of perturbations.
64
Chapter 5
The ILS-DSL algorithm
Generative approaches optimize the joint distribution of all the variables. This can lead to sub-
optimal results for predictive tasks because of the mismatch between the objective function
used (likelihood or a function thereof) and the goal of classification (maximizing accuracy or
conditional likelihood). In contrast discriminative approaches maximize the conditional like-
lihood of a set of outputs given a set of inputs (Lafferty et al. 2001) and this often produces
better results for prediction problems. In (Singla and Domingos 2005) the voted perceptron
based algorithm for discriminative weight learning of MLNs was shown to greatly outperform
maximum-likelihood and pseudo-likelihood approaches for two real-world prediction prob-
lems. Recently, the algorithm in (Lowd and Domingos 2007), outperforming the voted percep-
tron became the state-of-the-art method for discriminative weight learning of MLNs. However,
both discriminative approaches to MLNs learn weights for a fixed structure, given by a domain
expert or learned through another structure learning method (usually generative). Better results
could be achieved if the structure could be learned in a discriminative fashion. Unfortunately,
the computational cost of optimizing structure and parameters for conditional likelihood is
prohibitive. In this chapter it is shown that the simple approximation of choosing structures
by maximizing conditional likelihood (CLL) or AUC for Precision-Recall (PR) curve, while
setting parameters by maximum likelihood can produce better results in terms of predictive
accuracy. Structures are scored through a very fast inference algorithm MC-SAT (Poon and
Domingos 2006) whose lazy version Lazy-MC-SAT (Poon et al. 2008) greatly reduces mem-
ory requirements, while parameters are learned through a quasi-Newton optimization method
like L-BFGS (Liu and Nocedal 1989) that has been found to be much faster (Sha and Pereira
2003) than iterative scaling initially used for MNs weight learning (Della Pietra et al. 1997).
65
5. THE ILS-DSL ALGORITHM
The chapter presents the ILS-Discriminative Structure Learning (ILS-DSL) algorithm which
is based on the Iterated Local Search (ILS) metaheuristic (Loureno et al. 2002). We present
two variants of the algorithm that set parameters by maximum likelihood and choose structures
by maximum CLL or AUC of Precision Recall (PR) curve respectively.
5.1 Setting Parameters through Likelihood
Weight learning in MLNs is a convex optimization problem, thus gradient descent is guaran-
teed to find the global optimum. However, in practice convergence to this optimum is extremely
very slow, since MLNs, being exponential models, require as sufficient statistics, computing
the number of true groundings of a clause in the data. Optimizing the CLL for MLNs (similar to
Markov random fields) requires computing the partition function which is generally intractable.
Moreover, since the number of true groundings of clauses can easily vary by orders of mag-
nitude from one clause to another, learning rates that are small enough to avoid divergence in
some weights, may be too small for convergence in others. This causes the ill-conditioning
problem in numerical optimization (Nocedal and Wright. 2006). This can be solved through
different methods, but the most well-known are not directly applicable to MLNs. These in-
clude methods that perform line searches (computing the function as well as the gradient) such
as conjugate gradient and quasi-Newton methods. All these require the computation of the par-
tition function, therefore the approach of optimizing CLL for every refinement is impractical.
Instead, the optimization of WPLL does not require these computations.
For every candidate structure, the parameters that optimize the WPLL are set through L-
BFGS. As pointed out in (Kok and Domingos 2005) a potentially serious problem that arises
when evaluating candidate clauses using WPLL is that the optimal (maximum WPLL) weights
need to be computed for each candidate. Since this involves numerical optimization, and needs
to be done millions of times, it could easily make the algorithm too slow. In (Della Pietra et al.
1997; McCallum 2003) the problem is addressed by assuming that the weights of previous
features do not change when testing a new one. Surprisingly, the authors in (Kok and Domingos
2005) found this to be unnecessary if the very simple approach of initializing L-BFGS with the
current weights (and zero weight for a new clause) is used. Although in principle all weights
could change as the result of introducing or modifying a clause, in practice this is very rare.
Second-order, quadratic-convergence methods like L-BFGS are known to be very fast if started
near the optimum (Sha and Pereira 2003). This is what happened in (Kok and Domingos 2005):
66
5.2 Scoring Structures through Conditional Likelihood
L-BFGS typically converges in just a few iterations, sometimes one. We use the same approach
for setting the parameters that optimize the WPLL.
5.2 Scoring Structures through Conditional Likelihood
In order to score MLN structures, we need to perform inference over the network. A very fast
algorithm for inference in MLNs is MC-SAT (Poon and Domingos 2006). Since probabilistic
inference methods like MCMC or belief propagation tend to give poor results when determin-
istic or near-deterministic dependencies are present, and logical ones like satisfiability testing
are inapplicable to probabilistic dependencies, MC-SAT combines ideas from both MCMC
and satisfiability to handle probabilistic, deterministic and near-deterministic dependencies that
are typical of statistical relational learning. MC-SAT was shown to greatly outperform Gibbs
sampling and simulated tempering in two real-world datasets regarding entity resolution and
collective classification.
Even though MC-SAT is a very fast inference algorithm, scoring candidate structures at
each step can be potentially very expensive since inference has to be performed for each can-
didate clause added to the current structure. One problem that arises is that fully instantiating
a finite first-order theory requires memory in the order of the number of constants raised to the
length of the clauses, which significantly limits the size of domains where the problem can still
be tractable. To avoid this problem, we used a lazy version of MC-SAT, Lazy-MC-SAT (Poon
et al. 2008) which reduces memory and time by orders of magnitude compared to MC-SAT.
Before Lazy-MC-SAT was introduced, the LazySat algorithm (Singla and Domingos 2006a)
was shown to greatly reduce memory requirements by exploiting the sparseness of relational
domains (i.e., only a small fraction of ground atoms are true, and most clauses are trivially sat-
isfied). The authors in (Poon et al. 2008) generalize the ideas in (Singla and Domingos 2006a)
by proposing a general method for applying lazy inference to a broad class of algorithms such
as other SAT solvers or MCMC methods. Another problem is that even though Lazy-MC-
SAT makes memory requirements tractable, it can take too much time to construct the Markov
random field in the first step of MC-SAT for every candidate structure.
To make the execution of Lazy-MC-SAT tractable for every candidate structure, we use
the following simple heuristics: 1) We score through Lazy-MC-SAT only those candidates
that produce an improvement in WPLL. Once the parameters are set through L-BFGS, it is
straightforward to compute the gain in WPLL for each candidate. This reduces the number of
67
5. THE ILS-DSL ALGORITHM
candidates to be scored through Lazy-MC-SAT for a gain in CLL. 2) We pose a memory limit
for Lazy-MC-SAT on the clause activation phase and this greatly speeds up the whole inference
task. Although in principle this limit can reduce the accuracy of inference, we found that in
most cases the memory limit is never reached making the overall inference task very fast. 3)
We pose a time limit in the clause activation phase in order to avoid those rare cases where the
step takes a very long time to be completed. For most candidate structures such a time limit
is never reached and for those rare cases where time limit is reached, inference is performed
using the activated clauses within the limit.
We found that these simple approximations greatly speed up the scoring of each structure at
each step. Filtering the potential candidates through the gain in WPLL can in principle exclude
good candidates due to the mismatch between the optimization of WPLL and that of CLL.
However, we empirically found that most candidates not improving WPLL, did not improve
CLL. Further investigation on this issue may help to select better or more candidates to be
scored through Lazy-MC-SAT.
5.3 Discriminative Structure Learning using ILS
In this section we describe our proposal for tailoring the ILS metaheuristic to the problem
of learning the structure of MLNs. We describe how weights are set and how structures are
scored. The approach we follow is similar to (Grossman and Domingos 2004) where Bayesian
Networks were learned by setting weights through maximum likelihood and choosing struc-
tures by maximizing conditional likelihood.
Algorithms ILS−DSLCLL and ILS−DSLAUC (Algorithm 5.1) iteratively add the best clause
to the current MLN until δ consecutive steps have not produced improvement (however other
stopping criteria could be applied). It can start from an empty network or from an existing KB.
Like in (Kok and Domingos 2005; Richardson and Domingos 2006) we add all unit clauses
(single predicates) to the MLN. The initial weights are learned in LearnWeights through L-
BFGS and the initial structure is scored in ComputeScore through MC-SAT. MC-SAT takes in
input a MLN, a query predicate and evidence ground facts and computes for each grounding
of the query predicate, its probability of being true. From these values in ComputeScore, the
CLL is computed as the average of CLL over all these groundings. DSLCLL scores directly
the candidate clauses by CLL, while DSLAUC computes the AUC of the PR curve by using the
package of (Davis and Goadrich 2006).
68
5.3 Discriminative Structure Learning using ILS
Algorithm 5.1 The ILS-DSL algorithmInput: (P:set of predicates, MLN:Markov Logic Network, RDB:Relational Database, QP:Query predicate)CLS = All clauses in MLN ∪ P;LearnWeights(MLN,RDB);BestScore = ComputeScore(MLN,RDB,QP);repeat
BestClause = SearchBestClause(P,MLN,BestScore,CLS,RDB,QP);if BestClause 6= null then
Add BestClause to MLN;BestScore = ComputeScore(MLN,RDB,QP);
end ifuntil BestClause = null for δ consecutive stepsReturn MLNFor DSLCLL ComputeScore computes the average CLL over all the groundings of the querypredicate QPFor DSLAUC ComputeScore computes the AUC of PR curve
The search for the best clause is performed in the SearchBestClause procedure (Algorithm
5.2). The algorithm performs an iterated local search to find the best clause to add to the current
MLN. It starts by randomly choosing a unit clause CLC in the search space. Then it performs
a greedy local search to efficiently reach a local optimum CLS. At this point, a perturbation
method is applied leading to the neighbor CL′C of CLS and then a greedy local search is applied
to CL′C to reach another local optimum CL′S . The accept function decides whether the search
must continue from the previous local optimum CLC or from the last found local optimum CL′S(accept can perform random walk or iterative improvement in the space of local optima).
Careful choice of the various components of SearchBestClause is important to achieve high
performance. The clause perturbation operator (flipping the sign of literals, removing literals
or adding literals) has the goal to jump in a different region of the search space where search
should start with the next iteration. There can be strong or weak perturbations which means that
if the jump in the search space is near to the current local optimum the subsidiary local search
procedure LocalSearchII (Algorithm 5.3) may fall again in the same local optimum and enter
regions with the same value of the objective function called plateau, but if the jump is too far,
LocalSearchII may take too many steps to reach another good solution. In our algorithm we use
69
5. THE ILS-DSL ALGORITHM
Algorithm 5.2 The SearchBestClause component of ILS-DSLInput:(P: set of predicates, MLN: Markov Logic Network, BestScore: CLL of AUC score,BestWPLL: WPLL score, CLS: List of clauses, RDB: Relational Database, QP: Query pred-icate)CLC = Random Pick a clause in CLS ∪ P;CLS = LocalSearchII(CLS,BestScore,BestWPLL);BestClause = CLS;repeat
CL’C = Perturb(CLS);CL’S = LocalSearchII(CL’C,MLN,BestScore,BestWPLL);if ComputeScore(BestClause,MLN,RDB,QP) ≤ ComputeScore(CL’S,MLN,RDB,QP)then
BestClause = CL’S;Add BestClause to MLN;BestScore = ComputeScore(CL’S,MLN,RDB,QP)
end ifCLS = accept(CLS,CL’S);
until k consecutive steps have not produced improvementReturn BestClauseFor DSLCLL ComputeScore computes the average CLL over all the groundings of the querypredicateFor DSLAUC ComputeScore computes the AUC of PR curve
only strong perturbations, i.e., we always re-start from unit clauses (in future work we intend
to dynamically adapt the nature of the perturbation). Regarding the procedure LocalSearchII ,
we decided to use an iterative improvement approach (the walk probability is set to zero and
the best clause is always chosen in StepII) in order to balance intensification (greedily increase
solution quality by exploiting the evaluation function) and diversification (randomness induced
by strong perturbation to avoid search stagnation). In future work we intend to further weaken
intensification by using a higher walk probability. Finally, the accept function always accepts
the best solution found so far.
5.3.1 The ILS-DSLCLL version
The ILS-DSLCLL version of the algorithm maximizes CLL during search. In Algorithm 5.2,
the function ComputeScore computes the average CLL over all the groundings of the query
70
5.3 Discriminative Structure Learning using ILS
Algorithm 5.3 The subsidiary procedure LocalSearch and the Step function of ILS-DSLLocalSearchII(CLS,BestScore,BestWPLL)wp: walk probability, the probability of performing an improvement step or a random steprepeat
NBHD = Neighborhood of CLC constructed using the clause construction operators;CLS = StepRII(CLC,NBHD,wp,BestScore,BestWPLL);CLC = CLS;
until two consecutive steps do not produce improvementReturn CLS;StepRII(CLC,NBHD,wp,BestScore,BestWPLL)U = random(]0,1]); random number using a Uniform Probability Distributionif U ≤ wp) then
CLS = stepURW (CLC,NBHD)Uninformed Random Walk: randomly choose a neighbor from NBHD
elseCLS = stepII(CLC,NBHD)Iterative Improvement: among the improving neighbors in NBHD that improve BestW-PLL,choose the one that maximally improves BestScore in terms of CLL or AUC.If there is no improving neighbor choose the minimally worsening one
end ifReturn CLS
predicate QP. It uses the Lazy-MC-SAT algorithm to perform inference over the network con-
structed using the current structure MLN and the relational data RDB of a tuning set. In the
tuning set all the groundings of the query predicate QP are commented. MC-SAT produces for
each grounding of the query predicate QP the probability that it is true. These values are then
used to compute the average CLL by distinguishing positive and negative atoms. For a positive
atom its estimated probability P contributes with logP to the CLL and a negative’s estimated
probability contributes with log(1−P). In the subsidiary local search procedure LocalSearchII
and in the StepII function, the candidates are filtered based on their improvement in terms of
WPLL. Among the candidates that improve WPLL, the one that most maximizes CLL is then
chosen as the best candidate to continue the search.
71
5. THE ILS-DSL ALGORITHM
5.3.2 The ILS-DSLAUC version
The ILS-DSLAUC version of the algorithm maximizes AUC of the PR during search. In Algo-
rithm 5.2, the function ComputeScore computes the average AUC over all the groundings of the
query predicate QP. It uses the Lazy-MC-SAT algorithm to perform inference over the network
constructed using the current structure MLN and the relational data RDB of a tuning set. In
the tuning set all the groundings of the query predicate QP are commented. MC-SAT produces
for each grounding of the query predicate QP the probability that it is true. The precision-
recall curve for a predicate is computed by varying the CLL threshold above which a ground
atom is predicted to be true; i.e. the predicates whose probability of being true is greater than
the threshold are positive and the rest are negative. For the computation of AUC we used the
package of (Davis and Goadrich 2006). In the subsidiary local search procedure LocalSearchII
and in the StepII function, the candidates are filtered based on their improvement in terms of
WPLL. Among the candidates that improve WPLL, the one that most maximizes AUC is then
chosen as the best candidate to continue the search.
5.4 Experiments
Through experimental evaluation we want to answer the following questions:
(Q1) Are the proposed algorithms competitive with state-of-the-art discriminative training
algorithms of MLNs?
(Q2) Are the proposed algorithms competitive with the state-of-the-art generative algo-
rithm for structure learning of MLNs?
(Q3) Are the proposed algorithms competitive with pure probabilistic approaches such as
Naïve Bayes and Bayesian Networks?
(Q4) Are the proposed algorithms competitive with state-of-the-art ILP systems for the task
of structure learning of MLNs?
(Q5) Do the proposed algorithms always perform better than BUSL for classification tasks?
If not, are there any regimes in which each algorithm performs better?
(Q6) Regarding the task of Entity Resolution, do the proposed algorithms perform better
than other language-independent discriminative approaches based on MLNs?
Regarding question (Q1) we have to compare all our algorithms with Preconditioned Scaled
Conjugate Gradient (PSCG) which is the state-of-the-art discriminative training algorithm for
72
5.4 Experiments
MLNs proposed in (Lowd and Domingos 2007). It must be noted that this algorithm takes
in input a fixed structure, and with the clausal knowledge base we use in our experiments for
CORA (each dataset comes with a hand-coded knowledge base), PSCG has achieved the best
published results. We also exclude the approach of adapting the rule set and then learning
weights with PSCG, since it would be computationally intractable.
To answer question (Q2) we have to perform experimental comparison with the Bottom-Up
Structure Learning (BUSL) algorithm (Mihalkova and Mooney 2007) which is the state-of-the-
art algorithm for this task. Since in principle, the MLNs structure can be learned using any ILP
technique it would be interesting to know how our algorithms compare to ILP approaches. In
(Kok and Domingos 2005), the proposed algorithm based on beam search (BS) was shown to
outperform FOIL and the state-of-the-art ILP system ALEPH for the task of learning MLNs
structure. Moreover, BS outperformed both Naïve Bayes and Bayesian Networks in terms of
CLL and AUC. Since in (Mihalkova and Mooney 2007) was shown that BUSL outperforms
the BS algorithm of (Kok and Domingos 2005), our baseline for questions (Q2), (Q3) and
(Q4) is again BUSL. It must be noted that since the goal of learning MLNs (and then perform
inference over the model) is to perform probability estimation, the proposed algorithms are not
directly comparable with ILP systems because these are not designed to maximize the data’s
likelihood (and thus the quality of the probabilistic predictions). Moreover, since ALEPH and
FOIL learn more restricted clauses (non recursive definite clauses), the only ILP system that
is directly comparable with our algorithm is CLAUDIEN which, unlike most ILP systems that
learn only Horn clauses, is able to learn arbitrary first-order clauses. Thus the comparison re-
gards the task of structure learning of MLNs where ILP systems learn the structure followed by
a weight learning phase. In (Kok and Domingos 2005) the authors showed that CLAUDIEN
(also ALEPH and FOIL) followed by a weight learning phase was outperformed by the BS
algorithm in terms of CLL and AUC. Regarding question (Q5), we compare all our algorithms
and BUSL on two datasets with the goal of discovering regimes in which each one can per-
form better. We will use two datasets, one of which can be considered of small size and the
other one of much larger size. Finally, to answer question (Q6), we should compare our algo-
rithms with the best language-independent discriminative approach to Entity Resolution based
on MLNs proposed in (Singla and Domingos 2006b). In this work, the MLN(G+C+T) model is
language-independent because it does not contain rules referring to specific strings occurring
in the data. This is similar to the approach that we follow here for this task: we learn rules
which are not vocabulary specific. In (Singla and Domingos 2006b) the discriminative weight
73
5. THE ILS-DSL ALGORITHM
learning approach is based on the voted perceptron for MLNs and was used to learn weights for
different hand-coded models (one of these was MLN(G+C+T)). Since in (Lowd and Domingos
2007) it was shown that PSCG in general outperforms the voted perceptron and for the task
of entity resolution the comparison was performed following a language-dependent approach
(excluding MLN(G+C+T)), it would be interesting to investigate how our algorithms compare
to MLN(G+C+T).
5.4.1 Link Analysis
As introduced in Section 4.3.1, Link Analysis is an important problem in many domains where
entities such as people, Web pages, computers, scientific publications, organizations, are in-
terconnected and interact in one way or another (Popescul and Ungar 2003). Predicting the
presence of links between entities is not an easy task due to the characteristics of such do-
mains. The first requirement regards the representation formalism. Flat representations are not
suitable to deal with these problems, hence relational formalisms must be used. Second, most
of these domains contain noisy or partially observed data thus robust methods for dealing with
uncertainty must be used.
In many scenarios related to Link Analysis, it is known in advance which will be the entity
subject to query after the model has been learned, i.e., it is known before which is the query
variable. Therefore it is useless to optimize the joint distribution of all the variables. Instead, in
order to increase classification accuracy, it is sufficient to optimize the distribution of the query
predicate given the evidence. For example, in Social Network modeling, often there is a target
predicate that expresses a relationship among two entities of a certain type and the problem
is to find whether this relation between two objects in the domain holds or not. Therefore a
discriminative approach to this problem would be more helpful in case it is known a priori what
the query predicate is.
Regarding other discriminative approaches based on SRL models applied to link analysis,
in (Popescul and Ungar 2003) a SRL model, Structural Logistic Regression was used to solve
a link analysis problem about predicting citations in scientific literature. In (Taskar et al. 2003)
Relational Markov Networks were used for link prediction in two domains: university web-
pages and social networks. Markov Logic was first applied to link prediction by (Richardson
and Domingos 2006) and then by (Kok and Domingos 2005; Mihalkova and Mooney 2007;
Singla and Domingos 2005). The only discriminative approach regards the work of (Singla
74
5.4 Experiments
and Domingos 2005) where a discriminative weight learning algorithm based on the voted per-
ceptron was applied to the problem of link prediction in social networks. The experiments of
(Singla and Domingos 2005) were performed on the same data that we use here and no struc-
ture was learned. Instead a hand-coded MLN structure was used and only the parameters of
the model were learned. In this dissertation, we try to learn the clauses from the given facts
together with their parameters.
Dataset
The experiments on link prediction were carried out on a publicly-available database: the
UW-CSE database (available at http://alchemy.cs.washington.edu/data/uw-cse) used by (Kok
and Domingos 2005; Mihalkova and Mooney 2007; Richardson and Domingos 2006; Singla
and Domingos 2005). This dataset represents a standard relational one and is used for the
important relational task of social network analysis.
The published UW-CSE dataset consists of 15 predicates divided into 9 types with 1323
constants. Types include: publication, person, course, etc. Predicates include: Student(person),
Professor(person), AdvisedBy(person1, person2),TaughtBy(course, person, quarter), Publica-
tion (paper, person) etc. The dataset contains 2673 tuples (true ground atoms, with the re-
mainder assumed false). The task is to predict who is whose advisor from information about
coauthorships, classes taught, etc. More precisely, the query atoms are all groundings of Ad-
visedBy(person1, person2), and the evidence atoms are all groundings of all other predicates
(except Student and Professor) (Richardson and Domingos 2006). In our experiments we per-
formed inference over the predicate AdvisedBy by commenting all the groundings of the pred-
icates Professor and Student.
5.4.2 Entity Resolution
As introduced in Section 4.3.2 Entity Resolution is the problem of determining which records
in a database refer to the same entities, and is an essential and expensive step in the data
mining process. The problem was originally defined in (Newcombe et al. 1959) and then the
work in (Fellegi and Sunter 1969) laid the theoretical basis of what is now known by the name
of object identification, record linkage, de-duplication, merge/purge, identity uncertainty, co-
reference resolution, and others. The main characteristics of domains where this problem must
be solved, are relations between objects and uncertainty about these relations due to noisy or
partially observed data.
75
5. THE ILS-DSL ALGORITHM
In most cases related to Entity Resolution, it is not required to solve all the different types
of entities in domain, but only a category of all the different types in the domain. Therefore it
is useless to optimize the joint distribution of all the variables when we know a priori the entity
we want to resolve. In order to increase classification accuracy, it is sufficient to optimize the
distribution of the query predicate given the evidence. For example, in protein networks, often
there is a target predicate that expresses a relationship among two proteins of a certain type and
the problem is to find whether the two observations refer to the same entity, i.e., the proteins
are the same but for different reasons (noise or partially observed data) it was not clear the
equality. Another application field is that of citation databases where many citations may refer
to the same paper and it is important to deduplicate the data. Discriminative approaches take
a set of variables in input and produce predictions for a set of output variables. For this rea-
son, discriminative approaches are more suitable for entity resolution problems than generative
approaches.
Regarding other SRL models for Entity Resolution, there have been proposed several ap-
proaches such as those in Bhattacharya and Getoor (2004); Milch et al. (2005); Pasula and
Russell (2001). Other approaches similar to ours, based on Markov Logic, are those in Lowd
and Domingos (2007); Singla and Domingos (2005, 2006b). All these propose discriminative
weight learning approaches that take an existing structure in input and learn the parameters of
the model. The approach that we propose in this chapter learns both structure and parameters
in a discriminative fashion.
Dataset
The CORA dataset consists of 1295 citations of 132 different computer science papers,
drawn from the CORA CS Research Paper Engine. The task is to predict which citations refer
to the same paper, given the words in their author, title, and venue fields. The labeled data also
specify which pairs of author, title, and venue fields refer to the same entities. We performed
experiments for each field in order to evaluate the ability of the model to deduplicate fields as
well as citations. The dataset contains 10 predicates and 70367 tuples (true and false ground
atoms, with the remainder assumed false). Since the number of possible equivalences is very
large, like the authors did in (Lowd and Domingos 2007) we used the canopies found in (Singla
and Domingos 2006b) to make this problem tractable. The dataset used is in Alchemy format
(publicly-available at http://alchemy.cs.washington.edu/data/cora/). The original version not
76
5.4 Experiments
in alchemy format was segmented by Bilenko and Mooney in (Bilenko and Mooney 2003)
(available at http://www.cs.utexas.edu/users/ml/riddle/data/cora.tar.gz).
5.4.3 Systems and Methodology
We implemented the algorithm ILS-DSL as part of the MLN++ package 8.3 which is a suite of
algorithms based on Markov Logic and built upon the Alchemy framework (Kok et al. 2005).
Alchemy implements inference and learning algorithms for Markov Logic. Alchemy can be
viewed as a declarative programming language akin to Prolog, but with some key differences:
the underlying inference mechanism is model checking instead of theorem proving; the full
syntax of first-order logic is allowed, rather than just Horn clauses. Moreover, Alchemy has
some built-in functionalities with the ability to handle uncertainty and learn from data. MLN++
uses the API of Alchemy for some tasks such as the implementation of L-BFGS and Lazy-MC-
SAT in Alchemy to learn maximum WPLL weights and compute CLL during clause search.
Regarding parameter learning, we compared our algorithms performance with the state-
of-the-art algorithm PSCG of (Lowd and Domingos 2007) for discriminative weight learning
of MLNs. This algorithm takes as input an MLN and the evidence (groundings of non-query
predicates) and discriminatively trains the MLN to optimize the CLL of the query predicates
given evidence. Our algorithms and PSCG optimize the CLL (or AUC) of the query predicates
and a comparison between these algorithms would be useful to understand if learning auto-
matically the clauses from scratch can improve over hand-coded MLN structures in terms of
classification accuracy of the query predicates given evidence.
We performed all the experiments on a 2.13 GHz Intel Core2 Duo CPU. For the UW-
CSE dataset we trained PSCG on the hand-coded knowledge base provided with the dataset.
We used the implementation of PSCG in the Alchemy package and ran this algorithm with the
default parameters for 10 hours. For the CORA dataset, for PSCG we report the results obtained
in (Lowd and Domingos 2007) where PSCG was trained on a hand-coded MLN and achieved
best current results on this dataset. For the language-independent approach MLN(G+C+T) of
(Singla and Domingos 2006b) we report the results for three of the query predicates in this
domain: sameBib, sameAuthor and sameVenue (the results reported in (Singla and Domingos
2006b) do not include the predicate sameTitle).
For both datasets, for all our algorithms we used the following parameters: the mean and
variance of the Gaussian prior were set to 0 and 100, respectively; maximum variables per
clause = 4; maximum predicates per clause = 4; penalization of weighted pseudo-likelihood
77
5. THE ILS-DSL ALGORITHM
= 0.01 for UW-CSE and 0.001 for CORA. For L-BFGS we used the following parameters:
maximum iterations = 10,000 (tight) and 10 (loose); convergence threshold = 10−5 (tight) and
10−4 (loose). For Lazy-MC-SAT during learning we used the following parameters: memory
limit = 300MB for both datasets; maximum number of steps for Gibbs sampling = 100; simu-
lated annealing temperature = 0.5; the parameter k (number of iterations without improvement)
was set to three, while the parameter δ was set to 2. All these parameters were set in an ad
hoc manner and per-fold optimization may lead to better results. Regarding BUSL, for both
datasets, we used the following parameters: the mean and variance of the Gaussian prior were
set to 0 and 100, respectively; maximum variables per clause: 5 for UW-CSE and 6 for CORA;
maximum predicates per clause = 6; penalization of WPLL: 0.01 for UW-CSE and 0.001 for
CORA. minWeight 0.5 for UW-CSE and 0.01 for CORA; For L-BFGS we used the following
parameters: maximum iterations = 10,000 (tight) and 10 (loose); convergence threshold = 10−5
(tight) and 10−4 (loose).
In the UW-CSE domain, we followed the same leave-one-area-out methodology as in
(Richardson and Domingos 2006). In the CORA domain, we performed 5-fold cross-validation.
For each train/test split, one of the training folds is used as tuning set for computing the CLL
(or AUC). For each system on each test set, we measured the CLL and the AUC of PR curve
for the query predicates. The advantage of the CLL is that it directly measures the quality of
the probability estimates produced. The advantage of the AUC is that it is insensitive to the
large number of true negatives (i.e., ground atoms that are false and predicted to be false), but
the disadvantage is that it ignores calibration by considering only whether true atoms are given
higher probability than false atoms. The CLL of a query predicate is the average over all its
groundings of the ground atom’s log-probability given evidence. The precision-recall curve for
a predicate is computed by varying the CLL threshold above which a ground atom is predicted
to be true; i.e. the predicates whose probability of being true is greater than the threshold are
positive and the rest are negative. For the computation of AUC we used the package of (Davis
and Goadrich 2006).
5.4.4 Results
After learning the structure discriminatively, we performed inference on the test fold for both
datasets by using MC-SAT with number of steps = 10000 and simulated annealing temperature
= 0.5. For each experiment, on the test fold all the groundings of the query predicates were
commented: advisedBy for the UW-CSE dataset (professor and student are also commented)
78
5.4 Experiments
and sameBib, sameTitle, sameAuthor and sameVenue for CORA. MC-SAT produces probabil-
ity outputs for every grounding of the query predicate on the test fold. We used these values to
compute the average CLL over all the groundings and to compute the PR curve.
We denote the two versions of the algorithm as ILS−DSLCLL and ILS−DSLAUC. For the
algorithm that optimizes AUC of PR curve during search, we scored each structure by using
the package of (Davis and Goadrich 2006). The results for all algorithms on the UW-CSE
dataset are reported in Table 5.1 for CLL and Table 5.2 for AUC. In Table 5.1, CLL is averaged
over all the groundings of the predicate advisedBy in the test fold. Regarding the comparison
with PSCG in terms of CLL, in this domain our algorithms perform better than PSCG in every
fold of the dataset and overall. Regarding AUC, PSCG overall performs better than both our
algorithms. It must be noted that on two out of five folds (language and graphics) the results
of our algorithms were quite competitive and there was a large difference only in the theory
fold where PSCG achieved a high result. Our best performing algorithm in terms of CLL was
ILS−DSLAUC. This was a surprising result since we expected better results from the algorithm
ILS−DSLCLL that optimizes CLL during search. On the other side, in terms of AUC, our
algorithms performed equally. Overall for UW-CSE, we can state that our algorithms perform
better in terms of CLL and worse in terms of AUC.
For the CORA dataset the results are reported in Tables 5.3 and 5.4. For CLL for each
query predicate we report the average of CLL of its groundings over the test fold (for each
predicate, training is performed on four folds and testing on the remaining one in a 5-fold
cross-validation). For CORA, compared to PSCG, all our algorithms perform better in terms
of CLL for each of the query predicates, but worse in terms of AUC. We observed empirically
on each fold that the performances in terms of CLL and AUC were always balanced, a slightly
better performance in CLL always resulted in a slightly worse performance in terms of AUC
and vice versa. Since CLL determines the quality of the probability predictions output by the
algorithm, all our algorithms outperform PSCG in terms of the ability to predict correctly the
query predicates given evidence. However, since AUC is useful to predict the few positives in
the data, PSCG produces better results for only positive examples. Hence, these results answer
question (Q1). It must be noted that PSCG has achieved the best published results on CORA in
terms of AUC (Lowd and Domingos 2007) and the approach followed is language-dependent,
i.e. the hand-coded MLN used with PSCG in (Lowd and Domingos 2007) contains rules such
that a weight is learned for each ground clause that is constructed using specific constants in
79
5. THE ILS-DSL ALGORITHM
Table 5.1: CLL results for the query predicate advisedBy in the UW-CSE domain
area language graphics systems theory ai OverallILS−DSLCLL -0.048±0.016 -0.016±0.003 -0.020±0.003 -0.020±0.005 -0.022±0.003 -0.025±0.006ILS−DSLAUC -0.028±0.008 -0.015±0.003 -0.017±0.002 -0.018±0.004 -0.019±0.003 -0.019±0.004
PSCG -0.049±0.016 -0.023±0.005 -0.026±0.005 -0.028±0.007 -0.032±0.005 -0.032±0.008BUSL -0.024±0.008 -0.014±0.002 -0.295±0.000 -0.013±0.003 -0.019±0.003 -0.073±0.003
Table 5.2: AUC results for the query predicate advisedBy in the UW-CSE domain
area language graphics systems theory ai OverallILS−DSLCLL 0.011 0.006 0.007 0.010 0.006 0.008ILS−DSLAUC 0.016 0.005 0.007 0.005 0.008 0.008
PSCG 0.011 0.005 0.069 0.101 0.034 0.044BUSL 0.115 0.007 0.007 0.032 0.013 0.035
the domain. This makes the approach with PSCG of (Lowd and Domingos 2007) vocabulary
specific while all our algorithms learn general rules not related to a specific set of strings.
Regarding the comparison with BUSL, the results show that all our algorithms perform
better than BUSL in terms of CLL on both datasets. It must be noted, however, that for UW-
CSE, BUSL performed generally better than our algorithms, but produced very low results in
one fold. In terms of AUC, BUSL performs slightly better on the UW-CSE dataset while in
the CORA dataset all our algorithms outperform BUSL. Therefore, questions (Q2), (Q3) and
(Q4) can be answered affirmatively. Our discriminative algorithms are competitive with BUSL
even though for BUSL, in the UW-CSE domain, we used optimized parameters taken from
(Mihalkova and Mooney 2007) in terms of number of variables and literals per clause, while
for our algorithms we did not perform per-fold optimization of any parameter.
Regarding question (Q5), the goal was whether previous results of (Ng and Jordan 2002)
carry on to MLNs, that on small datasets generative approaches can perform better than dis-
criminative ones. The UW-CSE dataset with a total of 2673 tuples can be considered of much
smaller size compared to CORA that has 70367 tuples. The results of Tables 5.1 and 5.2show
that on the UW-CSE dataset, the generative algorithm BUSL performs better in terms of AUC
and is competitive in terms of CLL since it underperforms our algorithms only because of the
low results in the systems fold of the dataset. Thus we can answer question (Q5) confirming
the results in (Ng and Jordan 2002) that on small datasets generative approaches can perform
better than discriminative ones, while for larger datasets discriminative approaches outperform
80
5.4 Experiments
Table 5.3: CLL results for all query predicates in the CORA domain
area sameBib sameTitle sameAuthor sameVenue OverallILS−DSLCLL -0.087±0.001 -0.077±0.006 -0.148±0.009 -0.121±0.004 -0.108±0.005ILS−DSLAUC -0.168±0.002 -0.117±0.010 -0.158±0.011 -0.101±0.004 -0.136±0.007
PSCG -0.291±0.003 -0.231±0.014 -0.182±0.013 -0.444±0.012 -0.287±0.011MLN(G+C+T) -0.394±0.004 − -0.263±0.053 -1.196±0.031 -0.618±0.030
BUSL -0.566±0.001 -0.100±0.004 -0.834±0.009 -0.232±0.005 -0.433±0.005
Table 5.4: AUC results for all query predicates in the CORA domain
area sameBib sameTitle sameAuthor sameVenue OverallILS−DSLCLL 0.603 0.428 0.371 0.315 0.429ILS−DSLAUC 0.334 0.470 0.688 0.252 0.436
PSCG 0.990 0.953 0.999 0.823 0.941MLN(G+C+T) 0.973 − 980 0.743 0.899
BUSL 0.138 0.419 0.323 0.218 0.275
generative ones.
The final question (Q6) is related to the task of entity resolution and the approaches which
are based on MLNs and are language independent, i.e., that do not contain rules which refer to
specific constants in the domain. The results of Tables 5.3 and 5.4 show that in terms of CLL,
all our algorithms outperform MLN(G+C+T) for all the query predicates, but in terms of AUC,
MLN(G+C+T) outperforms our algorithms. Thus, the same conclusions for PSCG are valid
for MLN(G+C+T). Our algorithms produce in general more accurate probability predictions,
while MLN(G+C+T) produces better results for only positive atoms. Therefore, question (Q6)
can be answered affirmatively.
Finally, we give examples of clauses from MLN structures learned for both datasets (we
omit the relative weights). For the UW-CSE dataset examples of learned clauses are:
position(a1,a2)∨¬advisedBy(a1,a3)∨ yearsInProgram(a1,a4)∨ yearsInProgram(a3,a4)
¬professor(a1)∨ student(a1)∨advisedBy(a2,a1)∨ tempAdvisedBy(a1,a2)
These clauses model the relation advisedBy between students and professors. In the first
clause, a1 and a3 are variables that denote persons (students or professors), while a2 and a4
81
5. THE ILS-DSL ALGORITHM
denote respectively university positions and years spent in university programs. The predicate
position relates the person denoted by a1 (only professors have a position) to his university
position. In the second clause, the constants a1 and a2 denote persons and are either in a
advisedBy or tempAdvisedBy relationship.
For CORA, examples of learned clauses are the following:
sameAuthor(a1,a2) ∨ ¬hasWordAuthor(a1,a3) ∨ ¬hasWordAuthor(a2,a3)
¬title(a1,a2)∨¬title(a3,a2)∨ sameBib(a3,a1)
In the first clause, a1 and a2 denote author fields while the predicate hasWordAuthor relates
author fields to words contained in these fields. In the second rule the predicate title relates titles
to their respective citations and the predicate sameBib is true if both its arguments denote the
same citation.
5.5 Related Work
Many works in the SRL or PILP area have addressed classification tasks. Our discriminative
method falls among those approaches that tightly integrate ILP and statistical learning in a
single step for structure learning. The earlier works in this direction are those in (Dehaspe 1997;
Popescul and Ungar 2003) that employ statistical models such as maximum entropy modeling
in (Dehaspe 1997) and logistic regression in (Popescul and Ungar 2003). These approaches
can be computationally very expensive. A simpler approach that integrates FOIL and Naïve
Bayes is nFOIL proposed in (Landwehr et al. 2005). This approach interleaves the steps of
generating rules and scoring them through CLL. In another work (Davis et al. 2005) these
steps are coupled by scoring the clauses through the improvement in classification accuracy.
This algorithm incrementally builds a Bayes net during rule learning and each candidate rule is
introduced in the network and scored by whether it improves the performance of the classifier.
In a recent approach (Landwehr et al. 2006), the kFOIL system integrates ILP and support
vector learning. kFOIL constructs the feature space by leveraging FOIL search for a set of
relevant clauses. The search is driven by the performance obtained by a support vector machine
based on the resulting kernel. The authors showed that kFOIL improves over nFOIL. Recently,
in TFOIL (Landwehr et al. 2007), Tree Augmented Naïve Bayes, a generalization of Naïve
Bayes was integrated with FOIL and it was shown that TFOIL outperforms nFOIL.
82
5.5 Related Work
Regarding other approaches on MLNs, the most closely related approach is the recently
published work of (Huynh and Mooney 2008). The difference with our algorithms stands in
the very restricted clauses that this approach can learn which are non-recursive definite clauses.
The authors use a modification of ALEPH to generate a very large number of potential clauses
and then effectively learn their parameters by altering existing discriminative MLN weight-
learning methods to perform exact inference and L1 regularization. Since clauses are generated
by ALEPH, this approach is limited only to problems where there is a target predicate that can
be inferred using non-recursive definite clauses and only in this case, it is possible to perform
exact inference. On the other side, our algorithms, having no restrictions on the clauses that
can be learned, can deal with more general problems that need FOL expressiveness. Another
difference with our algorithms is that the work of (Huynh and Mooney 2008) follows a two
step approach, clauses are first generated by ALEPH and then weights are learned on the final
theory. This can be seen as a kind of static propositionalization which was shown in (Landwehr
et al. 2007) to be outperformed on a large number of ILP datasets by the dynamic proposition-
alization approach that we follow in our algorithms. Another advantage of our algorithms is
that Lazy-MC-SAT is an approximate inference algorithm and it can handle ground atoms with
unknown truth value which characterize many SRL domains. On the other side, the algorithm
of (Huynh and Mooney 2008) performs exact inference and does not handle cases where data
maybe incomplete or partially observed.
Regarding the two steps integration, the most closely related approach to the proposed
algorithms is nFOIL (and TFOIL as an extension) which is the first system in literature to
tightly integrate feature construction and Naïve Bayes. Such a dynamic propositionalization
was shown to be superior compared to static propositionalization approaches that use Naïve
Bayes only to post-process the rule set. The approach is different from ours in that nFOIL
selects features and parameters that jointly optimize a probabilistic score on the training set,
while our algorithms maximize the likelihood on the training data but select the clauses based
on the tuning set. This approach is similar to SAYU (Davis et al. 2005) that uses the tuning
set to compute the score in terms of classification accuracy or AUC, with the difference that
DSLCLL uses CLL as score instead of AUC. SAYU is similar only to DSLAUC. From the point of
view of steps integration, MACCENT (Dehaspe 1997) follows a similar approach by inducing
clausal constraints (one at a time) that are used as features for maximum-entropy classification.
Another difference with nFOIL and SAYU is that all our algorithms, to perform inference
for the computation of CLL, use MC-SAT that is able to handle probabilistic, deterministic and
83
5. THE ILS-DSL ALGORITHM
near-deterministic dependencies that are typical of statistical relational learning. Moreover, the
lazy version Lazy-MC-SAT reduces memory and time by orders of magnitude as the results in
(Poon et al. 2008) show. This makes it possible to apply the proposed algorithms to very large
domains.
Finally, from the point of view of search strategies, our algorithms are also similar to ap-
proaches in ILP that exploit SLS (Zelezny. et al. 2006). The algorithms that we propose here
are different in that they use likelihood as evaluation measure instead of ILP coverage criteria.
Moreover, our algorithms differ from those in (Zelezny. et al. 2006) in that we use Hybrid
SLS approaches which can combine other simple SLS methods to produce high performance
algorithms.
84
5.6 Summary
5.6 Summary
In this chapter we have introduced the ILS-DSL algorithm that learns discriminatively first-
order clauses and their weights. The algorithm scores the candidate structures by maximizing
conditional likelihood or area under the Precision-Recall curve while setting the parameters by
maximum pseudo-likelihood. ILS-DSL is based on the Iterated Local Search metaheuristic.
To speed up learning we propose some simple heuristics that greatly reduce the computational
effort for scoring structures. Empirical evaluation with real-world data in two domains show
the promise of our approach improving over the state-of-the-art discriminative weight learn-
ing algorithm for MLNs in terms of conditional log-likelihood of the query predicates given
evidence. We have also compared the proposed algorithm with the state-of-the-art generative
structure learning algorithm and shown that on small datasets the generative approach is com-
petitive, while on larger datasets the discriminative approach outperforms the generative one.
The algorithm can be further improved by the following: weakening intensification through
a higher random walk probability in the local search procedure; the current used acceptance
function in ILS-DSL performs iterative improvement in the space of local optima. This can
lead to getting stuck in local optima therefore in order to induce some random walk among
the local optima, the acceptance function can be rendered probabilistic; dynamically adapting
the nature of perturbations; implementing parallel models such as MPI (Message Passing In-
terface) or PVM (Parallel Virtual Machine) in order to score more structures in parallel or to
assign each iteration of ILS-DSL to a separate thread and then take the best result; develop
heuristics that can find among those that do not improve WPLL, potential candidates that can
improve CLL or AUC.
85
5. THE ILS-DSL ALGORITHM
86
Chapter 6
The RBS-DSL algorithm
6.1 The GRASP metaheuristic
Greedy Randomized Adaptive Search Procedure (GRASP) (Feo and Resende 1989, 1995) is an
approach for quickly finding high-quality solutions by applying a greedy construction search
method (that starting from an empty candidate solution at each construction step adds the so-
lution component ranked best, according to a heuristic selection function) and subsequently a
perturbative local search algorithm to improve the candidate solution thus obtained. This type
of hybrid search method often yields much better solution quality than simple SLS methods
initialized at candidate solutions by Uninformed Random Picking (Hoos and Stutzle 2005).
Moreover, when starting from a greedily constructed candidate solution, the subsequent per-
turbative local search process typically takes much fewer improvement steps to reach a local
optimum. Since greedy construction methods can typically generate a very limited number of
different candidate solutions, GRASP avoids this disadvantage by randomizing the construc-
tion method such that it can generate a large number of different good starting points for a
perturbative local search method. In Algorithm 6.1, in each iteration, the randomized construc-
tive local search algorithm GreedyRandomizedConstruction and the perturbative LocalSearch
algorithm are applied until the termination criterion is met. The algorithm GreedyRandom-
izedConstruction, in contrast to greedy constructive algorithms, does not necessarily add the
best solution component but rather selects it randomly from a list of highly ranked solution
components (Restricted Candidate List) which can be defined by cardinality restriction or by
value restriction. In this chapter we present a novel algorithm inspired from GRASP that per-
forms randomized beam search by scoring the structures through maximum likelihood in the
87
6. THE RBS-DSL ALGORITHM
Algorithm 6.1 The GRASP metaheuristic.Procedure GRASPS = φ
repeatS0 = GreedyRandomizedConstruction(S)S∗ = LocalSearch(S0)S = UpdateSolution(S,S∗)
until termination criteria is metReturn Send
first phase and then uses maximum CLL or AUC for PR curve in a second step to randomly
generate a beam of the best clauses to add to the current MLN structure.
6.2 Randomized Beam Discriminative Structure Learning
In this Section we present the Randomized Beam Search Discriminative Structure Learning
(RBS-DSL) algorithm. Algorithm 6.2 starts with a beam of the initial clauses (in case there are
other clauses previously learned) and unit clauses and iteratively adds to the current structure
the best clause found by the SearchBestClause procedure. This procedure (Algorithm 6.3),
takes in input the current beam and by using the clause construction operators, constructs in
GenerateCandidates all the potential candidate clauses to be scored for adding to the current
structure. Then for each of these candidates the gain in WPLL is computed. In the next step,
the algorithm performs a randomized construction of candidate clauses in the Randomized-
Construction procedure (Algorithm 6.4). In this procedure, similar to a GRASP approach, it is
first defined the Restricted Candidate List (RCL) in a random fashion by cardinality value of
WPLL. All candidates with a gain in WPLL greater than minGain+α ∗ (maxGain−minGain)
(where α is a random number from a uniform probability distribution), are considered to be
included in the RCL. To induce randomness in the algorithm the parameter α has an important
function and is called the RCL parameter. It determines the level of randomness or greediness
in the construction. In some GRASP implementations the parameter is fixed while in others it
is adapted dynamically. The case α = 0 corresponds to a pure greedy algorithm, while α = 1
is equivalent to a random construction.
88
6.2 Randomized Beam Discriminative Structure Learning
Algorithm 6.2 The RBS-DSL algorithmInput: (P:set of predicates, MLN:Markov Logic Network, RDB:Relational Database, QP:Query predicate)CLS = All clauses in MLN ∪ P;LearnWeights(MLN,RDB);BestScore = ComputeScore(MLN,RDB,QP);repeat
BestClause = SearchBestClause(P,MLN,BestScore,CLS,RDB,QP);if BestClause 6= null then
Add BestClause to MLN;BestScore = ComputeScore(MLN,RDB,QP);
end ifuntil BestClause = null for δ consecutive stepsReturn MLNFor RBS−DSLCLL ComputeScore computes the average CLL over all the groundings of thequery predicate QPFor RBS−DSLAUC ComputeScore computes the AUC of PR curve
The similarity of our algorithm with GRASP is that randomization is applied not only to
the choice of the candidates from the RCL but also to the construction of the RCL. On the other
side, the difference with GRASP is that in GRASP only one candidate is randomly chosen from
the RCL in order to continue the search, while in our algorithm a list of clauses is randomly
constructed by choosing them from the RCL. Another difference is that we follow the heuristic
that only candidates with a positive gain in WPLL are to be considered for scoring of CLL.
Thus, in case there are candidates with no gain (minGain ≤ 0), we set the value threshold to
zero. In order not to loose randomness in case of threshold = 0, a random choice among the
RCL candidates follows. Once the potential candidates for the RCL are randomly constructed,
the algorithm randomly chooses among these according to the random number rand and the
paramater λ . In our experiments we found empirically that the value λ = 0.5∗beamSize/100
induces enough randomness in the choice from RCL candidates. This value depends on the
size of the beam which is a parameter of the main algorithm. In most cases, the number of
candidates in the RCL and those that are chosen from this list can be very high. This can cause
intractable computation times because most of these candidates have to be scored again in terms
of CLL (or AUC). For this reason, it is reasonable to pose a limit in the size of the clauses to
89
6. THE RBS-DSL ALGORITHM
Algorithm 6.3 The SearchBestClause procedure of the RBS-DSL algorithmSearchBestClause(P: set of predicates, MLN: Markov Logic Network, BestScore: CLL orAUC score, BestWPLL: WPLL score,CLS: List of clauses, RDB: Relational Database, QP:Query predicate)Beam = CLS;repeat
CandidateClauses = GenerateCandidates(Beam,P);for Each Clause C in CandidateClauses do
Add C to the current MLN; LearnWeights(MLN,RDB);CWPLL = Score of C by WPLL; WPLLGain of C = CWPLL - BestWPLL;
end forBestWPLLClauses = RandomizedConstruction(CandidateClauses,BestWPLL);scoredList: list of candidates scored in terms of CLL (or AUC);for Each Clause C in BestWPLLClauses do
Add C to the current MLN;ComputeScore(MLN,RDB);Add C to scoredList;
end forNewBeam = RandomizedBeam(scoredList,BestScore);BestClause = Best Clause in NewBeam;Beam = NewBeam;
until two consecutive iterations have not produced improvementReturn BestClauseFor RBS−DSLCLL ComputeScore computes the average CLL over all the groundings of thequery predicateFor RBS−DSLAUC ComputeScore computes the AUC of PR curve
90
6.2 Randomized Beam Discriminative Structure Learning
be evaluated in the next step. This is achieved by setting the parameter maxNumClauses which
determines the number of potential candidates to be scored by CLL (or AUC).
Algorithm 6.4 Randomized Construction of the best WPLL candidate listRandomizedConstruction(CandidateClauses,BestWPLL)BestWPLLClauses: Randomized List of best WPLL candidates;maxNumClauses = maximum number of clauses to choose from RCL;α = random([0,1]); random number using a Uniform Probability Distributionthreshold: value to use as limit;minGain = minimumWPLLGain(CandidateClauses);maxGain = maximumWPLLGain(CandidateClauses);if minGain > 0 then
threshold = minGain + α * (maxGain - minGain) ;else
threshold = 0;end iffor Each Clause C in CandidateClauses do
if WPLLGain(C) > threshold thenrand = random([0,1]); random number using a Uniform Probability Distributionif rand > λ then
Add C to BestWPLLClauses;end ifif size of BestWPLLClauses = maxNumClauses then
break;end if
end ifend forReturn BestWPLLClauses
After the procedure RandomizedConstruction returns the list BestWPLLClauses, all the
candidates in this list are scored for CLL (or AUC) and given in input to the RandomizedBeam
procedure. In this procedure (Algorithm 6.5) it is performed the same randomized process
on the candidates but this time based on their CLL (or AUC) values. Differently from the
RandomizedConstruction procedure, the randomized construction of the beam does not exclude
negative values for the gain of the candidates. The value of the parameter λ is the same as for
the RandomizedConstruction procedure.
91
6. THE RBS-DSL ALGORITHM
Algorithm 6.5 Randomized choice of the best CLL (or AUC) candidate list to form the newbeam.
RandomizedBeam(ListClauses,BestScore)ListClauses: list of clauses scored for CLL (or AUC)newBeam: new list of clauses to randomly generate from ListClauses;beamSize = Size of beam for the algorithm RBS-DSL;α = random([0,1]); random number using a Uniform Probability Distributionthreshold: value to use as limit;minGain = minimumGain(ListClauses);maxGain = maximumGain(ListClauses);threshold = minGain + α * (maxGain - minGain) ;for Each Clause C in ListClauses do
if Gain(C) > threshold thenrand = random([0,1]); random number using a Uniform Probability Distributionif rand > λ then
Add C to newBeam;end ifif size of newBeam = beamSize then
break;end if
end ifend forReturn newBeam;For RBS−DSLCLL minimumGain returns the minimum gain in CLL among all candidatesFor RBS−DSLAUC minimumGain returns the minimum gain in AUC among all candidates
92
6.3 Experiments
6.2.1 The RBS-DSLCLL version
The RBS-DSLCLL version of the algorithm maximizes CLL during search. In Algorithms 6.2
and 6.3, the function ComputeScore computes the average CLL over all the groundings of the
query predicate QP. It uses the Lazy-MC-SAT algorithm to perform inference over the network
constructed using the current structure MLN and the relational data RDB of a tuning set. In
the tuning set all the groundings of the query predicate QP are commented. MC-SAT produces
for each grounding of the query predicate QP the probability that it is true. These values are
then used to compute the average CLL by distinguishing positive and negative atoms. For a
positive atom its estimated probability P contributes with logP to the CLL and a negative’s
estimated probability contributes with log(1−P). In Algorithm 6.5, the minimumGain and
maximumGain functions compute respectively the minimum and maximum gain in CLL among
all the potential candidates.
6.2.2 The RBS-DSLAUC version
The RBS-DSLAUC version of the algorithm maximizes AUC of the PR during search. In Al-
gorithms 6.2 and 6.3, the function ComputeScore computes the AUC over all the groundings
of the query predicate QP. It uses the Lazy-MC-SAT algorithm to perform inference over the
network constructed using the current structure MLN and the relational data RDB of a tuning
set. In the tuning set all the groundings of the query predicate QP are commented. MC-SAT
produces for each grounding of the query predicate QP the probability that it is true. The
precision-recall curve for a predicate is computed by varying the CLL threshold above which a
ground atom is predicted to be true; i.e. the predicates whose probability of being true is greater
than the threshold are positive and the rest are negative. For the computation of AUC we used
the package of (Davis and Goadrich 2006). In Algorithm 6.5, the minimumGain and maximum-
Gain functions compute respectively the minimum and maximum gain in AUC among all the
potential candidates.
6.3 Experiments
Experimental evaluation of RBS-DSL was performed for the same problems introduced in
the previous chapters: Link Analysis in Social Networks and Entity Resolution in citation
93
6. THE RBS-DSL ALGORITHM
databases. The datasets used for RBS-DSL are the same used for the algorithms GSL and ILS-
DSL as introduced in sections 4.3.2 and 4.3.1.
Through experimental evaluation we want to answer the following questions:
(Q1) Are the proposed algorithms competitive with state-of-the-art discriminative training
algorithms of MLNs?
(Q2) Are the proposed algorithms competitive with the state-of-the-art generative algo-
rithm for structure learning of MLNs?
(Q3) Are the proposed algorithms competitive with pure probabilistic approaches such as
Naïve Bayes and Bayesian Networks?
(Q4) Are the proposed algorithms competitive with state-of-the-art ILP systems for the task
of structure learning of MLNs?
(Q5) Do the proposed algorithms always perform better than BUSL for classification tasks?
If not, are there any regimes in which each algorithm performs better?
(Q6) Regarding the task of Entity Resolution, do the proposed algorithms perform better
than other language-independent discriminative approaches based on MLNs?
Regarding question (Q1) we have to compare all our algorithms with Preconditioned Scaled
Conjugate Gradient (PSCG) which is the state-of-the-art discriminative training algorithm for
MLNs proposed in (Lowd and Domingos 2007). It must be noted that this algorithm takes
in input a fixed structure, and with the clausal knowledge base we use in our experiments for
CORA (each dataset comes with a hand-coded knowledge base), PSCG has achieved the best
published results. We also exclude the approach of adapting the rule set and then learning
weights with PSCG, since it would be computationally intractable.
To answer question (Q2) we have to perform experimental comparison with the Bottom-Up
Structure Learning (BUSL) algorithm (Mihalkova and Mooney 2007) which is the state-of-the-
art algorithm for this task. Since in principle, the MLNs structure can be learned using any ILP
technique it would be interesting to know how our algorithms compare to ILP approaches. In
(Kok and Domingos 2005), the proposed algorithm based on beam search (BS) was shown to
outperform FOIL and the state-of-the-art ILP system ALEPH for the task of learning MLNs
structure. Moreover, BS outperformed both Naïve Bayes and Bayesian Networks in terms of
CLL and AUC. Since in (Mihalkova and Mooney 2007) was shown that BUSL outperforms
the BS algorithm of (Kok and Domingos 2005), our baseline for questions (Q2), (Q3) and
94
6.3 Experiments
(Q4) is again BUSL. It must be noted that since the goal of learning MLNs (and then perform
inference over the model) is to perform probability estimation, the proposed algorithms are not
directly comparable with ILP systems because these are not designed to maximize the data’s
likelihood (and thus the quality of the probabilistic predictions). Moreover, since ALEPH and
FOIL learn more restricted clauses (non recursive definite clauses), the only ILP system that
is directly comparable with our algorithm is CLAUDIEN which, unlike most ILP systems that
learn only Horn clauses, is able to learn arbitrary first-order clauses. Thus the comparison re-
gards the task of structure learning of MLNs where ILP systems learn the structure followed by
a weight learning phase. In (Kok and Domingos 2005) the authors showed that CLAUDIEN
(also ALEPH and FOIL) followed by a weight learning phase was outperformed by the BS
algorithm in terms of CLL and AUC. Regarding question (Q5), we compare all our algorithms
and BUSL on two datasets with the goal of discovering regimes in which each one can per-
form better. We will use two datasets, one of which can be considered of small size and the
other one of much larger size. Finally, to answer question (Q6), we should compare our algo-
rithms with the best language-independent discriminative approach to Entity Resolution based
on MLNs proposed in (Singla and Domingos 2006b). In this work, the MLN(G+C+T) model is
language-independent because it does not contain rules referring to specific strings occurring
in the data. This is similar to the approach that we follow here for this task: we learn rules
which are not vocabulary specific. In (Singla and Domingos 2006b) the discriminative weight
learning approach is based on the voted perceptron for MLNs and was used to learn weights for
different hand-coded models (one of these was MLN(G+C+T)). Since in (Lowd and Domingos
2007) it was shown that PSCG in general outperforms the voted perceptron and for the task
of entity resolution the comparison was performed following a language-dependent approach
(excluding MLN(G+C+T)), it would be interesting to investigate how our algorithms compare
to MLN(G+C+T). Finally, we would like to compare RBS-DSL with the algorithm ILS-DSL
presented in the previous chapter and check if there is any difference in performance between
them.
6.3.1 Systems and Methodology
We implemented the algorithm RBS-DSL as part of the MLN++ package 8.3 which is a suite of
algorithms based on Markov Logic and built upon the Alchemy framework (Kok et al. 2005).
We used the implementation of L-BFGS and Lazy-MC-SAT in Alchemy to learn maximum
WPLL weights and compute CLL during clause search. Regarding parameter learning, we
95
6. THE RBS-DSL ALGORITHM
compared our algorithms performance with the state-of-the-art algorithm PSCG of (Lowd and
Domingos 2007) for discriminative weight learning of MLNs. This algorithm takes as input
an MLN and the evidence (groundings of non-query predicates) and discriminatively trains the
MLN to optimize the CLL of the query predicates given evidence. Our algorithms and PSCG
optimize the CLL (or AUC) of the query predicates and a comparison between these algorithms
would be useful to understand if learning automatically the clauses from scratch can improve
over hand-coded MLN structures in terms of classification accuracy of the query predicates
given evidence.
We performed all the experiments on a 2.13 GHz Intel Core2 Duo CPU. For the UW-
CSE dataset we trained PSCG on the hand-coded knowledge base provided with the dataset.
We used the implementation of PSCG in the Alchemy package and ran this algorithm with the
default parameters for 10 hours. For the CORA dataset, for PSCG we report the results obtained
in (Lowd and Domingos 2007) where PSCG was trained on a hand-coded MLN and achieved
best current results on this dataset. For the language-independent approach MLN(G+C+T) of
(Singla and Domingos 2006b) we report the results for three of the query predicates in this
domain: sameBib, sameAuthor and sameVenue (the results reported in (Singla and Domingos
2006b) do not include the predicate sameTitle).
For both datasets, for all our algorithms we used the following parameters: the mean and
variance of the Gaussian prior were set to 0 and 100, respectively; maximum variables per
clause = 4; maximum predicates per clause = 4; penalization of weighted pseudo-likelihood =
0.01 for UW-CSE and 0.001 for CORA; beamSize = 5 for UW-CSE and 10 for CORA. For L-
BFGS we used the following parameters: maximum iterations = 10,000 (tight) and 10 (loose);
convergence threshold = 10−5 (tight) and 10−4 (loose). For Lazy-MC-SAT during learning we
used the following parameters: memory limit = 600MB for UW-CSE and 1GB for CORA,
maximum number of steps for Gibbs sampling = 100; simulated annealing temperature = 0.5;
the parameter δ was set to 1. All these parameters were set in an ad hoc manner and per-fold
optimization may lead to better results. In particular the memory requirements of Lazy-MC-
SAT were set higher for RBS based algorithms because we empirically observed that a larger
number of potential clauses compared to ILS-DSL, required larger memory requirements. This
is due to the nature of RBS which evaluates more clauses than ILS during search. However,
as can be noted from the results of the experiments a larger memory spent by Lazy-MC-SAT
to score the structures in the RBS based algorithms did not produce much higher results than
the ILS based versions. This confirms our heuristic on the memory limit of Lazy-MC-SAT that
96
6.3 Experiments
most significant clauses are normally scored within a certain limit and a higher limit would
not change the results. Regarding BUSL, for both datasets, we used the following parameters:
the mean and variance of the Gaussian prior were set to 0 and 100, respectively; maximum
variables per clause: 5 for UW-CSE and 6 for CORA; maximum predicates per clause = 6;
penalization of WPLL: 0.01 for UW-CSE and 0.001 for CORA. minWeight 0.5 for UW-CSE
and 0.01 for CORA; For L-BFGS we used the following parameters: maximum iterations =
10,000 (tight) and 10 (loose); convergence threshold = 10−5 (tight) and 10−4 (loose).
In the UW-CSE domain, we followed the same leave-one-area-out methodology as in
(Richardson and Domingos 2006). In the CORA domain, we performed 5-fold cross-validation.
For each train/test split, one of the training folds is used as tuning set for computing the CLL
(or AUC). For each system on each test set, we measured the CLL and the AUC of PR curve
for the query predicates. The advantage of the CLL is that it directly measures the quality of
the probability estimates produced. The advantage of the AUC is that it is insensitive to the
large number of true negatives (i.e., ground atoms that are false and predicted to be false), but
the disadvantage is that it ignores calibration by considering only whether true atoms are given
higher probability than false atoms. The CLL of a query predicate is the average over all its
groundings of the ground atom’s log-probability given evidence. The precision-recall curve for
a predicate is computed by varying the CLL threshold above which a ground atom is predicted
to be true; i.e. the predicates whose probability of being true is greater than the threshold are
positive and the rest are negative. For the computation of AUC we used the package of (Davis
and Goadrich 2006).
6.3.2 Results
After learning the structure discriminatively, we performed inference on the test fold for both
datasets by using MC-SAT with number of steps = 10000 and simulated annealing temperature
= 0.5. For each experiment, on the test fold all the groundings of the query predicates were
commented: advisedBy for the UW-CSE dataset (professor and student are also commented)
and sameBib, sameTitle, sameAuthor and sameVenue for CORA. MC-SAT produces probabil-
ity outputs for every grounding of the query predicate on the test fold. We used these values to
compute the average CLL over all the groundings and to compute the PR curve.
We denote the two versions of the algorithm as RBS−DSLCLL and RBS−DSLAUC. For the
algorithm that optimizes AUC of PR curve during search, we scored each structure by using the
97
6. THE RBS-DSL ALGORITHM
package of (Davis and Goadrich 2006). We compare the results also with the algorithms pre-
sented in the previous chapter ILS−DSLCLL and ILS−DSLAUC. The results for all algorithms
on the UW-CSE dataset are reported in Table 6.1 for CLL and Table 6.2 for AUC. In Table 6.1,
CLL is averaged over all the groundings of the predicate advisedBy in the test fold. Regarding
the comparison with PSCG in terms of CLL, in this domain all our algorithms perform bet-
ter than PSCG except RBS−DSLCLL in every fold of the dataset and overall. RBS−DSLCLL
performs better than PSCG in two folds, worse in other two and equally in the ai area. The
difference in the overall results between RBS−DSLCLL and PSCG is due to the low result of
the former in the systems fold. Regarding AUC, PSCG overall performs better than all our
algorithms. It must be noted that on two out of five folds (language and graphics) the results of
our algorithms were quite competitive and there was a large difference only in the theory fold
where PSCG achieved a high result. Our best performing algorithms in terms of CLL were
those that optimize AUC during search. This was a surprising result since we expected better
results from the algorithms that optimize CLL during search. On the other side, in terms of
AUC, our best performing algorithm was RBS−DSLAUC. Overall for UW-CSE, we can state
that our algorithms perform better in terms of CLL and worse in terms of AUC.
For the CORA dataset the results are reported in Table 6.3 and 6.4. For CLL for each
query predicate we report the average of CLL of its groundings over the test fold (for each
predicate, training is performed on four folds and testing on the remaining one in a 5-fold
cross-validation). For CORA, compared to PSCG, all our algorithms perform better in terms
of CLL for each of the query predicates, but worse in terms of AUC. We observed empirically
on each fold that the performances in terms of CLL and AUC were always balanced, a slightly
better performance in CLL always resulted in a slightly worse performance in terms of AUC
and vice versa. Since CLL determines the quality of the probability predictions output by the
algorithm, all our algorithms outperform PSCG in terms of the ability to predict correctly the
query predicates given evidence. However, since AUC is useful to predict the few positives in
the data, PSCG produces better results for only positive examples. Hence, these results answer
question (Q1). It must be noted that PSCG has achieved the best published results on CORA in
terms of AUC (Lowd and Domingos 2007) and the approach followed is language-dependent,
i.e. the hand-coded MLN used with PSCG in (Lowd and Domingos 2007) contains rules such
that a weight is learned for each ground clause that is constructed using specific constants in
the domain. This makes the approach with PSCG of (Lowd and Domingos 2007) vocabulary
specific while all our algorithms learn general rules not related to a specific set of strings.
98
6.3 Experiments
Table 6.1: CLL results for the query predicate advisedBy in the UW-CSE domain
area language graphics systems theory ai OverallILS−DSLCLL -0.048±0.016 -0.016±0.003 -0.020±0.003 -0.020±0.005 -0.022±0.003 -0.025±0.006RBS−DSLCLL -0.043± 0.015 -0.026±0.004 -0.058±0.002 -0.019±0.004 -0.032±0.005 -0.036±0.006ILS−DSLAUC -0.028±0.008 -0.015±0.003 -0.017±0.002 -0.018±0.004 -0.019±0.003 -0.019±0.004RBS−DSLAUC -0.025±0.007 -0.015±0.003 -0.017±0.003 -0.018±0.004 -0.020±0.003 -0.019±0.004
PSCG -0.049±0.016 -0.023±0.005 -0.026±0.005 -0.028±0.007 -0.032±0.005 -0.032±0.008BUSL -0.024±0.008 -0.014±0.002 -0.295±0.000 -0.013±0.003 -0.019±0.003 -0.073±0.003
Table 6.2: AUC results for the query predicate advisedBy in the UW-CSE domain
area language graphics systems theory ai OverallILS−DSLCLL 0.011 0.006 0.007 0.010 0.006 0.008RBS−DSLCLL 0.034 0.009 0.010 0.012 0.008 0.015ILS−DSLAUC 0.016 0.005 0.007 0.005 0.008 0.008RBS−DSLAUC 0.073 0.005 0.005 0.005 0.007 0.019
PSCG 0.011 0.005 0.069 0.101 0.034 0.044BUSL 0.115 0.007 0.007 0.032 0.013 0.035
Regarding the comparison with BUSL, the results show that all our algorithms perform
better than BUSL in terms of CLL on both datasets. It must be noted, however, that for UW-
CSE, BUSL performed generally better than our algorithms, but produced very low results in
one fold. In terms of AUC, BUSL performs slightly better on the UW-CSE dataset while in
the CORA dataset all our algorithms outperform BUSL. Therefore, questions (Q2), (Q3) and
(Q4) can be answered affirmatively. Our discriminative algorithms are competitive with BUSL
even though for BUSL, in the UW-CSE domain, we used optimized parameters taken from
(Mihalkova and Mooney 2007) in terms of number of variables and literals per clause, while
for our algorithms we did not perform per-fold optimization of any parameter.
Regarding question (Q5), the goal was whether previous results of (Ng and Jordan 2002)
carry on to MLNs, that on small datasets generative approaches can perform better than dis-
criminative ones. The UW-CSE dataset with a total of 2673 tuples can be considered of much
smaller size compared to CORA that has 70367 tuples. The results of Tables 6.1 and 6.2 show
that on the UW-CSE dataset, the generative algorithm BUSL performs better in terms of AUC
and is competitive in terms of CLL since it underperforms our algorithms only because of the
low results in the systems fold of the dataset. Thus we can answer question (Q5) confirming
the results in (Ng and Jordan 2002) that on small datasets generative approaches can perform
better than discriminative ones, while for larger datasets discriminative approaches outperform
99
6. THE RBS-DSL ALGORITHM
Table 6.3: CLL results for all query predicates in the CORA domain
area sameBib sameTitle sameAuthor sameVenue OverallILS−DSLCLL -0.087±0.001 -0.077±0.006 -0.148±0.009 -0.121±0.004 -0.108±0.005RBS−DSLCLL -0.222±0.003 -0.120±0.008 -0.126±0.008 -0.129±0.005 -0.149±0.006ILS−DSLAUC -0.168±0.002 -0.117±0.010 -0.158±0.011 -0.101±0.004 -0.136±0.007RBS−DSLAUC -0.254±0.002 -0.077±0.007 -0.133±0.011 -0.172±0.005 -0.159±0.006
PSCG -0.291±0.003 -0.231±0.014 -0.182±0.013 -0.444±0.012 -0.287±0.011MLN(G+C+T) -0.394±0.004 − -0.263±0.053 -1.196±0.031 -0.618±0.030
BUSL -0.566±0.001 -0.100±0.004 -0.834±0.009 -0.232±0.005 -0.433±0.005
Table 6.4: AUC results for all query predicates in the CORA domain
area sameBib sameTitle sameAuthor sameVenue OverallILS−DSLCLL 0.603 0.428 0.371 0.315 0.429RBS−DSLCLL 0.265 0.546 0.600 0.233 0.411ILS−DSLAUC 0.334 0.470 0.688 0.252 0.436RBS−DSLAUC 0.322 0.423 0.534 0.175 0.364
PSCG 0.990 0.953 0.999 0.823 0.941MLN(G+C+T) 0.973 − 980 0.743 0.899
BUSL 0.138 0.419 0.323 0.218 0.275
generative ones.
The final question (Q6) is related to the task of entity resolution and the approaches which
are based on MLNs and are language independent, i.e. that do not contain rules which refer to
specific constants in the domain. The results of Tables 6.3 and 6.4 show that in terms of CLL,
all our algorithms outperform MLN(G+C+T) for all the query predicates, but in terms of AUC,
MLN(G+C+T) outperforms our algorithms. Thus, the same conclusions for PSCG are valid
for MLN(G+C+T). Our algorithms produce in general more accurate probability predictions,
while MLN(G+C+T) produces better results for only positive atoms. Therefore, question (Q6)
can be answered affirmatively.
Finally, regarding the comparison with ILS-DSL, on the UW-CSE dataset, RBS-DSL per-
formed generally better than ILS-DSL in terms of AUC. In terms of CLL the results were quite
balanced, ILS-DSLAUC and ILS-DSLAUC performed equally, and only ILS-DSLCLL performed
better than RBS-DSLCLL. On CORA, the ILS-DSL algorithm, generally performed better than
RBS-DSL both in terms of CLL and AUC. It must be noted however, that a larger beamSize
parameter (for CORA it was set ot 10) for RBS-DSL could lead to improvements in accuracy.
This parameter seems to be more critic for RBS than the parameter k (number of restarts) is
100
6.4 Related Work
for ILS. Moreover, for RBS the parameter δ was set to 1 and for ILS was set to 2. All these
parameters were not found following a per-fold optimization process, thus better performance
could be achieved by tuning the parameters for both algorithms.
6.4 Related Work
Regarding discriminative structure learning of MLNs, RBS-DSL is similar to ILS-DSL, thus
we remind to section 5.5 for related work on structure learning of SRL models. From the point
of view of the search strategy, the algorithm RBS-DSL has similarities with that in (Kok and
Domingos 2005) that performs a beam search. However, RBS-DSL is a stochastic algorithm
which randomizes the process of beam construction, whereas in (Kok and Domingos 2005) the
search is deterministic. Moreover, the algorithm of (Kok and Domingos 2005) is a generative
one and search is guided by WPLL while our algorithms are guided by conditional likelihood
or area under the precision-recall curve.
RBS-DSL approach is also similar to approaches in ILP that exploit SLS (Zelezny. et al.
2006). The algorithms that we propose here are different in that they use likelihood as evalu-
ation measure instead of ILP coverage criteria. Moreover, our algorithms differ from those in
(Zelezny. et al. 2006) in that we use Hybrid SLS approaches which can combine other simple
SLS methods to produce high performance algorithms.
GRASP is a widely used metaheuristic for hard combinatorial problems in many fields as
shown in (Festa and Resende 2002). However, its use in Machine Learning, to the best of the
author’s knowledge, has not been yet experimented. The results obtained in this chapter with
RBS-DSL, show that GRASP can be help in developing robust and highly efficient algorithms
for complex optimization problems in learning SRL models.
101
6. THE RBS-DSL ALGORITHM
6.5 Summary
In this chapter we have introduced the RBS-DSL algorithm that learns discriminatively first-
order clauses and their weights. The algorithm scores the candidate structures by maximizing
conditional likelihood or area under the Precision-Recall curve while setting the parameters by
maximum pseudo-likelihood. RBS-DSL is inspired from the Greedy Randomized Adaptive
Search Procedure metaheuristics and performs randomized beam search by scoring the struc-
tures through maximum likelihood in the first phase and then uses maximum CLL or AUC for
PR curve in a second step to randomly generate a beam of the best clauses to add to the current
MLN structure. To speed up learning we propose some simple heuristics that greatly reduce
the computational effort for scoring structures. Empirical evaluation with real-world data in
two domains show the promise of our approach improving over the state-of-the-art discrimi-
native weight learning algorithm for MLNs in terms of conditional log-likelihood of the query
predicates given evidence. We have also compared the proposed algorithm with the state-of-
the-art generative structure learning algorithm and shown that on small datasets the generative
approach is competitive, while on larger datasets the discriminative approach outperforms the
generative one.
RBS-DSL can be further improved with the following: dynamically adapting the parameter
α for the Restricted Candidate List construction; score structures with MC-SAT in a parallel
model such as MPI (Message Passing Interface) or PVM (Parallel Virtual Machine) by assign-
ing a run of MC-SAT to a separate thread; develop heuristics that can find among those that do
not improve WPLL, potential candidates that can improve CLL or AUC; since the iterations of
GRASP are independent they could be assigned to parallel CPUs in order to learn high quality
structures and greatly speed up the whole learning task; develop other heuristics for choosing
candidates from the RCL.
102
Chapter 7
The IRoTS and MC-IRoTS algorithms
Most real-world problems are characterized by both probabilistic and deterministic informa-
tion. The state-of-the-art in pure probabilistic and deterministic inference has seen in recent
years important advances towards solving hard problems. However at the boundary of the
two, there has not been much work in investigating combined methods for dealing with near-
deterministic dependencies that cause the #P-completeness of probabilistic inference (Roth
1996). Many problems with these dependencies appear in Statistical Relational Learning, thus
it is important to investigate how probabilistic and deterministic inference methods can be com-
bined. For example, in Entity Resolution (the problem of determining which observations refer
to the same object), both probabilistic inferences (e.g., observations with similar properties
are more likely to be the same object) and deterministic ones (e.g., transitive closure: if x =
y and y = z, then x = z) are involved (McCallum and Wellner 2005). This chapter presents
two algorithms, IRoTS and MC-IRoTS, for MAP/MPE and conditional inference in Markov
Logic respectively. IRoTS is a MAX-SAT solver based on the Iterated Local Search (Hoos
and Stutzle 2005; Loureno et al. 2002) and Robust Tabu Search (Taillard 1991) metaheuris-
tics while MC-IRoTS combines IRoTS with Markov Chain Monte Carlo and is able to deal
with probabilistic and deterministic dependencies. Experimental evaluation shows that IRoTS
performs better than MaxWalkSAT (Kautz et al. 1997a) for MAP/MPE inference in Markov
Logic, being faster and more accurate. We also show that MC-IRoTS improves in terms of
inference time over the state-of-the-art algorithm for conditional inference in MLNs.
103
7. THE IROTS AND MC-IROTS ALGORITHMS
7.1 MAP/MPE inference using IRoTS
The basic inference task in MNs and BNs is finding the most probable state of the world given
some evidence. This is generally known as Maximum a posteriori (MAP) inference in Markov
random fields, and Most Probable Explanation (MPE) inference in Bayesian Networks. MAP
inference in MNs means finding the most likely state of a set of output variables given the state
of the input variables and it is a NP-hard problem. From Equation 3.1 introduced in Section
3.1, for MLNs this inference task reduces to finding a truth assignment that maximizes the
sum of weights of satisfied clauses. This can be done using any weighted satisfiability solver,
and in practice need not be more expensive than standard logical inference by model checking.
The authors in (Singla and Domingos 2005) use the MaxWalkSAT solver (Kautz et al. 1997a)
for MAP inference in MLNs. This section proposes IRoTS with some modifications from the
original version of (Smyth et al. 2003) as a MAX-SAT solver for the MAP inference task in
MLNs.
7.1.1 The SAT and MAX-SAT problems
One of the central problems in logic is that of determining if a knowledge base (usually in
clausal form) is satisfiable, i.e., if there is an assignment of truth values to all ground atoms
that makes the KB true. The satisfiability problem in propositional logic (SAT) is the task
of deciding whether a given propositional formula has a model. More formally, given a set
of m clauses C1, ...,Cm involving n Boolean variables x1, ...,xn the SAT problem is to decide
whether an assignment of values to variables exists such that all clauses are simultaneously
satisfied. This problem plays a crucial role in various areas of computer science, mathematical
logic and artificial intelligence.
MAX-SAT is the optimisation variant of SAT and can be seen as a generalisation of the
SAT problem: Given a propositional formula in conjunctive normal form (CNF), the MAX-
SAT problem then is to find a variable assignment that maximises the number of satisfied
clauses. In weighted MAX-SAT, each clause Ci has an associated weight wi and the goal is to
maximise the total weight of the satisfied clauses. The decision variants of SAT and MAX-SAT
are NP-complete (Garey and Johnson. 1979). Furthermore, it is known that optimal solutions
to MAX-SAT are hard to approximate; for MAX-3-SAT (unweighted MAX-SAT with 3 literals
per clause), e.g., there exists no polynomial-time approximation algorithm with a (worst-case)
approximation ratio lower than 8/7 ≈ 1.1429. It is worth noting that approximation algorithms
104
7.1 MAP/MPE inference using IRoTS
for MAX-SAT can be empirically shown to achieve much better solution qualities for many
types of MAX-SAT instances; however, their performance is usually substantially inferior to
that of state-of-the-art stochastic local search (SLS) algorithms for MAX-SAT (Hansen and
Jaumard. 1990).
A successful approach to the SAT and MAX-SAT problems is stochastic local search (Hoos
and Stutzle 2005). Many SLS methods have been applied to SAT and MAX-SAT leading to
a large number of algorithms. These include algorithms originally proposed for SAT, which
can be applied to unweighted MAX-SAT in a straightforward way by keeping track of the
best solution found so far in the search process. As pointed out in (Hoos and Stutzle 2005)
it is not clear that SLS algorithms that are known to perform well on SAT can be expected
to show equally strong performance on MAX-SAT and some empirical evidence suggests that
this is generally not the case. Therefore, many SLS algorithms were directly developed for
unweighted and, in particular, weighted MAX-SAT or extended from existing SLS algorithms
for SAT in various ways.
The best performing SLS algorithms for unweighted and weighted MAX-SAT belong to
three categories: Tabu Search algorithms, Dynamic Local Search algorithms, and Iterated Lo-
cal Search. High performance was shown by Reactive Tabu Search (H-RTS), a tabu search that
dynamically adjusts the tabu tenure, on unweighted MAX-SAT instances (Battiti and Protasi.
1997). High performing Dynamic Local Search algorithms include DLM (Shang and Wah.
1997), a later extension called DLM-99-SAT (Wu and Wah. 1999), and Guided Local Search
(GLS) (Mills and Tsang. 2000). Computational results show that GLS is currently the top per-
forming SLS algorithm for specific classes of weighted MAX-SAT instances, outperforming
DLM and MaxWalkSAT. Also highly competitive is the Iterated Local Search (ILS-YI) (Yag-
iura and Ibaraki. 2001) that uses a local search algorithm based on 2- and 3-flip neighbour-
hoods. Particularly for MAX-SAT-encoded minimum-cost graph colouring and set covering
instances, as well as for a big, MAX-SAT-encoded real-world time-tabling instance, the 2-flip
variant of ILS-YI performs better than other versions of ILS-YI and a tabu search algorithm.
In (Smyth et al. 2003) the authors showed that IRoTS is highly competitive with GLS and
Novelty+/wcs+we on many MAX-SAT instances. On weighted and unweighted Uniform Ran-
dom 3-SAT instances, IRoTS performs significantly better than GLS and Novelty+ variants in
terms of CPU time; on the wjnh instances, IRoTS performs worse than Novelty+ variants and
for MAX-SAT-encoded instances, IRoTS performs worse than GLS.
105
7. THE IROTS AND MC-IROTS ALGORITHMS
One of the most successful SLS algorithms applied to SAT is WalkSAT (Selman et al.
1996). WalkSAT, (Algorithm 7.1) starting from a random initial state, repeatedly flips (changes
the truth value of) an atom in a random unsatisfied clause. With probability p, WalkSAT
chooses the atom that minimizes the number of unsatisfied clauses or the number of satisfied
clauses that become unsatisfied, and with probability 1- p it chooses a random one. WalkSAT
has been shown to be able to solve hard instances of satisfiability with hundreds of thousands
of variables in minutes. The MaxWalkSAT (Kautz et al. 1997a) algorithm extends WalkSAT
to the weighted satisfiability problem, where each clause has a weight and the goal is to maxi-
mize the sum of the weights of satisfied clauses. (Systematic solvers have also been extended
to weighted satisfiability, but tend to work poorly).
In (Park 2005) it was shown how the problem of finding the most likely state of a Bayesian
network given some evidence can be efficiently solved by reduction to weighted satisfiabil-
ity. WalkSAT is essentially the special case of MaxWalkSAT obtained by giving all clauses
the same weight. In this dissertation we focus on function-free FOL with the domain closure
assumption (i.e., the only objects in the domain are those represented by the constants). A pred-
icate or formula is grounded by replacing all its variables by constants. Propositionalization
is the process of replacing a first-order knowledge base (KB) by an equivalent propositional
one. In finite domains, this can be done by replacing each universally (existentially) quantified
formula with a conjunction (disjunction) of all its groundings. A first-order KB is satisfiable
iff the equivalent propositional KB is satisfiable. Thus, inference over a first-order KB can
be performed by propositionalization followed by satisfiability testing. For MAP inference in
MLNs, the authors in (Singla and Domingos 2006a) use MaxWalkSAT as a weighted MAX-
SAT solver and show also how to use it in an algorithm for discriminative learning of MLNs
parameters.
7.1.2 Iterated Robust Tabu Search
Robust Tabu SearchRobust Tabu Search (RoTS) (Taillard 1991) is a special case of Tabu Search (Glover and
Laguna. 1997). In each search step, the RoTS algorithm (Algorithm 7.2) for MAX-SAT flips
a non-tabu variable that achieves a maximal improvement in the total weight of the unsatisfied
clauses (the size of this improvement is also called score) and declares it tabu for the next tt
steps. The parameter tt is called the tabu tenure. An exception to this “tabu” rule is made if a
more recently flipped variable achieves an improvement over the best solution seen so far (this
106
7.1 MAP/MPE inference using IRoTS
Algorithm 7.1 The WalkSAT algorithmWalkSAT(wcl: weighted clauses, max_flips: maximum number of flips, max_ tries: numberof tries, target: the target cost, p: probability of random walk)atoms = variables in wcl;for i = 1 to max_tries do
solution = a random truth assignment to atoms;cost = sum of weights of unsatisfied clauses in solution;for i = 1 to max flips do
if cost ≤ target thenReturn solution;
end ifC = a randomly chosen unsatisfied clause;if Uniform(0,1) < p then
AtomToFlip = a randomly chosen variable from C;else
for each variable A in C docompute Cost(A);
end forAtomToFlip = A with lowest Cost(A);
end ifsolution = solution with AtomToFlip flipped;cost = cost + Cost(A);
end forend forreturn solution;
107
7. THE IROTS AND MC-IROTS ALGORITHMS
mechanism is called aspiration). Furthermore, whenever a variable has not been flipped within
a certain number of search steps, it is forced to be flipped. This implements a form of long-
term memory and helps prevent stagnation of the search process. The tabu status of variables is
determined by comparing the number of search steps that have been performed since the most
recent flip of a given variable with the current tabu tenure. Finally, instead of using a fixed
tabu tenure, every n iterations the parameter tt is randomly chosen from an interval [ttmin, ttmax]
according to a uniform distribution.
The RoTS algorithm is closely related to MaxWalkSAT/Tabu for weighted MAX-SAT. In
each search step one of the non-tabu variables that achieves a maximal improvement in the total
weight of the unsatisfied clauses is flipped and declared tabu for the next tt steps. However,
different from MaxWalkSAT, RoTS additionally to the aspiration criteria, forces a variable to
be flipped if it has not been flipped for a certain number of steps.
Iterated Robust Tabu SearchThe original version of IRoTS for MAX-SAT was proposed in (Smyth et al. 2003). Al-
gorithm 7.3 starts by independently (with equal probability) initializing the truth values of the
atoms. Then it performs a local search to efficiently reach a local optimum CLS by using RoTS.
At this point, a perturbation method based again on RoTS is applied leading to the neighbor
CL′C of CLS and then again a local search based on RoTS is applied to CL′C to reach another
local optimum CL′S . The accept function decides whether the search must continue from the
previous local optimum or from the last found local optimum CL′S (accept can perform random
walk or iterative improvement in the space of local optima).
Careful choice of the various components of Algorithm 7.3 is important to achieve high
performance. For the tabu tenure we refer to the parameters used in (Smyth et al. 2003) that
have proven to be highly performant across many domains. At the beginning of each local
search and perturbation phase, all variables are declared non-tabu. The clause perturbation op-
erator (flipping the atoms truth value) has the goal to jump in a different region of the search
space where search should start with the next iteration. There can be strong or weak perturba-
tions which means that if the jump in the search space is near to the current local optimum the
subsidiary local search procedure LocalSearchRoT S may fall again in the same local optimum
and enter regions with the same value of the objective function called plateau, but if the jump
is too far, LocalSearchRoT S may take too many steps to reach another good solution. In our
algorithm we use a fixed number of RoTS steps 9n/10 with tabu tenure n/2 where n is the
108
7.1 MAP/MPE inference using IRoTS
Algorithm 7.2 The Robust Tabu Search algorithmRoTS(F: weighted CNF formula,ttmin: minimum tabu tenure,ttmax: maximum tabu tenure,maxNoImprov: maximum number of steps without improvement)num_atoms = number of variables in F;Å= randomly chosen assignment of the variables in F;Cost(Å) = sum of weights of unsatisfied formulas;A = Å;k = 0;repeat
if k mod n = 0 thentt = random([ttmin,ttmax]);
end ifAtom = randomly selected variable whose flip results in a maximal improvement in Cost;if Score(A with Atom flipped) < Score(A) then
A = A with Atom flipped;else
if ∃ variable A that has not been flipped for ≥ 10∗n steps thenA = A with Atom flipped;
elseAtom = randomly selected non-tabu variable whose flip results in the maximal im-provement in Cost;A = A with Atom flipped;
end ifend ifif Score(A) < Score(Å) then
Å= A;end ifk = k +1;
until no improvement in Å for maxNoImprov stepsreturn Å
109
7. THE IROTS AND MC-IROTS ALGORITHMS
Algorithm 7.3 The Iterated Robust Tabu Search algorithmInput: C: set of weighted clauses in CNF, BestScore: current best score)CLC = Random initialization of truth values for atoms in C;CLS = LocalSearchRoT S(CLS);BestAssignment = CLS;BestScore = Score(CLS);repeat
CL’C = PerturbRoT S(BestAssignment);CL’S = LocalSearchRoT S(CL’C);if Score(CL’S) ≥ BestScore then
BestScore = Score(CL’S)end ifBestAssignment = accept(BestAssignment,CL’S);
until k consecutive steps have not produced improvementReturn BestAssignment
number of atoms (in future work we intend to dynamically adapt the nature of the perturba-
tion). Regarding the procedure LocalSearchRoT S, it performs RoTS steps until no improvement
is achieved for n2/d steps (we call d threshold ratio) with a tabu tenure n/10 + 4. The accept
function always accepts the best solution found so far. The difference of our algorithm with
that in (Smyth et al. 2003) is that we do not dynamically adapt the tabu tenure and do not use
a probabilistic choice in accept.
7.1.3 Experiments
Through experimental evaluation we want to answer the following questions:
(Q1) Does the proposed algorithm IRoTS improve over the state-of-the-art algorithm for
MLNs in terms of solutions quality?
(Q2) Does the performance depend on the particular configuration of clauses’ weights?
(Q3) Does the performance depend on particular features of the dataset, i.e., number of
ground clauses and predicates?
(Q4) In case IRoTS finds better solutions than the state-of-the-art algorithm, what is the
performance in terms of running times?
(Q5) What is the performance of the algorithms for huge relational domains with hundreds
of thousands of ground predicates and clauses?
110
7.1 MAP/MPE inference using IRoTS
Table 7.1: Inference results in terms of cost of false clauses for query predicate advisedBy forIRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learned by runningPSCG for 500 iterations
fold IRoTS MWSAT-Tabu MWSAT-Tabu&Restartsai 93103.7 92393.5 92512.9
graphics 72221.8 72245.1 71659.8language 32398.1 32668.2 32380.1systems 117144.0 118416.0 118629.0theory 71726.1 71727.9 71873.1
average 77318.7 77490.1 77411.0
We implemented the algorithm as part of the MLN++ package 8.3. In order to perform
MAP inference we need MLN models and evidence data. MLN models can be hand-coded or
learned from training data using the algorithms in the previous chapters. Since the goal here is
to perform inference for complex models where it is not easy to find the best MAP state given
evidence, we decided to generate complex models from real-world data and test IRoTS against
the current state-of-the-art algorithm MaxWalkSAT that is implemented in alchemy. We took
as a dataset, the UW-CSE dataset that we used in the previous chapter and the MLN hand-coded
model that comes together with this dataset. For the first experiment we learned weights using
the algorithm PSCG (Lowd and Domingos 2007) giving advisedBy as non-evidence predicate.
We trained the algorithm on the basis of a leave-one-out methodology for 500 iterations on each
area of the dataset. After having learnt the MLNs, we performed MAP inference with IRoTS
with query predicate advisedBy. As we did in the previous chapters, we commented on the test
set also the student and professor predicates together with the predicate advisedBy. In order
to equally compare IRoTS with MaxWalkSAT, we decided to compare IRoTS with the tabu
version of MaxWalkSAT and by using the same number of search steps for both algorithms.
For IRoTS the threshold ratio d was set to 1 and the parameter k, number of iterations without
improvement was set to 3. We observed that on the language and theory folds the iterations
were very fast and three steps without improvement were too few. For this reason we used k=10
only for these areas, and k=3 for the other areas. Anyway, at the end of IRoTS we counted the
overall number of flips of the algorithm and used the same number for MaxWalkSAT with tabu
(MWSAT-Tabu). The tabu tenure for MWSAT-Tabu was set to the default of alchemy, i.e.,
equal to 5.
Since IRoTS uses the perturbation procedure to escape local optima, it would be fair to
111
7. THE IROTS AND MC-IROTS ALGORITHMS
Table 7.2: Running times (in minutes) for the same number of search steps for query predicate ad-visedBy for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learnedby running PSCG for 500 iterations
fold IRoTS MWSAT-Tabu MWSAT-Tabu&Restarts Num. ground preds Num. ground clausesai 56.65 62.98 60.27 4760 185849
graphics 26.91 28.88 46.30 3843 136392language 1.03 1.08 1.06 840 15762systems 125.71 134.35 192.16 5328 218918theory 17.85 19.30 34.03 2499 73600
average 45.63 49.32 66,76 - -
compare IRoTS with a version of MaxWalkSAT-Tabu that uses a similar mechanism to jump
in a different region of the search space. For this reason we compared IRoTS also with
MaxWalkSAT-Tabu&Restarts with a number of ten restarts and with a number of flips for each
iteration equal to 1/10th of the overall number of flips. In this way, the equality of comparison
is maintained in order to perform the same number of flips for all algorithms. Moreover, for
MaxWalkSAT-Tabu we used the default tabu tenure of alchemy that is five, but it would more
interesting to compare IRoTS with MaxWalkSAT-Tabu&Restarts that has the same tabu tenure
as IRoTS, i.e., n/10+4. Thus we used this tabu tenure for MaxWalkSAT-Tabu&Restarts.
The results are reported in Table 7.1 where for each algorithm we report the cost of false
clauses of the final solution, while running times of inference are reported in Table 7.2. As
can be seen IRoTS is more accurate than the other two algorithms since it finds solutions
of higher quality. MaxWalkSAT-Tabu&Restarts is more competitive than MaxWalkSAT-Tabu
due to the ability of escaping local optima by jumping in a different region of the search space.
Running times show that IRoTS is faster than both the other algorithms even though the number
of search steps is the same. Thus, questions (Q1) and (Q4) can be answered affirmatively.
However, we want to be sure that the performance of IRoTS towards the other algorithms does
not depend on the weights of the model. For this reason we decided to generate other MLNs on
the same dataset but with different weights than the first ones. We did this by using again PSCG
and running it for 10 hours instead of 500 iterations for each training set. This will guarantee
that the MLNs generated will be different in terms of the clauses weights. The MAP inference
results regarding these MLNs are reported in Table 7.3. As can be seen, again IRoTS performs
better than the other algorithms, thus question (Q2) can be answered affirmatively, since for the
same number of ground clauses and predicates but with different clauses’ weights, IRoTS finds
112
7.1 MAP/MPE inference using IRoTS
Table 7.3: Inference results in terms of cost of false clauses for query predicate advisedBy forIRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learned by runningPSCG for 10 hours
fold IRoTS MWSAT-Tabu MWSAT-Tabu&Restartsai 98513.5 99737.8 99876.4
graphics 28007.9 28074.8 28005.2language 10985.8 11070.8 10711.3systems 73154.6 73471.8 73642.9theory 90979.1 89517.9 89462.7
average 60328.2 60374.6 60339.7
Table 7.4: Running times (in minutes) for the same number of search steps for query predicate ad-visedBy for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learnedby running PSCG for 10 hours
fold IRoTS MWSAT-Tabu MWSAT-Tabu&Restarts Num. ground preds Num. ground clausesai 71.34 78.31 74.53 4760 185849
graphics 26.3 28.02 43.06 3843 136392language 1.48 1.57 1.52 840 15762systems 55.06 59.92 57.98 5328 218918theory 25.53 26.73 49.06 2499 73600
‘ average 35.94 38.91 45.23 - -
better solutions than the other algorithms. Regarding running times for the last experiments,
the results are reported in Table 7.4 and again IRoTS is faster than the other algorithms.
An important question to answer is whether the performance of IRoTS towards the other
algorithms depends on the number of ground clauses and predicates. It is important to maintain
the same performance for other number of groundings. For this reason we decided to consider
another query predicate in more in the UW-CSE dataset in order to change the number of
ground atoms and clauses. We learned again the weights using PSCG but this time considering
as non-evidence predicates both the predicate advisedBy and tempAdvisedBy. We ran PSCG
on each fold for 50 iterations. The final MLNs learned should be able to predict the probability
for all groundings of both predicates given evidence. We report experiments for each of the
predicates in turn and finally for an inference task where both predicates are specified as query
predicate of the inference task. In this way we will have a different number of ground pred-
icates and clauses compared to the previous experiments. The results for the query predicate
advisedBy with the new MLNs are reported in Table 7.5 and the respective running times are
113
7. THE IROTS AND MC-IROTS ALGORITHMS
Table 7.5: Inference results in terms of cost of false clauses for query predicate advisedBy forIRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learned by runningPSCG for 50 iterations with both advisedBy and tempAdvisedBy as non-evidence predicates
fold IRoTS MWSAT-Tabu MWSAT-Tabu&Restartsai 50.85 50.85 50.85
graphics 62.10 62.10 62.10language 9.75 9.75 9.75systems 52.96 52.96 52.96theory 57.23 57.23 57.23
average 46.58 46.58 46.58
Table 7.6: Running times (in minutes) for the same number of search steps for query predicate ad-visedBy for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learnedby running PSCG for 50 iterations with both advisedBy and tempAdvisedBy as non-evidence pred-icates
fold IRoTS MWSAT-Tabu MWSAT-Tabu&Restarts Num. ground preds Num. ground clausesai 2.15 13.55 10.30 4760 185762
graphics 2.10 8.40 6.30 3843 136297language 0.12 0.32 0.27 840 15711systems 11.20 80.71 60.07 5328 218820theory 0.15 2.08 1.63 2499 73540
average 3.14 21.01 15.71 - -
reported in Table 7.6. As it can be seen, in this case the algorithms find the same solution but
IRoTS is much faster than the other two algorithms. The number of ground predicates and
clauses is different from the previous experiments.
In Table 7.7, we report the inference results for the query predicate tempAdvisedBy. As
the results show, IRoTS performs much better than MWSAT-Tabu and is more accurate than
MWSAT-Tabu&Restarts. Regarding running times, the results for all algorithms are reported
in Table 7.8 and IRoTS is clearly faster than the other algorithms.
Finally, with the last generated MLNs by declaring as non-evidence predicates both ad-
visedBy and tempAdvisedBy, we perform inference by specifying as query predicates both
these predicates in a single inference task. The results are shown in Table 7.9. IRoTS is
clearly superior against the other algorithms. The difference in solutions quality is more evi-
dent towards MWSAT-Tabu with an improvement of approximately 12% in the solution quality.
MWSAT-Tabu&Restarts is competitive with IRoTS but looses on average 7% in terms of solu-
114
7.1 MAP/MPE inference using IRoTS
Table 7.7: Inference results in terms of cost of false clauses for query predicate tempAdvisedBy forIRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learned by runningPSCG for 50 iterations with both advisedBy and tempAdvisedBy as non-evidence predicates
fold IRoTS MWSAT-Tabu MWSAT-Tabu&Restartsai 16112.70 16669.30 16309.00
graphics 12872.90 13153.10 12765.50language 2238.57 2196.23 2024.31systems 19722.50 20352.70 19938.30theory 7388.90 7668.17 7600.41
average 11667.11 12007.90 11727.50
Table 7.8: Running times (in minutes) for the same number of search steps for query predi-cate tempAdvisedBy for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, usingMLNs learned by running PSCG for 50 iterations with both advisedBy and tempAdvisedBy asnon-evidence predicates
fold IRoTS MWSAT-Tabu MWSAT-Tabu&Restarts Num. ground preds Num. ground clausesai 93.89 101.98 143.29 4760 185672
graphics 30.63 32.81 33.03 3843 136244language 0.73 0.75 0.68 840 15706systems 61.07 65.65 95.62 5328 218727theory 47.85 55.60 34.50 4892 261078
average 46.83 51.36 61.42 - -
tion quality towards IRoTS. Regarding running times, these are reported in Table 7.10. IRoTS
is slightly slower than the MWSAT-Tabu and slightly faster than MWSAT-Tabu&Restarts.
However, the differences are not so significant compared to the overall running times.
The results in the last six Tables, clearly answer questions (Q2) and (Q3). We have gener-
ated different MLN models with different weights, but the better performance of IRoTS towards
the other algorithms seems not to be sensible to the clauses’ weights. Moreover, with the last
three experiments we generated MLN models that together with the evidence data give rise to
a different number of ground predicates and clauses during inference. The results show that
IRoTS is superior in terms of solutions quality and the performance does not change with the
number of ground predicates and clauses. Regarding question (Q4), IRoTS is in general faster
than the other algorithms. In only one case, IRoTS is slightly slower than MWSAT-Tabu but
finds much better solutions. Thus question (Q4) can be answered by stating that even though it
finds better solutions, IRoTS does not spend more time than the other algorithms, but it’s faster
115
7. THE IROTS AND MC-IROTS ALGORITHMS
Table 7.9: Inference results in terms of cost of false clauses for query predicates advisedBy andtempAdvisedBy in a single inference task, for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabuwith restarts, using MLNs learned by running PSCG for 50 iterations with both advisedBy andtempAdvisedBy as non-evidence predicates
fold IRoTS MWSAT-Tabu MWSAT-Tabu&Restartsai 13367.10 15659.60 14610.00
graphics 11511.40 12622.90 11994.40language 1996.73 2054.05 2004.55systems 16823.70 19283.50 18484.60theory 6845.18 7545.73 7112.81
average 10108.82 11433.16 10841.27
Table 7.10: Running times (in minutes) for the same number of search steps for query predi-cates advisedBy and tempAdvisedBy in a single inference task, for IRoTS, MaxWalkSAT-Tabuand MaxWalkSAT-Tabu with restarts, using MLNs learned by running PSCG for 50 iterations withboth advisedBy and tempAdvisedBy as non-evidence predicates
fold IRoTS MWSAT-Tabu MWSAT-Tabu&Restarts Num. ground preds Num. ground clausesai 294.01 278.3 331.79 9384 680351
graphics 166.78 160.78 150.17 7564 495227language 7.1 8.71 8.1 1624 52491systems 362.78 339.2 337.44 10512 804425theory 80.65 104.3 103.05 4900 261890
average 182.26 178.26 186.11 - -
than both MWSAT-Tabu and MWSAT-Tabu&Restarts.
The last experiment and the results reported in Tables 7.9 and 7.10 answer question (Q5).
As the results in Table 7.10 show, the inference task consists in thousands of ground predicates
and with a really huge number of clauses.
Finally, to completely answer question (Q5), we decided to generate MLNs with an ad-
ditional query predicate such that the number of ground predicates and clauses could be very
high. We chose from the predicates in the UW-CSE domain the taughtBy predicate which has
three arguments: course, person and period. This gives rise to a huge ground MN that is to
be solved for MAP inference. We learned the MLNs with PSCG specifying taughtBy as an
additional non-evidence predicate and running the weight learning algorithm for 50 iterations.
We then performed inference with query predicates taughtBy, advisedBy and tempAdvisedBy.
The results are reported in Table 7.11 and the results show that IRoTS is again more performant
116
7.2 Conditional Inference for MLNs using MC-IRoTS
Table 7.11: Inference results in terms of cost of false clauses for query predicates taughtBy,advisedBy and tempAdvisedBy in a single inference task, for IRoTS, MaxWalkSAT-Tabu andMaxWalkSAT-Tabu with restarts, using MLNs learned by running PSCG for 50 iterations withthe three predicates as non-evidence predicates
fold IRoTS MWSAT-Tabu MWSAT-Tabu&Restartsai 509738.00 536670.00 538508.00
graphics 338790.00 338554.00 339004.00language 37034.00 43756.20 42886.70systems 494128.00 619380.00 604742.00theory 175668.00 214252.00 210758.00
average 311071.60 350522.44 347179.74
Table 7.12: Running times (in minutes) for the same number of search steps for query predicatestaughtBy, advisedBy and tempAdvisedBy in a single inference task, for IRoTS, MaxWalkSAT-Tabu and MaxWalkSAT-Tabu with restarts, using MLNs learned by running PSCG for 50 iterationswith the three predicates as non-evidence predicates
fold IRoTS MWSAT-Tabu MWSAT-Tabu&Restarts Num. ground preds Num. ground clausesai 52.97 49.22 50.52 23664 1894428
graphics 40.33 35.58 32.62 23485 1510794language 1.18 1.17 1.14 5152 157944systems 48.58 38.58 36.62 26136 2045461theory 15.33 11.5 12.62 14504 794484
average 31.68 27.21 26.7 - -
than the other two algorithms. Running times and number of ground predicates and clauses are
reported in Table 7.12. As can be seen, the number of ground clauses is very high and in one
fold it reaches nearly 2 million. This is common for relational domains where grounding of
first-order clauses causes a combinatorial explosion in the number of ground clauses. The re-
sults however show that IRoTS is slower than the other two algorithms, even though IRoTS
finds better solutions. Thus question (Q5) can be answered affirmatively in that for inference
tasks with a huge number of ground predicates and clauses, IRoTS is highly superior towards
the other algorithms in terms of solutions quality.
7.2 Conditional Inference for MLNs using MC-IRoTS
Conditional inference in graphical models involves computing the distribution of the query
variables given the evidence and it has been shown to be #P-complete (Roth 1996). The most
117
7. THE IROTS AND MC-IROTS ALGORITHMS
widely used approach to approximate inference is by using MCMC methods and in particular
Gibbs sampling. One of the problems that arises in real-world applications, is that an infer-
ence method must be able to handle probabilistic and deterministic dependencies that might
hold in the domain. MCMC methods are suitale for handling probabilistic dependencies but
give poor results when deterministic or near deterministic dependencies characterize a certain
domain. On the other side logical ones, like satisfiability testing cannot be applied to proba-
bilistic dependencies. One approach to deal with both kinds of dependencies is that of (Poon
and Domingos 2006) where the authors use SampleSAT (Wei et al. 2004) in a MCMC algo-
rithm to uniformly sample from the set of satisfying solutions. As pointed out in (Wei et al.
2004), SAT solvers find solutions very fast but they may sample highly non-uniformly. On the
other side, MCMC methods may take exponential time, in terms of problem size, to reach the
stationary distribution. For this reason, the authors in (Wei et al. 2004) proposed to use a hybrid
strategy by combining random walk steps with MCMC steps, and in particular with Metropo-
lis transitions. This permits to efficiently jump between isolated or near-isolated regions of
non-zero probability, while preserving detailed balance.
Deterministic dependencies often cause the support of probability distribution to be bro-
ken into disconnected regions. This makes difficult the design of ergodic Markov chains for
MCMC inference (Gilks et al. 1996). Thus, Gibbs sampling is trapped in a single region, and
it may never converge to the correct answers. A simple solution to this is running multiple
chains with random starting points, but in general this does not solve the problem, since it is
not guaranteed that different regions will be sampled with frequency proportional to their prob-
ability. In practice there may be a very large number of regions and simply running multiple
chains is not the optimal solution. On the other side, near-deterministic dependencies preserve
ergodicity, but lead to intractable long convergence times, such as simulated tempering (Mari-
nari and Parisi 1992). Another inference method is belief propagation (Yedidia et al. 2001),
where deterministic or near-deterministic dependencies can lead to incorrect answers or failure
to converge. Deterministic dependencies can be exploited to speed up exact inference but this
is highly unlikely scalable for problems found in SRL domains where there are many densely
connected variables.
In this dissertation, we use the same approach as the authors did in (Poon and Domingos
2006), but instead of SampleSAT, for MC-IRoTS we propose to use SampleIRoTS, which
performs with probability p a RoTS step and with probability 1− p a simulated annealing (SA)
step. We use fixed temperature annealing (i.e., Metropolis) moves. The goal is to reach as
118
7.2 Conditional Inference for MLNs using MC-IRoTS
fast as possible a first solution through IRoTS and then exploit the ability of SA to explore a
cluster of solutions. A cluster of solutions is usually a set of connected solutions, so that any
two solutions within the cluster can be connected through a series of flips without leaving the
cluster. In many domains of interest, solutions exist in clusters and it is highly useful to explore
such clusters without leaving them. SA has good properties in exploring a connected space,
therefore it samples near-uniformly and often explores all the neighboring solutions.
Through MC-IRoTS we can perform conditional inference given evidence to compute
probabilities for query predicates. These probabilities can be used to make predictions from the
model. Since inference is a computationally hard task, it is highly desirable to design high per-
forming algorithms. IRoTS has been shown to be a very competitive algorithm on some SAT
instances (Smyth et al. 2003), but to the best of our knowledge, no results have been reported
for huge domains such as those of SRL.
Often, in many application domains, it is not required performing learning and/or inference
once in batch mode, but rather for many time steps in an on-line mode. On-line learning and
inference, often used by agents, requires high performing algorithms since the agent contin-
uously updates the evidence and query by adding, changing or deleting evidence and query
atoms and then waits for a response from the inference algorithm, in order to make a decision
based on the output of the inference process. For this reason, we will compare MC-IRoTS with
the state of the art algorithm not only in terms of quality of query probabilities produced but
also in terms of running time. For the same reason, as it is shown in section 7.3, inference
can be used during learning and the inference procedure may be called thousands of times dur-
ing learning. This requires fast inference algorithms in order to speed up the entire learning
process. We will show through experiments that MC-IRoTS is faster than the state-of-the-art
algorithm for inference in Markov logic.
7.2.1 The SampleIRoTS algorithm: Combining MCMC and IRoTS
One of the most widely used MCMC method for computing conditional probabilities is Gibbs
sampling, which proceeds by sampling each variable in turn given its Markov blanket (the
variables it appears in some potential with). In order to generate samples from the correct
distribution, it is sufficient that the Markov chain satisfy ergodicity and detailed balance. In
essence, all states must be aperiodically reachable from each other, and for any two states x, y,
P(x)T (x→ y) = P(y)T (y→ x), where T is the chain’s transition probability. In the presence
of strong dependencies, changes to the state of a certain variable given its neighbors become
119
7. THE IROTS AND MC-IROTS ALGORITHMS
very unlikely, and convergence of the probability estimates to the true values becomes very
slow. In the limit of deterministic dependencies, ergodicity breaks down. Simulated tempering
can be used to speed up Gibbs sampling, by running in parallel with the original one, chains
with reduced weights, and by periodically attempting to swap the states of the two chains. The
disadvantage is that if weights are very large, swaps become very unlikely and ergodicity is
broken by infinite weights.
Another widely used approach relies on auxiliary variables to capture the dependencies.
For instance, let P(X = x,U = u) = (1/Z)∏k I[0,φk(xk)](uk), where φk is the kth potential func-
tion, uk is the kth auxiliary variable, I[a,b](uk) = 1 if a ≤ uk ≤ b, and I[a,b](uk) = 0 otherwise.
The marginal distribution of X under this joint is P(X = x), thus for sampling from the original
distribution it is sufficient to sample from P(x,u) and ignore the u values. P(uk|x) is uniform
in [0,φk(xk)], and thus easy to sample from. P(x|u) is uniform in the “slice” of χ that satisfies
φk(xk) ≥ uk for all k. Identifying this region is the main difficulty in this technique, known as
slice sampling (Damien et al. 1999).
The question whether state-of-the-art satisfiability procedures, based on random walk strate-
gies, can be used to sample uniformly or near-uniformly from the space of satisfying assign-
ments, was first dealt with in (Wei et al. 2004). It was shown that random walk SAT procedures
often do reach the full set of solutions of complex logical theories. Moreover, by interleaving
random walk steps with Metropolis transitions, it was also shown how the sampling becomes
near-uniform. At near-zero temperature, simulated annealing samples solutions uniformly, but
will generally take too long to find them. WalkSAT finds solutions very fast, but samples them
highly non-uniformly. The SampleSAT algorithm samples solutions near- uniformly and highly
efficiently by, at each iteration, performing a WalkSAT step with probability p and a simulated
annealing step with probability 1− p. The parameter p is used to trade off uniformity and
computational cost.
In the previous section we showed how IRoTS outperformed MaxWalkSAT in terms of
quality solutions found and running times. In this section, the idea is to combine the high
performing algorithm IRoTS with simulated annealing. The novel algorithm performs with
probability p a RoTS step and with probability 1− p a simulated annealing step. We call this
algorithm SampleIRoTS and the expectation is that the novel algorithm will be faster towards
SampleSAT in the same way that IRoTS is faster than WalkSAT. The goal is to exploit Sam-
pleIRoTS in an inference algorithm for Markov logic and compare it with the state-of-the-art
algorithm for this task.
120
7.2 Conditional Inference for MLNs using MC-IRoTS
7.2.2 The MC-IRoTS algorithm
The basic idea of how to use SampleIRoTS in an inference algorithm, was first proposed in
(Poon and Domingos 2006). The MC-IRoTS algorithm that we propose here, applies slice
sampling to Markov logic by using SampleIRoTS to sample a new state given the auxiliary
variables. Algorithm 7.4 gives pseudo-code for MC-IRoTS and is similar to that proposed in
(Poon and Domingos 2006). In the following we describe how it works (for further reading see
(Poon and Domingos 2006)).
Algorithm 7.4 The MC-IRoTS algorithmMC-IRoTS (clauses, numSamples)x(0)← Satisfy(hard clauses)for i = 1 to numSamples do
M← φ
for all ck ∈ clauses satisfied by x(i−1) doWith probability 1− e−wk add ck to M
end forSample x(i) ∼Uni fSAT (M)
end for
In a ground MN, each ground clause ck corresponds to the potential function φk(x) =
exp(wk fk(x)). This function has value ewk if ck is satisfied, and 1 otherwise. The authors
in (Poon and Domingos 2006) introduced an auxiliary variable uk for each ck. In the ith iter-
ation of MC-IRoTS, if ck is not satisfied by the current state x(i), uk is drawn uniformly from
[0, 1]; thus uk ≤ 1 and uk ≤ ewi , and it is not required to be satisfied in the next state. On the
other side, if ck is satisfied, uk is drawn uniformly from [0, ewi], and with probability 1− e−wi
it will be greater than 1, in which case the next state must satisfy ck. In this way, sampling
all the auxiliary variables determines a random subset M of the currently satisfied clauses that
must also be satisfied in the next state. As the next state a uniform sample from the set of states
SAT(M) that satisfy M is taken. (SAT(M) is never empty, because it always contains at least the
current state). The initial state is found by applying the satisfiability solver IRoTS to the set of
all hard clauses in the network (i.e., all clauses with infinite weight). If this set is unsatisfiable,
the output of MC-IRoTS is undefined.
In Algorithm 7.4, Uni fSAT (M) is the uniform distribution over the set SAT(M). At each
step of the algorithm, hard clauses are selected with probability 1, and thus all sampled states
121
7. THE IROTS AND MC-IROTS ALGORITHMS
Table 7.13: Inference running times for 1000 samples in the CORA domain
preds clauses MC-IRoTS SampleIRoTS MC-SAT SampleSAT Gain MC Gain Sample1849 77701 22.71 4.48 26.95 5.44 4.24 0.9667081 1171457 444.91 133.65 435.10 114.95 -9.81 -18.7059536 113798 68.41 27.55 104.82 58.64 36.41 31.0959536 118828 64.63 34.97 93.69 51.10 29.06 16.131681 2724901 1291.30 187.33 1479.75 382.20 188.45 194.879409 912673 207.22 41.65 200.76 42.83 -6.46 1.1871289 142311 363.59 313.39 385.50 321.46 21.91 8.0769169 69169 50.91 28.69 56.72 32.47 5.81 3.7859536 59536 42.86 23.12 45.53 25.74 2.67 2.621849 79507 28.93 6.69 29.96 7.47 1.03 0.783844 234546 72.71 15.54 82.76 18.92 10.05 3.381681 68921 24.22 5.03 25.28 6.15 1.06 1.128836 821842 275.11 51.58 294.25 57.11 19.14 5.536084 350298 66.56 8.40 70.10 10.83 3.54 2.431849 82216 28.16 8.19 33.81 9.84 5.65 1.6571289 142311 94.61 46.28 112.59 53.70 17.98 7.4259536 62051 43.31 23.41 49.03 27.28 5.72 3.873844 121086 34.65 5.76 35.92 6.20 1.27 0.449409 14065 5.86 3.37 6.51 3.66 0.65 0.2911025 1157625 428.42 92.97 448.42 103.44 20.00 10.47
- - 182.95 53.10 200.87 66.97 17.92 13.87
satisfy them. (For simplicity, we omit the case of negative weights. These are simply handled
by considering that a clause with a negative weight is equivalent to its negation with the same
weight but opposite sign, and a clause’s negation is the conjunction of the negations of all of
its literals. Instead of checking whether the clause is satisfied, the algorithm checks whether
its negation is satisfied. If the clause is satisfied, all of its negated literals are selected with
probability 1− ew, and with probability ew none is selected.
As shown and proven in (Poon and Domingos 2006), this kind of algorithm, generates a
Markov chain which satisfies ergodicity and detailed balance. As their algorithm, also MC-
IRoTS is guaranteed to be sound, even in the presence of deterministic dependencies, while
these other MCMC methods such as Gibbs sampling and simulated tempering are not. Al-
though, in practice, perfectly uniform samples are too hard to obtain, MC-IRoTS uses Sam-
pleIRoTS to obtain nearly uniform ones. Furthermore, the parameter p of SampleIRoTS can
be used to trade off speed and uniformity of sampling.
122
7.2 Conditional Inference for MLNs using MC-IRoTS
7.2.3 Experiments
Through experimental evaluation we want to answer the following questions:
(Q1) Does the proposed algorithm MC-IRoTS improve over the state-of-the-art algorithm
in terms of running time?
(Q2) What is the performance of MC-IRoTS compared the state-of-the-art algorithm in
terms of quality of query probabilities produced?
We implemented the algorithm as part of the MLN++ package 8.3. In order to perform
inference with MC-IRoTS, we first need to have the models MLNs. For this reason, we use
the MLNs learned discriminatively with the algorithms proposed in the previous chapter. For
each model, we perform inference with a query predicate both with MC-SAT and MC-IRoTS.
For CORA, we perform inference with the query predicates sameBib, sameAuthor, sameVenue
and sameTitle.
The results are reported in Table 7.13 where both algorithms were ran with 1000 samples
each. We generated different models in order to have different ratios of ground predicates and
ground clauses during inference. This would help better evaluate the inference algorithm over
a wide range of inference scenarios. As the results show, MC-IRoTS improves over MC-SAT
in terms of overall running time of the inference task. Results in terms of CLL and AUC are
reported in Table 7.14. It is clear that the quality of the probabilities predicted is not different
for the algorithms. Thus, MC-IRoTS maintains the same accuracy of inference but is faster
than MC-SAT.
In order to provide further experimental evidence of the superiority of MC-IRoTS towards
MC-SAT, we decided to perform inference also on the UW-CSE dataset by exploiting the
MLNs generated in the previous section for MAP inference. We first performed experiments
with MLNs generated for the non-evidence predicate advisedBy with 500 iterations of PSCG.
The accuracy results are reported in Table 7.15 and running times are reported in Table 7.16.
As the results show, MC-IRoTS improves in terms of running time, while preserving almost
the same accuracy in terms of CLL and AUC..
We then took the MLNs generated in the previous section by running PSCG for 10 hour on
the training data and used these to perform conditional inference with the predicate advisedBy
as query predicate. The results are reported in Tables 7.17 and 7.18. Again MC-IRoTS is faster
than MC-SAT, but this time it looses in accuracy in terms of AUC in one of the folds of the
dataset. However the difference is not significant.
123
7. THE IROTS AND MC-IROTS ALGORITHMS
Table 7.14: Accuracy results of inference for 1000 samples in the CORA domain
MC-IRoTS MC-SATCLL AUC CLL AUC
-0.043± 0.003 0.901 -0.043± 0.003 0.901-0.248± 0.003 0.092 -0.247± 0.003 0.094-1.686± 0.003 0.059 -1.714± 0.003 0.059-0.170± 0.002 0.158 -0.146± 0.001 0.156-1.427± 0.010 0.050 -1.447± 0.010 0.055-2.011± 0.007 0.083 -1.990± 0.007 0.090-0.079± 0.000 0.815 -0.079± 0.000 0.813-0.044± 0.000 0.907 -0.044± 0.000 0.907-0.057± 0.001 0.797 -0.057± 0.001 0.799-0.158± 0.011 0.333 -0.154± 0.011 0.348-0.056± 0.005 0.432 -0.057± 0.005 0.434-0.085± 0.009 0.452 -0.085± 0.009 0.447-0.083± 0.002 0.324 -0.084± 0.002 0.319-0.139± 0.005 0.099 -0.137± 0.005 0.116-0.159± 0.010 0.333 -0.162± 0.010 0.343-0.124± 0.001 0.406 -0.124± 0.001 0.410-0.315± 0.004 0.283 -0.315± 0.004 0.283-0.625± 0.024 0.076 -0.651± 0.024 0.069-0.246± 0.008 0.108 -0.242± 0.008 0.110-0.101± 0.003 0.219 -0.100± 0.003 0.228-0.393± 0.006 0.346 -0.394± 0.006 0.349
We performed another experiment with the MLNs generated in the previous section by
running PSCG with both query predicates advisedBy and tempAdvisedBy as query predicates.
The results for advisedBy are reported in Tables 7.19 and 7.20. Again MC-IRoTS is faster
than MC-SAT and this time is also more accurate in terms of AUC. The same experiments
were performed by specifying tempAdvisedBy as query predicate. Results are shown in Tables
7.21 and 7.22. For this predicate, MC-IRoTS is faster and much more accurate than MC-SAT.
We also performed inference with both predicates as query predicates. Results are reported in
Tables 7.23 and 7.24. As can be seen, running times are lower for MC-IRoTS and accuracy
is almost the same. Finally, we exploited the MLNs generated by adding taughtBy as non-
evidence predicate. Results are reported in Tables 7.25 and 7.26 and again MC-IRoTS is faster
and preserves the same accuracy of MC-SAT.
124
7.3 Discriminative Parameter Learning
Table 7.15: Accuracy results of inference for 1000 samples for the advisedBy predicate based onthe MLNs generated with 500 iterations of PSCG
MC-IRoTS MC-SATfold CLL AUC CLL AUCai -0.031±0.005 0.043 -0.033±0.005 0.008
graphics -0.023±0.005 0.005 -0.023±0.005 0.005language -0.049±0.016 0.011 -0.049±0.016 0.011systems -0.026±0.005 0.074 -0.028±0.005 0.006theory -0.028±0.007 0.101 -0.029±0.007 0.007
average -0.031±0.008 0.047 -0.032±0.008 0.007
Table 7.16: Inference running times (in seconds) for 1000 samples for the predicate advisedBybased on the MLNs generated with 500 iterations of PSCG
fold preds clauses MC-IRoTS SampleIRoTS MC-SAT SampleSAT Gain MC Gain Sampleai 4760 185849 59.72 16.16 69.66 16.12 9.94 -0.04
graphics 3843 136392 45.28 12.29 53.35 11.6 8.07 -0.69language 840 15762 3.5 1.21 3.63 1.22 0.13 0.01systems 5328 218918 62.45 19.54 80.92 20.05 18.47 0.51theory 2499 73600 24.93 6.72 28.68 6.58 3.75 -0.14
average - - 39.18 11.18 47.25 11.11 8.07 -0.07
7.3 Discriminative Parameter Learning
As previously introduced in Section 3.3, parameter learning for MNs and MLNs can be distin-
guished in generative and discriminative. Generative approaches optimize the joint probability
distribution of all the variables. In contrast discriminative approaches maximize the conditional
likelihood of a set of outputs given a set of inputs (Lafferty et al. 2001) and this often produces
better results for prediction problems. In this section, we will show how the MC-IRoTS algo-
rithm provides good samples in a discriminative weight learning algorithm for MLNs.
7.3.1 Optimizing Conditional Likelihood for Weight Learning
As described in Section 3.3, computing the expected counts Ew[ni(e,q)] in Equation 3.8 is
intractable. These can be approximated by the counts ni(e,q∗w) in the MAP state q∗w(x). Thus,
computing the gradient needs only MAP inference to find q∗w(x) which is much faster than full
conditional inference for computing Ew[ni(e,q)]. To generalize this method to arbitrary MLNs
it is necessary to develop a general-purpose algorithm for MAP inference in MLNs. From
125
7. THE IROTS AND MC-IROTS ALGORITHMS
Table 7.17: Accuracy results of inference for 1000 samples for the advisedBy predicate based onthe MLNs generated by running PSCG for 10 hours on the training data.
MC-IRoTS MC-SATfold CLL AUC CLL AUCai -0.033±0.005 0.008 -0.029±0.005 0.156
graphics -0.023±0.005 0.005 -0.023±0.005 0.005language -0.049±0.016 0.011 -0.049±0.016 0.011systems -0.027±0.005 0.006 -0.027±0.005 0.006theory -0.029±0.007 0.007 -0.029±0.007 0.007
average -0.032±0.008 0.007 -0.031±0.008 0.037
Table 7.18: Inference running times (in seconds) for 1000 samples for the predicate advisedBybased on the MLNs generated by running PSCG for 10 hours on the training data.
fold preds clauses MC-IRoTS SampleIRoTS MC-SAT SampleSAT Gain MC Gain Sampleai 4760 185849 59.47 16.05 70.4 17.18 10.93 1.13
graphics 3843 136392 45.85 12.24 54.23 12.06 8.38 -0.18language 840 15762 3.61 1.24 3.68 1.18 0.07 -0.06systems 5328 218918 71.99 19.99 87.35 19.7 15.36 -0.29theory 2499 73600 24.33 6.21 28.18 6.41 3.85 0.2
‘ average - - 41.05 11.15 48.77 11.31 7.72 0.16
Equation 3.7 it can be seen that since q∗w(x) is the state that maximizes the sum of the weights
of the satisfied ground clauses, it can be found using a MAX-SAT solver. The authors in (Singla
and Domingos 2005), replaced the Viterbi algorithm with the MaxWalkSAT solver (Kautz et al.
1997b). Given an MLN and set of evidence atoms, the KB to be passed to MaxWalkSAT is
formed by constructing all groundings of clauses in the MLN involving query atoms, replacing
the evidence atoms in those groundings by their truth values, and simplifying.
MaxWalkSAT is not guaranteed to reach the global MAP state, unlike the Viterbi algorithm.
This can lead to errors in the weight estimates produced. The quality of the estimates can
be improved by running a Gibbs sampler starting at the state returned by MaxWalkSAT, and
averaging counts over the samples. If the Pw(q|e) distribution has more than one mode, doing
multiple runs of MaxWalkSAT followed by Gibbs sampling can be helpful. This approach is
followed in the algorithm in (Singla and Domingos 2005) which is essentially gradient descent.
Weight learning in MLNs represents a convex optimization problem, and gradient descent
is guaranteed to find the global optimum. However, convergence to this optimum may be too
slow. The sufficient statistics for MLNs are the number of true groundings of each clause. Since
126
7.3 Discriminative Parameter Learning
Table 7.19: Accuracy results of inference for 1000 samples for the advisedBy predicate based onthe MLNs generated by running PSCG with both advisedBy and tempAdvisedBy as non-evidencepredicates
MC-IRoTS MC-SATfold CLL AUC CLL AUCai -0.028±0.004 0.066 -0.027±0.004 0.102
graphics -0.026±0.004 0.017 -0.024±0.004 0.004language -0.043±0.012 0.233 -0.037±0.011 0.221systems -0.026±0.004 0.067 -0.029±0.004 0.004theory -0.022±0.005 0.307 -0.025±0.005 0.204
average -0.029±0.006 0.138 -0.028±0.006 0.107
Table 7.20: Inference running times (in seconds) for 1000 samples for the predicate advisedBybased on the MLNs generated by running PSCG with both advisedBy and tempAdvisedBy as non-evidence predicates
fold preds clauses MC-IRoTS SampleIRoTS MC-SAT SampleSAT Gain MC Gain Sampleai 4760 185762 66.8 22.56 76.9 22.82 10.1 0.26
graphics 3843 136297 51.1 17.19 57.63 15.16 6.53 -2.03language 840 15711 4 1.74 4.1 1.87 0.1 0.13systems 5328 218820 78.58 26.56 89.5 25.88 10.92 -0.68theory 2499 73540 27.99 9.29 31.19 8.88 3.2 -0.41
average - - 45.69 15.47 51.86 14.92 6.17 -0.55
this number can easily vary by orders of magnitude from one clause to another, a learning rate
that is small enough to avoid divergence in some weights may be too small for fast convergence
in others. This is an instance of the well-known problem of ill-conditioning in numerical
optimization, and many candidate solutions for it exist (Nocedal and Wright 1999). However,
most of these are not easily applicable to MLNs because of the nature of the function to be
optimized.
In (Lowd and Domingos 2007) was proposed another approach based on conjugate gradi-
ent (Shewchuck. 1994). Gradient descent can be sped up by performing a line search to find
the optimum along the chosen descent direction instead of taking a small step of constant size
at each iteration. This can be inefficient on ill-conditioned problems, since line searches along
successive directions tend to partly undo the effect of each other: each line search makes the
gradient along its direction zero, but the next line search will generally make it non-zero again.
This can be solved by imposing at each step the condition that the gradient along previous di-
127
7. THE IROTS AND MC-IROTS ALGORITHMS
Table 7.21: Accuracy results of inference for 1000 samples for the predicate tempAdvisedBybased on the MLNs generated by running PSCG with both advisedBy and tempAdvisedBy as non-evidence predicates
MC-IRoTS MC-SATfold CLL AUC CLL AUCai -0.007±0.002 0.030 -0.007±0.002 0.032
graphics -0.004±0.001 0.174 -0.006±0.002 0.042language -0.002±0.002 1.000 -0.004±0.004 0.008systems -0.008±0.002 0.008 -0.007±0.002 0.019theory -0.013±0.004 0.005 -0.014±0.005 0.003
average -0.007±0.002 0.243 -0.008±0.003 0.021
Table 7.22: Inference running times (in seconds) for 1000 samples for the predicate tempAd-visedBy based on the MLNs generated by running PSCG with both advisedBy and tempAdvisedByas non-evidence predicates
fold preds clauses MC-IRoTS SampleIRoTS MC-SAT SampleSAT Gain MC Gain Sampleai 4760 185672 59.01 14.2 71.46 14.92 12.45 0.72
graphics 3843 136244 43.96 10.13 53.44 10.91 9.48 0.78language 840 15706 3.26 0.97 3.42 1.21 0.16 0.24systems 5328 218727 69.51 16.5 84.54 17.05 15.03 0.55theory 2499 73513 23.9 5.47 28.95 6.28 5.05 0.81
average - - 39.93 9.45 48.36 10.07 8.43 0.62
rections remain zero. The directions chosen in this way are called conjugate, and the method
conjugate gradient. In (Lowd and Domingos 2007), the authors used the Polak-Ribiere method
for choosing conjugate gradients since it has generally been found to be the best-performing
one.
7.3.2 Learning MLNs Weights by Sampling with MC-IRoTS
Conjugate gradient methods are among the most efficient ones, on a par with quasi-Newton
ones. Unfortunately, as the authors point out in (Lowd and Domingos 2007), applying them to
MLNs is difficult, because line searches require computing the objective function, and therefore
the partition function Z, which is intractable. Fortunately, the Hessian (matrix of second-order
partial derivatives) can be used instead of a line search to choose a step size. This method is
known as scaled conjugate gradient (SCG), and was proposed in (Moller 1993) for training
neural networks. In (Lowd and Domingos 2007), a step size was chosen by using the Hessian
128
7.3 Discriminative Parameter Learning
Table 7.23: Accuracy results of inference for 1000 samples with both query predicates advisedByand tempAdvisedBy in a single inference task
MC-IRoTS MC-SATfold CLL AUC CLL AUCai -0.020±0.003 0.028 -0.019±0.003 0.004
graphics -0.019±0.003 0.003 -0.019±0.003 0.002language -0.029±0.007 0.004 -0.029±0.008 0.004systems -0.018±0.002 0.027 -0.019±0.002 0.003theory -0.020±0.003 0.005 -0.023±0.004 0.004
average -0.021±0.004 0.013 -0.022±0.004 0.004
Table 7.24: Inference running times (in seconds) for 1000 samples with both query predicatesadvisedBy and tempAdvisedBy in a single inference task
fold preds clauses MC-IRoTS SampleIRoTS MC-SAT SampleSAT Gain MC Gain Sampleai 9384 680351 210.58 52.88 243.07 56.71 32.49 3,83
graphics 7564 495227 153.00 40.31 183.7 46.34 30.7 6,03language 1624 52491 22.34 11.04 17.56 4.44 -4.78 -6,6systems 10512 804425 305.73 117.17 286.46 65.43 -19.27 -51,74theory 4900 261890 83.72 22.24 97.61 22.06 13.89 -0,18
average - - 155.07 48.73 165.68 39.00 10.61 -9.73
similar to a diagonal Newton method. Conjugate gradient methods are often more effective
with a preconditioner, a linear transformation that attempts to reduce the condition number of
the problem (Sha and Pereira 2003). Good preconditioners approximate the inverse Hessian. In
(Lowd and Domingos 2007), the authors used the inverse diagonal Hessian as preconditioner
and called the SCG algorithm Preconditioned SCG (PSCG). PSCG was shown to outperform
the voted perceptron algorithm of (Singla and Domingos 2005) on two real-world domains both
for CLL and AUC. For the same learning time, PSCG learned much more accurate models.
However, to compute the Hessian the MPE approximation is no longer sufficient. The au-
thors in (Lowd and Domingos 2007) address both this problem by computing expected counts
using MC-SAT. When optimizing quadratic functions, Newton’s method can move to the global
minimum or maximum in a single step. It does this by multiplying the gradient, g, by the in-
verse Hessian, H−1, thus having wt+1 = wtH−1g. For hundreds or thousands of weights, the
use of the full Hessian becomes infeasible. A good approximation is to use the diagonal New-
ton (DN) method, which uses the inverse of the diagonalized Hessian in place of the inverse
Hessian. DN typically uses a smaller step size than the full Newton method. This is impor-
129
7. THE IROTS AND MC-IROTS ALGORITHMS
Table 7.25: Accuracy results of inference for 1000 samples with query predicates taughtBy, ad-visedBy and tempAdvisedBy in a single inference task
MC-IRoTS MC-SATfold CLL AUC CLL AUCai -0.053±0.003 0.007 -0.047±0.002 0.009
graphics -0.012±0.001 0.001 -0.012±0.001 0.001language -0.022±0.004 0.005 -0.022±0.004 0.005systems -0.017±0.001 0.003 -0.017±0.001 0.007theory -0.017±0.002 0.043 -0.020±0.002 0.003
average -0.024±0.002 0.012 -0.024±0.002 0.005
Table 7.26: Inference running times (in seconds) for 1000 samples with the query predicates taugh-tBy, advisedBy and tempAdvisedBy in a single inference task
fold preds clauses MC-IRoTS SampleIRoTS MC-SAT SampleSAT Gain MC Gain Sampleai 23664 1894428 559.43 142.31 672.84 141.64 113.41 -0.67
graphics 23485 1510794 434.36 109.84 556.14 123.06 121.78 13.22language 5152 157944 44.08 10.86 56.63 12.93 12.55 2.07systems 26136 2045461 604.5 148.7 744.24 158.96 139.74 10.26theory 14504 794484 228.35 57.65 289.32 123.62 60.97 65.97
average - - 374.14 93.87 463.83 112.04 89.69 18.17
tant when applying the algorithm to non-quadratic functions, such as MLN conditional log
likelihood, where the quadratic approximation is only good within a local region.
Since the Hessian for an MLN is simply the negative covariance matrix:
∂
∂wi∂w j= logP(Y = y|X = x) = Ew[ni]Ew[n j]−Ew[nin j] (7.1)
similar to the gradient, we can approximate this using samples from MC-IRoTS. The au-
thors in (Lowd and Domingos 2007), used MC-SAT to achieve this. In each iteration they took
a step in the diagonalized Newton direction:
wi = wi−αni−Ewni
Ew[n2i ]− (Ew[ni])2 (7.2)
Since in the previous Section, we showed that MC-IRoTS ran faster than MC-SAT on
a wide range of inference scenarios maintaining almost the same accuracy in the probabili-
ties produced, we expect MC-IRoTS to be a good sampler to estimate the sufficient statistics
needed in Equation 7.1. We will show in the next section, through experiments in the webpage
130
7.3 Discriminative Parameter Learning
Table 7.27: Accuracy results for classifying webpages of students
fold CLL AUCWisconsin -0.121±0.013 0.674
Washington -0.164±0.020 0.601Cornell -0.175±0.014 0.551Texas -0.139±0.015 0.623
average -0.150±0.015 0.612
classification domain, that MC-IRoTS provides good samples for the discriminative weight
learning algorithm of (Lowd and Domingos 2007).
7.3.3 Experiments on Web Page Classification
In this section we want to answer the following question:
(Q1) Does MC-IRoTS produce good samples to be used in a discriminative weight learning
algorithm for MLNs?
In order to perform experiments we need to learn MLNs from data with PSCG by sampling
with MC-IRoTS and see if the learned models are accurate. We decided to perform experiments
in the webpage classification domain and chose the WebKB dataset (Craven et al. 1998). The
relational version of the dataset that we use is that of (Craven and Slattery. 2001), the same used
by (Lowd and Domingos 2007). This dataset contains labeled webpages from the Department
of Computer Science of four universities: Texas, Cornell, Washington and Wisconsin. The
relational version consists of 4165 webpages and 10.935 web links together with the words on
the webpages, anchors of the links and the neighbourhoods of each link. Each webpage in the
dataset is labeled as being a page of a student, faculty, course or research project. The goal is
to predict the class for each page based on the words of that page and from the links that the
page has with other pages. The databases that we used from WebKB are the following for each
department of computer science:
Database common. Defines the relational “LinkTo” that specifies hyperlink connections.
Moreover, it contains boolean predicates characterizing the anchor text of hyperlinks, “All-
WordsCapitalized” and “HasAlphanumericWord”.
Database of page-words. Contains a bag-of-words representation of the words that occur
in the webpages. Each predicate in these files specifies a stemmed word, and the instances of
the predicate are those pages that contain the word.
131
7. THE IROTS AND MC-IROTS ALGORITHMS
Table 7.28: Accuracy results for classifying webpages of faculty members
fold CLL AUCWisconsin -0.053±0.009 0.325
Washington -0.047±0.010 0.344Cornell -0.046±0.009 0.625Texas -0.057±0.011 0.720
average -0.051±0.010 0.504
Table 7.29: Accuracy results for classifying webpages of research projects
fold CLL AUCWisconsin -0.036±0.007 0.196
Washington -0.041±0.009 0.208Cornell -0.072±0.009 0.071Texas -0.048±0.010 0.045
average -0.049±0.009 0.130
Database of anchor-words. Contains the words that occur in the anchor text of hyperlinks.
Neighborhood-words. Contains the words that occur in the “neighboring” text of hyper-
links. The neighborhood of a hyperlink includes words in a single paragraph, list item, table
entry, title or heading in which the hyperlink is contained.
We hand-coded a very simple MLN for this problem:
Has(+w1, p1)⇒ PageClass(p1)
¬Has(+w1, p1)⇒ PageClass(p1)
Has(+w1, p1)∧HasAnchor(+w1, lnkid)⇒ PageClass(p1)
¬Has(+w1, p1)∧HasAnchor(+w1, lnkid)⇒ PageClass(p1)
PageClass(p1)∧LinkTo(+lnkid, p1, p2)⇒ PageClass(p2)
“Has” is the predicate that expresses that the word “w” is contained in the page “p1”, while
“HasAnchor” relates a word with it’s anchor. The last rule, states the relationship of pages
of class p1 and p2 as linked by hyperlink “lnkid”. The sign + means a separate weight is
learned for each ground word and hyperlink. When instantiated, the model contained nearly
10.000 weights which represents a very complex non-i.i.d. probability distribution where query
predicates are linked together in a huge graph. We used the following parameters for MC-
IRoTS: d = 1, k = 3. While for PSCG we ran 100 iterations of this algorithms with 100
samples of MC-IRoTS for each inference run. We performed leave-one-area-out for each class
132
7.3 Discriminative Parameter Learning
Table 7.30: Accuracy results for classifying webpages of courses
fold CLL AUCWisconsin -0.059±0.006 0.633
Washington -0.254±0.028 0.039Cornell -0.097±0.012 0.232Texas -0.058±0.009 0.434
average -0.117±0.014 0.335
Table 7.31: Overall accuracy results for web page classification in the WebKB domain
Class CLL AUCStudent -0.150±0.015 0.612
Research -0.049±0.010 0.130Faculty -0.051±0.010 0.504Course -0.117±0.014 0.335average -0.092±0.012 0.395
of webpages learning a MLN for each department. After learning the models, we performed
inference on the left-out area by using again MC-IRoTS with 1000 samples. Tables 7.27, 7.28,
7.29, 7.30, 7.31 present the results in terms of CLL and AUC for all the classes of the domain.
Each table contains the results for each area of the dataset and the overall accuracy. As can
be seen, CLL results are very accurate while for AUC results are competitive. This shows that
samples from MC-IRoTS are good for PSCG for the sufficient statistics. In only one case, the
AUC results were not high. In fact for the research project webpages, AUC is quite low, but
on the other side CLL was the best among the four classes with an excellent result of -0.049.
For the Courses class, the AUC results were very good for three area and in only one area
(Washington) AUC was very low. This affected the overall result for the class.
Overall, the results obtained by using MC-IRoTS as sampler in PSCG answer question
(Q1) and confirm that MC-IRoTS is a good algorithm for inference in statistical relational
domains. In the previous section it was shown that MC-IRoTS was faster as an inference
algorithm, while in this section we showed that it is also useful to produce good samples for
a weight learning algorithm. This implies a double use of this algorithm: for inference and
weight learning in statistical relational domains.
133
7. THE IROTS AND MC-IROTS ALGORITHMS
7.4 Summary
Inference is the process of responding to queries once the model has been learned. Efficient and
effective inference is important to evaluate and compare the learned models. On the other side,
inference is often a subroutine when learning statistical models of relational domains. These
models often contain hundreds of thousands of variables or more, making efficient inference
crucial to their learnability. Moreover, in on-line learning and inference, often used by agents,
decisions are based on the output of the inference process, thus fast and accurate algorithms
are strongly needed for this task. In this chapter we introduced two high performing algorithms
for MAP and conditional inference in Markov Logic, based on the Iterated Local Search and
Tabu Search metaheuristics. The first algorithm, IRoTS performs a biased sampling of the
set of local optima by using Tabu Search as a local search procedure and repetitively jump-
ing in the search space through a perturbation operator. Extensive experiments on real-world
data show that IRoTS outperforms the state-of-the-art algorithm for MAP inference in Markov
Logic. The second algorithm MC-IRoTS combines IRoTS with Markov Chain Monte Carlo
by interleaving RoTS steps with Metropolis transitions in a iterated local search. Experiments
on real-world domains show that MC-IRoTS is faster than the state-of-the-art algorithm for
conditional inference in Markov Logic while maintaining the same quality of probabilities pro-
duced. Finally, we used MC-IRoTS as a sampler to approximate the sufficient statistics in a
state-of-the-art discriminative parameter learning algorithm for MLNs and showed through ex-
periments in the webpage classification domain that MC-IRoTS produces good samples to be
used during learning.
Future work regards the application of both IRoTS and MC-IRoTS to other problems in
complex SRL domains and the adaptation of lazy techniques such as those presented in (Poon
et al. 2008).
134
Chapter 8
Conclusion
8.1 Contributions of this Dissertation
This dissertation presented novel algorithms for Markov Logic Networks by addressing the
problems of learning and inference of these models. Its contributions are:
• A novel and powerful algorithm for generative structure learning of Markov Logic Net-
works. The GSL algorithm for generative structure learning of Markov Logic Networks
exploits the iterated local search metaheuristic guided by pseudo-likelihood. The algo-
rithm performs a biased sampling of the set of local optima focusing the search not on
the full space of solutions but on a smaller subspace defined by the solutions that are lo-
cally optimal for the optimization engine. It employs a strong perturbation operator and
an iterative improvement local search procedure in order to balance diversification (ran-
domness induced by strong perturbation to avoid search stagnation) and intensification
(greedily increase solution quality by exploiting the evaluation function). Experimen-
tal evaluation on two benchmarking datasets regarding the problem of Link Analysis in
Social Networks and Entity Resolution in citation databases, show that GSL achieves
improvements over the state-of-the-art algorithms for generative structure learning of
Markov Logic Networks.
• The first algorithm for discriminative structure learning of Markov Logic Networks. The
ILS-DSL algorithm learns discriminatively first-order clauses and their weights. The al-
gorithm scores the candidate structures by maximizing conditional likelihood or area
135
8. CONCLUSION
under the Precision-Recall curve while setting the parameters by maximum pseudo-
likelihood. ILS-DSL is based on the Iterated Local Search metaheuristic. To speed up
learning we propose some simple heuristics that greatly reduce the computational effort
for scoring structures. Empirical evaluation with real-world data in two domains show
the promise of our approach improving over the state-of-the-art discriminative weight
learning algorithm for MLNs in terms of conditional log-likelihood of the query pred-
icates given evidence. We have also compared the proposed algorithm with the state-
of-the-art generative structure learning algorithm and shown that on small datasets the
generative approach is competitive, while on larger datasets the discriminative approach
outperforms the generative one.
• A powerful algorithm based on randomized beam search for discriminative structure
learning. The RBS-DSL algorithm learns discriminatively first-order clauses and their
weights. The algorithm scores the candidate structures by maximizing conditional likeli-
hood or area under the Precision-Recall curve while setting the parameters by maximum
pseudo-likelihood. RBS-DSL is inspired from the Greedy Randomized Adaptive Search
Procedure metaheuristics and performs randomized beam search by scoring the struc-
tures through maximum likelihood in the first phase and then uses maximum CLL or
AUC for PR curve in a second step to randomly generate a beam of the best clauses
to add to the current MLN structure. To speed up learning we propose some simple
heuristics that greatly reduce the computational effort for scoring structures. Empirical
evaluation with real-world data in two domains show the promise of our approach im-
proving over the state-of-the-art discriminative weight learning algorithm for MLNs in
terms of conditional log-likelihood of the query predicates given evidence. We have also
compared the proposed algorithm with the state-of-the-art generative structure learning
algorithm and shown that on small datasets the generative approach is competitive, while
on larger datasets the discriminative approach outperforms the generative one.
• A novel and powerful algorithm for MAP inference in Markov Logic. The IRoTS al-
gorithm based on the Iterated Local Search and Tabu Search metaheuristics, performs
a biased sampling of the set of local optima by using Tabu Search as a local search
procedure and repetitively jumping in the search space through a perturbation operator.
Extensive experiments on real-world data show that IRoTS outperforms the state-of-the-
art algorithm for MAP inference in Markov Logic.
136
8.2 Directions for Future Research
• A novel and powerful algorithm for conditional inference in Markov Logic. The MC-
IRoTS algorithm combines IRoTS with Markov Chain Monte Carlo by interleaving
RoTS steps with Metropolis transitions in a iterated local search. Experiments on real-
world domains show that MC-IRoTS is faster than the state-of-the-art algorithm for con-
ditional inference in Markov Logic while maintaining the same quality of probabilities
produced. Finally, MC-IRoTS was used as a sampler to approximate the sufficient statis-
tics in a state-of-the-art discriminative parameter learning algorithm for MLNs and it
was shown through experiments in the webpage classification domain that MC-IRoTS
produces good samples to be used during learning.
8.2 Directions for Future Research
Any research effort might be the beginning of new exciting research or even of entire novel
research areas. This dissertation aimed at investigating the integration of logic and probability
in the context of Markov Logic Networks and this section describes the future directions that
might be followed.
• Parallel computing for models that integrate logic and probability. The GSL algorithm
is a simple example of how parallel computing can help learn better SRL models. Imple-
menting more sophisticated parallel models such as MPI (Message Passing Interface) or
PVM (Parallel Virtual Machine) could boost performance in learning complex models
such as Markov Logic Networks. In the era of multi-core computing, running parallel
threads of an algorithm has become easier and easier. Moreover, algorithms such as
those proposed in this dissertation based on ILS or GRASP could be easily parallelized
due to the independent nature of the iterations that could be assigned to separate threads.
• Search space structure analysis for Markov Logic Networks and other SRL models in
general. The performance of SLS algorithms strongly depends on the structural aspects
of the search space. To the best of the authors knowledge, no theoretical or empirical
analysis of the search space for MLNs exists (neither for SRL models). Understanding
the properties of such spaces could greatly improve our ability to use SLS algorithms
to learn MLNs (or SRL) models. These properties include fundamental features of the
search space such as size, connectivity, diameter and solution density as well as global
and local properties of the search landscapes.
137
8. CONCLUSION
• Multiobjective optimization for Markov Logic Networks and other SRL models in gen-
eral. The accuracy performance of a learned MLN model should satisfy not only con-
ditional likelihood but also the area under curve for precision recall. Many algorithms
optimize only one of these during structure search, giving often poor results for the other
desired measure. It is interesting to investigate how multiobjective optimization tech-
niques can be applied to learning MLNs and SRL models in general.
• Analysis of the relationship between different evaluation functions. Pseudo-likelihood is
a good measure when learning probabilistic models due to it’s efficiency, but gives poor
results when long chains of inference are required at query time. Conditional likelihood
would be the perfect measure to optimize during search, but it is intractable. Thus it is
interesting to further investigate the relationship between these two measures and under-
stand whether a good structure in terms of pseudo-likelihood is also a good measure in
terms of conditional likelihood.
• Use of ILP techniques to restrict the space of structures. Search in ILP is restricted by
refinement operators which direct the search of the lattice exploring the candidates of a
certain structure by a generality ordering. In Markov Logic this has not been attempted
yet and current algorithms and the algorithms proposed in this dissertation blindly gen-
erate all the potential candidates of a certain structure leading to a huge space structures.
Generality ordering in Markov Logic is not easy to address but further investigation in
this direction is precious in order to achieve major breakthroughs in the field.
• Efficient computation of clauses true counts. The main bottleneck of learning MLN
structures is the computation of the number of true groundings of a clause. High per-
forming SAT solvers such as IRoTS could be efficiently used to sample the satisfying
solutions for a clause and then count the number of its true groundings. These approaches
have the state-of-the-art for model counting (Gomes et al. 2007; Wei and Selman 2005).
• Piecewise training of Markov Logic Networks. An appealing idea for undirected models
is to independently train a local undirected classiïnAer over each clique and then com-
bine the learned weights into a single global model. Piecewise training or piecewise
pseudolikelihood has been shown to be more accurate than standard pseudolikelihood
(Sutton and McCallum 2005b, 2007).
138
8.3 Summary
8.3 Summary
Integrating logic and probability has a long story in Artificial Intelligence and Machine Learn-
ing. This dissertation attempted the challenge of exploring and developing high performing
algorithms for a state-of-the-art model that integrates first-order logic and probability. How-
ever, much remains to be done until AI systems will reach human intelligence. A powerful
language to achieve this is Markov Logic which embodies the experience and successes of
various subfields of AI and Statistics. It allows to express complexity and uncertainty, just as
humans would do in complex environments. Moreover, complex models that reflect real-world
phenomena can be learned efficiently from examples and powerful inference algorithms can
be used to answer queries about the world. This dissertation made an effort to build powerful
algorithms for these two tasks. Thus it is hoped that this dissertation will constitute in another
step in our attempt to better understand and build intelligent systems.
139
8. CONCLUSION
140
Appendix A The MLN++ Package
MLN++ package is a suite of algorithms built upon alchemy (Kok et al. 2005). Alchemy can be
seen as a declarative programming language related to Prolog. Prolog has played for a long time
an important role in Artificial Intelligence and Machine Learning. In particular, most current
state-of-the-art Inductive Logic Programming and Statistical Relational Learning systems are
written in this language. However, for alchemy the underlying inference mechanism is model
checking instead of theorem proving; the full syntax of first-order logic is allowed, rather than
just Horn clauses; the ability to handle uncertainty and learn from data is already built in.
MLN++ can bee seen as the analog upon alchemy of the ILP and SRL systems built upon
Prolog. MLN++ includes the algorithms: GSL, ILS-DSL, RBS-DSL, IRoTS and MC-IRoTS.
It includes LearnParams which is just a version of PSCG that works by sampling with MC-
IRoTS.
In this appendix we present for each algorithm of MLN++ the parameters and how it is
used. Most part of the parameters are common with alchemy and we will describe here only
the specific parameters of each algorithm. For the standard parameters of alchemy please refer
to (Kok et al. 2005).
GSL. The GSL algorithm has these parameters in more or different from alchemy:
bestGainUnchangedLimit. Number of iterations without improvement for iterated local search.
minGain. Minimum gain of a candidate structure to be accepted as the new best structure.
ILS-DSL. The ILS-DSL algorithm has these parameters in more or different from alchemy:
bestGainUnchangedLimit. Number of iterations without improvement for iterated local search.
queryPredicate. The query predicate for which the discriminative model should be learned.
RBS-DSL. The RBS-DSL algorithm has these parameters in more or different from alchemy:
141
8. CONCLUSION
beamSize. The size of beam to consider in the randomized construction of the beam of clauses.
numClausesReEval. Maximum number of clauses to be considered for scoring in terms of CLL
or AUC.
bestGainUnchangedLimit. Number of iterations without improvement for randomized beam
search.
queryPredicate. The query predicate for which the discriminative model should be learned.
IRoTS. The IRoTS algorithm has these parameters in more or different from alchemy:
iterations. Number of iterations without improvement for iterated robust tabu search.
threshold. The threshold ratio for iterated robust tabu search.
MC-IRoTS. The MC-IRoTS algorithm has these parameters in more or different from alchemy:
iterations. Number of iterations without improvement for iterated robust tabu search.
threshold. The threshold ratio for iterated robust tabu search.
LearnParams. The LearnParams algorithm needs the parameters necessary for MC-IRoTS:
iterations. Number of iterations without improvement for iterated robust tabu search.
threshold. The threshold ratio for iterated robust tabu search.
142
References
M.N. Garofalakis A. Deshpande and M.I. Jordan. Efficient stepwise selection in decomposable
models. In In Proc. UAI, pages 128–135, 2001. 15
C. Anderson, P. Domingos, and D. Weld. Relational markov models and their application to
adaptive web navigation. In Proc. of 8th ACM SIGKDD Int’l Conf. on Knowledge Discovery
and Data Mining, pages 143–152. Edmonton, Canada: ACM Press, 2002. 27
F. Bacchus. Representing and Reasoning with Probabilistic Knowledge. Cambridge, MA: MIT
Press, 1990. 1
F. Bach and M. Jordan. Thin junction trees. In In NIPS 14, 2002. 15
R. Battiti and M. Protasi. Reactive search, a history-based heuristic for max-sat. ACM Journal
of Experimental Algorithmics, (2), 1997. 105
J. Besag. Statistical analysis of non-lattice data. Statistician, 24:179–195, 1975. 2, 36
I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In Proc.
SIGMOD-04 DMKD Workshop, 2004. 76
M. Biba, S. Ferilli, and F. Esposito. Structure learning of markov logic networks through
iterated local search. In Frontiers in Artificial Intelligence and Applications, Proceedings of
18th European Conference on Artificial Intelligence (ECAI)., volume 178, pages 361–365,
2008. 53
M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity
measures. In Proc. KDD-03, pages 39–48, 2003. 56, 77
C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. 10
143
REFERENCES
W. L. Buntine. Operations for learning with graphical models. J. AI Research, 2:159–225,
1994. 15
M. Collins. Discriminative training methods for hidden markov models: Theory and experi-
ments with perceptron algorithms. In In Proc. of the 2002 Conference on Empirical Methods
in Natural Language Processing. Philadelphia, PA: ACL„ 2002. 4, 42
V. Santos Costa, D. Page, and J. Cussens. Clp(bn): Constraint logic programming for proba-
bilistic knowledge. In In Probabilistic Inductive Logic Programming, volume LNCS 4911,
pages 156–188. Springer, 2008. 28
R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. Probabilistic Networks and
Expert Systems. Springer-Verlag, 1999. 9, 10
M. Craven and S. Slattery. Relational learning with statistical predicate invention: Better mod-
els for hypertext. Machine Learning, 43:97–119, 2001. 131
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery.
Learning to extract symbolic knowledge from the world wide web. In Proc. of AAAI. AAAI
Press, 1998. 131
C. Cumby and D. Roth. Feature extraction languages for propositionalized relational learning.
In Proceedings of the IJCAI-2003 Workshop on Learning Statistical Models from Relational
Data, pages 24–31. Acapulco, Mexico: IJCAII, 2003. 2
J. Cussens. Parameter estimation in stochastic logic programs. Machine Learning, 44(3):
245–271, 2001. 2, 21, 24
J. Cussens. Loglinear models for first-order probabilistic reasoning. In Fifteenth Annual Con-
ference on Uncertainty in Artificial Intelligence, pages 126–133. Morgan Kaufmann, 1999.
21, 22, 24
P. Damien, J. Wakefield, and S. Walker. Gibbs sampling for bayesian non-conjugate and hi-
erarchical models by auxiliary variables. Journal of the Royal Statistical Society B, 61:2,
1999. 120
J. Davis and M. Goadrich. The relationship between precision-recall and roc curves. In Proc.
23rd ICML, pages 233–240, 2006. 58, 68, 72, 78, 79, 93, 97, 98
144
REFERENCES
J. Davis, I. de Castro Dutra E. Burnside, D. Page, and V. Santos Costa. An integrated approach
to learning bayesian networks of rules. In Proc. 16th European Conf. on Machine Learning,
volume 3720 LNCS, pages 84–95, 2005. 27, 82, 83
A. P. Dawid. Conditional independence for statistical operations. Annals of Statistics, 8:598–
617, 1980. 11
L. De Raedt. Logical settings for concept-learning. Artificial Intelligence, 95(1):197–201,
1997. 18, 19
L. De Raedt and L. Dehaspe. Clausal discovery. Machine Learning, 26:99–146, 1997. 19, 37
L. De Raedt and K. Kersting. Probabilistic logic learning. SIGKDD Explorations, 5(1):31–48,
2003. 18, 20
L. De Raedt and K. Kersting. Probabilistic inductive logic programming. In In Proc. of Algo-
rithmic Learning Theory, pages 19–36, 2004. 18, 20, 21
L. De Raedt, K. Kersting, and S. Torge. Towards learning stochastic logic programs from
proof-banks. In AAAI Press, pages 752–757, 2005. 22, 25
L. De Raedt, P. Frasconi, K. Kersting, and S. Muggleton, editors. Probabilistic Inductive Logic
Programming - Theory and Applications. Springer, 2008. 1
L. Dehaspe. Maximum entropy modeling with clausal constraints. In Proc. of 17th Int’l Work-
shop on Inductive Logic Programming, volume volume 1297 of LNCS, pages 109–124.
Springer, 1997. 27, 62, 82, 83
S. Della Pietra, V. Della Pietra, and J. Laferty. Inducing features of random fields. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 19:380–392, 1997. 4, 12, 15,
39, 65, 66
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via
the em algorithm. Journal of the Royal Statistical Society, Series B, vol. 39:1–38, 1977. 15
P. Domingos and M. Pazzani. On the optimality of the simple bayesian classifier under zero-one
loss. Machine Learning, 29:103–130, 1997. 4
145
REFERENCES
P. Domingos and M. Richardson. Markov logic: A unifying framework for statistical relational
learning. In L. Getoor and B. Taskar, editors, Introduction to Statistical Relational Learning,
pages 339–371. Cambridge, MA: MIT Press, 2007. 2
P. Domingos, S. Kok, H. Poon, M. Richardson, and P. Singla. Markov logic. In K. Kersting
L. De Raedt, P. Frasconi and S. Muggleton, editors, Probabilistic Inductive Logic Program-
ming, pages 92–117. New York: Springer, 2008. 2
S. Dzeroski and N. Lavrac. Relational Data Mining. Springer-Verlag, 2001. 18
D. Edwards. Introduction to Graphical Modelling, 2nd ed. Springer-Verlag, 2000. 10
I. Fellegi and A. Sunter. A theory for record linkage. J. American Statistical Association, 64:
1183–1210, 1969. 55, 75
T. A. Feo and M.G.C. Resende. A probabilistic heuristic for a computationally difficult set
covering problem. Operations Research Letters, 8(2):67–71, 1989. 87
T. A. Feo and M.G.C. Resende. Greedy randomized adaptive search procedures. Journal of
Global Optimization, 6:109–133, 1995. 87
P. Festa and M.G.C. Resende. Grasp: An annotated bibliography. In C.C. Ribeiro and P.
Hansen, editors, Essays and Surveys on Metaheuristics,, pages 325–367. Kluwer Academic
Publishers, 2002. 101
P. Flach and N. Lachiche. Naïve bayesian classification of structured data. Machine Learning,
57(3):233–269, 2004. 27
C. Fonlupt, D. Robilliard, P. Preux, and E.-G. Talbi. Fitness landscape and performance of
meta-heuristics. In S. Voss, S. Martello, I.H. Osman, , and C. Roucairol, editors, Meta-
Heuristics: Advances and Trends in Local Search Paradigms for Optimization, pages 257–
268. Kluwer Academic Publishers,Boston, MA, 1999. 52
J. H. Friedman. On bias, variance, 0/1 - loss, and the curse-of-dimensionality. Data Mining
and Knowledge Discovery, pages 55–77, 1997a. 4
N. Friedman. Learning belief networks in the presence of missing values and hidden variables.
In Fourteenth Inter. Conf. on Machine Learning (ICML97), 1997b. 14
146
REFERENCES
N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In
Proc. 16th Int’l Joint Conf. on AI (IJCAI), pages 1300–1307. Morgan Kaufmann, 1999. 2,
23
J. Furnkranz. Separate-and-conquer rule learning. Artificial Intelligence Review, 13(1):3–54,
1999. 19
V. Ganapathi, D. Vickrey, J. Duchi, and D. Koller. Constrained approximate maximum entropy
learning. In In Proc. of UAI, 2008. 16
M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory of NP-
Completeness. Freeman, San Francisco, CA, 1979. 104
D. Heckerman D. Geiger and D. M. Chickering. Learning bayesian networks: The combination
of knowledge and statistical data. Machine Learning, 20:197–243, 1995. 14
M. R. Genesereth and N. J. Nilsson. Logical foundations of artificial intelligence. San Mateo,
CA: Morgan Kaufmann., 1987. 17, 30, 31, 36, 44
L. Getoor and B. Taskar. Introduction to Statistical Relational Learning. MIT, 2007. 1, 62
C. J. Geyer and E. A. Thompson. Constrained monte carlo maximum likelihood for dependent
data. Journal of the Royal Statistical Society, Series B, 54:657–699, 1992. 36, 41
W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain Monte Carlo in Practice.
Chapman and Hall, 1996. 45, 118
F. Glover and M. Laguna. Tabu Search. Kluwer Academic Publishers, Boston, MA„ 1997. 52,
106
Carla P. Gomes, Jörg Hoffmann, Ashish Sabharwal, and Bart Selman. From sampling to model
counting. In IJCAI, pages 2293–2299, 2007. 138
R. Greiner, X. Su, S. Shen, and W. Zhou. Structural extension to logistic regression: Discrim-
inative parameter learning of belief net classifiers. Machine Learning, 59:297–322, 2005.
4
D. Grossman and P. Domingos. Learning bayesian network classifiers by maximizing con-
ditional likelihood. In Proc. 21st Int’l Conf. on Machine Learning, pages 361–368. Banf,
Canada: ACM Press, 2004. 4, 68
147
REFERENCES
J. Halpern. An analysis of first-order logics of probability. Artificial Intelligence, 46:311–350,
1990. 1, 29
P. Hansen and B. Jaumard. Algorithms for the maximum satisfiability problem. Computing,
44:279–303, 1990. 105
D. Heckerman. A tutorial on learning with bayesian networks. In M. Jordan, editor, Learning
in Graphical Models. MIT Press, 1998. 15
D. Heckerman, D. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks
for inference, collaborative filtering and data visualization. Journal of Machine Learning
Research, pages 49–75, 2000. 26
H. H. Hoos and T. Stutzle. Stochastic Local Search: Foundations and Applications. Morgan
Kaufmann, San Francisco, 2005. 44, 47, 49, 87, 103, 105
T. N. Huynh and R. J. Mooney. Discriminative structure and parameter learning for markov
logic networks. In In Proc. of the 25th International Conference on Machine Learning
(ICML), 2008. 62, 83
R. Jirousek and S. Preucil. On the effective implementation of the iterative proportional fitting
procedure. Computational Statistics & Data Analysis, 19:177–189, 1995. 15
M. I. Jordan, editor. Learning in Graphical Models. MIT Press, 1998. 9, 10
M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational
methods for graphical models. In M. I. Jordan (Ed.), Learning in Graphical Models. Cam-
bridge: MIT Press, 1999. 17
H. Kautz, B. Selman, and Y. Jiang. A general stochastic approach to solving problems with
hard and soft constraints. In The Satisfiability Problem: Theory and Applications. AMS.
1997a. 103, 104, 106
H. Kautz, B. Selman, and Y. Jiang. A general stochastic approach to solving problems with
hard and soft constraints. In D. In Gu, J. Du, and P. eds. Pardalos, editors, The Satisfiability
Problem: Theory and Applications., pages 573–586. New York, NY: American Mathemati-
cal Society, 1997b. 42, 44, 126
148
REFERENCES
K. Kersting and L. De Raedt. Towards combining inductive logic programming with bayesian
networks. In Proc. 11th Int’l Conf. on Inductive Logic Programming, pages 118–131.
Springer, 2001a. 2, 21, 24
K. Kersting and L. De Raedt. Adaptive bayesian logic programs. In Proc. of the 11th Confer-
ence on Inductive Logic Programming, volume 2157. Springer, 2001b. 24
R. Kindermann and J. L. Snell. Markov Random Fields and Their Applications. American
Mathematical Society, 1980. 12
S. Kok and P. Domingos. Learning the structure of markov logic networks. In Proc. 22nd Int’l
Conf. on Machine Learning, pages 441–448, 2005. 3, 38, 39, 55, 56, 57, 61, 62, 66, 68, 73,
74, 75, 94, 95, 101
S. Kok, P. Singla, M. Richardson, and P. Domingos. The alchemy system for
statistical relational ai. Technical report, Department of CSE-UW, Seattle, WA,
http://alchemy.cs.washington.edu/, 2005. 56, 77, 95, 141
D. Koller and A. Pfeffer. Probabilistic frame-based systems. In In Proc. AAAI, 1998. 23
D. Koller, A. Levy, and A. Pfeffer. P-classic: A tractable probabilistic description logic. In In
Proc. of NCAI97, pages 360–397, 1997. 2, 23
F. Kschischang, B. Frey, and H. Loeliger. Factor graphs and the sum product algorithm. IEEE
Transactions on Information Theory, February 2001. 17
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for
segmenting and labeling sequence data. In Proc. 18th Int’l Conf. on Machine Learning,
pages 282–289, 2001. 3, 40, 65, 125
N. Landwehr, K. Kersting, and L De Raedt. nfoil: Integrating naive bayes and foil. In Proc.
20th Nat’l Conf. on Artificial Intelligence, pages 795–800. AAAI Press, 2005. 26, 62, 82
N. Landwehr, A. Passerini, De Raedt L., and P. Frasconi. kfoil: Learning simple relational
kernels. In Proc. 21st Nat’l Conf. on Artificial Intelligence. AAAI Press, 2006. 26, 62, 82
N. Landwehr, K. Kersting, and L. De Raedt. Integrating naive bayes and foil. Journal of
Machine Learning Research, pages 481–507, 2007. 26, 62, 82, 83
149
REFERENCES
N. Lavrac and S. Dzeroski. Inductive Logic Programming: Techniques and applications. UK:
Ellis Horwood, Chichester, 1994. 1, 2, 18
S. Lee, V. Ganapathi, and D. Koller. Efficient structure learning of markov networks using
l1-regularization. In In Proc. of NIPS, 2006. 15
D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization.
Mathematical Programming, 45:503–528, 1989. 37, 41, 65
H.R. Loureno, O. Martin, and T. Stutzle. Iterated local search. In Handbook of Metaheuristics,
pages 321–353. F. Glover and G. Kochenberger, Kluwer Academic Publishers, Norwell,
MA, USA, 2002. 3, 5, 49, 51, 52, 66, 103
D. Lowd and P. Domingos. Efficient weight learning for markov logic networks. In Proc. of
the 11th PKDD, pages 200–211. Springer Verlag, 2007. 3, 4, 6, 42, 43, 56, 65, 73, 74, 76,
77, 79, 80, 94, 95, 96, 98, 111, 127, 128, 129, 130, 131
E. Marinari and G. Parisi. Simulated tempering: A new monte carlo scheme. Europhysics
Letters, 19:451–458, 1992. 118
A. McCallum. Efficiently inducing features of conditional random fields. In Proc. UAI-03,
pages 403–410, 2003. 15, 39, 66
A. McCallum and B Wellner. Conditional models of identity uncertainty with application to
noun coreference. In NIPS-04, 2005. 103
R. McEliece and S. M. Aji. The generalized distributive law. IEEE Trans. Inform. Theory, 46:
325–343, 2000. 17
M. Mezard, G. Parisi, and M. A. Virasoro. Spin-glass theory and beyond. In Lecture Notes in
Physics, volume 9. World Scientific,Singapore, 1987. 52
L. Mihalkova and R. J. Mooney. Bottom-up learning of markov logic network structure. In
Proc. 24th Int’l Conf. on Machine Learning, pages 625–632, 2007. 3, 39, 40, 55, 56, 57, 62,
73, 74, 75, 80, 94, 99
B. Milch, B. Marthi, D. Sontag, S. Russell, and D. L. Ong. Blog: Probabilistic models with
unknown objects. In Proc.IJCAI-05, pages 1352–1359. Edinburgh, Scotland, 2005. 76
150
REFERENCES
P. Mills and E. Tsang. Guided local search for solving sat and weighted max-sat problems.
In I.P. Gent, H. van Maaren, and T. Walsh, editors, SAT2000 - Highlights of Satisfiability
Research in the Year 2000, pages 89–106. IOS Press, 2000. 105
T. M. Mitchell. Machine Learning. The McGraw-Hill Companies, Inc., 1997. 19
M. Moller. A scaled conjugate gradient algorithm for fast supervised learning. Neural Net-
works, 6:525–533, 1993. 6, 43, 128
S Muggleton. Learning structure and parameters of stochastic logic programs. In In Proc. of
12th Int’l Conference on Inductive Logic Prgramming, pages 198–206, 2002. 24
S. Muggleton. Stochastic logic programs. In In L. De Raedt (Ed.), Advances in inductive logic
programming. IOS Press, Amsterdam, 1996. 2, 21, 24
S. Muggleton and L. De Raedt. Inductive logic programming: Theory and methods. Journal
of Logic Programming, 19(20):629–679, 1994. 18
S. Muggleton and C. Feng. Efficient induction of logic programs. In Inductive logic program-
ming, pages 281–297. New York: Academic Press., 1992. 39
S. H. Muggleton. Inverse entailment and progol. New Generation Computing Journal, pages
245–286, 1995. 18
K. Murphy. Learning bayes net structure from sparse data sets. Technical report, Technical
report,Comp. Sci. Div., UC Berkeley„ 2001. 14
K. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief-propagation for approximate inference:
An empirical study. In 15th Conf. on Uncertainty in Artificial Intelligence (UAI). San Mateo,
CA: Morgan Kaufmann, 1999. 17
J. Neville and D. Jensen. Relational dependency networks. Journal of Machine Learning
Research, 8(Mar):653–692, 2007. 26
H. Newcombe, J. Kennedy. S. Axford., and A. James. Automatic linkage of vital records.
Science, 130:954–959, 1959. 55, 75
A. Y. Ng and M. I. Jordan. On discriminative vs. generative: A comparison of logistic re-
gression and naive bayes. In Advances in Neural Information Processing Systems, pages
841–848. Cambridge, MA: MIT Press, 2002. 4, 80, 99
151
REFERENCES
R. T. Ng and V. S. Subrahmanian. Probabilistic logic programming. Information and Compu-
tation, 101(2):150–201, December 1992. 22
L. Ngo and P. Haddawy. Answering queries from context-sensitive probabilistic knowledge
bases. Theoretical Computer Science, 171:147–177, 1997. 2, 22
S.-H. Nienhuys-Cheng and R. de Wolf. Foundations of Inductive Logic Programming.
Springer-Verlag, 1997. 19
N. Nilsson. Probabilistic logic. Artificial Intelligence, 28:71–87, 1986. 1
J. Nocedal and S. Wright. Numerical Optimization. Springer, New York, NY„ 2006. 66
J. Nocedal and S. J. Wright. Numerical optimization. New York, NY:Springer, 1999. 41, 42,
127
Minka. T. P. Algorithms for maximum-likelihood logistic regression. Technical report, Avail-
able from http://www.stat.cmu.edu/minka/, 2001. 16
J. D. Park. Using weighted max-sat engines to solve mpe. In Proc. of AAAI, pages 682–687,
2005. 106
H. Pasula and S. Russell. Approximate inference for first-order probabilistic languages. In Pro-
ceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pages
741–748. Seattle, WA: Morgan Kaufmann, 2001. 2, 76
J. Pearl. Probabilistic reasoning in intelligent systems: Networks of plausible inference. San
Francisco, CA: Morgan Kaufmann, 1988. 1, 9, 16, 17
F. Pernkopf and J. Bilmes. Discriminative versus generative parameter and structure learning
of bayesian network classifiers. In Proc, 22nd Int’l Conf. on Machine Learning, pages 657–
664, 2005. 4
G. D. Plotkin. A note on inductive generalization. In Machine Intelligence, Edinburgh Univer-
sity Press, 5:153–163, 1970. 19, 22, 25
D. Poole. First-order probabilistic inference. In Proceedings of the 18th International Joint
Conference on Artificial Intelligence, pages 985–991. Acapulco, Mexico: Morgan Kauf-
mann., 2003. 45
152
REFERENCES
D. Poole. Probabilistic horn abduction and bayesian networks. Artificial Intelligence, 64(81-
129), 1993. 2
H. Poon and P. Domingos. Sound and efficient inference with probabilistic and deterministic
dependencies. In Proc. 21st Nat’l Conf. on AI, (AAAI), pages 458–463. AAAI Press, 2006.
4, 5, 6, 46, 57, 65, 67, 118, 121, 122
H. Poon, P. Domingos, and M. Sumner. A general method for reducing the complexity of
relational inference and its application to mcmc. In Proc. 23rd Nat’l Conf. on Artificial
Intelligence. Chicago, IL: AAAI Press, 2008. 4, 46, 65, 67, 84, 134
A. Popescul and L. H. Ungar. Structural logistic regression for link analysis. In Proceed-
ings of the Second International Workshop on Multi-Relational Data Mining, pages 92–106.
Washington, DC: ACM Press, 2003. 2, 27, 54, 55, 74, 82
J. R. Quinlan. Learning logical definitions from relations. Machine Learning, 5:239–266,
1990. 18, 26
L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recog-
nition. In Proceedings of the IEEE, pages 257–286. IEEE, 1989. 42
M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62:107–236,
2006. 2, 29, 31, 32, 36, 37, 38, 39, 40, 41, 45, 46, 55, 56, 68, 74, 75, 78, 97
F. Riguzzi. Learning logic programs with annotated disjunctions. In Proc. 14th International
Conference on Inductive Logic Programming, pages p. 270–287. Springer, 2004. 28
D. Roth. On the hardness of approximate reasoning. Artificial Intelligence, 82:273–302, 1996.
103, 117
V. Santos Costa, D. Page, M. Qazi, and J. Cussens. Clp(bn): Constraint logic programming
for probabilistic knowledge. In Proceedings of the Nineteenth Conference on Uncertainty in
Artificial Intelligence, pages 517–524. Acapulco, Mexico: Morgan Kaufmann, 2003. 2
T. Sato. A statistical learning method for logic programs with distribution semantics. In Proc.
of the 12th Int’l Conference on Logic Programming, Tokyo, pages 715–729, 1995. 25
153
REFERENCES
T. Sato and Y. Kameya. Prism: A symbolic-statistical modeling language. In Proceedings
of the Fifteenth International Joint Conference on Artificial Intelligence, pages 1330–1335.
Nagoya, Japan: Morgan Kaufmann, 1997a. 2
T. Sato and Y. Kameya. Prism: A symbolic-statistical modeling language. In Proceedings of
the 15th International Joint Conference on Artificial Intelligence, pages 1330–1335, 1997b.
25
T. Sato and Y. Kameya. Parameter learning of logic programs for symbolic-statistical modeling.
Journal of Artificial Intelligence Research (JAIR), 15:391–454, 2001. 25
T. Sato and Y. Kameya. New advances in logic-based probabilistic modeling by prism. In In
Probabilistic Inductive Logic Programming, volume LNCS 4911, pages 118–155. Springer,
2008. 25
B. Selman, H. Kautz, and B. Cohen. Local search strategies for satisfiability testing. In Cliques,
Coloring, and Satisfiability: Second DIMACS Implementation Challenge, pages 521–532.
American Mathematical Society, 1996. 5, 106
F. Sha and F. Pereira. Shallow parsing with conditional random fields. In Proc. HLT-NAACL-
03, pages 134–141, 2003. 4, 16, 39, 43, 61, 65, 66, 129
Y. Shang and B. Wah. Discrete lagrangian-based search for solving max-sat problems. In
Proc. of IJCAI, pages 378–383. Morgan Kaufmann Publishers, San Francisco, CA, USA,
1997. 105
E. Shapiro. Algorithmic Program Debugging. MIT Press, 1983. 19
N. Shental, A. Zomet, T. Hertz, and Y. Weiss. Learning and inferring image segmentations
using the gbp typical cut algorithm. In Proc. ICCV, 2003. 16
J. Shewchuck. An introduction to the conjugate gradient method without the agonizing pain.
Technical report, School of Computer Science,Carnegie Mellon University, 1994. Technical
Report CMU-CS-94-125. 42, 127
P. Singla and P. Domingos. Markov logic in infinite domains. In Proc. 23rd UAI, pages 368–
375. AUAI Press, 2007. 2, 29
154
REFERENCES
P. Singla and P. Domingos. Lifted first-order belief propagation. In Twenty-Third National
Conference on Artificial Intelligence, Chicago, IL. AAAI Press, 2008. 45
P. Singla and P. Domingos. Discriminative training of markov logic networks. In Proc. 20th
Nat’l Conf. on AI, (AAAI), pages 868–873. AAAI Press, 2005. 3, 4, 5, 6, 42, 43, 44, 55, 65,
74, 75, 76, 104, 126, 129
P. Singla and P. Domingos. Memory-efficient inference in relational domains. In Proc. 21st
Nat’l Conf. on AI, (AAAI), pages 488–493. AAAI Press, 2006a. 44, 67, 106
P. Singla and P. Domingos. Entity resolution with markov logic. In Proc. ICDM-2006, pages
572–582. IEEE Computer Society Press, 2006b. 56, 73, 76, 77, 95, 96
K. Smyth, H. Hoos, and T. Stützle. Iterated robust tabu search for max-sat. In Canadian
Conference on AI,, pages 129–144, 2003. 104, 105, 108, 110, 119
A. Srinivasan. The Aleph Manual. Available at http://www.comlab.ox.ac.uk/oucl/ es-
earch/areas/machlearn/Aleph/. 18
A. Stolcke and S. Omohundro. Hidden markov model induction by bayesian model merging.
In In Advances in Neural Information Processing Systems, volume 5, 1993. 22
C. Sutton and A. McCallum. Piecewise training of undirected models. In Proc. UAI., 2005a.
16
Charles A. Sutton and Andrew McCallum. Piecewise training for undirected models. In UAI,
pages 568–575, 2005b. 138
Charles A. Sutton and Andrew McCallum. Piecewise pseudolikelihood for efficient training of
conditional random fields. In ICML, pages 863–870, 2007. 138
E.D. Taillard. Robust taboo search for the quadratic assignment problem. Parallel Computing,
17:443–455, 1991. 5, 103, 106
B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data.
In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, pages
485–492. Edmonton, Canada: Morgan Kaufmann, 2002. 2, 16, 23, 27
155
REFERENCES
B. Taskar, M. F. Wong, P. Abbeel, and D. Koller. Link prediction in relational data. In Proc. of
Neural Information Processing Systems Conference. Vancouver, Canada, December 2003.
55, 74
J. Vennekens, S. Verbaeten, and M. Bruynooghe. Logic programs with annotated disjunctions.
In Proc. of 20th international conference on logic programming. Springer, 2004. 28
S. V. N. Vishwanathan, N. N. Schraudolph, M. W. Schmidt, and K. P. Murphy. Accelerated
training of conditional random fields with stochastic gradient methods. In ICML06, pages
969–976, 2006. 16
W. Wei, J. Erenrich, and B. Selman. Towards efficient sampling: Exploiting random walk
strategies. In Proc. 19th Nat’l Conf. on AI, (AAAI), 2004. 5, 46, 118, 120
Wei Wei and Bart Selman. A new approach to model counting. In SAT, pages 324–339, 2005.
138
J. S. Wellman, M. Breese and R. P. Goldman. From knowledge bases to decision models.
Knowledge Engineering Review, 7, 1992. 1, 2, 22
Z. Wu and B.W. Wah. Trap escaping strategies in discrete lagrangian methods for solving hard
satisfiability and maximum satisfiability problems. In Proc. of AAAI, pages 673–678. MIT
Press, 1999. 105
M. Yagiura and T. Ibaraki. Efficient 2 and 3-flip neighborhood search algorithms for the max
sat:experimental evaluation. Journal of Heuristics, 7(5):423–442, 2001. 105
J. Yedidia, W. Freeman, and Y. Weiss. Constructing free-energy approximations and gener-
alized belief propagation algorithms. IEEE Transaction on Information Theory, 51:2282–
2312, 2005. 16
J. S. Yedidia, W. T. Freeman, and Y. Weiss. Generalized belief propagation. In NIPS 2001,
2001. 118
F. Zelezny., A. Srinivasan., and D. Page. Randomised restarted search in ilp. Machine Learning,
64(1-3):183–208, 2006. 63, 84, 101
156