arxiv:1901.02522v3 [cs.cl] 1 may 2020

On the Possibilities and Limitations of Multi-hop ReasoningUnder Linguistic Imperfections

Daniel Khashabi1∗ Erfan Sadeqi Azer2 Tushar Khot1 Ashish Sabharwal1 Dan Roth3

1Allen Institute for AI, Seattle, USA2Indiana University, Bloomington, USA

3University of Pennsylvania, Philadelphia, USA

Abstract

Systems for language understanding have be-come remarkably strong at overcoming lin-guistic imperfections in tasks involving phrasematching or simple reasoning. Yet, their accu-racy drops dramatically as the number of rea-soning steps increases. We present the first for-mal framework to study such empirical obser-vations. It allows one to quantify the amountand effect of ambiguity, redundancy, incom-pleteness, and inaccuracy that the use of lan-guage introduces when representing a hiddenconceptual space. The idea is to considertwo interrelated spaces: a conceptual mean-ing space that is unambiguous and completebut hidden, and a linguistic space that cap-tures a noisy grounding of the meaning spacein the words of a language—the level at whichall systems, whether neural or symbolic, oper-ate. Applying this framework to a special classof multi-hop reasoning, namely the connectiv-ity problem in graphs of relationships betweenconcepts, we derive rigorous intuitions and im-possibility results even under this simplifiedsetting. For instance, if a query requires a mod-erately large (logarithmic) number of hops inthe meaning graph, no reasoning system oper-ating over a noisy graph grounded in languageis likely to correctly answer it. This highlightsa fundamental barrier that extends to a broaderclass of reasoning problems and systems, andsuggests an alternative path forward: focusingon aligning the two spaces via richer repre-sentations, before investing in reasoning withmany hops.

1 Introduction

Reasoning can be viewed as the process of combin-ing facts and beliefs, in order to infer new conclu-sions (Johnson-Laird, 1980). While there is a rich

∗ Work was done when the first author was affiliated withthe University of Pennsylvania.

Figure 1: The interface between meanings and wordscapturing linguistic imperfections: each meaning (top)can be uttered in many ways as words (bottom), andthe same word can have multiple meanings.

literature on reasoning, there is limited understand-ing of it in the context of natural language, par-ticularly in the presence of linguistic noise. Thereremains a substantial gap between the strong empir-ical performance and little theoretical understand-ing of models for common language tasks, such asquestion answering, reading comprehension, andtextual entailment. We focus on multi-hop or multi-step reasoning over text, where one must take mul-tiple steps and combine multiple pieces of informa-tion to arrive at the correct conclusion. Remarkably,the accuracy of even the best multi-hop models isempirically known to drop drastically as the num-ber of required hops increases (Yang et al., 2018;Mihaylov et al., 2018). This is generally viewed asa consequence of linguistic imperfections, such asthose illustrated in Figure 1.

We introduce a novel formalism that quantifies

arX

iv:1

901.

0252

2v3

[cs

.CL

] 1

May

202

0

these imperfections and makes it possible to theo-retically analyze a subclass of multi-hop reasoningalgorithms. Our results provide formal support forexisting empirical intuitions about the limitationsof reasoning with inherently noisy language. Im-portantly, our formalism can be applied to bothneural as well as symbolic systems, as long as theyoperate on natural language input.

The formalism consists of (A) an abstract modelof linguistic knowledge, and (B) a reasoningmodel. The abstract model is built around thenotion of two spaces (see Figure 1): a hiddenmeaning space of unambiguous concepts and anassociated linguistic space capturing their imper-fect grounding in words. The reasoning modelcaptures the ability to infer a property of interest inthe meaning space (e.g., the relationship betweentwo concepts) while observing only its imperfectrepresentation in the linguistic space.

The interaction between the meaning space andthe linguistic space can be seen as simulating andquantifying the symbol-grounding problem (Har-nad, 1990), a key difficulty in operating over lan-guage that arises when one tries to accurately mapwords into their underlying meaning representation.Practitioners often address this challenge by en-riching their representations; for example by map-ping textual information to Wikipedia entries (Mi-halcea and Csomai, 2007; Ratinov et al., 2011),grounding text to executable rules via semanticparsing (Reddy et al., 2017), or incorporating con-texts into representations (Peters et al., 2018; De-vlin et al., 2019). Our results indicate that advancesin higher fidelity representations are key to makingprogress in multi-hop reasoning.

There are many flavors of multi-hop reasoning,such as finding a particular path of labeled edgesthrough the linguistic space (e.g., textual multi-hopreasoning) or performing discrete operations overa set of such paths (e.g., semantic parsing). In thisfirst study, we explore a common primitive sharedby various tasks, namely, the connectivity problem:Can we determine whether there is a path of lengthk between a pair of nodes in the meaning graph,while observing only its noisy grounding as thelinguistic graph? This simplification clarifies theexposition and analysis; we expect similar impossi-bility results, as the ones we derive, to hold for abroader class of tasks that rely on connectivity.

We note that even this simplified setting by it-self captures interesting cases. Consider the task

of inferring the Is-a relationship, by reasoning overa noisy taxonomic graph extracted from language.For instance, is it true that “a chip” Is-a “a nu-trient”? When these elements cannot be directlylooked up in a curated resource such as Word-Net (Miller, 1995) (e.g., in low-resource languages,for scientific terms, etc.), recognizing the Is-a rela-tionship between “a chip” and “a nutrient” wouldrequire finding a path in the linguistic space con-necting “a chip” to “a nutrient”, while addressingthe associated word-sense issues. Is-a edges wouldneed to be extracted from text, resulting in poten-tially missing and/or extra edges. This correspondsprecisely to the linguistic inaccuracies modeled andquantified in our framework.

Contributions. This is the first mathematicalstudy of the challenges and limitations of reasoningalgorithms in the presence of the symbol-meaningmapping difficulty. Our main contributions are:1. Formal Framework. We establish a novel, lin-guistically motivated formal framework for ana-lyzing the problem of reasoning about the groundtruth (the meaning space) while operating over itsnoisy linguistic representation. This allows oneto derive rigorous intuitions about what variousclasses of reasoning algorithms can and cannotachieve. 2. Impossibility Results. We study indetail the connectivity reasoning problem, focusingon the interplay between linguistic noise (redun-dancy, ambiguity, incompleteness, and inaccuracy)and the distance (inference steps, or hops) betweentwo concepts in the meaning space. We prove thatunder low noise, reliable connectivity reasoningis indeed possible up to a few hops (Theorem 1).In contrast, even a moderate increase in the noiselevel makes it provably impossible to assess theconnectivity of concepts if they are a logarithmicdistance apart in terms of the meaning graph nodes(Theorems 2 and 3). This helps us understand em-pirical observations of “semantic drift” in systems,causing a substantial performance drop beyond 2-3hops (Fried et al., 2015; Jansen, 2016). We discusspractical lessons from our findings, such as fo-cusing on richer representations and higher-qualityabstractions to reduce the number of hops.3. Empirical Illustration. While our main con-tribution is theoretical, we use the experiments toillustrate the implications of our results in a con-trolled setting. Specifically, we apply the frame-work to a subset of a real-world knowledge-base,FB15k237 (Toutanova and Chen, 2015), treated as

2

the meaning graph, illustrating how key parametersof our analytical model influence the possibility (orimpossibility) of accurately solving the connectiv-ity problem.

2 Overview of the Formalism

(A) Linguistically-inspired abstract model.We propose an abstract model of linguisticknowledge built around the notion of two spaces(see Figures 1 and 2). The meaning space refersto the internal conceptualization in human mind,where we assume the information is free ofnoise and uncertainty. In contrast to humanthinking that happens in this noise-free space,human expression of thought via the utterenceof language introduces many imperfections, andhappens in the linguistic space. This linguisticspace is often redundant (e.g., multiple words1

such as “CPU” and “computer processor” expressthe same meaning), ambiguous (e.g., a wordlike “chips” could refer to multiple meanings), incomplete (e.g., common-sense relationsnever expressed in text), and inaccurate (e.g.,incorrect facts written down in text). Importantly,the noisy linguistic space—with its redundancy,ambiguity, incompleteness, and inaccuracy—iswhat a machine reasoning algorithm operates in.

(B) Reasoning model. For the purposes of thiswork, we define reasoning as the ability to inferthe existence or absence of a property of interestin the meaning space, by only observing its imper-fect representation in the linguistic space. Specifi-cally, we focus on properties that can be capturedby graphs, namely the meaning graph which con-nects concepts via semantic relationships, and the(noisy) linguistic graph which connects words vialanguage. Nodes in the linguistic graph may rep-resent words in various ways, such as using sym-bols (Miller, 1995), fixed vectors (Mikolov et al.,2013), or even words in context as captured by con-textual vectors (Peters et al., 2018; Devlin et al.,2019).

The target property in the meaning graph char-acterizes the reasoning task. E.g., relation extrac-tion corresponds to determining the relationshipbetween two meaning graph nodes, by observingtheir grounding in text as linguistic graph nodes.

1For simplicity of exposition, we use the term ‘words’throughout this paper as the unit of information in the linguis-tic space. Often, the unit of information is instead a shortphrase. Our formalism continues to apply to this case as well.

Word sense disambiguation (WSD) corresponds toidentifying a meaning graph node given a word (anode in the linguistic graph) and its surroundingcontext. Question-answering may ask for a linguis-tic node that has a desired semantic relationship(when mapped to the meaning space) to anothergiven node (e.g., what is the capital of X?).

2.1 Example

Figure 2 illustrates a reasoning setting that in-cludes edge semantics. Most humans understandthat V1:“present day spoons” and V2:“the metalspoons” are equivalent nodes (have the same mean-ing) for the purposes of the query, “is a metal spoona good conductor of heat?”. However, a machinemust infer this. In the linguistic space (left), thesemantics of the connections between nodes are ex-pressed through natural language sentences, whichalso provide a context for the nodes. For exam-ple, edges in the meaning graph (right) directly ex-press the semantic relation has-property(metal,

thermal-conductor), while a machine, operatingon language, may struggle to infer this from read-ing Internet text expressed in various ways, e.g.,

“dense materials such as [V3:]metals and stones are[V5:]good conductors of heat”.

To ground this in existing efforts, considermulti-hop reasoning for QA systems (Khashabiet al., 2016; Jansen et al., 2018). Here thereasoning task is to connect local information,via multiple local “hops”, in order to arriveat a conclusion. In the meaning graph, onecan trace a path of locally connected nodesto verify the correctness of a query; for ex-ample, the query has-property(metal-spoon,

thermal-conductor) can be verified by tracinga sequence of nodes in Fig. 2. Thus, answeringsome queries can be cast as inferring the existenceof a path connecting two nodes.2 While doing soon the meaning graph is straightforward, reliablydoing so on the noisy linguistic graph is not. In-tuitively, each local “hop” introduces additionalnoise, allowing reliable inference to be performedonly when it does not require too many steps in theunderlying meaning graph. To study this issue, ourapproach quantifies the effect of noise accumula-tion for long-range reasoning.

2This particular grounding is meant to help relate ourgraph-based formalism to existing applications, and is notthe only way of realizing reasoning on graphs.

3

Figure 2: The meaning graph (right) captures a clean and unique (canonical) representation of concepts and facts,while the linguistic graph (left) contains a noisy, incomplete, and redundant representation of the same knowledge.Shown here are samples of meaning graph nodes and the corresponding linguistic graph nodes for answering thequestion: Is a metal spoon a good conductor of heat?. Relationships in the linguistic graph are here indirectlyrepresented via sentences.

3 Related Work

AI literature contains a variety of formalisms for au-tomated reasoning, including reasoning with logi-cal representations (McCarthy, 1963), semantic net-works (Quillan, 1966), Bayesian networks (Pearl,1988), etc. For our analysis, we use a graphical for-malism since it provides a useful modeling tool andis a commonly used abstraction for representingitems and relations between them.

One major challenge in reasoning with naturallanguage tasks is grounding free-form text to ahigher-level meaning. Example proposals to dealwith this challenge include semantic parsing (Steed-man and Baldridge, 2011), linking to a knowledgebase (Mihalcea and Csomai, 2007), and mapping tosemantic frames (Punyakanok et al., 2004). Thesemethods can be viewed as approximate solutionsto the symbol-grounding problem (Harnad, 1990;Taddeo and Floridi, 2005) — the challenge of map-ping symbols to a representation that reflect theiraccurate meaning. Our formalism is an effort tobetter understand the fundamental difficulties thatthis problem poses.

Several efforts focus on reasoning with disam-biguated inputs, such as using executable formu-las (Reddy et al., 2017; Angeli and Manning,2014) and chaining relations to infer new rela-tions (Socher et al., 2013). Our analysis coversany algorithm for inferring patterns that can be rep-resented in a graphical form, e.g., chaining localinformation, often referred to as multi-hop reason-ing (Jansen et al., 2016; Lin et al., 2018). For exam-ple, Jansen et al. (2017) propose a structured multi-hop reasoning approach by aggregating sententialinformation from multiple KBs. They demonstrateimprovements with few hops and degradation when

aggregating more than 2-3 sentences. Our theoreti-cal findings align well with such observations andfurther quantify them.

4 The Meaning-Language Interface

We formalize our conceptual and linguistic notionsof knowledge spaces as undirected graphs:

• The meaning space, M , is a conceptual hiddenspace where all the facts are accurate and com-plete, without ambiguity. We focus on the knowl-edge in this space that can be represented as anundirected graph, GM (VM , EM ). This knowl-edge is hidden, and representative of the informa-tion that exists within human minds.

• The linguistic space, L, is the space of knowl-edge represented in natural language for ma-chine consumption (written sentences, curatedknowledge-bases, etc.). We assume access to agraph GL(VL, EL) in this space that is a noisyapproximation of GM .

The two spaces interact: when we read a sen-tence, we are reading from the linguistic space andinterpreting it in the meaning space. When writingout our thoughts, we symbolize our thought pro-cess, by moving them from meaning space to thelinguistic space. Fig.1 provides a high-level viewof this interaction.

A reasoning system operates in the linguisticspace and is unaware of the exact structure andinformation encoded in GM . How well it performsdepends on the local connectivity of GM and thelevel of noise in GL, which is discussed next.

4

Algorithm 1: Generative, stochastic construc-tion of a linguistic graph GL given a meaninggraph GM .

Input: Meaning graph GM (VM , EM ), discretedistribution r(λ), edge retention prob. p+, edgecreation prob. p−

Output: Linguistic graph GL(VL, EL)foreach v ∈ VM do

sample k ∼ r(λ)construct a collection of new nodes U s.t. |U | = kVL ← VL ∪ U , O(v)← U

foreach (m1,m2) ∈ (VM × VM ),m1 6= m2 doW1 ← O(m1), W2 ← O(m2)foreach e ∈W1 ×W2 do

if (m1,m2) ∈ EM thenwith probability p+: EL ← EL ∪ e

elsewith probability p−: EL ← EL ∪ e

4.1 Meaning-to-Language mapping

Let O : VM → 2VL denote an oracle function thatcaptures the set of nodes in GL that each (hidden)node in GM should ideally map to in a noise-freesetting. This represents multiple ways in which asingle meaning can be equivalently represented inlanguage. When w ∈ O(m), i.e., w is one of thewords the oracle maps a meaning m to, we writeO−1(w) = m to denote the reverse mapping. Forexample, if CHIP represents a node in the meaninggraph for the common snack food, then O(CHIP)= “potato chips”, “crisps”, ....

Ideally, if two nodes m1 and m2 in VM havean edge in GM , we would want an edge betweenevery pair of nodes inO(m1)×O(m2) inGL. Theactual graph GL (discussed next) will be a noisyapproximation of this ideal case.

4.2 Generative Model of Linguistic Graphs

Linguistic graphs in our framework are constructedvia a generative process. Starting with GM , wesample GL ← ALG(GM ) using a stochastic pro-cess described in Alg.1. Informally, the algorithmsimulates the process of transforming conceptualinformation into linguistic utterances (web-pages,conversations, knowledge bases, etc.). This con-struction captures a few key properties of linguisticrepresentation of meaning via 3 control parameters:replication factor λ, edge retention probability p+,and spurious edge creation probability p−.

Each node in GM is mapped to multiple linguis-tic nodes (the exact number drawn from a distri-bution r(λ), parameterized by λ), which modelsredundancy in GL. Incompleteness of knowl-

edge is modeled by not having all edges of GMbe retained in GL (controlled by p+). Further, GLcontains spurious edges that do not correspond toany edge in GM , representing inaccuracy (con-trolled by p−). The extreme case of noise-freeconstruction of GL corresponds to having r(λ) beconcentrated at 1, p+ = 1, and p− = 0.

4.3 Noisy Similarity Metric

Additionally, we include linguistic similarity basededges to model ambiguity, i.e., a single word map-ping to multiple meanings. Specifically, we viewambiguity as treating (or confusing) two distinctlinguistic nodes for the same word as identical evenwhen they correspond to different meaning nodes(e.g., confusing a “bat” node for animals with a“bat” node for sports).

Similarity metrics are typically used to judge theequivalence of words, with or without context. Letρ : VL × VL → 0, 1 be such a metric, whereρ(w,w′) = 1 for w,w′ ∈ VL denotes the equiv-alence of these two nodes in GL. We define ρ tobe a noisy version of the true node similarity asdetermined by the oracle O:

ρ(w,w′) ,

1− Bern(ε+) if O−1(w) = O−1(w′)

Bern(ε−) otherwise,

where ε+, ε− ∈ (0, 1) are the noise parameters ofρ, both typically close to zero, and Bern(p) denotesthe Bernoulli distribution with parameter p.

Intuitively, ρ is a perturbed version of true simi-larity (as defined by O), with small random noise(parameterized with ε+ and ε−). With a high prob-ability 1 − ε+/−, ρ returns the correct similaritydecision as determined by the oracle (i.e., whethertwo words have the same meaning). The extremecase of ε+ = ε− = 0 models the perfect similaritymetric. In practice, even the best similarity systemsare noisy, captured here with ε+/− > 0.

We assume reasoning algorithms have access toGL and ρ, and that they use the following proce-dure to verify the existence of a direct connectionbetween two nodes:

function NODEPAIRCONNECTIVITY(w,w′)return (w,w′) ∈ EL or ρ(w,w′) = 1

Several corner cases result in uninteresting mean-ing or linguistic graphs. We focus on a regimeof “non-trivial” cases where GM is neither overly-connected nor overly-sparse, there is non-zero

5

noise (p−, ε−, ε+ > 0) and incomplete informa-tion (p+ < 1), and noisy content does not dominateactual information (e.g., p− p+).3 Henceforth,we consider only such “non-trivial” instances.

Throughout, we will also assume that the repli-cation factor (i.e., the number of linguistic nodescorresponding to each meaning node) is a constant,i.e., r is such that P [|U | = λ] = 1.

5 Hypothesis Testing Setup: ReasoningAbout Meaning via Words

While a reasoning engine only sees the linguisticgraph GL, it must make inferences about the latentmeaning graph GM . Given a pair of nodes vL :=w,w′ ⊂ VL, the reasoning algorithm must thenpredict properties about the corresponding nodesvM = m,m′ = O−1(w),O−1(w′).

We use a hypothesis testing setup to assessthe likelihood of two disjoint hypotheses definedover GM , denoted H 1

M (vM) and H 2M (vM) (e.g.,

whether vM are connected in GM or not). Givenobservations XL(vL) about linguistic nodes (e.g.,whether vL are connected in GL), we define thegoal of a reasoning algorithm as inferring whichof the two hypotheses about GM is more likely tohave resulted in these observations, under the sam-pling process (Alg.1). That is, we are interested in:

argmaxh∈H 1

M (vM),H 2M (vM)

P(h) [XL(vL)] (1)

where P(h) [x] denotes the probability of an eventx in the sample space induced by Alg.1, when(hidden) GM satisfies hypothesis h. Defn. 6 in theAppendix formalizes this.

Since we start with two disjoint hypotheses onGM , the resulting probability spaces are generallydifferent, making it plausible to identify the cor-rect hypothesis with high confidence. However,with sufficient noise in the sampling process, it canbe difficult for an algorithm based on the linguis-tic graph to distinguish the two resulting probabil-ity spaces (corresponding to the two hypotheses),depending on observations XL(vL) used by thealgorithm and the parameters of the sampling pro-cess. For example, the distance between linguisticnodes can often be an insufficient indicator for dis-tinguishing these two hypotheses. We will explorethese two contrasting behaviors in the next section.

Not all observations are equally effective in

3Defn. 5 in the Appendix gives a precise characterization.

distinguishing h1 from h2. We say XL(vL) γ-separates them if:

P(h1) [XL(vL)]− P(h2) [XL(vL)] ≥ γ. (2)

(formal definition in Appendix, Defn. 7) We canview γ as the gap between the likelihoods ofXL(vL) having originated from a meaning graphsatisfying h1 vs. one satisfying h2. When γ = 1,XL(vL) is a perfect discriminator for h1 and h2. Ingeneral, any positive γ bounded away from 1 yieldsa valuable observation,4 and a reasoning algorithm:

function SEPARATOR XL(GL,vL = w,w′)if XL(vL) = 1 then return h1 else return h2

Importantly, this algorithm does not compute theprobabilities in Eqs. (1) and (2). Rather, it workswith a particular instantiation GL. We refer to suchan algorithm A as γ-accurate for h1 and h2 if, un-der the sampling choices of Alg.1, it outputs the‘correct’ hypothesis with probability at least γ; thatis, for both i ∈ 1, 2: P(hi) [A outputs hi] ≥ γ.This happens when XL γ-separates h1 and h2

(cf. Appendix, Prop. 8). The rest of the work willexplore when one can obtain a γ-accurate algo-rithm, using γ-separation of the underlying obser-vation as an analysis tool.

6 Main Results (Informal)

Having formalized a model of the meaning-language interface above, we now present our mainfindings. The results in this section are intention-ally stated in a somewhat informal manner for easeof exposition and to help convey the main mes-sages clearly. Formal mathematical statements andsupporting proofs are deferred to the Appendix.5

One simple but often effective approach for rea-soning is to focus on connectivity (as described inFig. 2). Specifically, we consider reasoning chainsas valid if they correspond to a short path in themeaning space, and invalid if they correspond todisconnected nodes. Mathematically, this translatesinto the d-connectivity reasoning problem, de-fined as follows: Given access to a linguistic graphGL and two nodesw,w′ in it, letm = O−1(w) andm′ = O−1(w′) denote the corresponding nodes inthe (hidden) meaning graph GM . While observ-ing only GL, can we distinguish between two hy-potheses about GM , namely, m,m′ have a path of

4If the above probability gap is negative, one can insteaduse the complement of XL(vL) for γ-separation.

5While provided for completeness and to facilitate follow-up work, details in the Appendix are not crucial for under-standing the key concepts and results of this work.

6

length d in GM , vs. m,m′ are disconnected? Dis-tinguishing between these two hypotheses, namelyh1 = m

d! m′ and h2 = m!m′, is the problem

we are interested in solving.The answer clearly depends on how faithfully

GL reflectsGM . In the (unrealistic) noise-free caseof GL = GM , this problem is trivially solvable bycomputing the shortest path between w and w′ inGL. More realistically, the higher the noise level isin Alg.1 and NODEPAIRSIMILARITY, the harderit will be for any algorithm operating on GL toconfidently infer a property of GM .

In other words, the redundancy, ambiguity, in-completeness and inaccuracy of language discussedearlier directly impact the capability and limitationsof algorithms that perform connectivity reasoning.Our goal is to quantify this intuition using the pa-rameters p+, p−, λ, ε+, ε− of our generative model,and derive possibility and impossibility results.

A simple connectivity testing algorithm. Thefirst set of results assume the following algorithm:given w,w′ with a desired distance d in GM , itchecks whether w,w′ have a path of length at mostd, which is a function of d and the replication factorλ. If yes, it declares the corresponding meaningnodes m,m′ in GM have a path of length d; other-wise it declares them disconnected.6

Let a ⊕ b denote a + b − ab, and B(d) denotethe maximum number of nodes within distance dof any node in GM . The following result showsthat even the simple algorithm above guaranteesaccurate reasoning (specifically, it correctly infersconnectivity in GM while only observing GL) forlimited noise and small d (i.e., few hops):

Theorem 1 (Possibility Result; Informal). Ifp−, ε−, d, and γ are small enough such that(p− ⊕ ε−) · B2(d) < 1

2eλ2n, then there exists

a γ ∈ [0, 1] that increases with p+, λ and de-creases with ε+,a such that the simple connec-tivity testing algorithm with d = λ · d+ λ− 1correctly solves the d-connectivity problemwith probability at least γ.

aSee Defn. 10 in formal results for exact expressionfor γ.

Here, the probability is over the samplingchoices of Alg. 1 when constructing GL, and thefunction NODEPAIRCONNECTIVITY for determin-

6The simple connectivity algorithm is described formallyusing the notion of a SEPARATOR in Appendix B.1.

ing node similarity in GL.

On the other hand, when there is more noiseand d becomes even moderately large—specifically,logarithmic in the number of nodes in GM—thenthis simple algorithm no longer works even forsmall values of desired accuracy γ:

Theorem 2 (Impossibility Result #1; Infor-mal). If p− and ε− are large enough suchthat p− ⊕ ε− ≥ c

λn and d ∈ Ω(log n) wheren is the number of nodes in GM , then thesimple connectivity algorithm with d = λdcannot solve the d-connectivity problem cor-rectly with probability γ for any γ > 0.

This result exposes an inherent limitation tomulti-hop reasoning: even for small values of noise,the diameter of GL can quickly become very small,namely, logarithmic in n (similar to the small-world phenomenon observed by Milgram (1967) inother contexts), at which point the above impossi-bility result kicks in. Our result affirms that if NLPreasoning algorithms are not designed carefully,such macro behaviors will necessarily become bot-tlenecks, even for relatively simple tasks such asdetecting connectivity.

The above result is for the simple connectivityalgorithm. One can imagine other ways to deter-mine connectivity in GM , such as by analyzing thedegree distribution of GL, looking at its clusteringstructure, etc. Our third finding extends the im-possibility result to this general setting, showingthat if the noise level is increased a little more (bya logarithmic factor), then no algorithm can inferconnectivity in GM by observing only GL:

Theorem 3 (Impossibility Result #2; Infor-mal). If p− and ε− are large enough suchthat p−⊕ ε− > c logn

λn and d > log n where nis the number of nodes in GM , then any algo-rithm cannot correctly solve the d-connectivityproblem correctly with probability γ for anyγ > 0.

This reveals a fundamental limitation, that wemay only be able to infer interesting properties ofGM within small, logarithmic sized neighborhoods.We leave the formal counterparts of these results(Theorems 11, 12, and 13, resp.) to Appendix B,and focus next on a small scale empirical validationto complement our analytical findings.

7

Figure 3: Colors in the figure depict the average distancebetween node-pairs in GL, for each true distance d (x-axis) inGM , as the noise parameter p− (y-axis) is varied. The goalis to distinguish squares in the column for a particular d withthe corresponding squares in the right-most column, whichcorresponds to node-pairs being disconnected. This is easy inthe bottom-left regime and becomes progressively harder aswe move upward (more noise) or rightward (higher distancein GM ). (ε+ = 0.7, λ = 3)

7 Empirical Validation

Our formal analysis thus far offers worst-casebounds for two regions in a fairly large spectrum ofnoisy sampling parameters for the linguistic space,namely, when p− ⊕ ε− and d are either both small(Theorem 1), or both large (Theorems 2 and 3).

This section complements our theoretical find-ings in two ways: (a) by grounding the formalismempirically into a real-world knowledge graph, (b)by quantifying the impact of sampling parameterson the connectivity algorithm. We use ε− = 0 forthe experiments, but the effect turns out to be iden-tical to using ε− > 0 for appropriate adjustmentsof p− and p+ (see Remark 9 in the Appendix).

We consider FB15k237 (Toutanova and Chen,2015), a set of 〈head, relation, target〉 triples fromFreeBase (Bollacker et al., 2008). For scalabil-ity, we use the movies domain subset (relations/film/*), with 2855 entity nodes and 4682 rela-tion edges. We treat this as the meaning graph GMand sample a linguistic graph GL (via Alg.1) tosimulate the observed graph derived from text.

We sample GL for various values of p− and plotthe resulting distances in GL and GM in Fig. 3, asfollows. For every value of p− (y-axis), we samplepairs of points inGM that are separated by distanced (x-axis). For these pairs of points, we computethe average distance between the correspondinglinguistic nodes (in sampled GL), and plot that inthe heat map using color shades.

We make two observations from this simulation.First, for lower values of p−, disconnected nodes inGM (rightmost column) are clearly distinguishablefrom meaning nodes with short paths (small d) as

predicted by Theorem 11, but harder to distinguishfrom nodes at large distances (large d). Second,and in contrast, for higher values of p−, almost ev-ery pair of linguistic nodes is connected with a veryshort path (dark color), making it impossible for adistance-based reasoning algorithm to confidentlyassess d-connectivity in GM . This simulation alsoconfirms our finding in Theorem 12: any graphwith p− ≥ 1/λn (which is ∼ 0.0001 here) can-not distinguish disconnected meaning nodes fromnodes with paths of short (logarithmic) length (toprows).

8 Implications and Practical Lessons

Our analysis is motivated by empirical observa-tions of “semantic drift” of reasoning algorithms,as the number of hops is increased. For example,Fried et al. (2015) show modest benefits up to 2-3 hops, and then decreasing performance; Jansenet al. (2018) make similar observations in graphsbuilt out of larger structures such as sentences,where the performance drops off around 2 hops.This pattern has interestingly been observed undera variety of representations, including word-levelinput, graphs, and traversal methods. A naturalquestion is whether the field might be hitting afundamental limit on multi-hop information aggre-gation. Our “impossibility” results are reaffirma-tions of the empirical intuition in the field. Thismeans that multi-hop inference (and any algorithmthat can be cast in that form), as we’ve been ap-proaching it, is exceptionally unlikely to breach thefew-hop barrier predicted in our analysis.

There are at least two practical lessons: First,our results suggest that ongoing efforts on “verylong” multi-hop reasoning, especially without acareful understanding of the limitations, are un-likely to succeed, unless some fundamental build-ing blocks are altered. Second, this observationsuggests that practitioners must focus on richer rep-resentations that allow reasoning with only a “few”hops. This, in part, requires higher-fidelity abstrac-tion and grounding mechanisms. It also pointsto alternatives, such as offline KB completion/ex-pansion mechanisms, which indirectly reduce thenumber of steps needed at inference time.

Our formal results quantify how different lin-guistic imperfection parameters affect multi-hopreasoning differently. This provides an informeddesign choice for which noise parameter to try tocontrol when creating a new representation.

8

9 Conclusion

This work is the first attempt to develop a formalframework for understanding the behavior of com-plex natural language reasoning in the presence oflinguistic noise. The importance of this work istwo-fold. First, it proposes a novel graph-theoreticparadigm for studying reasoning, inspired by thesymbol-meaning problem in the presence of redun-dancy, ambiguity, incompleteness, and inaccuracyof language. Second, it shows how to use thisframework to analyze a class of reasoning algo-rithms. We expect our findings, as well as thosefrom future extensions to other classes of reasoningalgorithms, to have important implications on howwe study problems in language comprehension.

ReferencesGabor Angeli and Christopher D. Manning. 2014. Nat-

uralli: Natural logic inference for common sense rea-soning. In EMNLP.

Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: A collab-oratively created graph database for structuring hu-man knowledge. In ICMD, pages 1247–1250. ACM.

Fan Chung and Linyuan Lu. 2002. The average dis-tances in random graphs with given expected de-grees. Proceedings of the National Academy of Sci-ences, 99(25):15879–15882.

Thomas H Cormen, Charles E Leiserson, Ronald LRivest, and Clifford Stein. 2009. Introduction to al-gorithms. MIT press.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In NAACL-HLT.

Xue Ding, Tiefeng Jiang, et al. 2010. Spectral dis-tributions of adjacency and laplacian matrices ofrandom graphs. The annals of applied probability,20(6):2086–2117.

Paul Erdos and Alfred Renyi. 1960. On the evolutionof random graphs. Publ. Math. Inst. Hung. Acad.Sci, 5(1):17–60.

Daniel Fried, Peter Jansen, Gustave Hahn-Powell, Mi-hai Surdeanu, and Peter Clark. 2015. Higher-order lexical semantic models for non-factoid an-swer reranking. TACL, 3.

Edgar N Gilbert. 1959. Random graphs. The Annals ofMathematical Statistics, 30(4):1141–1144.

Stevan Harnad. 1990. The symbol grounding prob-lem. Physica D: Nonlinear Phenomena, 42(1-3):335–346.

Peter Jansen, Niranjan Balasubramanian, Mihai Sur-deanu, and Peter Clark. 2016. What’s in an expla-nation? characterizing knowledge and inference re-quirements for elementary science exams. In COL-ING, pages 2956–2965.

Peter Jansen, Rebecca Sharp, Mihai Surdeanu, and Pe-ter Clark. 2017. Framing qa as building and rankingintersentence answer justifications. ComputationalLinguistics.

Peter A Jansen. 2016. A study of automatically acquir-ing explanatory inference patterns from corpora ofexplanations: Lessons from elementary science ex-ams. In AKBC.

Peter A. Jansen, Elizabeth Wainwright, StevenMarmorstein, and Clayton T. Morrison. 2018.WorldTree: A corpus of explanation graphs for el-ementary science questions supporting multi-hop in-ference. In LREC.

Philip Nicholas Johnson-Laird. 1980. Mental modelsin cognitive science. Cognitive science, 4(1):71–115.

Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Pe-ter Clark, Oren Etzioni, and Dan Roth. 2016. Ques-tion answering via integer programming over semi-structured knowledge. In IJCAI.

Tushar Khot, Ashish Sabharwal, and Peter Clark. 2017.Answering complex questions using open informa-tion extraction. ACL.

Xi Victoria Lin, Richard Socher, and Caiming Xiong.2018. Multi-hop knowledge graph reasoning withreward shaping. In EMNLP.

John McCarthy. 1963. Programs with common sense.Defense Technical Information Center.

Rada Mihalcea and Andras Csomai. 2007. Wikify!:linking documents to encyclopedic knowledge. InCIKM, pages 233–242.

Todor Mihaylov, Peter Clark, Tushar Khot, and AshishSabharwal. 2018. Can a suit of armor conduct elec-tricity? a new dataset for open book question answer-ing. In EMNLP.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space. In ICLR.

Stanley Milgram. 1967. Six degrees of separation. Psy-chology Today, 2:60–64.

George Miller. 1995. Wordnet: a lexical database forenglish. Communications of the ACM, 38(11):39–41.

J Pearl. 1988. Probabilistic Reasoning in IntelligentSystems: Networks of Plausible Inference. Mor-gan Kaufmann Publishers Inc., San Francisco, CA,USA.

9

Matthew E. Peters, Mark Neumann, Mohit Iyyer,Matt Gardner, Christopher Clark, Kenton Lee, andLuke S. Zettlemoyer. 2018. Deep contextualizedword representations. In NAACL.

Vasin Punyakanok, Dan Roth, and Scott Yih. 2004.Mapping dependencies trees: An application toquestion answering. AIM.

M Ross Quillan. 1966. Semantic memory. Technicalreport, Bolt Beranek and Newman Inc, Cambridge,MA.

Lev Ratinov, Dan Roth, Doug Downey, and Mike An-derson. 2011. Local and global algorithms for dis-ambiguation to Wikipedia. In ACL.

Siva Reddy, Oscar Tackstrom, Slav Petrov, Mark Steed-man, and Mirella Lapata. 2017. Universal semanticparsing. In EMNLP, pages 89–101.

Richard Socher, Danqi Chen, Christopher D. Manning,and Andrew Y. Ng. 2013. Reasoning with neuraltensor networks for knowledge base completion. InNIPS.

Mark Steedman and Jason Baldridge. 2011. Combi-natory categorial grammar. Non-TransformationalSyntax: Formal and explicit models of grammar,pages 181–224.

Mariarosaria Taddeo and Luciano Floridi. 2005. Solv-ing the symbol grounding problem: a critical reviewof fifteen years of research. Journal of Experimental& Theoretical Artificial Intelligence, 17(4):419–445.

Kristina Toutanova and Danqi Chen. 2015. Observedversus latent features for knowledge base and textinference. In Workshop on Continuous Vector SpaceModels and their Compositionality (CVSC).

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-gio, William W. Cohen, Ruslan Salakhutdinov, andChristopher D. Manning. 2018. HotpotQA: Adataset for diverse, explainable multi-hop questionanswering. In EMNLP.

10

Supplemental MaterialTo focus the main text on key findings and intuitions, we abstract away many technical details. Here

instead we go into the formal proofs with the necessary mathematical rigor. We start with our requisitedefinitions (§A), followed by formal statements of our three results (§B), and their proofs (§C, §D, §E).We conclude the supplementary material with additional empirical results (§F).

A Definitions: Reasoning Problems, Hypothesis Testing Setup, and γ-Separation

Notation. Throughout, we follow the standard notation for asymptotic comparison of functions: O(.),o(.), Θ(.), Ω(.), and ω(.) (Cormen et al., 2009). X ∼ f(θ) denotes a random variable X distributedaccording to probability distribution f(θ), paramterized by θ. Bern(p) and Bin(n, p) denote the Bernoulliand Binomial distributions, respectively. Given random variables X ∼ Bern(p) and Y ∼ Bern(q), theirdisjunction X ∨ Y is distributed as Bern(p⊕ q), where p⊕ q , 1− (1− p)(1− q) = p+ q − pq. Wemake extensive use of this notation.

Let dist(u, v) be the distance between nodes u and v in G. A simple path (henceforth referred to as

just a path) is a sequence of adjacent nodes that does not have repeating nodes. Let u d! v denote the

existence of a path of length d between u and v. Similarly, u!v denotes u and v are disconnected. Thenotion of d-neighborhood is useful when analyzing local properties of graphs:

Definition 4. For a graph G = (V,E), w ∈ V , and d ∈ N, the d-neighborhood of w is v |dist(w, v) ≤ d, i.e., the ‘ball’ of radius d around w. B(w, d) denotes the number of nodes in thisd-neighborhood, and B(d) = maxw∈V B(w, d).

For the rest of this section we formally the key concepts used in the main text, as well as those necessaryfor our formal proofs.

Definition 5 (Nontrivial Graph Instances). A pair (GM , GL) of a meaning graph and a linguisticgraph created from it is non-trivial if its generation process satisfies the following:

1. non-zero noise, i.e., p−, ε−, ε+ > 0;

2. incomplete information, i.e., p+ < 1;

3. noise content does not dominate the actual information, i.e., p− p+, ε+ < 0.5, andp+ > 0.5;

4. GM is not overly-connected, i.e., B(d) ∈ o(n), where n is the number of nodes in GM ;

5. GM is not overly-sparse, i.e., |EGM| ∈ ω(1).

While the reasoning engine only sees the linguistic graph GL, it must make inferences about thepotential latent meaning graph. Given a pair of nodes vL := w,w′ ⊂ VL in the linguistic graph,the reasoning algorithm must then predict properties about the corresponding nodes vM = m,m′ =O−1(w),O−1(w′) in the meaning graph.

We use a hypothesis testing setup to assess the likelihood of two disjoint hypotheses defined over GM :H 1M (vM) and H 2

M (vM). Given observations XL(vL) about linguistic nodes, the goal of a reasoningalgorithm here is to identify which of the two hypotheses about GM is more likely to have resulted inthese observations, under the sampling process of Algorithm 1. That is, we are interested in:

argmaxh∈H 1

M (vM),H 2M (vM)

P(h) [XL(vL)] (3)

where P(h) [x] denotes the probability of an event x in the sample space induced by Algorithm 1, whenthe (hidden) meaning graph GM satisfies hypothesis h. Formally:

11

Definition 6 (Reasoning Problem). The input for an instance P of the reasoning problem is acollection of parameters that characterize how a linguistic graph GL is generated from a (latent)meaning graph GM , two hypotheses H 1

M (vM), H 2M (vM) about GM , and available observations

XL(vL) in GL. The reasoning problem, P(p+, p−, ε+, ε−, B(d), n, λ, H 1M (vM), H 2

M (vM),XL(vL)), is to map the input to the hypothesis h as per Eq. (1).

Since we start with two disjoint hypotheses on GM , the resulting probability spaces are generallydifferent, making it plausible to identify the correct hypothesis with high confidence. On the other hand,with sufficient noise in the sampling process, it can also become difficult for an algorithm to distinguishthe two resulting probability spaces (corresponding to the two hypotheses), especially depending on theobservations XL(vL) used by the algorithm and the parameters of the sampling process. For example,the distance between linguistic nodes can often be an insufficient indicator for distinguishing these twohypotheses. We will explore these two contrasting reasoning behaviors in the next section.

We use “separation” to measure how effective is an observation XL in distinguishing between the twohypotheses:

Definition 7 (γ-Separation). For γ ∈ [0, 1] and a reasoning problem instance P with two hypothesesh1 = H 1

M (vM) and h2 = H 2M (vM), we say an observation XL(vL) in the linguistic space

γ-separates h1 from h2 if:

P(h1) [XL(vL)]− P(h2) [XL(vL)] ≥ γ.

We can view γ as the gap between the likelihoods of the observation XL(vL) having originated froma meaning graph satisfying hypothesis h1 vs. one satisfying hypothesis h2. When γ = 1, XL(vL) is aperfect discriminator for distinguishing h1 and h2. In general, any positive γ bounded away from 1 yieldsa valuable observation.7

Given an observation XL that γ-separates h1 and h2, there is a simple algorithm that distinguishes h1

from h2:

function SEPARATORXL(GL,vL = w,w′)

if XL(vL) = 1 then return h1 else return h2

Importantly, this algorithm does not compute the probabilities in Definition 7. Rather, it works with aparticular instantiation GL of the linguistic graph. We refer to such an algorithm A as γ-accurate for h1

and h2 if, under the sampling choices of Algorithm 1, it outputs the ‘correct’ hypothesis with probabilityat least γ; that is, for both i ∈ 1, 2: P(hi) [A outputs hi] ≥ γ.

Proposition 8. If observation XL γ-separates h1 and h2, then algorithm SEPARATORXLis γ-

accurate for h1 and h2.

Proof. Let A denote SEPARATORXLfor brevity. Combining γ-separation of XL with how A operates,

we obtain:

P(h1) [A outputs h1]− P(h2) [A outputs h1] ≥ γ

⇒ P(h1) [A outputs h1] + P(h2) [A outputs h2] ≥ 1 + γ

Since each term on the left is bounded above by 1, each of them must also be at least γ, as desired.

The rest of the work will explore when one can obtain a γ-accurate algorithm, using γ-separation of theunderlying observation as an analysis tool.

The following observation allows a simplification of the proofs, without loss of any generality.

7If the above probability gap is negative, one can instead use the complement of XL(vL) for γ-separation.

12

Remark 9. Since our procedure doesn’t treat similarity edges and meaning-to-symbol noise edgesdifferently, we can ‘fold’ ε− into p− and p+ (by increasing edge probabilities). More generally, theresults are identical whether one uses p+, p−, ε− or p′+, p

′−, ε

′−, as long as:

p+ ⊕ ε− = p′+ ⊕ ε′−p− ⊕ ε− = p′− ⊕ ε′−

For any p+ and ε−, we can find a p′+ such that ε′− = 0. Thus, w.l.o.g., in the following analysiswe derive results only using p+ and p− (i.e. assume ε′− = 0). Note that we expand these terms top+ ⊕ ε− and p− ⊕ ε− respectively in the final results.

13

B Main Results (Formal)

This section presents a more formal version of our results. As discussed below, the formal theorems arebest stated using the notions of hypothesis testing, observations on linguistic graphs GL, and γ-separationof two hypotheses about GM using these observations on GL.

B.1 Connectivity Reasoning AlgorithmFor nodes m,m′ in GM , we refer to distinguishing the following two hypotheses as the d-connectivityreasoning problem and find that even these two extreme worlds can be difficult to separate:

h1 = md! m′, and h2 = m!m′

For reasoning algorithms, a natural test is the connectivity of linguistic nodes in GL via the NODEPAIR-CONNECTIVITY function. Specifically, checking whether there is a path of length ≤ d between w,w′:

X dL(w,w′) = w

≤d! w′

Existing multi-hop reasoning models (Khot et al., 2017) use similar approaches to identify valid reasoningchains.

The corresponding connectivity algorithm is SEPARATOR X dL, which we would like to be γ-accurate

for h1 and h2. Next, we derive bounds on γ for these specific hypotheses and observations XL(vL).While the space of possible hypotheses and observations is large, the above natural and simple choicesstill allow us to derive valuable intuitions.

B.1.1 Possibility of Accurate Connectivity ReasoningWe begin by defining the following accuracy threshold, γ∗, as a function of the parameters for sampling alinguistic graph:

Definition 10. Given n, d ∈ N and linguistic graph sampling parameters p+, ε+, λ, defineγ∗(n, d, p+, ε+, ε−, λ) as(

1− (1− (p+ ⊕ ε−))λ2)d·(

1− 2e3ελ/2+

)d+1− 2en(λB(d))2p−.

This expression, although complex-looking, behaves in a natural way. E.g., the accuracy threshold γ∗

increases (higher accuracy) as p+ increases (higher edge retention) or ε+ decreases (fewer droppedconnections between replicas). Similarly, as λ increases (higher replication), the impact of the noise onedges between node clusters or d decreases (shorter paths), again increasing the accuracy threshold.

Theorem 11 (formal statement of Theorem 1). Let p+, p−, λ be the parameters of Alg.1 on ameaning graph with n nodes. Let ε+, ε− be the parameters of NODEPAIRCONNECTIVITY. Letd ∈ N and d = λ · d+ λ− 1. If

(p− ⊕ ε−) · B2(d) <1

2eλ2n,

and γ = max0, γ∗(n, d, p+, ε+, ε−, λ), then algorithm SEPARATOR X dL is γ-accurate for d-

connectivity problem.

B.2 Limits of Connectivity-Based AlgorithmsWe show that as the distance d between two meaning nodes increases, it becomes difficult to make anyinference about their connectivity by assessing connectivity of the corresponding linguistic-graph nodes.Specifically, if d is at least logarithmic in the number of meaning nodes, then, even with small noise, thealgorithm will see all node-pairs as being within distance d, making informative inference unlikely.

14

Theorem 12 (formal statement of Theorem 2). Let c > 1 be a constant and p−, λ be parametersof the sampling process in Alg.1 on a meaning graph GM with n nodes. Let ε− be a parameter ofNODEPAIRCONNECTIVITY. Let d ∈ N and d = λd. If

p− ⊕ ε− ≥c

λnand d ∈ Ω(log n),

then the connectivity algorithm SEPARATOR X dL almost-surely infers any node-pair in GM as

connected, and is thus not γ-accurate for any γ > 0 for the d-connectivity problem.

Note that preconditions of Theorems 11 and 12 are disjoint, that is, both results don’t apply simultane-ously. Since B(.) ≥ 1 and λ ≥ 1, Theorem 11 requires p− ⊕ ε− ≤ 1

2eλ2n< 1

λ2n, whereas Theorem 12

applies when p− ⊕ ε− ≥ cλn >

1λ2n

.

B.3 Limits of General AlgorithmsWe now extend the result to an algorithm agnostic setting, where no assumption is made on the choiceof the SEPARATOR algorithm or XL(vL). For instance, XL(vL) could make use of the entire degreedistribution of GL, compute the number of disjoint paths between various linguistic nodes, cluster nodes,etc. The analysis uses spectral properties of graphs to quantify local information.

Theorem 13 (formal statement of Theorem 3). Let c > 0 be a constant and p−, λ be parametersof the sampling process in Alg.1 on a meaning graph GM with n nodes. Let ε− be a parameter ofNODEPAIRCONNECTIVITY. Let d ∈ N. If

p− ⊕ ε− >c log n

λnand d > log n,

then there exists n0 ∈ N s.t. for all n ≥ n0, no algorithm can distinguish, with a high probability,between two nodes in GM having a d-path vs. being disconnected, and is thus not γ-accurate forany γ > 0 for the d-connectivity problem.

This reveals a fundamental limitation: under noisy conditions, our ability to infer interesting phenomenain the meaning space is limited to a small, logarithmic neighborhood.

15

C Proofs: Possibility of Accurate Connectivity Reasoning

In this section we provide the proofs of the additional lemmas necessary for proving the intermediateresults. First we introduce a few useful lemmas, and then move on to the proof of Theorem 11.

We introduce the following lemmas which will be used in connectivity analysis of the clusters of thenodes O(m).

Lemma 14 (Connectivity of a random graph (Gilbert, 1959)). Let Pn denote the probability ofthe event that a random undirected graph G(n, p) (p > 0.5) is connected. This probability can belower-bounded as following:

Pn ≥ 1−[qn−1

(1 + q(n−2)/2)n−1 − q(n−2)(n−1)/2

+ qn/2

(1 + q(n−2)/2)n−1 − 1

],

where q = 1− p.

See Gilbert (1959) for a proof of this lemma. Since q ∈ (0, 1), this implies that Pn → 1 as n increases.The following lemma provides a simpler version of the above probability:

Corollary 15 (Connectivity of a random graph (Gilbert, 1959)). The random-graph connectivityprobability Pn (Lemma 14) can be lower-bounded as following:

Pn ≥ 1− 2e3qn/2

Proof. We use the following inequality:

(1 +3

n)n ≤ e3

Given that q ≤ 0.5, n ≥ 1, one can verify that q(n−2)/2 ≤ 3/n. Combining this with the aboveinequality gives us, (1 + qn−2/2)n−1 ≤ e3.

With this, we bound the two terms within the two terms of the target inequality:(1 + q(n−2)/2)n−1 − q(n−2)(n−1)/2 ≤ e3

(1 + q(n−2)/2)n−1 − 1 ≤ e3

[qn−1

(1 + q(n−2)/2)n−1 − q(n−2)(n−1)/2

+ qn/2

(1 + q(n−2)/2)n−1 − 1

]≤ e3qn−1+e3qn/2 ≤ 2e3qn/2

which concludes the proof.

We show a lower-bound on the probability of w and w′ being connected given the connectivity of theircounterpart nodes in the meaning graph. Notice that the choice of d = (d+ 1)(λ− 1) + d = λ · d+λ− 1is calculated as follows: The type of paths from w to w′, we consider for the proof, has at most λ − 1edges within the nodes corresponding to one single meaning node, and d edges in between. This lemmawill be used in the proof of Theorem 11:

Lemma 16 (Lower bound). P[w

d! w′|m d

! m′]≥(

1− 2e3ελ/2+

)d+1·(

1− (1− p+)λ2)d

.

Proof. We know that m and m′ are connected through some intermediate nodes m1,m2, · · · ,m` (` < d).We show a lower-bound on having a path in the linguistic graph between w and w′, through clusters ofnodes O(m1),O(m2), · · · ,O(m`). We decompose this into two events:

e1[v] For a given meaning node v its cluster in the linguistic graph, O(v) is connected.e2[v, u] For any two connected nodes (u, v) in the meaning graph, there is at least an edge

connecting their clusters O(u),O(v) in the linguistic graph.

16

The desired probability can then be refactored as:

P[w

d! w′|m d

! m′]≥ P

⋂v∈w,m1,...,m`,w′

e1[v]

∩ ⋂

(v,u)∈(w,m1),...,(m`,w′)

e2[v, u]

≥ P [e1]d+1 · P [e2]d .

We split the two probabilities and identify lower bounds for each. Based on Corollary 15, P [e1] ≥

1− 2e3ελ/2+ , and as a result P [e1]d+1 ≥

(1− 2e3ε

λ/2+

)d+1. The probability of connectivity between pair

of clusters is P [e2] = 1− (1− p+)λ2. Thus, similarly, P [e2]d ≥

(1− (1− p+)λ

2)d

. Combining thesetwo, we obtain:

P[w

d! w′|m d

! m′]≥(

1− 2e3ελ/2+

)d+1·(

1− (1− p+)λ2)d

(4)

The connectivity analysis of GL can be challenging since the graph is a non-homogeneous combinationof positive and negative edges. For the sake of simplifying the probabilistic arguments, given linguisticgraph GL, we introduce a non-unique simple graph GL as follows.

Definition 17. Consider a special partitioning of VG such that the d-neighbourhoods of w and w′

form two of the partitions and the rest of the nodes are arbitrarily partitioned in a way that thediameter of each component does not exceed d.

• The set of nodes VGLof GL corresponds to the aforementioned partitions.

• There is an edge (u, v) ∈ EGLif and only if at least one node-pair from the partitions of VG

corresponding to u and v, respectively, is connected in EGL.

In the following lemma we give an upper-bound on the connectivity of neighboring nodes in GL:

Lemma 18. When GL is drawn at random, the probability that an edge connects two arbitrarynodes in GL is at most (λB(d))2p−.

Proof. Recall that a pair of nodes from GL, say (u, v), are connected when at least one pair of nodes fromcorresponding partitions in GL are connected. Each d-neighbourhood in the meaning graph has at mostB(d) nodes. It implies that each partition in GL has at most λB(d) nodes. Therefore, between each pairof partitions, there are at most (λB(d))2 possible edges. By union bound, the probability of at least oneedge being present between two partitions is at most (λB(d))2p−.

Let vw, vw′ ∈ VGLbe the nodes corresponding to the components containing w and w′ respectively.

The following lemma establishes a relation between connectivity of w,w′ ∈ VGLand the connectivity of

vs, vw′ ∈ VGL:

Lemma 19. P[w

d! w′|m!m′

]≤ P

[There is a path from vs to vw′ in GL with length d

].

Proof. Let L and R be the events in the left hand side and right hand side respectively. Also for apermutation of nodes in GL, say p, let Fp denote the event that all the edges of p are present, i.e.,L = ∪Fp. Similarly, for a permutation of nodes in GL, say q, let Hq denote the event that all the edges ofq are present. Notice that Fp ⊆ Hq for q ⊆ p, because if all the edges of p are present the edges of q willbe present. Thus,

L =⋃p

Fp ⊆⋃p

Hp∩EGL=⋃q

Hq = R.

17

This implies that P [L] ≤ P [R].

Lemma 20 (Upper bound). If (λB(d))2p− ≤ 12en , then P

[w≤d! w′ | m!m′

]≤

2en(λB(d))2p−.

Proof. To identify the upper bound on P[w≤d! w′|m!m′

], recall the definition of GL, given an

instance of GL (as outlined in Lemma 18 and Lemma 19, for p = (λB(d))2p−). Lemma 19 re-

lates the connectivity of w and w′ to a connectivity event in GL, i.e., P[w≤d! w′ | m!m′

]≤

P[there is a path from vs to vw′ in GL with length d

], where vs, vw′ ∈ VGL

are the nodes corresponding

to the components containing w and w′ respectively. Equivalently, in the following, we prove that theevent dist(vs, vw′) ≤ d happens with a small probability:

P[w≤d! w′

]= P

∨`=1,··· ,d

w`

! w′

≤∑`≤d

(n

`

)p` ≤

∑`≤d

(en

`)`p`

≤∑`≤d

(en)`p` ≤ enp(enp)d − 1

enp− 1≤ enp

1− enp≤ 2enp.

where the final inequality uses the assumption that p ≤ 12en .

Armed with the bounds in Lemma 16 and Lemma 20, we are ready to provide the main proof:

Proof of Theorem 11. Recall that the algorithm checks for connectivity between two given nodes w and

w′, i.e., w≤d! w′. With this observation, we aim to infer whether the two nodes in the meaning graph are

connected (m≤d! m′) or not (m!m′). We prove the theorem by using lower and upper bound for these

two probabilities, respectively:

γ = P[w≤d! w′|m d

! m′]− P

[w≤d! w′|m!m′

]≥ LB

(P[w≤d! w′|m d

! m′])− UB

(P[w≤d! w′|m!m′

])≥(

1− 2e3ελ/2+

)d+1·(

1− (1− p+)λ2)d− 2en(λB(d))2p−.

where the last two terms of the above inequality are based on the results of Lemma 16 and Lemma 20,with the assumption for the latter that (λB(d))2p− ≤ 1

2en . To write this result in its general form we haveto replace p+ and p−, with p+ ⊕ ε− and p− ⊕ ε−, respectively (see Remark 9).

18

D Proofs: Limitations of Connectivity Reasoning

We provide the necessary lemmas and intuitions before proving the main theorem.A random graph is an instance sampled from a distribution over graphs. In the G(n, p) Erdos-Renyi

model, a graph is constructed in the following way: Each edge is included in the graph with probability p,independent of other edges. In such graphs, on average, the length of the path connecting any node-pair isshort (logarithmic in the number of nodes).

Lemma 21 (Diameter of a random graph, Corollary 1 of (Chung and Lu, 2002)). If n · p = c > 1for some constant c, then almost-surely the diameter of G(n, p) is Θ(log n).

We use the above lemma to prove Theorem 12. Note that the overall noise probably (i.e., p in Lemma 21)in our framework is p− ⊕ ε−.

Proof of Theorem 12. Note that the |VGL| = λ · n. By Lemma 21, the linguistic graph has diameter

Θ(log λn). This means that for any pair of nodes w,w′ ∈ VGL, we have w

Θ(log λn)! w′. Since d ≥ λd ∈

Ω(log λn), the multi-hop reasoning algorithm finds a path between w and w′ in linguistic graph andreturns connected regardless of the connectivity of m and m′.

19

E Proofs: Limitations of General Reasoning

The proof of the general limitations theorem follows after introducing necessary notation and lemmas.This section is structured to first cover key definitions and two main lemmas that lead to the theorem, afterwhich proofs of auxiliary lemmas and results are included to complete the formal argument.

E.1 Main Argument

Consider a meaning graph GM in which two nodes m and m′ are connected. We drop edges in a min-cutC to make the two nodes disconnected and obtain G′M (Figure 4).

Figure 4: The construction considered in Definition 22. The node-pair m-m′ is connected with distance d in GM ,and disconnected in G′M , after dropping the edges of a cut C. For each linguistic graph, we consider it “local”Laplacian.

Definition 22. Define a pair of meaning graphs G and G′, both with n nodes and satisfying ball

assumption B(d), with three properties: (1) m d! m′ in G, (2) m!m′ in G′, (3) EG′ ⊂ EG, (4)

C = EG \ EG′ is an (m,m′) min-cut of G.

Definition 23. Define a distribution G over pairs of possible meaning graphs G,G′ and pairs ofnodes m,m′ which satisfies the requirements of Definition 22. Formally, G is a uniform distributionover the following set:

(G,G′,m,m′) | G,G′,m,m′satisfy Definition 22.

As two linguistic graphs, we sample GL and G′L, as denoted in Figure 4. In the sampling of GL andG′L, all the edges share the randomization, except for the ones that correspond to C (i.e., the differencebetween the GM and G′M ). Let U be the union of the nodes involved in d-neighborhood of w,w′, in GLand G′L. Define L,L′ to be the Laplacian matrices corresponding to the nodes of U . As n grows, the twoLaplacians become less distinguishable whenever p− ⊕ ε− and d are large enough:

Lemma 24. Let c > 0 be a constant and p−, λ be parameters of the sampling process in Algorithm 1on a pair of meaning graphs G and G′ on n nodes constructed according to Definition 22. Letd ∈ N, d ≥ λd, and L,L′ be the Laplacian matrices for the d-neighborhoods of the correspondingnodes in the sampled linguistic graphs GL and GL′. If p− ⊕ ε− ≥ c logn

n and d > log n, then, witha high probability, the two Laplacians are close:

‖L− L′‖ ≤√

2λB(1)√n log(nλ)

Proof of Lemma 24. We start by proving an upper bound on L− L′ in matrix inequality notation. Similar

20

upper-bound holds for L′ − L which concludes the theorem.

L− L′ = L

‖L‖− L′

‖L′‖

L

‖L‖− L′

‖L− L′‖+ ‖L‖

=L · ‖L− L′‖‖L‖2

+L− L′

‖L‖

√λB(1)√

2n log(nλ)I +

√λB(1)√

2n log(nλ)I.

The last inequality is due to Lemma 29. By symmetry the same upper-bound holds for L′ − L 2√λB(1)√

2n log(nλ)I . This means that ‖L− L′‖ ≤ 2

√λB(1)√

2n log(nλ).

This can be used to show that, for such large enough p− and d, the two linguistic graphs, GL and G′Lsampled as above, are indistinguishable by any function operating over a λd-neighborhood of w,w′ inGL, with a high probability.

A reasoning function can be thought of a mapping defined on normalized Laplacians, since they encodeall the information in a graph. For a reasoning function f with limited precision, the input space can bepartitioned into regions where the function is constant; and for large enough values of n both L, L′ (witha high probability) fall into regions where f is constant.

Note that a reasoning algorithm is oblivious to the the details of C, i.e. it does not know where C is,or where it has to look for the changes. Therefore a realistic algorithm ought to use the neighborhoodinformation collectively.

In the next lemma, we define a function f to characterize the reasoning function, which uses Laplacianinformation and maps it to binary decisions. We then prove that for any such functions, there are regimesthat the function won’t be able to distinguish L and L′:

Lemma 25. Let meaning and linguistic graphs be constructed under the conditions of Lemma 24.Let β > 0 and f : R|U|×|U| → 0, 1 be the indicator function of an open set. Then there existsn0 ∈ N such that for all n ≥ n0:

P(G,G′,m,m′)∼GGL←ALG(G),G′L←ALG(G′)

[f(L) = f(L′)

]≥ 1− β.

Proof of Lemma 25. Note that f maps a high dimensional continuous space to a discrete space. Tosimplify the argument about f , we decompose it to two functions: a continuous function g mappingmatrices to (0, 1) and a threshold function H (e.g. 0.5 + 0.5sgn(.)) which maps to one if g is higher thana threshold and to zero otherwise. Without loss of generality we also normalize g such that the gradient isless than one. Formally,

f = H g, where g : R|U|×|U| → (0, 1), ‖∇g∣∣∣L‖ ≤ 1.

Lemma 30 gives a proof of existence for such decompositon, which depends on having open or closedpre-images.

One can find a differentiable and Lipschitz function g such that it intersects with the threshold specifiedby H , in the borders where f changes values.

With g being Lipschitz, one can upper-bound the variations on the continuous function:

‖g(L)− g(L′)‖ ≤M‖L− L′‖.

According to Lemma 24, ‖L− L′‖ is upper-bounded by a decreasing function in n.

21

For uniform choices (G,G′,m,m′) ∼ G (Definition 23) the Laplacian pairs (L, L′) are randomlydistributed in a high-dimensional space, and for big enough n, there are enough portion of the (L, L′) (tosatisfy 1− β probability) that appear in the same side of the hyper-plane corresponding to the thresholdfunction (i.e. f(L) = f(L′)).

Proof of Theorem 13. By substituting γ = β, and XL = f , the theorem reduces to the statement ofLemma 25.

E.2 Auxiliary Lemmas and ProofsIn the following lemma, we show that the spectral differences between the two linguistic graphs in thelocality of the target nodes are small. For ease of exposition, we define an intermediate notation, for anormalized version of the Laplacians: L = L/‖L‖2 and L′ = L′/‖L′‖2.

Lemma 26. The norm-2 of the Laplacian matrix corresponding to the nodes participating in a cut,can be upper-bounded by the number of the edges participating in the cut (with a constant factor).

Proof of Lemma 26. Using the definition of the Laplacian:

‖LC‖2 ≤ ‖A−D‖2 ≤ ‖A‖2 + ‖D‖2

where A is the adjacency matrix and D is a diagonal matrix with degrees on the diagonal. We bound thenorms of the matrices based on size of the cut (i.e., number of the edges in the cut). For the adjacencymatrix we use the Frobenius norm:

‖A‖2 ≤ ‖A‖F =

√∑ij

aij = 2 · |C|

where |C| denotes the number of edges in C. To bound the matrix of degrees, we use the fact that norm-2is equivalent to the biggest eigenvalue, which is the biggest diagonal element in a diagonal matrix:

‖D‖2 = σmax(D) = maxideg(i) ≤ |C|

With this we have shown that: ‖LC‖2 ≤ 3|C|.

For sufficiently large values of p, G(n, p) is a connected graph, with a high probability. More formally:

Lemma 27 (Connectivity of random graphs). In a random graph G(n, p), for any p bigger than(1+ε) lnn

n , the graph will almost surely be connected.

The proof can be found in (Erdos and Renyi, 1960).

Lemma 28 (Norm of the adjacency matrix in a random graph). For a random graph G(n, p), let Lbe the adjacency matrix of the graph. For any ε > 0:

limn→+∞

P(∣∣∣‖L‖2 −√2n log n

∣∣∣ > ε)→ 0

Proof of Lemma 28. From Theorem 1 of (Ding et al., 2010) we know that:

σmax(L)√n log n

P→√

2

where P→ denote convergence in probability. And also notice that norm-2 of a matrix is basically the sizeof its biggest eigenvalue, which concludes our proof.

22

Lemma 29. For any pair of meaning-graphs G and G′ constructed according to Definition 22, and,

• d > log n,

• p− ⊕ ε− ≥ c log n/n for some constant c,

• d ≥ λd,

with L and L′ being the Laplacian matrices corresponding to the d-neighborhoods of the corre-sponding nodes in the surface-graph; we have:

‖L− L′‖2‖L‖2

≤√λB(1)√

2n log(nλ),

with a high-probability.

Proof of Lemma 29. In order to simplify the exposition, w.l.o.g. assume that ε− = 0 (see Remark 9). Ourgoal is to find an upper-bound to the fraction ‖L−L

′‖2‖L‖2 . Note that the Laplacians contain only the local

information, i.e., d−neighborhood. First we prove an upper bound on the nominator. By eliminating anedge in a meaning-graph, the probability of edge appearance in the linguistic graph changes from p+ top−. The effective result of removing edges in C would appear as i.i.d. Bern(p+−p−). Since by definition,B(1) is an upper bound on the degree of meaning nodes, the size of minimum cut should also be upper

bounded by B(1). Therefore, the maximum size of the min-cut C separating two nodes m d! m′ is at

most B(1). To account for vertex replication in linguistic graph, the effect of cut would appear on at mostλB(1) edges in the linguistic graph. Therefore, we have‖L− L′‖2 ≤ λB(1) using Lemma 26.

As for the denominator, the size of the matrixL is the same as the size of d-neighborhood in the linguisticgraph. We show that if d > log(λn) the neighborhood almost-surely covers the whole graph. While thegrowth in the size of the d-neighborhood is a function of both p+ and p−, to keep the analysis simple,we underestimate the neighborhood size by replacing p+ with p−, i.e., the size of the d-neighborhood islower-bounded by the size of a d-neighborhood in G(λ · n, p−).

By Lemma 27 the diameters of the linguistic graphs GL and G′L are both Θ(log(λn)). Since d ∈Ω(log(λn)), d-neighborhood covers the whole graph for both GL and G′L.

Next, we use Lemma 28 to state that ‖L‖2 converges to√

2λn log(λn), in probability.Combining numerator and denominator, we conclude that the fraction, for sufficiently large n, is

upper-bounded by: λB(1)√2λn log(λn)

, which can get arbitrarily small, for a big-enough choice of n.

Lemma 30. Suppose f is an indicator function on an open seta, it is always possible to write it ascomposition of two functions:

• A continuous and Lipschitz function: g : Rd → (0, 1),

• A thresholding function: H(x) = 1x > 0.5.

such that: ∀x ∈ Rd : f(x) = h(g(x)).ahttps://en.wikipedia.org/wiki/Indicator function

Proof of Lemma 30. Without loss of generality, we assume that the threshold function is defined asH(x) = 1x > 0.5. One can verify that a similar proof follows for H(x) = 1x ≥ 0.5. We usenotation f−1(A) the set of pre-images of a function f , for the set of outputs A.

First let’s study the collection of inputs that result in output of 1 in f function. Since f = h g, thenf−1(1) = g−1(h−1(1)) = g−1((0.5, 1)) and f−1(0) = g−1(h−1(0)) = g−1((0, 0.5)). Define

23

https://en.wikipedia.org/wiki/Indicator_function

C0 and C1, such that Ci , f−1(i); note that since g is continuous and (0.5, 1) is open C1 is an openset (hence C1 is closed). Let d : Rn → R be defined by,

d(x) , dist(x,C0) = infc∈C0

‖x− c‖.

Since C0 is closed, it follows d(x) = 0 if and only if x ∈ C0. Therefore, letting

g(x) =1

2+

1

2· d(x)

1 + d(x),

then g(x) = 12 when x ∈ C0, while g(x) > 1

2 when x 6∈ C0. This means that letting h(x) = 1 whenx > 1

2 and h(x) = 0 when x ≤ 12 , then f = h g. One can also verify that this construction is

1/2-Lipschitz; this follows because d(x) is 1-Lipschitz, which can be proved using the triangle inequalityHence the necessary condition to have such decomposition is f−1(1) and f−1(0) be open or

closed.

24

F Empirical investigation: further details

To evaluate the impact of the other noise parameters in the sampling process, we compare the averagedistances between nodes in the linguistic graph for a given distance between the meaning graph nodes. Inthe Figure 5, we plot these graphs for decreasing values of p− (from top left to bottom right). With highp− (top left subplot), nodes in the linguistic graph at distances lower than two, regardless of the distanceof their corresponding node-pair in the meaning graph. As a result, any reasoning algorithm that relies onconnectivity can not distinguish linguistic nodes that are connected in the meaning space from those thatare not. As the p− is set to lower values (i.e. noise reduces), the distribution of distances get wider, andcorrelation of distance between the two graphs increases. In the bottom middle subplot, when p− has avery low value, we observe a significant correlation that can be reliably utilized by a reasoning algorithm.

Figure 5: With varied values for p− a heat map representation of the distribution of the average distances ofnode-pairs in linguistic graph based on the distances of their corresponding meaning nodes is presented.

25

arxiv:1901.02522v3 [cs.cl] 1 may 2020

Documents