learning relational features with backward random walks · path ranking algorithm (pra), a...

Learning Relational Features with Backward Random Walks

Ni LaoGoogle Inc.

[email protected]

Einat MinkovUniversity of Haifa

[email protected]

William W. CohenCarnegie Mellon [email protected]

Abstract

The path ranking algorithm (PRA)has been recently proposed to addressrelational classification and retrieval tasksat large scale. We describe Cor-PRA,an enhanced system that can model alarger space of relational rules, includinglonger relational rules and a class offirst order rules with constants, whilemaintaining scalability. We describeand test faster algorithms for searchingfor these features. A key contributionis to leverage backward random walksto efficiently discover these types ofrules. An empirical study is conductedon the tasks of graph-based knowledgebase inference, and person named entityextraction from parsed text. Our resultsshow that learning paths with constantsimproves performance on both tasks, andthat modeling longer paths dramaticallyimproves performance for the namedentity extraction task.

1 Introduction

Structured knowledge about entities and therelationships between them can be representedas an edge-typed graph, and relational learningmethods often base predictions on connectivitypatterns in this graph. One such method is thePath Ranking Algorithm (PRA), a random-walkbased relational learning and inference frameworkdue to Lao and Cohen (2010b). PRA is highlyscalable compared with other statistical relationallearning approaches, and can therefore be appliedto perform inference in large knowledge bases(KBs). Several recent works have applied PRAto link prediction in semantic KBs, as well asto learning syntactic relational patterns used ininformation extraction from the Web (Lao et al.,

2012; Gardner et al., 2013; Gardner et al., 2014;Dong et al., 2014).

A typical relational inference problem isillustrated in Figure 1. Having relationalknowledge represented as a graph, it is desiredto infer additional relations of interest betweenentity pairs. For example, one may wish toinfer whether an AthletePlaysInLeague relationholds between nodes HinesWard and NFL. Moregenerally, link prediction involves queries of theform: which entities are linked to a source node s(HinesWard) over a relation of interest r (e.g., r isAlthletePlaysInLeague)?

PRA gauges the relevance of a target node twith respect to the source node s and relation rbased on a set of relation paths (i.e., sequencesof edge labels) that connect the node pair. Eachpath πi is considered as feature, and the value offeature πi for an instance (s, t) is the probability ofreaching t from s following path πi. A classifieris learned in this feature space, using logisticregression.

PRA’s candidate paths correspondclosely to a certain class of Hornclauses: for instance, the path π =〈AthletePlaysForTeam,TeamPlaysInLeague〉,when used as a feature for the relationr = AthletePlaysForLeague, corresponds tothe Horn clause

AthletePlaysForTeam(s, z) ∧ TeamPlaysInLeague(z, t)→ AthletePlaysForLeague(s, t)

One difference between PRA’s features andmore traditional logical inference is thatrandom-walk weighting means that not allinferences instantiated by a clause will be giventhe same weight. Another difference is thatPRA is very limited in terms of expressiveness.In particular, inductive logic programming

Eli Manning GiantsAthletePlays

ForTeam

HinesWard Steelers

AthletePlaysForTeam NFL

TeamPlaysInLeague

MLBTeamPlaysInLeague

TeamPlaysInLeague

Figure 1: An example knowledge graph

(ILP) methods such as FOIL (Quinlan andCameron-Jones, 1993) learn first-order Hornrules that may involve constants. Consider thefollowing rules as motivating examples.

EmployeedByAgent(s, t) ∧ IsA(t, SportsTeam)

→ AthletePlaysForTeam(s, t)

t = NFL→ AthletePlaysForTeam(s, t)

The first rule includes SportsTeam as a constant,corresponding to a particular graph node, whichis a the semantic class (hypernym) of the targetnode t. The second rule simply assigns NFLas the target node for the AthletePlaysForTeamrelation; if used probabilistically, this rule canserve as a prior. Neither feature can be expressedin PRA, as PRA features are restricted to edge typesequences.

We are interested in extending the range ofrelational rules that can be represented within thePRA framework, including rules with constants.A key challenge is that this greatly increasesthe space of candidate rules. Knowledgebases such as Freebase (Bollacker et al., 2008),YAGO (Suchanek et al., 2007), or NELL (Carlsonet al., 2010a), may contain thousands of predicatesand millions of concepts. The number of featuresinvolving concepts as constants (even if limitedto simple structures such as the example rulesabove) will thus be prohibitively large. Therefore,it is necessary to search the space of candidatepaths π very efficiently. More efficient candidategeneration is also necessary if one attempts to usea looser bound on the length of candidate paths.

To achieve this, we propose using backwardrandom walks. Given target nodes that areknown to be relevant for relation r, we performbackward random walks (up to finite length `)originating at these target nodes, where everygraph node c reachable in this random walkprocess is considered as a potentially usefulconstant. Consequently, the relational paths

that connect nodes c and t are evaluated aspossible random walk features. As we will show,such paths provide informative class priors forrelational classification tasks.

Concretely, this paper makes the followingcontributions. First, we outline and discuss anew and larger family of relational features thatmay be represented in terms of random walkswithin the PRA framework. These featuresrepresent paths with constants, expanding theexpressiveness of PRA. In addition, we propose toencode bi-directional random walk probabilities asfeatures; we will show that accounting for this sortof directionality provides useful information aboutgraph structure.

Second, we describe the learning of thisextended set of paths by means of backward walksfrom relevant target nodes. Importantly, the searchand computation of the extended set of features isperformed efficiently, maintaining high scalabilityof the framework. Concretely, using backwardwalks, one can compute random walk probabilitiesin a bi-directional fashion; this means that forpaths of length 2M , the time complexity of pathfinding is reduced from O(|V |2M ) to O(|V |M ),where |V | is the number of edge types in graph.

Finally, we report experimental results forrelational inference tasks in two different domains,including knowledge base link prediction andperson named entity extraction from parsedtext (Minkov and Cohen, 2008). It is shownthat the proposed extensions allow one toeffectively explore a larger feature space,significantly improving model quality overpreviously published results in both domains. Inparticular, incorporating paths with constantssignificantly improves model quality onboth tasks. Bi-directional walk probabilitycomputation also enables the learning of longerpredicate chains, and the modeling of long pathsis shown to substantially improve performanceon the person name extraction task. Importantly,learning and inference remain highly efficient inboth these settings.

2 Related Work

ILP complexity stems from two mainsources—the complexity of searching forclauses, and of evaluating them. First-orderlearning systems (e.g. FOIL, FOCL (Pazzani etal., 1991)) mostly rely on hill-climbing search,

i.e., incrementally expanding existing patternsto explore the combinatorial model space, andare thus often vulnerable to local maxima. PRAtakes another approach, generating features usingefficient random graph walks, and selecting asubset of those features which pass precision andfrequency thresholds. In this respect, it resemblesa stochastic approach to ILP used in earlierwork (Sebag and Rouveirol, 1997).The idea ofsampling-based inference and induction has beenfurther explored by later systems (Kuzelka andZelezny, 2008; Kuzelka and Zelezny, 2009).

Compared with conventional ILP or relationallearning systems, PRA is limited to learningfrom binary predicates, and applies random-walksemantics to its clauses. Using samplingstrategies (Lao and Cohen, 2010a), thecomputation of clause probabilities can bedone in time that is independent of the knowledgebase size, with bounded error rate (Wang et al.,2013). Unlike in FORTE and similar systems, inPRA, sampling is also applied to the inductionpath-finding stage. The relational featureconstruction problem (or propositionalization)has previously been addressed in the ILPcommunity—e.g., the RSD system (Zelezny andLavrac, 2006) performs explicit first-order featureconstruction guided by an precision heuristicfunction. In comparison, PRA uses precision andrecall measures, which can be readily read offfrom random walk results.

Bi-directional search is a popular strategyin AI, and in the ILP literature. TheAleph algorithm (Srinivasan, 2001) combinestop-down with bottom-up search of the refinementgraph, an approach inherited from Progol.FORTE (Richards and Mooney, 1991) was anotherearly ILP system which enumerated paths viaa bi-directional seach. Computing backwardrandom walks for PRA can be seen as a particularway of bi-directional search, which is alsoassigned a random walk probability semantics.Unlike in prior work, we will use this probabilitysemantics directly for feature selection.

3 Background

We first review the Path Ranking Algorithm(PRA) as introduced by (Lao and Cohen, 2010b),paying special attention to its random walk featureestimation and selection components.

3.1 Path Ranking Algorithm

Given a directed graph G, with nodes N , edges Eand edge types R, we assume that all edges can betraversed in both directions, and use r−1 to denotethe reverse of edge type r ∈ R. A path type πis defined as a sequence of edge types r1 . . . r`.Such path types may be indicative of an extendedrelational meaning between graph nodes that arelinked over these paths; for example, the path〈AtheletePlaysForTeam,TeamPlaysInLeague〉implies the relationship “the league a certainplayer plays for”. PRA encodes P (s→ t;πj), theprobability of reaching target node t starting fromsource node s and following path πj , as a featurethat describes the semantic relation between s andt. Specifically, provided with a set of selectedpath types up to length `, P` = {π1, . . . , πm},the relevancy of target nodes t with respect to thequery node s and the relationship of interest isevaluated using the following scoring function

score(s, t) =∑πj∈P`

θjP (s→ t;πj), (1)

where θ are appropriate weights for the features,estimated in the following fashion.

Given a relation of interest r and a set ofannotated node pairs {(s, t)}, for which it isknown whether r(s, t) holds or not, a trainingdata set D = {(x, y)} is constructed, wherex is a vector of all the path features for thepair (s, t)—i.e., the j-th component of x isP (s → t;πj), and y is a boolean variableindicating whether r(s, t) is true. We adoptthe closed-world assumption—a set of relevanttarget nodes Gi is specified for every examplesource node si and relation r, and all other nodesare treated as negative target nodes. A biasedsampling procedure selects only a small subset ofnegative samples to be included in the objectivefunction (Lao and Cohen, 2010b). The parametersθ are estimated from both positive and negativeexamples using a regularized logistic regressionmodel.

3.2 PRA Features–Generation and Selection

PRA features are of the form P (s → t;πj),denoting the probability of reaching target node t,originating random walk at node s and followingedge type sequence πj . These path probabilitiesneed to be estimated for every node pair, as partof both training and inference. High scalability

is achieved due to efficient path probabilityestimation. In addition, feature selection isapplied so as to allow efficient learning and avoidoverfitting.

Concretely, the probability of reaching t from sfollowing path type π can be recursively definedas

P (s→ t;π) =∑z

P (s→ z;π′)P (z → t; r),

(2)where r is the last edge type in path π, and π′

is its prefix, such that adding r to π’ gives π.In the terminal case that π’ is the empty path φ,P (s → z;φ) is defined to be 1 if s = z, and 0otherwise. The probability P (z → t; r) is definedas 1/|r(z)| if r(z, t), and 0 otherwise, where r(z)is the set of nodes linked to node z over edgetype r. It has been shown that P (s → t;π)can be effectively estimated using random walksampling techniques, with bounded complexityand bounded error, for all graph nodes that can bereached from s over path type π (Lao and Cohen,2010a).

Due to the exponentially large feature spacein relational domains, candidate path features arefirst generated using a dedicated particle filteringpath-finding procedure (Lao et al., 2011), whichis informed by training signals. Meaningfulfeatures are then selected using the followinggoodness measures, considering path precisionand coverage:

precision(π) =1

n

∑i

P (si → Gi;π), (3)

coverage(π) =∑i

I(P (si → Gi;π) > 0). (4)

where P (si → Gi;π) ≡∑

t∈GiP (si → t;π).

The first measure prefers paths that lead to correctnodes with high average probability. The secondmeasure reflects the number of queries for whichsome correct node is reached over path π. Inorder for a path type π to be included in thePRA model, it is required that the respectivescores pass thresholds, precision(π) ≥ a andcoverage(π) ≥ h, where the thresholds a and hare tuned empirically using training data.

4 Cor-PRA

We will now describe the enhanced system, whichwe call Cor-PRA, for the Constant and Reversed

Path Ranking Algorithm. Our goal is to enrich thespace of relational rules that can be representedusing PRA, while maintaining the scalability ofthis framework.

4.1 Backward random walks

We first introduce backward random walks, whichare useful for generating and evaluating the setof proposed relational path types, including pathswith constants. As discussed in Sec.4.4, the use ofbackward random walks also enables the modelingof long relational paths within Cor-PRA.

A key observation is that the path probabilityP (s → t;π) may be computed using forwardrandom walks (Eq. (2)), or alternatively, it can berecursively defined in a backward fashion:

P (t← s;π) =∑z

P (t← z;π′−1)P (z ← s; r−1)

(5)where π′−1 is the path that results from removingthe last edge type r in π′. Here, in the terminalcondition that π′−1 = φ, P (t ← z;π′−1) isdefined to be 1 for z = t, and 0 otherwise. Inwhat follows, the starting point of the randomwalk calculation is indicated at the left side ofthe arrow symbol; i.e., P (s → t;π) denotes theprobability of reaching t from s computed usingforward random walks, and P (t ← s;π) denotesthe same probability, computed in a backwardfashion.

4.2 Relational paths with constants

As stated before, we wish to model relationalrules that may include constants, denoting relatedentities or concepts. Main questions are, howcan relational rules with constants be representedas path probability features? and, how canmeaningful rules with constants be generated andselected efficiently?

In order to address the first question, letus assume that a set of constant nodes {c},which are known to be useful with respect torelation r, has been already identified. Therelationship between each constant c and targetnode t may be represented in terms of pathprobability features, P (c → t;π). For example,the rule IsA(t, SportsTeam) corresponds to a pathoriginating at constant SportsTeam, and reachingtarget node t over a direct edge typed IsA−1. Suchpaths, which are independent of the source nodes, readily represent the semantic type, or other

characteristic attributes of relevant target nodes.Similarly, a feature (c, φ), designating a constantand an empty path, forms a prior for the targetnode identity.

The remaining question is how to identifymeaningful constant features. Apriori, candidateconstants range over all of the graph nodes,and searching for useful paths that originateat arbitrary constants is generally intractable.Provided with labeled examples, we apply thepath-finding procedure for this purpose, whererather than search for high-probability paths fromsource node s to target t, paths are explored ina backward fashion, initiating path search at theknown relevant target nodes t ∈ Gi per eachlabeled query. This process identifies candidate(c, π) tuples, which give high P (c ← t;π−1)values, at bounded computation cost. As a secondstep, P (c → t;π) feature values are calculated,where useful path features are selected using theprecision and coverage criteria. Further details arediscussed in Section 4.4.

4.3 Bi-directional Random Walk Features

The PRA algorithm only uses features of the formP (s → t;π). In this study we also considergraph walk features in the inverse direction ofthe form P (s ← t;π−1). Similarly, weconsider both P (c → t;π) and P (c ← t;π−1).While these path feature pairs represent the samelogical expressions, the directional random walkprobabilities may greatly differ. For example,it may be highly likely for a random walker toreach a target node representing a sports teamt from node s denoting a player over a path πthat describes the functional AthletePlaysForTeamrelation, but unlikely to reach a particular playernode s from the multiplayer team t via the reversedpath π−1.

In general, there are six types of random walkprobabilities that may be modeled as relationalfeatures following the introduction of constantpaths and inverse path probabilities. The randomwalk probabilities between s and constant nodesc, P (s → c;π) and P (s ← c;π), do not directlyaffect the ranking of candidate target nodes, sowe do not use them in this study. It is possible,however, to generate random walk features thatcombine these probabilities with random walksstarting or ending with t through conjunction.

Algorithm 1 Cor-PRA Feature Induction1

Input training queries {(si, Gi)}, i = 1...nfor each query (s,G) do

1. Path exploration(i). Apply path-finding to generate pathsPs up to length` that originate at si.(ii). Apply path-finding to generate paths Pt up tolength ` that originate at every ti ∈ Gi.2. Calculate random walk probabilities:for each πs ∈ Ps: do

compute P (s→ x;πs) and P (s← x;π−1s )

end forfor each πt ∈ Pt: do

compute P (G→ x;πt) and P (G← x;π−1t )

end for3. Generate constant paths candidates:for each (x ∈ N,π ∈ Pt) with P (G→ x|πt) > 0 do

propose path feature P (c ← t;π−1t ) setting c = x,

and update its statistics by coverage += 1.end forfor each (x ∈ N,π ∈ Pt) with P (G ← x|π−1

t ) > 0do

propose P (c → t;πt) setting c = x and update itsstatistics by coverage += 1

end for4. Generate long (concatenated) path candidates:for each (x ∈ N,πs ∈ Ps, πt ∈ Pt) with P (s →x|πs) > 0 and P (G← x|π−1

t ) > 0 dopropose long path P (s → t;πs.π

−1t ) and update its

statistics by coverage += 1, and precision +=P (s→ x|πs)P (G← x|π−1

t )/n.end forfor each (x ∈ N,πs ∈ Ps, πt ∈ Pt) with P (s ←x|π−1

s ) > 0 and P (G→ x|πt) > 0 dopropose long path P (s ← t;πt.π

−1s ) and update its

statistics by coverage += 1, and precision +=P (s← x|π−1

s )P (G→ x|πt)/n.end for

end for

4.4 Cor-PRA feature induction and selection

The proposed feature induction procedure isoutlined in Alg. 1. Given labeled node pairs,the particle-filtering path-finding procedure is firstapplied to identify edge type sequences up tolength ` that originate at either source nodes sior relevant target nodes ti (step 1). Bi-directionalpath probabilities are then calculated over thesepaths, recording the terminal graph nodes x (step2). Note that since the set of nodes x may belarge, path probabilities are all computed withrespect to s or t as starting points. As a resultof the induction process, candidate relationalpaths involving constants are identified, and areassociated with precision and coverage statistics(step 3). Further, long paths up to length 2` areformed between the source and target nodes as thecombination of paths πs from the source side andpath πt from the target side, updating accuracy andcoverage statistics for the concatenated paths πsπt

(step 4).Following feature induction, feature selection is

applied. First, random walks are performed forall the training queries, so as to obtain complete(rather than sampled) precision and coveragestatistics per path. Then relational paths, whichpass respective tuned thresholds are added to themodel. We found, however, that applying thisstrategy for paths with constants often leads toover-fitting. We therefore select only the top Kconstant features in terms of F1

2, where K istuned using training examples.

Finally, at test time, random walk probabilitiesare calculated for the selected paths, starting fromeither s or c nodes per query–since the identity ofrelevant targets t is unknown, but rather has to berevealed.

5 Experiments

In this section, we report the results of applyingCor-PRA to the tasks of knowledge base inferenceand person named entity extraction from parsedtext.

We performed 3-fold cross validationexperiments, given datasets of labeled queries.For each query node in the evaluation set, a list ofgraph nodes ranked by their estimated relevancyto the query node s and relation r is generated.Ideally, relevant nodes should be ranked at thetop of these lists. Since the number of correctanswers is large for some queries, we reportresults in terms of mean average precision (MAP),a measure that reflects both precision and recall(Turpin and Scholer, 2006).

The coverage and precision thresholds ofCor-PRA were set to h = 2 and a = 0.001in all of the experiments, following empiricaltuning using a small subset of the training data.The particle filtering path-finding algorithm wasapplied using the parameter setting wg = 106, soas to find useful paths with high probability andyet constrain the computational cost.

Our results are compared against the FOILalgorithm3, which learns first-order horn clauses.In order to evaluate FOIL using MAP, its candidatebeliefs are first ranked by the number of FOILrules they match. We further report resultsusing Random Walks with Restart (RWR), also

2F1 is the harmonic mean of precision and recall, wherethe latter is defined as coverage

total number targets in training queries3http://www.rulequest.com/Personal/

Table 1: MAP and training time [sec] on KBinference and NE extraction tasks. consti denotesconstant paths up to length i.

KB inference NE extractionTime MAP Time MAP

RWR 25.6 0.429 7,375 0.017FOIL 18918.1 0.358 366,558 0.167PRA 10.2 0.477 277 0.107CoR-PRA-no-const 16.7 0.479 449 0.167CoR-PRA-const2 23.3 0.524 556 0.186CoR-PRA-const3 27.1 0.530 643 0.316

known as personalized PageRank (Haveliwala,2002), a popular random walk based graphsimilarity measure, that has been shown to befairly successful for many types of tasks (e.g.,(Agirre and Soroa, 2009; Moro et al., 2014)).Finally, we compare against PRA, which modelsrelational paths in the form of edge-sequences(no constants), using only uni-directional pathprobabilities, P (s→ t;π).

All experiments were run on a machine with a16 core Intel Xeon 2.33GHz CPU and 24Gb ofmemory. All methods are trained and tested withthe same data splits. We report the total trainingtime of each method, measuring the efficiency ofinference and induction as a whole.

5.1 Knowledge Base Inference

We first consider relational inference in thecontext of NELL, a semantic knowledge baseconstructed by continually extracting facts fromthe Web (Carlson et al., 2010b). This work usesa snapshot of the NELL knowledge base graph,which consists of ∼1.6M edges comprised of353 edge types, and ∼750K nodes. FollowingLao et al. (2011), we test our approach on 16link prediction tasks, targeting relations suchas Athlete-plays-in-league, Team-plays-in-leagueand Competes-with.

Table 1 reports MAP results and training timesfor all of the evaluated methods. The maximumpath length of RWR, PRA, and CoR-PRA are setto 3 since longer path lengths do not result in betterMAPs. As shown, RWR performance is inferior toPRA; unlike the other approaches, RWR is merelyassociative and does not involve path learning.PRA is significantly faster than FOIL due to itsparticle filtering approach in feature inductionand inference. It also results in a better MAPperformance due to its ability to combine randomwalk features in a discriminative model.

1

10

100

1000

2 3 4 5

Pat

h F

ind

ing

Tim

e (

s)

Max Path Length

2F+

1B

3F+

1B

3F

2F+

2B

3F+

2B

1F+

1B

2F

4F

0.2

0.3

0.4

0.5

2 3 4 5

MA

P

Max Path Length

2F+

1B

3F+

1B

3F

2F+

2B

3F+

2B

1F+

1B

2F

4F

(a) (b)

0.1

1

10

100

1000

3 4 5 6

Pat

h D

isco

very

Tim

e (

s)

Max Path Length

2F+

1B

3F+

1B

3F

4F

2F+

2B

3F+

2B

5F

4F+

2B

3

F+3

B

4F+

1B

0.00

0.05

0.10

0.15

0.20

3 4 5 6

MA

P

Max Path Length

3F 2F+

1B

4F

5F

3F+

1B

2

F+2

B

3F+

2B

4F+

2B

3

F+3

B

4F+

1B

(c) (d)

Figure 2: Path finding time (a) and MAP (b)for the KB inference (top) and name extraction(bottom) tasks. A marker iF + jB indicates themaximum path exploration depth i from querynode s and j from target node t–so that thecombined path length is up to i+ j. No paths withconstants were used.

Table 1 further displays the evaluation results ofseveral variants of CoR-PRA. As shown, modelingfeatures that encode random walk probabilitiesin both directions (CoR-PRA-no-const), yet nopaths with constants, requires longer trainingtimes, but results in slightly better performancecompared with PRA. Note that for a fixed pathlength, CoR-PRA has “forward” features of theform P (s → t;π), the probability of reachingtarget node t from source node s over path π(similarly to PRA), as well as backward featuresof the form P (s ← t;π−1), the probability ofreaching s from t over the backward path π−1.As mentioned earlier these probabilities are not thesame; for example, a player usually plays for oneteam, whereas a team is linked to many players.

Performance improves significantly, however,when paths with constants are further added. Thetable includes our results using constant pathsup to length ` = 2 and ` = 3 (denoted asCoR-PRA-const`). Based on tuning experimentson one fold of the data, K = 20 top-rated constantpaths were included in the models.4 We foundthat these paths provide informative class priors;

4MAP performance peaked at roughly K = 20, andgradually decayed as K increased.

Table 2: Example paths with constants learnt forthe knowledge base inference tasks. (φ denotesempty paths.)

Constant path Interpretationr=athletePlaysInLeagueP (mlb→ t;φ) Bias toward MLB.P (boston braves→ t; The leagues played by⟨

athleteP laysForTeam−1, Boston Braves universityathletePlaysInLeague〉) team members.

r=competesWithP (google→ t;φ) Bias toward Google.P (google→ t; Companies which compete〈competesWith, competesWith〉)with Google’s competitors.

r=teamPlaysInLeagueP (ncaa→ t;φ) Bias toward NCAA.P (boise state→ t; The leagues played by Boise〈teamPlaysInLeague〉) State university teams.

example paths and their interpretation are includedin Table 2.

Figure 2(a) shows the effect of increasing themaximal path length on path finding and selectiontime. The leftmost (blue) bars show baselineperformance of PRA, where only forward randomwalks are applied. It is clearly demonstrated thatthe time spent on path finding grows exponentiallywith `. Due to memory limitations, we wereable to execute forward-walk models only up to4 steps. The bars denoted by iF + jB showthe results of combining forward walks up tolength i with backward walks of up to j = 1 orj = 2 steps. Time complexity using bidirectionalrandom walks is dominated by the longest pathsegment (either forward or backward)—e.g., thesettings 3F , 3F +1B, 3F +2B have similar timecomplexity. Using bidirectional search, we wereable to consider relational paths up to length 5.Figure 2(b) presents MAP performance, where itis shown that extending the maximal explored pathlength did not improve performance in this case.This result indicates that meaningful paths in thisdomain are mostly short. Accordingly, path lengthwas set to 3 in the respective main experiments.

5.2 Named Entity Extraction

We further consider the task of named entityextraction from a corpus of parsed texts, followingprevious work by Minkov and Cohen (2008).

In this case, an entity-relation graph schema isused to represent a corpus of parsed sentences,as illustrated in Figure 3. Graph nodes denotingword mentions (in round edged boxes) are linkedover edges typed with dependency relations. The

parsed sentence structures are connected via nodesthat denote word lemmas, where every wordlemma is linked to all of its mentions in thecorpus via the special edge type W . We representpart-of-speech tags as another set of graph nodes,where word mentions are connected to the relevanttag over POS edge type.

In this graph, task-specific word similaritymeasures can be derived based on thelexico-syntactic paths that connect wordtypes (Minkov and Cohen, 2014). The taskdefined in the experiments is to retrieve a rankedlist of person names given a small set of seeds.This task is implemented in the graph as a query,where we let the query distribution be uniformover the given seeds (and zero elsewhere). Thatis, our goal is to find target nodes that are relatedto the query nodes over the relation r =similar-to,or, coordinate-term. We apply link prediction inthis case with the expected result of generatinga ranked list of graph nodes, which is populatedwith many additional person names. The namedentity extraction task we consider is somewhatsimilar to the one adopted by FIGER (Lingand Weld, 2012), in that a finer-grain categoryis being assigned to proposed named entities.Our approach follows however set expansionsettings (Wang and Cohen, 2007), where the goalis to find new instances of the specified type fromparsed text.

In the experiments, we use the training setportion of the MUC-6 data set (MUC, 1995),represented as a graph of 153k nodes and 748Kedges. We generated 30 labeled queries, eachcomprised of 4 person names selected randomlyfrom the person names mentioned in the dataset. The MUC corpus is fully annotated withentity names, so that relevant target nodes (otherperson names) were readily sampled. Extractionperformance was evaluated considering the taggedperson names, which were not included in thequery, as the correct answer set. The maximumpath length of RWR, PRA, and CoR-PRA are setto 6 due to memory limitation.

Table 1 shows that PRA is much fasterthan RWR or FOIL on this data set, givingcompetitive MAP performance to FOIL. RWRis generally ineffective on this task, becausesimilarity in this domain is represented by arelatively small set of long paths, whereasRWR express local node associations in the

W

BillGates

BillGates

founded

founded

nsubj

W

W

SteveJobs

SteveJobs founded

nsubj

W

vbd

POS

POS

nnp

POS

POS

CEO

appos

CEO

appos

W

W

CEOnnp

POS

POS

Words/

POSs

Tokens

Tokens

Figure 3: Part of a typed graph representing acorpus of parsed sentences.

Table 3: Highly weighted paths with constantslearnt for the person name extraction task.

Constant path InterpretationP (said← t;W−1, nsubj,W ) The subjects of ‘said’ or ‘say’P (says← t;W−1, nsubj,W ) are likely to be a person name.P (vbg ← t;POS−1, nsubj,W ) Subjects, proper nouns, andP (nnp← t;POS−1,W ) nouns with apposition orP (nn← t;POS−1, appos−1,W ) possessive constructions, areP (nn← t;POS−1, poss,W ) likely to be person names.

graph (Minkov and Cohen, 2008). Modelinginverse path probabilities improves performancesubstantially, and adding relational features withconstants boosts performance further. Theconstant paths learned encode lexical features, aswell as provide useful priors, mainly over differentpart-of-speech tags. Example constant paths thatwere highly weighted in the learned models andtheir interpretation are given in Table 3.

Figure 2(c) shows the effect of modeling longrelational paths using bidirectional random walksin the language domain. Here, forward pathfinding was applied to paths up to length 5 dueto memory limitation. The figure displays theresults of exploring paths up to a total length of6 edges, performing backward search from thetarget nodes of up to j = 1, 2, 3 steps. MAPperformance (Figure 2(d)) using paths of varyinglengths shows significant improvements as thepath length increases. Top weighted long featuresinclude:P (s→ t;W−1, conj and−1,W,W−1, conj and,W )

P (s→ t;W−1, nn,W,W−1, appos−1,W )

P (s→ t;W−1, appos,W,W−1, appos−1,W )

These paths are similar to the top ranked pathsfound in previous work (Minkov and Cohen,2008). In comparison, their results on this datasetusing paths of up to 6 steps measured 0.09 inMAP. Our results reach roughly 0.16 in MAP dueto modeling of inverse paths; and, when constant

paths are incorporated, MAP reaches 0.32.Interestingly, in this domain, FOIL generates

fewer yet more complex rules, which arecharacterised with low recall and high precision,such as: W (B,A) ∧ POS(B,nnp) ∧ nsubj(D,B) ∧W (D, said) ∧ appos(B,F ) → person(A). Notethat subsets of these rules, namely, POS(B,nnp),nsubj(D,B) ∧ W (D, said) and appos(B,F )have been discovered by PRA as individualfeatures assigned with high weights (Table 3).This indicates an interesting future work, whereproducts of random walk features can be used toexpress their conjunctions.

6 Conclusion

We have introduced CoR-PRA, extending anexisting random walk based relational learningparadigm to consider relational paths withconstants, bi-directional path features, as wellas long paths. Our experiments on knowledgebase inference and person name extraction tasksshow significant improvements over previouslypublished results, while maintaining efficiency.An interesting future direction is to use productsof these random walk features to express theirconjunctions.

Acknowledgments

We thank the reviewers for their helpful feedback.This work was supported in part by BSF grant No.2010090 and a grant from Google Research.

ReferencesEneko Agirre and Aitor Soroa. 2009. Personalizing

pagerank for word sense disambiguation. InProceedings of the 12th Conference of the EuropeanChapter of the Association for ComputationalLinguistics.

Kurt Bollacker, Colin Evans, Praveen Paritosh,Tim Sturge, and Jamie Taylor. 2008. Freebase:a collaboratively created graph database forstructuring human knowledge. In Proceedings ofthe 2008 ACM SIGMOD international conferenceon Management of data, pages 1247–1250. ACM.

A. Carlson, J. Betteridge, B. Kisiel, B. Settles,E. Hruschka Jr., and T. Mitchell. 2010a. Towardan architecture for never-ending language learning.In AAAI.

Andrew Carlson, Justin Betteridge, Bryan Kisiel, BurrSettles, Estevam R. Hruschka Jr., and Tom M.Mitchell. 2010b. Toward an Architecture forNever-Ending Language Learning. In AAAI.

Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, WilkoHorn, Ni Lao, Kevin Murphy, Thomas Strohmann,Shaohua Sun, and Wei Zhang. 2014. Knowledgevault: a web-scale approach to probabilisticknowledge fusion. In The 20th ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining, KDD ’14, New York, NY, USA -August 24 - 27, 2014, pages 601–610.

Matt Gardner, Partha Pratim Talukdar, Bryan Kisiel,and Tom Mitchell. 2013. Improving learning andinference in a large knowledge-base using latentsyntactic cues. In EMNLP.

Matt Gardner, Partha Talukdar, Jayant Krishnamurthy,and Tom Mitchell. 2014. Incorporating VectorSpace Similarity in Random Walk Inference overKnowledge Bases. In EMNLP.

Taher H. Haveliwala. 2002. Topic-sensitive pagerank.In WWW, pages 517–526.

Ondrej Kuzelka and Filip Zelezny. 2008. A restartedstrategy for efficient subsumption testing. Fundam.Inf., 89(1):95–109, January.

Ondrej Kuzelka and Filip Zelezny. 2009. Block-wiseconstruction of acyclic relational features withmonotone irreducibility and relevancy properties.In Proceedings of the 26th Annual InternationalConference on Machine Learning, ICML ’09, pages569–576, New York, NY, USA. ACM.

Ni Lao and William W. Cohen. 2010a. Fastquery execution for retrieval models based onpath-constrained random walks. In Proceedings ofthe 16th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD ’10,pages 881–888, New York, NY, USA. ACM.

Ni Lao and William W. Cohen. 2010b. Relationalretrieval using a combination of path-constrainedrandom walks. In Machine Learning, volume 81,pages 53–67, July.

Ni Lao, Tom M. Mitchell, and William W. Cohen.2011. Random Walk Inference and Learning in ALarge Scale Knowledge Base. In EMNLP, pages529–539.

Ni Lao, Amarnag Subramanya, Fernando Pereira,and William W. Cohen. 2012. Readingthe web with learned syntactic-semantic inferencerules. In Proceedings of the 2012 JointConference on Empirical Methods in NaturalLanguage Processing and Computational NaturalLanguage Learning, EMNLP-CoNLL ’12, pages1017–1026, Stroudsburg, PA, USA. Association forComputational Linguistics.

X. Ling and D.S. Weld. 2012. Fine-grained entityrecognition. In Proceedings of the 26th Conferenceon Artificial Intelligence (AAAI).

Einat Minkov and William W Cohen. 2008. LearningGraph Walk Based Similarity Measures for ParsedText. EMNLP.

Einat Minkov and William W. Cohen. 2014. Adaptivegraph walk-based similarity measures for parsedtext. Natural Language Engineering, 20(3).

Andrea Moro, Alessandro Raganato, and RobertoNavigli. 2014. Entity Linking meets Word SenseDisambiguation: a Unified Approach. Transactionsof the Association for Computational Linguistics(TACL), 2.

1995. MUC6 ’95: Proceedings of the 6th Conferenceon Message Understanding, Stroudsburg, PA, USA.Association for Computational Linguistics.

Michael Pazzani, Cliff Brunk, and Glenn Silverstein.1991. A Knowledge-Intensive Approach toLearning Relational Concepts. In Proceedingsof the Eighth International Workshop on MachineLearning, pages 432–436. Morgan Kaufmann.

J. Ross Quinlan and R. Mike Cameron-Jones. 1993.FOIL: A Midterm Report. In ECML, pages 3–20.

B L Richards and R J Mooney. 1991. First-OrderTheory Revision. In Proceedings of the 8thInternational Workshop on Machine Learning,pages 447–451. Morgan Kaufmann.

Michele Sebag and Celine Rouveirol. 1997. Tractableinduction and classification in first order logicvia stochastic matching. In Proceedings of theFifteenth International Joint Conference on ArtificalIntelligence - Volume 2, IJCAI’97, pages 888–893,San Francisco, CA, USA. Morgan KaufmannPublishers Inc.

Ashwin Srinivasan. 2001. The Aleph Manual. Inhttp://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/.

F. Suchanek, G. Kasneci, and G. Weikum. 2007.YAGO - A Core of Semantic Knowledge. In WWW.

Andrew Turpin and Falk Scholer. 2006. Userperformance versus precision measures forsimple search tasks. In PProceedings of theinternational ACM SIGIR conference on Researchand development in information retrieval (SIGIR).

Filip Zelezny and Nada Lavrac. 2006.Propositionalization-based relational subgroupdiscovery with rsd. Mach. Learn., 62(1-2):33–63,February.

Richard C Wang and William W Cohen. 2007.Language-independent set expansion of namedentities using the web. In Proceedings of the IEEEInternational Conference on Data Mining (ICDM).

William Yang Wang, Kathryn Mazaitis, and William WCohen. 2013. Programming with personalizedpagerank: A locally groundable first-orderprobabilistic logic. Proceedings of the 22ndACM International Conference on Information andKnowledge Management (CIKM 2013).

learning relational features with backward random walks · path ranking algorithm (pra), a...

Documents