joint inference & factorie 1 - seas.upenn.edumuri/y5review/jointinf-muri-upenn2012.pdf · depth...

Joint work with David Belanger, Sameer Singh, Alexandre Passos, Brian Martin, Michael Wick, Sebastian Riedel, and Limin Yao.

Andrew McCallumDepartment of Computer Science

University of Massachusetts Amherst

Joint Inference &FACTORIE 1.0

Natural Language Ambiguity

. .. .She saw the man with the telescope.

?

?

Parsing. Which PP attachment?

Natural Language Ambiguity

. .. .She saw the man with the telescope.

Parsing. Which PP attachment??

AGENTLOCStolen painting found by tree.

Semantic Role Labeling. Which semantic role?

?

?

Natural Language ProcessingDialog

Semantic Role Labeling

Coreference Resolution

Relation Extraction

Entity Recognition

Parsing

Classification (POS, NER)

Segmentation (word, phrase)

80-90%on many ofthese tasks.

But when combined errors cascade.

(0.9)6 = 0.54

Unified Natural Language Processing

Dialog

Semantic Role Labeling

Coreference Resolution

Relation Extraction

Entity Recognition

Parsing

Classification (POS, NER)

Segmentation (word, phrase)

They’ll be no escape for the

Princess this time.

Unified Computer VisionVideo Understanding

Object Tracking

Scene Understanding

Object Recognition

Object Segmentation

Depth Perception

Part Segmentation

Line Detection

Unified RoboticsPlanning

Manipulation

Grasping

Object Recognition

State Estimation

Localization

Locomotion

Calibration

Joint InferenceFundamental issue in all Artificial Intelligence.

(Learning is fundamental too, but more under control, I’d argue.)

That was inspirational. But now for Pessimism

• Often a lot of trouble for moderate gains- Sutton & McCallum 2007, loopy BP, FCRF,

96.9% → 97.3% +0.4%- Finkel & Manning 2008, forward sampling,

78.5% → 79.3% +0.7%

• CoNLL 2009 Shared Task Summary:“The best systems overall do not use joint learning or optimization.”

Outline• New algorithms for joint inference

- Guaranteed-optimal “beam search” by Column Generation [NIPS 2012]

- Joint inference with sparse values & factors [Submitted 2013]

- Compressing messages with Expectation Propagation [TR 2012]

- Focussed Query-specific MCMC Sampling [NIPS 2011]

- Joint reasoning about relations of “universal schema” [AKBC 2012]

• FACTORIE Probabilistic programming support for joint inference

- Landscape of tools- Capabilities- Release 1.0

Beam Search• Message passing for finite-state inference

with “spars-ified values” to be faster.

• Widely employed.• Approximate. Non-optimal.

• Our new work:As fast as beam search,but guaranteed optimal.Answer always identical to exact Viterbi.

Inference in linear-chains

Time flies like an arrow

POS label

observed wordgr

aphi

cal

mod

el



POS label

observed word

DET

NN

VB

ADJ

ADV

PP

grap

hica

l m

odel

POS

labe

l val

ues



POS label

observed word

DET

NN

VB

ADJ

ADV

PP

grap

hica

l m

odel

POS

labe

l val

ues

Beam Search



POS label

observed word

DET

NN

VB

ADJ

ADV

PP

grap

hica

l m

odel

POS

labe

l val

ues

Message Passing with Column Generation[Belanger, Passos, Riedel, McCallum, NIPS 2012]



POS label

observed word

DET

NN

VB

ADJ

ADV

PP

grap

hica

l m

odel

POS

labe

l val

ues

Message Passing with Column Generation

Efficiently

consider

needed

expansions

[Belanger, Passos, Riedel, McCallum, NIPS 2012]

Related to Joshi’s “Bidirectional Incremental Construction”

Constrained Optimizationstart with just subset of constraints

gradient direction = c

Constrained Optimization: Cutting Planesfind violated constraints, add just those



“Column Generation”: analogous idea for “values” instead of “constraints”

Constrained Optimization: Cutting Planesfind violated constraints, add just those

Linear Programming formulationof Viterbi (message passing in chains)

New constraints

Experiments

AccuracySpeed

WSJ POS tagging, 45 labels


Multi-core BP

(a) Exact Parallel Inference (b) Parallel Inference with Induced Sparsity

Figure 1: Parallel Inference in Chains: Arrows with the same style represent messages that areindependent on each other, and are required to be assigned to the same core. By inducing sparsityon the values of the shaded variables (b), up to 6 cores can be used in parallel, as opposed to therestriction of using only 2 in exact inference (a).

To incorporate this approximation to the exact algorithm outlined in Section 2, we modify it asfollows. At any point during inference, the marginal of a variable i may become peaked (accordingto ⇣), in which case we do the following. First, the marginal mi (and mf/b

i ) is set to the deterministicdistribution. Second, we add two messages to the queue: a forward pass to variable i + 1 and abackward pass to variable i � 1. Further, for every message computation, we do not add the nextmessages to the queue if the neighboring variable is deterministic. Note that when ⇣ = 1, none ofthe variables become deterministic, and the approach is equivalent to exact inference.

This approach can significantly improve performance even on a single core. Even though the totalnumber of messages is still 2n, the computation of the messages is much faster for variables thatneighbor a deterministic variable. In particular, the summation in Eq 3 (or 4) reduces to a singlevalue since mf

i�1 (or mbi+1) is non-zero only for a single value.

Parallelism: The computations that are added to the queue are independent of each other, and thuscan be computed simultaneously using multiple threads. For exact inference, the queue is initializedwith only 2 computations, and each step consumes a single message and adds a single message,thereby limiting the size of the queue to 2. This results in a speedup of 2 when using 2 cores, andincreasing the number of cores does not provide a benefit.

With our approximation, steps that results in a deterministic variable consume a single computation,but potentially add two more computations. The queue can therefore grow to sizes > 2, thus allow-ing multiple cores can be utilized to provide speedups. Alternatively, we can view the deterministicvariables as “islands of certainty” that each split the chain into two. This results in multiple forwardand backward passes that are independent of each other. See Figure 1b for an example.

Revisiting Sparsity: Marginals of a variable may vary considerably over the course of inference,and committing to a deterministic value at an early stage of inference can lead to substantial errors.We allow revisiting of our sparsity decisions by storing the actual marginals as a separate value m0

i.After every message propagation, we update m0

i and compare the resulting distribution probabilitieswith ⇣. If the probability of all the values is < ⇣, we unsparsify the variable by setting mi to m0

i.We also modify the deterministic value if the mode of m0

i is different from mi and the probabilityof the mode is � ⇣.

Extension to Tree-shaped Models: Although we describe this work on linear-chain models, thealgorithm generalizes to the case of the tree. Exact inference in the tree is defined by selecting aroot, and performing message passing in upstream and downstream passes. However, the queuesize is not limited to 2; particularly the beginning of upstream pass and the end of downstream passcan grow the queue considerably. Thus we can achieve speedups > 2 even for exact inference (see[12]). When including our approximation, we expect further speedups. Specifically, as the messagepropagate up the tree, the size of the queue shrinks in exact inference. In our approach, we aresplitting the tree into multiple smaller tress, thus providing potential for utilizing multiple cores atevery stage of inference. However, we do not expect the speedups to be as significant as in chains.

4 Experiments

To study the trade-offs provided by our approach, we run the inference algorithm on synthetic andreal-world problems, described in the following sections.

Synthetic Chains: As described above, the speedup provided by multiple cores depends on the sizeof the queue of computations, i.e. linear speedups can be obtained if the queue size is always greaterthan the number of cores. However, the queue size doesn’t depend on ⇣, instead it depends on whichvariables turn deterministic during inference. To study the behavior of our system as we vary the

3

(a) Exact Parallel Inference (b) Parallel Inference with Induced Sparsity

Figure 1: Parallel Inference in Chains: Arrows with the same style represent messages that areindependent on each other, and are required to be assigned to the same core. By inducing sparsityon the values of the shaded variables (b), up to 6 cores can be used in parallel, as opposed to therestriction of using only 2 in exact inference (a).

To incorporate this approximation to the exact algorithm outlined in Section 2, we modify it asfollows. At any point during inference, the marginal of a variable i may become peaked (accordingto ⇣), in which case we do the following. First, the marginal mi (and mf/b

i ) is set to the deterministicdistribution. Second, we add two messages to the queue: a forward pass to variable i + 1 and abackward pass to variable i � 1. Further, for every message computation, we do not add the nextmessages to the queue if the neighboring variable is deterministic. Note that when ⇣ = 1, none ofthe variables become deterministic, and the approach is equivalent to exact inference.

This approach can significantly improve performance even on a single core. Even though the totalnumber of messages is still 2n, the computation of the messages is much faster for variables thatneighbor a deterministic variable. In particular, the summation in Eq 3 (or 4) reduces to a singlevalue since mf

i�1 (or mbi+1) is non-zero only for a single value.

Parallelism: The computations that are added to the queue are independent of each other, and thuscan be computed simultaneously using multiple threads. For exact inference, the queue is initializedwith only 2 computations, and each step consumes a single message and adds a single message,thereby limiting the size of the queue to 2. This results in a speedup of 2 when using 2 cores, andincreasing the number of cores does not provide a benefit.

With our approximation, steps that results in a deterministic variable consume a single computation,but potentially add two more computations. The queue can therefore grow to sizes > 2, thus allow-ing multiple cores can be utilized to provide speedups. Alternatively, we can view the deterministicvariables as “islands of certainty” that each split the chain into two. This results in multiple forwardand backward passes that are independent of each other. See Figure 1b for an example.

Revisiting Sparsity: Marginals of a variable may vary considerably over the course of inference,and committing to a deterministic value at an early stage of inference can lead to substantial errors.We allow revisiting of our sparsity decisions by storing the actual marginals as a separate value m0

i.After every message propagation, we update m0

i and compare the resulting distribution probabilitieswith ⇣. If the probability of all the values is < ⇣, we unsparsify the variable by setting mi to m0

i.We also modify the deterministic value if the mode of m0

i is different from mi and the probabilityof the mode is � ⇣.

Extension to Tree-shaped Models: Although we describe this work on linear-chain models, thealgorithm generalizes to the case of the tree. Exact inference in the tree is defined by selecting aroot, and performing message passing in upstream and downstream passes. However, the queuesize is not limited to 2; particularly the beginning of upstream pass and the end of downstream passcan grow the queue considerably. Thus we can achieve speedups > 2 even for exact inference (see[12]). When including our approximation, we expect further speedups. Specifically, as the messagepropagate up the tree, the size of the queue shrinks in exact inference. In our approach, we aresplitting the tree into multiple smaller tress, thus providing potential for utilizing multiple cores atevery stage of inference. However, we do not expect the speedups to be as significant as in chains.

4 Experiments

To study the trade-offs provided by our approach, we run the inference algorithm on synthetic andreal-world problems, described in the following sections.

Synthetic Chains: As described above, the speedup provided by multiple cores depends on the sizeof the queue of computations, i.e. linear speedups can be obtained if the queue size is always greaterthan the number of cores. However, the queue size doesn’t depend on ⇣, instead it depends on whichvariables turn deterministic during inference. To study the behavior of our system as we vary the

3

4 threads5x faster

[Singh, Martin, McCallum, NIPS WS 2011]

Experiments

AccuracySpeed

WSJ POS tagging, 45 labels


Joint Entities, Relations, Coref

Schumacher has a contract with the Italian team through 2002.Person

Commercial OrgNation

Based InEmploy-Staff

Michael Schumacher is still celebrating Ferrari's ...

Ferrari leads McLaren by 13 points.``I'm still young."

Figure 1: Information Extraction: An example of a sentence labeled with 3 mentions labeledwith entity types (red boxes), relations (green arrows), and coreference (shown as blue links tothree other sentences from the document).

et al., 2008) the top five systems in the closed challenge consisted of pipeline approaches.Even when positive results are demonstrated, as in Finkel and Manning (2009) and Kate andMooney (2010), they are modest. Comparatively, we achieve an entity tagging error reductionof 12.4% for the joint model. Further, our results show consistent improvement as more tasksare included in the model, demonstrating the utility of joint inference.

2 Isolated Models

In this section, we present a brief background on probabilistic graphical models, before describ-ing the three tasks of our interest, and laying out the model and the features commonly used torepresent them.

2.1 Graphical Models

Graphical models define a family of probability distributions that factorize according to thedependencies encoded in the graph structure. Factor graphs are commonly used to representundirected graphical models (Kschischang et al., 2001). Formally, a factor graph G is a bipartitegraph with variable nodes y, and factor nodes = { i}. The probability distribution can bewritten as p(y) = 1

Z

Q i2 i , where Z =

Py

Q i i is the partition function that ensures the

probabilities sum to one. Factors are often defined as a log-linear combination of the innerproduct of sufficient statistics (or feature functions) fk and model parameters ✓ . Graphicalmodels are a popular tool for representing NLP tasks. We use the above notation in the rest ofthe text.

2.2 Entity Tagging

Entity tagging is the task of classifying each entity mention according to the type of entity towhich they refer. The input for this task is the set mention boundaries and the sentences of adocument. For each mention mi , the output of entity tagging is a label ti from a predefined setof labels T . The specific set of labels T that are used depends on the domain of interest. Forexample the set of labels used in ACE consist of PERSON, ORGANIZATION, GEO-POLITICAL, LOCATION,FACILITY, VEHICLE, and WEAPON.

As an example of entity tagging, consider the sentence “A cap for a depressurization valve floated

Fully Joint Factor Graphtransitivity

t0 t1 t2

r02

c01 c02 c12

�C(12)�C(02)�C(01)

�T (0) �T (1) �T (2)

�LR(02)

�JR(02)

m0 m1 m2

Figure 2: Joint Model of Entity Tagging, Coreference Resolution and Relation Extraction:

shown over 3 observed mentions, two of which belong to the same sentence (m0 and m2).There is no direct linkage between relation extraction and coreference. Note that for brevity,we have only included (i, j) in the factor labels, and have omitted ti , t j , mi and mj .

all the variables and factors into a single graphical model. To elaborate, the resulting modelhas the same variables and features as the individual models. There exist entity tag variables,relation label variables, and pairwise coreference decision variables. The factors of the jointmodels are also the set of factors instantiated by individual models, as described in sections2.2, 2.3, and 2.4. See Figure 2 for an illustration of the joint model as defined over 3 mentions,two of which are in the same sentence. Note that even with such a small set of mentions, theunderlying joint model is quite complex and dense. Variable counts over the train, test, anddevelopment sets are provided in Table 5.

Formally, the probability for a setting to all the variables in a document is:

p(t, r,c|m) /Y

ti2t T (mi , ti)Y

ci j2c C(ci j , mi , mj , ti , t j)

Y

ri j2r L

R(mi , mj , ri j) JR(ri j , mi , mj , ti , t j) (1)

The semantics of these factors are different from earlier. Instead of representing a distributionover the labels of a single task conditioned on the predictions from another task, these factorsnow directly represent the joint (unnormalized) distribution over the tasks that they are definedover. For example, in Section 2.4, each coreference factor defines a distribution over thepairwise boolean coreference variable given the entity tags of the mentions. In the joint model,however, this factor induces a distribution over both the pairwise boolean variable and the

observed entity mentions

relations between mentions

mention type

coreference

transitivity

t0 t1 t2

r02

c01 c02 c12

�C(12)�C(02)�C(01)

�T (0) �T (1) �T (2)

�LR(02)

�JR(02)

m0 m1 m2

Figure 2: Joint Model of Entity Tagging, Coreference Resolution and Relation Extraction:

shown over 3 observed mentions, two of which belong to the same sentence (m0 and m2).There is no direct linkage between relation extraction and coreference. Note that for brevity,we have only included (i, j) in the factor labels, and have omitted ti , t j , mi and mj .

all the variables and factors into a single graphical model. To elaborate, the resulting modelhas the same variables and features as the individual models. There exist entity tag variables,relation label variables, and pairwise coreference decision variables. The factors of the jointmodels are also the set of factors instantiated by individual models, as described in sections2.2, 2.3, and 2.4. See Figure 2 for an illustration of the joint model as defined over 3 mentions,two of which are in the same sentence. Note that even with such a small set of mentions, theunderlying joint model is quite complex and dense. Variable counts over the train, test, anddevelopment sets are provided in Table 5.

Formally, the probability for a setting to all the variables in a document is:

p(t, r,c|m) /Y

ti2t T (mi , ti)Y

ci j2c C(ci j , mi , mj , ti , t j)

Y

ri j2r L

R(mi , mj , ri j) JR(ri j , mi , mj , ti , t j) (1)

The semantics of these factors are different from earlier. Instead of representing a distributionover the labels of a single task conditioned on the predictions from another task, these factorsnow directly represent the joint (unnormalized) distribution over the tasks that they are definedover. For example, in Section 2.4, each coreference factor defines a distribution over thepairwise boolean coreference variable given the entity tags of the mentions. In the joint model,however, this factor induces a distribution over both the pairwise boolean variable and the

Inference with dynamically spars-ified values and factors.

Experimental Results

Entity Tagging

Relation Extraction

Entity Resolution

Pipeline

Joint

80% 54% 58%

83% 55% 79%

Source #Mentions #Coreference #Relation

Train 15,640 637,160 82,479Dev 5,545 244,461 34,057Test 6,598 342,942 38,270

Table 5: Number of variables of each type in ACE train, test, and development sets.

via Wikipedia. Similar to our work, Yu and Lam (2010) also model entities and relations usinga discriminative model.

Others have combined parsing with NER (Finkel and Manning, 2009) and semantic rolelabeling (Sutton and McCallum, 2005b) with mixed success. Finkel et al. (2006) represents apipeline of NLP tasks as a Bayesian network where each variable represents one stage of thepipeline. Joint inference has also been applied to various information extraction tasks such ascitation segmentation and matching (Wellner et al., 2004; Poon and Domingos, 2007; Singhet al., 2009) and to BioNLP (Poon and Vanderwende, 2010; Riedel and McCallum, 2011).

Our approach to joint inference differs significantly from those above. First, we are modelingthree crucial information extraction tasks, including coreference, which have not been modeledtogether before. Coreference as the third task requires document-level joint inference, asopposed to sentence-level joint inference in related work. Difficulty of inference is furthercompounded from transitivity. Second, our resulting model is significantly more loopy than anumber of existing joint inference techniques. Due to its size and structure, most approximateinference technique are not directly applicable, and we present a novel efficient inferencetechnique that is a modification of belief propagation. Third, as opposed to some of the relatedwork, we learn both hard and soft constraints between tasks instead of setting them by hand(as in Roth and Yih (2007)), and our inference provides marginals. Due to the dependenciesrepresented in our model, and our inference technique, we are able to obtain consistentimprovements in all the three tasks, improving accuracy as we include more dependencies.

6 Experiments

We use the Automatic Content Extraction (ACE; Doddington et al. (2004)) 2004 English datasetfor the experiments, a standard labeled corpus for the three tasks that we are studying. ACEconsists of 443 documents from 4 distinct news domains: broadcast, newswire, Arabic treebank,and Chinese treebank. The total number of sentences and tokens are 7,790 and 172,506respectively. We split the data into train, test, and development sets of sizes 60%, 20%, and20%. Counts of each type of variable are shown in Table 5.

Our model takes the complete mention string as input, and for these experiments we use thegold mention boundaries.4 Due to the complexity of inference and the size and density of themodel, we restrict the set of labels we explore for both entity tags and relation extraction to thecoarse-grained types (7 and 8 respectively).

4Using predicted mention boundaries remains future work, although we imagine boundary detection can be yetanother task for joint inference.

Number of variables of each type

50% reduction in error!

BP with EP Message Compression[Singh, Riedel, McCallum, 2012]

Manuscript under review by AISTATS 2012

Z

X

Y

(a) Factorial chain between two mod-

ules

Z

¯

Z

X

Y

(b) Equivalent factorial chain (double

edges correspond to hard equality con-

straints)

L1 LL-1

Z

¯

Z

Y

(c) Structure of approximating family.

Figure 1: Factor Graph representations of modules, reformulations and approximating families.

lated using the update rule:

⌧̃u!d (z) X

x

u

�u (xu, z) ⌧d!u (z) ; ⌧u!d

⌧̃u!d

⌧d!u

(3)

where is a normalizing constant. We use the sec-ond step in equation 3 (that divides out the previ-ous incoming message) to simplify the transition toExpectation Propagation. Note that this step is notrequired for exact messages. Using ⌧̃u!d we can es-timate marginals p

�ydi�

under the joint model using�u (x

u, z) ⌧d!u (z). The upstream pass is performedanalogously:

⌧̃d!u (z) X

y

d

�d

�z,yd

�⌧u!d (z) ; ⌧d!u

⌧̃d!u

⌧u!d

(4)

2.2 Expectation Propagation

For anything but very small sets of shared variables,GBP is intractable as messages require memory andruntime exponential in the number of these variables(and their domain sizes). This suggests that we findan approximate and more compact representation ofmessages. Expectation Propagation [Minka, 2001] is aframework in which messages attempt to capture a setof expectations of ⌧d!u · �u. One way to frame EP’supdate rule is:

µu E⌧d!u·�u[ ] ⌧u!d

match (µu)

⌧d!u(5)

Here we first calculate moments µu of the statistics (Z) over ⌧d!u ·�u. Then we estimate a distributionmatch (µu

) that matches these moments, and divideout the previous message. The upstream pass is per-formed analogously. Different message representationsarise from different definitions of the statistics (Z).

2.2.1 Sparse and Merged Messages

Using merged and sparse messages, the EP analogsto k-best lists, provides an attractive approximation

scheme. The statistics are designed to match the prob-abilities of some small set of states Lu ✓ dom (Z), andso = {

z

}z2Lu with

z

(z) = �z,zu where �i,j is

the Kronecker delta. We often choose Lu to cover theset of k most likely states under ⌧d!u · �u, but wecould also sample the list. For the merged messages,the matching distribution match (µu

) assigns the prob-ability µu

z

= E [ z

] to each z 2 Lu, and then dis-tributes the remaining mass equally among the statesoutside of Lu. Sparse approximations do not matchthese expectations exactly, instead they assign zeroprobability outside of Lu, and renormalize probabil-ities within. Both sparse and merged messages canlead to substantial runtime and memory usage reduc-tions, as marginalization depends on |Lu| instead of|dom (Z)|.

2.2.2 Belief Propagation

More coverage of the exact message can be achieved ifwe resort to a factorized approximation. In terms ofEP, this means that instead of matching global statis-tics, we estimate and seek to match per-variable ex-pectations. That is, we set = { i;y} with i;z (z) =

�zi,y and then find a message from the family of fullyfactorized distributions

⌧u!d (z) / exp

8<

:X

c,y2dom(Zc)

�zi,y⇡c,z

9=

; (6)

to match the moments E [ ]. It can be shown that us-ing this approximation family corresponds to runningregular Belief Propagation in the underlying graphicalmodel of �u�d.

The BP approximation can capture substantially moreof the uncertainty than the sparse message approacheslike k-best lists. However, it adversely affects thedownstream component in several ways. First, whenthe downstream component consumes a k-best list,marginalization is efficient: for each state in the listperform inference conditioned on the state. This isparticularly attractive when the module d has low tree-

Cut complex graph into easier-to-manage parts.

Coordinate by passing messages among the parts



Z

X

Y


ules

Z

¯

Z

X

Y



straints)

L1 LL-1

Z

¯

Z

Y




⌧̃u!d (z) X

x

u

�u (xu, z) ⌧d!u (z) ; ⌧u!d

⌧̃u!d

⌧d!u

(3)


�ydi�



⌧̃d!u (z) X

y

d

�d

�z,yd

�⌧u!d (z) ; ⌧d!u

⌧̃d!u

⌧u!d

(4)




match (µu)

⌧d!u(5)






z

}z2Lu with

z




z

= E [ z





⌧u!d (z) / exp

8<

:X

c,y2dom(Zc)

�zi,y⇡c,z

9=

; (6)



Message size exponential in # variables



Z

X

Y


ules

Z

¯

Z

X

Y



straints)

L1 LL-1

Z

¯

Z

Y




⌧̃u!d (z) X

x

u

�u (xu, z) ⌧d!u (z) ; ⌧u!d

⌧̃u!d

⌧d!u

(3)


�ydi�



⌧̃d!u (z) X

y

d

�d

�z,yd

�⌧u!d (z) ; ⌧d!u

⌧̃d!u

⌧u!d

(4)




match (µu)

⌧d!u(5)






z

}z2Lu with

z




z

= E [ z





⌧u!d (z) / exp

8<

:X

c,y2dom(Zc)

�zi,y⇡c,z

9=

; (6)



Compress by using a graphical model to represent the message.

Expectation Propagation [Minka 2001] gives us a framework for compression.

Results on Joint Seg & Coref[Singh, Riedel, McCallum, 2012]


Method Prec Rec F1 Time

(hours)k-Best k = 1 98.8 42.3 59.3 0.08

k = 2 98.9 40.6 57.6 0.16k = 5 99.1 46.5 63.3 0.4k = 10 98.8 48.1 64.8 0.8k = 50 98.2 53.6 69.3 4.0k = 100 98.2 54.3 69.9 8.0k = 150 98.2 54.8 70.3 12.0

Sampling s = 25 96.3 67.6 79.4 3.1s = 100 96.9 67.3 79.5 10.0

FRAK 95.3 69.9 80.7 0.16

Table 1: Pairwise Coreference Decision Accuracy onthe Citation Coreference task, along with average run-ning time

ing the top k solutions, we perform Gibbs samplingover the output of the segmentation, and sparse mes-sage to the coreference layer assigns zero probabilityto all configurations that are not in the set of sam-ples. Although similar to k -best, it has been foundto often work well as it captures diversity in the out-put distribution [Finkel et al., 2005]. As we can seefrom the table, Gibbs sampling outperforms k -best onthis task, but does not match our performance. Themerged messages, by assigning non-zero probability tothe “merged” states, allows the coreference layer torecover from the mistakes made in the segmentationlayer.

Our system shows substantial accuracy gains; an errorreduction of 52.5% is achieved over the pipeline 1-Bestapproach. Further, our system takes approximately 10

minutes to run, which is much faster than sampling(s = 100 takes ⇠ 10 hours) and k-Best (k = 150 takes⇠ 12 hours).

5 Related Work

There are a number of approaches that address similarconcerns, yet differ from our approach in significantways.

A number of related techniques follow a “relax thencompensate” strategy, which we can also frame ourmethod as. Our relaxation is made during inferencein the downstream component, where the assumptionthat copies of the same variable should have identicalbeliefs is relaxed. This is compensated for after thedownstream pass, where the beliefs from the copies aremade consistent before sending them to the upstreamcomponent. From the expectation propagation per-spective, we never remove the consistency factors andthe relaxation is absent.

Both dual decomposition [Rush et al., 2010] and edgedeletion [Choi and Darwiche, 2009] approaches alsocreate copies of variables and relax the coupling con-straint, however as we described above, we reformu-late these constraints. We also address sparsifica-tion, which allow us to handle high fan-in problemsthat dual-decomposition and edge deletion will be in-tractable on.

Factorized frontier algorithm [Murphy and Weiss,2001] and its variants aim to reduce the complexityof message passing between components from QN toQF , where Q is domain size of the variables, N thenumber variables, and F the max fan-in of variables.In most joint-inference problems of our focus, the fan-in F is large enough that QF is still very high, makingthese approaches intractable. We address this usingsparsification.

Heskes and Zoeter [2002] is similar to our work in thatit frames approximate inference for DBNs as expec-tation propagation. However, it is defined over anyfixed approximation family, while in FRAK the choiceof approximation family is adaptive. Further, theypropose a convergent algorithm which is orthogonalto our contribution; there is potential to combine theapproaches.

Even though Gibbs sampling [Finkel et al., 2005] per-forms only slightly worse, our approach provides sig-nificant advantages. First, sampling is comparativelyvery slow. Second, there is inherent variance in theresults, as the set of samples may not be representa-tive (due to local minima), and multiple chains maybe required. Further, there are a number of hyper-parameters that need to be tuned, such as number ofsamples, burn-in period, iterations between samples,etc. that can severely impact the results.

6 Conclusion

We have presented FRAK-best lists, a novel proto-col and message format for modular joint inference.FRAK-best lists are sparse in order to facilitate effi-cient downstream inference, and factorized to cover asmuch uncertainty as possible. They require no prede-fined list size k to be defined, and their sub-lists canbe consumed as if they were independent. We showthat they outperform BP, sampling and k-best lists onboth a synthetic and a real world joint inference task.

of citation data

Our EP method

20x slower

Query-specific Sampling

Given a query,no need to run sampling on whole universe of data

Want query-specific sampling.(Amazingly: under-studied problem!)

Query-specific Sampling

Traditional

(Uniform)

Query-only

Our query-specific

0 10 20 30 40 50

0.00

0.10

0.20

Independent

Time

Tota

l var

iatio

n di

stan

ce

UniformQuery-onlyQAM-Poly1QAM-MIQAM-TVQAM-TV-G

0 10 20 30 40 50

0.00

0.05

0.10

0.15

0.20

Linear Chain

Time

Tota

l var

iatio

n di

stan

ce

0 10 20 30 40 50

0.05

0.10

0.15

0.20

Hoop

Time

Tota

l var

iatio

n di

stan

ce

0 10 20 30 40 50

0.00

0.05

0.10

0.15

0.20

Grid

Time

Tota

l var

iatio

n di

stan

ce

0 10 20 30 40 50

0.0

0.1

0.2

0.3

Fully Connected (PW)

Time

Tota

l var

iatio

n di

stan

ce

0 10 20 30 40 50

0.00

0.02

0.04

0.06

Fully Connected

Time

Tota

l var

iatio

n di

stan

ce

Figure 1: Convergence to the query marginals of the stationary distribution from an initial uniformdistribution.

0 10 20 30 40 50

0.00

0.05

0.10

3 Variables

Time

Impr

ovem

ent o

ver u

nifo

rm

UniformQuery-onlyQAM-Poly1QAM-Poly2QAM-MIQAM-TVQAM-TV-G

0 10 20 30 40 50

0.00

0.05

0.10

4 Variables

Time

Impr

ovem

ent o

ver u

nifo

rm

0 10 20 30 40 50

0.00

0.05

0.10

6 Variables

Time

Impr

ovem

ent o

ver u

nifo

rm

0 10 20 30 40 50

0.00

0.05

0.10

8 Variables

Time

Impr

ovem

ent o

ver u

nifo

rm

0 10 20 30 40 50

0.00

0.05

0.10

10 Variables

Time

Impr

ovem

ent o

ver u

nifo

rm

0 10 20 30 40 50

0.00

0.05

0.10

12 Variables

Time

Impr

ovem

ent o

ver u

nifo

rm

Figure 2: Improvement over uniform p as the number of variables increases. Above the line x = 0

is an improvement in marginal convergence, and below is worse than the baseline. As number ofvariables increase, the improvements of the query specific techniques increase.

8

Uniform

Our query-‐specific

• Define influence trail score as approximation to marginal MI.• Prove

3 Query Specific MCMC

Given a query Q = hxq,xl,xei, and a probability distribution ⇡ encoded by a graphical model Gwith factors and random variables x, the problem of query specific inference is to return the high-est fidelity answer to Q given a possible time budget. We can put more precision on this statementby defining “highest fidelity” as closest to the truth in total variation distance.

Our approach for query specific inference is based on the Metropolis Hastings algorithm describedin Section 2.4. A simple yet generic case of the Metropolis Hastings proposal distribution T (thathas been quite successful in practice) employs the following steps:

1: Beginning in a current state s, select a random variable xi 2 x from a probability distribution p

over the indices of the variables (1, 2, · · · , n).2: Sample a new value for xi according to some distribution q(Xi) over that variable’s domain,

leave all other variables unchanged and return the new state s

0.

In brief, this strategy arrives at a new state s

0 from a current state s by simply updating the valueof one variable at a time. In traditional MCMC inference, where the marginal distributions of allvariables are of equal interest, the variables are usually sampled in a deterministic order, or selecteduniformly at random; that is, p(i) =

1

n induces a uniform distribution over the integers 1, 2, · · · , n.

However, given a query Q, it is reasonable to choose a p that more frequently selects the queryvariables for sampling. Clearly, the query variable marginals depend on the remaining latent vari-ables, so we must tradeoff sampling between query and non-query variables. A key observation isthat not all latent variables influence the query variables equally. A fundamental question raised andaddressed in this paper is: how do we pick a variable selection distribution p for a query Q to obtain

the highest fidelity answer under a finite time budget. We propose to select variables based on theirinfluence on the query variable according to the graphical model.

We will now formalize a broad definition of influence by generalizing mutual information. Themutual information I(x, y) = ⇡(x, y) log(

⇡(x,y)

⇡(x)⇡(y)

) between two random variables measuresthe strength of their dependence. It is easy to check that this quantity is the KL divergencebetween the joint distribution of the variables and the product of the marginals: I(x, y) =

KL(⇡(x, y)||⇡(x)⇡(y)). In this sense, mutual information measures dependence as a “distance”between the full joint distribution and its independent approximation. Clearly, if x and y are inde-pendent then this distance is zero and so is their mutual information. We produce a generalizationof mutual information which we term the influence by substituting an arbitrary divergence functionf in place of the KL divergence.

Definition 1 (Influence). Let x and y be two random variables with marginal distributions⇡(x, y),⇡(x),⇡(y). Let f(⇡

1

(·),⇡2

(·)) 7! r, r 2 R+

be a non-negative real-valued divergencebetween probability distributions. The influence ◆(x, y) between x and y is

◆(x, y) := f(⇡(x, y),⇡(x)⇡(y)) (8)

If we let f be the KL divergence then ◆ becomes the mutual information; however, because MCMCconvergence is more commonly assessed with total variation norm, we define an influence metricbased on this choice for f . In particular we define ◆tv(x, y) := k⇡(x, y)� ⇡(x)⇡(y)ktv.

As we will now show, the total variation influence (between the query variable and the latent vari-ables) has the important property that it is exactly the error incurred from ignoring a single latentvariable when sampling values for xq. For example, suppose we design an approximate query spe-cific sampler that saves computational resources by ignoring a particular random variable xl. Then,the variable xl will remain at its burned-in value xl=vl for the duration of query specific sampling.As a result, the chain will converge to the invariant distribution ⇡(·|xl=vl). If we use this conditionaldistribution to approximate the marginal, then the expected error we incur is exactly the influencescore under total variation distance.

Proposition 1. If p(i) = 1(i 6= l)

1

n�1

induces an MH kernel that neglects variable xl, then theexpected total variation error ⇠tv of the resulting MH sampling procedure under the model is thetotal variation influence ◆tv.

4

Query-‐only

[Wick, McCallum, NIPS 2011]

x6 x7

y67

f67x5 x8

x4

f5y5

f8y8y54 y54

Schema Matching

f43

y1 y2x3

y3

y13 y23

y12

f1 f2

Coreference and Canonicalization

€

P(Y | X) =1ZX

ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏

yi∈Y∏

€

ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛

⎝ ⎜

⎞

⎠ ⎟ f7

y5

y7

x1 x2

Joint Inference

Really Hairy Models!How to do

• parameter estimation• inference

x6 x7

y67

f67x5 x8

x4

f5y5

f8y8y54 y54

Schema Matching

f43

y1 y2x3

y3

y13 y23

y12

f1 f2

Coreference and Canonicalization

€

P(Y | X) =1ZX

ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏

yi∈Y∏

€

ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛

⎝ ⎜

⎞

⎠ ⎟ f7

y5

y7

x1 x2

Joint Inference

Really Hairy Models!How to do

• parameter estimation• inference• software engineering

Probabilistic Programming Languages

• Make it easy to specify rich, complex models, using the full power of programming languages- data structures- control mechanisms- abstraction

• Inference implementation comes for free

Provides language to easily create new models

Landscape of Tools• BUGS Bayesian Inf. using Gibbs Sampling

- Arbitrary graphical models- Slow, small data

• Alchemy Markov Logic [Domingos]- Convenient language- Unpredictable inefficiency

• GraphLab [Guestrin]- Scalable parallel/distributed- Knows nothing of prob. model

• Stanford NLP Toolkit [Manning]- Rich NLP data types & models- Ad-hoc collection of implementations

FAC

TOR

IE

FACTORIE• “Factor Graphs, Imperative, Extensible”• Implemented as a library in Scala [Martin Odersky]

- object oriented & functional; type inference- runs in JVM (complete interoperation with Java)- fast, JIT compiled, but also cmd-line interpreter

• Library, not new “little language”- integrate data pre-processing & eval. w/ model spec- leverage OO-design: modularity, encapsulation, inheritance

• Distinct: Data, Factors, Inference, Learning• Pre-build, rich NLP representations & algorithms• Scalable

- billions of variables, super-exponential #factors- DB back-end

http://factorie.cs.umass.edu

Stages of FACTORIE programming1. Define “templates for data” (i.e. classes of random variables)

- Use data structures just like in deterministic programming.- Easily express relations among pieces of data.

2. Define “templates for factors” (i.e. model)- Distinct from above data representation;

makes it easy to modify model scoring independently.- Define factor scores based on neighboring variable values.

3. Select inference (message-passing, sampling,...)- Gibbs Sampling, Metropolis-Hastings, Iterated Conditional Modes,

WalkSAT, LP, Belief propagation, SparseBP, Expectation Propagation,...4. Select learning (loss function, optimization, parallelization strategy)5. Read the data, creating variables.

Then inference / parameter estimation is often a one-liner!

FACTORIE “special sauce”• Other tools (e.g. MSR’s Infer.NET)

are very powerfulbut inference is black box. Penetrable only by architects

• FACTORIE is designedto make it easy for usersto descend layers of abstraction.

• 1. Command-line tool2. Short script with pre-selected variables, model, inference, learning

3. Longer script with custom selections

4. Implement custom data structures new variables

5. Implement custom components inference, learning

e.g.

Components / Vocabulary• Variable value, set, domain• Domain values• Diff, DiffList redo, undo• Factor variables, score• Family (of factors) weights• Template (of factors) factors, unroll1, unroll2• Model (& objective) factors, score• Marginal, Summary marginal(variables)• Inferencer infer(variables): Summary• Optimizer step(weights, gradient, value)• Piece (of data) accumulateGradient(model, accumulator)• Trainer model, process(pieces)

Example Variables• RealVariable• IntegerVariable• DiscreteVariable, CategoricalVariable• TensorVariable• ProportionsVariable, MassesVariable• SeqVariable• EdgeVariable• SetVariable• RefVariable

Example NLP Variables• Token, Span, Sentence

- token.attr[PosLabel], span.attr[NerLabel]

• Sentence• Document• Mention, Entity • Relation• ParseTree

Example NLP Components• Token segmentation• Sentence segmentation• Part-of-speech tagging• Noun phrase chunking• Named-entity extraction• Dependency parsing• Relation extraction• Within-doc and cross-doc coreference

Example: Binary Doc Classifier// Dataclass Document extends FeatureVectorVariable { def domain = DocumentDomain }class Label(val doc:Document) extends LabeledBooleanVariable

Example: Binary Doc Classifier// Dataclass Document extends FeatureVectorVariable { def domain = DocumentDomain }class Label(val doc:Document) extends LabeledBooleanVariableval data: Iterable[Label] = readData()

Example: Binary Doc Classifier// Dataclass Document extends FeatureVectorVariable { def domain = DocumentDomain }class Label(val doc:Document) extends LabeledBooleanVariableval data: Iterable[Label] = readData()// Modelvar model = new Model[Label] {

}

Example: Binary Doc Classifier// Dataclass Document extends FeatureVectorVariable { def domain = DocumentDomain }class Label(val doc:Document) extends LabeledBooleanVariableval data: Iterable[Label] = readData()// Modelvar model = new Model[Label] { val bias = new DotFamilyWithStatistics1[Label] { val weights = new val obs = new DotFamilyWithStatistics[Label,Document] { val weights =

}

Example: Binary Doc Classifier// Dataclass Document extends FeatureVectorVariable { def domain = DocumentDomain }class Label(val doc:Document) extends LabeledBooleanVariableval data: Iterable[Label] = readData()// Modelvar model = new Model[Label] { val bias = new DotFamilyWithStatistics1[Label] { val weights = new val obs = new DotFamilyWithStatistics[Label,Document] { val weights = def factors(label:Label) = List( bias.Factor(label), obs.Factor(label, label.token) )}

Example: Binary Doc Classifiertraditional batch training, L-BFGS updates

// Dataclass Document extends FeatureVectorVariable { def domain = DocumentDomain }class Label(val doc:Document) extends LabeledBooleanVariableval data: Iterable[Label] = readData()// Modelvar model = new Model[Label] { val bias = new DotFamilyWithStatistics1[Label] { val weights = new val obs = new DotFamilyWithStatistics[Label,Document] { val weights = def factors(label:Label) = List( bias.Factor(label), obs.Factor(label, label.token) )}// Learningval trainer = new BatchTrainer(L2RegularizedLBFGS, model)trainer.process(data.map(y => new GLMPiece(y))

Example: Binary Doc Classifiertraditional batch training, L-BFGS updates

// Dataclass Document extends FeatureVectorVariable { def domain = DocumentDomain }class Label(val doc:Document) extends LabeledBooleanVariableval data: Iterable[Label] = readData()// Modelvar model = new Model[Label] { val bias = new DotFamilyWithStatistics1[Label] { val weights = new val obs = new DotFamilyWithStatistics[Label,Document] { val weights = def factors(label:Label) = List( bias.Factor(label), obs.Factor(label, label.token) )}// Learningval trainer = new BatchTrainer(L2RegularizedLBFGS, model)trainer.process(data.map(y => new GLMPiece(y))// Inferenceval marginal = Infer(data.first, model)

Example: Binary Doc Classifierstochastic gradient ascent, plain gradient updates

// Dataclass Document extends FeatureVectorVariable { def domain = DocumentDomain }class Label(val doc:Document) extends LabeledBooleanVariableval data: Iterable[Label] = readData()// Modelvar model = new Model[Label] { val bias = new DotFamilyWithStatistics1[Label] { val weights = new val obs = new DotFamilyWithStatistics[Label,Document] { val weights = def factors(label:Label) = List( bias.Factor(label), obs.Factor(label, label.token) )}// Learningval trainer = new SGDTrainer(StepwiseGradientAscent, model)trainer.process(data.map(y => new GLMPiece(y))// Inferenceval marginal = Infer(data.first, model)

Example: Binary Doc Classifierstochastic gradient ascent, with conf. weighting [Pereira]

// Dataclass Document extends FeatureVectorVariable { def domain = DocumentDomain }class Label(val doc:Document) extends LabeledBooleanVariableval data: Iterable[Label] = readData()// Modelvar model = new Model[Label] { val bias = new DotFamilyWithStatistics1[Label] { val weights = new val obs = new DotFamilyWithStatistics[Label,Document] { val weights = def factors(label:Label) = List( bias.Factor(label), obs.Factor(label, label.token) )}// Learningval trainer = new SGDTrainer(AROW, model)trainer.process(data.map(y => new GLMPiece(y))// Inferenceval marginal = Infer(data.first, model)

Example: Binary Doc Classifier“hogwild” parallelized stochastic gradient ascent

// Dataclass Document extends FeatureVectorVariable { def domain = DocumentDomain }class Label(val doc:Document) extends LabeledBooleanVariableval data: Iterable[Label] = readData()// Modelvar model = new Model[Label] { val bias = new DotFamilyWithStatistics1[Label] { val weights = new val obs = new DotFamilyWithStatistics[Label,Document] { val weights = def factors(label:Label) = List( bias.Factor(label), obs.Factor(label, label.token) )}// Learningval trainer = new HogwildTrainer(AROW, model)trainer.process(data.map(y => new GLMPiece(y))// Inferenceval marginal = Infer(data.first, model)

Example: N-ary Doc Classifier“hogwild” parallelized stochastic gradient ascent

// Dataclass Document extends FeatureVectorVariable { def domain = DocumentDomain }class Label(val doc:Document) extends LabeledCategoricalVariable { def domain =val data: Iterable[Label] = readData()// Modelvar model = new Model[Label] { val bias = new DotFamilyWithStatistics1[Label] { val weights = new val obs = new DotFamilyWithStatistics[Label,Document] { val weights = def factors(label:Label) = List( bias.Factor(label), obs.Factor(label, label.token) )}// Learningval trainer = new HogwildTrainer(AROW, model)trainer.process(data.map(y => new GLMPiece(y))// Inferenceval marginal = Infer(data.first, model)

Example: N-ary Doc Classifier“hogwild” with SampleRank gradients

// Dataclass Document extends FeatureVectorVariable { def domain = DocumentDomain }class Label(val doc:Document) extends LabeledCategoricalVariable { def domain =val data: Iterable[Label] = readData()// Modelvar model = new Model[Label] { val bias = new DotFamilyWithStatistics1[Label] { val weights = new val obs = new DotFamilyWithStatistics[Label,Document] { val weights = def factors(label:Label) = List( bias.Factor(label), obs.Factor(label, label.token) )}// Learningval trainer = new HogwildTrainer(AROW, model)trainer.process(data.map(y => new SampleRankPiece(y))// Inferenceval marginal = Infer(data.first, model)

Example: Sequence Labeling“hogwild” with SampleRank gradients

// Dataclass Token extends FeatureVectorVariable { def domain = DocumentDomain }class Label(val doc:Document) extends LabeledCategoricalVariable { def domain =val sentences: Iterable[Seq[Label]] = readData()// Modelvar model = new Model[Label] { val bias = new DotFamilyWithStatistics1[Label] { val weights = new val obs = new DotFamilyWithStatistics[Label,Document] { val weights = def factors(label:Label) = List( bias.Factor(label), obs.Factor(label, label.token) )}// Learningval trainer = new HogwildTrainer(AROW, model)trainer.process(sentences.map(y => new SampleRankPiece(y))// Inferenceval marginals = Infer(sentences.first, model)

Example: Sequence Labeling“hogwild” with SampleRank gradients, with Markov deps

// Dataclass Token extends FeatureVectorVariable { def domain = DocumentDomain }class Label(val doc:Document) extends LabeledCategoricalVariable { def domain =val sentences: Iterable[Seq[Label]] = readData()// Modelvar model = new Model[Seq[Label]] { val bias = new DotFamilyWithStatistics1[Label] { val weights = new val obs = new DotFamilyWithStatistics[Label,Document] { val weights = val markov = new DotFamilyWithStatistics2[Label,Label] { val weights def factors(labels:Seq[Label]) = for (label <- labels) yield bias.Factor(label) ++ for (label <- labels) yield obs.Factor(label, label.token) ++ for (label <- labels.drop(1)) yield markov.Factor(label.prev, label)}// Learningval trainer = new HogwildTrainer(AROW, model)trainer.process(sentences.map(y => new SampleRankPiece(y))// Inferenceval marginals = Infer(sentences.first, model)

Example: Sequence Labeling“hogwild”..., with Markov deps, Gibbs Sampling inference

// Dataclass Token extends FeatureVectorVariable { def domain = DocumentDomain }class Label(val doc:Document) extends LabeledCategoricalVariable { def domain =val sentences: Iterable[Seq[Label]] = readData()// Modelvar model = new Model[Seq[Label]] { val bias = new DotFamilyWithStatistics1[Label] { val weights = new val obs = new DotFamilyWithStatistics[Label,Document] { val weights = val markov = new DotFamilyWithStatistics2[Label,Label] { val weights def factors(labels:Seq[Label]) = for (label <- labels) yield bias.Factor(label) ++ for (label <- labels) yield obs.Factor(label, label.token) ++ for (label <- labels.drop(1)) yield markov.Factor(label.prev, label)}// Learningval trainer = new HogwildTrainer(AROW, model)trainer.process(sentences.map(y => new SampleRankPiece(y))// Inferenceval marginals = GibbsSampling.infer(sentences.first, model)

Example: Sequence Labeling“hogwild”..., with Markov deps, BP inference

// Dataclass Token extends FeatureVectorVariable { def domain = DocumentDomain }class Label(val doc:Document) extends LabeledCategoricalVariable { def domain =val sentences: Iterable[Seq[Label]] = readData()// Modelvar model = new Model[Seq[Label]] { val bias = new DotFamilyWithStatistics1[Label] { val weights = new val obs = new DotFamilyWithStatistics[Label,Document] { val weights = val markov = new DotFamilyWithStatistics2[Label,Label] { val weights def factors(labels:Seq[Label]) = for (label <- labels) yield bias.Factor(label) ++ for (label <- labels) yield obs.Factor(label, label.token) ++ for (label <- labels.drop(1)) yield markov.Factor(label.prev, label)}// Learningval trainer = new HogwildTrainer(AROW, model)trainer.process(sentences.map(y => new SampleRankPiece(y))// Inferenceval marginals = BP.infer(sentences.first, model)

Example: Sequence Labeling“hogwild”..., with Markov deps, Sparse BP inference

// Dataclass Token extends FeatureVectorVariable { def domain = DocumentDomain }class Label(val doc:Document) extends LabeledCategoricalVariable { def domain =val sentences: Iterable[Seq[Label]] = readData()// Modelvar model = new Model[Seq[Label]] { val bias = new DotFamilyWithStatistics1[Label] { val weights = new val obs = new DotFamilyWithStatistics[Label,Document] { val weights = val markov = new DotFamilyWithStatistics2[Label,Label] { val weights def factors(labels:Seq[Label]) = for (label <- labels) yield bias.Factor(label) ++ for (label <- labels) yield obs.Factor(label, label.token) ++ for (label <- labels.drop(1)) yield markov.Factor(label.prev, label)}// Learningval trainer = new HogwildTrainer(AROW, model)trainer.process(sentences.map(y => new SampleRankPiece(y))// Inferenceval marginals = SparseBP.infer(sentences.first, model)

FACTORIE Progress This Year• Design

- Model now source of factors with arbitrary context- Factor no long tied to Family- Tensors efficient multi-dimensional linear algebra- Cubbies for serialization to files and No-SQL DBs

• New application frameworks:- NLP representations and pre-trained models- Linear chains- Classification- Topic models

• New learning & joint inference algorithms- Parameter estimation flexibility, based on “Pieces”.- New inference procedures, e.g. Sparse BP.

• Released version 1.0, milestone 1.- Growing tutorial documentation.

FACTORIE Current Usage• Boyan Onyshkevych (NSA)

in collaboration with John Frank (MIT) and collaborators at NIST to support NIST TAC-KBA and internal work on large-scale entity resolution.

• DoD work at BAE, SRI, SAIC, Digital Reasoning Systems Inc., Oculus Inc., and Decisive Analytics Corporation,...

• Corporate work atOracle, Google, Contata,...

• Academic work atUMass, Stanford, CMU, UMD,...

joint inference & factorie 1 - seas.upenn.edumuri/y5review/jointinf-muri-upenn2012.pdf · depth...

Documents