probabilistic programming and probabilistic databases withmccallum/talks/utaustin2010b.pdf · 1f is...

124
Joint work with Sameer Singh, Michael Wick, Karl Schultz, Sebastian Reidel, Limin Yao, Aron Culotta. Andrew McCallum Department of Computer Science University of Massachusetts Amherst Probabilistic Programming and Probabilistic Databases with Imperatively-defined Factor Graphs

Upload: others

Post on 26-Jul-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Joint work with Sameer Singh, Michael Wick, Karl Schultz, Sebastian Reidel, Limin Yao, Aron Culotta.

Andrew McCallum

Department of Computer ScienceUniversity of Massachusetts Amherst

Probabilistic Programming andProbabilistic Databases with

Imperatively-defined Factor Graphs

Page 2: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Goal

Build models that mine actionable knowledge

from unstructured text.

Page 3: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective
Page 4: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective
Page 5: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective
Page 6: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective
Page 7: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective
Page 8: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Cites

Entities and Relations

Research Paper

Page 9: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Cites

Grant

Expertise

Venue University Groups

Research Paper

Person

Entities and Relations

Page 10: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Extracted Database

• 8 million research papers

• 2 million authors

• 400k grants, 90k institutions, 10k venues

Gather raw data

Extraction Resolution

Text

Men

tions

Entities

query

answer

Combining ambiguous evidence

Page 11: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Applying probabilistic modeling to large data.

Page 12: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Applying probabilistic modeling to large data.

scalability

algorithms

data structures

software engineering

bio/medicalinformatics

information extraction

computer vision

scientificdata modeling

StatisticsComputational

...

Rich model structurespatio-temporalhierarchicalrelationalinfinite

Implementing the new model is a significant task.

parallelism

Page 13: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Bayesian NetworkDirected, Generative Graphical Models

c d e

a b

p(a, b, c, d, e) = p(a) p(b|a) p(c|a) p(d|a, c) p(e|b, c)

Page 14: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Markov Random FieldUndirected Graphical Models, aka, Markov Network

c d e

a b

p(a, b, c, d, e) =1Z

φ(a, c, d) φ(a, b) φ(b, e) φ(d, e)

Page 15: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Markov Random FieldUndirected Graphical Models, aka, Markov Network

p(a, b, c, d, e) =1Z

φ(a, c)φ(a, d)φ(c, d) φ(a, b) φ(b, e) φ(d, e)

c d e

a b

p(a, b, c, d, e) =1Z

φ(a, c, d) φ(a, b) φ(b, e) φ(d, e)

Page 16: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Factor GraphCan represent both directed and undirected graphical models

p(a, b, c, d, e) =1Z

φ(a, c)φ(a, d)φ(c, d) φ(a, b) φ(b, e) φ(d, e)

c d e

a b

p(a, b, c, d, e) =1Z

φ(a, c, d) φ(a, b) φ(b, e) φ(d, e)

Page 17: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Factor GraphCan represent both directed and undirected graphical models

p(a, b, c, d, e) =1Z

φ(a, c)φ(a, d)φ(c, d) φ(a, b) φ(b, e) φ(d, e)

c d e

a b

p(a, b, c, d, e) =1Z

φ(a, c, d) φ(a, b) φ(b, e) φ(d, e)

Page 18: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Conditional Random Field (CRF)Undirected graphical model, conditioned on some data variables

c d e

a boutputpredictedvariables

inputobservedvariablesx

y

[Lafferty, McCallum, Pereira 2001]

p(y|x) =1

Zx

f

φ(x∈f ,y∈f )

Page 19: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Conditional Random Field (CRF)Undirected graphical model, conditioned on some data variables

c d e

a boutputpredictedvariables

inputobservedvariablesx

y

p(y|x) =1

Zx

f

φ(x∈f ,y∈f )

+ Tremendous freedom to use arbitrary features of input. + Predict multiple dependent variables (“structured output”)

[Lafferty, McCallum, Pereira 2001]

= exp

� �

k

λkfk(xt, yt)

Page 20: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Relational Graphical ModelRelational = repeated structure of data & factors

d e

a b

g

f

i

h

Page 21: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Relational Graphical ModelRelational = repeated structure of data & factors

d e

a b

g

f

i

h

x1

y1

FactorTemplate #1

y1 y2FactorTemplate #2

Page 22: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction with Linear-chain CRFs

Finite state model

Today Morgan Stanley Inc announced Mr. Friday’s appointment.

s1 s2 s3 s4 s5 s6 s7 s8

person nameorganization namebackground

Graphical model

state sequence

observation sequence

State-of-the-art predictive accuracy on many tasks.

Logistic Regression analogue of a hidden Markov model

Page 23: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Outline• Motivate software engineering for statistics

• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)

- Information Integration (really hairy CRFs, MCMC, SampleRank)

• Probabilistic Programming: FACTORIE

• Example

• Relation Extraction (cross-document, w/out labeled data)

• Probabilistic Programming inside a DB

• Ongoing Work

Page 24: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Outline• Motivate software engineering for statistics

• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)

- Information Integration (really hairy CRFs, MCMC, SampleRank)

• Probabilistic Programming: FACTORIE

• Example

• Relation Extraction (cross-document, w/out labeled data)

• Probabilistic Programming inside a DB

• Ongoing Work

Page 25: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Database A (Schema A)First Name Last Name Contact

J. Smith 222-444-1337

J. Smith 444 1337

John Smith (1) 4321115555

Database B (Schema B)Name Phone

John Smith U.S. 222-444-1337

John D. Smith 444 1337

J Smiht 432-111-5555

Schema A Schema BFirst Name Name

Last Name Phone

Contact

John #1 John #2

J. Smith John Smith

J. Smith J Smiht

John Smith

John D. Smith

Information Integration

Entity# Name Phone

523 John Smith 222-444-1337

524 John D. Smith 432-111-5555

… … …

Schema Matching Coreference

Canonicalization: Normalized DB

Page 26: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

x6 x7

y67

f67x5 x8

x4

f5y5

f8y8y54 y54

Schema Matching

y1 y2x3

y3

y13 y23

y12

f1 f2

Coreference and Canonicalization

P(Y | X) =1ZX

ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏

yi∈Y∏

ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛

⎝ ⎜

⎠ ⎟ f7

y5

y7

x1 x2

• x1 is a set of mentions {J. Smith,John,John Smith}}

• x2 is a set of mentions {Amanda, A. Jones}

• f12 is a factor between x1/x2

• y12 is a binary variable indicating a match (no)

• f1 is a factor over cluster x1

• y1 is a binary variable indicating match (yes)

• Entity/attribute factors omitted for clarity

Page 27: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

x6 x7

y67

f67x5 x8

x4

f5y5

f8y8y54 y54

Schema Matching

y1 y2x3

y3

y13 y23

y12

f1 f2

Coreference and Canonicalization

P(Y | X) =1ZX

ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏

yi∈Y∏

ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛

⎝ ⎜

⎠ ⎟ f7

y5

y7

• x6 is a set of attributes {phone,contact,telephone}}

• x7 is a set of attributes {last name, last name}

• f67 is a factor between x6/x7

• y67 is a binary variable indicating a match (no)

• f7 is a factor over cluster x7

• y7 is a binary variable indicating match (yes)

Page 28: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

x6 x7

y67

f67x5 x8

x4

f5y5

f8y8y54 y54

Schema Matching

f43

y1 y2x3

y3

y13 y23

y12

f1 f2

Coreference and Canonicalization

P(Y | X) =1ZX

ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏

yi∈Y∏

ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛

⎝ ⎜

⎠ ⎟ f7

y5

y7

x1 x2

Page 29: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

x6 x7

y67

f67x5 x8

x4

f5y5

f8y8y54 y54

Schema Matching

f43

y1 y2x3

y3

y13 y23

y12

f1 f2

Coreference and Canonicalization

P(Y | X) =1ZX

ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏

yi∈Y∏

ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛

⎝ ⎜

⎠ ⎟ f7

y5

y7

x1 x2

Really Hairy!

How to do

• parameter estimation• inference

Page 30: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

• Most methods require calculating gradient of log-likelihood, P(y1, y2, y3,... | x1, x2, x3,...)...

• ...which in turn requires “expectations of marginals,” P(y1| x1, x2, x3,...)

• But, getting marginal distributions can be difficult.

• Alternative: Perceptron. Approximate gradient from difference between true output and model’s predicted best output.

• But, even finding model’s predicted best output is expensive.

• We propose: “Sample Rank” [Culotta, Wick, Hall, McCallum, HLT 2007]Learn to rank intermediate solutions: P(y1=1, y2=0, y3=1,... | ...) > P(y1=0, y2=0, y3=1,... | ...)

Parameter Estimation in Large State Spaces

Page 31: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Metropolis-HastingsSampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

SampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

feasible region defined by deterministic constraintse.g. clustering, parse-tree projectivity.

SampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

Given factor graph with target variables y and observed x

SampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

proposal distribution

SampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

Can do MAP inference with decreasing temperature on ratio of p(y)’s

mod

Page 32: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

M-H Natural Efficiencies1. Partition function cancels

SampleRank

MH Efficiency

partition cancels

p(y �)

p(y)=

p(Y = y �|x ; θ)

p(Y = y |x ; θ)

=1

ZX

�y i∈y � ψ(x , y i)

1

ZX

�y∈y ψ(x , y i)

=

�y i∈y � ψ(x , y i)

�y∈y ψ(x , y i)

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27

SampleRank

MH Efficiency

factors cancel

=

�y �i∈y � ψ(x , y �i)

�y i∈y ψ(x , y i)

=

��y �i∈δy � ψ(x , y �i)

� ��y i∈y �/δy � ψ(x , y i)

��y i∈δy

ψ(x , y i)� ��

y i∈y/δyψ(x , y i)

=

�y �i∈δy � ψ(x , y �i)

�y i∈δy

ψ(x , y i)

δy is the “diff”, ie variables in y that have changed

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 6 / 27

2. Unchanged factors cancel

How to learn parameters for ?

Page 33: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

1.

UPDATE

Ranking Intermediate SolutionsExample

2.

∆ Model = -23∆ Truth = -0.2

3.

∆ Model = 10∆ Truth = -0.1

4.

∆ Model = -10∆ Truth = -0.1

5.

∆ Model = -3∆ Truth = 0.3

• Like Perceptron:Proof of convergence under Marginal Separability.

• More constrained than Maximum Likelihood:Parameters must correctly rank incorrect solutions!

• Very fast to train.

UPDATE

Page 34: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Comparison to Contrastive DivergenceContrastive Divergence, n=2 [Hinton 2002]

sufficient statistics for update

proposal

truth

Persistent Contrastive Divergence [Tieleman 2008]

Sample Rank

Page 35: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Comparison to LASO

Sample Rank scores possible worlds.

LASO scores “actions” (transitions).

SCORESCORESCORESCORESCORE

SCORESCORESCORESCORE

No concern about generation ordering of output vars.Defines standard factor graph score on possible world.Can get marginal probability distributions.

Daumé, Langford, Marcu ’05

Page 36: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

SampleRank on Coreference• ACE 2004

• All nouns. 28,122 mentions, 14,047 entitiese.g. he, the President, Clinton, Mrs. Clinton, Washington

2005 Ng . 69.5%

2007 Culotta, Wick, Hall, McCallum . 79.3%

2008 Bengston, Roth . 80.8%

2009 Wick, McCallum MCMC+SampleRank . 81.5%

Contrastive Divergence . 75.1%

Persistent Contrastive Divergence . 74.9%

Perceptron . 76.3%

B3

Page 37: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

x6 x7

y67

f67x5 x8

x4

f5y5

f8y8y54 y54

Schema Matching

f43

y1 y2x3

y3

y13 y23

y12

f1 f2

Coreference and Canonicalization

P(Y | X) =1ZX

ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏

yi∈Y∏

ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛

⎝ ⎜

⎠ ⎟ f7

y5

y7

x1 x2

Page 38: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Dataset• Faculty and alumni listings from

university websites, plus an IE system

• 9 different database schemas

• ~1400 mentions, 294 coreferent

DEX IE Northwestern Fac UPenn FacFirst Name Name Name

Middle Name Title First Name

Last Name PhD Alma Mater Last Name

Title Research Interests Job+Department

Department Office Address

Company Name E-mail

Home Phone

Office Phone

Fax Number

E-mail

Example schemas:

Page 39: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Coreference Results

PairPairPair MUCMUCMUC

F1 Prec Recall F1 Prec Recall

No Canon

ISO 72.7 88.9 61.5 75.0 88.9 64.9

No Canon

CASC 64.0 66.7 61.5 65.7 66.7 64.9No Canon

JOINT 76.5 89.7 66.7 78.8 89.7 70.3

Canon

ISO 78.3 90.0 69.2 80.6 90.0 73.0

CanonCASC 65.8 67.6 64.1 67.6 67.6 67.6

Canon

JOINT 81.7 90.6 74.4 84.1 90.6 74.4

~15% error reduction from joint model

ISO = isolated CASC = cascade JOINT = joint inference

Page 40: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Schema Matching Results

PairPairPair MUCMUCMUC

F1 Prec Recall F1 Prec Recall

No Canon

ISO 50.9 40.9 67.5 69.2 81.8 60.0

No Canon

CASC 50.9 40.9 67.5 69.2 81.8 60.0No Canon

JOINT 68.9 100 52.5 69.6 100 53.3

Canon

ISO 50.9 40.9 67.5 69.2 81.8 60.0

CanonCASC 52.3 41.8 70.0 74.1 83.3 66.7

Canon

JOINT 71.0 100 55.0 75.0 100 60.0

~40% error reduction from joint model

ISO = isolated CASC = cascade JOINT = joint inference

Page 41: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

x6 x7

y67

f67x5 x8

x4

f5y5

f8y8y54 y54

Schema Matching

f43

y1 y2x3

y3

y13 y23

y12

f1 f2

Coreference and Canonicalization

P(Y | X) =1ZX

ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏

yi∈Y∏

ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛

⎝ ⎜

⎠ ⎟ f7

y5

y7

x1 x2

Really Hairy!

How to do

• parameter estimation• inference

Page 42: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

x6 x7

y67

f67x5 x8

x4

f5y5

f8y8y54 y54

Schema Matching

f43

y1 y2x3

y3

y13 y23

y12

f1 f2

Coreference and Canonicalization

P(Y | X) =1ZX

ψw(yi,xi) ψb(yij,xij)yi,yj∈Y∏

yi∈Y∏

ψ(yi,xi) = exp λkfk(yi,xi)k∑⎛

⎝ ⎜

⎠ ⎟ f7

y5

y7

x1 x2

Really Hairy!

How to do

• parameter estimation• inference• software engineering

Page 43: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Outline• Motivate software engineering for statistics

• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)

- Information Integration (really hairy CRFs, MCMC, SampleRank)

• Probabilistic Programming: FACTORIE

• Example

• Relation Extraction (cross-document, w/out labeled data)

• Probabilistic Programming inside a DB

• Ongoing Work

Page 44: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Probabilistic Programming Languages

• Make it easy to specify rich, complex models, using the full power of programming languages- data structures

- control mechanisms

- abstraction

• Inference implementation comes for free

Provides language to easily create new models

Page 45: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Small Sampling of Probabilistic Programming Languages

• Explicit Directed Graph- BUGS

• Functional- IBAL, Church

• Object Oriented- Figaro, Infer.NET

• Logic-based- Markov logic, BLOG, PRISM

Page 46: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Declarative Model Specification• One of biggest advances in Artificial Intelligence community

• Gone too far?Much domain knowledge is also procedural.

• Logic + Probability → Imperative + Probability- Rising interest: Church, Infer.NET,...

Page 47: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Imperative tools for creating aDeclarative Model Specification

• One of biggest advances in Artificial Intelligence community

• Gone too far?Much domain knowledge is also procedural.

• Logic + Probability → Imperative + Probability- Rising interest: Church, Infer.NET,...

Page 48: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Imperative tools for creating aDeclarative Model Specification

• Preserve the declarative statistical semantics of factor graphs

• Provide imperative hooks to define structure, parameterization, inference, estimation.

“Imperatively-Defined Factor Graphs”

Our approach:

[McCallum, Rohanemanesh, Wick, Schultz, Singh, NIPS, 2008]

Page 49: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Our Design Goals• Represent factor graphs

- emphasis on conditional random fields

• Scalability- input data, output configuration, factors, tree-width

- observed data that cannot fit in memory

- super-exponential number of factors

• Leverage object-oriented benefits- Modularity, encapsulation, inheritance,...

• Integrate declarative & procedural knowledge- natural, easy-to-use

- upcoming slides: 2 examples of injecting imperativ-ism into factor graphs

Page 50: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

FACTORIE• “Factor Graphs, Imperative, Extensible”• Implemented as a library in Scala [Martin Odersky]

- object oriented & functional- type inference- runs in JVM (complete interoperation with Java)- fast, JIT compiled, but also cmd-line interpreter

• Library, not new “little language”- integrate data pre-processing & eval. w/ model spec- leverage OO-design: modularity, encapsulation, inheritance

• Scalable- billions of variables, super-exp #factors, DB back-end- fast parameter estimation through SampleRank [2009]

http://code.google.com/p/factorie

Page 51: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Stages of FACTORIE programming1. Define “templates for data” (i.e. classes)

- Use data structures just like in deterministic programming.

- Only special requirement: “undo” capability for changes.- (Variable holds single possible value, not a distribution.)

2. Define “templates for factors”- Distinct from above data representation;

makes it easy to modify model scoring independently.

- Leverage data’s natural relations to define factors’ relations.

3. Select inference (MCMC, variational)

- Optionally, define MCMC proposal functions that leverage domain knowledge.

4. Read the data, creating variables.Then inference / parameter estimation is often a one-liner!

Page 52: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Outline• Motivate software engineering for statistics

• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)

- Information Integration (really hairy CRFs, MCMC, SampleRank)

• Probabilistic Programming: FACTORIE

• Example

• Relation Extraction (cross-document, w/out labeled data)

• Probabilistic Programming inside a DB

• Ongoing Work

Page 53: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Scala• New variable

var myHometown : String

• New constantval myName = “Andrew”

• New methoddef climb(increment:double) = myAltitude += increment

• New classclass Skier extends Person

• New trait (like Java interface with implementations)trait FirstAid { def applyBandage = ... }

• New class with traitclass BackcountrySkier extends Skier with FirstAid

• Generics in square bracketsnew ArrayList[Skier]

Page 54: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Linear-Chain CRF for Segmentation

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Page 55: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg)

class Token(word:String) extends CategoricalVariable(word)

Bill loves skiing Tom loves snowshoeing

T F F T F F

Page 56: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq

class Token(word:String) extends CategoricalVariable(word) with VarInSeq

Bill loves skiing Tom loves snowshoeing

T F F T F F

label.prevlabel.next

Page 57: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq

class Token(word:String) extends CategoricalVariable(word) with VarInSeq

Bill loves skiing Tom loves snowshoeing

T F F T F F

Page 58: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label}

Bill loves skiing Tom loves snowshoeing

T F F T F F

Avoid representing relations by indices.Do it directly with member pointers... arbitrary data structure.

Page 59: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

Bill loves skiing Tom loves snowshoeing

T F F T F F

Page 60: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label])

Bill loves skiing Tom loves snowshoeing

T F F T F F

Page 61: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token])

Bill loves skiing Tom loves snowshoeing

T F F T F F

Page 62: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token], new TemplateWithStatistics2[Label,Label])

Bill loves skiing Tom loves snowshoeing

T F F T F F

Page 63: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.

Page 64: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.

Page 65: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables

Page 66: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate

those factors’ scores

Page 67: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate

those factors’ scores

Page 68: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate

those factors’ scores• How to find factors from variables & vice versa?

– In BLOG, rich, highly-indexed data structure stores mapping variables ←→ factors

– But complex to maintain as structure changes– Factors consume memory

Page 69: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Imperativ-ism #1: Model Structure• Maintain no map structure between factors and variables

• Finding factors is easy. Usually # templates < 50.– Given changed variable, query each template

• What’s hard: Given factor template and one changed variable, find other variables

• In factor Template object, define imperative methods that do this.– unroll1(v1) returns (v1,v2,v3)– unroll2(v2) returns (v1,v2,v3)– unroll3(v3) returns (v1,v2,v3)– i.e., use Turing-complete language to determine structure on the fly.

• Other nice attributes– Easy to do value-conditioned structure. Case Factor Diagrams, etc.– Not only avoid super-exp, don’t even allocate all factors for current config.– FACTORIE provides several simpler mechanisms that build on this primitive.

Primitive operation:

Page 70: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token], new TemplateWithStatistics2[Label,Label])

Page 71: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] {

}, new TemplateWithStatistics2[Label,Label])

Page 72: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token)

}, new TemplateWithStatistics2[Label,Label])

Page 73: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label])

Page 74: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) })

Page 75: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Linear-Chain CRF for Segmentation

Labels

Words

class Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) })

Page 76: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Imperativ-ism #2: Neighbor-Sufficient Map• “Neighbor Variables” of a factor

– Values of variables touching the factor• “Sufficient Statistics” of a factor

– Vector, dot product with weights of log-linear factor → factor’s score

• Usually confounded. Separate them w/ user-defined function!• Skip-chain NER. Instead of 5x5 parameters, just 2.

(label1, label2) → label1 == label2

Labels

Words

Bill loves Paris Bill the painter ...

PER O LOC PER O O

[Sutton & McCallum 2006]

Page 77: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Skip-Chain CRF for Segmentationclass Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) })

Page 78: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Skip-Chain CRF for Segmentationclass Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) }, new Template2[Label,Label] with Statistics1[BooleanVariable])

Page 79: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Skip-Chain CRF for Segmentationclass Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) }, new Template2[Label,Label] with Statistics1[BooleanVariable]{ def unroll1(label:Label) = for (other <- label.seq; if (label.token == other.token)) yield Factor(label,other) })

Page 80: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Skip-Chain CRF for Segmentationclass Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) }, new Template2[Label,Label] with Statistics1[BooleanVariable]{ def unroll1(label:Label) = for (other <- label.seq; if (label.token == other.token)) yield Factor(label,other) def statistics(label1:Label, label2:Label) = Stat(label1 == label2) })

Page 81: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Skip-Chain CRF for Segmentationclass Label(isBeg:boolean) extends BooleanVariable(isBeg) with VarInSeq { val token : Token}class Token(word:String) extends CategoricalVariable(word) with VarInSeq { val label : Label def longerThanSix = word.length > 6}

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error // Tokens shouldn’t change }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) }, new Template2[Label,Label] with Statistics1[BooleanVariable]{ def unroll1(label:Label) = for (other <- label.seq; if (label.token == other.token)) yield Factor(label,other) def statistics(label1:Label, label2:Label) = Stat(label1 == label2) })val labels:Collection[Label] = readData()val inferencer = new GibbsSampler(model)for (i <- 1 to numIterations) inferencer.process(labels)

Page 82: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example RunCoNLL 2003 NER

Page 83: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Dependency Parsingclass Word(str:String) extends CategoricalVariable(word)class Node(word:Word, parent:Node) extends RefVariable(parent)

object ChildParentTemplate extends Template1[Node] with Statistics2[Word,Word] { def statistics(n:Node) = Stat(n.word, n.parent.word)}

object NearestVerbTemplate extends Template1[Node] with Statistics2[Word,Word] { def statistics(n:Node) = Stat(n.word, closestVerb(n).word) def closestVerb(n:Node) = if (isVerb(n.word)) n else closestVerb(n.parent) def unroll1(n:Node) = n.selfAndDescendants}

VERB

Page 84: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Alternative Template Spec

val model = new Model( new TemplateWithStatistics1[Label], new TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = throw new Error }, new TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label) },)

Instead of previous:

Higher-level “Entity-Relationship” Specification:val model = new Model( Foreach[Label] { label => Score(label) }, Foreach[Label] { label => Score(label, label.token) }, Foreach[Label] { label => Score(label.prev, label) })

Page 85: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: First-order Logic Templates

val model = new Model ( Forany[Person] { p => p.cancer } * 0.1, Forany[Person] { p => p.smokes ==> p.cancer } * 2.0 Forany[Person] { p => p.friends.smokes <==> p.smokes } * 1.5)

Page 86: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: First-order Logic Templates

val model = new Model ( Forany[Person] { p => p.cancer } * 0.1, Forany[Person] { p => p.smokes ==> p.cancer } * 2.0 Forany[Person] { p => p.friends.smokes <==> p.smokes } * 1.5)

Page 87: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Latent Dirichlet Allocationclass Z(p:Proportions, value:Int) extends MixtureChoice(p, value)class Word(ps:FiniteMixture[Proportions], z:MixtureChoiceVariable, value:String) extends CategoricalMixture[String](ps, z, value)class Document(val file:String) extends ArrayBuffer[Word] { var theta:DirichletMultinomial = null }

Page 88: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Latent Dirichlet Allocationclass Z(p:Proportions, value:Int) extends MixtureChoice(p, value)class Word(ps:FiniteMixture[Proportions], z:MixtureChoiceVariable, value:String) extends CategoricalMixture[String](ps, z, value)class Document(val file:String) extends ArrayBuffer[Word] { var theta:DirichletMultinomial = null }

val phis = FiniteMixture(numTopics)(new GrowableDenseDirichletMultinomial(0.01))val documents = new ArrayBuffer[Document]for (directory <- directories) { for (file <- new File(directory).listFiles; if (file.isFile)) { val doc = new Document(file.toString) doc.theta = new DenseDirichletMultinomial(numTopics, 0.01) for (word <- lexer.findAllIn(file.mkString).map(_ toLowerCase)) { val z = new Z(doc.theta, random.nextInt(numTopics)) doc += new Word(phis, z, word) } documents += doc }}

Page 89: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Example: Latent Dirichlet Allocationclass Z(p:Proportions, value:Int) extends MixtureChoice(p, value)class Word(ps:FiniteMixture[Proportions], z:MixtureChoiceVariable, value:String) extends CategoricalMixture[String](ps, z, value)class Document(val file:String) extends ArrayBuffer[Word] { var theta:DirichletMultinomial = null }

val phis = FiniteMixture(numTopics)(new GrowableDenseDirichletMultinomial(0.01))val documents = new ArrayBuffer[Document]for (directory <- directories) { for (file <- new File(directory).listFiles; if (file.isFile)) { val doc = new Document(file.toString) doc.theta = new DenseDirichletMultinomial(numTopics, 0.01) for (word <- lexer.findAllIn(file.mkString).map(_ toLowerCase)) { val z = new Z(doc.theta, random.nextInt(numTopics)) doc += new Word(phis, z, word) } documents += doc }}

val sampler = new CollapsedGibbsSampler(phis ++ documents.map(_.theta))val zs = documents.flatMap(document => document.map(word => word.choice))sampler.processAll(zs)

Page 90: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Experimental Comparison• Joint Segmentation & Coreference of

research paper citations (Cora data).- 1295 mentions, 134 entities, 36487 tokens

• Compare with Markov Logic Networks (Alchemy)

- Same observable features

• FACTORIE results:

- ~25% reduction in error (segmentation & coref)

- 3-20x faster

- coref results:

Page 91: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Outline• Motivate software engineering for statistics

• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)

- Information Integration (really hairy CRFs, MCMC, SampleRank)

• Probabilistic Programming: FACTORIE

• Example

• Relation Extraction (cross-document, w/out labeled data)

• Probabilistic Programming inside a DB

• Ongoing Work

Page 92: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

founded(Bill Gates,Microsoft)

nationality(Steve Jobs,USA)

...(...,...)

Microsoft was founded by Bill Gates

With Microsoft chairman Bill Gates soon relinquishing ...

Paul Porter, a founder of Industry Ears

334K entities, 10 types 488K relation instances, 54 types

2 years216K articles

Joint model ofentities & relations

founded(Paul Porter,Industry Ears)

founded(D.L.Sifry,Technorati)

founded

founded

founded(D. L. Sifry, Technorati)?

KnowledgeBase Augmentation[Yao, Riedel, McCallum 2010]

Page 93: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Cross-document

Joint

Inferencefounded

Microsoft was founded by Bill Gates...

personcompany

nation

country

With Microsoft chairman Bill Gates soon relinquishing...

Bill Gates was born in the USA in 1955

1

nationof

Elevation Partners, was founded by Roger McNamee ...

Roger McNamee,USAYrel

Z1

personR McNamee

countryUSA

worksfor

comp. Microsoft

1Z1

1Z1

Roger McNamee,Microsoft

functionality-factor

ner-relation-factors

relation-mention factors

Ytypel

mention factors

Elevation Partners, was founded by Roger McNamee ...

Elevation Partners, was founded by Roger McNamee ...

Figure 1: Factor Graph for joint relation mention prediction and relation type identification.

fine the following conditional distribution:

p (y|x) =1

Zx

Tj∈T

(yi,xi)∈Tj

ePKj

k=1 θjkfj

k(yi,xi) (3)

In our case the set T consist of four templateswe will describe below. Note that to construct thisgraphical model we use FACTORIE (McCallum etal., 2009), a probabilistic programming languagethat simplifies the construction process, as well asinference and learning.

3.1.1 Bias TemplateWe use a bias template TBias that prefers certain

relations a priori over others. When the template isunrolled, it creates one factor per variable Ycfor can-didate tuple c and one weight θBias

r and feature func-tion fBias

r for each possible relation r. fBiasr fires if

the relation associated with tuple c is r.

3.1.2 Mention TemplateIn order to extract relations from text, we need to

model the correlation between relation instances andtheir mentions in text. For this purpose we definethe mention template TMen that connects each rela-tion instance variable Yc with its observed variablesmention variables XMc .

The feature functions of this template are takenfrom (Mintz et al., 2009b) (with minor modifica-tions). This includes features that inspect the lexical

context between entity mentions in the same sen-tence, and the syntactic path between these. Oneexample is

fMen101 (yc,xMc)

def=

1 yc = founder∧m1", director of "m2 ∈ xMc

0 otherwise.

It tests whether for any of the mentions of the can-didate tuple the sequence ", director of " appears be-tween the mentions of the argument entites.

Crucially, these templates function on a cross-document level. They gather all mentions of the can-didate tuple c and extract features from all of these.

3.1.3 Selectional Preference TemplatesTo capture the correlations between entity types

and the relations the entities participate in, we in-troduce the joint template TJoint. It connects a re-lation instance variable Ye1,...,ea to the entity typevariables Ye1 , . . . , Yen . To measure the compabil-ity between relation and entity variables, we useone feature f Joint

r,t1...ta (and weight θJointr,t1...ta) for each

combination of relation and entity types r, t1 . . . ta.The feature fires when the variables are in thestate r, t1 . . . ta. After training we would expecta weight θJoint

founder,person,company to be larger thanθJointfounder,person,country.

We also add a template TPair that measures thecompability between Ye1,...,ea and each Yei in iso-lation. Here we use features fPair

i,r,t that fire if ei is

1

nationof

Elevation Partners, was founded by Roger McNamee ...

Roger McNamee,USAYrel

Z1

personR McNamee

countryUSA

worksfor

comp. Microsoft

1Z1

1Z1

Roger McNamee,Microsoft

functionality-factor

ner-relation-factors

relation-mention factors

Ytypel

mention factors

Elevation Partners, was founded by Roger McNamee ...

Elevation Partners, was founded by Roger McNamee ...

Figure 1: Factor Graph for joint relation mention prediction and relation type identification.

fine the following conditional distribution:

p (y|x) =1

Zx

Tj∈T

(yi,xi)∈Tj

ePKj

k=1 θjkfj

k(yi,xi) (3)

In our case the set T consist of four templateswe will describe below. Note that to construct thisgraphical model we use FACTORIE (McCallum etal., 2009), a probabilistic programming languagethat simplifies the construction process, as well asinference and learning.

3.1.1 Bias TemplateWe use a bias template TBias that prefers certain

relations a priori over others. When the template isunrolled, it creates one factor per variable Ycfor can-didate tuple c and one weight θBias

r and feature func-tion fBias

r for each possible relation r. fBiasr fires if

the relation associated with tuple c is r.

3.1.2 Mention TemplateIn order to extract relations from text, we need to

model the correlation between relation instances andtheir mentions in text. For this purpose we definethe mention template TMen that connects each rela-tion instance variable Yc with its observed variablesmention variables XMc .

The feature functions of this template are takenfrom (Mintz et al., 2009b) (with minor modifica-tions). This includes features that inspect the lexical

context between entity mentions in the same sen-tence, and the syntactic path between these. Oneexample is

fMen101 (yc,xMc)

def=

1 yc = founder∧m1", director of "m2 ∈ xMc

0 otherwise.

It tests whether for any of the mentions of the can-didate tuple the sequence ", director of " appears be-tween the mentions of the argument entites.

Crucially, these templates function on a cross-document level. They gather all mentions of the can-didate tuple c and extract features from all of these.

3.1.3 Selectional Preference TemplatesTo capture the correlations between entity types

and the relations the entities participate in, we in-troduce the joint template TJoint. It connects a re-lation instance variable Ye1,...,ea to the entity typevariables Ye1 , . . . , Yen . To measure the compabil-ity between relation and entity variables, we useone feature f Joint

r,t1...ta (and weight θJointr,t1...ta) for each

combination of relation and entity types r, t1 . . . ta.The feature fires when the variables are in thestate r, t1 . . . ta. After training we would expecta weight θJoint

founder,person,company to be larger thanθJointfounder,person,country.

We also add a template TPair that measures thecompability between Ye1,...,ea and each Yei in iso-lation. Here we use features fPair

i,r,t that fire if ei is

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

nationality. When testing our model we thenencounter a sentence such as

(3) Arrest Warrant Issued for Richard Gere inIndia.

that leads us to extract that RICHARD GERE is a cit-izen of INDIA.

2.6 Global Consistency of FactsAs discussed above, distant supervision can lead tonoisy extractions. However, such noise can often beeasily identified by testing how compatible the ex-tracted facts are to each other. In this work we areconcerned with a particular type of compatibility:selectional preferences.

Relations require, or prefer, their arguments to beof certain types. For example, the nationalityrelation requires the first argument to be a person,and the second to be a country. On inspection,we find that these preferences are often not satis-fied in a baseline distant supervision system akin toMintz et al. (2009). This often results from patternssuch as “<Entity1> in <Entity2>” that fire in manycases where <Entity2> is a location, but not acountry.

3 ModelOur observations in the previous section suggestthat we should (a) explicitly model compatibil-ity between extracted facts, and (b) integrate ev-idence from several documents to exploit redun-dancy. In this work we choose a Conditional Ran-dom Field (CRF) to achieve this. CRFs are a naturalfit for this task: They allow us to capture correlationsin an explicit fashion, and to incorporate overlappinginput features from multiple documents.

The hidden output variables of our model are Y =(Yc)c∈C . That is, we have one variable Yc for eachcandidate tuple c ∈ C . This variable can take asvalue any relation in C with the same arity as c. Seeexample relation variables in figure 1.

The observed input variables X consists of a fam-ily of variables Xc =

�X1

c, . . .Xmc

�m∈M

for eachcandidate tuple c. Here Xi

c stores relevant observa-tions we make for the i-th candidate mention tuple ofc in the corpus. For example, X1

BILL GATES,MICROSOFT

in figure 1 would contain, among others, the pattern“[M2] was founded by [M1]”.

3.1 Factor TemplatesOur conditional probability distribution over vari-ables X and Y is defined using using a set T offactor templates. Each template Tj ∈ T definesa set of factors {(yi,xi)}, a set Kj of feature in-dices, parameters

�θjk

k∈Kj

and feature functions�

f jk

k∈Kj

. Together they define the following con-

ditional distribution:

p (y|x) =1

Zx

Tj∈T

(yi,xi)∈Tj

eP

k∈Kjθjkfj

k(yi,xi)

(4)In our case the set T consists of four templates

we will describe below. We construct this graphicalmodel using FACTORIE (McCallum et al., 2009), aprobabilistic programming language that simplifiesthe construction process, as well as inference andlearning.

3.1.1 Bias TemplateWe use a bias template TBias that prefers certain

relations a priori over others. When the templateis unrolled, it creates one factor per variable Yc forcandidate tuple c ∈ C. The template also consists ofone weight θBias

r and feature function fBiasr for each

possible relation r. fBiasr fires if the relation associ-

ated with tuple c is r.

3.1.2 Mention TemplateIn order to extract relations from text, we need

to model the correlation between relation instancesand their mentions in text. For this purpose we de-fine the template TMention that connects each relationinstance variable Yc with its observed mention vari-ables Xc. Crucially, this template gathers mentionsfrom multiple documents, and enables us to exploitredundancy.

The feature functions of this template are takenfrom Mintz et al. (2009). This includes features thatinspect the lexical content between entity mentionsin the same sentence, and the syntactic path betweenthem. One example is

fMen101 (yc,xc)

def=

1 yc = founded ∧ ∃i with"M2 was founded by M1" ∈ xi

c

0 otherwise.

founder

Microsoft was founded by Bill Gates...

personcompany

nationality

country

With Microsoft chairman Bill Gates soon relinquishing...

Bill Gates was born in the USA in 1955

1

nationof

Elevation Partners, was founded by Roger McNamee ...

Roger McNamee,USAYrel

Z1

personR McNamee

countryUSA

worksfor

comp. Microsoft

1Z1

1Z1

Roger McNamee,Microsoft

functionality-factor

ner-relation-factors

relation-mention factors

Ytypel

mention factors

Elevation Partners, was founded by Roger McNamee ...

Elevation Partners, was founded by Roger McNamee ...

Figure 1: Factor Graph for joint relation mention prediction and relation type identification.

fine the following conditional distribution:

p (y|x) =1

Zx

Tj∈T

(yi,xi)∈Tj

ePKj

k=1 θjkfj

k(yi,xi) (3)

In our case the set T consist of four templateswe will describe below. Note that to construct thisgraphical model we use FACTORIE (McCallum etal., 2009), a probabilistic programming languagethat simplifies the construction process, as well asinference and learning.

3.1.1 Bias TemplateWe use a bias template TBias that prefers certain

relations a priori over others. When the template isunrolled, it creates one factor per variable Ycfor can-didate tuple c and one weight θBias

r and feature func-tion fBias

r for each possible relation r. fBiasr fires if

the relation associated with tuple c is r.

3.1.2 Mention TemplateIn order to extract relations from text, we need to

model the correlation between relation instances andtheir mentions in text. For this purpose we definethe mention template TMen that connects each rela-tion instance variable Yc with its observed variablesmention variables XMc .

The feature functions of this template are takenfrom (Mintz et al., 2009b) (with minor modifica-tions). This includes features that inspect the lexical

context between entity mentions in the same sen-tence, and the syntactic path between these. Oneexample is

fMen101 (yc,xMc)

def=

1 yc = founder∧m1", director of "m2 ∈ xMc

0 otherwise.

It tests whether for any of the mentions of the can-didate tuple the sequence ", director of " appears be-tween the mentions of the argument entites.

Crucially, these templates function on a cross-document level. They gather all mentions of the can-didate tuple c and extract features from all of these.

3.1.3 Selectional Preference TemplatesTo capture the correlations between entity types

and the relations the entities participate in, we in-troduce the joint template TJoint. It connects a re-lation instance variable Ye1,...,ea to the entity typevariables Ye1 , . . . , Yen . To measure the compabil-ity between relation and entity variables, we useone feature f Joint

r,t1...ta (and weight θJointr,t1...ta) for each

combination of relation and entity types r, t1 . . . ta.The feature fires when the variables are in thestate r, t1 . . . ta. After training we would expecta weight θJoint

founder,person,company to be larger thanθJointfounder,person,country.

We also add a template TPair that measures thecompability between Ye1,...,ea and each Yei in iso-lation. Here we use features fPair

i,r,t that fire if ei is

1

nationof

Elevation Partners, was founded by Roger McNamee ...

Roger McNamee,USAYrel

Z1

personR McNamee

countryUSA

worksfor

comp. Microsoft

1Z1

1Z1

Roger McNamee,Microsoft

functionality-factor

ner-relation-factors

relation-mention factors

Ytypel

mention factors

Elevation Partners, was founded by Roger McNamee ...

Elevation Partners, was founded by Roger McNamee ...

Figure 1: Factor Graph for joint relation mention prediction and relation type identification.

fine the following conditional distribution:

p (y|x) =1

Zx

Tj∈T

(yi,xi)∈Tj

ePKj

k=1 θjkfj

k(yi,xi) (3)

In our case the set T consist of four templateswe will describe below. Note that to construct thisgraphical model we use FACTORIE (McCallum etal., 2009), a probabilistic programming languagethat simplifies the construction process, as well asinference and learning.

3.1.1 Bias TemplateWe use a bias template TBias that prefers certain

relations a priori over others. When the template isunrolled, it creates one factor per variable Ycfor can-didate tuple c and one weight θBias

r and feature func-tion fBias

r for each possible relation r. fBiasr fires if

the relation associated with tuple c is r.

3.1.2 Mention TemplateIn order to extract relations from text, we need to

model the correlation between relation instances andtheir mentions in text. For this purpose we definethe mention template TMen that connects each rela-tion instance variable Yc with its observed variablesmention variables XMc .

The feature functions of this template are takenfrom (Mintz et al., 2009b) (with minor modifica-tions). This includes features that inspect the lexical

context between entity mentions in the same sen-tence, and the syntactic path between these. Oneexample is

fMen101 (yc,xMc)

def=

1 yc = founder∧m1", director of "m2 ∈ xMc

0 otherwise.

It tests whether for any of the mentions of the can-didate tuple the sequence ", director of " appears be-tween the mentions of the argument entites.

Crucially, these templates function on a cross-document level. They gather all mentions of the can-didate tuple c and extract features from all of these.

3.1.3 Selectional Preference TemplatesTo capture the correlations between entity types

and the relations the entities participate in, we in-troduce the joint template TJoint. It connects a re-lation instance variable Ye1,...,ea to the entity typevariables Ye1 , . . . , Yen . To measure the compabil-ity between relation and entity variables, we useone feature f Joint

r,t1...ta (and weight θJointr,t1...ta) for each

combination of relation and entity types r, t1 . . . ta.The feature fires when the variables are in thestate r, t1 . . . ta. After training we would expecta weight θJoint

founder,person,company to be larger thanθJointfounder,person,country.

We also add a template TPair that measures thecompability between Ye1,...,ea and each Yei in iso-lation. Here we use features fPair

i,r,t that fire if ei is

1

nationof

Elevation Partners, was founded by Roger McNamee ...

Roger McNamee,USAYrel

Z1

personR McNamee

countryUSA

worksfor

comp. Microsoft

1Z1

1Z1

Roger McNamee,Microsoft

functionality-factor

ner-relation-factors

relation-mention factors

Ytypel

mention factors

Elevation Partners, was founded by Roger McNamee ...

Elevation Partners, was founded by Roger McNamee ...

Figure 1: Factor Graph for joint relation mention prediction and relation type identification.

fine the following conditional distribution:

p (y|x) =1

Zx

Tj∈T

(yi,xi)∈Tj

ePKj

k=1 θjkfj

k(yi,xi) (3)

In our case the set T consist of four templateswe will describe below. Note that to construct thisgraphical model we use FACTORIE (McCallum etal., 2009), a probabilistic programming languagethat simplifies the construction process, as well asinference and learning.

3.1.1 Bias TemplateWe use a bias template TBias that prefers certain

relations a priori over others. When the template isunrolled, it creates one factor per variable Ycfor can-didate tuple c and one weight θBias

r and feature func-tion fBias

r for each possible relation r. fBiasr fires if

the relation associated with tuple c is r.

3.1.2 Mention TemplateIn order to extract relations from text, we need to

model the correlation between relation instances andtheir mentions in text. For this purpose we definethe mention template TMen that connects each rela-tion instance variable Yc with its observed variablesmention variables XMc .

The feature functions of this template are takenfrom (Mintz et al., 2009b) (with minor modifica-tions). This includes features that inspect the lexical

context between entity mentions in the same sen-tence, and the syntactic path between these. Oneexample is

fMen101 (yc,xMc)

def=

1 yc = founder∧m1", director of "m2 ∈ xMc

0 otherwise.

It tests whether for any of the mentions of the can-didate tuple the sequence ", director of " appears be-tween the mentions of the argument entites.

Crucially, these templates function on a cross-document level. They gather all mentions of the can-didate tuple c and extract features from all of these.

3.1.3 Selectional Preference TemplatesTo capture the correlations between entity types

and the relations the entities participate in, we in-troduce the joint template TJoint. It connects a re-lation instance variable Ye1,...,ea to the entity typevariables Ye1 , . . . , Yen . To measure the compabil-ity between relation and entity variables, we useone feature f Joint

r,t1...ta (and weight θJointr,t1...ta) for each

combination of relation and entity types r, t1 . . . ta.The feature fires when the variables are in thestate r, t1 . . . ta. After training we would expecta weight θJoint

founder,person,company to be larger thanθJointfounder,person,country.

We also add a template TPair that measures thecompability between Ye1,...,ea and each Yei in iso-lation. Here we use features fPair

i,r,t that fire if ei is

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

g ( , ) � DKL ( || )

g ( ) = log�1− µi + µie

θi�− µie

θi

. . . + wφfφ (y , y , y ) + . . .

> 0

= maxy� ,y� ,y�

�y� , y� , y� �

< maxy� ,y� ,y�

�y� , y� , y� �

Φ1 (y5,7,;x) = exp (. . . + w f (y;x) + . . .)

Φ (yi,j ;x) = exp

��

k

wkfk (yi,j ;x)

p (y;x) =1

ZxΨ1 (y;x) · . . . · Ψn (y;x)

log E [Ψi]− µi

Ψi (y;x) = exp (θiφi (y;x))

µi = E [φi]

Y

Y

Y

Y

Y

X1

X2

Figure 1: Factor Graph of our model that captures selectional preferences and functionality constraints. For

readability we only label a subsets of equivalent variables and factors. Note that the graph shows an example

assignment to variables.

It tests whether for any mentions of the candidate

tuple the phrase "founded by" appears between the

mentions of the argument entities.

3.1.3 Selectional Preferences Templates

To capture the correlations between entity types

and relations the entities participate in, we introduce

the template TJoint. It connects a relation instance

variable Ye1,...,en to the individual entity type vari-

ables Ye1 , . . . , Yen . To measure the compatibility

between relation and entity variables, we use one

feature f Joint

r,t1...ta (and weight θJoint

r,t1...ta) for each com-

bination of relation and entity types r, t1 . . . ta.

f Joint

r,t1...ta fires when the factor variables are in the

state r, t1 . . . ta. For example, f Joint

founded,person,company

fires if Ye1 is in state person, Ye2 in state company,

and Ye1,e2 in state founded.

We also add a template TPair that measures the

pairwise compatibility between the relation variable

Ye1,...,ea and each entity variable Yei in isolation.

Here we use features fPair

i,r,t that fire if ei is the i-th ar-

gument of c, has the entity type t and the candidate

tuple c is labelled as instance of relation r. For ex-

ample, fPair

1,founded,person fires if Ye1(argument i = 1)

is in state person, and Ye1,e2 in state founded, re-

gardless of the state of Ye2 .

X1BILL GATES,USA

3.2 InferenceThere are two types of inference we have to perform:

sampling from the posterior during training (see sec-

tion 3.3), and finding the most likely configuration

(aka MAP inference). In both settings we employ a

Gibbs sampler (Geman and Geman, 1990) that ran-

domly picks a variable Yc and samples its relation

value conditioned on its Markov Blanket. At test

time we decrease the temperature of our sampler in

order to find an approximation of the MAP solution.

3.3 TrainingMost learning methods need to calculate the model

expectations (Lafferty et al., 2001) or the MAP con-

figuration (Collins, 2002) before making an update

to the parameters. This step of inference is usually

the bottleneck for learning, even when performed

approximately.

SampleRank (Wick et al., 2009) is a rank-based

learning framework that alleviates this problem by

performing parameter updates within MCMC infer-

ence. Every pair of consecutive samples in the

MCMC chain is ranked according to the model and

the ground truth, and the parameters are updated

when the rankings disagree. This update can fol-

low different schemes, here we use MIRA (Cram-

mer and Singer, 2003). This allows the learner to

acquire more supervision per instance, and has led

to efficient training for models in which inference

Entities and Relations[Yao, Riedel, McCallum 2010]

Page 94: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Isolated Pipeline Joint

0.78 0.82 0.94

Relation Extraction ExperimentsManual EvaluationTraining Set

Precision @50

2 years 1 year

Page 95: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Outline• Motivate software engineering for statistics

• Graphical models for Extraction & Integration- Extraction (linear-chain CRFs)

- Information Integration (really hairy CRFs, MCMC, SampleRank)

• Probabilistic Programming: FACTORIE

• Example

• Relation Extraction (cross-document, w/out labeled data)

• Probabilistic Programming inside a DB

• Ongoing Work

Page 96: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into DB

DocumentsExtraction &

Matching

Page 97: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into DB

DocumentsExtraction &

Matching

Page 98: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into DB

DocumentsExtraction &

Matching

Page 99: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into DB

DocumentsExtraction &

Matching

Page 100: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into DB

DocumentsExtraction &

Matching

query

queryproc

answer

Page 101: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

DocumentsExtraction &

Matching

Page 102: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

DocumentsExtraction &

Matching

Page 103: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

DocumentsExtraction &

Matching

Page 104: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

DocumentsExtraction &

Matching

Page 105: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

DocumentsExtraction &

Matching

query

queryproc

answer

Page 106: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

Documents

The MCMC Alternative

Page 107: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

Documents

The MCMC Alternative

Page 108: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

Documents

Extraction &Matching

DB contains only one possible world

at a time.

The MCMC Alternative

Page 109: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

The MCMC Alternative

Page 110: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

The MCMC Alternative

Page 111: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

The MCMC Alternative

Page 112: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

SQL answer

The MCMC Alternative

Page 113: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

SQL answer

The MCMC Alternative

Page 114: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

SQL answer

The MCMC Alternative

Page 115: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

SQL answer

The MCMC Alternative

Page 116: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

SQL answer

The MCMC Alternative

Page 117: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

SQL answer

The MCMC Alternative

Page 118: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

SQL answer

The MCMC Alternative

Page 119: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

SQL answer

The MCMC Alternative

Page 120: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Information Extraction into Pr DB

Documents

Extraction &Matching

MHinference

query

answer

SQL answer

[Wick, McCallum, Miklau 2010]

The MCMC Alternative

Page 121: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Particle Filtering

Documents

Extraction &Matching

query

answer

Page 122: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Particle Filtering

Documents

Extraction &Matching

query

answer

[Schultz, McCallum, Miklau 2010]

with compactrepresentation

Page 123: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Ongoing Work

• Factored particle filtering

• Query-specific MCMC

• Inference caching wrt to E[query]

• Learned proposal distributions

• Bayesian inference in distributed systems

• Probabilistic Database of all of Wikipedia, automatically growing by reading.

Page 124: Probabilistic Programming and Probabilistic Databases withmccallum/talks/utaustin2010b.pdf · 1F is the feasible region defined by deterministic constraints, e.g. clustering, non-projective

Thank you!

http://code.google.com/p/factorieVersion 0.9 at