probabilistic programming with imperative factor...

88
Joint work with Karl Schultz, Sameer Singh, Michael Wick, Sebastian Reidel. Some slide material from Avi Pfeffer. Andrew McCallum Department of Computer Science University of Massachusetts Amherst Probabilistic Programming with Imperative Factor Graphs

Upload: buidieu

Post on 28-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Joint work with Karl Schultz, Sameer Singh, Michael Wick, Sebastian Reidel.Some slide material from Avi Pfeffer.

Andrew McCallum

Department of Computer ScienceUniversity of Massachusetts Amherst

Probabilistic Programmingwith Imperative Factor Graphs

Page 2: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Uncertainty

• Uncertainty is ubiquitous- Partial information- Noisy sensors- Non-deterministic actions- Exogenous events

• Reasoning under uncertainty is a central challenge for building intelligent systems.

Page 3: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Probability

• Probability provides a mathematically sound basis for dealing with uncertainty.

• Combined with utilities, provides a basis for decision-making under uncertainty.

Page 4: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Probabilistic Modeling in the Last Few Years

• Models ever growing in richness and variety- hierarchical- spatio-temporal- relational- infinite

• Developing the representation, inference and learning for a new model is a significant task.

Page 5: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Conditional Random Fields

Finite state model

Undirected graphical model, trained to maximize conditional probability of output (sequence) given input (sequence)

transitions observations

y1 y2 y3 y4 y5 y6 y7 y8state sequence

observation sequence

x1 x2 x3 x4 x5 x6 x7 x8

Graphical model

(Linear-chain) [Lafferty, McCallum, Pereira 2001]

1Z�x

|�x|�

t=1

φ(yt, yt−1)φ(xt, yt)= exp

� �

k

λkfk(xt, yt)

p(y|x) =

Page 6: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Conditional Random Fields

Finite state model

Undirected graphical model, trained to maximize conditional probability of output (sequence) given input (sequence)

state sequence

observation sequence

Graphical model

(Linear-chain) [Lafferty, McCallum, Pereira 2001]

Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]

Wide-spread interest, positive experimental results in many applications.

Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…

y1 y2 y3 y4 y5 y6 y7 y8

x1 x2 x3 x4 x5 x6 x7 x8

Page 7: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Skip-chain CRF

. . .

Senator Joe Green said today . Green chairs the ...

Joint NER across sentences

Capture long-distance dependencies

[Sutton, McCallum, 2005]

Page 8: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Factorial CRF

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

Those surfers like San Jose

[Sutton, McCallum ’04]

Joint Part-of-speech, NP chunking, NER

Inference by Loopy Belief Propagation

Page 9: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Pairwise Affinity CRFMr. Hill

Amy Hall

Dana Hill

Dana she

C

C

N

N

N

C

NC

N N

[McCallum & Wellner 2003]

Page 10: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Entity Resolution

“mention”

“mention” “mention”

“mention”

“mention”Mr. Hill

Amy Hall

Dana Hill

Dana she

Page 11: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Entity Resolution

“entity”

“entity”

Mr. Hill

Amy Hall

Dana Hill

Dana she

Page 12: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Entity Resolution

“entity”

“entity”

Mr. Hill

Amy Hall

Dana Hill

Dana she

Page 13: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Entity Resolution

“entity”

“entity”

“entity”Mr. Hill

Amy Hall

Dana Hill

Dana she

Page 14: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

CRF for Co-referenceMr. Hill

Amy Hall

Dana Hill

Dana she

Page 15: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

CRF for Co-referenceMr. Hill

Amy Hall

Dana Hill

Dana she

C

C

N

N

N

C

NC

N N

[McCallum & Wellner 2003]

p(�y|�x) =1

Z�xexp

i,j

l

λlfl(xi, xj , yij)

+ mechanism for preserving transitivity

Make pair-wise mergingdecisions jointly by:- calculating a joint prob.- including all edge weights- enforcing transitivity.

Page 16: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Pairwise Affinity is not EnoughMr. Hill

Amy Hall

Dana Hill

Dana she

C

C

N

N

N

C

NC

N N

Page 17: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Pairwise Affinity is not Enoughshe

Amy Hall

she

she she

C

C

N

N

N

C

NC

N N

Page 18: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Pairwise Comparisons Not EnoughExamples:

• ∀ mentions are pronouns?

• Entities have multiple attributes (name, email, institution, location);

need to measure “compatibility” among them.

• Having 2 “given names” is common, but not 4.– e.g. Howard M. Dean / Martin, Dean / Howard Martin

• Need to measure size of the clusters of mentions.

• ∃ a pair of lastname strings that differ > 5?

We need to ask ∃, ∀ questions about a set of mentions

We want first-order logic!

Page 19: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Pairwise Affinity is not Enoughshe

Amy Hall

she

she she

C

C

N

N

N

C

NC

N N

Page 20: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Partition Affinity CRFshe

Amy Hall

she

she she

Ask arbitrary questionsabout all entities in a partitionwith first-order logic...

Page 21: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Partition Affinity CRFshe

Amy Hall

she

she she

Page 22: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Partition Affinity CRFshe

Amy Hall

she

she she

Page 23: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Partition Affinity CRFshe

Amy Hall

she

she she

Page 24: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Partition Affinity CRFshe

Amy Hall

she

she she

Page 25: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

How can we perform inference and learning in models that cannot be “unrolled” ?

Can’t use belief propagation.Can’t use standard integer linear programming.

Page 26: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Don’t represent all alternatives...

she

she

AmyHall

sheshe

Page 27: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Don’t represent all alternatives... just one

she

she

AmyHall

sheshe

she

she

AmyHall

sheshe

StochasticJump

ProposalDistribution

at a time

Markov Chain Monte Carlo

Page 28: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Metropolis-HastingsSampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

SampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

feasible region defined by deterministic constraintse.g. clustering, parse-tree projectivity.

SampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

Given factor graph with target variables y and observed x

SampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

proposal distribution

SampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

Can do MAP inference with decreasing temperature on ratio of p(y)’s

Page 29: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

M-H Natural Efficiencies1. Partition function cancels

SampleRank

MH Efficiency

partition cancels

p(y �)

p(y)=

p(Y = y �|x ; θ)

p(Y = y |x ; θ)

=1

ZX

�y i∈y � ψ(x , y i)

1

ZX

�y∈y ψ(x , y i)

=

�y i∈y � ψ(x , y i)

�y∈y ψ(x , y i)

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27

SampleRank

MH Efficiency

factors cancel

=

�y �i∈y � ψ(x , y �i)

�y i∈y ψ(x , y i)

=

��y �i∈δy � ψ(x , y �i)

� ��y i∈y �/δy � ψ(x , y i)

��y i∈δy

ψ(x , y i)� ��

y i∈y/δyψ(x , y i)

=

�y �i∈δy � ψ(x , y �i)

�y i∈δy

ψ(x , y i)

δy is the “diff”, ie variables in y that have changed

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 6 / 27

2. Unchanged factors cancel

How to learn parameters for ?

SampleRank

SampleRankmotivation

QUESTION: how do we learn θ for MH

Problem with traditional ML: inference in inner-most loop of learningmaximum likelihood requires inference for marginalsperceptron requires inference for decoding

Want: push updates (not inference) into inner-most loop of learning

Idea: use MH as a guide, exploit its efficiency, and learn to rankneighboring samples during random walk

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 7 / 27

SampleRank

MH Efficiency

partition cancels

p(y �)

p(y)=

p(Y = y �|x ; θ)

p(Y = y |x ; θ)

=1

ZX

�y i∈y � ψ(x , y i)

1

ZX

�y∈y ψ(x , y i)

=

�y i∈y � ψ(x , y i)

�y∈y ψ(x , y i)

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27

Page 30: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Probabilistic Modeling in the Last Few Years

• Models ever growing in richness and variety- hierarchical- spatio-temporal- relational- infinite

• Developing the representation, reasoning and learning for a new model is a significant task.

Page 31: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Probabilistic Programming Languages

• Make it easy to represent rich, complex models, using the full power of programming languages- data structures

- control mechanisms

- abstraction

• Inference and learning come for free (or sort of)

• Give you the language to think of and create new models

Page 32: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Small Sampling of Probabilistic Programming Languages

• Logic-based- Markov logic, BLOG, PRISM

• Functional- IBAL, Church

• Object Oriented- Figaro, Infer.NET

Page 33: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

BLOG

#Researcher ~ NumResearchersPrior();

Name(r) ~ NamePrior();

#Paper ~ NumPapersPrior();

FirstAuthor(p) ~ Uniform({Researcher r});

Title(p) ~ TitlePrior();

PubCited(c) ~ Uniform({Paper p});

Text(c) ~ NoisyCitationGrammar (Name(FirstAuthor(PubCited(c))), Title(PubCited(c)));

• Generative model of objects and relations.

• Handles unknown number of objects

• Inference by MCMC.

[Milch et al, 2005]

Page 34: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Church

• Tell generative story-line in Scheme.

• Do MCMC inference over execution paths.(define (DP alpha proc)

(let ((sticks (mem (lambda x (beta 1.0 alpha)))) (atoms (mem (lambda x (proc)))))

(lambda () (atoms (pick-a-stick sticks 1))))) (define (pick-a-stick sticks J) (if (< (random) (sticks J)) J (pick-a-stick sticks (+ J 1))))

(define (DPmem alpha proc) (let ((dps (mem (lambda args

(DP alpha (lambda () (apply proc args)) )))))

(lambda argsin ((apply dps argsin))) ))

[Goodman, Mansinghka, Roy, Tenenbaum, 2009]

Page 35: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Figaro• Generative model of objects and relations.

• Object oriented (also in Scala!)- “Models” are basic building block,

composed of other models, derived by inheritance.

- Models are objects with conditions, constraints and relations to other objects.

- Model = data + factors; they are intertwined.

[Pfeffer, 2009]

Page 36: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Figaro [Pfeffer, 2009]

People smoke with probability 0.6:Smoke(x) 1.5Friends are 3 times as likely to have the same smoking habit than different:¬Friends(x,y) v ¬Smoke(x) v Smoke(y) 3¬Friends(x,y) v Smoke(x) v ¬ Smoke(y) 3

class Person { val smokes = Flip(0.6) }val alice, bob, clara = new Personalice.smokes.condition(true)val friends = List((alice, bob), (bob, clara))def constraint(pair: (Boolean, Boolean)) = if (pair._1 == pair._2) 3.0; else 1.0for { (p1,p2) ← friends } Pair(p1.smokes, p2.smokes).constrain(constraint)

Page 37: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Markov LogicFirst-Order Logic as a Template to Define CRF Parameters

[Richardson & Domingos 2005][Paskin & Russell 2002][Taskar et al 2003]

ground Markov network

grounding Markov network requires space O(nr)

n = number constants r = highest clause arity

Page 38: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

My Approach

• I’m going to immediately dismiss the generative models.- Interesting, but not what performs best in NLP.

Want

• Discriminatively trained factor graphs.

• Best previous example of this: Markov Logic.

Page 39: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Logic + Probability

• Significant interest in this combination- Poole, Muggleton, DeRaedt, Sato, Domingos,...

• We now hypothesize that in much of this previous workthe “logic” aspect is mostly a red herring.- Power: repeated relational structures and tied parameters

- Logic is one way to specify these structures, but not the only one, and perhaps not the best.

- In deterministic programming, Prolog replaced by imperative lang’s✦ programmers have to keep imperative solver in mind after all✦ much domain knowledge is procedural anyway

- Logical inference replaced by probabilistic inference.

Page 40: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Declarative Model Specification• One of biggest advances in AI & ML

• Gone too far?Much domain knowledge is also procedural.

• Logic + Probability → Imperative + Probability- Rising interest: Church, Infer.NET,...

• Our approach- Preserve the declarative statistical semantics of factor graphs

- Provide imperative hooks to define structure, parameterization, inference, learning. Efficient. Easy-to-use.

- “Imperatively-Defined Factor Graphs” (IDFs)

Page 41: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Our Design Goals• Represent factor graphs

- emphasis on discriminative undirected models

• Scalability- input data, output configuration, factors, tree-width

- observed data that cannot fit in memory

- super-exponential number of factors

• Efficient discriminative parameter estimation- sensitive to the expense of inference

• Leverage object-oriented benefits- Modularity, encapsulation, inheritance,...

• Integrate declarative & procedural knowledge- natural, easy-to-use

- upcoming slides: 3 examples of injecting imperativ-ism into factor graphs

Page 42: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

FACTORIE• Factor Graphs, Imperative, Extensible• Implemented as a library in Scala [Martin Odersky]

- object oriented & functional- type inference- lazy evaluation- everything an object (int, float,...)- nice syntax for creating “domain-specific languages”- runs in JVM (complete interoperation with Java)- “Haskell++ in a Java style”

• Library, not new “little language”- all familiar Java constructs & libraries available to you- integrate data pre-processing & eval. w/ model spec- Scala makes syntax not too bad.- But not as compact as a dedicated language (BLOG, MLN)

Page 43: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Stages of FACTORIE programming1. Define templates for data (i.e. classes)

- Use data structures just like in deterministic programming.

- Only special requirement: provide “undo” capability for changes.

2. Define templates for factors- Distinct from above data representation;

makes it easy to modify model scores indep’ly.

- Use & transform data’s natural relations to define factors’ relations.

3. Optionally, define MCMC proposal functions that leverage domain knowledge.

Page 44: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Scala• New variable

var myHometown : Stringvar myAltitude = 10523.2

• New constantval myName = “Andrew”

• New methoddef climb(increment:double) = myAltitude += increment

• New classclass Skier extends Person

• New trait (like Java interface with implementations)trait FirstAid { def applyBandage = ... }

• New class with traitclass BackcountrySkier extends Skier with FirstAid

• New static object [generic]object GlobalSkierTable extends ArrayList[Skier]

Page 45: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg)

class Token(word:String) extends EnumVariable(word)

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Page 46: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq

class Token(word:String) extends EnumVariable(word) with VarSeq

label.prev label.next

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Page 47: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq

class Token(word:String) extends EnumVariable(word) with VarSeq

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Page 48: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label}

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Avoid representing relations by indices.Do it directly with members, pointers... arbitrary data structure.

Page 49: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Page 50: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Page 51: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token]

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Page 52: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] object StateTransitionTemplate extends Template2[Label,Label]

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Page 53: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Imperativ-ism #1: Jump Function

• Proposal “jump function”– Make changes to world state

• Sometimes simple, sometimes not– Sample Gaussian with mean at old value– Sample cluster to split,

run stochastic greedy agglomerative clustering

• Gibbs sampling, one variable at a time– poor mixing

• Rich jump function– Natural place to embed domain knowledge about what variables

should change in concert.

Page 54: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Imperativ-ism #1: Jump Function

• Proposal “jump function”– Make changes to world state

• Sometimes simple, sometimes not– Sample Gaussian with mean at old value– Sample cluster to split,

run stochastic greedy agglomerative clustering

• Gibbs sampling, one variable at a time– poor mixing

• Rich jump function– Natural place to embed domain knowledge about what variables

should change in concert.– Avoid some expensive deterministic factors with

property-preserving jump functions (e.g. coref transitivity, dependency parsing projectivity)

Page 55: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.

Page 56: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.

Page 57: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables

Page 58: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate

those factors’ scores

Page 59: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate

those factors’ scores

Page 60: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate

those factors’ scores• How to find factors from variables & vice versa?

– In BLOG, rich, highly-indexed data structure stores mapping variables ←→ factors

– But complex to maintain as structure changes

Page 61: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Imperativ-ism #2: Model Structure• Maintain no map structure between factors and variables

• Finding factors is easy. Usually # templates < 50.• Primitive operation:

Given factor template and one changed variable, find other variables• In factor Template object, define imperative methods that do this.

– unroll1(v1) returns (v1,v2,v3)– unroll2(v2) returns (v1,v2,v3)– unroll3(v3) returns (v1,v2,v3)– I.e., use Turing-complete language to determine structure on the fly.– If you want to use a data structure instead, access it in the method.– If you want a higher-level language for specifying structure,

write it terms of this primitive.

• Other nice attribute– Easy to do value-conditioned structure. Case Factor Diagrams, etc.

Page 62: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] object StateTransitionTemplate extends Template2[Label,Label]

Labels

Words

Page 63: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token]

object StateTransitionTemplate extends Template2[Label,Label]

Labels

Words

Page 64: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token)}

object StateTransitionTemplate extends Template2[Label,Label]

Labels

Words

Page 65: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends Template2[Label,Label]

Labels

Words

Page 66: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends Template2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next)}

Page 67: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends Template2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}

Page 68: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Imperativ-ism #3: Neighbor-Sufficient Map• “Neighbor Variables” of a factor

– Variables touching the factor• “Sufficient Statistics” of a factor

– Vector, dot product with weights of log-linear factor → factor’s score

• Usually confounded. Separate them.• Skip-chain NER. Instead of 5x5 parameters, just 2.

(label1, label2) → label1 == label2

Labels

Words

Bill loves Paris Bill the painter ...

PER O LOC PER O O

Page 69: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends Template2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}

Page 70: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Skip-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends TemplateWithStatistics1[Label]object StateTokenTemplate extends TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}object SkipTemplate extends Template1[Label,Label] with Statistics1[Bool]

Page 71: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Skip-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends TemplateWithStatistics1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}object SkipTemplate extends Template1[Label,Label] with Statistics1[Bool]{ def unroll1(label:Label) = for (other <- label.seq) if (label.token == other.token)) yield Factor (label,other)}

Page 72: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Skip-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends TemplateWithStatistics1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}object SkipTemplate extends Template2[Label,Label] with Statistics1[Bool]{ def unroll1(label:Label) = for (other <- label.seq; if (label.token == other.token)) yield Factor (label,other) def statistics(label1:Label,label2:Label) = Stat(label1 == label2)}

Page 73: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Example: Dependency Parsingclass Word(str:String) extends EnumVariable(word)class Node(word:Word, parent:Node) extends PrimitiveVariable(parent)

object ChildParentTemplate extends Template1[Node] with Statistics2[Word,Word] { def statistics(n:Node) = Stat(n.word, n.parent.word)}

object NearestVerbTemplate extends Template1[Node] with Statistics2[Word,Word] { def statistics(n:Node) = Stat(n.word, closestVerb(n).word) def closestVerb(n:Node) = if (isVerb(n.word)) n else closestVerb(n.parent) def unroll1(n:Node) = n.selfAndDescendants}

VERB

Page 74: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Extensibility

• Many variables types provided:- boolean, int, float, String, categorical,...

• Create new ones!- set-valued variable

- finite-state machine as a variable [JHU]

• Create new factor types- Poisson, Dirichlet,...

Page 75: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Experimental Results• Joint Segmentation & Coreference of

research paper citations.- 1295 mentions, 134 entities, 36487 tokens

• Compare with MLNs (Alchemy)- Same observable features

• Factorie results:

- ~25% reduction in error (segmentation & coref)

- 3-20x faster

- coref results:

Page 76: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

FACTORIE Summary

• Factor graphs, • ...object-oriented

– data types and factor template types, with inheritance• ...scalable

– factors created on demand, only score diffs• ...with imperative hooks

– jump function, override variable.set() for coordination– model structure– neighbor variables → sufficient statistics

• ...discriminative– efficient online training by SampleRank– generative modeling also provided (LDA = ~12 lines)

• Combine declarative & procedural knowledge

Page 77: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Conclusion: Some Reasons To Use Probabilistic Programming

• Simple- Save time & avoid debugging complex, hand-built ML code.- Say exactly what you want in the way you want to say it.

• Flexible- Encourage research exploration by making it easier to try

new modeling ideas- The language provides the right “hinge-points” to provide

the flexibility you want, without the underlying cruft.

• Glue that binds many reasoning paradigms together.• Allows probabilistic modeling to be integrated with all

the other traditional deterministic programming.

Some of this text from Avi Pfeffer

Page 78: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Key Questions for Probabilistic Programming

• What are good design patterns for probabilistic programming?

• What are the skills required to be an effective probabilistic programmer?

• How can probabilistic programmers work well with domain experts and end users?

• What kind of tools can we develop to support probabilistic programming (debuggers, profilers etc.)?

• How can probabilistic programs be learned (especially structure)?

• How can we make inference more efficient (especially memory) to scale up to even large domains?

Some of this text from Avi Pfeffer

Page 79: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

FACTORIE Summary

• Factor graphs, • ...object-oriented

– data types and factor template types, with inheritance• ...scalable

– factors created on demand, only score diffs• ...with imperative hooks

– jump function, override variable.set() for coordination– model structure– neighbor variables → sufficient statistics

• ...discriminative– efficient online training by SampleRank– generative modeling also provided (LDA = ~12 lines)

• Combine declarative & procedural knowledge

Page 80: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings
Page 81: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Don’t represent all alternatives... just one

she

she

AmyHall

sheshe

she

she

AmyHall

sheshe

StochasticJump

ProposalDistribution

at a time

Markov Chain Monte Carlo

Page 82: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

M-H Natural Efficiencies1. Partition function cancels

SampleRank

MH Efficiency

partition cancels

p(y �)

p(y)=

p(Y = y �|x ; θ)

p(Y = y |x ; θ)

=1

ZX

�y i∈y � ψ(x , y i)

1

ZX

�y∈y ψ(x , y i)

=

�y i∈y � ψ(x , y i)

�y∈y ψ(x , y i)

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27

SampleRank

MH Efficiency

factors cancel

=

�y �i∈y � ψ(x , y �i)

�y i∈y ψ(x , y i)

=

��y �i∈δy � ψ(x , y �i)

� ��y i∈y �/δy � ψ(x , y i)

��y i∈δy

ψ(x , y i)� ��

y i∈y/δyψ(x , y i)

=

�y �i∈δy � ψ(x , y �i)

�y i∈δy

ψ(x , y i)

δy is the “diff”, ie variables in y that have changed

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 6 / 27

2. Unchanged factors cancel

How to learn parameters for ?

SampleRank

SampleRankmotivation

QUESTION: how do we learn θ for MH

Problem with traditional ML: inference in inner-most loop of learningmaximum likelihood requires inference for marginalsperceptron requires inference for decoding

Want: push updates (not inference) into inner-most loop of learning

Idea: use MH as a guide, exploit its efficiency, and learn to rankneighboring samples during random walk

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 7 / 27

SampleRank

MH Efficiency

partition cancels

p(y �)

p(y)=

p(Y = y �|x ; θ)

p(Y = y |x ; θ)

=1

ZX

�y i∈y � ψ(x , y i)

1

ZX

�y∈y ψ(x , y i)

=

�y i∈y � ψ(x , y i)

�y∈y ψ(x , y i)

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27

Page 83: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

• Most methods require calculating gradient of log-likelihood, P(y1, y2, y3,... | x1, x2, x3,...)...

• ...which in turn requires “expectations of marginals,” P(y1| x1, x2, x3,...)

• But, getting marginal distributions by sampling can be inefficient due to large sample space.

• Alternative: Perceptron. Approximate gradient from difference between true output and model’s predicted best output.

• But, even finding model’s predicted best output is expensive.

• We propose: “Sample Rank” [Culotta, Wick, Hall, McCallum, HLT 2007]Learn to rank intermediate solutions: P(y1=1, y2=0, y3=1,... | ...) > P(y1=0, y2=0, y3=1,... | ...)

Parameter Estimation in Large State Spaces

Page 84: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Ranking vs Classification Training• Instead of training

[Powell, Mr. Powell, he] → YES[Powell, Mr. Powell, she] → NO

• ...Rather...

[Powell, Mr. Powell, he] > [Powell, Mr. Powell, she]

• In general, higher-ranked example may contain errors

[Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she]

Page 85: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

1.

UPDATE

Ranking Intermediate SolutionsExample

2.

∆ Model = -23∆ Truth = -0.2

3.

∆ Model = 10∆ Truth = -0.1

4.

∆ Model = -10∆ Truth = -0.1

5.

∆ Model = -3∆ Truth = 0.3

• Like Perceptron:Proof of convergence under Marginal Separability.

• More constrained than Maximum Likelihood:Parameters must correctly rank incorrect solutions!

• Very fast to train.

UPDATE

Page 86: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

Comparison to Contrastive DivergenceContrastive Divergence, n=2 [Hinton 2002]

sufficient statistics for update

proposal

truth

Persistent Contrastive Divergence [Tieleman 2008]

Sample Rank

Page 87: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

SampleRank on Coreference• ACE 2004

• All nouns. 28,122 mentions, 14,047 entitiese.g. he, the President, Clinton, Mrs. Clinton, Washington

2005 Ng . 69.5%

2007 Culotta, Wick, Hall, McCallum . 79.3%

2008 Bengston, Roth . 80.8%

2009 Wick, McCallum MCMC+SampleRank . 81.5%

Contrastive Divergence . 75.1%

Persistent Contrastive Divergence . 74.9%

Perceptron . 76.3%

B3

Page 88: Probabilistic Programming with Imperative Factor Graphspeople.cs.umass.edu/~mccallum/talks/edinburgh-factorie2010b.pdfContrastive Divergence IESL 3 / 27 SampleRank Metropolis-Hastings

FACTORIE Summary

• Factor graphs, • ...object-oriented

– data types and factor template types, with inheritance• ...scalable

– factors created on demand, only score diffs• ...with imperative hooks

– jump function, override variable.set() for coordination– model structure– neighbor variables → sufficient statistics

• ...discriminative– efficient online training by SampleRank– generative modeling also provided (LDA = ~12 lines)

• Combine declarative & procedural knowledge