probabilistic programming with imperative factor...
TRANSCRIPT
Joint work with Karl Schultz, Sameer Singh, Michael Wick, Sebastian Reidel.Some slide material from Avi Pfeffer.
Andrew McCallum
Department of Computer ScienceUniversity of Massachusetts Amherst
Probabilistic Programmingwith Imperative Factor Graphs
Uncertainty
• Uncertainty is ubiquitous- Partial information- Noisy sensors- Non-deterministic actions- Exogenous events
• Reasoning under uncertainty is a central challenge for building intelligent systems.
Probability
• Probability provides a mathematically sound basis for dealing with uncertainty.
• Combined with utilities, provides a basis for decision-making under uncertainty.
Probabilistic Modeling in the Last Few Years
• Models ever growing in richness and variety- hierarchical- spatio-temporal- relational- infinite
• Developing the representation, inference and learning for a new model is a significant task.
Conditional Random Fields
Finite state model
Undirected graphical model, trained to maximize conditional probability of output (sequence) given input (sequence)
transitions observations
y1 y2 y3 y4 y5 y6 y7 y8state sequence
observation sequence
x1 x2 x3 x4 x5 x6 x7 x8
Graphical model
(Linear-chain) [Lafferty, McCallum, Pereira 2001]
1Z�x
|�x|�
t=1
φ(yt, yt−1)φ(xt, yt)= exp
� �
k
λkfk(xt, yt)
�
p(y|x) =
Conditional Random Fields
Finite state model
Undirected graphical model, trained to maximize conditional probability of output (sequence) given input (sequence)
state sequence
observation sequence
Graphical model
(Linear-chain) [Lafferty, McCallum, Pereira 2001]
Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]
Wide-spread interest, positive experimental results in many applications.
Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…
y1 y2 y3 y4 y5 y6 y7 y8
x1 x2 x3 x4 x5 x6 x7 x8
Skip-chain CRF
. . .
Senator Joe Green said today . Green chairs the ...
Joint NER across sentences
Capture long-distance dependencies
[Sutton, McCallum, 2005]
Factorial CRF
Part-of-speech
Noun-phrase boundaries
Named-entity tag
English words
Those surfers like San Jose
[Sutton, McCallum ’04]
Joint Part-of-speech, NP chunking, NER
Inference by Loopy Belief Propagation
Pairwise Affinity CRFMr. Hill
Amy Hall
Dana Hill
Dana she
C
C
N
N
N
C
NC
N N
[McCallum & Wellner 2003]
Entity Resolution
“mention”
“mention” “mention”
“mention”
“mention”Mr. Hill
Amy Hall
Dana Hill
Dana she
Entity Resolution
“entity”
“entity”
Mr. Hill
Amy Hall
Dana Hill
Dana she
Entity Resolution
“entity”
“entity”
Mr. Hill
Amy Hall
Dana Hill
Dana she
Entity Resolution
“entity”
“entity”
“entity”Mr. Hill
Amy Hall
Dana Hill
Dana she
CRF for Co-referenceMr. Hill
Amy Hall
Dana Hill
Dana she
CRF for Co-referenceMr. Hill
Amy Hall
Dana Hill
Dana she
C
C
N
N
N
C
NC
N N
[McCallum & Wellner 2003]
p(�y|�x) =1
Z�xexp
�
i,j
�
l
λlfl(xi, xj , yij)
+ mechanism for preserving transitivity
Make pair-wise mergingdecisions jointly by:- calculating a joint prob.- including all edge weights- enforcing transitivity.
Pairwise Affinity is not EnoughMr. Hill
Amy Hall
Dana Hill
Dana she
C
C
N
N
N
C
NC
N N
Pairwise Affinity is not Enoughshe
Amy Hall
she
she she
C
C
N
N
N
C
NC
N N
Pairwise Comparisons Not EnoughExamples:
• ∀ mentions are pronouns?
• Entities have multiple attributes (name, email, institution, location);
need to measure “compatibility” among them.
• Having 2 “given names” is common, but not 4.– e.g. Howard M. Dean / Martin, Dean / Howard Martin
• Need to measure size of the clusters of mentions.
• ∃ a pair of lastname strings that differ > 5?
We need to ask ∃, ∀ questions about a set of mentions
We want first-order logic!
Pairwise Affinity is not Enoughshe
Amy Hall
she
she she
C
C
N
N
N
C
NC
N N
Partition Affinity CRFshe
Amy Hall
she
she she
Ask arbitrary questionsabout all entities in a partitionwith first-order logic...
Partition Affinity CRFshe
Amy Hall
she
she she
Partition Affinity CRFshe
Amy Hall
she
she she
Partition Affinity CRFshe
Amy Hall
she
she she
Partition Affinity CRFshe
Amy Hall
she
she she
How can we perform inference and learning in models that cannot be “unrolled” ?
Can’t use belief propagation.Can’t use standard integer linear programming.
Don’t represent all alternatives...
she
she
AmyHall
sheshe
Don’t represent all alternatives... just one
she
she
AmyHall
sheshe
she
she
AmyHall
sheshe
StochasticJump
ProposalDistribution
at a time
Markov Chain Monte Carlo
Metropolis-HastingsSampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
non-projective tree
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27
SampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
non-projective tree
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27
feasible region defined by deterministic constraintse.g. clustering, parse-tree projectivity.
SampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
non-projective tree
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27
Given factor graph with target variables y and observed x
SampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
non-projective tree
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27
proposal distribution
SampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
non-projective tree
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27
Can do MAP inference with decreasing temperature on ratio of p(y)’s
M-H Natural Efficiencies1. Partition function cancels
SampleRank
MH Efficiency
partition cancels
p(y �)
p(y)=
p(Y = y �|x ; θ)
p(Y = y |x ; θ)
=1
ZX
�y i∈y � ψ(x , y i)
1
ZX
�y∈y ψ(x , y i)
=
�y i∈y � ψ(x , y i)
�y∈y ψ(x , y i)
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27
SampleRank
MH Efficiency
factors cancel
=
�y �i∈y � ψ(x , y �i)
�y i∈y ψ(x , y i)
=
��y �i∈δy � ψ(x , y �i)
� ��y i∈y �/δy � ψ(x , y i)
�
��y i∈δy
ψ(x , y i)� ��
y i∈y/δyψ(x , y i)
�
=
�y �i∈δy � ψ(x , y �i)
�y i∈δy
ψ(x , y i)
δy is the “diff”, ie variables in y that have changed
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 6 / 27
2. Unchanged factors cancel
How to learn parameters for ?
SampleRank
SampleRankmotivation
QUESTION: how do we learn θ for MH
Problem with traditional ML: inference in inner-most loop of learningmaximum likelihood requires inference for marginalsperceptron requires inference for decoding
Want: push updates (not inference) into inner-most loop of learning
Idea: use MH as a guide, exploit its efficiency, and learn to rankneighboring samples during random walk
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 7 / 27
SampleRank
MH Efficiency
partition cancels
p(y �)
p(y)=
p(Y = y �|x ; θ)
p(Y = y |x ; θ)
=1
ZX
�y i∈y � ψ(x , y i)
1
ZX
�y∈y ψ(x , y i)
=
�y i∈y � ψ(x , y i)
�y∈y ψ(x , y i)
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27
Probabilistic Modeling in the Last Few Years
• Models ever growing in richness and variety- hierarchical- spatio-temporal- relational- infinite
• Developing the representation, reasoning and learning for a new model is a significant task.
Probabilistic Programming Languages
• Make it easy to represent rich, complex models, using the full power of programming languages- data structures
- control mechanisms
- abstraction
• Inference and learning come for free (or sort of)
• Give you the language to think of and create new models
Small Sampling of Probabilistic Programming Languages
• Logic-based- Markov logic, BLOG, PRISM
• Functional- IBAL, Church
• Object Oriented- Figaro, Infer.NET
BLOG
#Researcher ~ NumResearchersPrior();
Name(r) ~ NamePrior();
#Paper ~ NumPapersPrior();
FirstAuthor(p) ~ Uniform({Researcher r});
Title(p) ~ TitlePrior();
PubCited(c) ~ Uniform({Paper p});
Text(c) ~ NoisyCitationGrammar (Name(FirstAuthor(PubCited(c))), Title(PubCited(c)));
• Generative model of objects and relations.
• Handles unknown number of objects
• Inference by MCMC.
[Milch et al, 2005]
Church
• Tell generative story-line in Scheme.
• Do MCMC inference over execution paths.(define (DP alpha proc)
(let ((sticks (mem (lambda x (beta 1.0 alpha)))) (atoms (mem (lambda x (proc)))))
(lambda () (atoms (pick-a-stick sticks 1))))) (define (pick-a-stick sticks J) (if (< (random) (sticks J)) J (pick-a-stick sticks (+ J 1))))
(define (DPmem alpha proc) (let ((dps (mem (lambda args
(DP alpha (lambda () (apply proc args)) )))))
(lambda argsin ((apply dps argsin))) ))
[Goodman, Mansinghka, Roy, Tenenbaum, 2009]
Figaro• Generative model of objects and relations.
• Object oriented (also in Scala!)- “Models” are basic building block,
composed of other models, derived by inheritance.
- Models are objects with conditions, constraints and relations to other objects.
- Model = data + factors; they are intertwined.
[Pfeffer, 2009]
Figaro [Pfeffer, 2009]
People smoke with probability 0.6:Smoke(x) 1.5Friends are 3 times as likely to have the same smoking habit than different:¬Friends(x,y) v ¬Smoke(x) v Smoke(y) 3¬Friends(x,y) v Smoke(x) v ¬ Smoke(y) 3
class Person { val smokes = Flip(0.6) }val alice, bob, clara = new Personalice.smokes.condition(true)val friends = List((alice, bob), (bob, clara))def constraint(pair: (Boolean, Boolean)) = if (pair._1 == pair._2) 3.0; else 1.0for { (p1,p2) ← friends } Pair(p1.smokes, p2.smokes).constrain(constraint)
Markov LogicFirst-Order Logic as a Template to Define CRF Parameters
[Richardson & Domingos 2005][Paskin & Russell 2002][Taskar et al 2003]
ground Markov network
grounding Markov network requires space O(nr)
n = number constants r = highest clause arity
My Approach
• I’m going to immediately dismiss the generative models.- Interesting, but not what performs best in NLP.
Want
• Discriminatively trained factor graphs.
• Best previous example of this: Markov Logic.
Logic + Probability
• Significant interest in this combination- Poole, Muggleton, DeRaedt, Sato, Domingos,...
• We now hypothesize that in much of this previous workthe “logic” aspect is mostly a red herring.- Power: repeated relational structures and tied parameters
- Logic is one way to specify these structures, but not the only one, and perhaps not the best.
- In deterministic programming, Prolog replaced by imperative lang’s✦ programmers have to keep imperative solver in mind after all✦ much domain knowledge is procedural anyway
- Logical inference replaced by probabilistic inference.
Declarative Model Specification• One of biggest advances in AI & ML
• Gone too far?Much domain knowledge is also procedural.
• Logic + Probability → Imperative + Probability- Rising interest: Church, Infer.NET,...
• Our approach- Preserve the declarative statistical semantics of factor graphs
- Provide imperative hooks to define structure, parameterization, inference, learning. Efficient. Easy-to-use.
- “Imperatively-Defined Factor Graphs” (IDFs)
Our Design Goals• Represent factor graphs
- emphasis on discriminative undirected models
• Scalability- input data, output configuration, factors, tree-width
- observed data that cannot fit in memory
- super-exponential number of factors
• Efficient discriminative parameter estimation- sensitive to the expense of inference
• Leverage object-oriented benefits- Modularity, encapsulation, inheritance,...
• Integrate declarative & procedural knowledge- natural, easy-to-use
- upcoming slides: 3 examples of injecting imperativ-ism into factor graphs
FACTORIE• Factor Graphs, Imperative, Extensible• Implemented as a library in Scala [Martin Odersky]
- object oriented & functional- type inference- lazy evaluation- everything an object (int, float,...)- nice syntax for creating “domain-specific languages”- runs in JVM (complete interoperation with Java)- “Haskell++ in a Java style”
• Library, not new “little language”- all familiar Java constructs & libraries available to you- integrate data pre-processing & eval. w/ model spec- Scala makes syntax not too bad.- But not as compact as a dedicated language (BLOG, MLN)
Stages of FACTORIE programming1. Define templates for data (i.e. classes)
- Use data structures just like in deterministic programming.
- Only special requirement: provide “undo” capability for changes.
2. Define templates for factors- Distinct from above data representation;
makes it easy to modify model scores indep’ly.
- Use & transform data’s natural relations to define factors’ relations.
3. Optionally, define MCMC proposal functions that leverage domain knowledge.
Scala• New variable
var myHometown : Stringvar myAltitude = 10523.2
• New constantval myName = “Andrew”
• New methoddef climb(increment:double) = myAltitude += increment
• New classclass Skier extends Person
• New trait (like Java interface with implementations)trait FirstAid { def applyBandage = ... }
• New class with traitclass BackcountrySkier extends Skier with FirstAid
• New static object [generic]object GlobalSkierTable extends ArrayList[Skier]
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg)
class Token(word:String) extends EnumVariable(word)
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq
class Token(word:String) extends EnumVariable(word) with VarSeq
label.prev label.next
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq
class Token(word:String) extends EnumVariable(word) with VarSeq
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label}
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
Avoid representing relations by indices.Do it directly with members, pointers... arbitrary data structure.
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token]
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] object StateTransitionTemplate extends Template2[Label,Label]
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
Imperativ-ism #1: Jump Function
• Proposal “jump function”– Make changes to world state
• Sometimes simple, sometimes not– Sample Gaussian with mean at old value– Sample cluster to split,
run stochastic greedy agglomerative clustering
• Gibbs sampling, one variable at a time– poor mixing
• Rich jump function– Natural place to embed domain knowledge about what variables
should change in concert.
Imperativ-ism #1: Jump Function
• Proposal “jump function”– Make changes to world state
• Sometimes simple, sometimes not– Sample Gaussian with mean at old value– Sample cluster to split,
run stochastic greedy agglomerative clustering
• Gibbs sampling, one variable at a time– poor mixing
• Rich jump function– Natural place to embed domain knowledge about what variables
should change in concert.– Avoid some expensive deterministic factors with
property-preserving jump functions (e.g. coref transitivity, dependency parsing projectivity)
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate
those factors’ scores
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate
those factors’ scores
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate
those factors’ scores• How to find factors from variables & vice versa?
– In BLOG, rich, highly-indexed data structure stores mapping variables ←→ factors
– But complex to maintain as structure changes
Imperativ-ism #2: Model Structure• Maintain no map structure between factors and variables
• Finding factors is easy. Usually # templates < 50.• Primitive operation:
Given factor template and one changed variable, find other variables• In factor Template object, define imperative methods that do this.
– unroll1(v1) returns (v1,v2,v3)– unroll2(v2) returns (v1,v2,v3)– unroll3(v3) returns (v1,v2,v3)– I.e., use Turing-complete language to determine structure on the fly.– If you want to use a data structure instead, access it in the method.– If you want a higher-level language for specifying structure,
write it terms of this primitive.
• Other nice attribute– Easy to do value-conditioned structure. Case Factor Diagrams, etc.
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] object StateTransitionTemplate extends Template2[Label,Label]
Labels
Words
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token]
object StateTransitionTemplate extends Template2[Label,Label]
Labels
Words
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token)}
object StateTransitionTemplate extends Template2[Label,Label]
Labels
Words
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends Template2[Label,Label]
Labels
Words
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends Template2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next)}
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends Template2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}
Imperativ-ism #3: Neighbor-Sufficient Map• “Neighbor Variables” of a factor
– Variables touching the factor• “Sufficient Statistics” of a factor
– Vector, dot product with weights of log-linear factor → factor’s score
• Usually confounded. Separate them.• Skip-chain NER. Instead of 5x5 parameters, just 2.
(label1, label2) → label1 == label2
Labels
Words
Bill loves Paris Bill the painter ...
PER O LOC PER O O
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends Template2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}
Example: Skip-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends TemplateWithStatistics1[Label]object StateTokenTemplate extends TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}object SkipTemplate extends Template1[Label,Label] with Statistics1[Bool]
Example: Skip-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends TemplateWithStatistics1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}object SkipTemplate extends Template1[Label,Label] with Statistics1[Bool]{ def unroll1(label:Label) = for (other <- label.seq) if (label.token == other.token)) yield Factor (label,other)}
Example: Skip-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
// Factor templatesobject StateTemplate extends TemplateWithStatistics1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}object SkipTemplate extends Template2[Label,Label] with Statistics1[Bool]{ def unroll1(label:Label) = for (other <- label.seq; if (label.token == other.token)) yield Factor (label,other) def statistics(label1:Label,label2:Label) = Stat(label1 == label2)}
Example: Dependency Parsingclass Word(str:String) extends EnumVariable(word)class Node(word:Word, parent:Node) extends PrimitiveVariable(parent)
object ChildParentTemplate extends Template1[Node] with Statistics2[Word,Word] { def statistics(n:Node) = Stat(n.word, n.parent.word)}
object NearestVerbTemplate extends Template1[Node] with Statistics2[Word,Word] { def statistics(n:Node) = Stat(n.word, closestVerb(n).word) def closestVerb(n:Node) = if (isVerb(n.word)) n else closestVerb(n.parent) def unroll1(n:Node) = n.selfAndDescendants}
VERB
Extensibility
• Many variables types provided:- boolean, int, float, String, categorical,...
• Create new ones!- set-valued variable
- finite-state machine as a variable [JHU]
• Create new factor types- Poisson, Dirichlet,...
Experimental Results• Joint Segmentation & Coreference of
research paper citations.- 1295 mentions, 134 entities, 36487 tokens
• Compare with MLNs (Alchemy)- Same observable features
• Factorie results:
- ~25% reduction in error (segmentation & coref)
- 3-20x faster
- coref results:
FACTORIE Summary
• Factor graphs, • ...object-oriented
– data types and factor template types, with inheritance• ...scalable
– factors created on demand, only score diffs• ...with imperative hooks
– jump function, override variable.set() for coordination– model structure– neighbor variables → sufficient statistics
• ...discriminative– efficient online training by SampleRank– generative modeling also provided (LDA = ~12 lines)
• Combine declarative & procedural knowledge
Conclusion: Some Reasons To Use Probabilistic Programming
• Simple- Save time & avoid debugging complex, hand-built ML code.- Say exactly what you want in the way you want to say it.
• Flexible- Encourage research exploration by making it easier to try
new modeling ideas- The language provides the right “hinge-points” to provide
the flexibility you want, without the underlying cruft.
• Glue that binds many reasoning paradigms together.• Allows probabilistic modeling to be integrated with all
the other traditional deterministic programming.
Some of this text from Avi Pfeffer
Key Questions for Probabilistic Programming
• What are good design patterns for probabilistic programming?
• What are the skills required to be an effective probabilistic programmer?
• How can probabilistic programmers work well with domain experts and end users?
• What kind of tools can we develop to support probabilistic programming (debuggers, profilers etc.)?
• How can probabilistic programs be learned (especially structure)?
• How can we make inference more efficient (especially memory) to scale up to even large domains?
Some of this text from Avi Pfeffer
FACTORIE Summary
• Factor graphs, • ...object-oriented
– data types and factor template types, with inheritance• ...scalable
– factors created on demand, only score diffs• ...with imperative hooks
– jump function, override variable.set() for coordination– model structure– neighbor variables → sufficient statistics
• ...discriminative– efficient online training by SampleRank– generative modeling also provided (LDA = ~12 lines)
• Combine declarative & procedural knowledge
Don’t represent all alternatives... just one
she
she
AmyHall
sheshe
she
she
AmyHall
sheshe
StochasticJump
ProposalDistribution
at a time
Markov Chain Monte Carlo
M-H Natural Efficiencies1. Partition function cancels
SampleRank
MH Efficiency
partition cancels
p(y �)
p(y)=
p(Y = y �|x ; θ)
p(Y = y |x ; θ)
=1
ZX
�y i∈y � ψ(x , y i)
1
ZX
�y∈y ψ(x , y i)
=
�y i∈y � ψ(x , y i)
�y∈y ψ(x , y i)
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27
SampleRank
MH Efficiency
factors cancel
=
�y �i∈y � ψ(x , y �i)
�y i∈y ψ(x , y i)
=
��y �i∈δy � ψ(x , y �i)
� ��y i∈y �/δy � ψ(x , y i)
�
��y i∈δy
ψ(x , y i)� ��
y i∈y/δyψ(x , y i)
�
=
�y �i∈δy � ψ(x , y �i)
�y i∈δy
ψ(x , y i)
δy is the “diff”, ie variables in y that have changed
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 6 / 27
2. Unchanged factors cancel
How to learn parameters for ?
SampleRank
SampleRankmotivation
QUESTION: how do we learn θ for MH
Problem with traditional ML: inference in inner-most loop of learningmaximum likelihood requires inference for marginalsperceptron requires inference for decoding
Want: push updates (not inference) into inner-most loop of learning
Idea: use MH as a guide, exploit its efficiency, and learn to rankneighboring samples during random walk
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 7 / 27
SampleRank
MH Efficiency
partition cancels
p(y �)
p(y)=
p(Y = y �|x ; θ)
p(Y = y |x ; θ)
=1
ZX
�y i∈y � ψ(x , y i)
1
ZX
�y∈y ψ(x , y i)
=
�y i∈y � ψ(x , y i)
�y∈y ψ(x , y i)
(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27
• Most methods require calculating gradient of log-likelihood, P(y1, y2, y3,... | x1, x2, x3,...)...
• ...which in turn requires “expectations of marginals,” P(y1| x1, x2, x3,...)
• But, getting marginal distributions by sampling can be inefficient due to large sample space.
• Alternative: Perceptron. Approximate gradient from difference between true output and model’s predicted best output.
• But, even finding model’s predicted best output is expensive.
• We propose: “Sample Rank” [Culotta, Wick, Hall, McCallum, HLT 2007]Learn to rank intermediate solutions: P(y1=1, y2=0, y3=1,... | ...) > P(y1=0, y2=0, y3=1,... | ...)
Parameter Estimation in Large State Spaces
Ranking vs Classification Training• Instead of training
[Powell, Mr. Powell, he] → YES[Powell, Mr. Powell, she] → NO
• ...Rather...
[Powell, Mr. Powell, he] > [Powell, Mr. Powell, she]
• In general, higher-ranked example may contain errors
[Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she]
1.
UPDATE
Ranking Intermediate SolutionsExample
2.
∆ Model = -23∆ Truth = -0.2
3.
∆ Model = 10∆ Truth = -0.1
4.
∆ Model = -10∆ Truth = -0.1
5.
∆ Model = -3∆ Truth = 0.3
• Like Perceptron:Proof of convergence under Marginal Separability.
• More constrained than Maximum Likelihood:Parameters must correctly rank incorrect solutions!
• Very fast to train.
UPDATE
Comparison to Contrastive DivergenceContrastive Divergence, n=2 [Hinton 2002]
sufficient statistics for update
proposal
truth
Persistent Contrastive Divergence [Tieleman 2008]
Sample Rank
SampleRank on Coreference• ACE 2004
• All nouns. 28,122 mentions, 14,047 entitiese.g. he, the President, Clinton, Mrs. Clinton, Washington
2005 Ng . 69.5%
2007 Culotta, Wick, Hall, McCallum . 79.3%
2008 Bengston, Roth . 80.8%
2009 Wick, McCallum MCMC+SampleRank . 81.5%
Contrastive Divergence . 75.1%
Persistent Contrastive Divergence . 74.9%
Perceptron . 76.3%
B3
FACTORIE Summary
• Factor graphs, • ...object-oriented
– data types and factor template types, with inheritance• ...scalable
– factors created on demand, only score diffs• ...with imperative hooks
– jump function, override variable.set() for coordination– model structure– neighbor variables → sufficient statistics
• ...discriminative– efficient online training by SampleRank– generative modeling also provided (LDA = ~12 lines)
• Combine declarative & procedural knowledge