joseph k. bradley
DESCRIPTION
Sample Complexity of CRF Parameter Learning. Joseph K. Bradley. Joint work with Carlos Guestrin CMU Machine Learning Lunch talk on work appearing in AISTATS 2012. 4 / 9 / 2012. Markov Random Fields (MRFs). Goal: Model distribution P(X) over random variables X. E.g.,. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/1.jpg)
Carnegie Mellon
Joseph K. Bradley
Sample Complexity of CRF Parameter
Learning
4 / 9 / 2012
Joint work with Carlos Guestrin
CMU Machine Learning Lunch talk on work
appearing in AISTATS 2012
![Page 2: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/2.jpg)
Markov Random Fields (MRFs)
2
Goal: Model distribution P(X) over random variables X
X1: deadline?
X2: bags under eyes?
X3: sick?
X4: losing hair?
X5: overeating?
= P( deadline | bags under eyes, losing hair )E.g.,
![Page 3: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/3.jpg)
Markov Random Fields (MRFs)
3
X1: deadline?
X2: bags under eyes?
X3: sick?
X4: losing hair?
X5: overeating?
factor
Goal: Model distribution P(X) over random variables X
![Page 4: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/4.jpg)
Log-linear MRFs
4
Parameters Features
Our goal: Given structure Φ and data, learn parameters θ.
Binary X:
Real X:
![Page 5: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/5.jpg)
Parameter Learning: MLETraditional learning: max-likelihood estimation (MLE)
Minimize objective:
5
Given data: n i.i.d. samples from
Loss
L2 regularization is more common. Our analysis applies to L1 & L2.
Gold Standard:MLE is (optimally) statistically efficient.
Regularization
![Page 6: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/6.jpg)
Parameter Learning: MLETraditional learning: max-likelihood estimation (MLE)
Minimize objective:
6
Given data: n i.i.d. samples from
AlgorithmIterate:
Compute gradient.Step along gradient.
![Page 7: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/7.jpg)
Parameter Learning: MLETraditional learning: max-likelihood estimation (MLE)
7
AlgorithmIterate:
Compute gradient.Step along gradient.
![Page 8: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/8.jpg)
Parameter Learning: MLETraditional learning: max-likelihood estimation (MLE)
8
AlgorithmIterate:
Compute gradient.Step along gradient.
Requires inference. Provably hard for general MRFs.
Inference makeslearning hard.
Can we learn withoutintractable inference?
![Page 9: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/9.jpg)
Conditional Random Fields (CRFs)
9
X1: deadline?
X2: bags under eyes?
X3: sick?
X4: losing hair?
X5: overeating?
MRFs CRFs (Lafferty et al., 2001)
E1: weather
E2: full moon
E3: Steelers game
…
Inference exponential in |X|,
not |E|.
![Page 10: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/10.jpg)
Conditional Random Fields (CRFs)
10
MRFs CRFs (Lafferty et al., 2001)
But Z depends on E!Inference exponential in |X|,
not |E|.
Compute Z(e) for every training example!
Objective:Inference makes learningeven harder for CRFs.
Can we learn withoutintractable inference?
![Page 11: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/11.jpg)
Outline
Parameter learningBefore: No PAC learning results for general MRFs or CRFs
11
Sample complexity resultsPAC learning via pseudolikelihood for general MRFs and CRFs
Empirical analysis of boundsTightness & dependence on model
Structured composite likelihoodLowering sample complexity
![Page 12: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/12.jpg)
Related Work
Ravikumar et al. (2010): PAC bounds for regression Yi~X with Ising factors
Our theory is largely derived from this work.
Liang and Jordan (2008): Asymptotic bounds for pseudolikelihood, composite likelihood
Our finite sample bounds are of the same order.
Learning with approximate inferenceNo PAC-style bounds for general MRFs,CRFs.c.f.: Hinton (2002), Koller & Friedman (2009), Wainwright (2006)
12
![Page 13: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/13.jpg)
Outline
Parameter learningBefore: No PAC learning results for general MRFs or CRFs
Sample complexity resultsPAC learning via pseudolikelihood for general MRFs and CRFs
Empirical analysis of boundsTightness & dependence on model
Structured composite likelihoodLowering sample complexity
13
![Page 14: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/14.jpg)
Avoiding Intractable Inference
14
MLE loss:
Hard to compute.So replace it!
![Page 15: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/15.jpg)
Pseudolikelihood (MPLE)
15
MLE loss:
Pseudolikelihood (MPLE) loss:
Intuition: Approximate distribution as product of local conditionals.
X1: deadline?
X2: bags under eyes?
X3: sick?
X4: losing hair?
X5: overeating?
(Besag, 1975)
![Page 16: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/16.jpg)
Pseudolikelihood (MPLE)
16
MLE loss:
Pseudolikelihood (MPLE) loss:
Intuition: Approximate distribution as product of local conditionals.
X1: deadline?
X2: bags under eyes?
X3: sick?
X4: losing hair?
X5: overeating?
(Besag, 1975)
![Page 17: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/17.jpg)
Pseudolikelihood (MPLE)
17
MLE loss:
Pseudolikelihood (MPLE) loss:
Intuition: Approximate distribution as product of local conditionals.
X1: deadline?
X2: bags under eyes?
X3: sick?
X4: losing hair?
X5: overeating?
(Besag, 1975)
![Page 18: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/18.jpg)
Pseudolikelihood (MPLE)
18
MLE loss:
Pseudolikelihood (MPLE) loss:
No intractable inference required!
Previous work:•Pro: Consistent estimator•Con: Less statistically efficient than MLE•Con: No PAC bounds
(Besag, 1975)
![Page 19: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/19.jpg)
Outline
19
![Page 20: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/20.jpg)
Sample Complexity: MLE
20
TheoremGiven n i.i.d. samples from Pθ*(X),
MLE using L1 or L2 regularizationachieves avg. per-parameter errorwith probability ≥ 1-δif:
# parameters (length of θ)
Λmin: min eigenvalue of Hessian of loss at θ*:
![Page 21: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/21.jpg)
Sample Complexity: MPLE
21
r = length of θ
ε = avg. per-parameter error
δ = probability of failure
For MLE:Λmin = min eigval of Hessian of loss at θ*:
Same form as for MLE:
For MPLE:Λmin = mini [ min eigval of Hessian of loss component i at θ* ]:
![Page 22: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/22.jpg)
Joint vs. Disjoint Optimization
22
X1: deadline?
Pseudolikelihood (MPLE) loss:
Intuition: Approximate distribution as product of local conditionals.
![Page 23: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/23.jpg)
Joint vs. Disjoint Optimization
23
X1: deadline?
Joint: MPLE
Disjoint: Regress Xi~X-i. Average parameter estimates.
![Page 24: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/24.jpg)
Joint vs. Disjoint Optimization
24
Joint MPLE:
Disjoint MPLE:
Sample complexity bounds
Con: worse boundPro: data parallel
![Page 25: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/25.jpg)
Bounds for Log Loss
25
We have seen MLE & MPLE sample complexity:
where
TheoremIf parameter estimation error ε is small,
then log loss converges quadratically in ε:
else log loss converges linearly in ε:
(Matches rates from Liang and
Jordan, 2008)
![Page 26: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/26.jpg)
Outline
26
![Page 27: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/27.jpg)
Synthetic CRFs
27
X1
X1
X2
X2
Random:
if
otherwiseAssociative:
Chains Stars Grids
factor strength
![Page 28: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/28.jpg)
Tightness of Bounds
28
Parameter estimation error ≤ f(sample size)
Log loss ≤ f(parameter estimation error)
MPLE-disjointMPLEMLE Chain. |X|=4.Random factors.
![Page 29: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/29.jpg)
Tightness of Bounds
29
L1 p
ara
m e
rror
L1 p
ara
m e
rror
bou
nd
Training set size
Log loss ≤ f(parameter estimation error)
MPLE-disjointMPLEMLE Chain. |X|=4.Random factors.
![Page 30: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/30.jpg)
Tightness of Bounds
30
L1 p
ara
m e
rror
L1 p
ara
m e
rror
bou
nd
Training set size
MPLE-disjointMPLEMLE
Log
(b
ase
e)
loss
Training set size
Log
loss
bou
nd
,g
iven
para
ms
Chain. |X|=4.Random factors.
![Page 31: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/31.jpg)
Tightness of Bounds
31
Parameter estimation error ≤ f(sample size)
Log loss ≤ f(parameter estimation error)
(looser) (tighter)
![Page 32: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/32.jpg)
Predictive Power of Bounds
32
Parameter estimation error ≤ f(sample size)
(looser)
Is the bound still useful (predictive)?
Examine dependence on Λmin, r.
![Page 33: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/33.jpg)
Predictive Power of Bounds
33
1/Λmin
L1 p
ara
m e
rror
L1 p
ara
m e
rror
bou
nd
Actual error vs. bound:•Different constants•Similar behavior•Nearly independent of r
Chains.Random factors.10,000 train exs.
r=23r=11r=5
MLE(similar results for MPLE)
![Page 34: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/34.jpg)
Recall: Λmin
34
For MLE:Λmin = min eigval of Hessian of at θ*.
Sample complexity:
For MPLE:Λmin = mini [ min eigval of Hessian of at θ* ].
How do Λmin(MLE) and Λmin(MPLE) vary for different models?
![Page 35: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/35.jpg)
Λmin ratio: MLE/MPLE: chains
35
Factor strength(Fixed |Y|=8)
Λm
in r
ati
oRandom factors
Model size |Y|(Fixed factor strength)
Λm
in r
ati
o
Associative factors
Factor strength(Fixed |Y|=8)
Λm
in r
ati
o
Model size |Y|(Fixed factor strength)
Λm
in r
ati
o
bett
er
![Page 36: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/36.jpg)
Λmin ratio: MLE/MPLE: stars
36
Factor strength(Fixed |Y|=8)
Λm
in r
ati
o
Model size |Y|(Fixed factor strength)
Λm
in r
ati
o
Factor strength(Fixed |Y|=8)
Λm
in r
ati
o
Model size |Y|(Fixed factor strength)
Λm
in r
ati
o
Random factors Associative factors
bett
er
![Page 37: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/37.jpg)
Outline
Parameter learningBefore: No PAC learning results for general MRFs or CRFs
Sample complexity resultsPAC learning via pseudolikelihood for general MRFs and CRFs
Empirical analysis of boundsTightness & dependence on model
Structured composite likelihoodLowering sample complexity
37
![Page 38: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/38.jpg)
Grid Example
38
MLE: Estimate P(Y) all at once
![Page 39: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/39.jpg)
Grid Example
39
MLE: Estimate P(Y) all at once
MPLE: Estimate P(Yi|Y-i) separatelyYi
![Page 40: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/40.jpg)
Grid Example
40
MLE: Estimate P(Y) all at once
MPLE: Estimate P(Yi|Y-i) separately
YAi
Something in between? Estimate a larger
component, but keep inference tractable.
Composite Likelihood (MCLE):
Estimate P(YAi|Y-Ai) separately, where YAi in Y.
(Lindsay, 1988)
![Page 41: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/41.jpg)
Grid Example
41
Weak horizontal
factors
Strong vertical factors
Composite Likelihood (MCLE):
Estimate P(YAi|Y-Ai) separately, where YAi in Y
Choosing MCLE components YAi:•Larger is better.•Keep inference tractable.•Choose using model structure.Good choice: vertical combs
![Page 42: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/42.jpg)
Λmin ratio: MLE vs. MPLE,MCLE: grids
42
Random factors Associative factors
Grid width (Fixed factor strength)
Λm
in r
ati
o
combs
MPLE
Factor strength (Fixed |Y|=8)
Λm
in r
ati
o
combs
MPLE
Grid width (Fixed factor strength)
Λm
in r
ati
o
combs
MPLE
Factor strength (Fixed |Y|=8)
Λm
in r
ati
o
combs
MPLE
bett
er
![Page 43: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/43.jpg)
Structured MCLE on a Grid
43
Grid size |X|
Log
loss
rati
o (
oth
er/
MLE
)
combs
MPLE
Grid size |X|Tra
inin
g t
ime (
sec)
combsMPLE
MLE
Grid with associative factors (fixed strength).10,000 training samples.Gibbs sampling for inference.
bett
er
Combs (MCLE) lower sample complexity...without increasing computation!
![Page 44: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/44.jpg)
Averaging MCLE Estimators
44
MLE & MPLE sample complexity:
Λmin(MLE) = min eigval of Hessian of at θ*.Λmin(MPLE) = mini [ min eigval of Hessian of at θ* ].
MCLE sample complexity:
ρmin = minj [ sum over components Ai which estimate θj of
[ min eigval of Hessian of at θ* ].Mmax = maxj [ number of components Ai which estimate θj ].
![Page 45: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/45.jpg)
Averaging MCLE Estimators
45
MLE & MPLE sample complexity:
MCLE sample complexity:
ρmin = minj [ sum over components Ai which estimate θj of
[ min eigval of Hessian of at θ* ].Mmax = maxj [ number of components Ai which estimate θj ].
11
33
22
44
Estimated byboth components
Estimated byone component
Mmax = 2 Mmax = 2 Mmax = 3
![Page 46: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/46.jpg)
Averaging MCLE Estimators
46
MLE & MPLE sample complexity:
MCLE sample complexity:
For MPLE, a single bad estimator P(Xi|X-
i) can give a bad bound.
For MCLE, the effect of a bad estimator P(XAi|X-Ai) can be averaged out by other good estimators.
![Page 47: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/47.jpg)
Structured MCLE on a Grid
47
Grid width
Λm
in
MPLE
MLE
Combs-both
Comb-vert
Comb-horiz
Grid with strong vertical (associative) factors.
bett
er
![Page 48: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/48.jpg)
Summary: MLE vs. MPLE/MCLE
Relative performance of estimatorsIncreasing model diameter has little effect.MPLE/MCLE get worse with increasing:
Factor strengthNode degreeGrid width
48
Structured MCLE partly solves these problems.
Choose MCLE structure according to factor strength, node degree, grid structure.Same computational cost as MPLE.
![Page 49: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/49.jpg)
SummaryPAC learning via MPLE & MCLE for general MRFs and CRFs.Empirical analysis:
Bounds are predictive of empirical behavior.Strong factors and high-degree nodes hurt MPLE.
Structured MCLECan have lower sample complexity than MPLE but same computational complexity.
49
Future work•Choosing MCLE structure on natural graphs.•Parallel learning: Improving statistical efficiency of disjoint optimization via limited communication.•Comparing with MLE using approximate inference.
Thank you!
![Page 50: Joseph K. Bradley](https://reader035.vdocuments.mx/reader035/viewer/2022081506/568149d5550346895db6fd9b/html5/thumbnails/50.jpg)
Canonical Parametrization
Abbeel et al. (2006): Only previous method for PAC-learning high-treewidth discrete MRFs.PAC bounds for low-degree factor graphs over discrete X.Main idea:
Re-write P(X) as a ratio of many small factors P( XCi | X-Ci ).Fine print: Each factor is instantiated 2|Ci| times using a reference assignment.
Estimate each small factor P( XCi | X-Ci ) from data.
Plug factors into big expression for P(X).
50
TheoremIf the canonical parametrization uses the factorization of P(X), it is equivalent to MPLE with disjoint optimization. Computing MPLE directly is faster. Our analysis covers their learning method.