learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdflearning...

37
Learning graphical models for hypothesis testing and classification 1 Umamahesh Srinivas iPAL Group Meeting October 22, 2010 1 Tan et al., IEEE Trans. Signal Processing, Nov. 2010

Upload: others

Post on 01-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Learning graphical models for hypothesis testing

and classification1

Umamahesh Srinivas

iPAL Group Meeting

October 22, 2010

1Tan et al., IEEE Trans. Signal Processing, Nov. 2010

Page 2: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Outline

1 Background and motivation

2 Graphical models: some preliminaries

3 Generative learning of trees

4 Discriminative learning of trees

Discriminative learning of forests

5 Learning thicker graphs via boosting

6 Extension to multi-class problems

10/22/2010 iPAL Group Meeting 2

Page 3: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Binary hypothesis testing problem

Random vector x = (x1, . . . , xn) ∈ Xn generated from either of twohypotheses

H0 : x ∼ p

H1 : x ∼ q

Given: Training sets Tp and Tq, K samples each

Goal: Classify new sample as coming from H0 or H1

Assumption: Class densities p and q known exactly

Likelihood ratio test (LRT)

L(x) :=p(x)

q(x)

H1

TH0

τ.

10/22/2010 iPAL Group Meeting 3

Page 4: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Binary hypothesis testing problem

Random vector x = (x1, . . . , xn) ∈ Xn generated from either of twohypotheses

H0 : x ∼ p

H1 : x ∼ q

Given: Training sets Tp and Tq, K samples each

Goal: Classify new sample as coming from H0 or H1

Assumption: Class densities p and q known exactly

Likelihood ratio test (LRT)

L(x) :=p(x)

q(x)

H1

TH0

τ.

10/22/2010 iPAL Group Meeting 3

Page 5: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Binary hypothesis testing problem

Random vector x = (x1, . . . , xn) ∈ Xn generated from either of twohypotheses

H0 : x ∼ p

H1 : x ∼ q

Given: Training sets Tp and Tq, K samples each

Goal: Classify new sample as coming from H0 or H1

Assumption: Class densities p and q known exactly

Likelihood ratio test (LRT)

L(x) :=p(x)

q(x)

H1

TH0

τ.

10/22/2010 iPAL Group Meeting 3

Page 6: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

What if true densities are not known a priori?

Estimate empiricals pe and qe from Tp and Tq respectively

LRT using pe and qe

10/22/2010 iPAL Group Meeting 4

Page 7: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

What if true densities are not known a priori?

Estimate empiricals pe and qe from Tp and Tq respectively

LRT using pe and qe

Problem: For high-dimensional data, need large number of samples toget reasonable empirical estimates.

10/22/2010 iPAL Group Meeting 4

Page 8: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

What if true densities are not known a priori?

Estimate empiricals pe and qe from Tp and Tq respectively

LRT using pe and qe

Problem: For high-dimensional data, need large number of samples toget reasonable empirical estimates.

Graphical models:

Efficiently learn tractable models from insufficient data

Trade-off between consistency and generalization

10/22/2010 iPAL Group Meeting 4

Page 9: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

What if true densities are not known a priori?

Estimate empiricals pe and qe from Tp and Tq respectively

LRT using pe and qe

Problem: For high-dimensional data, need large number of samples toget reasonable empirical estimates.

Graphical models:

Efficiently learn tractable models from insufficient data

Trade-off between consistency and generalization

Generative learning: Learning models to approximate distributions.

Learn p from Tp, and q from Tq.

10/22/2010 iPAL Group Meeting 4

Page 10: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

What if true densities are not known a priori?

Estimate empiricals pe and qe from Tp and Tq respectively

LRT using pe and qe

Problem: For high-dimensional data, need large number of samples toget reasonable empirical estimates.

Graphical models:

Efficiently learn tractable models from insufficient data

Trade-off between consistency and generalization

Generative learning: Learning models to approximate distributions.

Learn p from Tp, and q from Tq.

Discriminative learning: Learning models for binary classification.

Learn p from Tp and Tq; likewise q.

10/22/2010 iPAL Group Meeting 4

Page 11: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Graphical models: Preliminaries

(Undirected) Graph G = (V , E) defined by a set of nodesV = 1, . . . , n, and a set of edges E ⊂

(V2

).

Graphical model: Random vector defined on a graph such that eachnode represents one (or more) random variables, and edges revealconditional dependencies.

Graph structure defines factorization of joint probability distribution.

x1

x2

x4 x5

x3

x6 x7

f(x) = f(x1)f(x2|x1)f(x3|x1)f(x4|x2)f(x5|x2)f(x6|x3)f(x7|x3).

Local Markov property:

p(xi|xV\i) = p(xi|xN (i)), ∀ i ∈ V .

Such a p(x) is Markov w.r.t. G.

10/22/2010 iPAL Group Meeting 5

Page 12: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Trees and forests

Tree: Undirected acyclic graph with exactly (n− 1) edges.

Forest: Contains k < (n− 1) edges → not connected.

10/22/2010 iPAL Group Meeting 6

Page 13: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Trees and forests

Tree: Undirected acyclic graph with exactly (n− 1) edges.

Forest: Contains k < (n− 1) edges → not connected.

Factorization property:

p(x) =∏

i∈V

p(xi)∏

(i,j)∈E

pi,j(xi, xj)

p(xi)p(xj).

10/22/2010 iPAL Group Meeting 6

Page 14: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Trees and forests

Tree: Undirected acyclic graph with exactly (n− 1) edges.

Forest: Contains k < (n− 1) edges → not connected.

Factorization property:

p(x) =∏

i∈V

p(xi)∏

(i,j)∈E

pi,j(xi, xj)

p(xi)p(xj).

Notational convention: p represents probability distribution, prepresents a graphical approximation (tree- or forest-structured).

10/22/2010 iPAL Group Meeting 6

Page 15: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Trees and forests

Tree: Undirected acyclic graph with exactly (n− 1) edges.

Forest: Contains k < (n− 1) edges → not connected.

Factorization property:

p(x) =∏

i∈V

p(xi)∏

(i,j)∈E

pi,j(xi, xj)

p(xi)p(xj).

Notational convention: p represents probability distribution, prepresents a graphical approximation (tree- or forest-structured).

Projection p of p onto a tree (or forest):

p(x) :=∏

i∈V

p(xi)∏

(i,j)∈E

pi,j(xi, xj)

p(xi)p(xj).

10/22/2010 iPAL Group Meeting 6

Page 16: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Generative learning of trees2

Optimal tree approximation of a distribution

Given p, find p = argminp∈T

D(p||p).

(D(p||p) :=

∫p(x) log

(p(x)

p(x)

)dx.

)

2Chow and Liu, IEEE Trans. Inf. Theory 1968

10/22/2010 iPAL Group Meeting 7

Page 17: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Generative learning of trees2

Optimal tree approximation of a distribution

Given p, find p = argminp∈T

D(p||p).

(D(p||p) :=

∫p(x) log

(p(x)

p(x)

)dx.

)

Equivalent max-weight spanning tree (MWST) problem:

maxE:G=(V,E) is a tree

(i,j)∈E

I(xi;xj).

2Chow and Liu, IEEE Trans. Inf. Theory 1968

10/22/2010 iPAL Group Meeting 7

Page 18: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Generative learning of trees2

Optimal tree approximation of a distribution

Given p, find p = argminp∈T

D(p||p).

(D(p||p) :=

∫p(x) log

(p(x)

p(x)

)dx.

)

Equivalent max-weight spanning tree (MWST) problem:

maxE:G=(V,E) is a tree

(i,j)∈E

I(xi;xj).

Need only marginal and pairwise statistics

Kruskal MWST algorithm.

2Chow and Liu, IEEE Trans. Inf. Theory 1968

10/22/2010 iPAL Group Meeting 7

Page 19: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

J-divergence

Given distributions p and q,

J(p, q) := D(p||q) +D(q||p) =

Ω⊂Xn

(p(x) − q(x)) log

(p(x)

q(x)

)dx.

1

4exp(−J) ≤ Pr(err) ≤

1

2

(J

4

)− 14

.

Maximize J to minimize upper bound on Pr(err).

Tree-approximate J-divergence of p, q w.r.t p, q:

J(p, q; p, q) :=

Ω⊂Xn

(p(x) − q(x)) log

(p(x)

q(x)

)dx.

10/22/2010 iPAL Group Meeting 8

Page 20: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Marginal consistency of p w.r.t. p:

p(i,j)(xi, xj) = p(i,j)(xi, xj), ∀ (i, j) ∈ Ep.

10/22/2010 iPAL Group Meeting 9

Page 21: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Marginal consistency of p w.r.t. p:

p(i,j)(xi, xj) = p(i,j)(xi, xj), ∀ (i, j) ∈ Ep.

Benefits of tree-approx. J-divergence:

Maximizing tree-approx. J-divergence gives good discriminativeperformance (shown experimentally).

Marginal consistency leads to tractable optimization for p and q.

Trees provide rich class of distributions to model high-dimensionaldata.

10/22/2010 iPAL Group Meeting 9

Page 22: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Discriminative learning of trees

p and q: empirical distributions from Tp and Tq respectively.

(p, q) = arg maxp∈Tp,q∈Tq

J(p, q; p, q).

Decoupling into two independent MWST problems:

p = arg minp∈Tp

D(p‖p)−D(q‖p)

q = arg minq∈Tq

D(q‖q)−D(p‖q).

10/22/2010 iPAL Group Meeting 10

Page 23: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Edge weights:

ψpi,j := Epi,j

[log

pi,jpipj

]− Eqi,j

[log

pi,jpipj

]

ψqi,j := Eqi,j

[log

qi,jqiqj

]− Epi,j

[log

qi,jqiqj

].

Algorithm 1 Discriminative trees (DT)

Given: Training sets Tp and Tq.

1: Estimate pairwise statistics pi,j(xi, xj), qi,j(xi, xj) for all edges (i, j).2: Compute edge weights ψp

i,j and ψqi,j for all edges (i, j).

3: Find Ep = MWST(ψpi,j) and Eq = MWST(ψq

i,j).

4: Get p by projection of p onto Ep; likewise q.

5: LRT using p and q.

10/22/2010 iPAL Group Meeting 11

Page 24: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Discriminative learning of forestsp(k) and q(k): Markov on forests with at most k ≤ (n− 1) edges.

10/22/2010 iPAL Group Meeting 12

Page 25: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Discriminative learning of forestsp(k) and q(k): Markov on forests with at most k ≤ (n− 1) edges.

Maximize joint objective over both pairs of distributions:

(p(k), q(k)) = arg maxp∈T

p(k) ,q∈Tq(k)

J(p, q; p, q)

10/22/2010 iPAL Group Meeting 12

Page 26: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Discriminative learning of forestsp(k) and q(k): Markov on forests with at most k ≤ (n− 1) edges.

Maximize joint objective over both pairs of distributions:

(p(k), q(k)) = arg maxp∈T

p(k) ,q∈Tq(k)

J(p, q; p, q)

Useful property: J(p, q; p, q) =∑

i∈V

J(pi, qi) +∑

(i,j)∈Ep∪Eq

wij ,

where wij can be expressed in terms of mutual information terms andKL-divergences involving marginal and pairwise statistics.

10/22/2010 iPAL Group Meeting 12

Page 27: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Discriminative learning of forestsp(k) and q(k): Markov on forests with at most k ≤ (n− 1) edges.

Maximize joint objective over both pairs of distributions:

(p(k), q(k)) = arg maxp∈T

p(k) ,q∈Tq(k)

J(p, q; p, q)

Useful property: J(p, q; p, q) =∑

i∈V

J(pi, qi) +∑

(i,j)∈Ep∪Eq

wij ,

where wij can be expressed in terms of mutual information terms andKL-divergences involving marginal and pairwise statistics.

Additivity of cost function and optimality of k-step Kruskal MWSTalgorithm for each k ⇒ k-step Kruskal MWST leads to optimalforest-structured distribution.

10/22/2010 iPAL Group Meeting 12

Page 28: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Discriminative learning of forestsp(k) and q(k): Markov on forests with at most k ≤ (n− 1) edges.

Maximize joint objective over both pairs of distributions:

(p(k), q(k)) = arg maxp∈T

p(k) ,q∈Tq(k)

J(p, q; p, q)

Useful property: J(p, q; p, q) =∑

i∈V

J(pi, qi) +∑

(i,j)∈Ep∪Eq

wij ,

where wij can be expressed in terms of mutual information terms andKL-divergences involving marginal and pairwise statistics.

Additivity of cost function and optimality of k-step Kruskal MWSTalgorithm for each k ⇒ k-step Kruskal MWST leads to optimalforest-structured distribution.

Estimated edges sets are nested, i.e., Tp(k−1) ⊆ Tp(k) , ∀ k ≤ n− 1.

10/22/2010 iPAL Group Meeting 12

Page 29: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Discriminative learning of forestsp(k) and q(k): Markov on forests with at most k ≤ (n− 1) edges.

Maximize joint objective over both pairs of distributions:

(p(k), q(k)) = arg maxp∈T

p(k) ,q∈Tq(k)

J(p, q; p, q)

Useful property: J(p, q; p, q) =∑

i∈V

J(pi, qi) +∑

(i,j)∈Ep∪Eq

wij ,

where wij can be expressed in terms of mutual information terms andKL-divergences involving marginal and pairwise statistics.

Additivity of cost function and optimality of k-step Kruskal MWSTalgorithm for each k ⇒ k-step Kruskal MWST leads to optimalforest-structured distribution.

Estimated edges sets are nested, i.e., Tp(k−1) ⊆ Tp(k) , ∀ k ≤ n− 1.

Single run of Kruskal MWST recovers all (n− 1) pairs of edgesubstructures!

10/22/2010 iPAL Group Meeting 12

Page 30: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Learning thicker graphs via boosting

Trees learn (n− 1) edges → sparse representation.

Desirable to learn more graph edges for better classification, if wecan also avoid overfitting.

Learning of junction trees known to be NP-hard.

Learning general graph structures is intractable.

10/22/2010 iPAL Group Meeting 13

Page 31: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Learning thicker graphs via boosting

Trees learn (n− 1) edges → sparse representation.

Desirable to learn more graph edges for better classification, if wecan also avoid overfitting.

Learning of junction trees known to be NP-hard.

Learning general graph structures is intractable.

Use boosting to learn more than (n− 1) edges per model.

10/22/2010 iPAL Group Meeting 13

Page 32: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Boosted graphical model classification

DT classifier used as a weak learner

Training sets Tp and Tq remain unchanged

10/22/2010 iPAL Group Meeting 14

Page 33: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Boosted graphical model classification

DT classifier used as a weak learner

Training sets Tp and Tq remain unchanged

In t-th iteration, learn trees pt and qt, and classify using:

ht(x) := log

(pt(x)

qt(x)

).

Learn trees by minimizing weighted training error: use (pw, qw)instead of (p, q).

10/22/2010 iPAL Group Meeting 14

Page 34: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Boosted graphical model classification

DT classifier used as a weak learner

Training sets Tp and Tq remain unchanged

In t-th iteration, learn trees pt and qt, and classify using:

ht(x) := log

(pt(x)

qt(x)

).

Learn trees by minimizing weighted training error: use (pw, qw)instead of (p, q).

Final boosted classifier:

HT (x) = sgn

[T∑

t=1

αt log

(pt(x)

qt(x)

)]

= sgn

[log

(p∗(x)

q∗(x)

)],

where p∗(x) :=∏T

t=1 pt(x)αt .

10/22/2010 iPAL Group Meeting 14

Page 35: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Some comments

Boosting learns at most (n− 1) edges per iteration ⇒ maximum of(n− 1)T edges (pairwise features)

With suitable normalization, p∗(x)/Zp(α) is a probabilitydistribution.

p∗(x)/Zp(α) is Markov on a graph G = (V , Ep∗) with edge set

Ep∗ =

T⋃

t=1

Ept.

How to avoid overfitting?

Use cross-validation to determine optimum number of iterations T ∗.

10/22/2010 iPAL Group Meeting 15

Page 36: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Extension to multi-class problems

Set of classes I: one-versus-all strategy.

p(k)i|j (x) and p

(k)j|i (x) - learned forests for the binary classification

problem Class i versus Class j.

f(k)ij (x) := log

p

(k)i|j (x)

p(k)j|i (x)

, i, j ∈ I.

Multi-class decision function:

g(k)(x) := argmaxi∈I

j∈I

f(k)ij (x).

10/22/2010 iPAL Group Meeting 16

Page 37: Learning graphical models for hypothesis testingsignal.ee.psu.edu/graph_classification.pdfLearning graphical models for hypothesis testing and classification1 Umamahesh Srinivas iPALGroupMeeting

Conclusion

Discriminative learning optimizes an approximation to theexpectation of log-likelihood ratio

Superior performance in classification applications compared togenerative approaches.

Learned tree models can have different edge structures → removesthe restriction of Tree Augmented Naive (TAN) Bayes framework .

No additional computational overhead compared to existingtree-based methods.

Amenable to boosting → weak learners on weighted empiricals.

Learning thicker graphical models in a principled manner.

10/22/2010 iPAL Group Meeting 17