direct sparse structural change detection in markov …liu/papers/ecml_talk.pdfsl is supported by...

Direct Learning of Sparse Changes in Markov Networks by Density Ratio Estimation

Song Liu1, John Quinn², Michael Gutmann3 ,Taiji Suzuki1

and Masashi Sugiyama1

"Tempora mutantur, nos et mutamur in illis. ""Times change, and we change with time. “

Latin Phrase.

SL is supported by the JST PRESTO program and the JSPS fellowship, JQ is supported by the JST PRESTO program, and MS is supported by the JST CREST program. MUG is supported by the Finnish Centre-of-Excellence in Computational Inference Research COIN (251170). TS was partially supported by MEXT Kakenhi 25730013, and the Aihara Project, the FIRST program from JSPS, initiated by CSTP.

1. Tokyo Institute of Technology, Japan

2. Makerere University, Uganda

3. University of Helsinki and HIIT, Finland.

1

Interactions Everywhere

Examples of InteractionsGenes regulate each other via

gene network. Synonyms tend to co-occur in

the same text corpus. Brain EEG signals may be

synchronized in a certain pattern.

However, such interactions may be changing!

(Wikipedia)

2

Structural Change Detection

Interactions between features may change.

e.g. Some genes related to sleep may be activated, but only in the evening.

“apple” may co-occur with “banana” quite often in cookbook, but not in IT news. 3

The change of brain signal correlation at two different experiment intervals. (Williamson et al., 2012)

x = (x(1); : : : ; x(d))>

Outline

1. Introductions

2. Related Works1. Gaussian Markov Networks (GMNs)

2. Estimating Sparse GMN: Graphical Lasso (Glasso)

3. Glasso: Pros and Cons

4. Detecting Changes via Fused-lasso (Flasso)

5. Nonparanoraml Extension

6. Generalized Log-linear Model

7. The Normalization Issue

3. Proposed Approach

4. Experiments

5. Conclusion

4

Gaussian Markov Networks (GMNs)

5

The interactions between random variables can be modelled by Markov Networks.

Markov Networks (MNs) are undirected Graphical Models. The simplest example of MN is a Gaussian MN:

We can visualize the above MN using an undirected graph.

𝒙 = (𝑥 1 , 𝑥 2 , … , 𝑥(𝑑)) 𝚯 is the inverse covariance matrix

1 23

546

Estimating Sparse GMN: Graphical Lasso (Glasso)

Recall, we would like to detect changes between MNs.

Changes can be found once the structure of two separate GMNs are known to us.

Estimating Sparse GMNs can be done via Graphical Lasso (Glasso).

Idea: Twice Glasso, take the parameter difference.

6

is the sample covariance matrix.

1 23

546

1 23

546

Tibshirani, JRSS 1996; Friedman et al., Biostatistics 2008

Glasso: Pros and Cons

Pros:

Statistical properties are well studied.

Off-the-shelf software can be used.

Cons:

Cannot detect high-order correlation.

Does not work if 𝑝 or 𝑞 is dense.

Not clear how to choose 𝜆𝑃or 𝜆𝑄.

7

P Q

Change

sparse sparse

sparse

Can we penalize change directly?

Detecting Changes via Fused-lasso (Flasso)

8

We can impose sparsity directly on , using Fused-lasso (Tibshirani et al., 2005).

Consider the following objective:

Similar approach for Gaussian structural change was proposed.Using Pseudo-likelihood

We don’t have to assume 𝑝 or 𝑞 is sparse.

Sparsity control is much easier than Glasso.

Gaussianity is still assumed.

(Zhang & Wang, UAI2010)

Nonparanormal (NPN) Extension

We may assume data are Gaussian after NPN transform.

More flexible than Gaussian methods, still tractable.

However, NPN extension is still restrictive. 9

𝑓𝑘 : Monotone, differentiable function𝒙 = (𝑥 1 , 𝑥 2 , … , 𝑥(𝑑))

𝒇(𝒙) = (𝑓1(𝑥1 ), 𝑓2(𝑥

2 ), … , 𝑓𝑑(𝑥𝑑 ))

Liu et al., JMLR 2009

Generalized Log-linear Model

Pairwise Markov Network

𝒇 are feature vectors.

The normalization term 𝑍(𝜽) is generally intractable.Gaussian or NPN models are exceptions.

𝒇: ℛ2 → ℛ𝑏

10

Gaussian: 𝑓𝑔𝑎𝑢 𝑥, 𝑦 = 𝑥𝑦

Nonparanormal: 𝑓𝑛𝑝𝑛 𝑥, 𝑦 = 𝑓 𝑥 𝑓(𝑦)

Polynomial: 𝒇𝑝𝑜𝑙𝑦 𝑥, 𝑦 = [𝑥𝑘 , 𝑦𝑘 , 𝑥𝑘−1𝑦 … , 𝑥, 𝑦, 1]

The Normalization Issue

However, for a generalized Markov Network, there is no closed-form for 𝑍(𝜽).

Importance sampling can be used to approximate 𝑍(𝜽).

Can adapt Glasso & Flasso for non-Gaussian data.

May result a high variance estimator depending on the choice of 𝑝inst. How to choose 𝑝inst is not clear.

11

!

Outline

1. Introductions

2. Problem Formulation

3. Related Works

4. Proposed Approach1. Modelling Changes Directly

2. Density ratio Estimation

3. Sparsity inducing Norm

4. The Dual Formulation

5. Experiments

6. Conclusion

12

Recall, our interest is:

The ratio of two MNs naturally incorporates the !

Modeling Changes Directly

So, model the ratio directly! 13

Modeling Changes Directly

We model density ratio instead of density function:

The normalization term is:

To ensure: Sample average approximationAlso works when integral has no closed form!

14

Estimating Density Ratio

Kullback-Leibler Importance Estimation Procedure (KLIEP):

Unconstrained convex optimization!

Sugiyama et al., NIPS 2007

Tsuboi et al, JIP 2009

15

Sparsity Inducing Norm

Impose sparsity constraints on each factor 𝜷𝑢,𝑣. equals to impose sparsity on changes.

So finally, we can obtain a with group sparsity!

L2 regularizersGroup lasso regularizer

Elastic Net

16

𝜷

The Dual Formulation

0

100

200

300

400

500

600

700

800

40 50 60 70 80

Primal

Dual

Dimension

Co

mp

utatio

nal Tim

e

When dimensionality is high, the dual formulation is preferred.

Optimize 𝜶 on probability simplex.

In longer Version

17

Outline

1. Introductions


3. Related Works


5. Experiments1. Numerical Experiments

2. Real-world Application

6. Conclusion

18

Gaussian Distribution (𝑛 = 100, 𝑑 = 40)

Regularization path

Start from 40 dimensional GMN with random correlations.

Randomly drops 15 edges. Precision and Recall curves are

averaged over 20 runs.

19P-R curve

𝜷𝑢,𝑣

𝜷𝑢,𝑣

𝜷𝑢,𝑣

Gaussian Distribution (𝑛 = 50, 𝑑 = 40)

20P-R curve

Regularization path

Start from 40 dimensional GMN with random correlations.

Randomly drops 15 edges. Precision and Recall curves are

averaged over 20 runs.

𝜷𝑢,𝑣

𝜷𝑢,𝑣

𝜷𝑢,𝑣

Diamond Distribution (n = 5000, d = 9)

Diamond Distribution:

Samples are drawn by slice sampling.

Only the methods with the correct model has good performance.

Regularization pathP-R curve

21

𝜷𝑢,𝑣

Real-world Applications

•Gene Network•Detecting changes from the original network to the modified network.

• Twitter Messages•Samples are the frequencies of 10 related keywords over time. •Detecting the change of co-occurrences on keywords before and after a certain events.

source: Wikipedia

22

Gene Network P QGene regulatory network is modified manually. 50 Samples are collected

before (𝑃) and after (𝑄)the change.

Polynomial kernel is used for 𝒇.𝜆1 is chosen by hold-out cross validation.Only KLIEP, Flasso and IS-Flasso are compared.

23

Gene Network (n = 50, d = 13)

Regularization path, KLIEP Flasso IS-Flasso

P-R curve 24

𝜷𝑢,𝑣

|𝜷𝑢,𝑣|

|𝜷𝑢,𝑣|

Q P P P

Twitter Keywords

Time

3 weeks~4.17

We choose the Deepwater Horizon oil

spill as the target event.

…

25

Twitter Keywords From 7.26-9.14

KLIEP Flasso

26

Outline

1. Introductions


3. Related Works


5. Experiments

6. Conclusion

27

Conclusion

Learning sparse changes in two Markov Networks, directly!By density ratio estimation

Two advantages comparing to conventional methods:Higher AccuracyThanks to the direct modelling nature.

Wider ApplicabilityP and Q are not only limited to discrete, Gaussian, or NPN.

28

direct sparse structural change detection in markov …liu/papers/ecml_talk.pdfsl is supported by...

Documents