direct sparse structural change detection in markov …liu/papers/ecml_talk.pdfsl is supported by...
TRANSCRIPT
Direct Learning of Sparse Changes in Markov Networks by Density Ratio Estimation
Song Liu1, John Quinn², Michael Gutmann3 ,Taiji Suzuki1
and Masashi Sugiyama1
"Tempora mutantur, nos et mutamur in illis. ""Times change, and we change with time. “
Latin Phrase.
SL is supported by the JST PRESTO program and the JSPS fellowship, JQ is supported by the JST PRESTO program, and MS is supported by the JST CREST program. MUG is supported by the Finnish Centre-of-Excellence in Computational Inference Research COIN (251170). TS was partially supported by MEXT Kakenhi 25730013, and the Aihara Project, the FIRST program from JSPS, initiated by CSTP.
1. Tokyo Institute of Technology, Japan
2. Makerere University, Uganda
3. University of Helsinki and HIIT, Finland.
1
Interactions Everywhere
Examples of InteractionsGenes regulate each other via
gene network. Synonyms tend to co-occur in
the same text corpus. Brain EEG signals may be
synchronized in a certain pattern.
However, such interactions may be changing!
(Wikipedia)
2
Structural Change Detection
Interactions between features may change.
e.g. Some genes related to sleep may be activated, but only in the evening.
“apple” may co-occur with “banana” quite often in cookbook, but not in IT news. 3
The change of brain signal correlation at two different experiment intervals. (Williamson et al., 2012)
x = (x(1); : : : ; x(d))>
Outline
1. Introductions
2. Related Works1. Gaussian Markov Networks (GMNs)
2. Estimating Sparse GMN: Graphical Lasso (Glasso)
3. Glasso: Pros and Cons
4. Detecting Changes via Fused-lasso (Flasso)
5. Nonparanoraml Extension
6. Generalized Log-linear Model
7. The Normalization Issue
3. Proposed Approach
4. Experiments
5. Conclusion
4
Gaussian Markov Networks (GMNs)
5
The interactions between random variables can be modelled by Markov Networks.
Markov Networks (MNs) are undirected Graphical Models. The simplest example of MN is a Gaussian MN:
We can visualize the above MN using an undirected graph.
𝒙 = (𝑥 1 , 𝑥 2 , … , 𝑥(𝑑)) 𝚯 is the inverse covariance matrix
1 23
546
Estimating Sparse GMN: Graphical Lasso (Glasso)
Recall, we would like to detect changes between MNs.
Changes can be found once the structure of two separate GMNs are known to us.
Estimating Sparse GMNs can be done via Graphical Lasso (Glasso).
Idea: Twice Glasso, take the parameter difference.
6
is the sample covariance matrix.
1 23
546
1 23
546
Tibshirani, JRSS 1996; Friedman et al., Biostatistics 2008
Glasso: Pros and Cons
Pros:
Statistical properties are well studied.
Off-the-shelf software can be used.
Cons:
Cannot detect high-order correlation.
Does not work if 𝑝 or 𝑞 is dense.
Not clear how to choose 𝜆𝑃or 𝜆𝑄.
7
P Q
Change
sparse sparse
sparse
Can we penalize change directly?
Detecting Changes via Fused-lasso (Flasso)
8
We can impose sparsity directly on , using Fused-lasso (Tibshirani et al., 2005).
Consider the following objective:
Similar approach for Gaussian structural change was proposed.Using Pseudo-likelihood
We don’t have to assume 𝑝 or 𝑞 is sparse.
Sparsity control is much easier than Glasso.
Gaussianity is still assumed.
(Zhang & Wang, UAI2010)
Nonparanormal (NPN) Extension
We may assume data are Gaussian after NPN transform.
More flexible than Gaussian methods, still tractable.
However, NPN extension is still restrictive. 9
𝑓𝑘 : Monotone, differentiable function𝒙 = (𝑥 1 , 𝑥 2 , … , 𝑥(𝑑))
𝒇(𝒙) = (𝑓1(𝑥1 ), 𝑓2(𝑥
2 ), … , 𝑓𝑑(𝑥𝑑 ))
Liu et al., JMLR 2009
Generalized Log-linear Model
Pairwise Markov Network
𝒇 are feature vectors.
The normalization term 𝑍(𝜽) is generally intractable.Gaussian or NPN models are exceptions.
𝒇: ℛ2 → ℛ𝑏
10
Gaussian: 𝑓𝑔𝑎𝑢 𝑥, 𝑦 = 𝑥𝑦
Nonparanormal: 𝑓𝑛𝑝𝑛 𝑥, 𝑦 = 𝑓 𝑥 𝑓(𝑦)
Polynomial: 𝒇𝑝𝑜𝑙𝑦 𝑥, 𝑦 = [𝑥𝑘 , 𝑦𝑘 , 𝑥𝑘−1𝑦 … , 𝑥, 𝑦, 1]
The Normalization Issue
However, for a generalized Markov Network, there is no closed-form for 𝑍(𝜽).
Importance sampling can be used to approximate 𝑍(𝜽).
Can adapt Glasso & Flasso for non-Gaussian data.
May result a high variance estimator depending on the choice of 𝑝inst. How to choose 𝑝inst is not clear.
11
!
Outline
1. Introductions
2. Problem Formulation
3. Related Works
4. Proposed Approach1. Modelling Changes Directly
2. Density ratio Estimation
3. Sparsity inducing Norm
4. The Dual Formulation
5. Experiments
6. Conclusion
12
Recall, our interest is:
The ratio of two MNs naturally incorporates the !
Modeling Changes Directly
So, model the ratio directly! 13
Modeling Changes Directly
We model density ratio instead of density function:
The normalization term is:
To ensure: Sample average approximationAlso works when integral has no closed form!
14
Estimating Density Ratio
Kullback-Leibler Importance Estimation Procedure (KLIEP):
Unconstrained convex optimization!
Sugiyama et al., NIPS 2007
Tsuboi et al, JIP 2009
15
Sparsity Inducing Norm
Impose sparsity constraints on each factor 𝜷𝑢,𝑣. equals to impose sparsity on changes.
So finally, we can obtain a with group sparsity!
L2 regularizersGroup lasso regularizer
Elastic Net
16
𝜷
The Dual Formulation
0
100
200
300
400
500
600
700
800
40 50 60 70 80
Primal
Dual
Dimension
Co
mp
utatio
nal Tim
e
When dimensionality is high, the dual formulation is preferred.
Optimize 𝜶 on probability simplex.
In longer Version
17
Outline
1. Introductions
2. Problem Formulation
3. Related Works
4. Proposed Approach
5. Experiments1. Numerical Experiments
2. Real-world Application
6. Conclusion
18
Gaussian Distribution (𝑛 = 100, 𝑑 = 40)
Regularization path
Start from 40 dimensional GMN with random correlations.
Randomly drops 15 edges. Precision and Recall curves are
averaged over 20 runs.
19P-R curve
𝜷𝑢,𝑣
𝜷𝑢,𝑣
𝜷𝑢,𝑣
Gaussian Distribution (𝑛 = 50, 𝑑 = 40)
20P-R curve
Regularization path
Start from 40 dimensional GMN with random correlations.
Randomly drops 15 edges. Precision and Recall curves are
averaged over 20 runs.
𝜷𝑢,𝑣
𝜷𝑢,𝑣
𝜷𝑢,𝑣
Diamond Distribution (n = 5000, d = 9)
Diamond Distribution:
Samples are drawn by slice sampling.
Only the methods with the correct model has good performance.
Regularization pathP-R curve
21
𝜷𝑢,𝑣
Real-world Applications
•Gene Network•Detecting changes from the original network to the modified network.
• Twitter Messages•Samples are the frequencies of 10 related keywords over time. •Detecting the change of co-occurrences on keywords before and after a certain events.
source: Wikipedia
22
Gene Network P QGene regulatory network is modified manually. 50 Samples are collected
before (𝑃) and after (𝑄)the change.
Polynomial kernel is used for 𝒇.𝜆1 is chosen by hold-out cross validation.Only KLIEP, Flasso and IS-Flasso are compared.
23
Gene Network (n = 50, d = 13)
Regularization path, KLIEP Flasso IS-Flasso
P-R curve 24
𝜷𝑢,𝑣
|𝜷𝑢,𝑣|
|𝜷𝑢,𝑣|
Q P P P
Twitter Keywords
Time
3 weeks~4.17
We choose the Deepwater Horizon oil
spill as the target event.
…
25
Twitter Keywords From 7.26-9.14
KLIEP Flasso
26
Outline
1. Introductions
2. Problem Formulation
3. Related Works
4. Proposed Approach
5. Experiments
6. Conclusion
27
Conclusion
Learning sparse changes in two Markov Networks, directly!By density ratio estimation
Two advantages comparing to conventional methods:Higher AccuracyThanks to the direct modelling nature.
Wider ApplicabilityP and Q are not only limited to discrete, Gaussian, or NPN.
28