inferring a graph from path frequency tatsuya akutsu 1,2 & daiji fukagawa 2 1 institute for...
Post on 20-Dec-2015
218 views
TRANSCRIPT
Inferring a Graph from Path Frequency
Tatsuya Akutsu1,2 & Daiji Fukagawa2
1 Institute for Chemical Research, Kyoto Univ., Japan2 Graduate School of Informatics, Kyoto Univ., Japan
2
Outline Introduction
String inference from spectrum feature (SISF) Graph inference from path frequency (GIPF) Optimization versions (SISF-M, GIPF-M)
Algorithms for special cases Complexity results
Strong NP-completeness of GIPF in general case(Reduction from 3-PARTITION)
Conclusion
3
Motivation Kernel methods (e.g. Support Vector Machine) have
been applied to various problems. In kernel methods,
Data (sequences, chemical compounds,…) Feature vector
This work: we consider reverse direction, i.e., Feature vector Data
May be useful for designing new sequences/chemical compounds
Inverse function
4
Motivation (continued)
CH3
CH3
A
B (A+B)/ 2
Potential application: drug design
For example, design a new compound which is the middle of known compounds A and B
Related work Kernel PCA + regression [Bakir,Weston,Scölkopf 2004 ] Graph pre-image [Bakir,Zien,Tsuda 2004] But, no complexity studies
5
graph inference problem
Graph inference from path frequencyGiven path frequency vector v, infer the original graph whose feature vector (=path frequency) is equal to v (or closest to v).(length) (path: occurrence)
0 --- C x 9, O x 2 1 --- CC x 18, CO x 2, OC x 2 2 --- CCC x 24, CCO x 2, OCC x 2, OCO x 2 3 --- CCCC x 26, CCCO x 4, OCCC x 4 : : : : :
C
C
C
C
C
C
C O
O
C
C
6
Spectrum Feature for Strings [Leslie et al. 02]
For a string S, Spectrum feature of level k is a frequency vector
of all possible k-grams.
e.g. spectrum feature of level 2 for ‘aababb’:
aa x 1ab x 2ba x 1bb x 1
f2(‘aababb’)=(1,2,1,1)
7
Spectrum Feature for Strings [Leslie et al. 02]
KtK stoccs ),()(f
aabb
ababa
aaaaa
bbb
aabbb
abaa
abbb babab
abba
(1,1,0,2)
(0,2,2,0)(1,1,0,1)
(0,1,0,2)(4,0,0,0)
(1,1,1,0)
(1,0,0,2)
(0,0,0,2)
(0,1,1,1)
φ
Input space Feature space
fK: mapping from input space to feature space
where occ(t,s) is # of occurrences of a substring t in a string s.
K: level (>0)Σ: alphabet
f2
8
Problem 1
Ex) Σ={a,b}, K=2, v=(vaa,vab,vba,vbb)=(1,1,0,2)
f2(‘aaaaa’)=(4,0,0,0)f2(‘aaaab’)=(3,1,0,0) : : : :
f2(‘aabbb’)=(1,1,0,2) : : : :
f2(‘bbbbb’)=(0,0,0,4)
SISF: String Inference from Spectrum Feature Input: an integer K, feature vector v = (vt)t∈ΣK
Output: a string s which, if it exists, satisfies fK(s) = v, otherwise ”no solution.”
|s|=(vaa+vab+vba+vbb)+(K-1)=(1+1+0+2)+1=5
solution: s=‘aabbb’
Solutions may not be unique: v=(1,1,1,1) → 4 solutions v=(1,2,2,1) → 12 solutions
9
Linear time algorithm for SISF
Reduction to Eulerian graph problem [Pevzner]
fK(s)=v for some s ⇔ Gv has a Eulerian path
Ex.2) K=3, v=(vaaa,vaab,vaba,vabb,vbaa,vbab,vbba,vbbb)=(1,1,0,1,1,1,2,1)
Ex.1) K=2, v=(vaa,vab,vba,vbb)=(1,1,0,2)
a b
aa ab
1 23 4
aabbb
bbaaabbbab
aaabbbbb
12
34
2 solutions
solution
ba bb
bbbaaabbab
aaab
bbbb
10
Problem 2
例 ) K=2, v=(vaa,vab,vba,vbb)=(1,2,0,1)
SISF-M: SISF with the Minimum Error Input: an integer K, feature vector v = (vt)t∈ΣK
Output: a string s which minimizes the distance between fK(s) and v
(1,2,0,1)φ
(1,2,1,1)
(1,2,0,2)
(1,2,0,0)
(1,3,0,1)(2,2,0,1)
(1,1,0,1)(0,2,0,1)
aabb
aababbaabbababaabbabbaab
11
Algorithm for SISF-M
It seems difficult to apply Eulerian path technique.
Thus, we employ another approach based on Dynamic programming.
The algorithm is a special case of the graph inference algorithm.
12
Feature vector for graphs:Path Frequency (c.f. Marginalized Graph Kernel)
KtK GtoccG ),()(fFeature vector fK: G → NΣ K≦
where occ(t,G) is # of occurrences of paths labeled with t in a graph G
K: level(>0)
e.g.)
Σ: alphabet
a
b bb
ba ab
a
bk=0
k=1
)0,3,3,0,3,1()(1 Gf
1),( Gaocc3),( Gbocc
0),( Gaaocc3),( Gabocc3),( Gbaocc0),( Gbbocc
G: graph
bab
bab
bab bab
bab babk=2
:
K=1
13
Problem 3 & 4
GIPF: Graph Inference from Path Frequency Input: an integer K, feature vector v = (vt)t∈ΣK
Output: a graph G which satisfies, if it exists, fK(G) = v otherwise ”no solution.”
GIPF-M: GIPF with Minimum Error Input: an integer K, feature vector v = (vt)t∈ΣK
Output: a graph G, which minimizes the distance between fK(G) and v
14
Dynamic Programming for restricted GIPF(1) Trees, K=1, fixed ∑
Any tree can be constructed by inserting a leaf one by one.
1)2,,,,1,(
1),1,1,,1,(
1),1,1,,,1(
1),,,2,,1(
1),,,,,(
bbbaabaaba
bbbaabaaba
bbbaabaaba
bbbaabaaba
bbbaabaaba
nnnnnnD
nnnnnnD
nnnnnnD
nnnnnnD
nnnnnnD
a
b
ba
a
ba
a
b
aa
baba
D(v)=1 iff. There exists a tree T s.t. fv(T)=v
15
Dynamic Programming for restricted GIPF(1) Trees, K=1, fixed ∑ (cont’d.)
Any tree can be constructed by inserting a leaf one by one.
a
b
ba
a
ba
a
b
aa
baba
D(v)=1 iff. There exists a tree T s.t. fv(T)=v
《 Theorem》 GIPF for trees is solved in polynomial time in n (the size of tree) if K=1 and a fixed alphabet.
GIPF-M is also solved by searching in this table.
16
Dynamic Programming for restricted GIPF(2) Trees, fixed K,Δ,∑
Extension of DP for K=1 (not straightforward)
More complicated data structure than K=1.
b a
a
a b
New leaf
When a new leaf is added, much more new paths appear O(ΔK) new paths
17
Dynamic Programming for restricted GIPF(2) Trees, fixed K,Δ,∑ (cont’d)
Extended DP table: D(v,e,d) v: feature vector, e: paths around leaves, d: depth
K,Δ,∑ are fixed, so is the size of e.
New leaf
depth 0
depth d-K
depth d
《 Theorem》 GIPF (and GIPF-M) for trees is solved in polynomial time in n if K,Δ,∑ are all fixed.
18
Strong NP-completeness of GIPF GIPF is strongly NP-complete even if the underlying graph
is a tree and K=3 (We improved the result from that in the proceedings, wher
e this result was shown for non-tree graphs) Reduction from 3-PARTITION problem
Reduction from: 3-PARTITION, P=(X,w,B) X={x1,…,x3m}, w(xi)=wi, Σwi=Bm
Reduction into: GIPF, Q=(v,K) Σ={a,b,c,d,y}∪{x1,…,x3m}∪{A1,…,Am} K=3 v(s)=O(poly(m+B)) for every s ∈∑K
|{s∈∑K | v(s) is non-zero}| = O(poly(m+B))
19
Strong NP-completeness of GIPF
《 Theorem 》 GIPF is strongly NP-complete, even if K=3 and the underlying graph is a tree.
(Proof) Reduction from 3-PARTITION can be done in poly
(m+B) time and thus its size is bounded by poly(m+B).
GIPF Q is ‘yes’ ⇔ 3-PARTITION P is ‘yes’
20
3-PARTITION problem [Garey&Johnson]3-PARTITION: strongly NP-complete problemInput: a set X={x1,…,x3m} of 3m items, for each xi a weight w
(xi), s.t. Σiw(xi)=mBOutput: ‘yes’ if there exist a partition A1,…,Am of X s.t. |Ah|
=3 and ∀i.Σx∈Aiw(x)=B, ‘no’ otherwise
3-PARTITION does not have pseudo-polynomial time algorithm unless P=NP.(i.e., cannot be solved in poly(m+B) unless P=NP)
Strongly NP-complete even if B/4 < w(x) < B/2
3m items
3 items
: : :
B
m sets
21
An Example of Reduction (1)
An instance of 3-PARTITION P: m=2 X = {x1,x2,x3,x4,x5,x6} (|X|=3m=6) w=(1,2,3,4,4,6), B=10 w(x1)=1, w(x2)=2, w(x3)=3, w(x4)=4, w(x5)=4, w(x6)=6
X The solution for P: A1={x1,x3,x6} 1+3+6=10
A2={x2,x4,x5} 2+4+4=10
x1
x2
x3
x5
x6
x4 x1
x2
x3
x5
x6
x4
A2:
A1:
22
An Example of Reduction (2)X x1
x2x3
x5
x6
x4
A1
A2
d
x1a
b cy
x2a
b cy
a
x3a
b cy
aa
x4a
b cy
aa a
x5a
b cy
aa a
x6a
b cy
aa aa a
x1
x2
x3
x5
x6
x4
A2:
A1:
B
Solution for 3-PARTITION
23
An Example of Reduction (3)
There are two kinds of vertices which have unique label. xi’s ( ) and Ah’s ( )
xi encodes the weight of i-th item w(xi)
Ah is a matchmaker of xi’s, but doesn’t know who matches who, because and are distant.
xi Ah
xi Ah
xi
a
b cy
a
a
Ah
xi
a
b cy
aa
Ahk = 3
b cb c
yy
k = 3
a
aa
a
xjxk
24
3-PARTITION P: w = (1,2,3,4,4,6) GIPF Q: (v,3)
An Example of Reduction (4)
Σ={a,b,y,c,d}∪{x1,…,x3m}∪{A1,…,Am}, |Σ|=4m+5Feature vector specifies structures of blocks,
but does not specify the connection between blocks {xi} and {A1,A2}.
x1
a
b cy
x2
a
b cy
a
x3
a
b cy
aa
x4
a
bc y
a aa
x5
a
bc y
a aa
x6
a
bc y
a aa aa
A1
A2
d
w(x1)=1
w(x2)=2
w(x3)=3
w(x4)=4
w(x5)=4
w(x6)=6
25
3-PARTITION P: w = (1,2,3,4,4,6) GIPF Q: (v,3)
An Example of Reduction (5)
Σ={a,b,y,c,d}∪{x1,…,x3m}∪{A1,…,Am}, |Σ|=4m+5- The connection satisfying the constraints given by
feature vector corresponds to a solution of 3-PARTITION.- In this case, {x1,x3,x6} and {x2,x4,x5} correspond to a solution of 3-PARTITION.
x1
a
b cy
x2
a
b cy
a
x3
a
b cy
aa
x4
a
bc y
a aa
x5
a
bc y
a aa
x6
a
bc y
a aa aa
A1
A2
d
w(x1)=1
w(x2)=2
w(x3)=3
w(x4)=4
w(x5)=4
w(x6)=6
26
For each xi; i=1,2,…,3m, generate a graph G(xi):
Paths of length ≦3 which determine G(xi)0 --- {(xi:1), (a:w(xi)), (y:1), (b:1), (c:1)}
1 --- {(xiy:1), (yxi:1), (yb:1), (by:1), (bc:1), (cb:1), (ab:w(xi)), (ba:w(xi))}
2 --- {(xiyb:1), (yba:w(xi)), (ybc:1), (byxi:1), (cby:1), (cba:w(xi)), (aby:w(xi)) , (abc:w(xi)) , (aba:w(xi)(w(xi)-1))}
3 --- {(xiyba:w(xi)), (xiybc:1), (cbyxi:1), (abyxi:w(xi))
An Example of Reduction (6)
・ ・・
b cy
a a aaw(xi)
G(xi) xi G(xi) encodes w(xi)• A label xi is unique in the whole graph• # of a’s = w(xi)
27
For each xi; i=1,2,…,3m, generate a graph G(xi):
In total of G(x1), G(x2),…, G(x3m)0 --- ∪i{(xi:1)} ∪ {(a:mB), (y:3m), (b:3m), (c:3m)}
1 --- ∪i{(xiy:1),(yxi:1)} ∪ {(ab:mB),(ba:mB),(yb:3m),(by:3m), (bc:3m), (cb:3m),}
2 --- ∪i{(xiyb:1),(byxi:1)} ∪ {(yba:mB),(ybc:3m),(cby:3m),(cba:mB), (aby:mB),(abc:mB),(aba:Σiw(xi)2-mB)}
3 --- ∪i{(xiyba:w(xi)), (xiybc:1), (cbyxi:1), (abyxi:w(xi))}
An Example of Reduction (7)
・ ・・
b cy
a a aaw(xi)
G(xi) xi G(xi) encodes w(xi)• A label xi is unique in the whole graph• # of a’s = w(xi)
28
Note for Uniqueness of Graph G(xi)
It is necessary to prove that the set of paths uniquely determines G(xi).
The following cases does NOT occur:y
xi
xj
--- Because a path ‘xiyxj’ is not given
1-to-1 correspondence between (xi, y)yxi
xj--- Because a path ‘yby’ is not giveny
b
(Even if tottering (backtrack) is admitted, provable similarly by using # of ‘yb’)
Uniqueness for a quadruple (xi, y, b, c) can be proved in a similar way
29
An Example of Reduction (8)
Generation of center graph GC
A1 Ah
d
Am
・ ・・
c
aa
c c
b b b
y…
aa
y…
aa
y…
・ ・・
B a’s
Remaing paths in Gc
0 --- ∪h{(Ah:1)} ∪ {(d:1)}
1 --- ∪h{(Ahc:3m), (Ahd:m), (cAh:3m), (dAh:m)}
2 --- ∪h{(Ahcb:3),(bcAh:3),(cAhc:6), (cAhd:3), (dAhc:3)} ∪ ∪h,k{(AhdAk:1)}
3 --- ∪h{(Ahcby:3),(ybcAh:3),(Ahcba:B), (abcAh:B), (dA
hcb:3),(bcAhd:3), (cAhcb:6),(bcAhc:6) ∪ ∪h,k{(AhdAkc:3),(cAkdAh:3)}
Note: center graph is determined without knowing partition (without information about w(xi)’s)
30
Strong NP-completeness of GIPF
《 Theorem》 GIPF is strongly NP-complete, even if K=3 and the underlying graph is a tree.
31
Hardness results for other special case《 Theorem 》 GIPF is strongly NP-complete, even for trees
of bounded degree 4 and of fixed ∑.
Bounded degrees (Δ) Branchings for a’s and for center d Use binary tree
Bounded alphabets (∑) xi’s and Ah’s Encode with fixed alphabets
In both cases, we cannot bound K by a constant
Note: if all of ∑,K,Δ are fixed, then the problem can be solved in poly. time
32
Conclusion GIPF is strongly NP-complete even if underlying
graph is a tree and K=3. GIPF (and GIPF-M) for trees is solvable in
polynomial time by using DP, if Σ,K and Δ are all fixed.
Still ongoing: Our DP is extendable to outer-planar graphs Completeness results for more restricted casesFuture work: Complexity of SISF-M in general cases Approximation algorithm, etc.
34
Tractable special cases
Strings SISF linear time SISF-M polynomial time if K and ∑ are
fixed
Trees of fixed ∑ GIPF (and GIPF-M) for trees
polynomial time if K,∑,Δ are fixed
35
36
DP based algorithmfor trees with fixed K,Σ,Δ
37
アルゴリズム: tree の場合 (K=1)
Dynamic ProgrammingK=1, Σ={0,1}
D(n0,n1,n00,n01,n10,n11)=1 if there exists a tree T such that fK(T) = (n0,n1,n00,n01,n10,n11).
Otherwise D(…)=0.
D(n0,n1,n00,n01,n10,n11)=1 iff.
(n0>1 and D(n0-1, n1, n00-2, n01, n10, n11)=1) or
(n1>0 and D(n0-1, n1, n00, n01-1, n10-1, n11)=1) or
(n0>0 and D(n0, n1-1, n00, n01-1, n10-1, n11)=1) or
(n1>1 and D(n0, n1-1, n00, n01, n10, n11-2)=1) or
38
アルゴリズム: tree の場合 (K=1)
Dynamic ProgrammingK=1, Σ={0,1}
D(n0,n1,n00,n01,n10,n11)=1 if there exists a tree T such that fK(T) = (n0,n1,n00,n01,n10,n11).
Otherwise D(…)=0.
T は木なので、各 n* は高々 O(n) の値をとるテーブル全体を計算する O(n6) 時間他に n0+n1=(n00+n01+n10+n11)/2+1 なども利用して計算量を減らせる
任意の定数 Σ に対しても同様
1t
vn t
39
アルゴリズム: tree の場合 (K: 定数 , 次数限定 )
「葉に頂点を追加する」操作のみで任意の木を構築可能である DP へ応用葉から距離 K 以内の部分木を DP のエントリに組み込
むK が定数かつ次数限定なので、異なる部分木の種類は定
数個に抑えられる
40
Tottering についてTottering paths
パスを考えるとき、後戻りを許すかどうかMarginalized Graph Kernel に関しては tottering
の有無が及ぼす影響は小さい
本研究では、基本的に tottering を考えないTottering を許してもおそらく valid
41
Strong NP-completeness の補足
bcc’
bcc’
bcc’
bcc’
bcc’
bcc’
1
2
3
1 と 2 の比較 :
• Tottering を許さない場合、「 b-c-b 」というパスの有無で区別可能
• Tottering を許す場合、「 b-c-b 」というパスの個数で区別可能
b-c c-c’ b-c-c’ b-c-b c-b-c c-c’-c
2 2 2 0(2) 0(2) 0(2)
2 2 2 2(4) 0(2) 0(2)
2 2 2 2(4) 0(2) 0(2)
c’-c-c’
0(2)
0(2)
2(4)
※() 内は tottering を許す場合の値
42
文字列、誤差ありの場合
00001
20000
00000
10000
01000
00100
00010
43
x1
x2
x5
a
x3
x7
x9
z1 z2
aaaa
aaa
a
weakly NP-hard•Planar graph (series-parallel graph)•K=2•Σ: unbounded•Δ: unbounded
Reduction from PARTITION (X,w)
w(x1)
44
x1
x2
x5
x3
x7
x9
a
y
y
z1
z2
z3
z4
z5
y
y
aaaa
aaa
a
strongly NP-hard•Planar graph (series-parallel graph)•K=2•Σ: unbounded•Δ: unbounded
Reduction from 3-PARTITION (X,w)where B/4 < w(xi) < B/2
45
Introduction: graph inference problem
Graph inference from path frequencyGiven path frequency vector v, infer the original graph whose feature vector (=path frequency) is equal to v (or closest to v).
Results on complexities: General graph strongly NP-complete Trees of fixed Δ,Σ,K in P (using DP)
46
The Complexity of GIPFIn general: strongly NP-complete
strongly NP-complete even for planar graph of bounded degree and for K=4.
Stronlgy NP-complete even for planar graph of bounded degree and of fixed Σ.
Reduction from 3-PARTITION
GIPF for trees Can be solved in polynomial time of n (#vertices of the original tr
ee), if the maximum degree, K and Σ are fixed GIPF-M also can be solved in poly. time Both algorithm are based on DP
[Akutsu,Fukagawa to appear in: CPM 2005]
47
(Reminder)Strong NP-completeness [Garey&Johnson]
DefinitionA number problem P is strongly NP-complete if P is NP-complete even if all numbers in a instance of P are given in unary (i.e., even if all numbers in a instance is bounded by some polynomial of the size of other input).
Knapsack problem is an example of NP-complete problem, but is not strongly NP-complete because it can be solved in O(nB) time where n is the number of items and B is the size of the knapsack.
Dynamic programming:
D(i,b) = the optimal value for items {x1,..,xi} and a knapsack of size b.
D(i+1,b) = max{D(i,b), D(i,b-size(xi+1))+value(xi+1)}
where D(i,b)=-∞ for b<0
48
Inference of Chemical Structure from its Feature Vector
Theoretical results on time complexity Sequences (spectrum kernel) in P Trees of fixed K,Δ,∑ in P Graphs strongly NP-hard for restricted case
CH3
CH3
A
B (A+B)/ 2
Potential application: drug design
For example, design a new compound which is the middle of known compounds A and B
49
Algorithm for SISF-M (still unfinished)Reduction to Eulerization problem
Add/remove edges to make Gv have a Eulerian path Hamming distance between Spectrum feature vectors = the nu
mber of edges added or removed
例 3) K=2, v=(vaa,vab,vba,vbb)=(1,2,0,1)
a ba b
a b
aabb
aababbaabbababaabbabbaab
(1,1,0,1)
(1,2,1,1)
solutions
50
Notes
We cannot add an edge arbitrarily Edge between u and v can exist only if (K-2)-suffix of l
abel(u) and (K-2)-prefix of label(v) are equivalent.
Special case SISF-M can be solved in polynomial time if we for
bid removing edges
51
Dynamic Programming for SISF-MPseudo-polynomial time algorithm
Ex) K=2,∑={a,b}: naa,nab,nba,nbb∈N, t∈∑K-1 に対して
otherwise 0
].),,,()(.[ if 1:),,,,( bbbaabaa
bbbaabaa
tsnnnnsstnnnnD K f
1)b,1,,,(1
1)a,,,1,(1
1)b,,,,(
1)b,,1,,(1
1)a,,,,1(1
1)a,,,,(
bbbaabaabb
bbbaabaaab
bbbaabaa
bbbaabaaba
bbbaabaaaa
bbbaabaa
nnnnDn
nnnnDn
nnnnD
nnnnDn
nnnnDn
nnnnD
1)b,1,0,0,0(
1)a,0,1,0,0(
1)b,0,0,1,0(
1)a,0,0,0,1(
D
D
D
D