inferring a graph from path frequency tatsuya akutsu 1,2 & daiji fukagawa 2 1 institute for...

Inferring a Graph from Path Frequency

Tatsuya Akutsu1,2 & Daiji Fukagawa2

1 Institute for Chemical Research, Kyoto Univ., Japan2 Graduate School of Informatics, Kyoto Univ., Japan

2

Outline Introduction

String inference from spectrum feature (SISF) Graph inference from path frequency (GIPF) Optimization versions (SISF-M, GIPF-M)

Algorithms for special cases Complexity results

Strong NP-completeness of GIPF in general case(Reduction from 3-PARTITION)

Conclusion

3

Motivation Kernel methods (e.g. Support Vector Machine) have

been applied to various problems. In kernel methods,

Data (sequences, chemical compounds,…) Feature vector

This work: we consider reverse direction, i.e., Feature vector Data

May be useful for designing new sequences/chemical compounds

Inverse function

4

Motivation (continued)

CH3

CH3

A

B (A+B)/ 2

Potential application: drug design

For example, design a new compound which is the middle of known compounds A and B

Related work Kernel PCA + regression [Bakir,Weston,Scölkopf 2004 ] Graph pre-image [Bakir,Zien,Tsuda 2004] But, no complexity studies

5

graph inference problem

Graph inference from path frequencyGiven path frequency vector v, infer the original graph whose feature vector (=path frequency) is equal to v (or closest to v).(length) (path: occurrence)

0 --- C x 9, O x 2 1 --- CC x 18, CO x 2, OC x 2 2 --- CCC x 24, CCO x 2, OCC x 2, OCO x 2 3 --- CCCC x 26, CCCO x 4, OCCC x 4 : : : : :

C

C

C

C

C

C

C O

O

C

C

6

Spectrum Feature for Strings [Leslie et al. 02]

For a string S, Spectrum feature of level k is a frequency vector

of all possible k-grams.

e.g. spectrum feature of level 2 for ‘aababb’:

aa x 1ab x 2ba x 1bb x 1

f2(‘aababb’)=(1,2,1,1)

7

Spectrum Feature for Strings [Leslie et al. 02]

KtK stoccs ),()(f

aabb

ababa

aaaaa

bbb

aabbb

abaa

abbb babab

abba

(1,1,0,2)

(0,2,2,0)(1,1,0,1)

(0,1,0,2)(4,0,0,0)

(1,1,1,0)

(1,0,0,2)

(0,0,0,2)

(0,1,1,1)

φ

Input space Feature space

fK: mapping from input space to feature space

where occ(t,s) is # of occurrences of a substring t in a string s.

K: level (>0)Σ: alphabet

f2

8

Problem 1

Ex) Σ={a,b}, K=2, v=(vaa,vab,vba,vbb)=(1,1,0,2)

f2(‘aaaaa’)=(4,0,0,0)f2(‘aaaab’)=(3,1,0,0) : : : :

f2(‘aabbb’)=(1,1,0,2) : : : :

f2(‘bbbbb’)=(0,0,0,4)

SISF: String Inference from Spectrum Feature　 Input: an integer K, feature vector v = (vt)t∈ΣK

　 Output: a string s which, if it exists, satisfies fK(s) = v, otherwise ”no solution.”

|s|=(vaa+vab+vba+vbb)+(K-1)=(1+1+0+2)+1=5

solution: s=‘aabbb’

Solutions may not be unique: v=(1,1,1,1) → 4 solutions v=(1,2,2,1) → 12 solutions

9

Linear time algorithm for SISF

Reduction to Eulerian graph problem [Pevzner]

fK(s)=v for some s ⇔ Gv has a Eulerian path

Ex.2) K=3, v=(vaaa,vaab,vaba,vabb,vbaa,vbab,vbba,vbbb)=(1,1,0,1,1,1,2,1)

Ex.1) K=2, v=(vaa,vab,vba,vbb)=(1,1,0,2)

a b

aa ab

1 23 4

aabbb

bbaaabbbab

aaabbbbb

12

34

2 solutions

solution

ba bb

bbbaaabbab

aaab

bbbb

10

Problem 2

例 ) K=2, v=(vaa,vab,vba,vbb)=(1,2,0,1)

SISF-M: SISF with the Minimum Error　 Input: an integer K, feature vector v = (vt)t∈ΣK

　 Output: a string s which minimizes the distance between fK(s) and v

(1,2,0,1)φ

(1,2,1,1)

(1,2,0,2)

(1,2,0,0)

(1,3,0,1)(2,2,0,1)

(1,1,0,1)(0,2,0,1)

aabb

aababbaabbababaabbabbaab

11

Algorithm for SISF-M

It seems difficult to apply Eulerian path technique.

Thus, we employ another approach based on Dynamic programming.

The algorithm is a special case of the graph inference algorithm.

12

Feature vector for graphs:Path Frequency (c.f. Marginalized Graph Kernel)

KtK GtoccG ),()(fFeature vector fK: G → NΣ K≦

where occ(t,G) is # of occurrences of paths labeled with t in a graph G

K: level(>0)

e.g.)

Σ: alphabet

a

b bb

ba ab

a

bk=0

k=1

)0,3,3,0,3,1()(1 Gf

1),( Gaocc3),( Gbocc

0),( Gaaocc3),( Gabocc3),( Gbaocc0),( Gbbocc

G: graph

bab

bab

bab bab

bab babk=2

:

K=1

13

Problem 3 & 4

GIPF: Graph Inference from Path Frequency Input: an integer K, feature vector v = (vt)t∈ΣK

Output: a graph G which satisfies, if it exists, fK(G) = v otherwise ”no solution.”

GIPF-M: GIPF with Minimum Error　 Input: an integer K, feature vector v = (vt)t∈ΣK

　 Output: a graph G, which minimizes the distance between fK(G) and v

14

Dynamic Programming for restricted GIPF(1) Trees, K=1, fixed ∑

Any tree can be constructed by inserting a leaf one by one.

1)2,,,,1,(

1),1,1,,1,(

1),1,1,,,1(

1),,,2,,1(

1),,,,,(

bbbaabaaba

bbbaabaaba

bbbaabaaba

bbbaabaaba

bbbaabaaba

nnnnnnD

nnnnnnD

nnnnnnD

nnnnnnD

nnnnnnD

a

b

ba

a

ba

a

b

aa

baba

D(v)=1 iff. There exists a tree T s.t. fv(T)=v

15

Dynamic Programming for restricted GIPF(1) Trees, K=1, fixed ∑ (cont’d.)

Any tree can be constructed by inserting a leaf one by one.

a

b

ba

a

ba

a

b

aa

baba

D(v)=1 iff. There exists a tree T s.t. fv(T)=v

《 Theorem》 GIPF for trees is solved in polynomial time in n (the size of tree) if K=1 and a fixed alphabet.

GIPF-M is also solved by searching in this table.

16

Dynamic Programming for restricted GIPF(2) Trees, fixed K,Δ,∑

Extension of DP for K=1 (not straightforward)

More complicated data structure than K=1.

b a

a

a b

New leaf

When a new leaf is added, much more new paths appear O(ΔK) new paths

17

Dynamic Programming for restricted GIPF(2) Trees, fixed K,Δ,∑ (cont’d)

Extended DP table: D(v,e,d) v: feature vector, e: paths around leaves, d: depth

K,Δ,∑ are fixed, so is the size of e.

New leaf

depth 0

depth d-K

depth d

《 Theorem》 GIPF (and GIPF-M) for trees is solved in polynomial time in n if K,Δ,∑ are all fixed.

18

Strong NP-completeness of GIPF GIPF is strongly NP-complete even if the underlying graph

is a tree and K=3 (We improved the result from that in the proceedings, wher

e this result was shown for non-tree graphs) Reduction from 3-PARTITION problem

Reduction from: 3-PARTITION, P=(X,w,B) X={x1,…,x3m}, w(xi)=wi, Σwi=Bm

Reduction into: GIPF, Q=(v,K) Σ={a,b,c,d,y}∪{x1,…,x3m}∪{A1,…,Am} K=3 v(s)=O(poly(m+B)) for every s ∈∑K

|{s∈∑K | v(s) is non-zero}| = O(poly(m+B))

19

Strong NP-completeness of GIPF

《 Theorem 》 GIPF is strongly NP-complete, even if K=3 and the underlying graph is a tree.

(Proof) Reduction from 3-PARTITION can be done in poly

(m+B) time and thus its size is bounded by poly(m+B).

GIPF Q is ‘yes’ ⇔ 3-PARTITION P is ‘yes’

20

3-PARTITION problem [Garey&Johnson]3-PARTITION: strongly NP-complete problemInput: a set X={x1,…,x3m} of 3m items, for each xi a weight w

(xi), s.t. Σiw(xi)=mBOutput: ‘yes’ if there exist a partition A1,…,Am of X s.t. |Ah|

=3 and ∀i.Σx∈Aiw(x)=B, ‘no’ otherwise

3-PARTITION does not have pseudo-polynomial time algorithm unless P=NP.(i.e., cannot be solved in poly(m+B) unless P=NP)

Strongly NP-complete even if B/4 < w(x) < B/2

3m items

3 items

: : :

B

m sets

21

An Example of Reduction (1)

An instance of 3-PARTITION P: m=2 X = {x1,x2,x3,x4,x5,x6} (|X|=3m=6) w=(1,2,3,4,4,6), B=10 w(x1)=1, w(x2)=2, w(x3)=3, w(x4)=4, w(x5)=4, w(x6)=6

X The solution for P: A1={x1,x3,x6} 1+3+6=10

A2={x2,x4,x5} 2+4+4=10

x1

x2

x3

x5

x6

x4 x1

x2

x3

x5

x6

x4

A2:

A1:

22

An Example of Reduction (2)X x1

x2x3

x5

x6

x4

A1

A2

d

x1a

b cy

x2a

b cy

a

x3a

b cy

aa

x4a

b cy

aa a

x5a

b cy

aa a

x6a

b cy

aa aa a

x1

x2

x3

x5

x6

x4

A2:

A1:

B

Solution for 3-PARTITION

23


There are two kinds of vertices which have unique label. xi’s ( ) and Ah’s ( )

xi encodes the weight of i-th item w(xi)

Ah is a matchmaker of xi’s, but doesn’t know who matches who, because and are distant.

xi Ah

xi Ah

xi

a

b cy

a

a

Ah

xi

a

b cy

aa

Ahk = 3

b cb c

yy

k = 3

a

aa

a

xjxk

24

3-PARTITION P: w = (1,2,3,4,4,6) GIPF Q: (v,3)


Σ={a,b,y,c,d}∪{x1,…,x3m}∪{A1,…,Am}, |Σ|=4m+5Feature vector specifies structures of blocks,

but does not specify the connection between blocks {xi} and {A1,A2}.

x1

a

b cy

x2

a

b cy

a

x3

a

b cy

aa

x4

a

bc y

a aa

x5

a

bc y

a aa

x6

a

bc y

a aa aa

A1

A2

d

w(x1)=1

w(x2)=2

w(x3)=3

w(x4)=4

w(x5)=4

w(x6)=6

25

3-PARTITION P: w = (1,2,3,4,4,6) GIPF Q: (v,3)


Σ={a,b,y,c,d}∪{x1,…,x3m}∪{A1,…,Am}, |Σ|=4m+5- The connection satisfying the constraints given by

feature vector corresponds to a solution of 3-PARTITION.- In this case, {x1,x3,x6} and {x2,x4,x5} correspond to a solution of 3-PARTITION.

x1

a

b cy

x2

a

b cy

a

x3

a

b cy

aa

x4

a

bc y

a aa

x5

a

bc y

a aa

x6

a

bc y

a aa aa

A1

A2

d

w(x1)=1

w(x2)=2

w(x3)=3

w(x4)=4

w(x5)=4

w(x6)=6

26

For each xi; i=1,2,…,3m, generate a graph G(xi):

Paths of length ≦3 which determine G(xi)0 --- {(xi:1), (a:w(xi)), (y:1), (b:1), (c:1)}

1 --- {(xiy:1), (yxi:1), (yb:1), (by:1), (bc:1), (cb:1), (ab:w(xi)), (ba:w(xi))}

2 --- {(xiyb:1), (yba:w(xi)), (ybc:1), (byxi:1), (cby:1), (cba:w(xi)), (aby:w(xi)) , (abc:w(xi)) , (aba:w(xi)(w(xi)-1))}

3 --- {(xiyba:w(xi)), (xiybc:1), (cbyxi:1), (abyxi:w(xi))


・・・

b cy

a a aaw(xi)

G(xi) xi G(xi) encodes w(xi)• A label xi is unique in the whole graph• # of a’s = w(xi)

27

For each xi; i=1,2,…,3m, generate a graph G(xi):

In total of G(x1), G(x2),…, G(x3m)0 --- ∪i{(xi:1)} ∪ {(a:mB), (y:3m), (b:3m), (c:3m)}

1 --- ∪i{(xiy:1),(yxi:1)} ∪ {(ab:mB),(ba:mB),(yb:3m),(by:3m), (bc:3m), (cb:3m),}

2 --- ∪i{(xiyb:1),(byxi:1)} ∪ {(yba:mB),(ybc:3m),(cby:3m),(cba:mB), (aby:mB),(abc:mB),(aba:Σiw(xi)2-mB)}

3 --- ∪i{(xiyba:w(xi)), (xiybc:1), (cbyxi:1), (abyxi:w(xi))}


・・・

b cy

a a aaw(xi)

G(xi) xi G(xi) encodes w(xi)• A label xi is unique in the whole graph• # of a’s = w(xi)

28

Note for Uniqueness of Graph G(xi)

It is necessary to prove that the set of paths uniquely determines G(xi).

The following cases does NOT occur:y

xi

xj

--- Because a path ‘xiyxj’ is not given

1-to-1 correspondence between (xi, y)yxi

xj--- Because a path ‘yby’ is not giveny

b

(Even if tottering (backtrack) is admitted, provable similarly by using # of ‘yb’)

Uniqueness for a quadruple (xi, y, b, c) can be proved in a similar way

29


Generation of center graph GC

A1 Ah

d

Am

・・・

c

aa

c c

b b b

y…

aa

y…

aa

y…

・・・

B a’s

Remaing paths in Gc

0 --- ∪h{(Ah:1)} ∪ {(d:1)}

1 --- ∪h{(Ahc:3m), (Ahd:m), (cAh:3m), (dAh:m)}

2 --- ∪h{(Ahcb:3),(bcAh:3),(cAhc:6), (cAhd:3), (dAhc:3)} ∪ ∪h,k{(AhdAk:1)}

3 --- ∪h{(Ahcby:3),(ybcAh:3),(Ahcba:B), (abcAh:B), (dA

hcb:3),(bcAhd:3), (cAhcb:6),(bcAhc:6) ∪ ∪h,k{(AhdAkc:3),(cAkdAh:3)}

Note: center graph is determined without knowing partition (without information about w(xi)’s)

30

Strong NP-completeness of GIPF

《 Theorem》 GIPF is strongly NP-complete, even if K=3 and the underlying graph is a tree.

31

Hardness results for other special case《 Theorem 》 GIPF is strongly NP-complete, even for trees

of bounded degree 4 and of fixed ∑.

Bounded degrees (Δ) Branchings for a’s and for center d Use binary tree

Bounded alphabets (∑) xi’s and Ah’s Encode with fixed alphabets

In both cases, we cannot bound K by a constant

Note: if all of ∑,K,Δ are fixed, then the problem can be solved in poly. time

32

Conclusion GIPF is strongly NP-complete even if underlying

graph is a tree and K=3. GIPF (and GIPF-M) for trees is solvable in

polynomial time by using DP, if Σ,K and Δ are all fixed.

Still ongoing: Our DP is extendable to outer-planar graphs Completeness results for more restricted casesFuture work: Complexity of SISF-M in general cases Approximation algorithm, etc.

34

Tractable special cases

Strings SISF linear time SISF-M polynomial time if K and ∑ are

fixed

Trees of fixed ∑ GIPF (and GIPF-M) for trees

polynomial time if K,∑,Δ are fixed

36

DP based algorithmfor trees with fixed K,Σ,Δ

37

アルゴリズム： tree の場合 (K=1)

Dynamic ProgrammingK=1, Σ={0,1}

D(n0,n1,n00,n01,n10,n11)=1 if there exists a tree T such that fK(T) = (n0,n1,n00,n01,n10,n11).

Otherwise D(…)=0.

D(n0,n1,n00,n01,n10,n11)=1 iff.

(n0>1 and D(n0-1, n1, n00-2, n01, n10, n11)=1) or

(n1>0 and D(n0-1, n1, n00, n01-1, n10-1, n11)=1) or

(n0>0 and D(n0, n1-1, n00, n01-1, n10-1, n11)=1) or

(n1>1 and D(n0, n1-1, n00, n01, n10, n11-2)=1) or

38

アルゴリズム： tree の場合 (K=1)

Dynamic ProgrammingK=1, Σ={0,1}

D(n0,n1,n00,n01,n10,n11)=1 if there exists a tree T such that fK(T) = (n0,n1,n00,n01,n10,n11).

Otherwise D(…)=0.

T は木なので、各 n* は高々 O(n) の値をとるテーブル全体を計算する O(n6) 時間他に n0+n1=(n00+n01+n10+n11)/2+1 なども利用して計算量を減らせる

任意の定数 Σ に対しても同様

1t

vn t

39

アルゴリズム：　 tree の場合 (K: 定数 , 次数限定 )

「葉に頂点を追加する」操作のみで任意の木を構築可能である DP へ応用葉から距離 K 以内の部分木を DP のエントリに組み込

むK が定数かつ次数限定なので、異なる部分木の種類は定

数個に抑えられる

40

Tottering についてTottering paths

パスを考えるとき、後戻りを許すかどうかMarginalized Graph Kernel に関しては tottering

の有無が及ぼす影響は小さい

本研究では、基本的に tottering を考えないTottering を許してもおそらく valid

41

Strong NP-completeness の補足

bcc’

bcc’

bcc’

bcc’

bcc’

bcc’

1

2

3

1 と 2 の比較 :

• Tottering を許さない場合、「 b-c-b 」というパスの有無で区別可能

• Tottering を許す場合、「 b-c-b 」というパスの個数で区別可能

b-c c-c’ b-c-c’ b-c-b c-b-c c-c’-c

2 2 2 0(2) 0(2) 0(2)

2 2 2 2(4) 0(2) 0(2)

2 2 2 2(4) 0(2) 0(2)

c’-c-c’

0(2)

0(2)

2(4)

※() 内は tottering を許す場合の値

42

文字列、誤差ありの場合

00001

20000

00000

10000

01000

00100

00010

43

x1

x2

x5

a

x3

x7

x9

z1 z2

aaaa

aaa

a

weakly NP-hard•Planar graph (series-parallel graph)•K=2•Σ: unbounded•Δ: unbounded

Reduction from PARTITION (X,w)

w(x1)

44

x1

x2

x5

x3

x7

x9

a

y

y

z1

z2

z3

z4

z5

y

y

aaaa

aaa

a

strongly NP-hard•Planar graph (series-parallel graph)•K=2•Σ: unbounded•Δ: unbounded

Reduction from 3-PARTITION (X,w)where B/4 < w(xi) < B/2

45

Introduction: graph inference problem

Graph inference from path frequencyGiven path frequency vector v, infer the original graph whose feature vector (=path frequency) is equal to v (or closest to v).

Results on complexities: General graph strongly NP-complete Trees of fixed Δ,Σ,K in P (using DP)

46

The Complexity of GIPFIn general: strongly NP-complete

strongly NP-complete even for planar graph of bounded degree and for K=4.

Stronlgy NP-complete even for planar graph of bounded degree and of fixed Σ.

Reduction from 3-PARTITION

GIPF for trees Can be solved in polynomial time of n (#vertices of the original tr

ee), if the maximum degree, K and Σ are fixed GIPF-M also can be solved in poly. time Both algorithm are based on DP

[Akutsu,Fukagawa to appear in: CPM 2005]

47

(Reminder)Strong NP-completeness [Garey&Johnson]

DefinitionA number problem P is strongly NP-complete if P is NP-complete even if all numbers in a instance of P are given in unary (i.e., even if all numbers in a instance is bounded by some polynomial of the size of other input).

Knapsack problem is an example of NP-complete problem, but is not strongly NP-complete because it can be solved in O(nB) time where n is the number of items and B is the size of the knapsack.

Dynamic programming:

D(i,b) = the optimal value for items {x1,..,xi} and a knapsack of size b.

D(i+1,b) = max{D(i,b), D(i,b-size(xi+1))+value(xi+1)}

where D(i,b)=-∞ for b<0

48

Inference of Chemical Structure from its Feature Vector

Theoretical results on time complexity Sequences (spectrum kernel) in P Trees of fixed K,Δ,∑ in P Graphs strongly NP-hard for restricted case

CH3

CH3

A

B (A+B)/ 2

Potential application: drug design

For example, design a new compound which is the middle of known compounds A and B

49

Algorithm for SISF-M (still unfinished)Reduction to Eulerization problem

Add/remove edges to make Gv have a Eulerian path Hamming distance between Spectrum feature vectors = the nu

mber of edges added or removed

例 3) K=2, v=(vaa,vab,vba,vbb)=(1,2,0,1)

a ba b

a b

aabb

aababbaabbababaabbabbaab

(1,1,0,1)

(1,2,1,1)

solutions

50

Notes

We cannot add an edge arbitrarily Edge between u and v can exist only if (K-2)-suffix of l

abel(u) and (K-2)-prefix of label(v) are equivalent.

Special case SISF-M can be solved in polynomial time if we for

bid removing edges

51

Dynamic Programming for SISF-MPseudo-polynomial time algorithm

Ex) K=2,∑={a,b}: naa,nab,nba,nbb∈N, t∈∑K-1 に対して

otherwise 0

].),,,()(.[ if 1:),,,,( bbbaabaa

bbbaabaa

tsnnnnsstnnnnD K f

1)b,1,,,(1

1)a,,,1,(1

1)b,,,,(

1)b,,1,,(1

1)a,,,,1(1

1)a,,,,(

bbbaabaabb

bbbaabaaab

bbbaabaa

bbbaabaaba

bbbaabaaaa

bbbaabaa

nnnnDn

nnnnDn

nnnnD

nnnnDn

nnnnDn

nnnnD

1)b,1,0,0,0(

1)a,0,1,0,0(

1)b,0,0,1,0(

1)a,0,0,0,1(

D

D

D

D

inferring a graph from path frequency tatsuya akutsu 1,2 & daiji fukagawa 2 1 institute for...

Documents

solutions v

aa v ab v ba v bb

feature vector v

path frequency vector

c c c c c c c o o c

ccco x

cccc x

occc x