topk interesting subgraph discovery in information networks

26
TopK Interesting Subgraph Discovery in Information Networks Manish Gupta Jing Gao Xifeng Yan Hasan Cam Jiawei Han [email protected] 1 06/13/2022

Upload: stacey-burgess

Post on 31-Dec-2015

40 views

Category:

Documents


0 download

DESCRIPTION

TopK Interesting Subgraph Discovery in Information Networks. Manish Gupta Jing Gao Xifeng Yan Hasan Cam Jiawei Han. Real World Problems. Network Bottlenecks Discovery. Computer Networks. Organization Networks. Team Selection. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: TopK  Interesting Subgraph Discovery in Information Networks

1

TopK Interesting Subgraph Discovery in Information Networks

Manish Gupta Jing Gao Xifeng Yan Hasan Cam Jiawei Han

[email protected]/19/2023

Page 2: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Real World ProblemsNetwork Bottlenecks

Discovery

Interestingness = Lowest Bandwidth

Interestingness = Highest Negative Association Strength of Attribute Values

Computer Networks Organization Networks Team Selection

Battlefield Networks Resource Allocation

Interestingness = Highest Historical Compatibility

Interestingness = Lowest Distance between Entities

Suspicious RelationshipsDiscovery

Social Networks

Page 3: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

The Basic Underlying ProblemNetwork Bottlenecks

Discovery

Interestingness = Lowest Bandwidth

Team Selection

Interestingness = Highest Historical

Compatibility

Interestingness = Highest Negative Association

Strength

Suspicious RelationshipsDiscovery Resource Allocation

Interestingness = Lowest Distance

• Given– Edge-weighted Typed

Network G– Typed Subgraph Query Q– Edge Interestingness

measure

• Find– TopK matching subgraphs

Page 4: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Naïve Solution: Ranking After Matching

A

A

A

AB

A

C

B

A

C

A

C10

6 5

9

12

4

8

3

7

0.6 0.8

0.6

0.9

0.3 0.5 0.2

0.4

0.1

Network G

B

11

12

13

0.7 0.1

0.20.70.8

0.5

0.2

0.4 0.3

A

A

A

Query Q

1

2 3

B4

A

A

A

B

10

6 5

9

0.60.9

0.3

A A A4 3

B12

0.20.70.8

AA A B10 9 8 7

0.60.3 0.5 A

B

A A4 3

70.1

2

0.70.8

A

A A

B

5

9

4

7

0.80.9 0.1

A

A

A B

5

9 8 70.6

0.9

0.5

A AB A6 5 4 3

0.6 0.8 0.8

A

AB

A

6 5

9 8

0.6

0.6

0.9

𝑴𝟔

𝑴𝟑

𝑴 𝟒

𝑴𝟓

𝑴𝟏

𝑴𝟕

𝑴𝟖

𝑴𝟗

𝑴𝟐

Match Score

2.2

2.2

2.1

2.0

1.8

1.8

1.7

1.6

1.4Matching

Rank

ing

Why compute all matches?

We need only top-2!

A

B

A A4 3 2

0.70.70.8

7

Page 5: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Our Contributions

• New notion: TopK interesting subgraph detection in information networks

• Three new low-cost indexes– Graph topology index– Sorted edge lists– Graph maximum metapath weight index

• Novel top-K algorithm to answer interestingness queries on large graphs

• Detailed effectiveness and efficiency validation on several synthetic and real datasets

Page 6: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Relationship with Previous Work

• Subgraph matching– Approximate: fuzzy node/edge similarity– Exact: Matching without ranking– RDF graphs, probabilistic graphs, temporal graphs

• TopK querying on graphs– H-hop aggregate queries– Keyword queries on RDF graphs– K most frequent patterns– Twig queries

Page 7: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

System OverviewNetwork G

Distance D

Breadth First Traversal from each Node up to Distance D

GraphTopology

Index

Graph Maximum MetaPath Weight

Index

Sort Edges

Sorted Edge Lists

Top-K Computation

Find Candidate Nodes

Candidate Nodes

Query Q

Top-K Subgraphs

Offline Index Construction

Online Query Processing

1

2

3

Page 8: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Index StructuresG=(V,E), B=avg #neighbors, T=#types

A

A

A

AB

A

C

B

A

C

A

C10

6 5

9

12

4

8

3

7

0.6 0.8

0.6

0.9

0.3 0.5 0.2

0.4

0.1

Network G

B

11

12

13

0.7 0.1

0.20.70.8

0.5

0.2

0.4 0.3

AA BB CC AB AC BC(5,9):0.9 (12,13):0.2 (2,7): 0.7 (3,12): 0.5 (7,11): 0.2

(3,4):0.8 (5,6): 0.6 (4,12): 0.4 (1,11): 0.1

(4,5):0.8 (8,7): 0.5 (3,13): 0.4

(2,3):0.7 (2,1): 0.2 (2,13): 0.3

(8,9):0.6 (4,7): 0.1

(9,10):0.3

Index Time Complexity

Space Complexity

Sorted edge lists

Index Time Complexity

Space Complexity

Sorted edge lists

Graph topology

index

Index Time Complexity

Space Complexity

Sorted edge lists

Graph topology

index

Graph max

metapath weight index

d 1 2Node

Id A B C AA BA CA AB BB CB AC BC CC

1 0.2 0.1 0.9 0.9 0.3 0.52 0.7 0.7 0.3 1.5 1.2 0.7 1.2 0.9 0.53 0.8 0.5 1.6 0.9 1.4 1.2 0.74 0.8 0.1 0.4 1.7 0.8 0.9 1.4 1.3 0.3 0.65 0.9 0.6 1.6 0.9 1.26 0.6 1.57 0.7 0.2 1.4 0.9 0.3 18 0.6 0.5 1.5 1.2 0.79 0.9 1.7 1.5

10 0.3 1.211 0.2 0.912 0.5 0.2 1.3 0.6 0.5 0.9

d 1 2Node

Id A B C AA BA CA AB BB CB AC BC CC

1 1 1 1 1 1 12 1 2 1 1 2 1 2 1 13 2 2 1 2 2 2 24 2 1 1 2 2 1 1 2 1 15 2 1 3 1 16 1 27 3 1 3 1 1 2

8 1 1 2 2 1

9 3 1 2

10 1 2

11 2 3

12 2 1 4 2 1 1

Page 9: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Find Candidate NodesGraph

TopologyIndex

Query Q

Graph Topology Index

Query Topology

A

A

A

Query Q

1

2 3

B4

2 2 2 1

3 3 3 6

4 4 4 7

5 5 5

8 8 8

9 9 9

10 10 10

d 1 2Node

Id A B C AA BA CA AB BB CB AC BC CC

1 1 12 2 13 1 1 14 1 1

2 2 2 1

3 3 3 6

4 4 4 7

5 5 5

8 8 8

9 9 9

10 10 10

d 1 2Node

Id A B C AA BA CA AB BB CB AC BC CC

1 1 1 1 1 1 12 1 2 1 1 2 1 2 1 13 2 2 1 2 2 2 24 2 1 1 2 2 1 1 2 1 15 2 1 3 1 16 1 27 3 1 3 1 1 2

8 1 1 2 2 1

9 3 1 2

10 1 2

11 2 3

12 2 1 4 2 1 1

Page 10: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Finding and Scoring MatchesKey Idea

Top-K Computation

𝑀 1

𝑀 4 𝑀 2

𝑀 3 𝑀 5

Top-K Heap

More valid edges?

Start

Generate a Size-1 Candidate

Compute Actual and UB Score

Grow Candidates

Update Heap

Done!

TopK Quit?

Candidate Size==|Q|?

Compute Actual and UB ScoreTopK Quit?

Compute Max UB Score

TopK Quit?

Y

Y

YY

YN

N

N

N

NY

A

A

A

Query Q

1

2 3

B4

(5,9):0.9 (2,7): 0.7

(3,4):0.8 (5,6): 0.6

(4,5):0.8 (8,7): 0.5

(2,3):0.7 (2,1): 0.2

(8,9):0.6 (4,7): 0.1

(9,10):0.3

AA BA

Page 11: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Finding and Scoring MatchesGenerating Size-1 Candidates

(5,9):0.9 (2,7): 0.7

(3,4):0.8 (5,6): 0.6

(4,5):0.8 (8,7): 0.5

(2,3):0.7 (2,1): 0.2

(8,9):0.6 (4,7): 0.1

(9,10):0.3

AA BA

A

A

A

Query Q

1

2 3

B4

Size-1 Candidates

A

A

A

5

9

BMultiple query edges of the same type

A

A

A59

B

A

A

A

9

5

B

A

A

A95

BQuery Edge with both endpoints of same type

Order(5,9)(3,4)(4,5)(2,3)(2,7)…

Candidate Growth

A

A

A59

B

Prune?

Grow?

Prune?

Grow?

Heapify?

Discard?

A

A

A59

B8

A

A

A59

B8 6

Prune?

Grow?

Heapify?

Discard?

A

A

A59

B10

A

A

A59

B10 6

Page 12: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Finding and Scoring MatchesActual Score and Upper Bound Score

Candidate Growth

Useful Edge Lists

Actual Score= 0.9

UB Score = 0.9+ UB(NonConsidered Edges) = 0.9+ (0.6+0.6) = 2.1

• Partially grown candidate• Prune if UBScore< min(heap)• Grow otherwise

• Fully grown candidate• Discard if UBScore< min(heap)• Update heap otherwise

A

A

A59

B

A

A

A59

B8

A

A

A59

B8 6

Prune?

Grow?

Prune?

Grow?

Heapify?

Discard?

(5,9):0.9 (2,7): 0.7

(3,4):0.8 (5,6): 0.6

(4,5):0.8 (8,7): 0.5

(2,3):0.7 (2,1): 0.2

(8,9):0.6 (4,7): 0.1

(9,10):0.3

AA BA A

A

A59

B

Page 13: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Finding and Scoring MatchesGlobal Top-K Quit

K=2TopK Heap

(4,3,2,7): 2.2(3,4,5,6): 2.2

0.7+0.6+0.7 = 2 <2.2 Stop

A

A

A

AB

A

C

B

A

C

A

C10

6 5

9

12

4

8

3

7

0.6 0.8

0.6

0.9

0.3 0.5 0.2

0.4

0.1

Network G

B

11

12

13

0.7 0.1

0.20.70.8

0.5

0.2

0.4 0.3

A

A

A

Query Q

1

2 3

B4

(5,9):0.9 (2,7): 0.7

(3,4):0.8 (5,6): 0.6

(4,5):0.8 (8,7): 0.5

(2,3):0.7 (2,1): 0.2

(8,9):0.6 (4,7): 0.1

(9,10):0.3

AA BA

Page 14: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Faster Query Processing using Graph Maximum MetaPath Weight Index

CA B

1

2

3C

C

4 5

1

2

C

C

CA B

13C 4 5

A2

3

C

Query

Partial Candidate

Paths to cover Non-Considered

Edges

UB Score = Actual Score(1-2) + UB(1-3) + UB(2-3) + UB(3-4) + UB(4-5)

UB Score = Actual Score(1-2) + UB(1-3-4-5) + UB(2-3)

Using MMW Index!

CA B

1

2

3C

C

4 51

2

C

C

CA B

13C 4 5

A2

3

C

Query PartialInstantiation

Paths to cover Non-Considered

Edges

CB6 7

C7

Edges to Consider

Separately

B

CB6 7

4

Slight complication

UB Score = Actual Score(1-2) + UB(1-3-4-5-7) + UB(2-3) + UB(4-6) +UB(6-7)

Page 15: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Faster Query Processing using Graph Maximum MetaPath Weight Index

A

A

A

9

5

B

K=2TopK Heap

(8,9,5,6): 2.1(5,9,8,7): 2.0

(5,9):0.9 (2,7): 0.7

(3,4):0.8 (5,6): 0.6

(4,5):0.8 (8,7): 0.5

(2,3):0.7 (2,1): 0.2

(8,9):0.6 (4,7): 0.1

(9,10):0.3

AA BA

Edge-based UBScore0.9+0.8+0.7=2.4 > 2.0

Path-based UBScore0.9+UB(5-A-B)=0.9+0.9=1.8 < 2.0

Grow

Prune

Prune?

Grow?

MMW Index

d 1 2Node

Id A B C AA BA CA AB BB CB AC BC CC

1 0.2 0.1 0.9 0.9 0.3 0.52 0.7 0.7 0.3 1.5 1.2 0.7 1.2 0.9 0.53 0.8 0.5 1.6 0.9 1.4 1.2 0.74 0.8 0.1 0.4 1.7 0.8 0.9 1.4 1.3 0.3 0.65 0.9 0.6 1.6 0.9 1.26 0.6 1.57 0.7 0.2 1.4 0.9 0.3 18 0.6 0.5 1.5 1.2 0.79 0.9 1.7 1.5

10 0.3 1.211 0.2 0.912 0.5 0.2 1.3 0.6 0.5 0.9

Page 16: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Discussions

• Queries with multiple edge semantics• Directed graphs• Homogeneous networks• Weighted query edges– Weights signify expected amount of

interestingness– Weights signify importance of query edge

• Faster computations versus index size

Page 17: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Low-cost Index Structures

1000 10000 100000 10000001

10

100

1000

10000 Topology+MMW (D=2)SPath (D=2)Sorted Edge Lists

|V|

Tim

e (s

ec)

1000 10000 100000 100000010

100

1000

10000

100000

1000000

10000000

100000000 Edge ListsTopology (D=2)Topology (D=3)MMW (D=2)MMW (D=3)SPath (D=2)SPath (D=3)Graph Size

|V|

Inde

x Si

ze (K

Bs)

Page 18: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Faster Query Execution

|Q|=2 |Q|=3 |Q|=4 |Q|=5RAM 158 3186 39294 469962

RWM0 10 165 824 4660RWM1 12 195 1022 5891RWM2 12 212 3135 27363RWM3 111 1486 3978 9972RWM4 12 165 791 4518

|Q|=2 |Q|=3 |Q|=4 |Q|=5RAM 144 8698 34639 174992

RWM0 10 375 14689 229136RWM1 13 446 16754 200065RWM2 12 562 19088 201708RWM3 156 2277 17182 161533RWM4 11 346 13547 199617

|Q|=2 |Q|=3 |Q|=4 |Q|=5RAM 245 2004 14628 169328

RWM0 15 32 43 122RWM1 19 36 98 178RWM2 20 40 442 6887RWM3 218 1733 2337 3933RWM4 18 34 42 118

Query Execution Time (msec) for PathQueries (Graph G2 and indexes with D=2)

Query Execution Time (msec) for CliqueQueries (Graph G2 and indexes with D=2)

Query Execution Time (msec) for SubgraphQueries (Graph G2 and indexes with D=2)

RAM: Ranking After Matching baseline RWM0: without using the candidate node filteringRWM1: without using the MMW indexRWM2: same as RWM1 without thepruning any partially grown candidatesRWM3: same as RWM1 without the global top-K quit checkRWM4: same as RWM1 with the MMW index

Page 19: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Good ScalabilityQuerySize

Graph |Q|=2 |Q|=3 |Q|=4 |Q|=5 |Q|=6 |Q|=7

|V|=1e+3 5 18 77 382 1870 7656|V|=1e+4 10 90 407 2267 12366 87657|V|=1e+5 52 396 2794 18412 131256 1006773|V|=1e+6 362 4907 28600 184523 1216893 9786327

Good Scalability thanks to Effective Pruning

|Q|=2 |Q|=3 |Q|=4 |Q|=5#Candidates of Size 2 9.54 7.86 4.38 1.63#Candidates of Size 3 28.28 18.31 7.94#Candidates of Size 4 24.42 25.5#Candidates of Size 5 13.61

Running time (msec) for different Query Sizes and Graph Sizes (D=2)

Query Execution Time for Different Values of K

Number of Candidates as Percentage of Total Matches for Different Query Sizes

and Candidate Sizes

|Q|=2 |Q|=3 |Q|=4 |Q|=510

100

1000

10000K=10 K=20 K=50 K=100

Size of the Query

Ave

rage

Que

ry E

xecu

tion

Tim

e (m

sec)

Page 20: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Author

Author

Conf Keyword

Q2

1 2

3

4

Person

Person

Company Settlement

Q4

1 2

3

4Person

Person

Film

Q3

1 2

3

Author

Author

Conf

Q1

1 2

3

Dataset DBLP Wikipedia

#Nodes 138K 670K

#Edges 1.6M 4.1M

#Types 3 10

Edge List Index Size

50 MB 261 MB

Topology Index Size

5.8 MB 148 MB

MMW Index Size 11.4 MB 249 MB

SPath Index Size 4.3 GB 13.7 GB

Topology+MMW Construction Time

513 minutes

1203 minutes

Avg Query Time 100 sec 42 sec

Real Dataset Case Studies

Page 21: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Real Dataset Case Studies

• DBLP– 1: Rohit Gupta, 2: BICoB, 3: Vipin Kumar

• Rohit Gupta -- computer networking• Vipin Kumar -- Data and Information Systems• BICoB -- International Conference on Bioinformatics and

Computational Biology

– 1: Jimeng Sun, 2: Operating Systems Review (SIGOPS), 3: Christos Faloutsos, 4: mining• Jimeng Sun and Christos Faloutsos -- Data and Information Systems,

Artificial intelligence, and Computational biology• "mining" -- Data and Information Systems• "Operating Systems Review (SIGOPS)" -- Operating systems,

Computer architecture, Computer networking

Page 22: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Real Dataset Case Studies

• Wikipedia– 1: Stacy Keach, 2: The Biggest Battle, 3: John Huston

• Stacy Keach and John Huston starred in the movie “The Biggest Battle”• Stacy Keach (American), John Huston (American), movie is Italian• Stacy (narration, comedy, music), John (drama, documentary, adventure),

movie (war)

– 1: Medha Patkar, 2: BBC, 3: Felix D’Alviella, 4: Mogilino• Medha Patkar -- Indian social activist -- won Best International Political

Campaigner by BBC• Felix D’Alviella -- Belgian actor in the BBC soap opera Doctors • Mogilino -- village in Bulgaria -- BBC showed the popular film "Bulgaria’s

Abandoned Children" in 2007• British company rewarding an Indian woman, covering a place in Bulgaria or

linked to a person from Belgium is rare

Page 23: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Related Work (1)

• Theory literature on subgraph isomorphism [Cordella et al., 2004; McKay, 1981; Ullmann, 1976]

• Exact subgraph matching [Cheng et al., 2008; He and Singh, 2008; Sun et al., 2012; Zhang et al., 2007; Zhang et al., 2009; Zhao and Han, 2010; Zou et al., 2009]

• Approximate subgraph matching [Zou et al., 2007; Zeng et al., 2012; Tian et al., 2007; Zhang et al., 2010]

Page 24: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Related Work (2)

• Matching in graph databases [Ranu and Singh, 2009; Yan et al., 2005; Zhu et al., 2012]

• Matching for RDF graphs [Liu et al., 2012], probabilistic graphs [Yuan et al., 2012] and temporal graphs [Bogdanov et al., 2011]

• Top-K queries– h-hop aggregate queries [Yan et al., 2010] – K most frequent patterns [Yang et al., 2012; Zhu et al., 2011]– Top-K keyword queries on RDF graphs [Tran et al., 2009]– Top-K similarity queries [Zou et al., 2007]– Twig queries [Gou and Chirkova, 2008]

Page 25: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Conclusion

• Given– Typed unweighted query– A heterogeneous edge-weighted information network– Edge interestingness measure

• Find– Top-K interesting subgraphs

• Investigated ranking after matching baseline • Proposed three new graph indexes and exploited them for

building a top-K solution• Showed efficiency, scalability and effectiveness on multiple

synthetic and real datasets

Page 26: TopK  Interesting Subgraph Discovery in Information Networks

[email protected]

Thanks!