dealing with diversity in mining and query processing

Dealing with Diversity in Mining and Query Processing

Jeffrey Xu Yu (于旭 )Department of Systems Engineering and Engineering ManagementThe Chinese University of Hong [email protected]

http://www.cuhk.edu.hk/

Books on Social Networks Social and Economic Networks

by Matthew O. Jackon Social Network Data Analysis

by Charu C. Aggarwal Exploratory Social Network Analysis with Pajek by Wouter de

Nooy, Andrej Mrvar, and Vladimir Batagelj Networks, Crowds, and Markets: Reasoning about a Highly

Connected World by David Easley and John Keinberg

Networks An Introduction by M.E.J. Newman

Some Online Courses Mining of Massive Datasets (Anand Rajaraman and Jeff Ullman)

http://infolab.stanford.edu/~ullman/mmds.html Networks, Crowds, and Markets: Reasoning about a highly

connected world, by David Easley and Jon Kleinberg http://www.cs.cornell.edu/home/kleinber/networks-book

Topics in Data Management & Mining – Social Networks, Laks V.S. Lakshmanan http://www.cs.ubc.ca/~laks/534l/cpsc534l.html

http://infolab.stanford.edu/~ullman/mmds.html

http://www.cs.cornell.edu/home/kleinber/networks-book

http://www.cs.ubc.ca/~laks/534l/cpsc534l.html

Stanford Large Network Dataset Collectionhttp://snap.stanford.edu/data Social networks Communication networks Citation networks Collaboration networks Web graphs Amazon networks Internet networks Road networks Autonomous systems Signed networks Wikipedia networks and metadata Twitter and Memetracker

Graph Database http://en.wikipedia.org/wiki/Graph_database

Pregel: Google’s internal graph processing platform Trinity: Microsoft Research Asia Neo4j: commercial graph database …

Diversified Ranking Why diversified ranking?

Information requirements diversity Query incomplete

Problem Statement For query dependent diversity ranking, the goal is to find

K nodes in a graph that are relevant to the query node, and also they are dissimilar to each other.

For query independent diversity ranking, the goal is to find K prestige nodes in a graph that are dissimilar to each other.

Main applications Ranking nodes in social network, ranking papers, etc.

Challenges Diversity measures

No wildly accepted diversity measures on graph in the literature.

Scalability Most existing methods cannot be scalable to large

graphs. Lack of intuitive interpretation.

Existing Methods Grasshopper [Zhu, et al., HLT-NAACL’07] ManiRank [Zhu, et al., WWW’11] DivRank [Mei, et al., KDD’10] DRAGON [Tong, et al., KDD’11] Resistive Graph Centers [Dubey, et al., KDD’11]

Grasshopper/ManiRank The main idea

Work in an iterative manner. Select a node at one iteration by random walk. Set the selected node to be an absorbing node, and perform

random walk again to select the second node. Perform the same process K iterations to get K nodes.

No diversity measure Achieving diversity only by intuition and experiments.

Cannot scale to large graph (time complexity O())

Grasshopper/ManiRank Initial random walk with no absorbing states

Absorbing random walk after ranking the first item

DivRank Based on a vertex-reinforced random walk. No diversity measure. Convergence properties is not clear. Time and space complexity is

DRAGON, Resistive Graph Centers

DRAGON [Tong, et al., KDD’11] Diversity measure lacks of clear topological interpretation

Resistive Graph Centers [Dubey, et al., KDD’11] Based on personalized PageRank with a learnable teleportation

parameter. Cannot be scalable to large graphs.

A Summary

Comparison with existing methods

Our Approach The main idea

Relevance of the top-K nodes (denoted by a set S) is achieved by the large (Personalized) PageRank scores.

Diversity of the top-K nodes is achieved by large expansion ratio. Expansion ratio of a set nodes S: σ(S)=|N(S)|/n

Larger expansion ratio implies better diversity

K-step expansion ratio of S: σk(S)=|Nk(S)|/n

Our diversity measures

The K-step Expansion

Diversified ranking problem on graph as a discrete optimization problem.

Submodularity F(S) is shown to be submodular and non-descreasing.

The greedy algorithm A 1-1/e approximation algorithm for solving Eq. (1). Linear time and space complexity w.r.t. the size of the graph.

A Discrete Optimization Problem

The Greedy Algorithm

Works in K rounds Select a node with maximal marginal gain at one round

Marginal gain

Maximize Fk(S) subject to cardinality constraint |S| <= K

Submodularity Fk(S) is shown to be submodular and non-descreasing.

Randomized greedy algorithm Near 1-1/e approximation algorithm. Linear time and space complexity w.r.t. the size of the graph.

Generalized Diversified Ranking Optimization

Randomized greedy algorithm Same idea as the greedy algorithm Works in K rounds At each round, select the node with maximal marginal gain. But,

evaluating the maximal marginal gain is expensive.

Our idea: Use a probabilistic counting data structure to sketch the k-step neighborhood for each node.

Generalized Diversified Ranking Optimization

(| ( { }) | | ( ) |)u k kw N S u N S Marginal gain

A probabilistic counting structure, devised by Flajolet and Martin.

Be used to estimate the cardinality of a multi-set using only logC+t bits, where C denotes the cardinality and t is a small constant.

Each FM Sketch is a log C+t bitmap. Advantage: To estimate the cardinality of the union of

two multi-sets, we only need to do a bitwise-OR between to FM Sketches.

FM Sketch and Its Properties

Randomized greedy algorithm For each node u, use FM Sketch to sketch Nk({u}) Use the following rule to sketch Nk({u}), which can be implemented in a

recursive way

Use FM sketch to sketch Nk(S) Evaluating the marginal gain can be implemented by a bitwise-OR

between Nk(S) and Nk({u})

The Randomized Greedy Algorithm

1( , )

({ }) ({ })k ku v E

N u N v

Experimental Studies We conduct experiments on 5 real networks (3

collaboration networks, 1 citation network, and 1 social network).

We show some results with Flickr, which is a popular photo shared website (from ASU social computing data repository). Undirected social network (80,513 nodes and

5,899,882 edges, and 195 different groups)

Some Testing Results on Flickr

Make a Top-K Algorithm Diversified

The result of searching “apple” in Google image

Existing top- search algorithms Search results are ranked independently When searching “apple” in google image, 9 out of top 15 results are the

logo of Apple Inc.

Structural Keyword Search (1)

Example: Keyword Search in Graphs Input: a graph with text information on each node, and a user given keyword query Output: top-k of minimal Steiner trees that contain all user given keywords

“graph patterns” “keyword search”

DBLP4a1

w1

3p1

w2

3p2

4a1

w1

3p1

w4

3p4

4a1

w3

3p3

w2

3p2

4a1

w3

3p3

w4

3p4

4a1 Author: Jiawei Han w1 w2 w3 w4 Action: Write

3p1 Paper: Mining Graph Patterns 3p2

3p3 3p4Paper: Mining Significant Graph

Patterns by Leap Search

Paper: Optimizing Index for Taxonomy Keyword Search

Paper: Keyword Search in Text Cube: Finding Top-k Aggregated Cell Documents

v1 v2 v3 v4

Structural Keyword Search (2)

4a1

w1

3p1

w2

3p2

v1 score=0.8

4a1

w1

3p1

w4

3p4

v2 score=0.5

4a1

w3

3p3

w2

3p2

v3 score=0.5

4a1

w3

3p3

w4

3p4

v4 score=0.4

0.6

0.6 0.6

0.6

0.2 0.2

Suppose the similarity of and is , e.g.,

Let

is better than because and are similar with each other

is better than because has a larger total score

Diversified Top-K We should consider both similarity and score Let be a list of search results Let be the score of result Let be the similarity of and For any ,

and are similar : a user given threshold

Diversified top- results result : At most results: No two results in are similar Total score of results in is maximized

A Diversity Graph

3v3

3v5

3v4

3v6

68

7

7

1

10 3v1

3v2

3v6

68

7

7

1

10v1

v2

v3

v5

v4

Diversity Graph Undirected graph , , there is an edge (,) in is similar to The diversified top-result set is an independent set of

𝐾=2 ,𝐷={𝑣1 ,𝑣2 } 𝐾=3 ,𝐷={𝑣1 ,𝑣2 }

Existing Top-K Search Frameworks

Most existing top-K search frameworks avoid exploring all search results by finding an early stop condition.

Incremental Top-K Results are generated one by one in ranked order Stops when K results are output

Bounding Top-K Results are generated not necessarily in ranked order. A non-increasing score upper bound for unseen result u is maintained. Stop when the K-th largest score generated is no smaller than u.

Our Framework We support the existing top-K frameworks

Results are generated one by one Stops if a certain stop condition is satisfied

Our framework

We extend the existing algorithms to get top-K diversified results by three new functions. sufficient(): a new early stop condition necessary(): the necessary stop condition div-search(): search top-k diversified results on the current results

Step 2

Step 3

Step 1

• Check the stop condition sufficient()

• Stops if sufficient() is satisfied

• Generate the next result using the original top-K algorithm

• Check the necessary() condition

• If necessary() is satisfied, search the diversified top-K results using div-search()

• Go to Step 1

𝑠𝑐𝑜𝑟𝑒 (𝐷𝐾 (𝑆))≥𝑏𝑒𝑠𝑡 (𝑆)

Sufficient Stop Condition Sufficient stop condition sufficient()

: the set of current generated results : an upper bound of the optimal solution calculated from current

generated results : the current diversified top- results with score : the score upper bound of all unseen results For each , in the ideal situation, for the unseen results, all the

remaining results are set to be We have The sufficient stop condition is

Necessary Stop Condition

|𝑆|≥|𝑆′|+𝐾−max {𝑖∨1≤ 𝑖≤𝐾 ,𝐷𝑖 (𝑆 ′)≠∅ }

Necessary stop condition necessary() : the set of current generated results Assume the stop condition of the original algorithm is satisfied

Otherwise the algorithm cannot stop : the set of results when the last time necessary() is satisfied (or if

necessary() is never satisfied) If for a certain , we need at least more results generated in order to get

results The necessary stop condition is

The Possible Search Algorithms

3v1 3v2 3v3 3v100

3u1

…

…

100

99 99 99 99

0.5 1 1 1u2

v0

u3 u100 3u1

…

…

100

99 99 99 99

0.5 1 1 1

3v0

3u2 3u3 3u100

v0 v2 v3 v100

Greedy Solution: score=199 Optimal Solution: score=9900

Given the diversity graph for the current generated result set

Greed is Not Good

Finding on is an NP-Hard problem

𝐺 (𝐾=100)𝐺 (𝐾=100)

Three New Search Algorithms We propose three exact algorithms

div-astar: an A* based approach div-dp: decompose div-astar using operator div-cut: further decompose div-dp using operators and

NP

NP NP

NP NP

NP NP

NP

NP

NP

NP NP

NP

NP

NP

NP NP

NP

NP

NP

NP NP

NP

NP

NP

div-astar div-dp div-cut

An A* Based Approach

We use a heap to maintain partial solutions Each partial solution is with form

the set of results selected in the partial solution : the total score of results in : the upper bound of score if is expanded to a full solution Entries in are expanded in non-increasing order of

The algorithm stops when of the next soution is no larger than the score of the current best solution

An A* Based Approach Calculation of

is the set of adjacent nodes of in The equation is a relaxation of the optimal solution w.r.t. is to avoid generating redundant results can be calculated in time in the worst case

s.t.


3 3

3

3

68

7

7

3

10

3

3

Diversity graph

∅ ,0,25

{𝑣1},10,21

{𝑣2},8,8

{𝑣3 },7,20

{𝑣4 },7,13

{𝑣5},6,6

{𝑣6 },3,3

An example ()

Step 1: Expand node (), with


3 3

3

3

68

7

7

3

10

3

3

Diversity graph

∅ ,0,25

{𝑣1},10,21

{𝑣2},8,8

{𝑣3 },7,20

{𝑣4 },7,13

{𝑣5},6,6

{𝑣6 },3,3

{𝑣1,𝑣2 },18,18

{𝑣1,𝑣6 },13,13

An example ()



3 3

3

3

68

7

7

3

10

3

3

Diversity graph

∅ ,0,25

{𝑣1},10,21

{𝑣2},8,8

{𝑣3 },7,20

{𝑣4 },7,13

{𝑣5},6,6

{𝑣6 },3,3

{𝑣1,𝑣2 },18,18

{𝑣1,𝑣6 },13,13

{𝑣3 ,𝑣4 },14,20

{𝑣3 ,𝑣5 },13,13

An example ()



3 3

3

3

68

7

7

3

10

3

3

Diversity graph

∅ ,0,25

{𝑣1},10,21

{𝑣2},8,8

{𝑣3 },7,20

{𝑣4 },7,13

{𝑣5},6,6

{𝑣6 },3,3

{𝑣1,𝑣2 },18,18

{𝑣1,𝑣6 },13,13

{𝑣3 ,𝑣4 },14,20

{𝑣3 ,𝑣5 },13,13

{𝑣3 ,𝑣4 ,𝑣5},20,20

An example ()



3 3

3

3

68

7

7

3

10

3

3

Diversity graph

∅ ,0,25

{𝑣1},10,21

{𝑣2},8,8

{𝑣3 },7,20

{𝑣4 },7,13

{𝑣5},6,6

{𝑣6 },3,3

{𝑣1,𝑣2 },18,18

{𝑣1,𝑣6 },13,13

{𝑣3 ,𝑣4 },14,20

{𝑣3 ,𝑣5 },13,13

{𝑣3 ,𝑣4 ,𝑣5},20,20

An example ()

Step 5: Expand node (), with Current best score is , and next best score is : stopOptimal solution:

A DP Based Approach The diversity graph may contain many disconnected components

It is costly to apply A* algorithm on the whole diversity graph Combine the results of disconnected components using operator based

on Dynamic Programming (DP) Dynamic Programming

Suppose contains two disconnected components and State : the optimal score of the diversified top- results on State transition equation:

𝐺 .𝑠𝑖=max0≤ 𝑗≤𝑖

{𝐺1 .𝑠 𝑗+𝐺2 . 𝑠𝑖− 𝑗 }

A DP Based Approach

3 3

3

3

6 8

7

7

110

10 6

78

93

3

An Example ()

optimal solution: {,,,}

i solution s

0 0

1 10

2 18

3 20

4 0

5 0

⊕

i solution s

0 0

1 10

2 18

3 22

4 0

5 0

¿

i solution s

0 0

1 10

2 20

3 28

4 36

5 40

𝐺2 𝐺

𝐺1 𝐺2𝐺

A Cut Point Based Approach Cut point of graph

Suppose is a connected graph A cut point is a point whose removal makes disconnected

can be further decomposed using cut points Suppose is a cut point of , there are two situations

: is excluded in the final solution After removing , becomes several disconnected components

: is included in the final solution After removing and all ’s adjacent nodes, becomes several disconnected

components Add to each result in

and are combined using operator to compute

A Cut Point Based Approach Let be a cut point of Let be the solution by excluding Let be the solution by including and are mutually exclusive with each other : the optimal score of diversified top- results on Calculating

𝐺 .𝑠𝑖=𝑚𝑎𝑥 {𝐺1 . 𝑠𝑖 ,𝐺2 .𝑠𝑖 }

A Cut Point Based Approach Handling multiple cut points

Step 1: Construct a cup-point tree (cptree) Each node: associated with a cut point (leaf node is associated with a virtual

cut point) Each edge: associated with a subgraph that connects two cut points (the

subgraph can be empty or disconnected) A sample cptree:

Step 2: Search the cptree In a bottom-up fashion

𝑐0

𝑐1 𝑐2 𝑐3

𝑐4 𝑐5 𝑐6

𝐺1𝐺2

𝐺3

𝐺4 𝐺5 𝐺6

𝐺3

𝐺4𝐺2

𝐺1

𝐺34 𝐺12

𝐺

𝑐24𝑐34 𝑐12

A Cut Point Based Approach

Suppose , , , have been computed

We now compute and

An Example

𝐺3

𝐺4𝐺2

𝐺1

𝐺34 𝐺12

𝐺

𝑐24𝑐34 𝑐12


Computing

Computing (Case 1) is excluded: (Case 2) is included:

is the result after removing adjacent nodes of from

We have can be computed similarly

An Example

𝐺3

𝐺4𝐺2

𝐺1

𝐺34 𝐺12

𝐺

𝑐24𝑐34 𝑐12


Computing

Computing (Case 1) is excluded: (Case 2) is included: We have

can be computed similarly Do not forget to add {} to all the results

of

An Example

A Cut Point Based Approach i solution s

0 0

1 13

2 23

3 33

4 36

5 39

An Example ()

3 3 6 8

7

7

110

𝐺1

10 6

78

9

𝐺23

3𝒘𝟐

3 3𝒘𝟓

3𝒘𝟔

3𝒘𝟑

3𝒘 𝟒

3

3

𝐺

𝐺4

𝐺3

i solution s

0 0

1 10

2 20

3 28

4 36

5 40

⊗=¿

i solution s

0 0

1 13

2 23

3 33

4 36

5 40

𝑮 . 𝒊𝒏(𝒘𝟐)

𝑮 .𝒆𝒙 (𝒘𝟐)

13

11

1

1

𝑮

Further Improvements Example can be removed from There exists s.t.

After removing and become cut points

3 3 6 8

7

7

110

10 6

78

9

3 3𝒘𝟓

3𝒘𝟑

3𝒘 𝟒3

𝐺

13

1

1

1

1

3 3 6 8

7

7

110

𝐺1

10 6

78

9

𝐺23

3𝒘𝟐

3 3𝒘𝟓

3𝒘𝟔

3𝒘𝟑

3𝒘 𝟒

3

3

𝐺 ′

𝐺4

𝐺3

13

1

1

1

1

3𝒘𝟔

3𝒘𝟏

3

3𝒘𝟐

3

12

Performance Studies Experimental Setup

We use 2 real datasets: Enwiki and Reuters Enwiki: 11,930,681 articles from English Wikipedia Reuters: 21,578 news from Reuters

Query: a set of keywords Answer: top- documents We compare three algorithms

div-star: A* based approach div-dp: Dynamic programming based approach div-cut: Cut point based approach

We vary 3 parameters: : (two groups)

Small 40, 80, 120, 160, 200, default 120 Large : 500, 700, 900, 1300, 2000, default 900

Similarity threshold : 0.4, 0.5, 0.6, 0.7, 0.8 default 0.6 Keyword frequency : 5 levels 1,2,3,4,5, default 3

Performance Studies Score function:

Given a query and a document

is term frequency of keyword for dataset is the total number of words in

Similarity function: Given two documents and

𝑠𝑐𝑜𝑟𝑒 (𝑄 ,𝑑 )=∑𝑞∈𝑄

𝑡𝑓 (𝑞 ,𝑑 )× 𝑖𝑑𝑓 (𝑞)

√𝑙𝑒𝑛(𝑑)

𝑠𝑖𝑚 (𝑑1 ,𝑑2 )=∑

𝑤∈𝑑1∩𝑑2𝑖𝑑𝑓 (𝑤)

∑𝑤∈𝑑1∪𝑑 2

𝑖𝑑𝑓 (𝑤)

Performance Studies

Vary (Enwiki)

Small Small

Large Large

Conclusion We study the diversified ranking. We study the diversified top- search problem.

The diversity use only the similarity of search results themselves We propose a framework, s.t. most top- algorithm can be easily

extended to handle diversified top- search by applying.

APWeb 2013 in Sydney, Australia The 15th International Asia-Pacific Web Conference (APWeb), 4-6

April, 2013, Sydney, Australia Just before ICDE 2013. Paper Submission Deadline: October 20.

Three Keynote Speakers H.V. Jagadish (University of Michigan) Dan Suciu (University of Washington) Mark Sanderson (RMIT)

A Special Issue on WWW Journal

Research Postgraduate Study at SEEM/CUHK [www.se.cuhk.edu.hk/programmes]

Research Postgraduate Programs M.Phil, PhD, M.phil-PhD (Articulated) Deadlines:

December 1, 2012 (First Round) January 31, 2013 (Official Final Round). But, due to Chinese New Year, submit it

early before January 20. Postgraduate Studentship: HK$13,600 per month (non-taxable) Current Tuition Fees: HK$42,100/year

Hong Kong PhD Fellowship Scheme 2013-2014 (135 positions in HK) Deadline: December 1, 2012 Monthly stipend of HK$20,000 10,000 travel allowance Current Tuition Fees: HK$42,100/year

Taught Postgraduate Study at SEEM/CUHK [www.se.cuhk.edu.hk/programmes]

Taught Postgraduate Programmes MSc Programme in SEEM (Systems Engineering and Engineering

Management) MSc Programme in ECLT (E-Commerce and Logistics Technologies) Current Tuition Fees: (Provisional) HK$128,000 Full-Time One-Year study in HK Application deadline:

1st Round: January 15, 2013 2nd Round: March 15, 2013 Early applications are encouraged; Offers may be made to eligible

applicants well before March 15.

Thank you!Questions?