challenging problems for scalable mining of heterogeneous social and information networks by jiawei...

Challenging Problems for Challenging Problems for Scalable Mining of Scalable Mining of

Heterogeneous Social and Heterogeneous Social and Information NetworksInformation Networks

Jiawei Han Computer Science , University of Illinois at Urbana-Champaign

Collaborated with many, especially Yizhou Sun, Ming Ji, Chi Wang, Tim Weninger, Xiaoxin Yin, Bo Zhao

Acknowledgements: ARL, NSF, AFOSR (MURI), NASA, Microsoft, IBM, Yahoo!, Boeing

April 13, 2023

1

2

OutlineOutline Why Is Mining Heterogeneous Social and Info Networks Promising?Why Is Mining Heterogeneous Social and Info Networks Promising?

Homogeneous vs. Heterogeneous Social and Info. Networks Homogeneous vs. Heterogeneous Social and Info. Networks

On the Power of Mining Structured, Heterogeneous Social and On the Power of Mining Structured, Heterogeneous Social and Info. Networks Info. Networks

Challenges on BigMine: Scalable Mining of Massive Heterogeneous Challenges on BigMine: Scalable Mining of Massive Heterogeneous Social and Information NetworksSocial and Information Networks

PathSim: Online, Query-Based Similarity Search PathSim: Online, Query-Based Similarity Search

PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path

Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge

ConclusionsConclusions

Where There Is Information,Where There Is Information, There Are Networks! There Are Networks!

Social Networking WebsitesSocial Networking Websites Biological Network: Protein InteractionBiological Network: Protein Interaction

Research Collaboration NetworkResearch Collaboration Network Product Recommendation Network via Emails Product Recommendation Network via Emails

The Real World: Heterogeneous NetworksThe Real World: Heterogeneous Networks Multiple object types and/or multiple link types

VenueVenue PaperPaper AuthorAuthor

DBLP Bibliographic NetworkDBLP Bibliographic Network The IMDB Movie NetworkThe IMDB Movie NetworkActorActor

MovieMovieDirectorDirector

Movie Movie StudioStudio

Homogeneous networks are information lossinformation loss projection of heterogeneous networks!

The Facebook NetworkThe Facebook Network

Directly mining information-richer heterogeneous networksDirectly mining information-richer heterogeneous networks

Structured Heterogeneous Network Modeling Structured Heterogeneous Network Modeling Leads to the New Power of Data Mining! Leads to the New Power of Data Mining!

DBLP: A Computer Science bibliographic database

A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues), …

5

Power of het. network modeling: Treat Author, Venue, Term, Paper all first-class citizens!

6









7

On the Power of Mining Structured, On the Power of Mining Structured, Heterogeneous Networks Heterogeneous Networks

Links carry a lot of hidden information in structured, Links carry a lot of hidden information in structured, heterogeneous social and information networks heterogeneous social and information networks

Effectiveness of miningEffectiveness of mining Clustering in heterogeneous networks: Rank-based Clustering in heterogeneous networks: Rank-based

clustering: (RankClus [EDBT’09] and NetClus [KDD’09]) and clustering: (RankClus [EDBT’09] and NetClus [KDD’09]) and user-guided, meta-path-based clustering [KDD’12]user-guided, meta-path-based clustering [KDD’12]

Knowledge propgation through heterogeneous links Knowledge propgation through heterogeneous links (GNetMine [ECMLPKDD’10]) and Rank-based classification (GNetMine [ECMLPKDD’10]) and Rank-based classification (RankClass [KDD’11])(RankClass [KDD’11])

Meta-path-based similarity search (PathSim [VLDB’11])Meta-path-based similarity search (PathSim [VLDB’11]) Meta-path-based prediction in heterogeneous networks Meta-path-based prediction in heterogeneous networks

(PathPredict [ASONAM’11])(PathPredict [ASONAM’11])

RankClus: RankClus: Integrated Clustering and Ranking Integrated Clustering and Ranking

Highly ranked objects are more important (i.e., more weighted) in a cluster than weakly ranked ones

Ranking will make more sense within one cluster than in multiple clusters

Ranking, as the feature of the cluster, is conditional to a specific cluster

SIGMOD

SDM

ICDM

KDD

EDBT

VLDB

ICML

AAAI

Tom

Jim

Lucy

Mike

Jack

Tracy

Cindy

Bob

Mary

Alice

SIGMOD

VLDB

EDBT

KDDICDM

SDM

AAAI

ICML

Objects

Ra

nkin

g

Sub-Network

Ranking

Clustering

8

Clustering and ranking mutually enhance each other at each iteration

RankClus [EDBT’09]: An efficient, EM-like algorithm

9

NetClus: Ranking & Clustering with NetClus: Ranking & Clustering with Star Network Schema [KDD’09]Star Network Schema [KDD’09]

Beyond bi-typed information network: A Star Network Schema Split a network into different layers, each representing by a net-cluster

Research Paper

Term

AuthorVenue

Publish Write

Contain

P

T

AV

P

T

AV

……

P

T

AVNetClus

Computer Science

Database

Hardware

Theory

10

NetClus: Database System ClusterNetClus: Database System Cluster

database 0.0995511databases 0.0708818

system 0.0678563data 0.0214893query 0.0133316

systems 0.0110413queries 0.0090603

management 0.00850744object 0.00837766

relational 0.0081175processing 0.00745875

based 0.00736599distributed 0.0068367

xml 0.00664958oriented 0.00589557design 0.00527672

web 0.00509167information 0.0050518

model 0.00499396efficient 0.00465707

Surajit Chaudhuri 0.00678065Michael Stonebraker 0.00616469

Michael J. Carey 0.00545769C. Mohan 0.00528346

David J. DeWitt 0.00491615Hector Garcia-Molina 0.00453497

H. V. Jagadish 0.00434289David B. Lomet 0.00397865

Raghu Ramakrishnan 0.0039278Philip A. Bernstein 0.00376314

Joseph M. Hellerstein 0.00372064Jeffrey F. Naughton 0.00363698Yannis E. Ioannidis 0.00359853

Jennifer Widom 0.00351929Per-Ake Larson 0.00334911Rakesh Agrawal 0.00328274

Dan Suciu 0.00309047Michael J. Franklin 0.00304099Umeshwar Dayal 0.00290143

Abraham Silberschatz 0.00278185

VLDB 0.318495SIGMOD Conf. 0.313903

ICDE 0.188746PODS 0.107943EDBT 0.0436849

Go one-level deeper: Authors in XML, Xquery

cluster

Term Venue Author

Rank-Based Clustering for OthersRank-Based Clustering for Others

11

RankCompete: Organize your photo album automatically!RankCompete: Organize your photo album automatically!

Rank treatments for AIDS from MEDLINERank treatments for AIDS from MEDLINE

12

Classification in Heterogeneous NetworksClassification in Heterogeneous Networks GNetMine [ECMLPKDD'10]:

Knowledge propagation across heterogeneous links

RankClass [KDD’11]: Integration of ranking and classification in heterogeneous network analysis

Highly ranked objects play more role in classification

An object can only be ranked high in some focused classes

Class membership and ranking are stat. distributions

Let ranking and classification mutually enhance each other!

Output: Classification results + ranking list of objects within each class

Experiments with Very Small Training SetExperiments with Very Small Training Set

DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. network Rank objects within each class (with extremely limited label information) Obtain High classification accuracy and excellent rankings within each class

Database Data Mining AI IR

Top-5 ranked conferences

VLDB KDD IJCAI SIGIR

SIGMOD SDM AAAI ECIR

ICDE ICDM ICML CIKM

PODS PKDD CVPR WWW

EDBT PAKDD ECML WSDM

Top-5 ranked terms

data mining learning retrieval

database data knowledge information

query clustering reasoning web

system classification logic search

xml frequent cognition text

13

Similarity Search: Find Similar Objects in Networks Similarity Search: Find Similar Objects in Networks

Who are most similar to Christos Faloutsos? Meta-Path: Meta-level description of a path between two

objects

Christos’s students or close collaborators Similar reputation at similar venues

Meta-Path: Author-Paper-Author (APA) Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)

14

Schema of the DBLP Network

Different meta-paths lead to very different results!

Different meta-paths carry rather different semantics

Which Similarity Measure Is Better? Which Similarity Measure Is Better?

Anhai Doan CS, Wisconsin Database area PhD: 2002

Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)

• Jignesh Patel• CS, Wisconsin• Database area• PhD: 1998

• Amol Deshpande• CS, Maryland• Database area• PhD: 2004

• Jun Yang• CS, Duke• Database area• PhD: 2001

15

PathSim [VLDB’11]

PathPredict: PathPredict: Meta-Path Based New Co-author Meta-Path Based New Co-author Relationship Prediction in DBLP [ASONAM’11]Relationship Prediction in DBLP [ASONAM’11]

Co-authorship prediction: Whether two authors are going to collaborate for the first time

Co-authorship encoded in meta-path Author-Paper-Author (A-P-A)

Topological features encoded in meta-paths as below:

Meta-paths between authors under length 4Meta-paths between authors under length 4

Meta-Path Semantic Meaning

16

The Success of PathPredict: Exploring Meta-PathsThe Success of PathPredict: Exploring Meta-Paths

Explain the prediction power of each meta-path Wald Test for logistic

regression Higher prediction accuracy

than using projected homogeneous network 11% higher in

prediction accuracy Citation prediction

The selected meta-paths could be rather different

17

Co-author prediction Co-author prediction for Jian Peifor Jian Pei: Only 42 among 4809 : Only 42 among 4809 candidates are true first-time co-authors!candidates are true first-time co-authors!(Feature collected in [1996, 2002]; Test period in [2003,2009])

18









19

Challenges on BigMineChallenges on BigMine Scalable mining of massive information networks: Necessity

Many such networks are gigantic: News, PubMed, … DBLP is a small one: 2M papers and 0.8M authors, …

Meta-path: Potentially long chains of matrix multiplication of such networks

APVPA: AP X PV X VP X PA Comparative analysis of multi-meta-paths is costly

Scalable mining of massive information networks: Possibility Many functions do not need to compute eigen values Top-k computation may save computation cost substantially Precomputation may save online computation substantially Clustering-based precomputation:

20

Computing Eigen Values: When Need It? Computing Eigen Values: When Need It?

Computations needed Clustering (RankClus), classification (RankClass), similarity

search (PathSim), prediction (PathPredict) A small # of interactive processing (e.g., EM-styled)

Meta-path-based prediction : Selection from a set of “parallel” meta-paths

Long Meta-Path May Not Carry the Right SemanticsLong Meta-Path May Not Carry the Right Semantics

Repeat the meta-path 2, 4, and infinite times for conference similarity query

21

22

Top-K Computation Is What We NeedTop-K Computation Is What We Need Similarity search: “Who are similar to Christos?”

There is no need/interest to calculate and rank the remaining 0.8M authors

Only top-k (e.g., top-100) authors are needed in practice Lots of optimizations can be explored for top-k computation

Precomputation vs. online computation Precomputation of long meta-paths will save online, costly

multi-matrix multiplication Clustering-based precomputation

Example: top-k similarity authors Precomputation by clustering: only computing rather

similar author groups

Co-Clustering-Based Pruning AlgorithmCo-Clustering-Based Pruning Algorithm General idea:

Store commuting matrices for short path schemas and compute top-k queries on line

Framework Generate co-clusters for materialized commuting matrices, for

feature objects and target objects Derive upper bound for similarity between object and target

cluster, and between object and object Safely pruning target clusters and objects if the upper

bound similarity is lower than current threshold Dynamically update top-k threshold

Similarity Search: Experiments on Efficiency Similarity Search: Experiments on Efficiency Searching for top-20 objects vs.

1001th-1020th objects: PathSim-pruning is more efficient than PathSim-baseline

The denser the corresponding commuting matrix, the more PathSim-pruning can improve

The more neighbors of a query, the more PathSim-pruning can improve

Then compare the efficiency under different top-k’s (k = 5, 10, 20) for PathSim-pruning using query set 1

A smaller top-k has stronger pruning power, and thus needs less execution time

24

PathPredict: Exploring Big Data SpacePathPredict: Exploring Big Data Space

Scalable computation in really huge heterogeneous networks?

Sampling may lead to similar judgment on importance of meta-path

Query-dependent prediction can be “selective” and thus may not need that much resources

Precomputation and clustering may further enhance its efficiency

25

26

Mining Query-Relevant “Hidden” NetworksMining Query-Relevant “Hidden” Networks

Query-relevant hidden networks What is the hidden network closely relevant to “SVM”? The network should contains weighted network consisting of

papers, terms, authors and venues Is “kernel machine” closely relevant to “SVM”? How could we

know it? It takes substantial computation to derive such a

“weighted/ranked” hidden heterogeneous network Due to the diversity of queries (e.g., SVM + Cloud + SIGMOD), it is

impossible to precompute every possible combinations How can we compute such hidden network efficiently on the fly?

An interesting open problem

27

ConclusionsConclusions Heterogeneous social & information networks are ubiquitous

Most datasets can be “organized” or “transformed” into “structured” multi-typed heterogeneous info. networks

Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, … Surprisingly rich knowledge can be mined from structured

heterogeneous info. networks Clustering, ranking, classification, path prediction, ……

Knowledge is power, but knowledge is hidden in massive, but “relatively structured” nodes and links!

Challenge to BigMine: How to mining massive, heterogeneous information networks efficiently

Some progress/tricks on scalability and efficiency Many open problems and much more to be explored!

From Data Mining to Mining Info. NetworksFrom Data Mining to Mining Info. Networks

28

Han, Kamber and Pei,Data Mining, 3rd ed. 2011

Yu, Han and Faloutsos (eds.), Link Mining, 2010

Sun and Han, Mining HeterogeneousInformation Networks, 2012

ReferencesReferences M. Ji, J. Han, and M. Danilevsky, "Ranking-Based Classification of Heterogeneous Information

Networks", KDD'11. Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies,

Morgan & Claypool Publishers, 2012 Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous

Information Network Analysis", EDBT’09 Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with

Star Network Schema", KDD’09 Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity Search in

Heterogeneous Information Networks”, VLDB'11 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in

Heterogeneous Bibliographic Networks", ASONAM'11 Y. Sun, J. Han, C. C. Aggarwal, N. Chawla, “When Will It Happen? Relationship Prediction in

Heterogeneous Information Networks”, WSDM'12 F. Tao, et al., “EventCube: Multi-Dimensional Search and Mining of Structured and Text Data”,

(system demo) KDD’13 C. Wang, J. Han, et al., “Mining Advisor-Advisee Relationships from Research Publication

Networks", KDD'10 C. Wang, M. Danilevsky, et al., “A Phrase Mining Framework for Recursive Construction of a

Topical Hierarchy”, KDD’13

29

http://www.cs.uiuc.edu/homes/hanj/pdf/kdd13_ftao.pdf

challenging problems for scalable mining of heterogeneous social and information networks by jiawei...

Technology

heterogeneous social

mining clustering

power of mining structured

information networks

networks challenges

pathbased prediction

querybased prediction

rankbased clustering