challenging problems for scalable mining of heterogeneous social and information networks by jiawei...
DESCRIPTION
In today’s interconnected real world, social and informational entities are interconnected, forming gigantic, interconnected, integrated social and information networks. By structuring these data objects into multiple types, such networks become semi-structured heterogeneous social and information networks. Most real world applications that handle big data, including interconnected social media and social networks, medical information systems, online e-commerce systems, or database systems, can be structured into typed, heterogeneous social and information networks. For example, in a medical care network, objects of multiple types, such as patients, doctors, diseases, medication, and links such as visits, diagnosis, and treatments are intertwined together, providing rich information and forming heterogeneous information networks. Effective analysis of large-scale heterogeneous social and information networks poses an interesting but critical challenge. In this talk, we present a set of data mining scenarios in heterogeneous social and information networks and show that mining typed, heterogeneous networks is a new and promising research frontier in data mining research. However, such mining may raise some serious challenging problems on scalability computation. We identify a set of problems on scalable computation and calls for serious studies on such problems. This includes how to efficiently computation for (1) meta path-based similarity search, (2) rank-based clustering, (3) rank-based classification, (4) meta path-based link/relationship prediction, and (5) topical hierarchies from heterogeneous information networks. We introduce some recent efforts, discuss the trade-offs between query-independent pre-computation vs. query-dependent online computation, and point out some promising research directions.TRANSCRIPT
Challenging Problems for Challenging Problems for Scalable Mining of Scalable Mining of
Heterogeneous Social and Heterogeneous Social and Information NetworksInformation Networks
Jiawei Han Computer Science , University of Illinois at Urbana-Champaign
Collaborated with many, especially Yizhou Sun, Ming Ji, Chi Wang, Tim Weninger, Xiaoxin Yin, Bo Zhao
Acknowledgements: ARL, NSF, AFOSR (MURI), NASA, Microsoft, IBM, Yahoo!, Boeing
April 13, 2023
1
2
OutlineOutline Why Is Mining Heterogeneous Social and Info Networks Promising?Why Is Mining Heterogeneous Social and Info Networks Promising?
Homogeneous vs. Heterogeneous Social and Info. Networks Homogeneous vs. Heterogeneous Social and Info. Networks
On the Power of Mining Structured, Heterogeneous Social and On the Power of Mining Structured, Heterogeneous Social and Info. Networks Info. Networks
Challenges on BigMine: Scalable Mining of Massive Heterogeneous Challenges on BigMine: Scalable Mining of Massive Heterogeneous Social and Information NetworksSocial and Information Networks
PathSim: Online, Query-Based Similarity Search PathSim: Online, Query-Based Similarity Search
PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path
Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge
ConclusionsConclusions
Where There Is Information,Where There Is Information, There Are Networks! There Are Networks!
Social Networking WebsitesSocial Networking Websites Biological Network: Protein InteractionBiological Network: Protein Interaction
Research Collaboration NetworkResearch Collaboration Network Product Recommendation Network via Emails Product Recommendation Network via Emails
The Real World: Heterogeneous NetworksThe Real World: Heterogeneous Networks Multiple object types and/or multiple link types
VenueVenue PaperPaper AuthorAuthor
DBLP Bibliographic NetworkDBLP Bibliographic Network The IMDB Movie NetworkThe IMDB Movie NetworkActorActor
MovieMovieDirectorDirector
Movie Movie StudioStudio
Homogeneous networks are information lossinformation loss projection of heterogeneous networks!
The Facebook NetworkThe Facebook Network
Directly mining information-richer heterogeneous networksDirectly mining information-richer heterogeneous networks
Structured Heterogeneous Network Modeling Structured Heterogeneous Network Modeling Leads to the New Power of Data Mining! Leads to the New Power of Data Mining!
DBLP: A Computer Science bibliographic database
A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues), …
5
Power of het. network modeling: Treat Author, Venue, Term, Paper all first-class citizens!
6
OutlineOutline Why Is Mining Heterogeneous Social and Info Networks Promising?Why Is Mining Heterogeneous Social and Info Networks Promising?
Homogeneous vs. Heterogeneous Social and Info. Networks Homogeneous vs. Heterogeneous Social and Info. Networks
On the Power of Mining Structured, Heterogeneous Social and On the Power of Mining Structured, Heterogeneous Social and Info. Networks Info. Networks
Challenges on BigMine: Scalable Mining of Massive Heterogeneous Challenges on BigMine: Scalable Mining of Massive Heterogeneous Social and Information NetworksSocial and Information Networks
PathSim: Online, Query-Based Similarity Search PathSim: Online, Query-Based Similarity Search
PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path
Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge
ConclusionsConclusions
7
On the Power of Mining Structured, On the Power of Mining Structured, Heterogeneous Networks Heterogeneous Networks
Links carry a lot of hidden information in structured, Links carry a lot of hidden information in structured, heterogeneous social and information networks heterogeneous social and information networks
Effectiveness of miningEffectiveness of mining Clustering in heterogeneous networks: Rank-based Clustering in heterogeneous networks: Rank-based
clustering: (RankClus [EDBT’09] and NetClus [KDD’09]) and clustering: (RankClus [EDBT’09] and NetClus [KDD’09]) and user-guided, meta-path-based clustering [KDD’12]user-guided, meta-path-based clustering [KDD’12]
Knowledge propgation through heterogeneous links Knowledge propgation through heterogeneous links (GNetMine [ECMLPKDD’10]) and Rank-based classification (GNetMine [ECMLPKDD’10]) and Rank-based classification (RankClass [KDD’11])(RankClass [KDD’11])
Meta-path-based similarity search (PathSim [VLDB’11])Meta-path-based similarity search (PathSim [VLDB’11]) Meta-path-based prediction in heterogeneous networks Meta-path-based prediction in heterogeneous networks
(PathPredict [ASONAM’11])(PathPredict [ASONAM’11])
RankClus: RankClus: Integrated Clustering and Ranking Integrated Clustering and Ranking
Highly ranked objects are more important (i.e., more weighted) in a cluster than weakly ranked ones
Ranking will make more sense within one cluster than in multiple clusters
Ranking, as the feature of the cluster, is conditional to a specific cluster
SIGMOD
SDM
ICDM
KDD
EDBT
VLDB
ICML
AAAI
Tom
Jim
Lucy
Mike
Jack
Tracy
Cindy
Bob
Mary
Alice
SIGMOD
VLDB
EDBT
KDDICDM
SDM
AAAI
ICML
Objects
Ra
nkin
g
Sub-Network
Ranking
Clustering
8
Clustering and ranking mutually enhance each other at each iteration
RankClus [EDBT’09]: An efficient, EM-like algorithm
9
NetClus: Ranking & Clustering with NetClus: Ranking & Clustering with Star Network Schema [KDD’09]Star Network Schema [KDD’09]
Beyond bi-typed information network: A Star Network Schema Split a network into different layers, each representing by a net-cluster
Research Paper
Term
AuthorVenue
Publish Write
Contain
P
T
AV
P
T
AV
……
P
T
AVNetClus
Computer Science
Database
Hardware
Theory
10
NetClus: Database System ClusterNetClus: Database System Cluster
database 0.0995511databases 0.0708818
system 0.0678563data 0.0214893query 0.0133316
systems 0.0110413queries 0.0090603
management 0.00850744object 0.00837766
relational 0.0081175processing 0.00745875
based 0.00736599distributed 0.0068367
xml 0.00664958oriented 0.00589557design 0.00527672
web 0.00509167information 0.0050518
model 0.00499396efficient 0.00465707
Surajit Chaudhuri 0.00678065Michael Stonebraker 0.00616469
Michael J. Carey 0.00545769C. Mohan 0.00528346
David J. DeWitt 0.00491615Hector Garcia-Molina 0.00453497
H. V. Jagadish 0.00434289David B. Lomet 0.00397865
Raghu Ramakrishnan 0.0039278Philip A. Bernstein 0.00376314
Joseph M. Hellerstein 0.00372064Jeffrey F. Naughton 0.00363698Yannis E. Ioannidis 0.00359853
Jennifer Widom 0.00351929Per-Ake Larson 0.00334911Rakesh Agrawal 0.00328274
Dan Suciu 0.00309047Michael J. Franklin 0.00304099Umeshwar Dayal 0.00290143
Abraham Silberschatz 0.00278185
VLDB 0.318495SIGMOD Conf. 0.313903
ICDE 0.188746PODS 0.107943EDBT 0.0436849
Go one-level deeper: Authors in XML, Xquery
cluster
Term Venue Author
Rank-Based Clustering for OthersRank-Based Clustering for Others
11
RankCompete: Organize your photo album automatically!RankCompete: Organize your photo album automatically!
Rank treatments for AIDS from MEDLINERank treatments for AIDS from MEDLINE
12
Classification in Heterogeneous NetworksClassification in Heterogeneous Networks GNetMine [ECMLPKDD'10]:
Knowledge propagation across heterogeneous links
RankClass [KDD’11]: Integration of ranking and classification in heterogeneous network analysis
Highly ranked objects play more role in classification
An object can only be ranked high in some focused classes
Class membership and ranking are stat. distributions
Let ranking and classification mutually enhance each other!
Output: Classification results + ranking list of objects within each class
Experiments with Very Small Training SetExperiments with Very Small Training Set
DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. network Rank objects within each class (with extremely limited label information) Obtain High classification accuracy and excellent rankings within each class
Database Data Mining AI IR
Top-5 ranked conferences
VLDB KDD IJCAI SIGIR
SIGMOD SDM AAAI ECIR
ICDE ICDM ICML CIKM
PODS PKDD CVPR WWW
EDBT PAKDD ECML WSDM
Top-5 ranked terms
data mining learning retrieval
database data knowledge information
query clustering reasoning web
system classification logic search
xml frequent cognition text
13
Similarity Search: Find Similar Objects in Networks Similarity Search: Find Similar Objects in Networks
Who are most similar to Christos Faloutsos? Meta-Path: Meta-level description of a path between two
objects
Christos’s students or close collaborators Similar reputation at similar venues
Meta-Path: Author-Paper-Author (APA) Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)
14
Schema of the DBLP Network
Different meta-paths lead to very different results!
Different meta-paths carry rather different semantics
Which Similarity Measure Is Better? Which Similarity Measure Is Better?
Anhai Doan CS, Wisconsin Database area PhD: 2002
Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)
• Jignesh Patel• CS, Wisconsin• Database area• PhD: 1998
• Amol Deshpande• CS, Maryland• Database area• PhD: 2004
• Jun Yang• CS, Duke• Database area• PhD: 2001
15
PathSim [VLDB’11]
PathPredict: PathPredict: Meta-Path Based New Co-author Meta-Path Based New Co-author Relationship Prediction in DBLP [ASONAM’11]Relationship Prediction in DBLP [ASONAM’11]
Co-authorship prediction: Whether two authors are going to collaborate for the first time
Co-authorship encoded in meta-path Author-Paper-Author (A-P-A)
Topological features encoded in meta-paths as below:
Meta-paths between authors under length 4Meta-paths between authors under length 4
Meta-Path Semantic Meaning
16
The Success of PathPredict: Exploring Meta-PathsThe Success of PathPredict: Exploring Meta-Paths
Explain the prediction power of each meta-path Wald Test for logistic
regression Higher prediction accuracy
than using projected homogeneous network 11% higher in
prediction accuracy Citation prediction
The selected meta-paths could be rather different
17
Co-author prediction Co-author prediction for Jian Peifor Jian Pei: Only 42 among 4809 : Only 42 among 4809 candidates are true first-time co-authors!candidates are true first-time co-authors!(Feature collected in [1996, 2002]; Test period in [2003,2009])
18
OutlineOutline Why Is Mining Heterogeneous Social and Info Networks Promising?Why Is Mining Heterogeneous Social and Info Networks Promising?
Homogeneous vs. Heterogeneous Social and Info. Networks Homogeneous vs. Heterogeneous Social and Info. Networks
On the Power of Mining Structured, Heterogeneous Social and On the Power of Mining Structured, Heterogeneous Social and Info. Networks Info. Networks
Challenges on BigMine: Scalable Mining of Massive Heterogeneous Challenges on BigMine: Scalable Mining of Massive Heterogeneous Social and Information NetworksSocial and Information Networks
PathSim: Online, Query-Based Similarity Search PathSim: Online, Query-Based Similarity Search
PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path
Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge
ConclusionsConclusions
19
Challenges on BigMineChallenges on BigMine Scalable mining of massive information networks: Necessity
Many such networks are gigantic: News, PubMed, … DBLP is a small one: 2M papers and 0.8M authors, …
Meta-path: Potentially long chains of matrix multiplication of such networks
APVPA: AP X PV X VP X PA Comparative analysis of multi-meta-paths is costly
Scalable mining of massive information networks: Possibility Many functions do not need to compute eigen values Top-k computation may save computation cost substantially Precomputation may save online computation substantially Clustering-based precomputation:
20
Computing Eigen Values: When Need It? Computing Eigen Values: When Need It?
Computations needed Clustering (RankClus), classification (RankClass), similarity
search (PathSim), prediction (PathPredict) A small # of interactive processing (e.g., EM-styled)
Meta-path-based prediction : Selection from a set of “parallel” meta-paths
Long Meta-Path May Not Carry the Right SemanticsLong Meta-Path May Not Carry the Right Semantics
Repeat the meta-path 2, 4, and infinite times for conference similarity query
21
22
Top-K Computation Is What We NeedTop-K Computation Is What We Need Similarity search: “Who are similar to Christos?”
There is no need/interest to calculate and rank the remaining 0.8M authors
Only top-k (e.g., top-100) authors are needed in practice Lots of optimizations can be explored for top-k computation
Precomputation vs. online computation Precomputation of long meta-paths will save online, costly
multi-matrix multiplication Clustering-based precomputation
Example: top-k similarity authors Precomputation by clustering: only computing rather
similar author groups
Co-Clustering-Based Pruning AlgorithmCo-Clustering-Based Pruning Algorithm General idea:
Store commuting matrices for short path schemas and compute top-k queries on line
Framework Generate co-clusters for materialized commuting matrices, for
feature objects and target objects Derive upper bound for similarity between object and target
cluster, and between object and object Safely pruning target clusters and objects if the upper
bound similarity is lower than current threshold Dynamically update top-k threshold
Similarity Search: Experiments on Efficiency Similarity Search: Experiments on Efficiency Searching for top-20 objects vs.
1001th-1020th objects: PathSim-pruning is more efficient than PathSim-baseline
The denser the corresponding commuting matrix, the more PathSim-pruning can improve
The more neighbors of a query, the more PathSim-pruning can improve
Then compare the efficiency under different top-k’s (k = 5, 10, 20) for PathSim-pruning using query set 1
A smaller top-k has stronger pruning power, and thus needs less execution time
24
PathPredict: Exploring Big Data SpacePathPredict: Exploring Big Data Space
Scalable computation in really huge heterogeneous networks?
Sampling may lead to similar judgment on importance of meta-path
Query-dependent prediction can be “selective” and thus may not need that much resources
Precomputation and clustering may further enhance its efficiency
25
26
Mining Query-Relevant “Hidden” NetworksMining Query-Relevant “Hidden” Networks
Query-relevant hidden networks What is the hidden network closely relevant to “SVM”? The network should contains weighted network consisting of
papers, terms, authors and venues Is “kernel machine” closely relevant to “SVM”? How could we
know it? It takes substantial computation to derive such a
“weighted/ranked” hidden heterogeneous network Due to the diversity of queries (e.g., SVM + Cloud + SIGMOD), it is
impossible to precompute every possible combinations How can we compute such hidden network efficiently on the fly?
An interesting open problem
27
ConclusionsConclusions Heterogeneous social & information networks are ubiquitous
Most datasets can be “organized” or “transformed” into “structured” multi-typed heterogeneous info. networks
Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, … Surprisingly rich knowledge can be mined from structured
heterogeneous info. networks Clustering, ranking, classification, path prediction, ……
Knowledge is power, but knowledge is hidden in massive, but “relatively structured” nodes and links!
Challenge to BigMine: How to mining massive, heterogeneous information networks efficiently
Some progress/tricks on scalability and efficiency Many open problems and much more to be explored!
From Data Mining to Mining Info. NetworksFrom Data Mining to Mining Info. Networks
28
Han, Kamber and Pei,Data Mining, 3rd ed. 2011
Yu, Han and Faloutsos (eds.), Link Mining, 2010
Sun and Han, Mining HeterogeneousInformation Networks, 2012
ReferencesReferences M. Ji, J. Han, and M. Danilevsky, "Ranking-Based Classification of Heterogeneous Information
Networks", KDD'11. Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies,
Morgan & Claypool Publishers, 2012 Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous
Information Network Analysis", EDBT’09 Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with
Star Network Schema", KDD’09 Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity Search in
Heterogeneous Information Networks”, VLDB'11 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in
Heterogeneous Bibliographic Networks", ASONAM'11 Y. Sun, J. Han, C. C. Aggarwal, N. Chawla, “When Will It Happen? Relationship Prediction in
Heterogeneous Information Networks”, WSDM'12 F. Tao, et al., “EventCube: Multi-Dimensional Search and Mining of Structured and Text Data”,
(system demo) KDD’13 C. Wang, J. Han, et al., “Mining Advisor-Advisee Relationships from Research Publication
Networks", KDD'10 C. Wang, M. Danilevsky, et al., “A Phrase Mining Framework for Recursive Construction of a
Topical Hierarchy”, KDD’13
29