on community outliers and their efficient detection in information networks

27
Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and Technology 1 On Community Outliers and their Efficient Detection in Information Networks Jing Gao, Feng Liang, Wei Fan, Chi Wang, Yizhou Sun, and Jiawei Han SIGKDD, 2010 Presented by Hung-Yi Cai 2011/2/10

Upload: clio

Post on 24-Feb-2016

44 views

Category:

Documents


0 download

DESCRIPTION

On Community Outliers and their Efficient Detection in Information Networks. Jing Gao , Feng Liang, Wei Fan, Chi Wang, Yizhou Sun, and Jiawei Han SIGKDD, 2010 Presented by Hung-Yi Cai 2011/2/10. Outlines. Motivation Objectives Related Work Methodology Experiments Conclusions - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

1

On Community Outliers and their Efficient Detection in Information Networks

Jing Gao, Feng Liang, Wei Fan, Chi Wang, Yizhou Sun, and Jiawei HanSIGKDD, 2010

Presented by Hung-Yi Cai2011/2/10

Page 2: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

2

Outlines· Motivation· Objectives· Related Work· Methodology· Experiments· Conclusions· Comments

Page 3: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

3

Motivation· Linked or networked data are ubiquitous in many

applications.

· Outlier detection in information networks can reveal important anomalous and interesting behaviors that are not obvious if community information is ignored.

Page 4: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

4

Objectives· To propose an e cient solution by modeling ffi networked data

as a mixture model composed of multiple normal communities and a set of randomly generated outliers.

· The probabilistic model characterizes both data and links simultaneously by defining their joint distribution based on hidden Markov random fields (HMRF).

Page 5: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Related Work

5

Page 6: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

6

Page 7: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology Community Outlier

Outlier Detection via HMRF Modeling Continuous and Text Data

Fitting Community Outlier Detection Model Inference Parameter Estimation

7

Page 8: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Community Outlier

· Outlier Detection via HMRF:

8

Page 9: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Community Outlier

· Modeling Continuous and Text Data:─ Continuous data

─ Text data

9

Page 10: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Fitting Community Outlier Detection Model

· Inference:─ We first assume that the model parameters in Θ are

known, and discuss how to obtain an assignment of the hidden variables.

─ The objective is to find the configuration that maximizes the posterior distribution given Θ.

10

Page 11: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Fitting Community Outlier Detection Model

· In general, we seek a labeling of the objects, Z = {z1,...,zM}, to maximize the posterior probability (MAP):

· Using the Iterated Conditional Modes (ICM) algorithm to solve this MAP estimation problem. It adopts a greedy strategy by calculating local minimization iteratively and the convergence is guaranteed after a few iterations.

11

Page 12: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Fitting Community Outlier Detection Model

· Normal community…

12

Page 13: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Fitting Community Outlier Detection Model

· Outlier…

13

Page 14: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Fitting Community Outlier Detection Model

14

Page 15: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Fitting Community Outlier Detection Model

· Parameter Estimation:─ In this part, we consider the problem of estimating

unknown Θ from the data.─ We view it as an “incomplete-data” problem, and use

the expectation-maximization (EM) algorithm to solve it.

15

Page 16: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Fitting Community Outlier Detection Model

· The outlier component…

16

Page 17: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Fitting Community Outlier Detection Model

· The normal component…

17

Page 18: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Fitting Community Outlier Detection Model

18

Page 19: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments· To conduct experiments on synthetic data to

compare detection accuracy with the baseline methods, and evaluate on real datasets to validate that the proposed algorithm can detect community outliers e ectively.ff

19

Page 20: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments - Synthetic Data· Baseline Method:

20

GLODAThis baseline looks at the data values only. We use the popular outlier detection algorithm LOF to detect “global” outliers without taking the network structure into account.

DNODA

This method only considers the values of each object’s direct neighbors in the graph.

CNA

In this approach, we partition the graph into K communities using clustering algorithms, and define outliers as the objects that have significantly different values compared with the other objects in the same community.

Page 21: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments - Synthetic Data· Empirical Results:

21

Page 22: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments - Synthetic Data· Sensitivity and Time Complexity:

22

Page 23: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments - DBLP· Sub-network of Conferences:

23

Page 24: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments - DBLP· Sub-network of Conferences:

─ The community outliers detected by the proposed algorithm include CVPR and CIKM.

24

Page 25: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments - DBLP· Sub-network of Authors:

─ Community Outliers in DBLP co-authors.

25

Page 26: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

26

Conclusions· In this paper, discussing a new outlier detection

problem in networks containing rich information, including data about each object and relationships among objects.

· Proposing a generative model called CODA that unifies both community discovery and outlier detection in a probabilistic formulation based on hidden Markov random fields.

Page 27: On Community Outliers and their Efficient Detection in Information Networks

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

27

Comments· Advantages

─ The CODA algorithm outperform the baseline methods on synthetic and DBLP data.

· Applications─ Outlier Detection ─ Community Discovery