on community outliers and their efficient detection in information networks
Post on 23-Feb-2016
37 Views
Preview:
DESCRIPTION
TRANSCRIPT
On Community Outliers and their Efficient Detection in Information
NetworksJing Gao1, Feng Liang1, Wei Fan2,
Chi Wang1, Yizhou Sun1, Jiawei Han1
University of Illinois, IBM TJ WatsonDebapriya Basu
2
OBJECTIVES
Determine outliers in information networks Compare various algorithms which does the
same
3
CHARACTERISTICS OF INFORMATION NETWORK
Eg Internet, Social Networking Sites Nodes – characterized by feature values Links - representative of relation between
nodes
4
Outliers – anomalies, novelties Different kinds of outliers
◦ Global◦ Contextual
OUTLIERS
5
OUTLIERS IN INFORMATION NETWORKS
V7
10
V9
V8
30
V10
40 70 100 110 140 160
V6 V1 V4 V5 V3 V2
Global Outlier
Salary (in $1000)
6
COMMUNITY OUTLIER Unified model considering both nodes and
links Community discovery and outlier detection
are related processes
7
SUMMARY OF THE APPROACH Treat each object as a multivariate data point Use K components to describe normal
community behavior and one component to denote outliers
Induce a hidden variable zi at each object indicating community
Treat network information as a graph Model the graph as a Hidden Markov Random
Field on zi Find the local minimum of the posterior
probability potential energy of the model.
8
UNIFIED PROBABILISTIC MODELcommunity
label Zoutlier
node feature
Xlink
structure W
high-income:mean: 116k
std: 35k
low-income:mean: 20k
std: 12k
model parameter
sK: number of communities
9
SYMBOLS IN THE MODEL
Symbol DefinitionI = {1,2,3….i,..M} Indices of the objectsV = {v1,v2….vm} Set of objectsS = {s1,s2,….sm} Given attributes of
objectsWM*M = {wij} Adjacency matrix
containing the weights of the links
Z = {z1,…..,zm} RVs for hidden labels of objects
X = {x1,…..,xm} RVs for observed dataNi (i ∈ I) Neighborhood of object
vi
1,….,k,….K Indices of normal communities
Θ = {Θ1, Θ2,……, Θk} R.Vs for model parameters
10
◦ Set of R.Vs X are conditionally independent given their labelsP(X=S|Z) = ΠP(xi=si|zi)
◦ Kth normal community is characterized by a set of parametersP(xi=si|zi =k) = P(xi=si|Θk)
◦ Outliers are characterized by uniform distribution◦ P(xi=si|zi =0) = ρ0◦ Markov random field is defined over hidden variable Z ◦ P(zi|zI-{i}) = P(zi|zNi)◦ The equivalent Gibbs distribution is P(Z) = exp(-U(Z))*1/H1
H1 = normalizing constant, U(Z) = sum of clique potentials.◦ Goal is to find the configuration of z that maximizes
P(X=S|Z)P(Z) for a given Θ
THE MODEL
11
MODELING DATA Continuous Data
◦ Is modeled as Gaussian distribution◦ Model parameters: mean, standard deviation
Text Data◦ Is modeled as Multinomial distribution◦ Model parameters: probability of a word
appearing in a community
12
COMMUNITY OUTLIER DETECTION ALGORITHM
Given Θ, find Z that maximizes P(Z|X)
Given Z, find Θthat maximizes P(X|Z)
Initialize Z
INFERENCE
PARAMETER ESTIMATION
Θ : model parametersZ: community labels
13
PARAMETER ESTIMATION Calculate model parameters
◦ maximum likelihood estimation Continuous
◦ mean: sample mean of the community◦ standard deviation: square root of the sample
variance of the community Text
◦ probability of a word appearing in the community: empirical probability
14
INFERENCES Calculate Zi values
◦ Given Model parameters,◦ Iteratively update the community labels of nodes at
each timestep◦ Select the label that maximizes P(Z|X,ZN)
Calculate P(Z|X,ZN) values◦ Both the node features and community labels of
neighbors if Z indicates a normal community◦ If the probability of a node belonging to any community
is low enough, label it as an outlier
15
DISCUSSIONS Setting Hyper parameters
◦ a0 = threshold ◦ Λ = confidence in the network◦ K = number of communities
Initialization◦ Group outliers in clusters.◦ It will eventually get corrected.
16
SIMULATED EXPERIMENTS Data Generation
◦ Generate continuous data based on Gaussian distributions and generate labels according to the model
◦ Define r: percentage of outliers, K: number of communities
Baseline models◦ GLODA: global outlier detection (based on node
features only)◦ DNODA: local outlier detection (check the feature
values of direct neighbors)◦ CNA: partition data into communities based on links
and then conduct outlier detection in each community
17
TRUE POSITIVE RATE(Top r% as outliers)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
r=1% K=5 r=5% K=5 r=1% K=8 r=5% K=8
GLODADNODACNACODA
18
EXPERIMENTS ON DBLP Communities
◦ data mining, artificial intelligence, database, information analysis
Sub network of Conferences Links: percentage of common authors among two
conferences Node features: publication titles in the conference
Sub network of Authors Links: co-authorship relationship Node features: titles of publications by an author
19
RESULTS
Community outliers: CVPR CIKM
20
CONCLUSIONS
Community Outliers
Community Outlier Detection
QUESTIONS
21
REFERENCES
On Community Outliers and their Efficient Detection in Information Networks – Gao, Liang, Fan, Wang, Sun, Han
Outlier detection – Irad Ben-Gal Automated detection of outliers in real-world data –
Last, Kandel Outlier Detection for High Dimensional Data –
Aggarwal, Yu
top related