a probabilistic framework for semi-supervised clustering sugato basu mikhail bilenko raymond j....
TRANSCRIPT
A Probabilistic Framework for Semi-Supervised Clustering
Sugato BasuMikhail BilenkoRaymond J. MooneyDeptment of Computer SciencesUniversity of Texas at Austin
Presented by Jingting Zeng
Outline
Introduction Background Algorithm Experiments Conclusion
What is Semi-Supervised Clustering? Use human input to provide labels for some
of the data Improve existing naive clustering methods Use labeled data to guide clustering of unlabeled
data End result is a better clustering of data
Motivation
Large amounts of unlabeled data exists More is being produced all the time
Expensive to generate Labels for data Usually requires human intervention
Want to use limited amounts of labeled data to guide the clustering of the entire dataset in order to provide a better clustering method
Semi-Supervised Clustering
Constraint-Based: Modify objective functions so that it includes
satisfying constraints, enforcing constraints during the clustering or initialization process
Distance-Based: A distance function is used that is trained on the
supervised dataset to satisfy the labels or constraint, and then is applied to the complete dataset
Method
Use both constraint-based and distance based approaches in a unified method
Use Hidden Markov Random Fields to generate constraints
Use constraints for initialization and assignment of points to clusters
Use an adaptive distance function to try to learn the distance measure
Cluster data to minimize some objective distance function
Main Points of Method
Improved initialization: Initial clusters are formed based on constraints
Constraint sensitive assignment of points to clusters Points are assigned to clusters to minimize a
distortion function, while minimizing the number of constraints violated
Iterative Distance Learning Distortion measure is re-estimated after each
iteration
Constraints
Pairwise constraints of Must-link, or Cannot-link labels Set M of must link constraints Set C of cannot link constraints
A list of associated costs for violating Must-link or cannot-link requirements
Class labels do not have to be known, but a user can still specify relationship between points.
HMRF
Posterior probability
This problem is an “incomplete-data problem” Cluster representative as well as class labels are unknown
Popular method for solving this type of problem is Expectation Maximization K-Means is equivalent to an EM algorithm with hard
clustering assignments
Must-Link Violation Cost Function
),(),( jiDijjim xxwxxf
Ensures that the penalty for violating the must-link constraint between 2 points that are far apart is higher than between 2 points that are close
Punishes distance functions in which must-link points are far apart
Cannot-Link Violation Cost Function
)),((),( max jiDDijjic xxwxxf
Ensures that the penalty for violating cannot-link constraints between points that are nearby according to the current distance function is higher than between distant points
Punishes distance functions that place 2 cannot link points close together
Objective Function
Goal: find minimum objective function Supervised data in initialization Constraints in cluster assignments Distance learning in M step
ZC
MulxDJ
xjxi
xxw
xjxi
xxw
jx iiobj
jiDDij
jiDij
i
log
),(
,
)),((
,
),(
max
Algorithm
EM framework Initialization step:
Use constraints to guide initial cluster formations E step:
minimizes objective function over cluster assignment M step:
Minimizes objective function over cluster representatives Minimizes objective function over the parameter
distortion measure
Initialization
Initialize: Form transitive closure of the must-link constraints
Set of connected components consisting of points connected by must-link constraints
y connected components If y < K (number of clusters), y connected
neighborhoods used to create y initial clusters Remaining clusters initialized by random perturbations of
the global centroid of the data
What If More Neighborhoods Then Clusters? If y > K (number of clusters), k initial clusters
are selecting using the distance measure Farthest first traversal is a good heuristic Weighted variant of farthest first traversal
Distance between 2 centroids multiplied by their corresponding weights
Weight of each centroid is proportional to the size of the corresponding neighborhood.
Biased to select centroids that are relatively far apart and of a decent size
Initialization Continued
Assuming consistency of data: Augment set M with must-link constraints inferred
from transitive closure For each pair of neighborhoods Np ,Np’ that have
at least one cannont link constraint between them, add connot-link constraints between every member of Np and Np’ .
Learn as much about the data through the constraints as possible
Augment set M
a b cMust-link Must-link
Inferred Must-link
Augment Set C
a b cMust-link Cannot-link
Inferred Cannot-link
E-step
Assignments of data points to clusters Since model incorporates interactions between
points, computing point assignments to minimize objective function is computationally intractable
Iterated conditional models, belief propagation, and linear programming relaxation
ICM: uses greedy strategy to sequentially update the cluster assignment of each point while keeping other points fixed.
M-step
First, cluster representatives are re-estimated to decrease objective function
Constraints do not factor into this step, so it is equivalent to K-Means
If parameterized variant of a distance measure is used, it is updated here Parameters can be found through partial derivatives of
distance function Learning step results in modifying the distortion measure
so that similar points are brought closer together, while dissimilar points are pulled apart
Results
KMeans – I – C – D : Complete HMRF- KMeans algorithm, with
supervised data in initialization(I) and cluster assignment(C) and distance learning(D)
KMEANS – I- C: HMRF-KMeans algorithm without distance
learning KMEANS –I :
HMRF-KMeans algorithm without distance learning and supervised cluster assignment
Results
Results
Results
Conclusion HMRF-KMeans Performs well (compared to
naïve K-Means) with a limited number of constraints The goal of the algorithm was to provide a better
clustering method with the use of a limited number of constraints
HMRF-KMeans learns quickly from a limited number of constraints
Should be applicable to data sets where we want to limit the amount of labeling needed to be done by humans, and constraints can be specified in pairwise labels
Questions Can all types of constraints be captured in
pairwise associations? Hierarchal structure?
Could other types of labels be included in this model? Use class labels as well as pairwise constraints
How does this model handle noise in the data/labels? Point A has must link constraint to Point B, Point B
has must list constraint to Point C, Point A has Cannot-link constraint to point C
More Questions
How does this apply to other types of Data? Authors mention wanting to try applying method to
other types of data in the future, such as gene representation
Who provides weights for function violations, and how are weights determined?
Only compared with naïve KMeans method How does it compare with other semi-supervised
clustering methods?
Reference
S. Basu, M. Bilenko, and R.J. Mooney, “A Probabilistic Framework for Semi-Supervised Clustering,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), Aug. 2004.
Thank you!