the university of kansas eecs 800 research seminar mining biological data instructor: luke huan...
TRANSCRIPT
The UNIVERSITY of Kansas
EECS 800 Research SeminarMining Biological Data
Instructor: Luke Huan
Fall, 2006
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide2
11/01/2006Semi-supervised learning
AdministrativeAdministrative
Student presentations start from next weekSend me the list of references that you use, if your choices are different from what are listed at the class webpage
Speaker:Allocate enough time for background.
20 minutes for general audience
20 minutes for people working in “your area”
<20 minutes for experts
Keep in mind of five “w” questions: What is it? Why is it important? Who cares? Works or not? and So what?
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide3
11/01/2006Semi-supervised learning
AdministrativeAdministrative
ParticipantsNeed to ask at least one question
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide4
11/01/2006Semi-supervised learning
OverviewOverview
Semi-supervised learningSemi-supervised classification
Semi-supervised clustering
Semi-supervised clusteringSearch based methods
Cop K-mean
Seeded K-mean
Constrained K-mean
Similarity based methods
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide5
11/01/2006Semi-supervised learning
Supervised Classification Example
Supervised Classification Example
.
..
.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide6
11/01/2006Semi-supervised learning
Supervised Classification Example
Supervised Classification Example
...
..
. .. ..
.
.
....
.
...
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide7
11/01/2006Semi-supervised learning
Supervised Classification Example
Supervised Classification Example
...
..
. .. ..
.
.
....
.
...
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide8
11/01/2006Semi-supervised learning
.
Unsupervised Clustering Example
Unsupervised Clustering Example
...
..
. .. ..
.
.
....
.
...
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide9
11/01/2006Semi-supervised learning
Unsupervised Clustering Example
Unsupervised Clustering Example
...
..
. .. ..
.
.
....
.
...
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide10
11/01/2006Semi-supervised learning
Semi-Supervised LearningSemi-Supervised Learning
Combines labeled and unlabeled data during training to improve performance:
Semi-supervised classification: Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier.
Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide11
11/01/2006Semi-supervised learning
Semi-Supervised Classification Example
Semi-Supervised Classification Example
...
..
. .. ..
.
.
....
.
...
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide12
11/01/2006Semi-supervised learning
Semi-Supervised Classification Example
Semi-Supervised Classification Example
...
..
. .. ..
.
.
....
.
...
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide13
11/01/2006Semi-supervised learning
Semi-Supervised Classification
Semi-Supervised Classification
Algorithms:Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00].
Co-training [Blum:COLT98].
Transductive SVM’s [Vapnik:98,Joachims:ICML99].
Assumptions:Known, fixed set of categories given in the labeled data.
Goal is to improve classification of examples into these known categories.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide14
11/01/2006Semi-supervised learning
Semi-Supervised Clustering Example
Semi-Supervised Clustering Example
...
..
. .. ..
.
.
....
.
...
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide15
11/01/2006Semi-supervised learning
Semi-Supervised Clustering Example
Semi-Supervised Clustering Example
.
...
. .. ..
.
.
....
.
...
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide16
11/01/2006Semi-supervised learning
Second Semi-Supervised Clustering Example
Second Semi-Supervised Clustering Example
...
..
. .. ..
.
.
....
.
...
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide17
11/01/2006Semi-supervised learning
Second Semi-Supervised Clustering Example
Second Semi-Supervised Clustering Example
.
...
. .. ..
.
.
....
.
...
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide18
11/01/2006Semi-supervised learning
Semi-Supervised ClusteringSemi-Supervised Clustering
Can group data using the categories in the initial labeled data.
Can also extend and modify the existing set of categories as needed to reflect other regularities in the data.
Can cluster a disjoint set of unlabeled data using the labeled data as a “guide” to the type of clusters desired.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide19
11/01/2006Semi-supervised learning
Problem definitionProblem definition
Input:A set of unlabeled objects
Some domain knowledge
Output:A partitioning of the objects into clusters
Objective:Maximum intra-cluster similarity
Minimum inter-cluster similarity
High consistency between the partitioning and the domain knowledge
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide20
11/01/2006Semi-supervised learning
What is Domain Knowledge?What is Domain Knowledge?
Must-link and cannot-link
Class labels
Ontology
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide21
11/01/2006Semi-supervised learning
Why semi-supervised clustering?Why semi-supervised clustering?
Why not clustering?Could not incorporate prior knowledge into clustering process
Why not classification?Sometimes there are insufficient labeled data.
Potential applicationsBioinformatics (gene and protein clustering)
Document hierarchy construction
News/email categorization
Image categorization
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide22
11/01/2006Semi-supervised learning
Semi-Supervised ClusteringSemi-Supervised Clustering
ApproachesSearch-based Semi-Supervised Clustering
Alter the clustering algorithm using the constraints
Similarity-based Semi-Supervised Clustering
Alter the similarity measure based on the constraints
Combination of both
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide23
11/01/2006Semi-supervised learning
Search-Based Semi-Supervised Clustering
Search-Based Semi-Supervised Clustering
Alter the clustering algorithm that searches for a good partitioning by:
Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99].
Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01].
Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans, EM) [Basu:ICML02].
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide24
11/01/2006Semi-supervised learning
Unsupervised KMeans ClusteringUnsupervised KMeans Clustering
KMeans iteratively partitions a dataset into K clusters.
Algorithm:
Initialize K cluster centers randomly. Repeat until convergence:
Cluster Assignment Step: Assign each data point x to the cluster Xl, such that L2 distance of x from (center of Xl) is minimum
Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster
}{1l
K
l
l
l
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide25
11/01/2006Semi-supervised learning
KMeans Objective FunctionKMeans Objective Function
Locally minimizes sum of squared distance between the data points and their corresponding cluster centers:
Initialization of K cluster centers:Totally random
Random perturbation from global mean
Heuristic to ensure well-separated centers etc.
2
1||||
K
l Xx lili
x
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide26
11/01/2006Semi-supervised learning
K Means ExampleK Means Example
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide27
11/01/2006Semi-supervised learning
K Means ExampleRandomly Initialize Means
K Means ExampleRandomly Initialize Means
x
x
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide28
11/01/2006Semi-supervised learning
K Means ExampleAssign Points to Clusters
K Means ExampleAssign Points to Clusters
x
x
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide29
11/01/2006Semi-supervised learning
K Means ExampleRe-estimate MeansK Means ExampleRe-estimate Means
x
x
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide30
11/01/2006Semi-supervised learning
K Means ExampleRe-assign Points to Clusters
K Means ExampleRe-assign Points to Clusters
x
x
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide31
11/01/2006Semi-supervised learning
K Means ExampleRe-estimate MeansK Means ExampleRe-estimate Means
x
x
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide32
11/01/2006Semi-supervised learning
K Means ExampleRe-assign Points to Clusters
K Means ExampleRe-assign Points to Clusters
x
x
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide33
11/01/2006Semi-supervised learning
K Means ExampleRe-estimate Means and Converge
K Means ExampleRe-estimate Means and Converge
x
x
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide34
11/01/2006Semi-supervised learning
Semi-Supervised K-MeansSemi-Supervised K-Means
Constraints (Must-link, Cannot-link)COP K-Means
Partial label information is givenSeeded K-Means (Basu, ICML’02)
Constrained K-Means
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide35
11/01/2006Semi-supervised learning
COP K-MeansCOP K-Means
COP K-Means is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points.
Initialization: Cluster centers are chosen randomly but no must-link constraints that may be violated
Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort.
Based on Wagstaff et al.: ICML01
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide36
11/01/2006Semi-supervised learning
COP K-Means AlgorithmCOP K-Means AlgorithmCOP K-Means Algorithm
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide37
11/01/2006Semi-supervised learning
IllustrationIllustration
xx
Must-link
Determineits label
Assign to the red class
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide38
11/01/2006Semi-supervised learning
IllustrationIllustration
xx
Cannot-link
Determineits label
Assign to the red class
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide39
11/01/2006Semi-supervised learning
IllustrationIllustration
xx
Cannot-link
Determineits label
The clustering algorithm fails
Must-link
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide40
11/01/2006Semi-supervised learning
EvaluationEvaluation
Rand index: measures the agreement between two partitions, P1 and P2, of the same data set D.
Each partition is viewed as a collection of n(n-1)/2 pairwise decisions, where n is the size of D.
a is the number of decisions where P1 and P2 put a pair of objects into the same cluster
b is the number of decisions where two instances are placed in different clusters in both partitions.
Total agreement can then be calculated using Rand(P1; P2) = (a + b)/ (n (n -1)/2):
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide41
11/01/2006Semi-supervised learning
EvaluationEvaluation
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide42
11/01/2006Semi-supervised learning
Semi-Supervised K-MeansSemi-Supervised K-Means
Seeded K-Means:Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i.
Seed points are only used for initialization, and not in subsequent steps.
Constrained K-Means:Labeled data provided by user are used to initialize K-Means algorithm.
Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are re-estimated.
Based on Basu et al., ICML’02.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide43
11/01/2006Semi-supervised learning
Seeded K-MeansSeeded K-Means
Use labeled data to find the initial centroids andthen run K-Means.
The labels for seeded points may change.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide44
11/01/2006Semi-supervised learning
Seeded K-Means ExampleSeeded K-Means Example
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide45
11/01/2006Semi-supervised learning
Seeded K-Means ExampleInitialize Means Using Labeled Data
Seeded K-Means ExampleInitialize Means Using Labeled Data
xx
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide46
11/01/2006Semi-supervised learning
Seeded K-Means ExampleAssign Points to Clusters
Seeded K-Means ExampleAssign Points to Clusters
xx
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide47
11/01/2006Semi-supervised learning
Seeded K-Means ExampleRe-estimate Means
Seeded K-Means ExampleRe-estimate Means
xx
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide48
11/01/2006Semi-supervised learning
Seeded K-Means ExampleAssign points to clusters and
Converge
Seeded K-Means ExampleAssign points to clusters and
Converge
xx the label is changed
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide49
11/01/2006Semi-supervised learning
Constrained K-MeansConstrained K-Means
Use labeled data to find the initial centroids andthen run K-Means.
The labels for seeded points will not change.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide50
11/01/2006Semi-supervised learning
Constrained K-Means Example
Constrained K-Means Example
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide51
11/01/2006Semi-supervised learning
Constrained K-Means Example
Initialize Means Using Labeled Data
Constrained K-Means Example
Initialize Means Using Labeled Data
xx
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide52
11/01/2006Semi-supervised learning
Constrained K-Means Example
Assign Points to Clusters
Constrained K-Means Example
Assign Points to Clusters
xx
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide53
11/01/2006Semi-supervised learning
Constrained K-Means Example
Re-estimate Means and Converge
Constrained K-Means Example
Re-estimate Means and Converge
xx
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide54
11/01/2006Semi-supervised learning
DatasetsDatasets
Data sets: UCI Iris (3 classes; 150 instances)CMU 20 Newsgroups (20 classes; 20,000 instances) Yahoo! News (20 classes; 2,340 instances)
Data subsets created for experiments:Small-20 newsgroup: random sample of 100 documents from each newsgroup, created to study effect of datasize on algorithms.Different-3 newsgroup: 3 very different newsgroups (alt.atheism, rec.sport.baseball, sci.space), created to study effect of data separability on algorithms.Same-3 newsgroup: 3 very similar newsgroups (comp.graphics, comp.os.ms-windows, comp.windows.x).
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide55
11/01/2006Semi-supervised learning
EvaluationEvaluation
Mutual information
Objective function
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide56
11/01/2006Semi-supervised learning
Results: MI and SeedingResults: MI and Seeding
Zero noise in seeds [Small-20 NewsGroup]Semi-Supervised KMeans substantially better than unsupervised KMeans
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide57
11/01/2006Semi-supervised learning
Results: Objective function and Seeding
Results: Objective function and Seeding
User-labeling consistent with KMeans assumptions [Small-20 NewsGroup] Obj. function of data partition increases exponentially with seed fraction
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide58
11/01/2006Semi-supervised learning
Results: Objective Function and Seeding
Results: Objective Function and Seeding
User-labeling inconsistent with KMeans assumptions [Yahoo! News] Objective function of constrained algorithms decreases with seeding
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide59
11/01/2006Semi-supervised learning
Similarity Based MethodsSimilarity Based Methods
Questions: given a set of points and the class labels, can we learn a distance matrix such that inter-cluster distance are minimized and intra-cluster distance are maximized?
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide60
11/01/2006Semi-supervised learning
Distance metric learningDistance metric learning
Define a new distance measure of the form:
Linear transformation of the original data
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide61
11/01/2006Semi-supervised learning
Distance metric learningDistance metric learning
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide62
11/01/2006Semi-supervised learning
Semi-Supervised Clustering Example
Similarity Based
Semi-Supervised Clustering Example
Similarity Based
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide63
11/01/2006Semi-supervised learning
Semi-Supervised Clustering Example
Distances Transformed by Learned Metric
Semi-Supervised Clustering Example
Distances Transformed by Learned Metric
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide64
11/01/2006Semi-supervised learning
Semi-Supervised Clustering Example
Clustering Result with Trained Metric
Semi-Supervised Clustering Example
Clustering Result with Trained Metric
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide65
11/01/2006Semi-supervised learning
Source: E. Xing, et al. Distance metric learning
EvaluationEvaluation
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide66
11/01/2006Semi-supervised learning
Source: E. Xing, et al. Distance metric learning
EvaluationEvaluation
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide67
11/01/2006Semi-supervised learning
Additional ReadingsAdditional Readings
Combining Similarity and Search-Based Semi-Supervised Clustering “Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering”, Basu, et al.
Ontology based semi-supervised clustering “A framework for ontology-driven subspace clustering”, Liu et al.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide68
11/01/2006Semi-supervised learning
SummarySummary
Semi-supervised clustering
Its applications in Bioinformatics? Not fully explored
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide69
11/01/2006Semi-supervised learning
ReferenceReference
UT machine learning grouphttp://www.cs.utexas.edu/~ml/publication/unsupervised.html
Semi-supervised Clustering by Seeding http://www.cs.utexas.edu/users/ml/papers/semi-icml-02.pdf
Constrained K-means clustering with background knowledge
http://www.litech.org/~wkiri/Papers/wagstaff-kmeans-01.pdf
Some slides are from Jieping Ye at Arizona State