the university of kansas eecs 800 research seminar mining biological data instructor: luke huan...

Post on 17-Dec-2015

218 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The UNIVERSITY of Kansas

EECS 800 Research SeminarMining Biological Data

Instructor: Luke Huan

Fall, 2006

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide2

11/01/2006Semi-supervised learning

AdministrativeAdministrative

Student presentations start from next weekSend me the list of references that you use, if your choices are different from what are listed at the class webpage

Speaker:Allocate enough time for background.

20 minutes for general audience

20 minutes for people working in “your area”

<20 minutes for experts

Keep in mind of five “w” questions: What is it? Why is it important? Who cares? Works or not? and So what?

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide3

11/01/2006Semi-supervised learning

AdministrativeAdministrative

ParticipantsNeed to ask at least one question

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide4

11/01/2006Semi-supervised learning

OverviewOverview

Semi-supervised learningSemi-supervised classification

Semi-supervised clustering

Semi-supervised clusteringSearch based methods

Cop K-mean

Seeded K-mean

Constrained K-mean

Similarity based methods

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide5

11/01/2006Semi-supervised learning

Supervised Classification Example

Supervised Classification Example

.

..

.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide6

11/01/2006Semi-supervised learning

Supervised Classification Example

Supervised Classification Example

...

..

. .. ..

.

.

....

.

...

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide7

11/01/2006Semi-supervised learning

Supervised Classification Example

Supervised Classification Example

...

..

. .. ..

.

.

....

.

...

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide8

11/01/2006Semi-supervised learning

.

Unsupervised Clustering Example

Unsupervised Clustering Example

...

..

. .. ..

.

.

....

.

...

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide9

11/01/2006Semi-supervised learning

Unsupervised Clustering Example

Unsupervised Clustering Example

...

..

. .. ..

.

.

....

.

...

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide10

11/01/2006Semi-supervised learning

Semi-Supervised LearningSemi-Supervised Learning

Combines labeled and unlabeled data during training to improve performance:

Semi-supervised classification: Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier.

Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide11

11/01/2006Semi-supervised learning

Semi-Supervised Classification Example

Semi-Supervised Classification Example

...

..

. .. ..

.

.

....

.

...

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide12

11/01/2006Semi-supervised learning

Semi-Supervised Classification Example

Semi-Supervised Classification Example

...

..

. .. ..

.

.

....

.

...

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide13

11/01/2006Semi-supervised learning

Semi-Supervised Classification

Semi-Supervised Classification

Algorithms:Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00].

Co-training [Blum:COLT98].

Transductive SVM’s [Vapnik:98,Joachims:ICML99].

Assumptions:Known, fixed set of categories given in the labeled data.

Goal is to improve classification of examples into these known categories.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide14

11/01/2006Semi-supervised learning

Semi-Supervised Clustering Example

Semi-Supervised Clustering Example

...

..

. .. ..

.

.

....

.

...

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide15

11/01/2006Semi-supervised learning

Semi-Supervised Clustering Example

Semi-Supervised Clustering Example

.

...

. .. ..

.

.

....

.

...

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide16

11/01/2006Semi-supervised learning

Second Semi-Supervised Clustering Example

Second Semi-Supervised Clustering Example

...

..

. .. ..

.

.

....

.

...

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide17

11/01/2006Semi-supervised learning

Second Semi-Supervised Clustering Example

Second Semi-Supervised Clustering Example

.

...

. .. ..

.

.

....

.

...

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide18

11/01/2006Semi-supervised learning

Semi-Supervised ClusteringSemi-Supervised Clustering

Can group data using the categories in the initial labeled data.

Can also extend and modify the existing set of categories as needed to reflect other regularities in the data.

Can cluster a disjoint set of unlabeled data using the labeled data as a “guide” to the type of clusters desired.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide19

11/01/2006Semi-supervised learning

Problem definitionProblem definition

Input:A set of unlabeled objects

Some domain knowledge

Output:A partitioning of the objects into clusters

Objective:Maximum intra-cluster similarity

Minimum inter-cluster similarity

High consistency between the partitioning and the domain knowledge

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide20

11/01/2006Semi-supervised learning

What is Domain Knowledge?What is Domain Knowledge?

Must-link and cannot-link

Class labels

Ontology

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide21

11/01/2006Semi-supervised learning

Why semi-supervised clustering?Why semi-supervised clustering?

Why not clustering?Could not incorporate prior knowledge into clustering process

Why not classification?Sometimes there are insufficient labeled data.

Potential applicationsBioinformatics (gene and protein clustering)

Document hierarchy construction

News/email categorization

Image categorization

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide22

11/01/2006Semi-supervised learning

Semi-Supervised ClusteringSemi-Supervised Clustering

ApproachesSearch-based Semi-Supervised Clustering

Alter the clustering algorithm using the constraints

Similarity-based Semi-Supervised Clustering

Alter the similarity measure based on the constraints

Combination of both

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide23

11/01/2006Semi-supervised learning

Search-Based Semi-Supervised Clustering

Search-Based Semi-Supervised Clustering

Alter the clustering algorithm that searches for a good partitioning by:

Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99].

Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01].

Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans, EM) [Basu:ICML02].

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide24

11/01/2006Semi-supervised learning

Unsupervised KMeans ClusteringUnsupervised KMeans Clustering

KMeans iteratively partitions a dataset into K clusters.

Algorithm:

Initialize K cluster centers randomly. Repeat until convergence:

Cluster Assignment Step: Assign each data point x to the cluster Xl, such that L2 distance of x from (center of Xl) is minimum

Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster

}{1l

K

l

l

l

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide25

11/01/2006Semi-supervised learning

KMeans Objective FunctionKMeans Objective Function

Locally minimizes sum of squared distance between the data points and their corresponding cluster centers:

Initialization of K cluster centers:Totally random

Random perturbation from global mean

Heuristic to ensure well-separated centers etc.

2

1||||

K

l Xx lili

x

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide26

11/01/2006Semi-supervised learning

K Means ExampleK Means Example

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide27

11/01/2006Semi-supervised learning

K Means ExampleRandomly Initialize Means

K Means ExampleRandomly Initialize Means

x

x

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide28

11/01/2006Semi-supervised learning

K Means ExampleAssign Points to Clusters

K Means ExampleAssign Points to Clusters

x

x

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide29

11/01/2006Semi-supervised learning

K Means ExampleRe-estimate MeansK Means ExampleRe-estimate Means

x

x

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide30

11/01/2006Semi-supervised learning

K Means ExampleRe-assign Points to Clusters

K Means ExampleRe-assign Points to Clusters

x

x

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide31

11/01/2006Semi-supervised learning

K Means ExampleRe-estimate MeansK Means ExampleRe-estimate Means

x

x

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide32

11/01/2006Semi-supervised learning

K Means ExampleRe-assign Points to Clusters

K Means ExampleRe-assign Points to Clusters

x

x

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide33

11/01/2006Semi-supervised learning

K Means ExampleRe-estimate Means and Converge

K Means ExampleRe-estimate Means and Converge

x

x

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide34

11/01/2006Semi-supervised learning

Semi-Supervised K-MeansSemi-Supervised K-Means

Constraints (Must-link, Cannot-link)COP K-Means

Partial label information is givenSeeded K-Means (Basu, ICML’02)

Constrained K-Means

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide35

11/01/2006Semi-supervised learning

COP K-MeansCOP K-Means

COP K-Means is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points.

Initialization: Cluster centers are chosen randomly but no must-link constraints that may be violated

Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort.

Based on Wagstaff et al.: ICML01

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide36

11/01/2006Semi-supervised learning

COP K-Means AlgorithmCOP K-Means AlgorithmCOP K-Means Algorithm

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide37

11/01/2006Semi-supervised learning

IllustrationIllustration

xx

Must-link

Determineits label

Assign to the red class

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide38

11/01/2006Semi-supervised learning

IllustrationIllustration

xx

Cannot-link

Determineits label

Assign to the red class

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide39

11/01/2006Semi-supervised learning

IllustrationIllustration

xx

Cannot-link

Determineits label

The clustering algorithm fails

Must-link

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide40

11/01/2006Semi-supervised learning

EvaluationEvaluation

Rand index: measures the agreement between two partitions, P1 and P2, of the same data set D.

Each partition is viewed as a collection of n(n-1)/2 pairwise decisions, where n is the size of D.

a is the number of decisions where P1 and P2 put a pair of objects into the same cluster

b is the number of decisions where two instances are placed in different clusters in both partitions.

Total agreement can then be calculated using Rand(P1; P2) = (a + b)/ (n (n -1)/2):

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide41

11/01/2006Semi-supervised learning

EvaluationEvaluation

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide42

11/01/2006Semi-supervised learning

Semi-Supervised K-MeansSemi-Supervised K-Means

Seeded K-Means:Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i.

Seed points are only used for initialization, and not in subsequent steps.

Constrained K-Means:Labeled data provided by user are used to initialize K-Means algorithm.

Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are re-estimated.

Based on Basu et al., ICML’02.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide43

11/01/2006Semi-supervised learning

Seeded K-MeansSeeded K-Means

Use labeled data to find the initial centroids andthen run K-Means.

The labels for seeded points may change.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide44

11/01/2006Semi-supervised learning

Seeded K-Means ExampleSeeded K-Means Example

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide45

11/01/2006Semi-supervised learning

Seeded K-Means ExampleInitialize Means Using Labeled Data

Seeded K-Means ExampleInitialize Means Using Labeled Data

xx

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide46

11/01/2006Semi-supervised learning

Seeded K-Means ExampleAssign Points to Clusters

Seeded K-Means ExampleAssign Points to Clusters

xx

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide47

11/01/2006Semi-supervised learning

Seeded K-Means ExampleRe-estimate Means

Seeded K-Means ExampleRe-estimate Means

xx

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide48

11/01/2006Semi-supervised learning

Seeded K-Means ExampleAssign points to clusters and

Converge

Seeded K-Means ExampleAssign points to clusters and

Converge

xx the label is changed

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide49

11/01/2006Semi-supervised learning

Constrained K-MeansConstrained K-Means

Use labeled data to find the initial centroids andthen run K-Means.

The labels for seeded points will not change.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide50

11/01/2006Semi-supervised learning

Constrained K-Means Example

Constrained K-Means Example

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide51

11/01/2006Semi-supervised learning

Constrained K-Means Example

Initialize Means Using Labeled Data

Constrained K-Means Example

Initialize Means Using Labeled Data

xx

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide52

11/01/2006Semi-supervised learning

Constrained K-Means Example

Assign Points to Clusters

Constrained K-Means Example

Assign Points to Clusters

xx

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide53

11/01/2006Semi-supervised learning

Constrained K-Means Example

Re-estimate Means and Converge

Constrained K-Means Example

Re-estimate Means and Converge

xx

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide54

11/01/2006Semi-supervised learning

DatasetsDatasets

Data sets: UCI Iris (3 classes; 150 instances)CMU 20 Newsgroups (20 classes; 20,000 instances) Yahoo! News (20 classes; 2,340 instances)

Data subsets created for experiments:Small-20 newsgroup: random sample of 100 documents from each newsgroup, created to study effect of datasize on algorithms.Different-3 newsgroup: 3 very different newsgroups (alt.atheism, rec.sport.baseball, sci.space), created to study effect of data separability on algorithms.Same-3 newsgroup: 3 very similar newsgroups (comp.graphics, comp.os.ms-windows, comp.windows.x).

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide55

11/01/2006Semi-supervised learning

EvaluationEvaluation

Mutual information

Objective function

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide56

11/01/2006Semi-supervised learning

Results: MI and SeedingResults: MI and Seeding

Zero noise in seeds [Small-20 NewsGroup]Semi-Supervised KMeans substantially better than unsupervised KMeans

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide57

11/01/2006Semi-supervised learning

Results: Objective function and Seeding

Results: Objective function and Seeding

User-labeling consistent with KMeans assumptions [Small-20 NewsGroup] Obj. function of data partition increases exponentially with seed fraction

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide58

11/01/2006Semi-supervised learning

Results: Objective Function and Seeding

Results: Objective Function and Seeding

User-labeling inconsistent with KMeans assumptions [Yahoo! News] Objective function of constrained algorithms decreases with seeding

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide59

11/01/2006Semi-supervised learning

Similarity Based MethodsSimilarity Based Methods

Questions: given a set of points and the class labels, can we learn a distance matrix such that inter-cluster distance are minimized and intra-cluster distance are maximized?

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide60

11/01/2006Semi-supervised learning

Distance metric learningDistance metric learning

Define a new distance measure of the form:

Linear transformation of the original data

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide61

11/01/2006Semi-supervised learning

Distance metric learningDistance metric learning

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide62

11/01/2006Semi-supervised learning

Semi-Supervised Clustering Example

Similarity Based

Semi-Supervised Clustering Example

Similarity Based

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide63

11/01/2006Semi-supervised learning

Semi-Supervised Clustering Example

Distances Transformed by Learned Metric

Semi-Supervised Clustering Example

Distances Transformed by Learned Metric

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide64

11/01/2006Semi-supervised learning

Semi-Supervised Clustering Example

Clustering Result with Trained Metric

Semi-Supervised Clustering Example

Clustering Result with Trained Metric

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide65

11/01/2006Semi-supervised learning

Source: E. Xing, et al. Distance metric learning

EvaluationEvaluation

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide66

11/01/2006Semi-supervised learning

Source: E. Xing, et al. Distance metric learning

EvaluationEvaluation

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide67

11/01/2006Semi-supervised learning

Additional ReadingsAdditional Readings

Combining Similarity and Search-Based Semi-Supervised Clustering “Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering”, Basu, et al.

Ontology based semi-supervised clustering “A framework for ontology-driven subspace clustering”, Liu et al.

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide68

11/01/2006Semi-supervised learning

SummarySummary

Semi-supervised clustering

Its applications in Bioinformatics? Not fully explored

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide69

11/01/2006Semi-supervised learning

ReferenceReference

UT machine learning grouphttp://www.cs.utexas.edu/~ml/publication/unsupervised.html

Semi-supervised Clustering by Seeding http://www.cs.utexas.edu/users/ml/papers/semi-icml-02.pdf

Constrained K-means clustering with background knowledge

http://www.litech.org/~wkiri/Papers/wagstaff-kmeans-01.pdf

Some slides are from Jieping Ye at Arizona State

top related