clustering documents. overview it is a process of partitioning a set of data in a set of meaningful...

Clustering Documents

Clustering Documents

OverviewIt is a process of partitioning a set of data in a set of meaningful subclasses. Every data in the subclass shares a common trait.It helps a user understand the natural grouping or structure in a data set.Unsupervised LearningCluster, category, group, classNo training data that the classifier use to learn how to groupDocuments that share same properties are categorized into same clustersCluster Size, Number of Clusters, similarity measureSquare root of n if n is the number of documentsLSI

What is clustering?A grouping of data objects such that the objects within a group are similar (or related) to one another and different from (or unrelated to) the objects in other groupsInter-cluster distances are maximizedIntra-cluster distances are minimized3Outliers Outliers are objects that do not belong to any cluster or form clusters of very small cardinality

In some applications we are interested in discovering outliers, not clusters (outlier analysis)clusteroutliers4Why do we cluster?Clustering : given a collection of data objects group them so thatSimilar to one another within the same clusterDissimilar to the objects in other clusters

Clustering results are used:As a stand-alone tool to get insight into data distributionVisualization of clusters may unveil important informationAs a preprocessing step for other algorithmsEfficient indexing or compression often relies on clustering5Applications of clustering?Image Processingcluster images based on their visual contentWebCluster groups of users based on their access patterns on webpagesCluster webpages based on their contentBioinformaticsCluster similar proteins together (similarity wrt chemical structure and/or functionality etc)Many more6The clustering taskGroup observations into groups so that the observations belonging in the same group are similar, whereas observations in different groups are different

Basic questions:What does similar meanWhat is a good partition of the objects? I.e., how is the quality of a solution measuredHow to find a good partition of the observationsObservations to clusterReal-value attributes/variablese.g., salary, heightBinary attributese.g., gender (M/F), has_cancer(T/F)Nominal (categorical) attributese.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)Ordinal/Ranked attributese.g., military rank (soldier, sergeant, lutenant, captain, etc.)Variables of mixed typesmultiple attributes with various types8Aim of ClusteringPartition unlabeled examples into subsets of clusters, such that:Examples within a cluster are very similarExamples in different clusters are very different99Clustering Example10................................Cluster OrganizationFor small number of documents simple/flat clustering is acceptableSearch a smaller set of clusters for relevancyIf cluster is relevant, documents in the cluster are also relevantProblem: Look for a broader or more specific documentsHierarchical clustering has a tree-like structure12Dendogram

A dendogram presents the progressive, hierarchy-forming merging process pictorially.

Visualization of Dendogram

ExampleD1Human machine interface for computer applicationsD2A survey of user opinion of computer system response timeD3The EPS user interface management systemD4System and human system engineering testing of the EPS systemD5The generation of the random binary and ordered treesD6The intersection graphs of paths in a treeD7Graph minors: A surveyD3D4D5D6D7D2D1BroadSpecificCluster ParametersA minimum and maximum size of clustersLarge cluster sizeone cluster attracting many documentsMulti topic themesA matching threshold value for including documents in a clusterMinimum degree of similarityAffects the number of clustersHigh thresholdFewer documents can join a clusterLarger number of clustersThe degree of overlap between clustersSome documents deal with more than one topicLow degree of overlapGreater separation of clustersA maximum number of clustersCluster-Based SearchInverted file organizationQuery keywords must exactly match word occurrencesClustered file organization matches a keyword against a set of cluster representativesEach cluster representative consists of popular words related to a common topicIn flat clustering, query compared against the centroids of the clustersCentroid : average representative of a group of documents built from the composite text of all member documentsAutomatic Document ClassificationSearching vs. BrowsingDisadvantages in using inverted index filesinformation pertaining to a document is scattered among many different inverted-term listsinformation relating to different documents with similar term assignment is not in close proximity in the file systemApproachesinverted-index files (for searching) +clustered document collection (for browsing)clustered file organization (for searching and browsing)CentroidsDocumentsTypical Search pathHighest-level centroidSupercentroidsCentroidsDocumentsTypical Clustered File OrganizationCluster Generation vs. Cluster SearchCluster generationCluster structure is generated only once.Cluster maintenance can be carried out at relatively infrequent intervals.Cluster generation process may be slower and more expensive.Cluster searchCluster search operations may have to be performed continually.Cluster search operations must be carried out efficiently.Hierarchical Cluster GenerationTwo strategiespairwise item similaritiesheuristic methodsModelsDivisive Clustering (top down)The complete collection is assumed to represent one complete cluster.Then the collection is subsequently broken down into smaller pieces.Hierarchical Agglomerative Clustering (bottom up)Individual item similarities are used as a starting point.A gluing operation collects similar items, or groups, into larger group.Searching with a taxonomyTwo ways to search a document collection organized in a taxonomyTop Down SearchStart at the rootProgressively compare query with cluster representativeSingle error at higher levels => wrong path => incorrect clusterBottom Up SearchCompare query with the most specific cluster at the lowest levelHigh number of low level clusters increase computation timeUse an inverted index for low level representatives

Aim of Clustering again?Partitioning data into classes with high intra-class similarity low inter-class similarityIs it well-defined?

What is Similarity?Clearly, subjective measure or problem-dependent

How Similar Clusters are? Ex1: Two clusters or one clusters?

How Similar Clusters are? Ex2: Cluster or outliers

Similarity MeasuresMost cluster methods use a matrix of similarity computationsCompute similarities between documents

Home work:What are the similarity measures used in text mining. Discuss advantages disadvantages. Whenever appropriate comment on the application areas for each similarity measure. List the references and use your own words.Linking MethodsCliqueStarStringClique: every member related to every other memberStar: Every member related to the central member: central member can be average String: each member may be related to one member. Topic at two ends may be dissimilar, cluster size may grow very much28Clustering MethodsMany methods to compute clustersNP complete problemEach solution can be evaluated quickly but exhaustive evaluation of all solutions is not feasibleEach trial may produce a different cluster organizationEach document placed into one cluster or topic29Stable ClusteringResults should be independent of the initial order of documentsClusters should not be substantially different when new documents are added to the collectionResults from consecutive runs should not differ significantlyK-MeansHeuristic with complexity O(nlogn)Matrix based algorithms O(n2)Begins with an initial set of clustersPic the cluster centroids randomlyUse matrix based similarity on a small subsetUse density test to pick cluster centers from sample dataDi is cluster center if at least n other ds have similarity greater than thresholdA set of documents that are sufficiently dissimal must exist in collection

K means AlgorithmSelect k documents from the collection to form k initial singleton clustersRepeat Until termination conditions are satisfiedFor every document d, find the cluster i whose centroid is most similar, assign d to cluster i.For every cluster i, recompute the centroid based on the current member documentsCheck for terminationminimal or no changes in the assignment of documents to clustersReturn a list of clustersEach document is assigned to one clusterSize of k and initial set is important32Simulated AnnealingAvoids local optima by randomly searchingDownhill moveNew solution with higher (better) value than the previous solutionUphill moveA worse solution is accepted to avoid local minimaThe frequency decreases during life cycleAnalogy for crystal formation

Simulated Annealing AlgorithmGet initial set of cluster and set the temperature to TRepeat until the temperature is reduced to the minimumRun a loop x timesFind a new set of clusters by altering the membership of some documentsCompare the difference between the values of the new and old set of clusters. If there is an improvement, accept the new set of clusters, otherwise accept the new set of clusters with probability p.Reduce the temperature based on cooling scheduleReturn the final set of clusters2.12.22.1.22.1.1Set initial temperature T=100Reduce by 10th using T=0.9*T

34Simulated AnnealingSimple to implementSolutions are reasonable good and avoid local minimaSuccessful in other optimazition tasksInitial set very importantAdjusting the size of clusters is difficultGenetic AlgorithmGenetic AlgorithmPick two parent solutions x and y from the set of all solutions with preference for solutions with higher fitness score.Use crossover operation to combine x and y to generate a new solution z.Periodically mutate a solution by randomly exchanging two documents in a solution.Learn scatter/gather algorithmExtra materialHierarchical Agglomerative Clustering Basic procedure1. Place each of N documents into a class of its own.2. Compute all pairwise document-document similarity coefficients. (N(N-1)/2 coefficients)3. Form a new cluster by combining the most similar pair of current clusters i and j; update similarity matrix by deleting the rows and columns corresponding to i and j; calculate the entries in the row corresponding to the new cluster i+j.4. Repeat step 3 if the number of clusters left is great than 1.2.1.2How to Combine Clusters?Intercluster similaritySingle-link Complete-linkGroup average linkSingle-link clusteringEach document must have a similarity exceeding a stated threshold value with at least one other document in the same class.similarity between a pair of clusters is taken to be the similarity between the most similar pair of itemseach cluster member will be more similar to at least one member in that same cluster than to any member of another clusterHow to Combine Clusters? (Continued)Complete-link ClusteringEach document has a similarity to all other documents in the same class that exceeds the the threshold value.similarity between the least similar pair of items from the two clusters is used as the cluster similarityeach cluster member is more similar to the most dissimilar member of that cluster than to the most dissimilar member of any other clusterHow to Combine Clusters? (Continued)Group-average link clusteringa compromise between the extremes of single-link and complete-link systemseach cluster member has a greater average similarity to the remaining members of that cluster than it does to all members of any other cluster

A-F (6 items) 6(6-1)/2 (15) pairwise similaritiesdecreasing orderExample for Agglomerative Clustering1. AF 0.9AF A B C D E FA . .3 .5 .6 .8 .9B .3 . .4 .5 .7 .8C .5 .4 . .3 .5 .2D .6 .5 .3 . .4 .1E .8 .7 .5 .4 . .3F .9 .8 .2 .1 .3 . AF B C D E AF . .8 .5 .6 .8 B .8 . .4 .5 .7 C .5 .4 . .3 .5 D .6 .5 .3 . .4 E .8 .7 .5 .4 . 0.92. AE 0.8AFE0.90.8sim(AF,X)=max(sim(A,X),sim(F,X))sim(AEF,X)=max(sim(AF,X),sim(E,X))Single Link Clustering3. BF 0.8AF AEF B C D AEF . .8 .5 .6 B .8 . .4 .5 C .5 .4 . .3 D .6 .5 .3 . E0.90.8B4. BE 0.7AFE0.90.8B ABEF C D ABEF . .5 .6 C .5 . .3 D .6 .3 . sim(ABEF,X)=max(sim(AEF,X), sim(B,X))Note E and B are on thesame level.sim(ABDEF,X)=max(sim(ABEF,X), sim(D,X))Single Link Clustering (Cont.)5. AD 0.6AFE0.90.8BD6. AC 0.5AF ABDEF C ABDEF . .5 C .5 . E0.90.8BDC0.60.60.5Single Link Clustering (Cont.)Single-Link ClustersSimilarity level 0.7 (i.e., similarity threshold) ABEFCD

Similarity level 0.5 (i.e., similarity threshold) ABEFCD1. AF 0.9AF A B C D E FA . .3 .5 .6 .8 .9B .3 . .4 .5 .7 .8C .5 .4 . .3 .5 .2D .6 .5 .3 . .4 .1E .8 .7 .5 .4 . .3F .9 .8 .2 .1 .3 .0.92. AE 0.8(A,E)(A,F)newcheckEF3. BF 0.8checkAB(A,E)(A,F)(B,F)StepNumberCheckOperationsSimilarityPairCompleteLink Structure &Pairs CoveredSimilarity Matrixsim(AF,X)=min(sim(A,X), sim(F,X))Complete-Linke Cluster Generation4. BE 0.7newBE0.7 AF B C D E AF . .3 .2 .1 .3 B .3 . .4 .5 .7 C .2 .4 . .3 .5 D .1 .5 .3 . .4 E .3 .7 .5 .4 . 5. AD 0.6checkDF(A,D)(A,E)(A,F)(B,E)(B,F)6. AC 0.6checkCF(A,C)(A,D)(A,E)(A,F)(B,E)(B,F)7. BD 0.5checkDE(A,C)(A,D)(A,E)(A,F)(B,D)(B,E)(B,F)StepNumberSimilarityPairCheckOperationsCompleteLink Structure &Pairs CoveredSimilarity MatrixComplete-Linke Cluster Generation (Cont.)8. CE 0.5BEC0.70.4checkBC AF BE C D AF . .3 .2 .1 BE .3 . .4 .4 C .2 .4 . .3 D .1 .4 .3 . 9. BC 0.4checkCE0.510. DE 0.4CheckBD0.5DE(A,C)(A,D)(A,E)(A,F)(B,C)(B,D)(B,E)(B,F)(C,E)(D,E)11. AB 0.3CheckAC0.5AE0.8BF0.8CF , EF(A,B)(A,C)(A,D)(A,E)(A,F)(B,C)(B,D)(B,E)(B,F)(C,E)(D,E)StepNumberSimilarityPairCheckOperationsSimilarity Matrix(A,C)(A,D)(A,E)(A,F)(B,D)(B,E)(B,F)(C,E)(in the checklist)Complete-Linke Cluster Generation (Cont.)BEC0.70.4D0.3 AF BCE D AF . .2 .1 BCE .2 . .3 D .1 .3 . 12. CD 0.3CheckBD0.5DE0.413. EF 0.3CheckBF0.8CFDF (A,B)(A,C)(A,D)(A,E)(A,F)(B,C)(B,D)(B,E)(B,F)(C,D)(C,E)(D,E)(E,F)14. CF 0.2CheckBF0.8EF0.3DF (A,B)(A,C)(A,D)(A,E)(A,F)(B,C)(B,D)(B,E)(B,F)(C,D)(C,E)(C,F)(D,E)(E,F)StepNumberSimilarityPairCheckOperationsSimilarity MatrixComplete-Linke Cluster Generation (Cont.)BEC0.70.4D AF BCDE AF . .1 BCDE .1 . AF0.30.10.915. DF 0.1last pairComplete-Linke Cluster Generation (Cont.)Similarity level 0.7AF0.9BE0.7CDSimilarity level 0.4AF0.9BE0.7DC0.40.5Similarity level 0.3AF0.9BEDC0.50.40.30.70.40.5Complete Link ClustersGroup Average Link ClusteringGroup average link clusteringuse the average values of the pairwise links within a cluster to determine similarityall objects contribute to intercluster similarityresulting in a structure intermediate between the loosely bound single link cluster and tightly bound complete link clustersComparisonThe Behavior of Single-Link ClusterThe single-link process tends to produce a small number of large clusters that are characterized by a chaining effect.Each element is usually attached to only one other member of the same cluster at each similarity level.It is sufficient to remember the list of previously clustered single items.ComparisonThe Behavior of Complete-Link Cluster Complete-link process produces a much larger number of small, tightly linked groupings.Each item in a complete-link cluster is guaranteed to resemble all other items in that cluster at the stated similarity level.It is necessary to remember the list of all item pairs previously considered in the clustering process.ComparisonThe complete-link clustering system may be better adapted to retrieval than the single-link clusters. A complete-link cluster generation is more expensive to perform than a comparable single-link process.

Di=(w1,i, w2,i, ..., wt,i) document vector for DiLj=(lj,1, lj,2, ..., lj,nj) inverted list for term Tjlji denotes document identifier of ith document listed under term Tjnj denote number of postings for term Tj

for j=1 to t (for each of t possible terms) for i=1 to nj (for all nj entries on the jth list) compute sim(Dlj,i,Dlj,k) i+1