scatter/gather : a cluster based approach for browsing large document collections grouper : a...

35
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL PATANKAR MADHURI WUDALI

Upload: roland-horton

Post on 16-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS

GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS

MINAL PATANKAR MADHURI WUDALI

Page 2: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

DOCUMENT CLUSTERING

Process of grouping documents with

similar contents into a common cluster

Page 3: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

ADVANTAGES OF DOCUMENT CLUSTERING

If a collection is well clustered, we can search only the cluster that will contain relevant documents Clustering also improves browsing

through the document collection

Page 4: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

DOCUMENTCOLLECTION

META SEARCH ENGINE

CLUSTERING

TRADITIONAL TEXT-BASED

CLUSTERING ALGORITHM

BUCKSHOT

FRACTIONATION

STC

SCATTER /GATHER GROUPER

WORD BASED SIMILARITY

PHRASE BASED SIMILARITY

A TOOL FOR

SEARCHING

A TOOL FOR

BROWSING

INTERFACES

USER

Page 5: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

SCATTER /GATHER INTERFACE

Page 6: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

SCATTER /GATHER SESSIONUser is presented with short summaries of

a small number of document groups.User selects one or more groups for

further studyContinue this process until the individual

document level

Page 7: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

Fractionation

Buckshot

Buckshot

Cluster Digest

Page 8: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

HOW IS SCATTER/GATHER DONE?

Static offline partitioning phase Fractionation Algorithm

Online Reclustering phase Buckshot AlgorithmStep 1:Group average agglomerative clustering Step 2: K-Means

Page 9: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

Clustering

Partitional

Hybrid

Hierarchical

Single link

Complete Link

Group Average Link

K-Means

Buckshot

Fractionation

Agglomerative

Divisive

Page 10: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

HIERARCHICAL AGGLOMERATIVE CLUSTERING

• Create NxN doc-doc similarity matrix• Each document starts as a cluster of size one.• Do Until there is only one cluster.– combine the two clusters with the greatest similarity– update the doc-doc matrix

Page 11: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

Example A B C D E A _ 2 7 6 4 B 2 _ 9 11 14 C 7 9 _ 4 8 D 6 11 4 _ 2 E 4 14 8 2 _ 

A B C D E

ABE

C D

SC(A,BE) = 4 if we are using single link (take max)SC(A,BE) = 2 if we are using complete linkage (take min)SC(A,BE) = 3 if we are using group average (take average)Note: C - BE is now the highest link

Page 12: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

Example A BE C D A _ 3 7 6

BE 3 _ 8.5 6.5

C 7 8.5 _ 4

D 6 6.5 4 _

COMBINING

SC(C,B)=9SC(C,E)=8SC(C,BE)=8.5

BE A C D

BEC

Page 13: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

Example A BEC D

 

A _ 5 6

 

BEC 5 _ 5.75

 

D 6 5.75 _

 

COMBINING

BEC A D

A,D

Page 14: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

SCATTER/GATHER SESSION STAGE

1

FRACTIONATION

•Corpus C is broken into N/m buckets of fixed size m>k•Apply Group average agglomerative clustering on each bucket•Generate document groups, given as input to next iteration•Repeat till ‘k’ centers remain

Page 15: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

SCATTER/GATHER SESSION

STAGE 2

BUCKSHOT

STEP1 : HAC

•First, randomly takes sample of size sqrt(kn)•Apply the Group average agglomerative clustering till we obtain ‘k’ clusters•Return the obtained clusters

Page 16: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

SCATTER /GATHER STAGE 2

BUCKSHOT

STEP2 : K -Means

•Arbitrary select K documents as seeds, they are the initial centroids of each cluster. •Assign all other documents to the closest centroid •Compute the centroid of each cluster again. Get new centroid of each cluster•Repeat step2,3, until the centroid of each cluster doesn’t change.

Page 17: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

A C HGFEDB

FEDCA HGB

Bucket 1 Bucket 2

A BG H C FD

E

BG

AH

DE

CF

AH BGCFDE

:::

Gro

up A

vera

ge

Agg

lom

erat

ive

Clu

ster

ing

Fra

ctio

nati

on

Contd…

Page 18: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

A D GE

GA DE

Documents in Sample

Gro

up A

vera

ge

Agg

lom

erat

ive

Clu

ster

ing

AG

DE Bu

cksh

ot

Assign remaining documents to these clusters using

K-means

Page 19: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

GENESIS OF GROUPER

Page 20: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

GROUPERA dynamic ,web-interface to Husky Search meta-

search engineClusters the top retrieved results of Husky Meta

search engineDynamically group search results into clustersUses STC Algorithm for Clustering

Page 21: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

Grouper’s query interface.

Page 22: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

Grouper Interface

Page 23: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

STC (Suffix Tree Clustering)A Fast , incremental algorithm

Operates on web document- snippets.

Relies on Suffix Tree to identify common phrases

Uses the common information to create clusters

23

Page 24: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

WHAT IS A SUFFIX TREE?

24

• A suffix tree is a rooted, directed tree

• Each internal node has at least 2 children

• Each edge is labeled with a non-empty sub-string of S.

• The label of a node is the concatenation of the edge-labels on the path from the root to that node.

• No two edges out of the same node can have edge-labels that begin with the same word.

Page 25: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

Step-1: Document “Cleaning”

Step-2: Identifying Base Clusters

Step-3: Combining Base Clusters

Step-4: Score clusters

25

STEPS OF STC

Page 26: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

DOCUMENT CLEANING• Stemming• Striping of HTML, Punctuation and numbers

<html>2 Cats ate<b>

cheese</b>.</html>Cat ate cheese

Page 27: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

Identifying Base Clusters Create an inverted index of strings from the web document collection with using a suffix tree Each node of the suffix tree represents a group of documents and a string that is common to all of themThe label of the node represents the common stringEach node represents a base cluster.

Page 28: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

too

cheese

too

ate

mouse too

cheese

too

cat ate

mouse too

cheese

too

mouse

ate cheese too

2,3

1,2

1,2,31,32,3

1,2

2.mouse ate cheese too

cat

1.cat ate cheese

mouse 3.cat ate mouse

too

cheese

cat ate

cheeseate chees

etoo

ate mouse too

cheese

too

ate cheese too

Page 29: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

29

BASE CLUSTERS IDENTIFIED!!

Node Phrase Documents

a cat ate 1,3

b ate 1,2,3

c cheese 1,2

d mouse 2,3

e too 2,3

f ate cheese 1,2

Table 1: Six nodes and their corresponding base clusters

Page 30: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

SCORING BASE CLUSTERSScoring clusters

|P| is the number of words in Phrase P

|B| is the number of documents in base cluster B

S(B) = |B | . f (|P|)

Page 31: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

Combining Base Clusters

|Bm Λ Bn | > 0.5 |Bm Λ Bn | > 0.5

|Bm| |Bn|

Documents which are in both Clusters

Documents in cluster ‘m’

Documents in Cluster ‘n’

Binary similarity measure:

SIMILARITY

1IF

CONDITION

SATISFIED

OTHERWISEO

Page 32: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

mouse

cat ate

cheese

ate

too ate chees

e

1,2

1,3

2,3

2,3

1,2,3

1,2

COMBINING THE BASE CLUSTERSBase cluster graph

Page 33: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

STC is IncrementalAs each document arrives from the web,

we “clean” it Add it to the suffix tree. Each node that is

updated/created as a result of this is taggedUpdate the relevant base clusters and

recalculate the similarity of these base clusters to the rest of k highest scoring base clusters

Check any changes to the final clustersScore and sort the final clusters, choose top 10

Page 34: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

STC allows cluster overlap…Why overlap is reasonable?

a document often has 1+ topicsSTC allows a document to appear in 1+

clusters, since documents may share 1+ phrases with other documents

Page 35: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL

REFERENCES http://www.math.unipd.it/~aiolli/corsi/0708/IR/

Lez18.pdfhttp://www.ir.iit.edu/~dagr/cs529/files/

handouts/08Clustering.pdfhttp://www.cs.washington.edu/research/

projects/WebWare1/www/metacrawler/http://sils.unc.edu/research/publications/

reports/TR-2007-06.pdfhttp://www.ir.iit.edu/~dagr/cs529/files/

handouts/08Clustering.pdf