9/18/2001information organization and retrieval vector representation, term weights and clustering...

32
9/18/2001 Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval Lecture authors: Marti Hearst & Ray Larson & Warren Sack

Post on 19-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Vector Representation, Term Weights and Clustering

(continued)

Ray Larson & Warren Sack

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and RetrievalLecture authors: Marti Hearst & Ray Larson & Warren Sack

Page 2: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Last Time

• Document Vectors

• Inverted Files

• Vector Space Model

• Term Weighting

• Clustering

Page 3: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Document Vectors

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

ABCDEFGHI

Document ids

Page 4: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

We Can Plot the VectorsStar

Diet

Doc about astronomyDoc about movie stars

Doc about mammal behavior

Page 5: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Inverted Index• This is the primary data structure for text

indexes• Main Idea:

– Invert documents into a big index

• Basic steps:– Make a “dictionary” of all the tokens in the

collection– For each token, list all the docs it occurs in.– Do a few things to reduce redundancy in the data

structure

Page 6: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Inverted IndexesWe have seen “Vector files” conceptually.

An Inverted File is a vector file “inverted” so that rows become columns and columns become rowsdocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1

Terms D1 D2 D3 D4 D5 D6 D7 …

t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0

Page 7: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Vector Space Model

• Documents are represented as vectors in term space– Terms are usually stems– Documents represented by binary vectors of terms

• Queries represented the same as documents• Query and Document weights are based on

length and direction of their vector• A vector distance measure between the query

and documents is used to rank retrieved documents

• This makes partial matching possible

Page 8: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Documents in 3D Space

Assumption: Documents that are “close together” in space are similar in meaning.

Page 9: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Assigning Weights• tf x idf measure:

– term frequency (tf)– inverse document frequency (idf) -- a way

to deal with the problems of the Zipf distribution

• Goal: assign a tf * idf weight to each term in each document

Page 10: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Similarity Measures

|)||,min(|

||

||||

||

||||

||||

||2

||

21

21

DQ

DQ

DQ

DQ

DQDQ

DQ

DQ

DQ

Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient

Page 11: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Computing Similarity Scores

2

1 1D

Q2D

98.0cos

74.0cos

)8.0 ,4.0(

)7.0 ,2.0(

)3.0 ,8.0(

2

1

2

1

Q

D

D

1.0

0.8

0.6

0.8

0.4

0.60.4 1.00.2

0.2

Page 12: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Text ClusteringClustering is

“The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990

Term 1

Term 2

Page 13: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Text Clustering

Term 1

Term 2

Clustering is“The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990

Page 14: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Types of Clustering

• Hierarchical vs. Flat

• Hard vs.Soft vs. Disjunctive

(set vs. uncertain vs. multiple assignment)

Page 15: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Page 16: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Flat Clustering

• K-Means – Hard– O(n)

• EM (soft version of K-Means)

Page 17: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

K-Means Clustering

• 1 Create a pair-wise similarity measure• 2 Find K centers • 3 Assign each document to nearest center,

forming new clusters• 4 Repeat 3 as necessary

Page 18: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Page 19: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Page 20: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Scatter/Gather

Cutting, Pedersen, Tukey & Karger 92, 93Hearst & Pedersen 95

• Cluster sets of documents into general “themes”, like a table of contents

• Display the contents of the clusters by showing topical terms and typical titles

• User chooses subsets of the clusters and re-clusters the documents within

• Resulting new groups have different “themes”

Page 21: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Scatter/Gather Example: query on “star”

Encyclopedia text14 sports

8 symbols 47 film, tv 68 film, tv (p) 7 music97 astrophysics 67 astronomy(p) 12 stellar phenomena 10 flora/fauna 49 galaxies, stars

29 constellations 7 miscelleneous

Clustering and re-clustering is entirely automated

Page 22: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
Page 23: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
Page 24: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
Page 25: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Another use of clustering

• Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.

• “Project” these onto a 2D graphical representation:

Page 26: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Clustering Multi-Dimensional Document Space

Wise, Thomas, Pennock, Lantrip, Pottier, Schur, Crow“Visualizing the Non-Visual: Spatial analysis and interaction with Information from text documents,” 1995

Page 27: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Clustering Multi-Dimensional Document Space

Wise et al., 1995

Page 28: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Concept “Landscapes”Browsing without search

Pharmocology

Anatomy

Legal

Disease

Hospitals

(e.g., Xia Lin, “Visualization for the Document Space,” 1992)

Based on Kohonen feature maps;See http://websom.hut.fi/websom/

Page 29: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

More examples ofinformation visualization

• Stuart Card, Jock Mackinlay, Ben Schneiderman (eds.) Readings in Information Visualization (San Francisco: Morgan Kaufmann, 1999)

• Martin Dodge, www.cybergeography.org

Page 30: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

Clustering• Advantages:

– See some main themes

• Disadvantage:– Many ways documents could group together are

hidden

• Thinking point: what is the relationship to classification systems and faceted queries?

e.g., f1: (osteoporosis OR ‘bone loss’) f2: (drugs OR pharmaceuticals) f3: (prevention OR cure)

Page 31: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

More information on content analysis and clustering

• Christopher Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing (Cambridge, MA: MIT Press, 1999)

• Daniel Jurafsky and James Martin, Speech and Language Processing (Upper Saddle River, NJ: Prentice Hall, 2000)

Page 32: 9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,

9/18/2001 Information Organization and Retrieval

And now on to…

• Vector Space Ranking

• Probabilistic Models and Ranking