2002.10.31 - slide 1is 202 – fall 2002 prof. ray larson & prof. marc davis uc berkeley sims...

78
2002.10.31 - SLIDE 1 IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002 http://www.sims.berkeley.edu/academics/courses/ is202/f02/ SIMS 202: Information Organization and Retrieval Lecture 18: Vector Representation

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 1IS 202 – FALL 2002

Prof. Ray Larson & Prof. Marc Davis

UC Berkeley SIMS

Tuesday and Thursday 10:30 am - 12:00 pm

Fall 2002http://www.sims.berkeley.edu/academics/courses/is202/f02/

SIMS 202:

Information Organization

and Retrieval

Lecture 18: Vector Representation

Page 2: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 2IS 202 – FALL 2002

Lecture Overview

• Review– Content Analysis– Statistical Properties of Text

• Zipf Distribution• Statistical Dependence

– Indexing and Inverted Files

• Vector Representation• Term Weights• Vector Matching• Clustering

Credit for some of the slides in this lecture goes to Marti Hearst

Page 3: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 3IS 202 – FALL 2002

Lecture Overview

• Review– Content Analysis– Statistical Properties of Text

• Zipf Distribution• Statistical Dependence

– Indexing and Inverted Files

• Vector Representation• Term Weights• Vector Matching• Clustering

Credit for some of the slides in this lecture goes to Marti Hearst

Page 4: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 4IS 202 – FALL 2002

Techniques for Content Analysis

• Statistical– Single Document– Full Collection

• Linguistic– Syntactic– Semantic– Pragmatic

• Knowledge-Based (Artificial Intelligence)

• Hybrid (Combinations)

Page 5: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 5IS 202 – FALL 2002

Content Analysis Areas

How isthe text processed?Index

Pre-Process

Parse

Collections

Rank

Query

Text Input

How isthe queryconstructed?

InformationNeed

Page 6: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 6

Document Processing Steps

From “Modern IR” Textbook

Page 7: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 7IS 202 – FALL 2002

Errors Generated by Porter Stemmer

Too Aggressive Too Timid organization/ organ european/ europe

policy/ police cylinder/ cylindrical

execute/ executive create/ creation

arm/ army search/ searcher

From Krovetz ‘93

Page 8: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 8IS 202 – FALL 2002

A Small Collection (Stems)Rank Freq Term1 37 system2 32 knowledg3 24 base4 20 problem5 18 abstract6 15 model7 15 languag8 15 implem9 13 reason10 13 inform11 11 expert12 11 analysi13 10 rule14 10 program15 10 oper16 10 evalu17 10 comput18 10 case19 9 gener20 9 form

150 2 enhanc151 2 energi152 2 emphasi153 2 detect154 2 desir155 2 date156 2 critic157 2 content158 2 consider159 2 concern160 2 compon161 2 compar162 2 commerci163 2 clause164 2 aspect165 2 area166 2 aim167 2 affect

Page 9: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 9IS 202 – FALL 2002

The Corresponding Zipf Curve

Rank Freq1 37 system2 32 knowledg3 24 base4 20 problem5 18 abstract6 15 model7 15 languag8 15 implem9 13 reason10 13 inform11 11 expert12 11 analysi13 10 rule14 10 program15 10 oper16 10 evalu17 10 comput18 10 case19 9 gener20 9 form

Page 10: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 10IS 202 – FALL 2002

Zipf Distribution

• The Important Points:– A few elements occur very frequently– A medium number of elements have medium

frequency– Many elements occur very infrequently

Page 11: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 11

Zipf Distribution

Linear Scale Logarithmic Scale

Page 12: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 12IS 202 – FALL 2002

Related Distributions/”Laws”

• Bradford’s Law of Scattering

• Lotka’s Law of Productivity

• De Solla Price’s Urn Model for “Cumulative Advantage Processes”

½ = 50% 2/3 = 66% ¾ = 75%Pick Pick

Replace +1 Replace +1

Page 13: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 13IS 202 – FALL 2002

Frequent Words on the WWW• 65002930 the• 62789720 a• 60857930 to• 57248022 of• 54078359 and• 52928506 in• 50686940 s• 49986064 for• 45999001 on• 42205245 this• 41203451 is• 39779377 by• 35439894 with• 35284151 or• 34446866 at• 33528897 all• 31583607 are

• 30998255 from• 30755410 e• 30080013 you• 29669506 be• 29417504 that• 28542378 not• 28162417 an• 28110383 as• 28076530 home• 27650474 it• 27572533 i• 24548796 have• 24420453 if• 24376758 new• 24171603 t• 23951805 your• 23875218 page

• 22292805 about• 22265579 com• 22107392 information• 21647927 will• 21368265 can• 21367950 more• 21102223 has• 20621335 no• 19898015 other• 19689603 one• 19613061 c• 19394862 d• 19279458 m• 19199145 was• 19075253 copyright• 18636563 us

(see http://elib.cs.berkeley.edu/docfreq/docfreq.html)

Page 14: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 14IS 202 – FALL 2002

Word Frequency vs. Resolving Power

The most frequent words are not the most descriptive

(from van Rijsbergen 79)

Page 15: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 15IS 202 – FALL 2002

Statistical Independence

• Two events x and y are statistically independent if the product of the probabilities of their happening individually equals the probability of their happening together

),()()( yxPyPxP

Page 16: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 16IS 202 – FALL 2002

Lexical Associations

• Subjects write first word that comes to mind– doctor/nurse; black/white (Palermo & Jenkins 64)

• Text Corpora can yield similar associations• One measure: Mutual Information (Church and

Hanks 89)

• If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)

)(),(

),(log),( 2 yPxP

yxPyxI

Page 17: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 17IS 202 – FALL 2002

Interesting Associations with “Doctor”

I(x,y) f(x,y) f(x) x f(y) y11.3 12 111 Honorary 621 Doctor

11.3 8 1105 Doctors 44 Dentists

10.7 30 1105 Doctors 241 Nurses

9.4 8 1105 Doctors 154 Treating

9.0 6 275 Examined 621 Doctor

8.9 11 1105 Doctors 317 Treat

8.7 25 621 Doctor 1407 Bills

AP Corpus, N=15 million, Church & Hanks 89

Page 18: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 18IS 202 – FALL 2002

I(x,y) f(x,y) f(x) x f(y) y0.96 6 621 doctor 73785 with

0.95 41 284690 a 1105 doctors

0.93 12 84716 is 1105 doctors

These associations were likely to happen because the non-doctor words shown here are very common

and therefore likely to co-occur with any noun

Un-Interesting Associations with “Doctor”

AP Corpus, N=15 million, Church & Hanks 89

Page 19: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 19IS 202 – FALL 2002

Content Analysis Summary

• Content Analysis: transforming raw text into more computationally useful forms

• Words in text collections exhibit interesting statistical properties– Word frequencies have a Zipf distribution– Word co-occurrences exhibit dependencies

• Text documents are transformed to vectors– Pre-processing includes tokenization, stemming,

collocations/phrases– Documents occupy multi-dimensional space

Page 20: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 20IS 202 – FALL 2002

Inverted Indexes

• We have seen “Vector files” conceptually– An Inverted File is a vector file “inverted” so

that rows become columns and columns become rows

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1

Terms D1 D2 D3 D4 D5 D6 D7 …

t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0

Page 21: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 21IS 202 – FALL 2002

How Inverted Files are Created

Dictionary PostingsTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Page 22: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 22IS 202 – FALL 2002

Inverted Indexes

• Permit fast search for individual terms• For each term, you get a list consisting of:

– Document ID – Frequency of term in doc (optional) – Position of term in doc (optional)

• These lists can be used to solve Boolean queries:

• country -> d1, d2• manor -> d2• country AND manor -> d2

• Also used for statistical ranking algorithms

Page 23: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 23IS 202 – FALL 2002

How Inverted Files are Used

Dictionary PostingsDoc # Freq

2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Query on

“time” AND “dark”

2 docs with “time” in dictionary ->

IDs 1 and 2 from posting file

1 doc with “dark” in dictionary ->

ID 2 from posting file

Therefore, only doc 2 satisfied the query

Page 24: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 24IS 202 – FALL 2002

Lecture Overview

• Review– Content Analysis– Statistical Properties of Text

• Zipf Distribution• Statistical Dependence

– Indexing and Inverted Files

• Vector Representation• Term Weights• Vector Matching• Clustering

Credit for some of the slides in this lecture goes to Marti Hearst

Page 25: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 25IS 202 – FALL 2002

Document Vectors

• Documents are represented as “bags of words”

• Represented as vectors when used computationally– A vector is like an array of floating point– Has direction and magnitude– Each vector holds a place for every term in

the collection– Therefore, most vectors are sparse

Page 26: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 26IS 202 – FALL 2002

Vector Space Model

• Documents are represented as vectors in term space– Terms are usually stems– Documents represented by binary or weighted vectors

of terms

• Queries represented the same as documents• Query and Document weights are based on

length and direction of their vector• A vector distance measure between the query

and documents is used to rank retrieved documents

Page 27: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 27IS 202 – FALL 2002

Vector Representation

• Documents and Queries are represented as vectors

• Position 1 corresponds to term 1, position 2 to term 2, position t to term t

• The weight of the term is stored in each position

absent is terma if 0

,...,,

,...,,

21

21

w

wwwQ

wwwD

qtqq

dddi itii

Page 28: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 28IS 202 – FALL 2002

Document Vectors

ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3

“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)

Page 29: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 29IS 202 – FALL 2002

Document Vectors

ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3

“Hollywood” occurs 7 times in text I“Film” occurs 5 times in text I“Diet” occurs 1 time in text I“Fur” occurs 3 times in text I

Page 30: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 30IS 202 – FALL 2002

Document Vectors

ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3

Page 31: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 31IS 202 – FALL 2002

We Can Plot the Vectors

Star

Diet

Doc about astronomyDoc about movie stars

Doc about mammal behavior

Page 32: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 32IS 202 – FALL 2002

Documents in 3D Space

Primary assumption of the Vector Space Model: Documents that are “close together” in space are similar in meaning

Page 33: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 33IS 202 – FALL 2002

Vector Space Documents and Queries

docs t1 t2 t3 RSV=Q.DiD1 1 0 1 4D2 1 0 0 1D3 0 1 1 5D4 1 0 0 1D5 1 1 1 6D6 1 1 0 3D7 0 1 0 2D8 0 1 0 2D9 0 0 1 3

D10 0 1 1 5D11 1 0 1 4Q 1 2 3

q1 q2 q3

D1

D2

D3

D4

D5

D6

D7

D8

D9

D10

D11

t2

t3

t1

Boolean term combinationsQ is a query – also represented as a vector

Page 34: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 34IS 202 – FALL 2002

Documents in Vector Space

t1

t2

t3

D1

D2

D10

D3

D9

D4

D7

D8

D5

D11

D6

Page 35: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 35IS 202 – FALL 2002

Lecture Overview

• Review– Content Analysis– Statistical Properties of Text

• Zipf Distribution• Statistical Dependence

– Indexing and Inverted Files

• Vector Representation• Term Weights• Vector Matching• Clustering

Credit for some of the slides in this lecture goes to Marti Hearst

Page 36: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 36IS 202 – FALL 2002

Assigning Weights to Terms

• Binary Weights

• Raw term frequency

• tf*idf– Recall the Zipf distribution– Want to weight terms highly if they are

• Frequent in relevant documents … BUT• Infrequent in the collection as a whole

• Automatically derived thesaurus terms

Page 37: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 37IS 202 – FALL 2002

Binary Weights

• Only the presence (1) or absence (0) of a term is included in the vector

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1D11 1 0 1

Page 38: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 38IS 202 – FALL 2002

Raw Term Weights

• The frequency of occurrence for the term in each document is included in the vector

docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1

D10 0 3 5D11 4 0 1

Page 39: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 39IS 202 – FALL 2002

Assigning Weights

• tf*idf measure:– Term frequency (tf)– Inverse document frequency (idf)

• A way to deal with some of the problems of the Zipf distribution

• Goal: Assign a tf*idf weight to each term in each document

Page 40: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 40IS 202 – FALL 2002

tf*idf

)/log(* kikik nNtfw

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik

Page 41: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 41IS 202 – FALL 2002

Inverse Document Frequency

• IDF provides high values for rare words and low values for common words

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

For a collectionof 10000 documents(N = 10000)

Page 42: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 42IS 202 – FALL 2002

Similarity Measures

|)||,min(|

||

||||

||

||||

||||

||2

||

21

21

DQ

DQ

DQ

DQ

DQDQ

DQ

DQ

DQ

Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient

Page 43: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 43IS 202 – FALL 2002

tf*idf Normalization

• Normalize the term weights (so longer vectors are not unfairly given more weight)– Normalize usually means force all values to

fall within a certain range, usually between 0 and 1, inclusive

t

k kik

kikik

nNtf

nNtfw

1

22 )]/[log()(

)/log(

Page 44: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 44IS 202 – FALL 2002

Vector Space Similarity

• Now, the similarity of two documents is:

• This is also called the cosine, or normalized inner product – The normalization was done when weighting

the terms

),( 1

t

kjkikji wwDDsim

Page 45: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 45IS 202 – FALL 2002

Vector Space Similarity Measure

• Combine tf and idf into a similarity measure

)()(

),(

:combined

are comparison similarity theandion normalizat otherwise

),( :normalized are weights termif

absent is terma if 0 ...,,

,...,,

1

2

1

2

1

1

,21

21

t

jd

t

jqj

t

jdqj

i

t

jdqji

qtqq

dddi

ij

ij

ij

itii

ww

ww

DQsim

wwDQsim

wwwwQ

wwwD

Page 46: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 46IS 202 – FALL 2002

Computing Similarity Scores

98.0cos

74.0cos

)8.0 ,4.0(

)7.0 ,2.0(

)3.0 ,8.0(

2

1

2

1

Q

D

D

2

1 1D

Q2D

1.0

0.8

0.6

0.8

0.4

0.60.4 1.00.2

0.2

Page 47: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 47IS 202 – FALL 2002

What’s Cosine Anyway?

“One of the basic trigonometric functions encountered in trigonometry. Let theta be an angle measured counterclockwise from the x-axis along the arc of the unit circle. Then cos(theta) is the horizontal coordinate of the arcendpoint. As a result of this definition, the cosine function is periodic with period 2pi.”

From http://mathworld.wolfram.com/Cosine.html

Page 48: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 48IS 202 – FALL 2002

Cosine vs. Degrees

Cosine

Degrees

Page 49: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 49IS 202 – FALL 2002

Computing a Similarity Score

98.0 42.0

64.0

])7.0()2.0[(*])8.0()4.0[(

)7.0*8.0()2.0*4.0(),(

yield? comparison similarity their doesWhat

)7.0,2.0(document Also,

)8.0,4.0(or query vect have Say we

22222

2

DQsim

D

Q

Page 50: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 50IS 202 – FALL 2002

Vector Space Matching

1.0

0.8

0.6

0.4

0.2

0.80.60.40.20 1.0

D2

D1

Q

1

2

Term B

Term A

Di=(di1,wdi1;di2, wdi2;…;dit, wdit)Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)

t

j

t

j dq

t

j dq

i

ijj

ijj

ww

wwDQsim

1 1

22

1

)()(),(

Q = (0.4,0.8)D1=(0.8,0.3)D2=(0.2,0.7)

98.042.0

64.0

])7.0()2.0[(])8.0()4.0[(

)7.08.0()2.04.0()2,(

2222

DQsim

74.058.0

56.),( 1 DQsim

Page 51: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 51IS 202 – FALL 2002

Weighting Schemes

• We have seen something of– Binary– Raw term weights– TF*IDF

• There are many other possibilities– IDF alone– Normalized term frequency

Page 52: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 52IS 202 – FALL 2002

Term Weights in SMART

• SMART is an experimental IR system developed by Gerard Salton (and continued by Chris Buckley) at Cornell

• Designed for laboratory experiments in IR – Easy to mix and match different weighting

methods– Really terrible user interface– Intended for use by code hackers (and even

they have trouble using it)

Page 53: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 53IS 202 – FALL 2002

Term Weights in SMART

• In SMART weights are decomposed into three factors:

norm

collectfreqw kkd

kd

Page 54: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 54IS 202 – FALL 2002

SMART Freq Components

1)ln(

)max(2

1

2

1)max(

}1,0{

kd

kd

kd

kd

kd

kd

freq

freqfreq

freq

freq

freq

Binary

maxnorm

augmented

log

Page 55: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 55IS 202 – FALL 2002

Collection Weighting in SMART

k

k

k

k

k

k

Doc

Doc

DocNDocDoc

NDoc

Doc

NDoc

collect

1

log

log

log

2

Inverse

squared

probabilistic

frequency

Page 56: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 56IS 202 – FALL 2002

Term Normalization in SMART

jvector

vectorj

vectorj

vectorj

w

w

w

w

norm

max

4

2

sum

cosine

fourth

max

Page 57: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 57IS 202 – FALL 2002

To Think About

• How does the tf*idf ranking algorithm behave?– Make a set of hypothetical documents

consisting of terms and their weights– Create some hypothetical queries– How are the documents ranked, depending on

the weights of their terms and the queries’ terms?

Page 58: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 58IS 202 – FALL 2002

Document Space Has High Dimensionality

• What happens beyond 2 or 3 dimensions?

• Similarity still has to do with how many tokens are shared in common

• More terms -> harder to understand which subsets of words are shared among similar documents

• One approach to handling high dimensionality: Clustering

Page 59: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 59IS 202 – FALL 2002

Vector Space Visualization

Page 60: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 60IS 202 – FALL 2002

Text Clustering

• Finds overall similarities among groups of documents

• Finds overall similarities among groups of tokens

• Picks out some themes, ignores others

Page 61: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 61IS 202 – FALL 2002

Text Clustering

Clustering is“The art of finding groups in data.” -- Kaufmann and Rousseeu

Term 1

Term 2

Page 62: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 62IS 202 – FALL 2002

Text Clustering

Clustering is“The art of finding groups in data.” -- Kaufmann and Rousseeu

Term 1

Term 2

Page 63: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 63IS 202 – FALL 2002

Pair-Wise Document Similarity

ID nova galaxy heat h'wood film role diet furA 1 3 1B 5 2C 2 1 5D 4 1

How to compute document similarity?

Page 64: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 64IS 202 – FALL 2002

Pair-Wise Document Similarity

t

iii

t

t

wwDDsim

wwwD

wwwD

12121

2,22212

1,12111

),(

...,,

...,,

9)11()42(),(

0),(

0),(

0),(

0),(

11)32()51(),(

DCsim

DBsim

CBsim

DAsim

CAsim

BAsim

ID nova galaxy heat h'wood film role diet furA 1 3 1B 5 2C 2 1 5D 4 1

(no normalization for simplicity)

Page 65: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 65IS 202 – FALL 2002

Document/Document Matrix

....

.....

.....

....

....

...

21

2212

1121

21

nnn

t

t

t

ddD

ddD

ddD

DDD

jiij DDd to of similarity

Page 66: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 66IS 202 – FALL 2002

Agglomerative Clustering

A B C D E F G HI

Page 67: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 67IS 202 – FALL 2002

Agglomerative Clustering

A B C D E F G HI

Page 68: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 68IS 202 – FALL 2002

Agglomerative Clustering

A B C D E F G HI

Page 69: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 69IS 202 – FALL 2002

ClusteringAgglomerative methods: Polythetic, Exclusive or Overlapping, Unorderedclusters are order-dependent

1. Select initial centers (i.e., seed the space)2. Assign docs to highest matching centers and compute centroids3. Reassign all documents to centroid(s)

DocDoc

DocDoc

DocDoc

DocDoc

Rocchio’s method

Page 70: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 70IS 202 – FALL 2002

Automatic Class Assignment

Doc

Doc

Doc

Doc

Doc

Doc

Doc

SearchEngine

1. Create pseudo-documents representing intellectually derived classes.2. Search using document contents3. Obtain ranked list4. Assign document to N categories ranked over threshold. OR assign to top-ranked category

Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually orderedclusters are order-independent, usually based on an intellectually derived scheme

Page 71: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 71IS 202 – FALL 2002

K-Means Clustering

1) Create a pair-wise similarity measure

2) Find K centers using agglomerative clustering

– Take a small sample – Group bottom up until K groups found

3) Assign each document to nearest center, forming new clusters

4) Repeat 3 as necessary

Page 72: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 72IS 202 – FALL 2002

Scatter/Gather

• Cutting, Pedersen, Tukey & Karger 92, 93, Hearst & Pedersen 95

• Cluster sets of documents into general “themes”, like a table of contents

• Display the contents of the clusters by showing topical terms and typical titles

• User chooses subsets of the clusters and re-clusters the documents within

• Resulting new groups have different “themes”

Page 73: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 73IS 202 – FALL 2002

S/G Example: Query on “star”

Encyclopedia text14 sports

8 symbols 47 film, tv 68 film, tv (p) 7 music97 astrophysics 67 astronomy(p) 12 stellar phenomena10 flora/fauna 49 galaxies, stars

29 constellations 7 miscelleneous

Clustering and re-clustering is entirely automated

Page 74: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
Page 75: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
Page 76: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
Page 77: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 77IS 202 – FALL 2002

Clustering Result Sets

• Advantages:– See some main themes

• Disadvantage:– Many ways documents could group together

are hidden

• Thinking point: What is the relationship to classification systems and facets?

Page 78: 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.31 - SLIDE 78IS 202 – FALL 2002

Next Time

• Probabilistic Models

• Relevance Feedback