text similarity dr eamonn keogh computer science & engineering department university of...
Post on 21-Dec-2015
223 views
TRANSCRIPT
![Page 1: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/1.jpg)
Text SimilarityText Similarity
Dr Eamonn KeoghDr Eamonn KeoghComputer Science & Engineering Department
University of California - RiversideRiverside,CA [email protected]
![Page 2: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/2.jpg)
Word Twain Twain Twain Twain Twain
Length Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Snodgrass
1 74 312 116 138 122 424
2 349 1146 496 532 466 2685
3 456 1394 673 741 653 2752
4 374 1177 565 591 517 2302
5 212 661 381 357 343 1431
6 127 442 249 258 207 992
7 107 367 185 215 152 896
8 84 231 125 150 103 638
9 45 181 94 83 92 465
10 27 109 51 55 45 276
11 13 50 23 30 18 152
12 8 24 8 10 12 101
13+ 9 12 8 9 9 61
0
200
400
600
800
1000
1200
1400
1600
1 2 3 4 5 6 7 8 9 10 11 12 13
Sample 1
Sample 2
0
0.05
0.1
0.15
0.2
0.25
0.3
1 2 3 4 5 6 7 8 9 10 11 12 13
Series1
Series2
![Page 3: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/3.jpg)
1
2
5
3
4
6
![Page 4: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/4.jpg)
Information RetrievalInformation Retrieval
• Task Statement:
Build a system that retrieves documents that users are likely to find relevant to their queries.
• This assumption underlies the field of Information Retrieval.
![Page 5: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/5.jpg)
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text input
How isthe queryconstructed? How is
the text processed?
Evaluate
![Page 6: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/6.jpg)
TerminologyTerminology
Token: A natural language word “Swim”, “Simpson”, “92513” etc
Document: Usually a web page, but more generally any file.
![Page 7: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/7.jpg)
Some IR HistorySome IR History
– Roots in the scientific “Information Explosion” following WWII
– Interest in computer-based IR from mid 1950’s• H.P. Luhn at IBM (1958)
• Probabilistic models at Rand (Maron & Kuhns) (1960)
• Boolean system development at Lockheed (‘60s)
• Vector Space Model (Salton at Cornell 1965)
• Statistical Weighting methods and theoretical advances (‘70s)
• Refinements and Advances in application (‘80s)• User Interfaces, Large-scale testing and application (‘90s)
![Page 8: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/8.jpg)
RelevanceRelevance
• In what ways can a document be relevant to a query?– Answer precise question precisely.
– Who is Homer’s Boss? Montgomery Burns.
– Partially answer question.– Where does Homer work? Power Plant.
– Suggest a source for more information.– What is Bart’s middle name? Look in Issue 234 of Fanzine
– Give background information.– Remind the user of other knowledge.– Others ...
![Page 9: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/9.jpg)
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text input
How isthe queryconstructed? How is
the text processed?
EvaluateThe section that follows is about
Content AnalysisContent Analysis(transforming raw text into a computationally more manageable form)
![Page 10: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/10.jpg)
Stemming and Morphological AnalysisStemming and Morphological Analysis
• Goal: “normalize” similar words
• Morphology (“form” of words)– Inflectional Morphology
• E.g,. inflect verb endings and noun number
• Never change grammatical class– dog, dogs
– Bike, Biking
– Swim, Swimmer, Swimming
What about… build, building;
![Page 11: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/11.jpg)
Original Words …consignconsignedconsigningconsignmentconsistconsistedconsistencyconsistentconsistentlyconsistingconsists…
Stemmed Words…consignconsignconsignconsignconsistconsistconsistconsistconsistconsistconsist
Examples of Stemming (using Porters algorithm)Examples of Stemming (using Porters algorithm)
Porters algorithms is available in Java, C, Lisp, Perl, Python etc from
http://www.tartarus.org/~martin/PorterStemmer/
![Page 12: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/12.jpg)
Errors Generated by PorterErrors Generated by Porter Stemmer (Krovetz 93)
Too Aggressive Too Timidorganization/ organ european/ europe
policy/ police cylinder/ cylindrical
execute/ executive create/ creation
arm/ army search/ searcher
Homework!! Play with the following URLhttp://fusion.scs.carleton.ca/~dquesnel/java/stuff/PorterApplet.html
![Page 13: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/13.jpg)
Statistical Properties of TextStatistical Properties of Text
• Token occurrences in text are not uniformly distributed
• They are also not normally distributed
• They do exhibit a Zipf distribution
![Page 14: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/14.jpg)
8164 the4771 of4005 to2834 a2827 and2802 in1592 The1370 for1326 is1324 s1194 that 973 by
969 on 915 FT 883 Mr 860 was 855 be 849 Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE 798 HEADLINE 798 DOCNO
1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI 1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT 1 ADVISERS 1 AE
Government documents, 157734 tokens, 32259 uniqueGovernment documents, 157734 tokens, 32259 unique
![Page 15: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/15.jpg)
Plotting Word Frequency by RankPlotting Word Frequency by Rank
• Main idea: count– How many times tokens occur in the text
• Over all texts in the collection
• Now rank these according to how often they occur. This is called the rank.
![Page 16: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/16.jpg)
Rank Freq1 37 system2 32 knowledg3 24 base4 20 problem5 18 abstract6 15 model7 15 languag8 15 implem9 13 reason10 13 inform11 11 expert12 11 analysi13 10 rule14 10 program15 10 oper16 10 evalu17 10 comput18 10 case19 9 gener20 9 form
The Corresponding Zipf CurveThe Corresponding Zipf Curve
![Page 17: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/17.jpg)
Zipf DistributionZipf Distribution
• The Important Points:– a few elements occur very frequently– a medium number of elements have medium
frequency– many elements occur very infrequently
![Page 18: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/18.jpg)
Zipf DistributionZipf Distribution• The product of the frequency of words (f) and their rank (r) is
approximately constant– Rank = order of words’ frequency of occurrence
• Another way to state this is with an approximately correct rule of thumb:– Say the most common term occurs C times– The second most common occurs C/2 times– The third most common occurs C/3 times– …
10/
/1
NC
rCf
![Page 19: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/19.jpg)
Illustration by Jacob Nielsen
Zipf DistributionZipf Distribution(linear and log scale)(linear and log scale)
![Page 20: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/20.jpg)
What Kinds of Data Exhibit a What Kinds of Data Exhibit a Zipf Distribution?Zipf Distribution?
• Words in a text collection– Virtually any language usage
• Library book checkout patterns• Incoming Web Page Requests • Outgoing Web Page Requests• Document Size on Web• City Sizes• …
![Page 21: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/21.jpg)
Consequences of ZipfConsequences of Zipf
• There are always a few very frequent tokens that are not good discriminators.– Called “stop words” in IR
• English examples: to, from, on, and, the, ...
• There are always a large number of tokens that occur once and can mess up algorithms.
• Medium frequency words most descriptive
![Page 22: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/22.jpg)
Word Frequency vs. Resolving Word Frequency vs. Resolving Power Power (from van Rijsbergen 79)(from van Rijsbergen 79)
The most frequent words are not the most descriptive.
![Page 23: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/23.jpg)
Statistical IndependenceStatistical Independence
Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together.
),()()( yxPyPxP
![Page 24: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/24.jpg)
Lexical AssociationsLexical Associations• Subjects write first word that comes to mind
– doctor/nurse; black/white (Palermo & Jenkins 64)
• Text Corpora yield similar associations• One measure: Mutual Information (Church and Hanks 89)
• If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)
)(),(
),(log),( 2 yPxP
yxPyxI
![Page 25: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/25.jpg)
Statistical IndependenceStatistical Independence• Compute for a window of words
collectionin words ofnumber
in occur -co and timesofnumber ),(
position at startingwindow within words
5)(say window oflength ||
),(1
),(
:follows as ),( eapproximat llWe'
/)()(
t.independen if ),()()(
||
1
N
wyxyxw
iw
ww
yxwN
yxP
yxP
NxfxP
yxPyPxP
i
wN
ii
w1 w11w21
a b c d e f g h i j k l m n o p
![Page 26: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/26.jpg)
Interesting Associations with “Doctor”Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89)
I(x,y) f(x,y) f(x) x f(y) y11.3 12 111 Honorary 621 Doctor
11.3 8 1105 Doctors 44 Dentists
10.7 30 1105 Doctors 241 Nurses
9.4 8 1105 Doctors 154 Treating
9.0 6 275 Examined 621 Doctor
8.9 11 1105 Doctors 317 Treat
8.7 25 621 Doctor 1407 Bills
![Page 27: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/27.jpg)
I(x,y) f(x,y) f(x) x f(y) y0.96 6 621 doctor 73785 with
0.95 41 284690 a 1105 doctors
0.93 12 84716 is 1105 doctors
UnUn--Interesting Associations with Interesting Associations with “Doctor“Doctor”
(AP Corpus, N=15 million, Church & Hanks 89)
These associations were likely to happen because the non-doctor words shown here are very commonand therefore likely to co-occur with any noun.
![Page 28: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/28.jpg)
Associations Are Important Because…Associations Are Important Because…
• We may be able to discover that phrases that should be treated as a word. I.e. “data mining”.
• We may be able to automatically discover synonyms. I.e. “Bike” and “Bicycle”
![Page 29: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/29.jpg)
Content Analysis SummaryContent Analysis Summary• Content Analysis: transforming raw text into more
computationally useful forms• Words in text collections exhibit interesting
statistical properties– Word frequencies have a Zipf distribution
– Word co-occurrences exhibit dependencies
• Text documents are transformed to vectors– Pre-processing includes tokenization, stemming,
collocations/phrases
![Page 30: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/30.jpg)
![Page 31: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/31.jpg)
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text inputHow isthe indexconstructed?
The section that follows is about
Index ConstructionIndex Construction Evaluate
![Page 32: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/32.jpg)
Inverted IndexInverted Index• This is the primary data structure for text indexes• Main Idea:
– Invert documents into a big index
• Basic steps:– Make a “dictionary” of all the tokens in the collection
– For each token, list all the docs it occurs in.
– Do a few things to reduce redundancy in the data structure
![Page 33: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/33.jpg)
Inverted IndexesInverted Indexes
We have seen “Vector files” conceptually. An Inverted File is a vector file “inverted” so that rows become columns and columns become rowsdocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1
Terms D1 D2 D3 D4 D5 D6 D7 …
t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0
![Page 34: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/34.jpg)
How Are Inverted Files CreatedHow Are Inverted Files Created
• Documents are parsed to extract tokens. These are saved with the Document ID.
Now is the timefor all good men
to come to the aidof their country
Doc 1
It was a dark andstormy night in
the country manor. The time was past midnight
Doc 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
![Page 35: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/35.jpg)
How Inverted How Inverted Files are CreatedFiles are Created
• After all documents have been parsed the inverted file is sorted alphabetically.
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
![Page 36: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/36.jpg)
How InvertedHow InvertedFiles are CreatedFiles are Created
• Multiple term entries for a single document are merged.
• Within-document term frequency information is compiled.
Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
![Page 37: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/37.jpg)
How Inverted Files are CreatedHow Inverted Files are Created
• Then the file can be split into – A Dictionary file
and – A Postings file
![Page 38: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/38.jpg)
How Inverted Files are CreatedHow Inverted Files are CreatedDictionary PostingsTerm Doc # Freq
a 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
![Page 39: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/39.jpg)
Inverted IndexesInverted Indexes• Permit fast search for individual terms• For each term, you get a list consisting of:
– document ID – frequency of term in doc (optional) – position of term in doc (optional)
• These lists can be used to solve Boolean queries:• country -> d1, d2• manor -> d2• country AND manor -> d2
• Also used for statistical ranking algorithms
![Page 40: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/40.jpg)
How Inverted Files are UsedHow Inverted Files are UsedQuery on “time” AND “dark”
2 docs with “time” in dictionary ->IDs 1 and 2 from posting file
1 doc with “dark” in dictionary ->ID 2 from posting file
Therefore, only doc 2 satisfied the query.
Dictionary PostingsDoc # Freq
2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
![Page 41: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/41.jpg)
![Page 42: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/42.jpg)
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text inputHow isthe indexconstructed?
The section that follows is about
Querying (and Querying (and ranking)ranking)
Evaluate
![Page 43: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/43.jpg)
Simple query language: Simple query language: BooleanBoolean
– Terms + Connectors (or operators)
– terms• words
• normalized (stemmed) words
• phrases
– connectors• AND
• OR
• NOT
• NEAR (Pseudo Boolean)
Word Doc
• Cat x
• Dog
• Collar x
• Leash
![Page 44: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/44.jpg)
Boolean QueriesBoolean Queries• Cat
• Cat OR Dog
• Cat AND Dog
• (Cat AND Dog)
• (Cat AND Dog) OR Collar
• (Cat AND Dog) OR (Collar AND Leash)
• (Cat OR Dog) AND (Collar OR Leash)
![Page 45: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/45.jpg)
Boolean SearchingBoolean Searching“Measurement of thewidth of cracks in prestressedconcrete beams”
Formal Query:cracks AND beamsAND Width_measurementAND Prestressed_concrete
Cracks
Beams Widthmeasurement
Prestressedconcrete
Relaxed Query:(C AND B AND P) OR(C AND B AND W) OR(C AND W AND P) OR(B AND W AND P)
![Page 46: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/46.jpg)
Ordering of Retrieved DocumentsOrdering of Retrieved Documents
• Pure Boolean has no ordering
• In practice:– order chronologically– order by total number of “hits” on query terms
• What if one term has more hits than others?
• Is it better to one of each term or many of one term?
![Page 47: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/47.jpg)
Boolean ModelBoolean Model• Advantages
– simple queries are easy to understand– relatively easy to implement
• Disadvantages– difficult to specify what is wanted– too much returned, or too little– ordering not well determined
• Dominant language in commercial Information Retrieval systems until the WWW
Since the Boolean model is limited, lets consider a generalization…
![Page 48: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/48.jpg)
Vector ModelVector Model• Documents are represented as “bags of words”• Represented as vectors when used computationally
– A vector is like an array of floating point
– Has direction and magnitude
– Each vector holds a place for every term in the collection
– Therefore, most vectors are sparse
• Smithers secretly loves Monty Burns• Monty Burns secretly loves Smithers Both map to…[ Burns, loves, Monty, secretly, Smithers]
![Page 49: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/49.jpg)
Document VectorsDocument VectorsOne location for each wordOne location for each word
nova galaxy heat h’wood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3
ABCDEFGHI
Document ids
![Page 50: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/50.jpg)
We Can Plot the VectorsWe Can Plot the VectorsStar
Diet
Doc about astronomyDoc about movie stars
Doc about mammal behavior
![Page 51: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/51.jpg)
Illustration from Jurafsky & Martin
Documents in 3D Vector SpaceDocuments in 3D Vector Space
t1
t2
t3
D1
D2
D10
D3
D9
D4
D7
D8
D5
D11
D6
![Page 52: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/52.jpg)
Vector Space ModelVector Space Modeldocs Homer Marge BartD1 * *D2 *D3 * *D4 *D5 * * *D6 * *D7 *D8 *D9 *
D10 * *D11 * *Q *
Note that the query is projected into the same vector space as the documents.
The query here is for “Marge”.
We can use a vector similarity model to determine the best match to our query (details in a few slides).
But what weights should we use for the terms?
![Page 53: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/53.jpg)
Assigning Weights to TermsAssigning Weights to Terms
• Binary Weights
• Raw term frequency
• tf x idf– Recall the Zipf distribution– Want to weight terms highly if they are
• frequent in relevant documents … BUT
• infrequent in the collection as a whole
![Page 54: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/54.jpg)
Binary WeightsBinary Weights
• Only the presence (1) or absence (0) of a term is included in the vector
docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1D11 1 0 1
We have already seen and discussed this model.
![Page 55: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/55.jpg)
Raw Term WeightsRaw Term Weights
• The frequency of occurrence for the term in each document is included in the vector
docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1
D10 0 3 5D11 4 0 1
This model is open to exploitation by websites… sex sex sex sex sexsex sex sex sex sexsex sex sex sex sexsex sex sex sex sexsex sex sex sex sex sex sex sex sex sex
Counts can be normalized by document lengths.
![Page 56: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/56.jpg)
tf * idf Weightstf * idf Weights
• tf * idf measure:– term frequency (tf)– inverse document frequency (idf) -- a way to
deal with the problems of the Zipf distribution
• Goal: assign a tf * idf weight to each term in each document
![Page 57: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/57.jpg)
tf * idftf * idf)/log(* kikik nNtfw
log
Tcontain that in documents ofnumber the
collection in the documents ofnumber total
in T termoffrequency document inverse
document in T termoffrequency
document in term
nNidf
Cn
CN
Cidf
Dtf
DkT
kk
kk
kk
ikik
ik
![Page 58: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/58.jpg)
Inverse Document FrequencyInverse Document Frequency
• IDF provides high values for rare words and low values for common words
41
10000log
698.220
10000log
301.05000
10000log
010000
10000log
For a collectionof 10000 documents
log
nNidf
kk
![Page 59: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/59.jpg)
Similarity MeasuresSimilarity Measures
|)||,min(|
||
||||
||
||||
||||
||2
||
21
21
DQ
DQ
DQ
DQ
DQDQ
DQ
DQ
DQ
Simple matching (coordination level match)
Dice’s Coefficient
Jaccard’s Coefficient
Cosine Coefficient
Overlap Coefficient
![Page 60: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/60.jpg)
CosineCosine
2
1 1D
Q2D
98.0cos
74.0cos
)8.0 ,4.0(
)7.0 ,2.0(
)3.0 ,8.0(
2
1
2
1
Q
D
D
1.0
0.8
0.6
0.8
0.4
0.60.4 1.00.2
0.2
![Page 61: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/61.jpg)
Vector Space Similarity MeasureVector Space Similarity Measure
)()(
),(
:comparison similarity in the normalize otherwise
),( :normalized weights termif
absent is terma if 0 ...,,
,...,,
1
2
1
2
1
1
,21
21
t
jd
t
jqj
t
jdqj
i
t
jdqji
qtqq
dddi
ij
ij
ij
itii
ww
ww
DQsim
wwDQsim
wwwwQ
wwwD
![Page 62: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/62.jpg)
Problems with Vector SpaceProblems with Vector Space
• There is no real theoretical basis for the assumption of a term space– it is more for visualization that having any real
basis– most similarity measures work about the same
regardless of model
• Terms are not really orthogonal dimensions– Terms are not independent of all other terms
![Page 63: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/63.jpg)
Probabilistic ModelsProbabilistic Models
• Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query
• Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle)
• Rely on accurate estimates of probabilities
![Page 64: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/64.jpg)
![Page 65: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/65.jpg)
Relevance FeedbackRelevance Feedback• Main Idea:
– Modify existing query based on relevance judgements• Query Expansion: Extract terms from relevant documents
and add them to the query• Term Re-weighing: and/or re-weight the terms already in the
query
– Two main approaches:• Automatic (psuedo-relevance feedback)• Users select relevant documents
– Users/system select terms from an automatically-generated list
![Page 66: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/66.jpg)
Definition: Relevance Feedback is the reformulation of a search query in response to feedback provided by the user for the results of previous versions of the query.
Term Vector [Jordan , Bank, Bull, River]Term Weights [ 1 , 1 , 1 , 1 ]
Term Vector [Jordan , Bank, Bull, River]
Term Weights [ 1.1 , 0.1 , 1.3 , 1.2 ]
SearchSearch
Display ResultsDisplay Results
Gather FeedbackGather Feedback
Update WeightsUpdate Weights
Suppose you are interested in bovine agriculture on the banks of the river Jordan…
![Page 67: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/67.jpg)
Rocchio MethodRocchio Method
0.25) to and 0.75 toset best to studies some(in
t termsnonrelevan andrelevant of importance the tune and
chosen documentsrelevant -non ofnumber the
chosen documentsrelevant ofnumber the
document relevant -non for the vector the
document relevant for the vector the
query initial for the vector the
2
1
0
1 21 101
21
n
n
iS
iR
Q
where
n
S
n
RQQ
i
i
n
i
in
i
i
![Page 68: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/68.jpg)
Rocchio IllustrationRocchio IllustrationAlthough we usually work in vector space for text, it is easier to visualize Euclidian space
Original Query Term Re-weightingNote that both the location of the center, and the shape of the query have changed
Query Expansion
![Page 69: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/69.jpg)
Rocchio Method
• Rocchio automatically– re-weights terms– adds in new terms (from relevant docs)
• Most methods perform similarly– results heavily dependent on test collection
• Machine learning methods are proving to work better than standard IR approaches like Rocchio
![Page 70: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/70.jpg)
Using Relevance Feedback
• Known to improve results
• People don’t seem to like giving feedback!
![Page 71: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/71.jpg)
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text inputHow isthe indexconstructed?
The section that follows is about
Evaluation Evaluation Evaluate
![Page 72: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/72.jpg)
EvaluationEvaluation
• Why Evaluate?
• What to Evaluate?
• How to Evaluate?
![Page 73: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/73.jpg)
Why Evaluate?Why Evaluate?
• Determine if the system is desirable
• Make comparative assessments
![Page 74: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/74.jpg)
What to Evaluate?What to Evaluate?
• How much of the information need is satisfied.
• How much was learned about a topic.
• Incidental learning:– How much was learned about the collection.– How much was learned about other topics.
• How inviting the system is.
![Page 75: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/75.jpg)
What to Evaluate?What to Evaluate?
What can be measured that reflects users’ ability to use system? (Cleverdon 66)
– Coverage of Information– Form of Presentation– Effort required/Ease of Use– Time and Space Efficiency– Recall
• proportion of relevant material actually retrieved
– Precision• proportion of retrieved material actually relevant
effe
ctiv
enes
s
![Page 76: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/76.jpg)
Relevant vs. RetrievedRelevant vs. Retrieved
Relevant
Retrieved
All docs
![Page 77: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/77.jpg)
Precision vs. RecallPrecision vs. Recall
Relevant
Retrieved
|Collectionin Rel|
|edRelRetriev| Recall
|Retrieved|
|edRelRetriev| Precision
All docs
![Page 78: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/78.jpg)
Why Precision and Recall?Why Precision and Recall?
Intuition:
Get as much good stuff while at the same time getting as little junk as possible.
![Page 79: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/79.jpg)
Retrieved vs. Relevant DocumentsRetrieved vs. Relevant Documents
Relevant
Very high precision, very low recall
![Page 80: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/80.jpg)
Retrieved vs. Relevant DocumentsRetrieved vs. Relevant Documents
Relevant
Very low precision, very low recall (0 in fact)
![Page 81: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/81.jpg)
Retrieved vs. Relevant DocumentsRetrieved vs. Relevant Documents
Relevant
High recall, but low precision
![Page 82: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/82.jpg)
Retrieved vs. Relevant DocumentsRetrieved vs. Relevant Documents
Relevant
High precision, high recall (at last!)
![Page 83: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/83.jpg)
Precision/Recall CurvesPrecision/Recall Curves
• There is a tradeoff between Precision and Recall
• So measure Precision at different levels of Recall
• Note: this is an AVERAGE over MANY queries
precision
recall
x
x
x
x
![Page 84: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/84.jpg)
Precision/Recall CurvesPrecision/Recall Curves
• Difficult to determine which of these two hypothetical results is better:
precision
recall
x
x
x
x
![Page 85: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/85.jpg)
Document Cutoff LevelsDocument Cutoff Levels
• Another way to evaluate:– Fix the number of documents retrieved at several levels:
• top 5• top 10• top 20• top 50• top 100• top 500
– Measure precision at each of these levels– Take (weighted) average over results
• This is a way to focus on how well the system ranks the first k documents.
![Page 86: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/86.jpg)
Problems with Precision/RecallProblems with Precision/Recall
• Can’t know true recall value – except in small collections
• Precision/Recall are related– A combined measure sometimes more appropriate
• Assumes batch mode– Interactive IR is important and has different criteria for
successful searches
– Assumes a strict rank ordering matters.
![Page 87: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/87.jpg)
Relation to Contingency TableRelation to Contingency Table
• Accuracy: (a+d) / (a+b+c+d)• Precision: a/(a+b)• Recall: a/(a+c)• Why don’t we use Accuracy for IR?
– (Assuming a large collection)– Most docs aren’t relevant – Most docs aren’t retrieved– Inflates the accuracy value
Doc is Relevant
Doc is NOT relevant
Doc is retrieved a b
Doc is NOT retrieved
c d
Doc is Relevant
Doc is NOT relevant
Doc is retrieved
Doc is NOT retrieved
relretN
relretN relretN
relretN
![Page 88: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/88.jpg)
The E-MeasureThe E-MeasureCombine Precision and Recall into one number (van
Rijsbergen 79)
PRb
bE
1
11 2
2
P = precisionR = recallb = measure of relative importance of P or R
For example,b = 0.5 means user is twice as interested in
precision as recall
![Page 89: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/89.jpg)
How to Evaluate?How to Evaluate?Test CollectionsTest Collections
![Page 90: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/90.jpg)
TRECTREC
• Text REtrieval Conference/Competition– Run by NIST (National Institute of Standards & Technology)
– 2004 (November) will be 13th year
• Collection: >6 Gigabytes (5 CRDOMs), >1.5 Million Docs– Newswire & full text news (AP, WSJ, Ziff, FT)– Government documents (federal register, Congressional
Record)– Radio Transcripts (FBIS)– Web “subsets”
![Page 91: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/91.jpg)
TREC (cont.)TREC (cont.)
• Queries + Relevance Judgments– Queries devised and judged by “Information Specialists”
– Relevance judgments done only for those documents retrieved -- not entire collection!
• Competition– Various research and commercial groups compete (TREC
6 had 51, TREC 7 had 56, TREC 8 had 66)
– Results judged on precision and recall, going up to a recall level of 1000 documents
![Page 92: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/92.jpg)
TRECTREC• Benefits:
– made research systems scale to large collections (pre-WWW)
– allows for somewhat controlled comparisons
• Drawbacks:– emphasis on high recall, which may be unrealistic for
what most users want
– very long queries, also unrealistic
– comparisons still difficult to make, because systems are quite different on many dimensions
– focus on batch ranking rather than interaction
– no focus on the WWW
![Page 93: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/93.jpg)
TREC is changingTREC is changing
• Emphasis on specialized “tracks”– Interactive track– Natural Language Processing (NLP) track– Multilingual tracks (Chinese, Spanish)– Filtering track– High-Precision– High-Performance
• http://trec.nist.gov/
![Page 94: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d555503460f94a31d4a/html5/thumbnails/94.jpg)
Homework…Homework…