indexing lbsc 796/cmsc 828o session 9, march 29, 2004 doug oard
DESCRIPTION
User Studies Goal is to account for interface issues –By studying the interface component –By studying the complete system Formative evaluation –Provide a basis for system development Summative evaluation –Designed to assess performanceTRANSCRIPT
![Page 1: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/1.jpg)
Indexing
LBSC 796/CMSC 828oSession 9, March 29, 2004
Doug Oard
![Page 2: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/2.jpg)
Agenda
• Questions
• Finish up evaluation from last time
• Computational complexity
• Inverted indexes
• Project planning
![Page 3: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/3.jpg)
User Studies
• Goal is to account for interface issues– By studying the interface component– By studying the complete system
• Formative evaluation– Provide a basis for system development
• Summative evaluation– Designed to assess performance
![Page 4: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/4.jpg)
Quantitative User Studies• Select independent variable(s)
– e.g., what info to display in selection interface• Select dependent variable(s)
– e.g., time to find a known relevant document• Run subjects in different orders
– Average out learning and fatigue effects• Compute statistical significance
– Null hypothesis: independent variable has no effect– Rejected if p<0.05
![Page 5: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/5.jpg)
Variation in Automatic Measures
• System– What we seek to measure
• Topic– Sample topic space, compute expected value
• Topic+System– Pair by topic and compute statistical significance
• Collection– Repeat the experiment using several collections
![Page 6: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/6.jpg)
Additional Effects in User Studies
• Learning– Vary topic presentation order
• Fatigue– Vary system presentation order
• Topic+User (Expertise)– Ask about prior knowledge of each topic
![Page 7: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/7.jpg)
Presentation Order
![Page 8: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/8.jpg)
Document Selection Experiments
InteractiveSelection
F0.8
StandardRanked List
Topic Description
![Page 9: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/9.jpg)
Measures of Effectiveness• Query Formulation: Uninterpolated Average Precision
– Expected value of precision [over relevant document positions]
– Interpreted based on query content at each iteration
• Document Selection: Unbalanced F-Measure:– P = precision– R = recall = 0.8 favors precision
• Models expensive human translation
RP
F 1
1
])(
[jrjEAP j
![Page 10: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/10.jpg)
End-to-End Experiments
QueryFormulation
AutomaticRetrieval
InteractiveSelection
AveragePrecision
F0.8
Topic Description
![Page 11: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/11.jpg)
End-to-End Experiment ResultsF α
=0.8
English queries, German documents4 searchers, 20 minutes per topic
![Page 12: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/12.jpg)
Summary
• Qualitative user studies suggest what to build
• Design decomposes task into components
• Automated evaluation helps to refine components
• Quantitative user studies show how well it works
![Page 13: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/13.jpg)
Supporting the Search Process
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Indexing Index
Acquisition Collection
![Page 14: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/14.jpg)
Some Questions for Today• How long will it take to find a document?
– Is there any work we can do in advance?• If so, how long will that take?
• How big a computer will I need?– How much disk space? How much RAM?
• What if more documents arrive?– How much of the advance work must be repeated?– Will searching become slower?– How much more disk space will be needed?
![Page 15: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/15.jpg)
A Cautionary Tale• Searching is easy - just ask Microsoft!
– “Find” can search my hard drive in a few minutes• If it only looks at the file names...
• How long would it would take for the Web?– A 100 GB disk?– For the World Wide Web?
• Computers are getting faster, but…– How does Google give answers in 3 seconds?
![Page 16: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/16.jpg)
Find “complex” in the dictionary
marsupial
belligerentcomplex
marsupial
belligerentcomplex
arcadeastronomical
mastiffrelativelyrelaxationresplendent
![Page 17: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/17.jpg)
Computational Complexity• Time complexity: how long will it take?• Space complexity: how much memory is needed?
• Things you need to know to assess complexity:– What is the “size” of the input? (“n”)
• What aspects of the input are we paying attention to?– How is the input represented?– How is the output represented?– What are the internal data structures?– What is the algorithm?
![Page 18: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/18.jpg)
Worst Case Complexity
0
500
1000
1500
2000
2500
3000
3500
4000
4500
10 20 30 40
10nn^2100n
![Page 19: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/19.jpg)
0
20000
40000
60000
80000
100000
120000
140000
50 200 350
10nn^2100n100n+25263
10n: O(n)100n: O(n)100n+25263: O(n)
n2: O(n2)n2+45662: O(n2)
![Page 20: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/20.jpg)
“Asymptotic” Complexity• Constant, i.e. O(1)
n doesn’t matter • Sublinear, e.g. O(log n)
n = 65536 log n = 16• Linear, i.e. O(n)
n = 65536 n = 65536• Polynomial, e.g. O(n3)
n = 65536 n3 = 281,474,976,710,656• Exponential, e.g. O(2n)
n = 65536 beyond astronomical
![Page 21: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/21.jpg)
The “Inverted File” Trick
• Organize the bag of words matrix by terms– You know the terms that you are looking for
• Look up terms like you search dictionaries– For each letter, jump directly to the right spot
• For terms of reasonable length, this is very fast
– For each term, store the document identifiers• For every document that contains that term
• At query time, use the document identifiers– Consult a “postings file”
![Page 22: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/22.jpg)
An Example
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
00110000010010110
01001001001100001
Term Doc
1D
oc 2
00110110110010100
11001001001000001
Doc
3D
oc 4
00010110010010010
01001001000101001
Doc
5D
oc 6
00110010010010010
10001001001111000
Doc
7D
oc 8
A
B
C
FD
GJLMNOPQ
T
AIALBABR
THTI
4, 82, 4, 61, 3, 7
1, 3, 5, 72, 4, 6, 8
3, 53, 5, 7
2, 4, 6, 83
1, 3, 5, 72, 4, 82, 6, 8
1, 3, 5, 7, 86, 81, 3
1, 5, 72, 4, 6
PostingsInverted File
![Page 23: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/23.jpg)
The Finished Product
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
Term
A
B
C
FD
GJLMNOPQ
T
AIALBABR
THTI
4, 82, 4, 61, 3, 7
1, 3, 5, 72, 4, 6, 8
3, 53, 5, 7
2, 4, 6, 83
1, 3, 5, 72, 4, 82, 6, 8
1, 3, 5, 7, 86, 81, 3
1, 5, 72, 4, 6
PostingsInverted File
![Page 24: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/24.jpg)
What Goes in a Postings File?
• Boolean retrieval– Just the document number
• Ranked Retrieval– Document number and term weight (TF*IDF, ...)
• Proximity operators– Word offsets for each occurrence of the term
• Example: Doc 3 (t17, t36), Doc 13 (t3, t45)
![Page 25: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/25.jpg)
How Big Is the Postings File?
• Very compact for Boolean retrieval– About 10% of the size of the documents
• If an aggressive stopword list is used!
• Not much larger for ranked retrieval– Perhaps 20%
• Enormous for proximity operators– Sometimes larger than the documents!
![Page 26: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/26.jpg)
Building an Inverted Index• Simplest solution is a single sorted array
– Fast lookup using binary search– But sorting large files on disk is very slow– And adding one document means starting over
• Tree structures allow easy insertion– But the worst case lookup time is linear
• Balanced trees provide the best of both– Fast lookup and easy insertion– But they require 45% more disk space
![Page 27: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/27.jpg)
Starting a B+ Tree Inverted File
now timegoodall
aaaaa now
Now is the time for all good …
![Page 28: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/28.jpg)
Adding a New Term
now timegoodall
aaaaa now
Now is the time for all good men …
aaaaa men
men
![Page 29: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/29.jpg)
How Big is the Inverted Index?
• Typically smaller than the postings file– Depends on number of terms, not documents
• Eventually, most terms will already be indexed– But the postings file will continue to grow
• Postings dominate asymptotic space complexity– Linear in the number of documents
![Page 30: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/30.jpg)
Index Compression• CPU’s are much faster than disks
– A disk can transfer 1,000 bytes in ~20 ms– The CPU can do ~10 million instructions in that time
• Compressing the postings file is a big win– Trade decompression time for fewer disk reads
• Key idea: reduce redundancy– Trick 1: store relative offsets (some will be the same)– Trick 2: use an optimal coding scheme
![Page 31: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/31.jpg)
Compression Example
• Postings (one byte each = 7 bytes = 56 bits)– 37, 42, 43, 48, 97, 98, 243
• Difference– 37, 5, 1, 5, 49, 1, 145
• Optimal Huffman Code– 0:1, 10:5, 110:37, 1110:49, 1111: 145
• Compressed (17 bits)– 11010010111001111
![Page 32: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/32.jpg)
Indexing and Searching
• Indexing– Walk the inverted file, splitting if needed– Insert into the postings file in sorted order– Hours or days for large collections
• Query processing– Walk the inverted file– Read the postings file– Manipulate postings based on query– Seconds, even for enormous collections
![Page 33: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/33.jpg)
Summary• Slow indexing yields fast query processing
– Key fact: most terms don’t appear in most documents
• We use extra disk space to save query time– Index space is in addition to document space– Time and space complexity must be balanced
• Disk block reads are the critical resource– This makes index compression a big win
![Page 34: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/34.jpg)
Project Options
• LBSC 796 MLS/MIM– Option 1: TREC-like IR evaluation (team of 2)– Option 2: Design and run a user study (team of 3)
• LBSC 796 Ph.D.– Research paper
• LBSC 828o– Program a new capability
![Page 35: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard](https://reader035.vdocuments.mx/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/35.jpg)
One Minute Paper
What was the muddiest point in today’s lecture?