![Page 1: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/1.jpg)
Ghislain Fourny
Information Retrieval12. Wrap-Up
Picture copyright: johan2011/123RF Stock Photo
![Page 2: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/2.jpg)
2
IntroductionBoolean queriesTerm vocabulary and posting listsTolerant retrievalEvaluationScale upIndex compressionVector space modelProbabilistic information retrievalLanguage modelsIndexing the Web
Lecture Overview
Basics of Information Retrieval
Advanced topics
Alternate methodologies
![Page 3: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/3.jpg)
3
Data Shapes: Text
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam vel erat nec dui aliquet vulputate sed quis nulla. Doneceget ultricies magna, eu dignissim elit. Nullam sed urna nec nisl rhoncus ullamcorper placerat et enim. Integer variusornare libero quis consequat. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean eu efficitur orci.Aenean ac posuere tellus. Ut id commodo turpis.
Praesent nec libero metus. Praesent at turpis placerat, congue ipsum eget, scelerisque justo. Ut volutpat, massa aclacinia cursus, nisl dui volutpat arcu, quis interdum sapien turpis in tellus. Suspendisse potenti. Vestibulum pharetrajusto massa, ac venenatis mi condimentum nec. Proin viverra tortor non orci suscipit rutrum. Phasellus sit ameteuismod diam. Nullam convallis nunc sit amet diam suscipit dapibus. Integer porta hendrerit nunc. Quisque pharetracongue porta. Suspendisse vestibulum sed mi in euismod. Etiam a purus suscipit, accumsan nibh vel, posuereipsum. Nulla nec tempor nibh, id venenatis lectus. Duis lobortis id urna eget tincidunt.
![Page 4: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/4.jpg)
4
Boolean retrieval
lawyer ANDPenang AND NOT silver
InputSet of documents
OutputSubset of documents
query
![Page 5: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/5.jpg)
5
Document
Documents
![Page 6: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/6.jpg)
6
Term
SherlocklawyerSwitzerlandUnterwalden nid dem WaldETH Zürichpersonwatchrunpaperbook...
![Page 7: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/7.jpg)
7
Boolean retrieval
lawyer ANDPenang AND NOT silver
InputSet of documents
OutputSubset of documents
query
![Page 8: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/8.jpg)
8
Model and abstraction
Document as a list of words(with duplicates)
Simplification
Document as a set of words
Document as a vector of booleans
(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0)
![Page 9: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/9.jpg)
9
Incidence MatrixDocuments
Term
s
1 2 3 4 5 6 7 8 9 10
t
u
v
w
x
y
![Page 10: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/10.jpg)
10
Warm up
a
b
c
d
e
f
g
1 2 3 5 6 8
3 4 7 8 9
1 2 4 5 7
1 3 5 8 9
2 3 4 7
1 2 4 5 8 9
3 5 7 8
6
5
5
5
4
6
4
![Page 11: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/11.jpg)
11
Intersection algorithm
1 2 4 5 8 9 10 12
1 3 4 6 7 8 11 12
List A
List B
Intersection of A and B 1 4 8
![Page 12: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/12.jpg)
12
Index construction
Collect documents
Tokenizing
Linguistic preprocessing
Build the index (postings list)
![Page 13: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/13.jpg)
13
Type
You come most carefully upon your hour
thinebetimeLaerteshourthyfairTake
My hour is almost come
Possess it merely That it should come to this
Type=equivalence class (same sequences)
![Page 14: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/14.jpg)
14
Stop words
aanandareasatbebyforfromhashein
isititsofonthatthetowaswerewillwith
![Page 15: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/15.jpg)
15
Query expansionUpon indexing
Lift
Elevator
1 5
41
Lift |
Upon querying
Lift |
Expansion
Lift OR Elevator
Lift
Elevator 41
6
5 6
41 5 6
Expansion
Lift
Elevator
1 5
41 6
![Page 16: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/16.jpg)
16
Porter Stemmer
https://tartarus.org/martin/PorterStemmer/
(m>0) ENCI -> ENCE valenci -> valence(m>0) ANCI -> ANCE hesitanci -> hesitance(m>0) IZER -> IZE digitizer -> digitize(m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different(m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous(m>0) IZATION -> IZE vietnamization -> vietnamize(m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate(m>0) ALISM -> AL feudalism -> feudal(m>0) IVENESS -> IVE decisiveness -> decisive(m>0) FULNESS -> FUL hopefulness -> hopeful(m>0) OUSNESS -> OUS callousness -> callous
![Page 17: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/17.jpg)
17
Skip lists
1 2 3 4 5 6 7 8 9 10 12
In practicep
Number of postings
13 1411 15 16
![Page 18: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/18.jpg)
18
Bi-word indices (Phrase search feature)
Help ETH Zurich to flexibly react to new challenges and to set new accents in the future.
Index
Help ETH
ETH Zurich
Zurich to
to flexibly
flexibly react
react to
![Page 19: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/19.jpg)
19
Positional index (phrase search feature)
Help C,1: 1
ETH C,1: 2
Zurich C,1: 3
to C,3: 4, 7, 11
flexibly C,1: 5
react C,1: 6
"ETH Zurich"|
![Page 20: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/20.jpg)
20
Search structures
Hash tables Trees (B, B+)
![Page 21: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/21.jpg)
21
B+-tree
almost carefully
fair
is Laertes
most
be
come hour mymerely
it takepossess
that
should
youupon yourthine
timethy to
this
possess
come is merely that thy upon
4 4
2
But it's fine if the root has less.
![Page 22: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/22.jpg)
22
Wildcard queries
foo*eth*barmultiple wildcards
![Page 23: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/23.jpg)
23
Permuterm index
plant
$plant
t$plan
nt$pla
ant$pl
lant$p
plant$
Rotations
![Page 24: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/24.jpg)
24
k-grams
computer
$c, co, om, mp, pu, ut, te, er, r$
$co, com, omp, mpu, put, ute, ter, er$
$com, comp, ompu, mput, pute, uter, ter$
$comp, compu, omput, mpute, puter, uter$
$compu, comput, ompute, mputer, puter$
$comput, compute, omputer, mputer$
$, c, o, m, p, u, t, e, r, $1-grams
2-grams
4-grams
3-grams
5-grams
6-grams
7-grams
...
Not very useful
Not space efficient
Usable zone
![Page 25: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/25.jpg)
25
Edit distance# a t e
# 0 1 2 3
c 1 1 2 3
a 2 1 2 3
t 3 2 1 2ate
![Page 26: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/26.jpg)
26
Jaccard coefficient
$co
com
mpu
put
uteter
er$$cm
cmp
∩
∪
= 5 / 10 = 0.5
omp
![Page 27: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/27.jpg)
27
Soundex algorithm
Change... To...A E H I O U W Y 0B F P V 1C G J K Q S X Z 2D T 3L 4M N 5R 6
![Page 28: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/28.jpg)
28
Memory hierarchy
Memory (RAM)
Disk (Secondary storage)
Tapes, DVDs (Tertiary storage)
Cache (CPU), level 1 and 2
Volatile
Non volatile
![Page 29: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/29.jpg)
29
TermIDs
t1
t2
t3
t4
t5
t6
t7
1 2 3
3 4 7
1 2 4
1 3 5
2 3 4
1 2 4
3 5 7
t1
t2
t3
t4
t5
t6
t7
t1
t2
t3
t4
t5
t6
t7
...
...
...
...
...
...
...
![Page 30: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/30.jpg)
30
Blocked Sort-Based Indexing
![Page 31: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/31.jpg)
31
Single-Pass In-Memory Indexing
![Page 32: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/32.jpg)
32
MapReduce
ETH
computer
data
CPU
information
1
2
1
2
1
2
1
2
ETH
computer
information
![Page 33: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/33.jpg)
33
Logarithmic Merging
I0 I1
Z0 Z1
I2
n postings 2n postings 4n postings
![Page 34: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/34.jpg)
34
Heap's law#
Term
s
(M)
(T)
M = kpT
30 k 100
# Tokens
![Page 35: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/35.jpg)
35
0
10000000
20000000
30000000
40000000
50000000
60000000
122
444
767
089
311
1513
3815
6117
8420
0622
2924
5226
7528
9831
2033
4335
6637
8940
1142
3444
5746
8049
0351
2553
4855
7157
9460
1662
3964
6266
8569
0871
3073
5375
7677
9980
2182
4484
6786
9089
1391
3593
5895
8198
04
Zipf's law
Frequency =k
Rank
![Page 36: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/36.jpg)
36
Compression: Front coding
6
5
5
5
4
6
4
4 bytes bytes
8automat*a8○e9○ic10○ion
4 bytes (less bytes)
Only everyk terms
3
k
![Page 37: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/37.jpg)
37
Variable byte encodingvariable byte encoding000000010010001101000101011001111001 00001001 00011001 00101001 00111001 01001001 01011001 01101001 01111010 00001010 00011010 00101010 00111010 01001010 01011010 01101010 0111...1001 1000 0000
decimal01234567891011121314151617181920212223...64
binary011011100101110111100010011010101111001101111011111000010001100101001110100101011011010111...1000000
fits
on 3
bits
fits
on 6
bits
50%less space
![Page 38: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/38.jpg)
38
Gamma encoding
19binary
10011
001111110 Length in unary
111100011
![Page 39: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/39.jpg)
39
Ranked retrieval
lawyerPenangsilver
2
1
3
InputSet of documents
OutputRanked subset of documents
query
4
![Page 40: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/40.jpg)
40
Parametric search
Title Algorithms|
Author
Publication Date
Language
Country
Cost $
Search
to $
![Page 41: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/41.jpg)
41
Parametric indicesTitle
Author
Publication Date
Language
Country
Cost
Search structure Posting lists
![Page 42: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/42.jpg)
42
Term frequency, (Inverted) Document frequency,
idffoo 5bar 10foobar 3
tf A Bfoo 5 1bar 0 4foobar 2 1
tf-idf A Bfoo 25 5bar 0 40foobar 6 3
![Page 43: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/43.jpg)
43
Model and abstraction
Document as a list of words(with duplicates)
Simplification
Document as a vector of numbers
(0. 1.2 0.15 0.34 2.4 23.5.4324.5 0.13)
Document as a bag of words
![Page 44: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/44.jpg)
44
Vector-Space Model
d1
d2
d3
d4
d5
Documents= vectors in thefirst quadrant
of
RM
![Page 45: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/45.jpg)
45
Queries as vectors
d1
d2
d3
d4
d5
Queries= points in thefirst quadrant
of
RMq1
q2
d3 is a goodresult of q2!
![Page 46: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/46.jpg)
46
Inner product as score
✓
�!x .�!y =I=MX
i=1
xiyi
![Page 47: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/47.jpg)
47
Evidence accumulation
ETH
tftq
computer
data
1 2 3 5 6
3 tftd 7 8
1 2 4 5 7
1 3 5
6
idft, ||d||
5
5
1 2 3 4 5 6 7
||q|| tftq ⇥ idft ⇥ tftd ⇥ idftkqk ⇥ kdk
![Page 48: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/48.jpg)
48
SMART notation
atc.lnbQuery weights
Sublinear term frequency
Natural document frequency
Byte-size normalization
![Page 49: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/49.jpg)
49
Probabilistic Information Retrieval
SortP (R = 1|D = d ^Q = q)
P (R = 1|D = e ^Q = q)
P (R = 1|D = f ^Q = q)
P (R = 1|D = g ^Q = q)
![Page 50: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/50.jpg)
50
... falling back to Ranked Retrieval and evidence accumulation!
RSVd =X
k|dk=1^qk=1
logN
dft
This justifies idf weighting in the Vector-Space Model!
![Page 51: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/51.jpg)
51
Language models
Enters a query q
Thought experiment: imagine that:• we picked a random document and built its model• we used this model to generate a new document• that document turns out to be q
What document is the most likely to have been picked and to have generated q?
![Page 52: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/52.jpg)
52
Results
Ret
urne
d re
sults
Relevant Not relevant
Precision =
![Page 53: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/53.jpg)
53
Results
Posi
tives
Relevant
Neg
ativ
es
Recall =
![Page 54: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/54.jpg)
54
Specificity
Specificity =
Not relevant
![Page 55: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/55.jpg)
55
F measure: harmonic mean
F↵ =1
↵P + 1�↵
R
Weighting
↵ = 1↵ = 0
![Page 56: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/56.jpg)
56
Precision-Recall curvesPrecision
Recall0.10 0.5
![Page 57: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model](https://reader030.vdocuments.mx/reader030/viewer/2022040300/5e693f89e432f052f7740f2a/html5/thumbnails/57.jpg)
57
ROC CurvesRecall (Sensitivity)
1 - Specificity