nlu & ir: classical irbeyond term matching in classical ir… query and document expansion term...

NLU & IR:CLASSICAL

IROmar Khattab

CS224U: Natural Language

Understanding

Spring 2021

1

Ranked Retrieval

■ Scope: A large corpus of text documents (e.g., Wikipedia)

■ Input: A textual query (e.g., a natural-language question)

■ Output: Top-K Ranking of relevant documents (e.g., top-100)

2

How do we conduct ranked retrieval?

■ We’ve touched on one way before: the Term–Document Matrix

■ With good weights, this allows us to answer single-term queries!

3

How do we conduct ranked retrieval?

■ For multi-term queries, classical IR models would tokenize andthen treat the tokens independently.

𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒𝑆𝑐𝑜𝑟𝑒 𝑞𝑢𝑒𝑟𝑦, 𝑑𝑜𝑐 =

𝑡𝑒𝑟𝑚∈𝑞𝑢𝑒𝑟𝑦

𝑊𝑒𝑖𝑔ℎ𝑡𝑑𝑜𝑐,𝑡𝑒𝑟𝑚

■ This reduces a large fraction of classical IR to:

– How do we best tokenize (and stem) queries and documents

– How do we best weight each term–document pair

4

Term–Document Weighting: Intuitions

■ Frequency of occurrence will remain a primary factor

– If a term 𝑡 occurs frequently in document 𝑑, the document ismore likely to be relevant for queries including 𝑡

■ Normalization will remain a primary component too

– If that term 𝑡 is rather rare, then document 𝑑 is even morelikely to be relevant for queries including 𝑡

– If that document 𝑑 is rather short, this also improves its odd

5

Term–Document Weighting: TF-IDF

■ Let 𝑁 = 𝐶𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 and 𝑑𝑓 𝑡𝑒𝑟𝑚 = {𝑑𝑜𝑐 ∈ 𝐶𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 ∶ 𝑡𝑒𝑟𝑚 ∈ 𝑑𝑜𝑐}

𝑇𝐹 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 = log(1 + 𝐹𝑟𝑒𝑞 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 )

𝐼𝐷𝐹 𝑡𝑒𝑟𝑚 = log𝑁

𝑑𝑓(𝑡𝑒𝑟𝑚)

𝑇𝐹. 𝐼𝐷𝐹 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 = 𝑇𝐹 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 × 𝐼𝐷𝐹(𝑡𝑒𝑟𝑚)

𝑇𝐹. 𝐼𝐷𝐹 𝑞𝑢𝑒𝑟𝑦, 𝑑𝑜𝑐 =


𝑇𝐹. 𝐼𝐷𝐹(𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐)

6

TF and IDF both grow

sub-linearly with frequency and 1/df

(in particular, logarithmically).

Term–Document Weighting: BM25Or “Finding the best match, seriously this time! Attempt #25” :-)

𝐼𝐷𝐹 𝑡𝑒𝑟𝑚 = log(1 +𝑁 − 𝑑𝑓 𝑡𝑒𝑟𝑚 + 0.5

𝑑𝑓 𝑡𝑒𝑟𝑚 + 0.5)

𝑇𝐹 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 =𝐹𝑟𝑒𝑞 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 × (𝑘 + 1)

𝐹𝑟𝑒𝑞 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 + 𝑘 × (1 − 𝑏 + 𝑏 ×|𝑑𝑜𝑐|

𝑎𝑣𝑔𝑑𝑜𝑐𝑙𝑒𝑛)

𝐵𝑀25 𝑡𝑒𝑟𝑚 = 𝐵𝑀25: 𝑇𝐹 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 × 𝐵𝑀25: 𝐼𝐷𝐹 𝑡𝑒𝑟𝑚

𝐵𝑀25 𝑞𝑢𝑒𝑟𝑦, 𝑑𝑜𝑐 =


𝐵𝑀25 (𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐)

7

k, b are parameters.

Unlike TF-IDF, term

frequency in BM25

saturates and

penalizes longer

documents!

Robertson, Stephen, and Hugo Zaragoza. The probabilistic relevance

framework: BM25 and beyond. Now Publishers Inc, 2009.

Efficient Retrieval: Inverted Indexing

■ Raw Collection: Document → Terms

■ Term–document matrix: Term -> Documents

– But it’s extremely sparse and thus wastes space!

■ The inverted index is just a sparse encoding of this matrix

– Mapping each unique term 𝑡 in the collection to a posting list

– The posting list enumerates non-zero <Freq, DocID> for 𝑡

8

Beyond term matching in classical IR…

■ Query and Document expansion

■ Term dependence and phrase search

■ Learning to Rank with various features:

– Different document fields (e.g., title, body, anchor text)

– Link Analysis (e.g., PageRank)

Lots of IR exploration into these!

However, BM25 was a very strong baseline on the best you can

do “ad-hoc”—until 2019 with BERT-based ranking!

9

IR Evaluation

■ A search system must be efficient and effective

– If we had infinite resources, we’d just hire experts to lookthrough all the documents one by one!

■ Efficiency

– Latency (milliseconds; for one query)

– Throughput (queries/sec)

– Space (GBs for the index? TBs?)

– Hardware required (one CPU core? Many cores? GPUs?)

– Scaling to various collection sizes, under different loads

10

IR Effectiveness

■ Do our top-k rankings fulfill users’ information needs?

– Often harder to evaluate than classification/regression!

■ If you have lots of users, you can run online experiments…

■ But we’re typically interested in reusable test collections

11

Test Collections

■ Document Collection (or “Corpus”)

■ Test Queries (or “Topics”)

– Could also include a train/dev split, if resources allow!

– Or, in some cases, cross-validation could be used.

■ Query–Document Relevance Assessments

– Is document 𝑗 relevant to query 𝑖?

■ Binary judgments: relevant (0) vs. non-relevant (1)

■ Graded judgments: {-1, 0, 1, 2} (e.g., junk, irrelevant, relevant, key)

12

We typically have to make the (significant!) assumption that unjudged

documents are irrelevant. Some test collections would only label a

few positives per query.

Test Collections: TREC

■ Text REtrieval Conference (TREC) includes numerous annualtracks for comparing IR systems.

■ The 2021 iteration has tracks for Conversational Assistance,Health Misinformation, Fair Ranking, “Deep Learning”.

■ TREC tends to emphasize careful evaluation with a very small setof queries (e.g., 50 queries, each with >100 annotated documents)

– Having only few test queries does not imply few documents!

13

Test Collections: MS MARCO Ranking Tasks

■ MS MARCO Ranking is the largest public IR benchmark

– adapted from a Question Answering dataset

– consists of more than 500k Bing search queries

■ Sparse labels: approx. one relevance label per query!

■ Fantastic for training IR models!

■ MS MARCO Passage Ranking (9M short passages; sparse labels)

■ MS MARCO Document Ranking (3M long documents; sparse labels)

■ TREC DL’19 and DL’20 (short&long; dense labels for few queries)

14

Test Collections: Other Benchmarks

■ Lots of small or domain-specific benchmarks!

■ BEIR is a recent effort to use those for testing models in “zero-shot” scenarios

15

Thakur, Nandan, et al. "BEIR:

A Heterogenous Benchmark

for Zero-shot Evaluation of

Information Retrieval Models.“

arXiv:2104.08663 (2021)

We will also see later

that OpenQA

benchmarks can

serve as large IR

benchmarks too!

IR Effectiveness Metrics

■ We’ll use “metric”@K, often with K in {5, 10, 100, 1000}.

– Selection of the metric (and the cutoff K) depends on the task.

■ For all metrics here, we’ll [macro-]average across all queries.

– All queries will be assigned equal weight, for our purposes.

16

IR Effectiveness Metrics: Success & MRR

■ Let 𝑟𝑎𝑛𝑘 ∈ {1, 2, 3, … } be the position of the first relevant document

■ Success@K = ቊ1 if 𝑟𝑎𝑛𝑘 ≤ 𝐾0 otherwise

■ ReciporcalRank@K = ቊ1/𝑟𝑎𝑛𝑘 if 𝑟𝑎𝑛𝑘 ≤ 𝐾

0 otherwise

– This is MRR (M for “mean”), but dropped the M as we’re looking at only one query

17

IR Effectiveness Metrics: MAP

■ (M)AP = (Mean) Average Precision

■ Let 𝑟𝑎𝑛𝑘1, 𝑟𝑎𝑛𝑘2, … , 𝑟𝑎𝑛𝑘|𝑅𝑒𝑙|be the positions of all relevant documents

– Compute precision@i at each of those positions—and average!

■ Equivalently, AveragePrecision@K =

σ𝑖=1𝐾 ቊ𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑖 if 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡? (𝑖𝑡ℎ 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡)

0 otherwise

|𝑅𝑒𝑙|

19

IR Effectiveness Metrics: DCG

■ Discounted Cumulative Gain

– Not inherently normalized, so we also consider Normalized DCG

𝐷𝐶𝐺@𝐾 =

𝑖=1

𝐾𝑔𝑟𝑎𝑑𝑒𝑑_𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒(𝑖𝑡ℎ 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡)

log2(𝑖 + 1)

𝑁𝐷𝐶𝐺@𝐾 =𝐷𝐶𝐺@𝐾

𝑖𝑑𝑒𝑎𝑙 𝐷𝐶𝐺@𝐾

20

Next…

■ Neural IR.

21

References

22

Manning, Christopher, Prabhakar Raghavan and Schutze, H. “Introduction to Information Retrieval.” (2008).

Manning, Christopher, and Pandu Nayak (2019). CS276 Information Retrieval and Web Search: Evaluation [Class handout]. Retrievedfrom http://web.stanford.edu/class/cs276/19handouts/lecture8-evaluation-6per.pdf

Hofstätter, Sebastian. Advanced Information Retrieval: {IR Fundamentals, Evaluatin, Test Collections} [Class handout]. Retrieved fromhttps://github.com/sebastian-hofstaetter/teaching

Robertson, Stephen, and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc, 2009.

Nguyen, Tri, et al. "MS MARCO: A human generated machine reading comprehension dataset." CoCo@ NIPS. 2016.

Craswell, Nick, et al. "TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime." arXiv preprintarXiv:2104.09399 (2021).

Thakur, Nandan, et al. "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models.“arXiv:2104.08663 (2021)