nlu & ir: classical irbeyond term matching in classical ir… query and document expansion term...
TRANSCRIPT
NLU & IR:CLASSICAL
IROmar Khattab
CS224U: Natural Language
Understanding
Spring 2021
1
Ranked Retrieval
■ Scope: A large corpus of text documents (e.g., Wikipedia)
■ Input: A textual query (e.g., a natural-language question)
■ Output: Top-K Ranking of relevant documents (e.g., top-100)
2
How do we conduct ranked retrieval?
■ We’ve touched on one way before: the Term–Document Matrix
■ With good weights, this allows us to answer single-term queries!
3
How do we conduct ranked retrieval?
■ For multi-term queries, classical IR models would tokenize andthen treat the tokens independently.
𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒𝑆𝑐𝑜𝑟𝑒 𝑞𝑢𝑒𝑟𝑦, 𝑑𝑜𝑐 =
𝑡𝑒𝑟𝑚∈𝑞𝑢𝑒𝑟𝑦
𝑊𝑒𝑖𝑔ℎ𝑡𝑑𝑜𝑐,𝑡𝑒𝑟𝑚
■ This reduces a large fraction of classical IR to:
– How do we best tokenize (and stem) queries and documents
– How do we best weight each term–document pair
4
Term–Document Weighting: Intuitions
■ Frequency of occurrence will remain a primary factor
– If a term 𝑡 occurs frequently in document 𝑑, the document ismore likely to be relevant for queries including 𝑡
■ Normalization will remain a primary component too
– If that term 𝑡 is rather rare, then document 𝑑 is even morelikely to be relevant for queries including 𝑡
– If that document 𝑑 is rather short, this also improves its odd
5
Term–Document Weighting: TF-IDF
■ Let 𝑁 = 𝐶𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 and 𝑑𝑓 𝑡𝑒𝑟𝑚 = {𝑑𝑜𝑐 ∈ 𝐶𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 ∶ 𝑡𝑒𝑟𝑚 ∈ 𝑑𝑜𝑐}
𝑇𝐹 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 = log(1 + 𝐹𝑟𝑒𝑞 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 )
𝐼𝐷𝐹 𝑡𝑒𝑟𝑚 = log𝑁
𝑑𝑓(𝑡𝑒𝑟𝑚)
𝑇𝐹. 𝐼𝐷𝐹 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 = 𝑇𝐹 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 × 𝐼𝐷𝐹(𝑡𝑒𝑟𝑚)
𝑇𝐹. 𝐼𝐷𝐹 𝑞𝑢𝑒𝑟𝑦, 𝑑𝑜𝑐 =
𝑡𝑒𝑟𝑚∈𝑞𝑢𝑒𝑟𝑦
𝑇𝐹. 𝐼𝐷𝐹(𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐)
6
TF and IDF both grow
sub-linearly with frequency and 1/df
(in particular, logarithmically).
Term–Document Weighting: BM25Or “Finding the best match, seriously this time! Attempt #25” :-)
𝐼𝐷𝐹 𝑡𝑒𝑟𝑚 = log(1 +𝑁 − 𝑑𝑓 𝑡𝑒𝑟𝑚 + 0.5
𝑑𝑓 𝑡𝑒𝑟𝑚 + 0.5)
𝑇𝐹 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 =𝐹𝑟𝑒𝑞 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 × (𝑘 + 1)
𝐹𝑟𝑒𝑞 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 + 𝑘 × (1 − 𝑏 + 𝑏 ×|𝑑𝑜𝑐|
𝑎𝑣𝑔𝑑𝑜𝑐𝑙𝑒𝑛)
𝐵𝑀25 𝑡𝑒𝑟𝑚 = 𝐵𝑀25: 𝑇𝐹 𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐 × 𝐵𝑀25: 𝐼𝐷𝐹 𝑡𝑒𝑟𝑚
𝐵𝑀25 𝑞𝑢𝑒𝑟𝑦, 𝑑𝑜𝑐 =
𝑡𝑒𝑟𝑚∈𝑞𝑢𝑒𝑟𝑦
𝐵𝑀25 (𝑡𝑒𝑟𝑚, 𝑑𝑜𝑐)
7
k, b are parameters.
Unlike TF-IDF, term
frequency in BM25
saturates and
penalizes longer
documents!
Robertson, Stephen, and Hugo Zaragoza. The probabilistic relevance
framework: BM25 and beyond. Now Publishers Inc, 2009.
Efficient Retrieval: Inverted Indexing
■ Raw Collection: Document → Terms
■ Term–document matrix: Term -> Documents
– But it’s extremely sparse and thus wastes space!
■ The inverted index is just a sparse encoding of this matrix
– Mapping each unique term 𝑡 in the collection to a posting list
– The posting list enumerates non-zero <Freq, DocID> for 𝑡
8
Beyond term matching in classical IR…
■ Query and Document expansion
■ Term dependence and phrase search
■ Learning to Rank with various features:
– Different document fields (e.g., title, body, anchor text)
– Link Analysis (e.g., PageRank)
Lots of IR exploration into these!
However, BM25 was a very strong baseline on the best you can
do “ad-hoc”—until 2019 with BERT-based ranking!
9
IR Evaluation
■ A search system must be efficient and effective
– If we had infinite resources, we’d just hire experts to lookthrough all the documents one by one!
■ Efficiency
– Latency (milliseconds; for one query)
– Throughput (queries/sec)
– Space (GBs for the index? TBs?)
– Hardware required (one CPU core? Many cores? GPUs?)
– Scaling to various collection sizes, under different loads
10
IR Effectiveness
■ Do our top-k rankings fulfill users’ information needs?
– Often harder to evaluate than classification/regression!
■ If you have lots of users, you can run online experiments…
■ But we’re typically interested in reusable test collections
11
Test Collections
■ Document Collection (or “Corpus”)
■ Test Queries (or “Topics”)
– Could also include a train/dev split, if resources allow!
– Or, in some cases, cross-validation could be used.
■ Query–Document Relevance Assessments
– Is document 𝑗 relevant to query 𝑖?
■ Binary judgments: relevant (0) vs. non-relevant (1)
■ Graded judgments: {-1, 0, 1, 2} (e.g., junk, irrelevant, relevant, key)
12
We typically have to make the (significant!) assumption that unjudged
documents are irrelevant. Some test collections would only label a
few positives per query.
Test Collections: TREC
■ Text REtrieval Conference (TREC) includes numerous annualtracks for comparing IR systems.
■ The 2021 iteration has tracks for Conversational Assistance,Health Misinformation, Fair Ranking, “Deep Learning”.
■ TREC tends to emphasize careful evaluation with a very small setof queries (e.g., 50 queries, each with >100 annotated documents)
– Having only few test queries does not imply few documents!
13
Test Collections: MS MARCO Ranking Tasks
■ MS MARCO Ranking is the largest public IR benchmark
– adapted from a Question Answering dataset
– consists of more than 500k Bing search queries
■ Sparse labels: approx. one relevance label per query!
■ Fantastic for training IR models!
■ MS MARCO Passage Ranking (9M short passages; sparse labels)
■ MS MARCO Document Ranking (3M long documents; sparse labels)
■ TREC DL’19 and DL’20 (short&long; dense labels for few queries)
14
Test Collections: Other Benchmarks
■ Lots of small or domain-specific benchmarks!
■ BEIR is a recent effort to use those for testing models in “zero-shot” scenarios
15
Thakur, Nandan, et al. "BEIR:
A Heterogenous Benchmark
for Zero-shot Evaluation of
Information Retrieval Models.“
arXiv:2104.08663 (2021)
We will also see later
that OpenQA
benchmarks can
serve as large IR
benchmarks too!
IR Effectiveness Metrics
■ We’ll use “metric”@K, often with K in {5, 10, 100, 1000}.
– Selection of the metric (and the cutoff K) depends on the task.
■ For all metrics here, we’ll [macro-]average across all queries.
– All queries will be assigned equal weight, for our purposes.
16
IR Effectiveness Metrics: Success & MRR
■ Let 𝑟𝑎𝑛𝑘 ∈ {1, 2, 3, … } be the position of the first relevant document
■ Success@K = ቊ1 if 𝑟𝑎𝑛𝑘 ≤ 𝐾0 otherwise
■ ReciporcalRank@K = ቊ1/𝑟𝑎𝑛𝑘 if 𝑟𝑎𝑛𝑘 ≤ 𝐾
0 otherwise
– This is MRR (M for “mean”), but dropped the M as we’re looking at only one query
17
IR Effectiveness Metrics: Precision & Recall
■ Let 𝑅𝑒𝑡(𝐾) be the top-K retrieved documents
■ Let 𝑅𝑒𝑙 be the set of all documents judged as relevant
■ Precision@K =|𝑅𝑒𝑡 𝐾 ∩𝑅𝑒𝑙|
𝐾
■ Recall@K =|𝑅𝑒𝑡 𝐾 ∩𝑅𝑒𝑙|
|𝑅𝑒𝑙|
18
IR Effectiveness Metrics: MAP
■ (M)AP = (Mean) Average Precision
■ Let 𝑟𝑎𝑛𝑘1, 𝑟𝑎𝑛𝑘2, … , 𝑟𝑎𝑛𝑘|𝑅𝑒𝑙|be the positions of all relevant documents
– Compute precision@i at each of those positions—and average!
■ Equivalently, AveragePrecision@K =
σ𝑖=1𝐾 ቊ𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑖 if 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡? (𝑖𝑡ℎ 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡)
0 otherwise
|𝑅𝑒𝑙|
19
IR Effectiveness Metrics: DCG
■ Discounted Cumulative Gain
– Not inherently normalized, so we also consider Normalized DCG
𝐷𝐶𝐺@𝐾 =
𝑖=1
𝐾𝑔𝑟𝑎𝑑𝑒𝑑_𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒(𝑖𝑡ℎ 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡)
log2(𝑖 + 1)
𝑁𝐷𝐶𝐺@𝐾 =𝐷𝐶𝐺@𝐾
𝑖𝑑𝑒𝑎𝑙 𝐷𝐶𝐺@𝐾
20
Next…
■ Neural IR.
21
References
22
Manning, Christopher, Prabhakar Raghavan and Schutze, H. “Introduction to Information Retrieval.” (2008).
Manning, Christopher, and Pandu Nayak (2019). CS276 Information Retrieval and Web Search: Evaluation [Class handout]. Retrievedfrom http://web.stanford.edu/class/cs276/19handouts/lecture8-evaluation-6per.pdf
Hofstätter, Sebastian. Advanced Information Retrieval: {IR Fundamentals, Evaluatin, Test Collections} [Class handout]. Retrieved fromhttps://github.com/sebastian-hofstaetter/teaching
Robertson, Stephen, and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc, 2009.
Nguyen, Tri, et al. "MS MARCO: A human generated machine reading comprehension dataset." CoCo@ NIPS. 2016.
Craswell, Nick, et al. "TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime." arXiv preprintarXiv:2104.09399 (2021).
Thakur, Nandan, et al. "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models.“arXiv:2104.08663 (2021)