information retrieval i · information retrieval i introduction, e cient indexing, querying clovis...
TRANSCRIPT
![Page 1: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/1.jpg)
Information retrieval IIntroduction, efficient indexing, querying
Clovis Galiez
Ensimag ISI
December 7, 2020
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 1 / 68
![Page 2: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/2.jpg)
Objectives of the course
Acquire a culture in information retrieval
Master the basics concepts allowing to understand:
what is at stake in novel IR methodswhat are the technical limits
This will allow you to have the basics tools to analyze current limitationsor lacks, and imagine novel solutions.
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 2 / 68
![Page 3: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/3.jpg)
Proceedings of the lecture
No lecture handbook, only slides and materials of the practicalssession.
So... take notes and ask questions!
No huge theory, still requires some fundamentals: linear algebra,pattern matching, complexity, etc.
Evaluation: exam and optional project (bonus)
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 3 / 68
![Page 4: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/4.jpg)
Outline of the lectures
Indexing, basic querying, vector-space model
Latent semantics, embeddingsHands-on session: programming a search engine (Python)
RankingHands-on session part II
Evaluation of IR methodsHands-on Part-III.
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 4 / 68
![Page 5: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/5.jpg)
Today’s outline
What is information retrieval (in general)?
Querying (correctness) and ranking (relevance)
IR in the context of the web
Elements of web protocols and languagesGathering data on the web: crawlingWhat data size is at stake?
How to represent the information?
IndexingSparse representationsReverse indexing
Flexible representation: vector model?
tf-idflatent semantics
Practicals: recent patent analysis
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 5 / 68
![Page 6: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/6.jpg)
What is information retrieval (IR)?
Definition
Answering a query by extracting relevant information from a collectionof documents.
Typical example
Qwant.
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 6 / 68
![Page 7: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/7.jpg)
What is information retrieval (IR)?
Definition
Answering a query by extracting relevant information from a collectionof documents.
Typical example
Qwant.
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 6 / 68
![Page 8: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/8.jpg)
What is information retrieval (IR)?
Definition
Answering a query by extracting relevant information from a collectionof documents.
Typical example
Qwant.
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 6 / 68
![Page 9: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/9.jpg)
Some open-source tools for in-house IR
IR tools:
NLP tools:
NLTK (Python)
spaCy
(far to be exhaustive!)
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 7 / 68
![Page 10: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/10.jpg)
What is ”document” and ”information”?
Information retrieval
Answering a query by extracting relevant information from a collectionof documents.
Here, documents are web pages, images, pdf, etc.
How would you define information in the context of information retrieval?
Information
Subset of documents relevant to a query.
How could you qualify or measure information, e.g. relevance?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 8 / 68
![Page 11: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/11.jpg)
What is ”document” and ”information”?
Information retrieval
Answering a query by extracting relevant information from a collectionof documents.
Here, documents are web pages, images, pdf, etc.
How would you define information in the context of information retrieval?
Information
Subset of documents relevant to a query.
How could you qualify or measure information, e.g. relevance?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 8 / 68
![Page 12: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/12.jpg)
What is ”document” and ”information”?
Information retrieval
Answering a query by extracting relevant information from a collectionof documents.
Here, documents are web pages, images, pdf, etc.
How would you define information in the context of information retrieval?
Information
Subset of documents relevant to a query.
How could you qualify or measure information, e.g. relevance?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 8 / 68
![Page 13: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/13.jpg)
What is ”document” and ”information”?
Information retrieval
Answering a query by extracting relevant information from a collectionof documents.
Here, documents are web pages, images, pdf, etc.
How would you define information in the context of information retrieval?
Information
Subset of documents relevant to a query.
How could you qualify or measure information, e.g. relevance?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 8 / 68
![Page 14: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/14.jpg)
Correctness, relevance and truth
When was the last US presidential elections?
correct or incorrect
- Blue- 42:17
true, false or...
- 1st Sept. 2018- Nov. 2020
Relevance
- Same time as the previous ones, but 4 years later- during the 21st Century- 1604361600 since Unix Epocha
aNumber of seconds elapsed since 1st of January 1970
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 9 / 68
![Page 15: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/15.jpg)
Correctness, relevance and truth
When was the last US presidential elections?
correct or incorrect
- Blue- 42:17
true, false or...
- 1st Sept. 2018- Nov. 2020
Relevance
- Same time as the previous ones, but 4 years later- during the 21st Century- 1604361600 since Unix Epocha
aNumber of seconds elapsed since 1st of January 1970
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 9 / 68
![Page 16: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/16.jpg)
Correctness, relevance and truth
When was the last US presidential elections?
correct or incorrect
- Blue- 42:17
true, false or...
- 1st Sept. 2018- Nov. 2020
Relevance
- Same time as the previous ones, but 4 years later- during the 21st Century- 1604361600 since Unix Epocha
aNumber of seconds elapsed since 1st of January 1970
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 9 / 68
![Page 17: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/17.jpg)
Correctness, relevance and truth
When was the last US presidential elections?
correct or incorrect
- Blue- 42:17
true, false or...
- 1st Sept. 2018- Nov. 2020
Relevance
- Same time as the previous ones, but 4 years later- during the 21st Century- 1604361600 since Unix Epocha
aNumber of seconds elapsed since 1st of January 1970
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 9 / 68
![Page 18: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/18.jpg)
Correctness, relevance and truth
When was the last US presidential elections?
correct or incorrect
- Blue- 42:17
true, false or...
- 1st Sept. 2018
- Nov. 2020
Relevance
- Same time as the previous ones, but 4 years later- during the 21st Century- 1604361600 since Unix Epocha
aNumber of seconds elapsed since 1st of January 1970
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 9 / 68
![Page 19: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/19.jpg)
Correctness, relevance and truth
When was the last US presidential elections?
correct or incorrect
- Blue- 42:17
true, false or...
- 1st Sept. 2018- Nov. 2020
Relevance
- Same time as the previous ones, but 4 years later- during the 21st Century- 1604361600 since Unix Epocha
aNumber of seconds elapsed since 1st of January 1970
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 9 / 68
![Page 20: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/20.jpg)
Correctness, relevance and truth
When was the last US presidential elections?
correct or incorrect
- Blue- 42:17
true, false or...
- 1st Sept. 2018- Nov. 2020
Relevance
- Same time as the previous ones, but 4 years later- during the 21st Century- 1604361600 since Unix Epocha
aNumber of seconds elapsed since 1st of January 1970
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 9 / 68
![Page 21: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/21.jpg)
Correctness, relevance and truth
When was the last US presidential elections?
correct or incorrect
- Blue- 42:17
true, false or...
- 1st Sept. 2018- Nov. 2020
Relevance
- Same time as the previous ones, but 4 years later
- during the 21st Century- 1604361600 since Unix Epocha
aNumber of seconds elapsed since 1st of January 1970
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 9 / 68
![Page 22: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/22.jpg)
Correctness, relevance and truth
When was the last US presidential elections?
correct or incorrect
- Blue- 42:17
true, false or...
- 1st Sept. 2018- Nov. 2020
Relevance
- Same time as the previous ones, but 4 years later- during the 21st Century- 1604361600 since Unix Epocha
aNumber of seconds elapsed since 1st of January 1970
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 9 / 68
![Page 23: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/23.jpg)
Querying and ranking: a two-stage procedure
Collection of documents↓
Query −→ Querying system
↓
Correct answers↓
Ranking of answers(relevance)
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 10 / 68
![Page 24: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/24.jpg)
Querying systems deal with correctness
Collection of documents↓
Query −→ Querying system
↓
Correct answers↓
Ranking of answers(relevance)
Filter documents that cor-rectly answers a given query
Boolean queries
Checks if a word ispresent or not in adocument
Vector-based models
Trained models (”AI”)
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 11 / 68
![Page 25: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/25.jpg)
Ranking systems deal with relevance
Collection of documents↓
Query −→ Querying system
↓
Correct answers↓
Ranking of answers(relevance)
Ranking methods:
Content-basedalgorithms
Vector model
Structure-based
PageRank
Supervised ranking(”AI”)
neural nets
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 12 / 68
![Page 26: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/26.jpg)
What are the pitfalls?
Exercise
Take a few minutes to list what could be the different pitfalls for queryingand ranking systems.
Complexity of natural language
Ambiguity of natural language
Size of the data
...
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 13 / 68
![Page 27: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/27.jpg)
What are the pitfalls?
Exercise
Take a few minutes to list what could be the different pitfalls for queryingand ranking systems.
Complexity of natural language
Ambiguity of natural language
Size of the data
...
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 13 / 68
![Page 28: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/28.jpg)
IR specific to the World Wide Web
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 14 / 68
![Page 29: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/29.jpg)
IR and the web
Collection of documents↓
Query −→ Querying system
↓
Correct answers↓
Ranking of answers(relevance)
2 specificities:
Building the collection ofdocuments
Crawling the webIndexing documents
Ranking the documents(next lecture)
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 15 / 68
![Page 30: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/30.jpg)
Gathering data on the web(crawling)
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 16 / 68
![Page 31: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/31.jpg)
The web structure: a huge graph
Initiated in 70’s with ARPANET. In 2017, >8 Billion ”nodes”
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 17 / 68
![Page 32: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/32.jpg)
The web structure: a huge graph
Initiated in 70’s with ARPANET. In 2017, >8 Billion ”nodes”
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 17 / 68
![Page 33: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/33.jpg)
The web protocols
The textual web uses the HTTP1 over TCP/IP protocol2.
1T. Berners-Lee in 90 at CERN2Cerf and Kahn, 74
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 18 / 68
![Page 34: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/34.jpg)
Edge-technology: Internet
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 19 / 68
![Page 35: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/35.jpg)
The web structure: languages
HTML (HyperText Markup Language) is the main language for describinga web page. Fromhttps://en.wikipedia.org/wiki/Information_retrieval:
HTML source code behind:
... obtaining <a href="/wiki/Information_system" title="Information system">
information system</a> resources relevant to an ...
How would you collect information from the web?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 20 / 68
![Page 36: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/36.jpg)
The web structure: languages
HTML (HyperText Markup Language) is the main language for describinga web page. Fromhttps://en.wikipedia.org/wiki/Information_retrieval:
HTML source code behind:
... obtaining <a href="/wiki/Information_system" title="Information system">
information system</a> resources relevant to an ...
How would you collect information from the web?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 20 / 68
![Page 37: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/37.jpg)
The web structure: languages
HTML (HyperText Markup Language) is the main language for describinga web page. Fromhttps://en.wikipedia.org/wiki/Information_retrieval:
HTML source code behind:
... obtaining <a href="/wiki/Information_system" title="Information system">
information system</a> resources relevant to an ...
How would you collect information from the web?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 20 / 68
![Page 38: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/38.jpg)
The web structure: crawling
Hopping from link to link, one can collect/process data on the web:
Web crawlingJumpStation (1993)
Must keep track of already visited pages (e.g. trie, hashtable).
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 21 / 68
![Page 39: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/39.jpg)
Parsing links from a web page
With regular expressions (regex) for instance.
Regex Match examples
a* aaa a
M.x Max Mix
M.*x Max Matrix
M[^a]*[xn] Mix MoonRegex is a simple pattern matching formalism.
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 22 / 68
![Page 40: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/40.jpg)
Regex: groups
Regex data 1st group
M(.)x Max a
M(.)x Mix i
M(.*)x Matrix atri
Exercise
Find a regex that extracts the URL of an HTML link.
<a href="http://www.wikipedia.org">The linked text</a>
Extend your regex to extract both the URL and the linked text.
Quality of the data: HTML errors, difficult parsing → uselibraries!
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 23 / 68
![Page 41: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/41.jpg)
Regex: groups
Regex data 1st group
M(.)x Max a
M(.)x Mix i
M(.*)x Matrix atri
Exercise
Find a regex that extracts the URL of an HTML link.
<a href="http://www.wikipedia.org">The linked text</a>
Extend your regex to extract both the URL and the linked text.
Quality of the data: HTML errors, difficult parsing → uselibraries!
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 23 / 68
![Page 42: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/42.jpg)
Size of the crawled data
From technical solution to practice... Quizz:
Nb of pages indexed by Google
Search index size of GoogleSource (2020) google.com/search/howsearchworks/crawling-indexing
Between 0.2% and 4% of the web is accessible by crawling3.What is ”uncrawlable” is coined the deep web.
3[Pandya et al. IJIRST 2017]C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 24 / 68
![Page 43: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/43.jpg)
Size of the crawled data
From technical solution to practice... Quizz:
Nb of pages indexed by Google ∼ 1011
Search index size of GoogleSource (2020) google.com/search/howsearchworks/crawling-indexing
Between 0.2% and 4% of the web is accessible by crawling3.What is ”uncrawlable” is coined the deep web.
3[Pandya et al. IJIRST 2017]C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 24 / 68
![Page 44: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/44.jpg)
Size of the crawled data
From technical solution to practice... Quizz:
Nb of pages indexed by Google ∼ 1011
Search index size of Google > 108GbSource (2020) google.com/search/howsearchworks/crawling-indexing
Between 0.2% and 4% of the web is accessible by crawling3.What is ”uncrawlable” is coined the deep web.
3[Pandya et al. IJIRST 2017]C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 24 / 68
![Page 45: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/45.jpg)
Size of the crawled data
From technical solution to practice... Quizz:
Nb of pages indexed by Google ∼ 1011
Search index size of Google > 108GbSource (2020) google.com/search/howsearchworks/crawling-indexing
Between 0.2% and 4% of the web is accessible by crawling3.What is ”uncrawlable” is coined the deep web.
3[Pandya et al. IJIRST 2017]C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 24 / 68
![Page 46: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/46.jpg)
Big data: algorithms matter!
Electric energy consumption in France per year per person: 0.067GWhElectric energy of Google per year: 2.500GWh4.Equivalent consumption of 40.000 people.
4according to Google. Other sources say 4× moreC. Galiez (LJK-SVH) Information retrieval I December 7, 2020 25 / 68
![Page 47: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/47.jpg)
Structure of the crawled data: beware of alorithmicimpacts!
Nk: Nb of pages with > k incoming links. Nk/2 ≈NL: Nb of pages of length > L. NL/2 ≈
6 degrees of separation law.
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 26 / 68
![Page 48: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/48.jpg)
Structure of the crawled data: beware of alorithmicimpacts!
Nk: Nb of pages with > k incoming links. Nk/2 ≈NL: Nb of pages of length > L. NL/2 ≈
6 degrees of separation law.
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 26 / 68
![Page 49: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/49.jpg)
Structure of the crawled data: beware of alorithmicimpacts!
Nk: Nb of pages with > k incoming links. Nk/2 ≈NL: Nb of pages of length > L. NL/2 ≈
6 degrees of separation law.
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 26 / 68
![Page 50: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/50.jpg)
Structure of the crawled data: beware of alorithmicimpacts!
Nk: Nb of pages with > k incoming links. Nk/2 ≈ 2γ .Nk (Zipf’s law)
NL: Nb of pages of length > L. NL/2 ≈6 degrees of separation law.
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 26 / 68
![Page 51: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/51.jpg)
Structure of the crawled data: beware of alorithmicimpacts!
Nk: Nb of pages with > k incoming links. Nk/2 ≈ 2γ .Nk (Zipf’s law)
NL: Nb of pages of length > L. NL/2 ≈ 2γ .NL (Zipf’s law)
6 degrees of separation law.
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 26 / 68
![Page 52: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/52.jpg)
Experimental evidence of the Zipf’s law
[Adamic et al. Glottometrics, 2002]C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 27 / 68
![Page 53: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/53.jpg)
From gathering to representation
Collection of documents↓
Query −→ Querying system
↓
Correct answers↓
Ranking of answers(relevance)
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 28 / 68
![Page 54: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/54.jpg)
Representations of a webdocument
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 29 / 68
![Page 55: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/55.jpg)
How to query for correct documents?
Exercise
Take a few minutes to think how you would retrieve the web documentscorresponding to the query:
result last elections president united states
You may have encounter the following issues:
how to correctly match words in the document (tokenization)
how to match equivalent word (e.g. plural)
how to implement it
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 30 / 68
![Page 56: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/56.jpg)
How to query for correct documents?
Exercise
Take a few minutes to think how you would retrieve the web documentscorresponding to the query:
result last elections president united states
You may have encounter the following issues:
how to correctly match words in the document (tokenization)
how to match equivalent word (e.g. plural)
how to implement it
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 30 / 68
![Page 57: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/57.jpg)
Tokenization
Process of chopping the text of a document in atomic elements:
Brian is in the kitchen → Brian is in the kitchen
United States president → United States president
Usually, tokenizers remove the punctuation.
May be difficult: United States 6= United + States!
Data mining approaches help to extract the righttokens: if two words are significantly seen one afterthe other, may be consider as a token.
Some languages are agglutinative (e.g. Turkish).
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 31 / 68
![Page 58: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/58.jpg)
Tokenization
Process of chopping the text of a document in atomic elements:
Brian is in the kitchen → Brian is in the kitchenUnited States president →
United States president
Usually, tokenizers remove the punctuation.
May be difficult: United States 6= United + States!
Data mining approaches help to extract the righttokens: if two words are significantly seen one afterthe other, may be consider as a token.
Some languages are agglutinative (e.g. Turkish).
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 31 / 68
![Page 59: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/59.jpg)
Tokenization
Process of chopping the text of a document in atomic elements:
Brian is in the kitchen → Brian is in the kitchenUnited States president → United States president
Usually, tokenizers remove the punctuation.
May be difficult: United States 6= United + States!
Data mining approaches help to extract the righttokens: if two words are significantly seen one afterthe other, may be consider as a token.
Some languages are agglutinative (e.g. Turkish).
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 31 / 68
![Page 60: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/60.jpg)
Tokenization
Process of chopping the text of a document in atomic elements:
Brian is in the kitchen → Brian is in the kitchenUnited States president → United States president
Usually, tokenizers remove the punctuation.
May be difficult: United States 6= United + States!
Data mining approaches help to extract the righttokens: if two words are significantly seen one afterthe other, may be consider as a token.
Some languages are agglutinative (e.g. Turkish).
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 31 / 68
![Page 61: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/61.jpg)
Python library for NLP: nltk
1 >>> import nltk
2 >>> sentence = """At eight o'clock on Thursday morning
3 ... Arthur didn't feel very good."""
4 >>> tokens = nltk.word_tokenize(sentence)
5 >>> tokens
6 ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
7 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
https://www.nltk.org/
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 32 / 68
![Page 62: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/62.jpg)
Stemming
Language-specific rules defining equivalent words up to a usualtransformation (e.g. -ing, -ed, -s, etc.).For instance, we would transform:
plural forms: elections → electionsubstantive/adjectival forms: presidential → presidentstop-words removal: in → NULL
Again, same difficulties could appear:
ambiguity: police, policy → polic
Non-conflating: mother, maternal
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 33 / 68
![Page 63: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/63.jpg)
Stemming
Language-specific rules defining equivalent words up to a usualtransformation (e.g. -ing, -ed, -s, etc.).For instance, we would transform:
plural forms: elections → electionsubstantive/adjectival forms: presidential → presidentstop-words removal: in → NULL
Again, same difficulties could appear:
ambiguity: police, policy → polic
Non-conflating: mother, maternal
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 33 / 68
![Page 64: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/64.jpg)
Implementation of stemming in Python
1 >>> from nltk.stem.porter import *
2 >>> stemmer = SnowballStemmer("english")
3 >>> print(stemmer.stem("running"))
4 run
https://www.nltk.org/
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 34 / 68
![Page 65: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/65.jpg)
After tokenizing and stemming: querying
Query: result last elections president united states
Stemmed tokens:result, last, election, president, united states
How to query your documents?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 35 / 68
![Page 66: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/66.jpg)
After tokenizing and stemming: querying
Query: result last elections president united states
Stemmed tokens:result, last, election, president, united states
How to query your documents?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 35 / 68
![Page 67: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/67.jpg)
Naive representation: vector of tokens
Matrix with the occurrence of tokens in documents.
tok 1 tok 2 tok 3 tok 4 tok 5 ...
election president crazy united United States ...
doc 1 1 1 0 0 1 ...
doc 2 0 1 1 0 1 ...
doc 3 1 1 1 0 1 ...
... ... ... ... ... ... ...
Query 1 1 0 0 1 ...
Exercise
Write down a simple algorithm that extracts given a doc-tok matrix thedocuments matching a query.Can you foresee any practical problem? What is the size of the matrix?Can it fit in memory?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 36 / 68
![Page 68: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/68.jpg)
Naive representation: vector of tokens
Matrix with the occurrence of tokens in documents.
tok 1 tok 2 tok 3 tok 4 tok 5 ...
election president crazy united United States ...
doc 1 1 1 0 0 1 ...
doc 2 0 1 1 0 1 ...
doc 3 1 1 1 0 1 ...
... ... ... ... ... ... ...
Query 1 1 0 0 1 ...
Exercise
Write down a simple algorithm that extracts given a doc-tok matrix thedocuments matching a query.Can you foresee any practical problem? What is the size of the matrix?Can it fit in memory?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 36 / 68
![Page 69: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/69.jpg)
Naive representation: vector of tokens
Matrix with the occurrence of tokens in documents.
tok 1 tok 2 tok 3 tok 4 tok 5 ...
election president crazy united United States ...
doc 1 1 1 0 0 1 ...
doc 2 0 1 1 0 1 ...
doc 3 1 1 1 0 1 ...
... ... ... ... ... ... ...
Query 1 1 0 0 1 ...
Exercise
Write down a simple algorithm that extracts given a doc-tok matrix thedocuments matching a query.Can you foresee any practical problem? What is the size of the matrix?Can it fit in memory?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 36 / 68
![Page 70: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/70.jpg)
Naive representation: vector of tokens
Matrix with the occurrence of tokens in documents.
tok 1 tok 2 tok 3 tok 4 tok 5 ...
election president crazy united United States ...
doc 1 1 1 0 0 1 ...
doc 2 0 1 1 0 1 ...
doc 3 1 1 1 0 1 ...
... ... ... ... ... ... ...
Query 1 1 0 0 1 ...
Exercise
Write down a simple algorithm that extracts given a doc-tok matrix thedocuments matching a query.
Can you foresee any practical problem? What is the size of the matrix?Can it fit in memory?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 36 / 68
![Page 71: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/71.jpg)
Naive representation: vector of tokens
Matrix with the occurrence of tokens in documents.
tok 1 tok 2 tok 3 tok 4 tok 5 ...
election president crazy united United States ...
doc 1 1 1 0 0 1 ...
doc 2 0 1 1 0 1 ...
doc 3 1 1 1 0 1 ...
... ... ... ... ... ... ...
Query 1 1 0 0 1 ...
Exercise
Write down a simple algorithm that extracts given a doc-tok matrix thedocuments matching a query.Can you foresee any practical problem?
What is the size of the matrix?Can it fit in memory?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 36 / 68
![Page 72: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/72.jpg)
Naive representation: vector of tokens
Matrix with the occurrence of tokens in documents.
tok 1 tok 2 tok 3 tok 4 tok 5 ...
election president crazy united United States ...
doc 1 1 1 0 0 1 ...
doc 2 0 1 1 0 1 ...
doc 3 1 1 1 0 1 ...
... ... ... ... ... ... ...
Query 1 1 0 0 1 ...
Exercise
Write down a simple algorithm that extracts given a doc-tok matrix thedocuments matching a query.Can you foresee any practical problem? What is the size of the matrix?Can it fit in memory?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 36 / 68
![Page 73: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/73.jpg)
Sparse representation
Since documents contain only a small fraction of existing tokens, most ofthe vector of token entries are null.We can use a sparse encoding of the same information:
tok 1 tok 2 tok 3 tok 4 tok 5
election president crazy united United States
doc 1→tok 1, tok 2, tok 5doc 2→tok 2, tok 3, tok 5doc 3→tok 1, tok 2, tok 3 ,tok 5
Exercise
What is the size of the sparse encoding data structure?
Hmmm
Write an algorithm extracting the matching documents from a sparseencoding doc-tok (on-line).How long would it take to process a query?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 37 / 68
![Page 74: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/74.jpg)
Sparse representation
Since documents contain only a small fraction of existing tokens, most ofthe vector of token entries are null.We can use a sparse encoding of the same information:
tok 1 tok 2 tok 3 tok 4 tok 5
election president crazy united United States
doc 1→tok 1, tok 2, tok 5doc 2→tok 2, tok 3, tok 5doc 3→tok 1, tok 2, tok 3 ,tok 5
Exercise
What is the size of the sparse encoding data structure?
Hmmm
Write an algorithm extracting the matching documents from a sparseencoding doc-tok (on-line).
How long would it take to process a query?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 37 / 68
![Page 75: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/75.jpg)
Sparse representation
Since documents contain only a small fraction of existing tokens, most ofthe vector of token entries are null.We can use a sparse encoding of the same information:
tok 1 tok 2 tok 3 tok 4 tok 5
election president crazy united United States
doc 1→tok 1, tok 2, tok 5doc 2→tok 2, tok 3, tok 5doc 3→tok 1, tok 2, tok 3 ,tok 5
Exercise
What is the size of the sparse encoding data structure?
Hmmm
Write an algorithm extracting the matching documents from a sparseencoding doc-tok (on-line).How long would it take to process a query?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 37 / 68
![Page 76: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/76.jpg)
Elements in complexity
Definition
The complexity is the measure the of the size an algorithm needs ofmemory and the time it takes to process.
It is usually measured in terms of order of magnitude of the size of theinput data.
Examples: Let N be the number of indexed documents, T the totalnumber of tokens, and t the average number of token per document5.
Memory complexity of the naive indexing algorithm? O(N.T )
Memory complexity of the sparse indexing algorithm? O(N.t)
Time complexity of your search algorithm in a sparse index? O(N.t)
Exercise
Can we do better?
5t can be related to the length of the document through Heap’s law: t = K.Lβ . Inpractice, K = 50, β = 0.5
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 38 / 68
![Page 77: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/77.jpg)
Elements in complexity
Definition
The complexity is the measure the of the size an algorithm needs ofmemory and the time it takes to process.
It is usually measured in terms of order of magnitude of the size of theinput data.Examples: Let N be the number of indexed documents, T the totalnumber of tokens, and t the average number of token per document5.
Memory complexity of the naive indexing algorithm?
O(N.T )
Memory complexity of the sparse indexing algorithm? O(N.t)
Time complexity of your search algorithm in a sparse index? O(N.t)
Exercise
Can we do better?
5t can be related to the length of the document through Heap’s law: t = K.Lβ . Inpractice, K = 50, β = 0.5
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 38 / 68
![Page 78: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/78.jpg)
Elements in complexity
Definition
The complexity is the measure the of the size an algorithm needs ofmemory and the time it takes to process.
It is usually measured in terms of order of magnitude of the size of theinput data.Examples: Let N be the number of indexed documents, T the totalnumber of tokens, and t the average number of token per document5.
Memory complexity of the naive indexing algorithm? O(N.T )
Memory complexity of the sparse indexing algorithm? O(N.t)
Time complexity of your search algorithm in a sparse index? O(N.t)
Exercise
Can we do better?
5t can be related to the length of the document through Heap’s law: t = K.Lβ . Inpractice, K = 50, β = 0.5
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 38 / 68
![Page 79: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/79.jpg)
Elements in complexity
Definition
The complexity is the measure the of the size an algorithm needs ofmemory and the time it takes to process.
It is usually measured in terms of order of magnitude of the size of theinput data.Examples: Let N be the number of indexed documents, T the totalnumber of tokens, and t the average number of token per document5.
Memory complexity of the naive indexing algorithm? O(N.T )
Memory complexity of the sparse indexing algorithm?
O(N.t)
Time complexity of your search algorithm in a sparse index? O(N.t)
Exercise
Can we do better?
5t can be related to the length of the document through Heap’s law: t = K.Lβ . Inpractice, K = 50, β = 0.5
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 38 / 68
![Page 80: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/80.jpg)
Elements in complexity
Definition
The complexity is the measure the of the size an algorithm needs ofmemory and the time it takes to process.
It is usually measured in terms of order of magnitude of the size of theinput data.Examples: Let N be the number of indexed documents, T the totalnumber of tokens, and t the average number of token per document5.
Memory complexity of the naive indexing algorithm? O(N.T )
Memory complexity of the sparse indexing algorithm? O(N.t)
Time complexity of your search algorithm in a sparse index? O(N.t)
Exercise
Can we do better?
5t can be related to the length of the document through Heap’s law: t = K.Lβ . Inpractice, K = 50, β = 0.5
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 38 / 68
![Page 81: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/81.jpg)
Elements in complexity
Definition
The complexity is the measure the of the size an algorithm needs ofmemory and the time it takes to process.
It is usually measured in terms of order of magnitude of the size of theinput data.Examples: Let N be the number of indexed documents, T the totalnumber of tokens, and t the average number of token per document5.
Memory complexity of the naive indexing algorithm? O(N.T )
Memory complexity of the sparse indexing algorithm? O(N.t)
Time complexity of your search algorithm in a sparse index?
O(N.t)
Exercise
Can we do better?
5t can be related to the length of the document through Heap’s law: t = K.Lβ . Inpractice, K = 50, β = 0.5
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 38 / 68
![Page 82: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/82.jpg)
Elements in complexity
Definition
The complexity is the measure the of the size an algorithm needs ofmemory and the time it takes to process.
It is usually measured in terms of order of magnitude of the size of theinput data.Examples: Let N be the number of indexed documents, T the totalnumber of tokens, and t the average number of token per document5.
Memory complexity of the naive indexing algorithm? O(N.T )
Memory complexity of the sparse indexing algorithm? O(N.t)
Time complexity of your search algorithm in a sparse index? O(N.t)
Exercise
Can we do better?
5t can be related to the length of the document through Heap’s law: t = K.Lβ . Inpractice, K = 50, β = 0.5
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 38 / 68
![Page 83: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/83.jpg)
Elements in complexity
Definition
The complexity is the measure the of the size an algorithm needs ofmemory and the time it takes to process.
It is usually measured in terms of order of magnitude of the size of theinput data.Examples: Let N be the number of indexed documents, T the totalnumber of tokens, and t the average number of token per document5.
Memory complexity of the naive indexing algorithm? O(N.T )
Memory complexity of the sparse indexing algorithm? O(N.t)
Time complexity of your search algorithm in a sparse index? O(N.t)
Exercise
Can we do better?5t can be related to the length of the document through Heap’s law: t = K.Lβ . In
practice, K = 50, β = 0.5C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 38 / 68
![Page 84: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/84.jpg)
Complexity matters
Action Complexity
Sorting O(N logN)
Searching a sorted list O(logN)
Accessing an element of a matrix O(1)
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 39 / 68
![Page 85: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/85.jpg)
Inverse sparse index
Idea
Inverting the sparse representation and sorting by document allows toreduce the complexity.
tok 1 tok 2 tok 3 tok 4 tok 5
election president crazy united United States
doc 1→tok 1,tok 2,tok 5doc 2→tok 2,tok 3,tok 5doc 3→tok 1,tok 2,tok 3,tok 5
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 40 / 68
![Page 86: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/86.jpg)
Inverse sparse index
Idea
Inverting the sparse representation and sorting by document allows toreduce the complexity.
tok 1 tok 2 tok 3 tok 4 tok 5
election president crazy united United States
doc 1→tok 1,tok 2,tok 5doc 2→tok 2,tok 3,tok 5doc 3→tok 1,tok 2,tok 3,tok 5
tok 1→doc 1,doc 3,...tok 2→doc 1,doc 2,doc 3,...tok 3→doc 2,doc 3,...tok 4→doc 102,...tok 5→doc 1,doc 2,doc 3,...
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 40 / 68
![Page 87: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/87.jpg)
Inverse sparse index
Idea
Inverting the sparse representation and sorting by document allows toreduce the complexity.
tok 1 tok 2 tok 3 tok 4 tok 5
election president crazy united United States
doc 1→tok 1,tok 2,tok 5doc 2→tok 2,tok 3,tok 5doc 3→tok 1,tok 2,tok 3,tok 5
tok 1→doc 1,doc 3,...tok 2→doc 1,doc 2,doc 3,...tok 3→doc 2,doc 3,...tok 4→doc 102,...tok 5→doc 1,doc 2,doc 3,...
Exercise
Write an algorithm that indexes a document using a reverse sparse index(on-line). Compute the time complexity for querying an inverse sparseindex.
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 40 / 68
![Page 88: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/88.jpg)
Building an inverted index in practice
The full sparse index does not fit in memory.Block Sort-Based Indexing is a simple algorithm allows to invert bigdictionaries that do not fit in main memory, at low cost, and that can evenbe parallelized6.
6https://westmont.instructure.com/files/51060/download?download_frd=1C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 41 / 68
![Page 89: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/89.jpg)
Summary
↓
↓
↓
↓
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 42 / 68
![Page 90: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/90.jpg)
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 43 / 68
![Page 91: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/91.jpg)
Extras
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 44 / 68
![Page 92: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/92.jpg)
How to store already crawled URLs?
Exercise
What is the cost of checking if the crawler already visited a web page? Isit reasonable?
A trie structure.
Exercise
What is the complexity in time?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 45 / 68
![Page 93: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/93.jpg)
How to store already crawled URLs?
Exercise
What is the cost of checking if the crawler already visited a web page? Isit reasonable?
A trie structure.
Exercise
What is the complexity in time?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 45 / 68
![Page 94: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/94.jpg)
How to store already crawled URLs?
Exercise
What is the cost of checking if the crawler already visited a web page? Isit reasonable?
A trie structure.
Exercise
What is the complexity in time?
C. Galiez (LJK-SVH) Information retrieval I December 7, 2020 45 / 68
![Page 95: Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval](https://reader034.vdocuments.mx/reader034/viewer/2022051815/603f19bdfc994e771f59618a/html5/thumbnails/95.jpg)
Examples of (successful) companies in IR
ElasticSearch
Distributed, RESTful7 search and analytics engine capa-ble of solving a growing number of use cases. [...] cen-trally stores your data so you can discover the expectedand uncover the unexpected.
swiftypeAll-in-one relevance, lightning-fast setup and unprece-dented control.
blekko Now in IBM Watson
7i.e. based on HTTPC. Galiez (LJK-SVH) Information retrieval I December 7, 2020 46 / 68