information retrieval introduction and basic...

15
Information Retrieval Introduction and basic concepts Luca Bondi

Upload: others

Post on 06-Jan-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Retrieval Introduction and basic conceptshome.deib.polimi.it/lbondi/data/uploads/irdm15-16/slides/01_introduction_v2.pdf · Information Retrieval Introduction What is

Information Retrieval

Introduction and basic concepts

Luca Bondi

Page 2: Information Retrieval Introduction and basic conceptshome.deib.polimi.it/lbondi/data/uploads/irdm15-16/slides/01_introduction_v2.pdf · Information Retrieval Introduction What is

Information Retrieval

Introduction

What is Information Retrieval?

• Information Retrieval deals with the representation, storage,

organization of and access to information items [Baeza-Yates,

Ribeiro-Nieto, 1999]

• The user need determines what information is

• How to characterize the user information need in a way that

computers can handle it?

2

Information Retrieval

A user expresses its need in a natural

language and expects a computer

system to generate results relevant to

user need

Data Retrieval

A user specifies a query in a formal

language and expects a computer

system to generate results that exactly

match the query statement

? Database

Page 3: Information Retrieval Introduction and basic conceptshome.deib.polimi.it/lbondi/data/uploads/irdm15-16/slides/01_introduction_v2.pdf · Information Retrieval Introduction What is

Information Retrieval

Introduction

Why do we need Information Retrieval?

• Huge document collections due to cheap and easy generation,

storage, processing

• 968,882,453 (almost 1 billion) websites at the end of 2014

• 1.5 billion Facebook users in Q2 2015

• 316 million Twitter active users in Q2 2015

• 30 billion Instagram photos up to august 2015

• 70 millions new photos per day

• Every minute 300 hours of new videos are uploaded on YouTube

• We need Information Retrieval to search as fast as possible

something (or everything) that is relevant to our needs!

3

Page 4: Information Retrieval Introduction and basic conceptshome.deib.polimi.it/lbondi/data/uploads/irdm15-16/slides/01_introduction_v2.pdf · Information Retrieval Introduction What is

Information Retrieval

Introduction

The Information Retrieval challenges

• User needs are not uniquely definable

• Example

• I need a new mouse for my workstation

• Let’s Google for it!

• Mmm... not exactly the kind of mouse I was looking for. But the

second result might be helpful…

4

Page 5: Information Retrieval Introduction and basic conceptshome.deib.polimi.it/lbondi/data/uploads/irdm15-16/slides/01_introduction_v2.pdf · Information Retrieval Introduction What is

Information Retrieval

Introduction

The nature of Information Retrieval

• Retrieving all objects which might be useful or relevant to the

user’s information need

• Given unstructured user queries

• Errors in the results are tolerated

• What is relevant to a user?

• What is the trade-off between precision and recall

• How to rank results to make users happy?

5

Page 6: Information Retrieval Introduction and basic conceptshome.deib.polimi.it/lbondi/data/uploads/irdm15-16/slides/01_introduction_v2.pdf · Information Retrieval Introduction What is

Information Retrieval

Introduction

Typical tasks covered in IR

• Search

• Static documents collection

• Dynamic queries

• Example

• Searching for a file within your PC

• Filtering

• Dynamic documents collection

• Static queries

• Example

• Automatic e-mail filtering to separate students e-mail from

department spam

• Clustering, Categorization, Recommendation, Browsing,

Summarization, Question answering

6

Page 7: Information Retrieval Introduction and basic conceptshome.deib.polimi.it/lbondi/data/uploads/irdm15-16/slides/01_introduction_v2.pdf · Information Retrieval Introduction What is

Information Retrieval

Definitions

Information Retrieval Model

7

𝑑𝑗

𝐷 – set of logical views for

the documents

𝑞𝑖

𝑄 – set of logical views for

the user’s needs (aka

queries)

𝑅 𝑞𝑖 , 𝑑𝑗 → ℝ – ranking function

Associates a real number to a document

representation 𝑑𝑗 with respect to a query 𝑞𝑖

The ranking defines an ordering among all the

documents with regard to the query 𝑞𝑖

𝐼𝑅𝑀 ≜ 𝐷,𝑄, 𝑅 𝑞𝑖 , 𝑑𝑗

Page 8: Information Retrieval Introduction and basic conceptshome.deib.polimi.it/lbondi/data/uploads/irdm15-16/slides/01_introduction_v2.pdf · Information Retrieval Introduction What is

Information Retrieval

Definitions

Relevance

8

• The relevance of a document with respect to user’s needs is

• Subjective

• different users may express the same information need in

different ways, thus expressing different queries leading to

different documents ranking

• two users with the same information need, expressed by the

same query, may give different judgments on the same retrieved

document

• Dynamic

• in time: documents retrieved and displayed now could influence

the user judgment on documents that will be displayed later

• in space: documents relevant to a user in a specific location may

not be relevant to a user in another geographic location

• Not known to the system prior to user judgement

• The system guesses the document relevance computing the

ranking function which depends on the adopted IRM

Page 9: Information Retrieval Introduction and basic conceptshome.deib.polimi.it/lbondi/data/uploads/irdm15-16/slides/01_introduction_v2.pdf · Information Retrieval Introduction What is

Information Retrieval

Definitions

Ranking vs Relevance

9

Ranking

Relevance

subjective

deterministic

user dependent

time variant

space variant

independent from user context

(at least in simple cases)

time and space invariant

(at least in simple cases)

Page 10: Information Retrieval Introduction and basic conceptshome.deib.polimi.it/lbondi/data/uploads/irdm15-16/slides/01_introduction_v2.pdf · Information Retrieval Introduction What is

Information Retrieval

Basic concepts

Similarity

10

𝑑𝑗

𝑞𝑖

𝐼𝑅𝑀 𝑆𝐶 𝑞𝑖 , 𝑑𝑗

Given a document 𝑑𝑗 and a query 𝑞𝑖 an Information Retrieval

Model assigns a measure of similarity - Similarity Coefficient

𝑆𝐶 𝑞𝑖 , 𝑑𝑗 - between the document and the query.

An idea of what similarity means:

The more often terms are found both in the document and the

query, the more relevant the document is with regard to the

query

Given a query 𝑞 and a set of documents 𝐷 = 𝑑1, 𝑑2, … , 𝑑𝑁 a

retrieval strategy is an algorithm that identifies the Similarity

Coefficient 𝑆𝐶 𝑞, 𝑑𝑗 for each document 𝑑𝑗 , ∀𝑗 ∈ 1, 𝑁 .

Page 11: Information Retrieval Introduction and basic conceptshome.deib.polimi.it/lbondi/data/uploads/irdm15-16/slides/01_introduction_v2.pdf · Information Retrieval Introduction What is

Information Retrieval

Basic concepts

Similarity and Rank

11

Similarity

Rank

depends on

documents collection

depends on the model independent from

documents

collection

depends on the model

the higher, the better

the lower, the better

Page 12: Information Retrieval Introduction and basic conceptshome.deib.polimi.it/lbondi/data/uploads/irdm15-16/slides/01_introduction_v2.pdf · Information Retrieval Introduction What is

Information Retrieval

Basic concepts

Similarity and Rank: a trivial example

12

𝒅𝒋 𝑺𝑪 𝒒, 𝒅𝒋 𝑹 𝒒, 𝒅𝒋

I’m scared by black cats 2 1

she’s missing her cat 1 2 (tie)

too many cats in my

neighbourhood 1 2 (tie)

what a beautiful flower! 0 4

𝑞 = “black cat”

𝑆𝐶 𝑞, 𝑑𝑗 = “number of words in query 𝑞 that also appear in document 𝑑𝑗”

Page 13: Information Retrieval Introduction and basic conceptshome.deib.polimi.it/lbondi/data/uploads/irdm15-16/slides/01_introduction_v2.pdf · Information Retrieval Introduction What is

Information Retrieval

Basic concepts

Index terms

13

Each document 𝑑𝑗 is represented by a set of keywords called

index terms 𝑡𝑖

Index terms are used to index and summarize the document

content

Distinct index terms have varying relevance (to the user) when

used to describe document contents. This effect is modelled

assigning a numerical weight 𝑤𝑖,𝑗 to each index term 𝑡𝑖 for each

document 𝑑𝑗

𝑤𝑖,𝑗 quantifies the importance of the index term for describing

the document semantic contents

Page 14: Information Retrieval Introduction and basic conceptshome.deib.polimi.it/lbondi/data/uploads/irdm15-16/slides/01_introduction_v2.pdf · Information Retrieval Introduction What is

Information Retrieval

Basic concepts

Index terms

14

Each document 𝑑𝑗 is associated to an index term vector

𝐝𝑗 = 𝑤1,𝑗 , 𝑤2,𝑗 , … , 𝑤𝑀,𝑗𝑇

𝑀 = total number of index terms

𝑤𝑖,𝑗 = 𝑔 𝑡𝑖, 𝑑𝑗 , where 𝑔 is a function that computes the weight

of term 𝑡𝑖 in document 𝑑𝑗

𝑤𝑖,𝑗 = 0 if term 𝑡𝑖 does not appear in document 𝑑𝑗

Page 15: Information Retrieval Introduction and basic conceptshome.deib.polimi.it/lbondi/data/uploads/irdm15-16/slides/01_introduction_v2.pdf · Information Retrieval Introduction What is

Information Retrieval

Basic concepts

Index terms

15

Index term weights are usually assumed to be mutually

independent

This means that knowing the weight 𝑤𝑖,𝑗 associated with the

pair 𝑡𝑖 , 𝑑𝑗 tells us nothing about the weight 𝑤𝑖+1,𝑗 associated

with the pair 𝑡𝑖+1, 𝑑𝑗

This is clearly a simplification because occurrences of index

terms in a document are not uncorrelated

• In a telecommunication book, for instance, the terms

computer and network are likely to appear coupled, thus the

weights of those terms are clearly correlated