10 - information retrieval - 1

74
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering Lecture 10: Linked Data Engineering - 5 and Information Retrieval - 1 Prof. Dr. Harald Sack FIZ Karlsruhe - Leibniz Institute for Information Infrastructure AIFB - Karlsruhe Institute of Technology Summer Semester 2017 This file is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0 )

Upload: harald-sack

Post on 21-Jan-2018

192 views

Category:

Education


1 download

TRANSCRIPT

Page 1: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service Engineering Lecture 10: Linked Data Engineering - 5 and

Information Retrieval - 1Prof. Dr. Harald SackFIZ Karlsruhe - Leibniz Institute for Information InfrastructureAIFB - Karlsruhe Institute of Technology

Summer Semester 2017

This file is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0)

Page 2: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service EngineeringLast Lecture: Linked Data Engineering - 4

3.1 Knowledge Representations and Ontologies

3.2 Semantic Web and the Web of Data

3.3 Linked Data Principles

3.4 How to name Things - URIs

3.5 Resource Description Framework (RDF)

3.6 Creating new Models with RDFS

3.7 Querying RDF(S) with SPARQL

3.8 More Expressivity with Web Ontology Language (OWL)

3.9 Wikipedia, DBpedia, and Wikidata

3.10 Linked Data Programming

● From Wikipedia to DBpedia● Differences between DBpedia and

Wikidata

2

Page 3: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service EngineeringLecture 10: Linked Data Engineering - 5

3.1 Knowledge Representations and Ontologies

3.2 Semantic Web and the Web of Data

3.3 Linked Data Principles

3.4 How to name Things - URIs

3.5 Resource Description Framework (RDF) as simple Data Model

3.6 Creating new Models with RDFS

3.7 Querying RDF(S) with SPARQL

3.8 More Expressivity with Web Ontology Language (OWL)

3.9 Wikipedia, DBpedia, and Wikidata

3.10 Linked Data Programming

3

Page 4: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Linked Data Driven Web Applications

● Required Components:

○ Local RDF Store

■ caching of results

■ permanent storage

M.Hausenblas: Linked Data Applications, DERI Technical Report, 2009

3. Linked Data Engineering / 3.10 Linked Data Programming

Page 5: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Linked Data Driven Web Applications

● Required Components:

○ Logic (Controller) and

○ User Interface

■ (=Business Logic)

■ (not LOD specific)

M.Hausenblas: Linked Data Applications, DERI Technical Report, 2009

3. Linked Data Engineering / 3.10 Linked Data Programming

Page 6: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Linked Data Driven Web Applications

● Required Components:

● Data Integration component

○ get data directly from

LOD-Cloud or

○ via Semantic Indexer

M.Hausenblas: Linked Data Applications, DERI Technical Report, 2009

3. Linked Data Engineering / 3.10 Linked Data Programming

Page 7: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Linked Data Driven Web Applications

● Required Components:

● Data Re-/Publishing component

○ write back application

dependent data

into the Web of Data

M.Hausenblas: Linked Data Applications, DERI Technical Report, 2009

3. Linked Data Engineering / 3.10 Linked Data Programming

Page 8: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Linked Data Driven Web Applications

● Required Components:

○ Local RDF Store

■ caching of results

■ permanent storage

○ Logic (Controller) and

○ User Interface (=Business Logic)

■ (not LOD specific)

○ Data Integration component

■ get data directly from LOD-Cloud or

■ via Semantic Indexer

○ Data Re-/Publishing component

■ write back application dependent data into the Web of DataM.Hausenblas: Linked Data Applications, DERI Technical Report, 2009

3. Linked Data Engineering / 3.10 Linked Data Programming

Page 9: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

● The easiest way is to make use of a suitable library:

○ SPARQL Javascript Libraryhttp://www.thefigtrees.net/lee/blog/2006/04/sparql_calendar_demo_a_sparql.html

○ ARC for SPARQL (PHP)https://github.com/semsol/arc2/wiki

○ dotNetRDF (C#)https://dotnetrdf.github.io/

○ Jena/ARQ (Java)http://jena.apache.org/

○ Sesame (Java)http://rdf4j.org/

○ SPARQL Wrapper (Python)http://rdflib.github.io/sparqlwrapper/

Linked Data Driven Web Applications3. Linked Data Engineering / 3.10 Linked Data Programming

Page 10: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

● The easiest way is to make use of a suitable library:

○ SPARQL Wrapper (Python)

http://rdflib.github.io/sparqlwrapper/

● Access to Linked Data via SPARQL endpoints

○ let‘s choose DBpedia (just for simplicity...)

http://dbpedia.org/sparql

● ...now we have to think of a simple example...

Linked Data Programming Example3. Linked Data Engineering / 3.10 Linked Data Programming

Page 11: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

● Example Application:

○ Build a simple application that looks for today‘s birthdays of famous

people, as e.g. authors

○ Create a list of authors, whose birthday is today including some

additional information, as e.g.

■ Year of Birth

■ Short description

○ Let‘s create a simple (local) web page for the task (i.e. encode results in

HTML), which can be displayed in the browser

○ we use Python and the SPARQL Wrapper library

Linked Data Programming Example3. Linked Data Engineering / 3.10 Linked Data Programming

Page 12: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Prerequisites - (Manual) Data Analysis

http://dbpedia.org/page/Alexandre_Dumas

● Choose a representative example:

○ E.g. Alexandre Dumas from DBpedia

3. Linked Data Engineering / 3.10 Linked Data Programming

Page 13: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

● Data Analysis via SPARQL

○ What kind of entities are you looking for?

■ ?author rdf:type dbo:Writer .

○ What information do you need?

■ ?author dbo:birthDate ?birthdate .

■ ?author rdfs:label ?name .

■ ?author rdfs:comment ?description .

■ OPTIONAL { ?author dbo:thumbnail ?thumbnail }

○ Any filter criteria?■ (lang(?name)="en") && (lang(?description)="en")

■ (SUBSTR(STR(?birthdate),6)="07-05")

■ More sophisticated: (SUBSTR(STR(bif:curdate('')),6)

Prerequisites - (Manual) Data Analysis3. Linked Data Engineering / 3.10 Linked Data Programming

Virtuoso triple store builtin function for current date

Page 14: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Prerequisites - SPARQL Queries

SPARQL query at dbpedia.org

3. Linked Data Engineering / 3.10 Linked Data Programming

Page 15: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Prerequisites - Download & Install

● You should have Python installed on your computer

○ https://www.python.org/downloads/

● Download SPARQL Wrapper for Python

○ http://rdflib.github.io/sparqlwrapper/

● Follow the instructions for Installation

3. Linked Data Engineering / 3.10 Linked Data Programming

Page 16: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Page 17: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Linked Data Programming Example3. Linked Data Engineering / 3.10 Linked Data Programming

Page 18: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

import com.hp.hpl.jena.query.*;

String service = "..."; // address of the SPARQL endpoint String query = "SELECT ..."; // your SPARQL query QueryExecution e = QueryExecutionFactory. sparqlService(service, query)

ResultSet results = e. execSelect(); while ( results.hasNext() ) {

QuerySolution s = results. nextSolution(); // ...

}

e.close();

● Alternative: simple example with Jena ARQ and Java:

http://jena.apache.org/

Linked Data Programming Example3. Linked Data Engineering / 3.10 Linked Data Programming

Page 19: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

You may also try out Wikidata...3. Linked Data Engineering / 3.10 Linked Data Programming

SPARQL query at wikidata.org

Page 20: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service EngineeringNext Lecture: Linked Data Engineering - 5

3.1 Knowledge Representations and Ontologies

3.2 Semantic Web and the Web of Data

3.3 Linked Data Principles

3.4 How to name Things - URIs

3.5 Resource Description Framework (RDF) as simple Data Model

3.6 Creating new Models with RDFS

3.7 Querying RDF(S) with SPARQL

3.8 More Expressivity with Web Ontology Language (OWL)

3.9 Wikipedia, DBpedia, and Wikidata

3.10 Linked Data Programming

20

Page 21: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service EngineeringLecture Overview

1. Information, Natural Language and the Web

2. Natural Language Processing

3. Linked Data Engineering

4. Information Retrieval

5. Knowledge Mining

6. Exploratory Search and Recommender Systems

21

Page 22: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service EngineeringLecture 10: Information Retrieval

4.1 A Brief History of Libraries and IR

4.2 Fundamental Concepts of IR

4.3 Information Retrieval Models

4.4 Retrieval Evaluation

4.5 Web Information Retrieval

4.6 Document Crawling, Text Processing, and Indexing

4.7 Query Processing and Result Representation

4.8 Question Answering

22

Page 23: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

● “Information retrieval is a field concerned with the structure, analysis,

organization, storage, searching, and retrieval of information.” (George Salton, 1968,[1])

● “IR is finding material (usually documents) of an unstructured nature (usually

text) that satisfies an information need from within large collections (usually

stored on computers).” (Manning et al., 2008, [2])

Information Retrieval4. Information Retrieval / 4.1 A Brief History of Libraries and IR

23

Page 24: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology24

How old is Information Retrieval?

Page 25: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

● 3,000 - 2,000 BC Sumerian archives

○ Clay tablets in cuneiform script

stored in temple rooms

○ Mostly inventories and records of

commercial transactions

● 300 BC Library of Alexandria

○ Idea: a universal library holding

copies of all the world’s books

○ At its height, the library contained

almost 750,000 books in form of

papyrus scrolls

Libraries and Information Retrieval

https://commons.wikimedia.org/wiki/File:Milkau_Oberer_Teil_der_Stele_mit_dem_Text_von_Hammurapis_Gesetzescode_369-2.jpg

4. Information Retrieval / 4.1 A Brief History of Libraries and IR

25

Page 26: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology26

● Middle Ages Monastic Libraries

○ Christian monks saved texts of

Roman and Greek antiquity from

getting lost by copying

○ Vatican Library founded in 1475

● c. 1450 Printing Press

○ Johannes Gutenberg introduced

movable type to Europe

○ Copying books became much easier

and less expensive

Libraries and Information Retrieval

https://commons.wikimedia.org/wiki/File:Buchdrucker-1568.png

4. Information Retrieval / 4.1 A Brief History of Libraries and IR

Page 27: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

● German National Library

○ 24 M items

○ Located in Leipzig, Frankfurt (Main),

and Berlin

● Library of Congress

○ the world’s largest library

○ 155M items

○ Classification system:

Library of Congress Classification

Libraries and Information Retrieval4. Information Retrieval / 4.1 A Brief History of Libraries and IR

27

Page 28: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology28

● Items are catalogued by metadata

○ Author, Editor, Title, ISBN,...

○ Keyword, e.g. “information retrieval”

○ Subjectarea, e.g.“informationsystems”

○ Specialized classification systems

■ Library of Congress

Classification (LCC)

■ Dewey Decimal

Classification (DDC)

■ Universal Decimal

Classification (UDC)

Library Catalog and Index4. Information Retrieval / 4.1 A Brief History of Libraries and IR

http://www.worldcat.org/title/modern-information-retrieval/oclc/40602840

Page 29: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology29

● Catalogue cards serve as document proxies

● Experts must catalogue each item individually

● Full text search: every word is a keyword

Full Text Search and Concordance4. Information Retrieval / 4.1 A Brief History of Libraries and IR

https://commons.wikimedia.org/wiki/File%3ASchlagwortkatalog.jpg

Page 30: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology30

● Before information retrieval,

in pre-computer area: Concordances

○ Alphabetical list of the principal words

used in a book, listing every instance of

each word with its immediate context

○ Only for works of special importance, as

e.g. the Bible

○ First Bible concordance by Hugh of

Saint-Chere, with the help of 500 monks,

at c. 1250

Full Text Search and Concordance4. Information Retrieval / 4.1 A Brief History of Libraries and IR

https://commons.wikimedia.org/wiki/File%3ASchlagwortkatalog.jpg

https://commons.wikimedia.org/w/index.php?title=File:A_Concordance_to_the_English_Poems_of_Thomas_Gray_(1908).djvu&page=17

Page 31: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

● 1957: Hans-Peter Luhn (IBM) uses words as indexing units for

documents

○ Measure similarity between documents by word overlap

● 1960s and 1970s: Gerard Salton and his students (Harvard, Cornell)

create the SMART system

○ Vector space model

○ Relevance feedback

● 1972: Karen Spa ̈rck Jones introduced inverse-document-frequency

Early Information Retrieval4. Information Retrieval / 4.1 A Brief History of Libraries and IR

31

Page 32: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology32

● 1992: TREC - annual Text Retrieval Conference

○ Sponsored by the U.S. National Institute of Standards and Technology and

the U.S. Department of Defense

○ many different tracks, e.g. blogs, genomics, spam, video, etc.

○ Provides data sets and test problems

● 1994: Web Crawler, very first Web Search Engine

● 1998: Google

● Current Research Questions:

○ Scalability, Speed, Quality

● IR related Research at ISE:

○ Semantic Search, Exploratory Search, Question Answering

Information Retrieval Timeline 4. Information Retrieval / 4.1 A Brief History of Libraries and IR

Page 33: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service EngineeringLecture 10: Information Retrieval

4.1 A Brief History of Libraries and IR

4.2 Fundamental Concepts of IR

4.3 Information Retrieval Models

4.4 Retrieval Evaluation

4.5 Web Information Retrieval

4.6 Document Crawling, Text Processing, and Indexing

4.7 Query Processing and Result Representation

4.8 Question Answering

33

Page 34: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

● A document is a coherent passage of free text

● Examples:○ Web pages, email, books, news stories, scholarly

papers, text messages, WordTM, PowerpointTM, PDF,

forum postings, patents, IM sessions, dictionary

entries etc.

● Common properties:

○ Written in natural language

○ Significant text content

○ Some structure, e.g.■ Papers: title,author, date, or

■ Email: subject, sender, destination, date

Information Retrieval Basics4. Information Retrieval / 4.2 Fundamental Concepts of IR

34https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/

Page 35: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

● A document collection is a set of documents

○ also known as corpus

○ usually, all documents within a collection are

similar with respect to some criterion

● Examples:

○ Chinese Patents

○ The articles covered by The New York Times

○ Amazon Product Reviews

○ The Web

Information Retrieval Basics4. Information Retrieval / 4.2 Fundamental Concepts of IR

35https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/

Page 36: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

● An information need is the topic about which the

user (or a group of users) desires to know more

○ Refers to an individual, hidden cognitive state

○ Paradoxical: It describes the user’s ignorance

○ Ill-defined

● Examples:

○ What is the capital of Uruguay?

○ Is it really true that Elvis is still alive?

○ Show me some definitions of

“information need”!

Information Retrieval Basics4. Information Retrieval / 4.2 Fundamental Concepts of IR

36https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/

Page 37: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

● A query is what the user conveys to the computer in

an attempt to communicate the information need

○ Stated in a formal query language

○ Usually a list of search terms (keywords)

● Keyword queries are often poor descriptions of

actual information needs○ E.g., a query for “jaguar” could mean “places to buy

jaguar cars” or the “cat”.

● Search queries (in particular one-word queries) are

under-specified.

○ Semantics of long queries are ignored

Information Retrieval Basics4. Information Retrieval / 4.2 Fundamental Concepts of IR

37https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/

Page 38: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

● A document is relevant with respect to some user’s

information need, if the user perceives it as

containing information of value with respect to this

information need

○ Usually assumed to be a binary concept, but

could also be graded

● Example:

○ Information need: “What is relevance in IR?”

● Relevant document:

○ Wikipedia’s entry

“Relevance (information retrieval)”

Information Retrieval Basics4. Information Retrieval / 4.2 Fundamental Concepts of IR

38https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/

Page 39: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

The Information Retrieval Paradigm4. Information Retrieval / 4.2 Fundamental Concepts of IR

39

Set of Queries Set of Documents

Query Formulation Indexing

indexquery

matches based on (string) similarity

https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/ https://pixabay.com/en/post-it-paper-notes-record-memory-1079361/

Page 40: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/

Classical Information RetrievalSimplified Form

4. Information Retrieval / 4.2 Fundamental Concepts of IR

40

search term(s)

keyword(s)

search index

search query

document corpus

document

Page 41: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

IR System Architecture - IndexingBasic Building Blocks

4. Information Retrieval / 4.2 Fundamental Concepts of IR

41

Text Acquisition Index Creation

Text Transformation

Index

Document Store

Page 42: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

IR System Architecture - QueryingBasic Building Blocks

4. Information Retrieval / 4.2 Fundamental Concepts of IR

42

User Interaction Ranking

Evaluation

Index

Document Store

Log Data

Retrieval model uses queries and index to generate a ranked list of documents

Page 43: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service EngineeringLecture 10: Information Retrieval

4.1 A Brief History of Libraries and IR

4.2 Fundamental Concepts of IR

4.3 Information Retrieval Models

4.4 Retrieval Evaluation

4.5 Web Information Retrieval

4.6 Document Crawling, Text Processing, and Indexing

4.7 Query Processing and Result Representation

4.8 Question Answering

43

Page 44: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Categorization of IR Models4. Information Retrieval / 4.3 Information Retrieval Models

44

● Set-theoretic models ○ represent documents as sets of words or phrases○ similarities are usually derived from set-theoretic operations on those sets

Page 45: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Categorization of IR Models4. Information Retrieval / 4.3 Information Retrieval Models

45

● Algebraic models○ represent documents and queries usually as vectors, matrices, or tuples.○ similarity of the query vector and document vector is represented as a

scalar value.

Page 46: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Categorization of IR Models4. Information Retrieval / 4.3 Information Retrieval Models

46

● Probabilistic models ○ treat the process of document retrieval as a probabilistic inference.○ similarities are computed as probabilities that a document is relevant for a

given query.

Page 47: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Categorization of IR Models4. Information Retrieval / 4.3 Information Retrieval Models

47

● Models without term-interdependencies

○ treat different terms/words as independent.

○ in vector space models this is represented by the orthogonality assumption of term vectors

○ in probabilistic models this is represented by an independency assumption for term variables.

Page 48: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Categorization of IR Models4. Information Retrieval / 4.3 Information Retrieval Models

48

● Immanent term-dependencies

○ allow a representation of interdependencies between terms.

○ interdependency between two terms is defined by the model itself.

● transcendent term interdependencies

○ do not allege how the interdependency between two terms is defined

○ rely an external source for the degree of interdependency between two terms.

Page 49: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Categorization of IR Models4. Information Retrieval / 4.3 Information Retrieval Models

49

Dominik Kuropka: Modelle zur Repräsentation natürlichsprachlicher Dokumente. Ontologie-basiertes Information-Filtering und -Retrieval mit relationalen Datenbanken, Advances in Information Systems and Management Science, Bd. 10, Logos Verlag, Berlin, 2004.

Page 50: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Boolean Retrieval Model4. Information Retrieval / 4.3 Information Retrieval Models

50

● Propositional Logic as retrieval language

● selection and connection of arbitrary document sets via boolean connectors (search operators)

● easy to implement

● no differentiated term weights

● no ranking

Page 51: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Vector Space Model4. Information Retrieval / 4.3 Information Retrieval Models

51

● Documents and queries are represented as points in a high-dimensional vector space ℝn

● for retrieval the Euclidian distance and Cosine similarity between search query and document vector is used

● ranking according to distance

● differentiated term weights

● linear order of terms in documents is lost

● No semantic sensitivity (vocabulary dependency)

Page 52: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Probabilistic Model4. Information Retrieval / 4.3 Information Retrieval Models

52

● Documents are weighted according their relevance for a search query

● IR system estimated the probability of relevance for a search query

● term weights for terms ti for a search query q

● for a new document dm

the relevance of dm

for the search query q can be determined via the term weights t

i

Relevance feedback for

search query q

Page 53: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service EngineeringLecture 10: Information Retrieval

4.1 A Brief History of Libraries and IR

4.2 Fundamental Concepts of IR

4.3 Information Retrieval Models

4.4 Retrieval Evaluation

4.5 Web Information Retrieval

4.6 Document Crawling, Text Processing, and Indexing

4.7 Query Processing and Result Representation

4.8 Question Answering

53

Page 54: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Retrieval Evaluation4. Information Retrieval / 4.4 Retrieval Evaluation

54

User Interaction Ranking

Evaluation

Index

Document Store

Log Data

Monitors and measures effectiveness and efficiency (primarily offline)

Page 55: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Retrieval Evaluation4. Information Retrieval / 4.4 Retrieval Evaluation

55

● Evaluation is key to building effective and efficient search engines.

● Drives advancement of search engines (when intuition fails)

● Measurement usually carried out in controlled laboratory experiments

(to control the many factors)

● Effectiveness: Measures ability to find right information

○ Compare ranking to user relevance feedback

● Efficiency: Measures ability to do this quickly

○ Measure time and space requirements

● Effectiveness, efficiency, and cost are related

○ Efficiency and cost targets may impact effectiveness.

Page 56: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Retrieval Evaluation

● How to objectively measure the quality of a (classification) experiment?

○ Compare your achieved results with a ground truth (gold standard)

● How to achieve a ground truth?

○ Often this means to invest manual effort…

● How to compare achieved results with a ground truth?

○ Correctness Precision

○ Completeness Recall

○ Correctness & Completeness F-Measure

4. Information Retrieval / 4.4 Retrieval Evaluation

Page 57: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Confusion Matrix

● Contains information about relevant documents and documents

retrieved by a search engine

● A table with two rows and two columns that reports the number of

○ false positives, false negatives, true positives, and true negatives.

retrieved

true false

relevanttrue true positive false negative

false false positive true negative

ground truth

Search results

4. Information Retrieval / 4.4 Retrieval Evaluation

Page 58: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Recall and Precision

● Recall is the fraction of relevant documents that are retrieved

relevant documents

retrieved documents

TruePositives

FalseNegative

True Negatives

FalsePositive

4. Information Retrieval / 4.4 Retrieval Evaluation

● Precision is the fraction of retrieved documents that are relevant

Page 59: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

F-Measure

● F1-Measure is the harmonic mean of precision and recall.

relevant documents

retrieved documents

TruePositives

FalseNegative

True Negatives

FalsePositive

4. Information Retrieval / 4.4 Retrieval Evaluation

Page 60: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Ranking Effectiveness

● Problem: Evaluate Ranking and not just a Boolean classification

● Idea: Calculate Recall and Precision at every rank position

4. Information Retrieval / 4.4 Retrieval Evaluation

= relevant documents

Ranking #1

Ranking #2

Page 61: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Ranking Effectiveness

● Problem: Evaluate Ranking and not just a Boolean classification

● Idea: Calculate Recall and Precision at every rank position

4. Information Retrieval / 4.4 Retrieval Evaluation

= relevant documents

Ranking #1

Ranking #2

Recall 0.17Precision 1.0

Recall 0.0Precision 0.0

Page 62: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Ranking Effectiveness

● Problem: Evaluate Ranking and not just a Boolean classification

● Idea: Calculate Recall and Precision at every rank position

4. Information Retrieval / 4.4 Retrieval Evaluation

= relevant documents

Ranking #1

Ranking #2

Recall 0.17 0.17Precision 1.0 0.5

Recall 0.0 0.17Precision 0.0 0.5

Page 63: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Ranking Effectiveness

● Problem: Evaluate Ranking and not just a Boolean classification

● Idea: Calculate Recall and Precision at every rank position

4. Information Retrieval / 4.4 Retrieval Evaluation

= relevant documents

Ranking #1

Ranking #2

Recall 0.17 0.17 0.33Precision 1.0 0.5 0.67

Recall 0.0 0.17 0.17Precision 0.0 0.5 0.33

Page 64: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Ranking Effectiveness

● Problem: Evaluate Ranking and not just a Boolean classification

● Idea: Calculate Recall and Precision at every rank position

4. Information Retrieval / 4.4 Retrieval Evaluation

= relevant documents

Ranking #1

Ranking #2

Recall 0.17 0.17 0.33 0.5Precision 1.0 0.5 0.67 0.75

Recall 0.0 0.17 0.17 0.17Precision 0.0 0.5 0.33 0.25

Page 65: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Ranking Effectiveness

● Problem: Evaluate Ranking and not just a Boolean classification

● Idea: Calculate Recall and Precision at every rank position

4. Information Retrieval / 4.4 Retrieval Evaluation

= relevant documents

Ranking #1

Ranking #2

Recall 0.17 0.17 0.33 0.5 0.67Precision 1.0 0.5 0.67 0.75 0.8

Recall 0.0 0.17 0.17 0.17 0.33Precision 0.0 0.5 0.33 0.25 0.4

Page 66: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Ranking Effectiveness

● Problem: Evaluate Ranking and not just a Boolean classification

● Idea: Calculate Recall and Precision at every rank position

4. Information Retrieval / 4.4 Retrieval Evaluation

= relevant documents

Ranking #1

Ranking #2

Recall 0.17 0.17 0.33 0.5 0.67 0.83Precision 1.0 0.5 0.67 0.75 0.8 0.83

Recall 0.0 0.17 0.17 0.17 0.33 0.5Precision 0.0 0.5 0.33 0.25 0.4 0.5

Page 67: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Ranking Effectiveness

● Problem: Evaluate Ranking and not just a Boolean classification

● Idea: Calculate Recall and Precision at every rank position

4. Information Retrieval / 4.4 Retrieval Evaluation

= relevant documents

Ranking #1

Ranking #2

Recall 0.17 0.17 0.33 0.5 0.67 0.83 0.83 0.83 0.83 1.0Precision 1.0 0.5 0.67 0.75 0.8 0.83 0.71 0.63 0.56 0.6

Recall 0.0 0.17 0.17 0.17 0.33 0.5 0.67 0.67 0.83 1.0Precision 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.5 0.56 0.6

Page 68: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Summarizing a Ranking

● Problem: Long lists are difficult to compare

● Ideas:

1. Calculate recall and precision at a small number of fixed rank positions

■ Compare two rankings:

● If precision at position p is higher, recall is higher too.

● “Precision at rank p” (p=5, p=10, p=20)

● Ignores ranking after p and ignores ranking within 1 to p.

2. Average the precision values from the rank positions where relevant

documents are retrieved

4. Information Retrieval / 4.4 Retrieval Evaluation

Page 69: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Average Precision

4. Information Retrieval / 4.4 Retrieval Evaluation

= relevant documents

Ranking #1

Ranking #2

Recall 0.17 0.17 0.33 0.5 0.67 0.83 0.83 0.83 0.83 1.0Precision 1.0 0.5 0.67 0.75 0.8 0.83 0.71 0.63 0.56 0.6

Recall 0.0 0.17 0.17 0.17 0.33 0.5 0.67 0.67 0.83 1.0Precision 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.5 0.56 0.6

precision@10 for ranking #1: (1.0+0.67+0.75+0.8+0.83+0.6)/6 = 0.78

precision@10 for ranking #2: (0.5+0.4+0.5+0.57+0.56+0.6)/6 = 0.52

Emphasizes top ranked documents

Page 70: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Mean Average Precision

4. Information Retrieval / 4.4 Retrieval Evaluation

● Each ranking produces an average precision

● Mean Average Precision (MAP):

○ Summarize rankings from multiple queries by averaging the average

precision

○ Most often used measure in research papers

○ Requires many relevance judgements

Page 71: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Mean Average Precision

4. Information Retrieval / 4.4 Retrieval Evaluation

relevant documents for query #1

Result #1

Result #2

Recall 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.8 1.0Precision 1.0 0.5 0.67 0.5 0.4 0.5 0.43 0.38 0.44 0.5

Recall 0.0 0.33 0.33 0.33 0.67 0.67 1.0 1.0 1.0 1.0Precision 0.0 0.5 0.33 0.25 0.4 0.33 0.43 0.38 0.33 0.3

relevant documents for query #2

precision@10 for result #1: (1.0+0.67+0.5+0.44+0.5)/5 = 0.62

precision@10 for result #2: (0.5+0.4+0.43)/3 = 0.44

Mean Average Precision MAP = (0.62+0.44)/2 = 0.53

Page 72: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

Information Service EngineeringLecture 10: Information Retrieval

4.1 A Brief History of Libraries and IR

4.2 Fundamental Concepts of IR

4.3 Information Retrieval Models

4.4 Retrieval Evaluation

4.5 Web Information Retrieval

4.6 Document Crawling, Text Processing, and Indexing

4.7 Query Processing and Result Representation

4.8 Question Answering

72

Page 73: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

4. Information Retrieval Bibliography

[1] G. Salton, Automatic Information Organization and Retrieval, McGraw-Hill, New York, 1968.

[2] Ch. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge

University Press. 2008, https://nlp.stanford.edu/IR-book/

● Further Reading:

○ R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, 2nd ed., Addison

Wesley, 2010.

73

Page 74: 10 - Information Retrieval - 1

Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology

4. Information Retrieval Syllabus Questions

● What are the main components of Linked Data driven Web applications and how do they

interact?

● Explain the fundamental concepts of Information Retrieval

● Explain the Architecture of an IR System

● Explain the Boolean Retrieval model. What are its benefits and its drawbacks?

● Explain the Vector Space Retrieval model. What are its benefits and its drawbacks?

● Explain how can the ranking of search results be evaluated.

74