10 - information retrieval - 1
TRANSCRIPT
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service Engineering Lecture 10: Linked Data Engineering - 5 and
Information Retrieval - 1Prof. Dr. Harald SackFIZ Karlsruhe - Leibniz Institute for Information InfrastructureAIFB - Karlsruhe Institute of Technology
Summer Semester 2017
This file is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0)
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service EngineeringLast Lecture: Linked Data Engineering - 4
3.1 Knowledge Representations and Ontologies
3.2 Semantic Web and the Web of Data
3.3 Linked Data Principles
3.4 How to name Things - URIs
3.5 Resource Description Framework (RDF)
3.6 Creating new Models with RDFS
3.7 Querying RDF(S) with SPARQL
3.8 More Expressivity with Web Ontology Language (OWL)
3.9 Wikipedia, DBpedia, and Wikidata
3.10 Linked Data Programming
● From Wikipedia to DBpedia● Differences between DBpedia and
Wikidata
2
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service EngineeringLecture 10: Linked Data Engineering - 5
3.1 Knowledge Representations and Ontologies
3.2 Semantic Web and the Web of Data
3.3 Linked Data Principles
3.4 How to name Things - URIs
3.5 Resource Description Framework (RDF) as simple Data Model
3.6 Creating new Models with RDFS
3.7 Querying RDF(S) with SPARQL
3.8 More Expressivity with Web Ontology Language (OWL)
3.9 Wikipedia, DBpedia, and Wikidata
3.10 Linked Data Programming
3
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Linked Data Driven Web Applications
● Required Components:
○ Local RDF Store
■ caching of results
■ permanent storage
M.Hausenblas: Linked Data Applications, DERI Technical Report, 2009
3. Linked Data Engineering / 3.10 Linked Data Programming
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Linked Data Driven Web Applications
● Required Components:
○ Logic (Controller) and
○ User Interface
■ (=Business Logic)
■ (not LOD specific)
M.Hausenblas: Linked Data Applications, DERI Technical Report, 2009
3. Linked Data Engineering / 3.10 Linked Data Programming
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Linked Data Driven Web Applications
● Required Components:
● Data Integration component
○ get data directly from
LOD-Cloud or
○ via Semantic Indexer
M.Hausenblas: Linked Data Applications, DERI Technical Report, 2009
3. Linked Data Engineering / 3.10 Linked Data Programming
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Linked Data Driven Web Applications
● Required Components:
● Data Re-/Publishing component
○ write back application
dependent data
into the Web of Data
M.Hausenblas: Linked Data Applications, DERI Technical Report, 2009
3. Linked Data Engineering / 3.10 Linked Data Programming
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Linked Data Driven Web Applications
● Required Components:
○ Local RDF Store
■ caching of results
■ permanent storage
○ Logic (Controller) and
○ User Interface (=Business Logic)
■ (not LOD specific)
○ Data Integration component
■ get data directly from LOD-Cloud or
■ via Semantic Indexer
○ Data Re-/Publishing component
■ write back application dependent data into the Web of DataM.Hausenblas: Linked Data Applications, DERI Technical Report, 2009
3. Linked Data Engineering / 3.10 Linked Data Programming
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
● The easiest way is to make use of a suitable library:
○ SPARQL Javascript Libraryhttp://www.thefigtrees.net/lee/blog/2006/04/sparql_calendar_demo_a_sparql.html
○ ARC for SPARQL (PHP)https://github.com/semsol/arc2/wiki
○ dotNetRDF (C#)https://dotnetrdf.github.io/
○ Jena/ARQ (Java)http://jena.apache.org/
○ Sesame (Java)http://rdf4j.org/
○ SPARQL Wrapper (Python)http://rdflib.github.io/sparqlwrapper/
Linked Data Driven Web Applications3. Linked Data Engineering / 3.10 Linked Data Programming
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
● The easiest way is to make use of a suitable library:
○ SPARQL Wrapper (Python)
http://rdflib.github.io/sparqlwrapper/
● Access to Linked Data via SPARQL endpoints
○ let‘s choose DBpedia (just for simplicity...)
http://dbpedia.org/sparql
● ...now we have to think of a simple example...
Linked Data Programming Example3. Linked Data Engineering / 3.10 Linked Data Programming
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
● Example Application:
○ Build a simple application that looks for today‘s birthdays of famous
people, as e.g. authors
○ Create a list of authors, whose birthday is today including some
additional information, as e.g.
■ Year of Birth
■ Short description
○ Let‘s create a simple (local) web page for the task (i.e. encode results in
HTML), which can be displayed in the browser
○ we use Python and the SPARQL Wrapper library
Linked Data Programming Example3. Linked Data Engineering / 3.10 Linked Data Programming
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Prerequisites - (Manual) Data Analysis
http://dbpedia.org/page/Alexandre_Dumas
● Choose a representative example:
○ E.g. Alexandre Dumas from DBpedia
3. Linked Data Engineering / 3.10 Linked Data Programming
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
● Data Analysis via SPARQL
○ What kind of entities are you looking for?
■ ?author rdf:type dbo:Writer .
○ What information do you need?
■ ?author dbo:birthDate ?birthdate .
■ ?author rdfs:label ?name .
■ ?author rdfs:comment ?description .
■ OPTIONAL { ?author dbo:thumbnail ?thumbnail }
○ Any filter criteria?■ (lang(?name)="en") && (lang(?description)="en")
■ (SUBSTR(STR(?birthdate),6)="07-05")
■ More sophisticated: (SUBSTR(STR(bif:curdate('')),6)
Prerequisites - (Manual) Data Analysis3. Linked Data Engineering / 3.10 Linked Data Programming
Virtuoso triple store builtin function for current date
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Prerequisites - SPARQL Queries
SPARQL query at dbpedia.org
3. Linked Data Engineering / 3.10 Linked Data Programming
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Prerequisites - Download & Install
● You should have Python installed on your computer
○ https://www.python.org/downloads/
● Download SPARQL Wrapper for Python
○ http://rdflib.github.io/sparqlwrapper/
● Follow the instructions for Installation
3. Linked Data Engineering / 3.10 Linked Data Programming
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Linked Data Programming Example3. Linked Data Engineering / 3.10 Linked Data Programming
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
import com.hp.hpl.jena.query.*;
String service = "..."; // address of the SPARQL endpoint String query = "SELECT ..."; // your SPARQL query QueryExecution e = QueryExecutionFactory. sparqlService(service, query)
ResultSet results = e. execSelect(); while ( results.hasNext() ) {
QuerySolution s = results. nextSolution(); // ...
}
e.close();
● Alternative: simple example with Jena ARQ and Java:
http://jena.apache.org/
Linked Data Programming Example3. Linked Data Engineering / 3.10 Linked Data Programming
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
You may also try out Wikidata...3. Linked Data Engineering / 3.10 Linked Data Programming
SPARQL query at wikidata.org
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service EngineeringNext Lecture: Linked Data Engineering - 5
3.1 Knowledge Representations and Ontologies
3.2 Semantic Web and the Web of Data
3.3 Linked Data Principles
3.4 How to name Things - URIs
3.5 Resource Description Framework (RDF) as simple Data Model
3.6 Creating new Models with RDFS
3.7 Querying RDF(S) with SPARQL
3.8 More Expressivity with Web Ontology Language (OWL)
3.9 Wikipedia, DBpedia, and Wikidata
3.10 Linked Data Programming
20
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service EngineeringLecture Overview
1. Information, Natural Language and the Web
2. Natural Language Processing
3. Linked Data Engineering
4. Information Retrieval
5. Knowledge Mining
6. Exploratory Search and Recommender Systems
21
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service EngineeringLecture 10: Information Retrieval
4.1 A Brief History of Libraries and IR
4.2 Fundamental Concepts of IR
4.3 Information Retrieval Models
4.4 Retrieval Evaluation
4.5 Web Information Retrieval
4.6 Document Crawling, Text Processing, and Indexing
4.7 Query Processing and Result Representation
4.8 Question Answering
22
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
● “Information retrieval is a field concerned with the structure, analysis,
organization, storage, searching, and retrieval of information.” (George Salton, 1968,[1])
● “IR is finding material (usually documents) of an unstructured nature (usually
text) that satisfies an information need from within large collections (usually
stored on computers).” (Manning et al., 2008, [2])
Information Retrieval4. Information Retrieval / 4.1 A Brief History of Libraries and IR
23
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology24
How old is Information Retrieval?
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
● 3,000 - 2,000 BC Sumerian archives
○ Clay tablets in cuneiform script
stored in temple rooms
○ Mostly inventories and records of
commercial transactions
● 300 BC Library of Alexandria
○ Idea: a universal library holding
copies of all the world’s books
○ At its height, the library contained
almost 750,000 books in form of
papyrus scrolls
Libraries and Information Retrieval
https://commons.wikimedia.org/wiki/File:Milkau_Oberer_Teil_der_Stele_mit_dem_Text_von_Hammurapis_Gesetzescode_369-2.jpg
4. Information Retrieval / 4.1 A Brief History of Libraries and IR
25
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology26
● Middle Ages Monastic Libraries
○ Christian monks saved texts of
Roman and Greek antiquity from
getting lost by copying
○ Vatican Library founded in 1475
● c. 1450 Printing Press
○ Johannes Gutenberg introduced
movable type to Europe
○ Copying books became much easier
and less expensive
Libraries and Information Retrieval
https://commons.wikimedia.org/wiki/File:Buchdrucker-1568.png
4. Information Retrieval / 4.1 A Brief History of Libraries and IR
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
● German National Library
○ 24 M items
○ Located in Leipzig, Frankfurt (Main),
and Berlin
● Library of Congress
○ the world’s largest library
○ 155M items
○ Classification system:
Library of Congress Classification
Libraries and Information Retrieval4. Information Retrieval / 4.1 A Brief History of Libraries and IR
27
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology28
● Items are catalogued by metadata
○ Author, Editor, Title, ISBN,...
○ Keyword, e.g. “information retrieval”
○ Subjectarea, e.g.“informationsystems”
○ Specialized classification systems
■ Library of Congress
Classification (LCC)
■ Dewey Decimal
Classification (DDC)
■ Universal Decimal
Classification (UDC)
Library Catalog and Index4. Information Retrieval / 4.1 A Brief History of Libraries and IR
http://www.worldcat.org/title/modern-information-retrieval/oclc/40602840
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology29
● Catalogue cards serve as document proxies
● Experts must catalogue each item individually
● Full text search: every word is a keyword
Full Text Search and Concordance4. Information Retrieval / 4.1 A Brief History of Libraries and IR
https://commons.wikimedia.org/wiki/File%3ASchlagwortkatalog.jpg
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology30
● Before information retrieval,
in pre-computer area: Concordances
○ Alphabetical list of the principal words
used in a book, listing every instance of
each word with its immediate context
○ Only for works of special importance, as
e.g. the Bible
○ First Bible concordance by Hugh of
Saint-Chere, with the help of 500 monks,
at c. 1250
Full Text Search and Concordance4. Information Retrieval / 4.1 A Brief History of Libraries and IR
https://commons.wikimedia.org/wiki/File%3ASchlagwortkatalog.jpg
https://commons.wikimedia.org/w/index.php?title=File:A_Concordance_to_the_English_Poems_of_Thomas_Gray_(1908).djvu&page=17
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
● 1957: Hans-Peter Luhn (IBM) uses words as indexing units for
documents
○ Measure similarity between documents by word overlap
● 1960s and 1970s: Gerard Salton and his students (Harvard, Cornell)
create the SMART system
○ Vector space model
○ Relevance feedback
● 1972: Karen Spa ̈rck Jones introduced inverse-document-frequency
Early Information Retrieval4. Information Retrieval / 4.1 A Brief History of Libraries and IR
31
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology32
● 1992: TREC - annual Text Retrieval Conference
○ Sponsored by the U.S. National Institute of Standards and Technology and
the U.S. Department of Defense
○ many different tracks, e.g. blogs, genomics, spam, video, etc.
○ Provides data sets and test problems
● 1994: Web Crawler, very first Web Search Engine
● 1998: Google
● Current Research Questions:
○ Scalability, Speed, Quality
● IR related Research at ISE:
○ Semantic Search, Exploratory Search, Question Answering
Information Retrieval Timeline 4. Information Retrieval / 4.1 A Brief History of Libraries and IR
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service EngineeringLecture 10: Information Retrieval
4.1 A Brief History of Libraries and IR
4.2 Fundamental Concepts of IR
4.3 Information Retrieval Models
4.4 Retrieval Evaluation
4.5 Web Information Retrieval
4.6 Document Crawling, Text Processing, and Indexing
4.7 Query Processing and Result Representation
4.8 Question Answering
33
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
● A document is a coherent passage of free text
● Examples:○ Web pages, email, books, news stories, scholarly
papers, text messages, WordTM, PowerpointTM, PDF,
forum postings, patents, IM sessions, dictionary
entries etc.
● Common properties:
○ Written in natural language
○ Significant text content
○ Some structure, e.g.■ Papers: title,author, date, or
■ Email: subject, sender, destination, date
Information Retrieval Basics4. Information Retrieval / 4.2 Fundamental Concepts of IR
34https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
● A document collection is a set of documents
○ also known as corpus
○ usually, all documents within a collection are
similar with respect to some criterion
● Examples:
○ Chinese Patents
○ The articles covered by The New York Times
○ Amazon Product Reviews
○ The Web
Information Retrieval Basics4. Information Retrieval / 4.2 Fundamental Concepts of IR
35https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
● An information need is the topic about which the
user (or a group of users) desires to know more
○ Refers to an individual, hidden cognitive state
○ Paradoxical: It describes the user’s ignorance
○ Ill-defined
● Examples:
○ What is the capital of Uruguay?
○ Is it really true that Elvis is still alive?
○ Show me some definitions of
“information need”!
Information Retrieval Basics4. Information Retrieval / 4.2 Fundamental Concepts of IR
36https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
● A query is what the user conveys to the computer in
an attempt to communicate the information need
○ Stated in a formal query language
○ Usually a list of search terms (keywords)
● Keyword queries are often poor descriptions of
actual information needs○ E.g., a query for “jaguar” could mean “places to buy
jaguar cars” or the “cat”.
● Search queries (in particular one-word queries) are
under-specified.
○ Semantics of long queries are ignored
Information Retrieval Basics4. Information Retrieval / 4.2 Fundamental Concepts of IR
37https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
● A document is relevant with respect to some user’s
information need, if the user perceives it as
containing information of value with respect to this
information need
○ Usually assumed to be a binary concept, but
could also be graded
● Example:
○ Information need: “What is relevance in IR?”
● Relevant document:
○ Wikipedia’s entry
“Relevance (information retrieval)”
Information Retrieval Basics4. Information Retrieval / 4.2 Fundamental Concepts of IR
38https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
The Information Retrieval Paradigm4. Information Retrieval / 4.2 Fundamental Concepts of IR
39
Set of Queries Set of Documents
Query Formulation Indexing
indexquery
matches based on (string) similarity
https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/ https://pixabay.com/en/post-it-paper-notes-record-memory-1079361/
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
https://pixabay.com/en/files-paper-office-paperwork-stack-1614223/
Classical Information RetrievalSimplified Form
4. Information Retrieval / 4.2 Fundamental Concepts of IR
40
search term(s)
keyword(s)
search index
search query
document corpus
document
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
IR System Architecture - IndexingBasic Building Blocks
4. Information Retrieval / 4.2 Fundamental Concepts of IR
41
Text Acquisition Index Creation
Text Transformation
Index
Document Store
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
IR System Architecture - QueryingBasic Building Blocks
4. Information Retrieval / 4.2 Fundamental Concepts of IR
42
User Interaction Ranking
Evaluation
Index
Document Store
Log Data
Retrieval model uses queries and index to generate a ranked list of documents
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service EngineeringLecture 10: Information Retrieval
4.1 A Brief History of Libraries and IR
4.2 Fundamental Concepts of IR
4.3 Information Retrieval Models
4.4 Retrieval Evaluation
4.5 Web Information Retrieval
4.6 Document Crawling, Text Processing, and Indexing
4.7 Query Processing and Result Representation
4.8 Question Answering
43
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Categorization of IR Models4. Information Retrieval / 4.3 Information Retrieval Models
44
● Set-theoretic models ○ represent documents as sets of words or phrases○ similarities are usually derived from set-theoretic operations on those sets
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Categorization of IR Models4. Information Retrieval / 4.3 Information Retrieval Models
45
● Algebraic models○ represent documents and queries usually as vectors, matrices, or tuples.○ similarity of the query vector and document vector is represented as a
scalar value.
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Categorization of IR Models4. Information Retrieval / 4.3 Information Retrieval Models
46
● Probabilistic models ○ treat the process of document retrieval as a probabilistic inference.○ similarities are computed as probabilities that a document is relevant for a
given query.
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Categorization of IR Models4. Information Retrieval / 4.3 Information Retrieval Models
47
● Models without term-interdependencies
○ treat different terms/words as independent.
○ in vector space models this is represented by the orthogonality assumption of term vectors
○ in probabilistic models this is represented by an independency assumption for term variables.
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Categorization of IR Models4. Information Retrieval / 4.3 Information Retrieval Models
48
● Immanent term-dependencies
○ allow a representation of interdependencies between terms.
○ interdependency between two terms is defined by the model itself.
● transcendent term interdependencies
○ do not allege how the interdependency between two terms is defined
○ rely an external source for the degree of interdependency between two terms.
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Categorization of IR Models4. Information Retrieval / 4.3 Information Retrieval Models
49
Dominik Kuropka: Modelle zur Repräsentation natürlichsprachlicher Dokumente. Ontologie-basiertes Information-Filtering und -Retrieval mit relationalen Datenbanken, Advances in Information Systems and Management Science, Bd. 10, Logos Verlag, Berlin, 2004.
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Boolean Retrieval Model4. Information Retrieval / 4.3 Information Retrieval Models
50
● Propositional Logic as retrieval language
● selection and connection of arbitrary document sets via boolean connectors (search operators)
● easy to implement
● no differentiated term weights
● no ranking
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Vector Space Model4. Information Retrieval / 4.3 Information Retrieval Models
51
● Documents and queries are represented as points in a high-dimensional vector space ℝn
● for retrieval the Euclidian distance and Cosine similarity between search query and document vector is used
● ranking according to distance
● differentiated term weights
● linear order of terms in documents is lost
● No semantic sensitivity (vocabulary dependency)
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Probabilistic Model4. Information Retrieval / 4.3 Information Retrieval Models
52
● Documents are weighted according their relevance for a search query
● IR system estimated the probability of relevance for a search query
● term weights for terms ti for a search query q
● for a new document dm
the relevance of dm
for the search query q can be determined via the term weights t
i
Relevance feedback for
search query q
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service EngineeringLecture 10: Information Retrieval
4.1 A Brief History of Libraries and IR
4.2 Fundamental Concepts of IR
4.3 Information Retrieval Models
4.4 Retrieval Evaluation
4.5 Web Information Retrieval
4.6 Document Crawling, Text Processing, and Indexing
4.7 Query Processing and Result Representation
4.8 Question Answering
53
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Retrieval Evaluation4. Information Retrieval / 4.4 Retrieval Evaluation
54
User Interaction Ranking
Evaluation
Index
Document Store
Log Data
Monitors and measures effectiveness and efficiency (primarily offline)
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Retrieval Evaluation4. Information Retrieval / 4.4 Retrieval Evaluation
55
● Evaluation is key to building effective and efficient search engines.
● Drives advancement of search engines (when intuition fails)
● Measurement usually carried out in controlled laboratory experiments
(to control the many factors)
● Effectiveness: Measures ability to find right information
○ Compare ranking to user relevance feedback
● Efficiency: Measures ability to do this quickly
○ Measure time and space requirements
● Effectiveness, efficiency, and cost are related
○ Efficiency and cost targets may impact effectiveness.
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Retrieval Evaluation
● How to objectively measure the quality of a (classification) experiment?
○ Compare your achieved results with a ground truth (gold standard)
● How to achieve a ground truth?
○ Often this means to invest manual effort…
● How to compare achieved results with a ground truth?
○ Correctness Precision
○ Completeness Recall
○ Correctness & Completeness F-Measure
4. Information Retrieval / 4.4 Retrieval Evaluation
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Confusion Matrix
● Contains information about relevant documents and documents
retrieved by a search engine
● A table with two rows and two columns that reports the number of
○ false positives, false negatives, true positives, and true negatives.
retrieved
true false
relevanttrue true positive false negative
false false positive true negative
ground truth
Search results
4. Information Retrieval / 4.4 Retrieval Evaluation
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Recall and Precision
● Recall is the fraction of relevant documents that are retrieved
relevant documents
retrieved documents
TruePositives
FalseNegative
True Negatives
FalsePositive
4. Information Retrieval / 4.4 Retrieval Evaluation
● Precision is the fraction of retrieved documents that are relevant
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
F-Measure
● F1-Measure is the harmonic mean of precision and recall.
relevant documents
retrieved documents
TruePositives
FalseNegative
True Negatives
FalsePositive
4. Information Retrieval / 4.4 Retrieval Evaluation
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Ranking Effectiveness
● Problem: Evaluate Ranking and not just a Boolean classification
● Idea: Calculate Recall and Precision at every rank position
4. Information Retrieval / 4.4 Retrieval Evaluation
= relevant documents
Ranking #1
Ranking #2
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Ranking Effectiveness
● Problem: Evaluate Ranking and not just a Boolean classification
● Idea: Calculate Recall and Precision at every rank position
4. Information Retrieval / 4.4 Retrieval Evaluation
= relevant documents
Ranking #1
Ranking #2
Recall 0.17Precision 1.0
Recall 0.0Precision 0.0
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Ranking Effectiveness
● Problem: Evaluate Ranking and not just a Boolean classification
● Idea: Calculate Recall and Precision at every rank position
4. Information Retrieval / 4.4 Retrieval Evaluation
= relevant documents
Ranking #1
Ranking #2
Recall 0.17 0.17Precision 1.0 0.5
Recall 0.0 0.17Precision 0.0 0.5
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Ranking Effectiveness
● Problem: Evaluate Ranking and not just a Boolean classification
● Idea: Calculate Recall and Precision at every rank position
4. Information Retrieval / 4.4 Retrieval Evaluation
= relevant documents
Ranking #1
Ranking #2
Recall 0.17 0.17 0.33Precision 1.0 0.5 0.67
Recall 0.0 0.17 0.17Precision 0.0 0.5 0.33
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Ranking Effectiveness
● Problem: Evaluate Ranking and not just a Boolean classification
● Idea: Calculate Recall and Precision at every rank position
4. Information Retrieval / 4.4 Retrieval Evaluation
= relevant documents
Ranking #1
Ranking #2
Recall 0.17 0.17 0.33 0.5Precision 1.0 0.5 0.67 0.75
Recall 0.0 0.17 0.17 0.17Precision 0.0 0.5 0.33 0.25
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Ranking Effectiveness
● Problem: Evaluate Ranking and not just a Boolean classification
● Idea: Calculate Recall and Precision at every rank position
4. Information Retrieval / 4.4 Retrieval Evaluation
= relevant documents
Ranking #1
Ranking #2
Recall 0.17 0.17 0.33 0.5 0.67Precision 1.0 0.5 0.67 0.75 0.8
Recall 0.0 0.17 0.17 0.17 0.33Precision 0.0 0.5 0.33 0.25 0.4
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Ranking Effectiveness
● Problem: Evaluate Ranking and not just a Boolean classification
● Idea: Calculate Recall and Precision at every rank position
4. Information Retrieval / 4.4 Retrieval Evaluation
= relevant documents
Ranking #1
Ranking #2
Recall 0.17 0.17 0.33 0.5 0.67 0.83Precision 1.0 0.5 0.67 0.75 0.8 0.83
Recall 0.0 0.17 0.17 0.17 0.33 0.5Precision 0.0 0.5 0.33 0.25 0.4 0.5
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Ranking Effectiveness
● Problem: Evaluate Ranking and not just a Boolean classification
● Idea: Calculate Recall and Precision at every rank position
4. Information Retrieval / 4.4 Retrieval Evaluation
= relevant documents
Ranking #1
Ranking #2
Recall 0.17 0.17 0.33 0.5 0.67 0.83 0.83 0.83 0.83 1.0Precision 1.0 0.5 0.67 0.75 0.8 0.83 0.71 0.63 0.56 0.6
Recall 0.0 0.17 0.17 0.17 0.33 0.5 0.67 0.67 0.83 1.0Precision 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.5 0.56 0.6
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Summarizing a Ranking
● Problem: Long lists are difficult to compare
● Ideas:
1. Calculate recall and precision at a small number of fixed rank positions
■ Compare two rankings:
● If precision at position p is higher, recall is higher too.
● “Precision at rank p” (p=5, p=10, p=20)
● Ignores ranking after p and ignores ranking within 1 to p.
2. Average the precision values from the rank positions where relevant
documents are retrieved
4. Information Retrieval / 4.4 Retrieval Evaluation
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Average Precision
4. Information Retrieval / 4.4 Retrieval Evaluation
= relevant documents
Ranking #1
Ranking #2
Recall 0.17 0.17 0.33 0.5 0.67 0.83 0.83 0.83 0.83 1.0Precision 1.0 0.5 0.67 0.75 0.8 0.83 0.71 0.63 0.56 0.6
Recall 0.0 0.17 0.17 0.17 0.33 0.5 0.67 0.67 0.83 1.0Precision 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.5 0.56 0.6
precision@10 for ranking #1: (1.0+0.67+0.75+0.8+0.83+0.6)/6 = 0.78
precision@10 for ranking #2: (0.5+0.4+0.5+0.57+0.56+0.6)/6 = 0.52
Emphasizes top ranked documents
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Mean Average Precision
4. Information Retrieval / 4.4 Retrieval Evaluation
● Each ranking produces an average precision
● Mean Average Precision (MAP):
○ Summarize rankings from multiple queries by averaging the average
precision
○ Most often used measure in research papers
○ Requires many relevance judgements
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Mean Average Precision
4. Information Retrieval / 4.4 Retrieval Evaluation
relevant documents for query #1
Result #1
Result #2
Recall 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.8 1.0Precision 1.0 0.5 0.67 0.5 0.4 0.5 0.43 0.38 0.44 0.5
Recall 0.0 0.33 0.33 0.33 0.67 0.67 1.0 1.0 1.0 1.0Precision 0.0 0.5 0.33 0.25 0.4 0.33 0.43 0.38 0.33 0.3
relevant documents for query #2
precision@10 for result #1: (1.0+0.67+0.5+0.44+0.5)/5 = 0.62
precision@10 for result #2: (0.5+0.4+0.43)/3 = 0.44
Mean Average Precision MAP = (0.62+0.44)/2 = 0.53
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
Information Service EngineeringLecture 10: Information Retrieval
4.1 A Brief History of Libraries and IR
4.2 Fundamental Concepts of IR
4.3 Information Retrieval Models
4.4 Retrieval Evaluation
4.5 Web Information Retrieval
4.6 Document Crawling, Text Processing, and Indexing
4.7 Query Processing and Result Representation
4.8 Question Answering
72
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
4. Information Retrieval Bibliography
[1] G. Salton, Automatic Information Organization and Retrieval, McGraw-Hill, New York, 1968.
[2] Ch. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge
University Press. 2008, https://nlp.stanford.edu/IR-book/
● Further Reading:
○ R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, 2nd ed., Addison
Wesley, 2010.
73
Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology
4. Information Retrieval Syllabus Questions
● What are the main components of Linked Data driven Web applications and how do they
interact?
● Explain the fundamental concepts of Information Retrieval
● Explain the Architecture of an IR System
● Explain the Boolean Retrieval model. What are its benefits and its drawbacks?
● Explain the Vector Space Retrieval model. What are its benefits and its drawbacks?
● Explain how can the ranking of search results be evaluated.
74