a pair of shoes in the thesaurus; some reflexions on human and computer indexing

50
Media, information & communication Amsterdam University of Applied Scien / Section Innovation & Development University Library Utrecht A pair of shoes in the thesaurus reflexions on human and computer indexing Society of Indexers Conference 2010 The challenging future of indexing 30 September 2010, Middelburg

Upload: eric-sieverts

Post on 17-Nov-2014

1.832 views

Category:

Education


3 download

DESCRIPTION

Presentation at Society of Indexers 2010 Conference, 30 september 2010, Middelburg

TRANSCRIPT

Page 1: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

Eric SievertsMedia, information & communication

Amsterdam University of Applied Sciences /

Section Innovation & DevelopmentUniversity Library Utrecht

A pair of shoes in the thesaurusreflexions on human and computer indexing

Society of Indexers Conference 2010 The challenging future of indexing30 September 2010, Middelburg

Page 2: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

the holy grail for search systems:

let people find what they search

• searching in the world of Google

• what's wrong with Google (and alikes)

• metadata and indexing

• indexing and knowledge organization

• knowledge organization and the semantic web

agenda

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 3: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

searching in the world of

Google appears to be "the measure of all things" in search:

– with Google "everything can be found"

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 4: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

searching in the world of

Google appears to be "the measure of all things" in search:

– with Google "everything can be found"

but isn't there a paradox ?

– if Google (or Yahoo! or Bing) contains everything (> 500.000.000.000 items) can "it" still be found ?

>> anticipation of user's intentions & peerless ranking algorithms become increasingly important

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 5: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

search, search,search, search,search, ......

searcher / query documents

match

the basic search-and-find paradigm

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 6: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

search, search,search, ......

validity for free-text matching ?

match

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

(paraphrasing a Dutch poetry title "Lees maar er staat niet wat er staat")

"just read;

it does not mean what you're reading"

• How does Google know what you mean?• How does Google know what a document means?

Page 7: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

filename: thesaurus.jpg

is this meant to be representative for the ease of use of thesauri?

to what query is this Google's answer ?

Page 8: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

Want to know something about "hallenkerken" (Dutch for "hall church") thru Google Books?

Google's first hit is a book about building thesauri, containing the word in a single example of broader and narrower terms

Page 9: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

searching in the world of

The new Google Instant tries to predict

user intent(the holy grail for search engine developers)

after typing 1 or 2 letters it already presents results

for statistically most probable (longer) words

but is Google really guessing right?

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 10: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

match

classical situation with controlled human indexing

searcher must enter the "term(s)" that have been used to characterize the subject

indexer must assign “correct” terms to characterize the document

in principleperfect match

is possible

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

search, search,search, ......

Page 11: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

match

not user-friendly: searcher has to invent the correct terms

expensive: indexers must analyze the document in order to assign the correct terms

however

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

search, search,search, ......

classical situation with controlled human indexing

Page 12: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

search in the world of

searcher just types some words (or often only one single word)

search system contains (all) the words from the documents themselves

often you don'tfind all you need- still satisfied ?

match

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

search, search,search, ......

Page 13: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

why still user satisfaction ?

despite recall and precision problems:• search system looks attractively simple• searcher always finds something (in 500 billion web pages)• smart relevance ranking,

providing some relevant items among first 10 for most (simple) questions, for majority of users,very often even #1 already

and: who cares about lousy recall & precision (in the Google -world)?

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 14: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

language technology at searcher side

original simple query expanded & disambiguated

statistics generate additional terms to refine queries

search system contains just the words from the documents themselves

improved querieswill result in

better answers ?

match

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

search, search,search, ......

Page 15: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

language technology for better "query"

"word stemming" and "fuzzy search" : automatically search for more wordforms >> better recall

semantic network (or ontology) contains semantic relations between words : query expanded with semantically related terms >> better recall

for different meanings of a word, a semantic network (or ontology) contains relations with different words >> disambiguation >> better precision

no scientific evidence yet about how much improvement

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 16: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

language technology for better "query"

• statistical analysis of search result generates characteristic terms, from which user can choose to refine its query

• such words can also be derived from a synonym list, thesaurus, semantic network et cetera

mostly >> better precision

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 17: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

language technology at the document

search with "correct" or “important” terms

language technology enriches document with "correct" term (from thesaurus) or derives characteristic terms from the text

in principleperfect match

is possible

match

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

search, search,search, ......

Page 18: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

automatic classificationautomatic classification

Page 19: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

automatic classification or enrichment

1. deriving specific terms from the document itselfon the basis of word lists and text analysis specific types of terms (e.g. names of persons, places, products, parties, companies, etc.) can be recognized and marked as such

2. adding characteristics to classify a documentafter training it, a system can analyze documents and classify them with terms from a thesaurus or with classes from a taxonomy

despite some limitations it's getting better all the time

even for less tangible tasks as sentiment analysis

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 20: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

The Calais Web Serviceautomatically createsrich semantic metadata

Named Entities

Facts Events

Page 21: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

geographical recognition in Google Books

Page 22: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

training a systemthesaurus

training documents

analysismodule

“finger-prints”

trainingmodule

enrichmentof

thesaurus

Joop van Gent, Irion

Page 23: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

classification with systemenriched

thesaurusnew documents

analysismodule

“finger-prints”

classificationmodule

Joop van Gent, Irion

enricheddocuments

Page 24: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

endgame tips: checkmate with bishop and knight (in Dutch: "horse")

chess

equestrianism

Page 25: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

knowledge

organization

systems

metadata:more than

keywords orthesauri ?

Page 26: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

knowledge organization systems can be more thanjust metadata models or tools for subject indexing

4 types of KOS :

• categorization systems (like classifications and taxonomies)

• metadata models (like MARC or Dublin Core)

• relational models (like thesauri, semantic networks, ontologies)

• term lists (like authorization files)

more about ontologies in a moment

knowledge organization systems

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 27: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

4 types of functions for KOS:

• description and labeling (e.g. subject indexing with a thesaurus)

• definition (e.g. specification of the meaning of concepts in a thesaurus or ontology)

• translation (e.g. concordance between systems for interoperability)

• navigational (thru the systematic structure of a taxonomy or classification, or the hierarchy of concepts in a thesaurus or ontology)

some of these play a role in the semantic web

knowledge organization systems

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 28: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

• "knowledge-representation“ in which knowledge about (a small part of) the world is stored

• mostly not directly used for subject indexing• allows more complete and complex representations of

reality than a thesaurus• with many possible types of relations between concepts• with fixed roles and properties of these concepts• often for limited domains (“wine ontology”)• sometimes broader in so-called “core ontologies”

for example: CIDOC-CRM (conceptual reference model) for concepts, relations and properties in the field of cultural heritage

ontologies

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 29: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

relations between some concepts in a simple "wine ontology"

Page 30: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

example of the relations between concepts aboutthe statue of Balzac by Rodin [in CIDOC-CRM]

Page 31: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

semantic web

Page 32: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

“ontologies” in relation to the semantic web

• in a more general connotation :

general name for all kinds of subject indexing (thesauri, classifications, taxonomies, name authority lists, .....)

• essential requirements :

ontology must be available in a form that can be read, interpreted and processed by a computer program

→ needs notations and formal languages to describe them

ontologies

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 33: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

ontology notation for semantic web

RDF resource description frameworkstandard to describe relations between object and its metadata

OWL web ontology languagestandard for computer readable description of ontologies

RDFS RDF-schemastandard for description of a KOS in RDF

SKOS simple knowledge organization systemstandard for describing KOSses and relations between them in RDF

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 34: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

• RDF uses XML to describe the relation between a resource (or object), its metadata and the used metadata standards

• resources should have a URI to refer to them

• RDF uses “namespaces” to refer to computer-readable description of the standards (link via URL)

• RDF is meant to (re)use and to combine existing semantic systems

• properties (metadata) are registered in so-called triples: subject <predicate> object

(which we could perhaps also write: thing <property> value )

• RDF-triples are used in "linked data"

Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | [email protected]

resource description framework

Page 35: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

rdf triples

subject <predicate> object

doc1 <has author> auth1

auth1 <has name> john smith

auth1 <has affiliation> home inc.

auth1 <has email> [email protected]

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

graphical representation ofsimple network of 4 RDF-triples

Page 36: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

SKOS-representation ofthesaurus term & relationscan be described in RDF

Term: Economic cooperation Used For: Economic co-operation Broader terms: Economic policy Narrower terms: Economic integration, European economic cooperation, European industrial cooperation, Industrial cooperation Related terms: Interdependence Scope Note: Includes cooperative measures in banking, trade, industry etc., between and among countries.

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 37: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

SKOS representation in RDF<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:skos="http://www.w3.org/2004/02/skos/core#"><skos:Concept> <skos:prefLabel>Economic cooperation</skos:prefLabel> <skos:altLabel>Economic co-operation</skos:altLabel> <skos:scopeNote>Includes cooperative measures in banking, trade, industry etc., between and among countries. </skos:scopeNote> <skos:broader> <skos:Concept> <skos:prefLabel>Economic policy</skos:prefLabel> </skos:Concept> </skos:broader> <skos:related> <skos:Concept> <skos:prefLabel>Interdependence</skos:prefLabel> </skos:Concept> </skos:related> <skos:narrower> <skos:Concept> <skos:prefLabel>Economic integration</skos:prefLabel> </skos:Concept> </skos:narrower> <!-- ...more narrower terms omitted ... --></skos:Concept></rdf:RDF>

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 38: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

RDF and "linked data"

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

a lot of buzz recently about "linked (open) data"• it's just RDF-triples

• so it's computer readable

• it's on the internet

• so it's open

• it's meant to be re-used

• so it's an important ingredient for the semantic web

• it's standardized

• so it can be re-used

• everybody can (and has to) contribute data

• so it is also somewhat messy

Page 39: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

the "linked data cloud" - september 2010 - 24 billion RDF triples online

Page 40: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

viaf: virtual internationalauthority filedbpedia: data

from Wikipedia

last.fm: artists

geonames:6.2 M toponyms

BBC: wildlifefinder

LCSH

Reuters:openCalais

IMDB

Page 41: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

topic maps

XML-based information systems

• that can be considered as ontologies

• that need no additional notations and/or standards to make them computer-readable

• that combine knowledge representations and the indexed information in a single self-containing, interlinked system

• suited to make local knowledge accessible

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 42: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

topic maps

consist of:• concepts (=topics)• that are being characterized with

– “names” (can be any word - even multiple- to describe them) (names are topics themselves as well!)

– “types” (describing to what class of concepts it belongs) (types are topics themselves as well!)

– “associations” (specified types of relations between topics) (associations are also topics, thus having types!)

– “occurrences” (information-items “about” the concept-topic) (occurrences are also topics, thus having types!)

• all of this described in XML

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 43: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

verdi puccini

lucca

italyitaliaitaliëitalien

tosca

madame-butterflymadama-butterfly

romarome

occurrences

situated in

influenced

composed

location for

place of birth

simple example of opera topic-mapadopted from Pepper

association types

topic types

composer

opera

city

country

Page 44: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

© Antony Pitts, Kal Ahmed, MusicDNA

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

topic map applicationRoyal Academy of Music in London developed a model todescribe "everything" around music, from work/composition to experience of a particular performance

conceptuallysimilar torelationalFRBR modelin library world

Page 45: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

© Antony Pitts, Kal Ahmed, MusicDNA

Page 46: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

semantic web

• ultimate application of interoperability

• using combination of methods and standards for storing, structuring, filling, formalizing, describing and interpreting metadata

– RDF(S) – ontologies (as well as thesauri, taxonomies, semantic networks, …) – formal languages (like SKOS and OWL)– annotation of resources/objects (=subject indexing)

• so that computers will be able to interpret meaning and to combine knowledge from separate systems

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 47: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

© Guus Schreiber UvA / VU

rdf annotation of web resource

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

Page 48: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

iconclass annotation

Page 49: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | [email protected]

© Guus Schreiber UvA / VU

"species ontology"

Page 50: A pair of shoes in the thesaurus; some reflexions on human and computer indexing

search, search,search, search,search, ......

match

the semantic web (and interoperability) still require a lot of subject indexing, but with smart systems that:

• (help to) index dumb documents• can infer meaning• can match heterogeneous metadata • can improve dumb searches

even a monkey may find correct information,even information he didn't know he was looking for

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010