imt530- organization of information resources1 recap descriptive metadata elements can be used for...
TRANSCRIPT
IMT530- Organization of Information Resources 1
Recap
• Descriptive metadata elements can be used for access or selection
• For access, it is important to have good authority control to enable the users to:– Find known items from the information they have
available– Gather all the items of a similar nature together– Choose the right one from among retrieved items
• Authority control takes time and effort, but pays off in better results for users– Need to balance cost against benefits and make a
decision on your approach for each project– Don’t do it halfway, because it’s not worth it
Module 5b: Subject Analysis and Indexing
IMT530: Organization of Information Resources
Winter 2008
Michael Crandall
IMT530- Organization of Information Resources 3
Module 5b Outline
• Subject analysis– Definition– Why do this?– Mai’s domain-centered analysis– Consistency
• Subject indexing– Definition and purpose of subject indexing– Types of subject indexing– Indexing non-text objects– Types of terms used in subject indexing– The subject indexing process
IMT530- Organization of Information Resources 4
Some Questions
• Library catalogs often lump fiction into one subject heading– why?
• Would you describe the subject of “The Organization of Information” to your mother the same way you would to a classmate?
• Would you use the same subjects to describe Chapter 9 in Taylor that you would to describe the whole book?
• If you wanted to assign a subject to your kitchen or garage, what would it be?
• What if you had to describe snow to a Papua New Guinea native? What words would you use? Would they be the same for an Inuit?
• How do you describe the subject of a picture or film?
IMT530- Organization of Information Resources 5
Subject Analysis - Definition
• The process of determining the subject and other content-related attributes of an object
• The purpose of subject analysis is to come to an understanding of or judgment regarding: – what an object is about, in the context of how it
might be used;– what an object exemplifies;– what discipline (or other aspect, including
community) an object reflects (for classification)
IMT530- Organization of Information Resources 6
Why Subject Analysis?
• One of the primary means of access to information is through “subjects”
• In order for a computer to access those subjects, there has to be some way to get to them– an index of some kind– Remember Soergel’s model, and the necessity for
a means to match user requests to information objects
• Automatic indexing works for some situations, but not all– As we’ll see, subject concepts are not necessarily
contained in words (especially not in images!!)– A specific audience may dictate specific analysis
IMT530- Organization of Information Resources 7
Wilson on Subjects
• One of the main purposes of Wilson’s chapter on subjects is to analyze the subject analysis process – to take it apart
• Starts with the words, then the sentences, then the work itself, and asks questions about how you can elicit descriptions of “aboutness”
• Wilson suggests four different ways to approach this:– Purposive- why did the author write– Figure-ground: what stands out among all the possible
subjects – Objective- count what is most frequently mentioned– Appeal to unity and completeness- what questions are
answered within the work• Ultimately, he concludes that any extraction will miss
some part of the work, and not satisfy some user
IMT530- Organization of Information Resources 8
Subject Analysis in Context
• Subject analysis should always be done in context
• Context considerations include:– user (children, medical practitioners, etc.) – uses (developing egg substitutes, learning
how to cook)– the document itself (the “text” of a
document, intended audience, uses, etc.)– institution (public library, corporate intranet)– administrative and information systems
context
IMT530- Organization of Information Resources 9
Mai’s Domain-Centered Approach
IMT530- Organization of Information Resources 10
Relevance
• Taylor’s stages in development of an information need– The visceral need– The conscious need– The formalized need– The compromised need
• Relevance is usually measured against the last of these, while ignoring the more complex situational aspects that affect the other states– Mai concludes that evaluation should be less mechanistic
(focused on terminology matches) and more humanistic (focused on the visceral needs)
– Requires contextual analysis and qualitative research rather than just precision/recall measures
IMT530- Organization of Information Resources 11
Consistency
• Taylor points out the difficulty of getting people to assign similar subjects to objects
• But when controlled vocabularies and rules for selecting subject terms from those vocabularies are used, consistency is much better– Assumes trained subject indexers– Not likely to be the case in most settings other than
libraries– Again points out need to determine what your
objectives in building a taxonomy are before you make the investment
• So how do you go about subject indexing?
IMT530- Organization of Information Resources 12
Definition and Purpose of Subject Indexing
• Subject indexing is the process or technique of identifying and selecting terms (words, phrases, sentences, taxonomic categories, notation) used in a domain of information to indicate the subject content of a resource for users and to provide subject access
• Purposes of subject indexing may be seen in light of Cutter’s objects of the catalog:– To facilitate finding a particular object on the basis of its
subject content (finding function)– To display to a user all of the objects that exhibit
particular subject content (collocating function)– To aid a user in the selection of a particular object
(choice function).
IMT530- Organization of Information Resources 13
Rowley Article
• Trade off between precision and recall• 4 eras in indexing
– Era1: Pre-computer access- Title indexing– Era 2: Online age- Cranfield and other retrieval
studies showed free indexing worked as well as controlled in abstract databases
– Era 3: Full-text vs. subject indexing- shown to complement each other (Taylor also points out the tradeoff between summarization for document retrieval vs. depth indexing for information retrieval)
– Era 4: Tests with real users instead of controlled experiments- difficulty in using search interfaces because of complex and varied systems
IMT530- Organization of Information Resources 14
Types of Subject Indexing: Derived Indexing
• Derived Indexing: in derived indexing, terms used for indexing are limited to those that actually appear in the document or resource.
• Derived indexing may be done manually or automatically– Search engine indexes are examples of
automatic derived indexing
IMT530- Organization of Information Resources 15
Assigned Indexing
• Assigned Indexing: in assigned indexing, terms used for indexing are not limited to those in the object, but may come from the object, the mind of the indexer, or from a controlled vocabulary
• There are two types of Assigned indexing: Free Indexing and Indexing from controlled vocabularies
IMT530- Organization of Information Resources 16
Free Indexing
• In free indexing, the indexer or indexing program is free to assign terms from anywhere inside or outside the object– the indexer may take terms from the object, or use
any terms that occur to them – In some “free” indexing settings, very detailed
instructions guide indexers in their selection of terms
– Other settings are much looser, users can pick any terms that mean something to them or others
• Pictures (http://flickr.com)• Folksonomies (http://del.icio.us)
IMT530- Organization of Information Resources 17
Controlled Vocabulary Indexing
• In indexing from controlled vocabularies, indexers are constrained by the terms that are available in lists of terms called “controlled vocabularies” - they must assign one or more terms from the controlled vocabulary.
• Controlled vocabulary indexing is much like choosing terms from a very large drop-down menu.
IMT530- Organization of Information Resources 18
Automatic Indexing
• In automatic indexing, it is common for indexing software applications to use derived indexing techniques only, enhanced with word stemming and spelling algorithms to improve matching
• However, more advanced programs are being developed that mimic free indexing (e.g., text summarization programs)
• Some advanced automatic indexing programs (particularly those in medicine) are making use of controlled vocabularies in term selection and identification.
IMT530- Organization of Information Resources 19
Mai’s Conceptions of Indexing
• Simplistic conception of indexing– automatic extraction (derived indexing)
• Document-oriented indexing– focus on document & document parts
• Content-oriented indexing– focus on content in document (still document
oriented)• User-oriented indexing
– focus on user & possible uses of the document• Requirement-oriented indexing
– relies on in-depth knowledge of users & uses of documents; complete knowledge of context
IMT530- Organization of Information Resources 20
Types of Terms Used in Subject Indexing
• Words or short phrases – descriptors, identifiers, subject headings, or
keywords
• Sentences – derived indexing may use whole sentences, but rarely done – used in some web documents and for derived abstracts – abstracts, summaries, or annotations
• Taxonomic categories (such as the type used in the Yahoo directory)
• Notation (such as the type used in the Dewey Decimal Classification)
IMT530- Organization of Information Resources 21
Sample ERIC Indexing Record
PERSONAL AUTHOR: Magnuson,-Sandy; Norem,-KenTITLE: Challenges for Higher Education Couples in Commuter Marriages: Insights for Couples and
Counselors Who Work with Them.PUBLICATION YEAR: 1999SOURCE (JOURNAL CITATION): Family-Journal:-Counseling-and-Therapy-for-Couples-and-Families;
v7 n2 p125-34 Apr 1999DOCUMENT TYPE: Journal-Articles (080); Reports-Research (143)LANGUAGE: English
MAJOR DESCRIPTORS: *Counseling-Techniques; *Dual-Career-Family; *Job-Satisfaction; *Marital-Satisfaction; *Marriage-
MINOR DESCRIPTORS: Trust-Psychology
MAJOR IDENTIFIERS: *Career-Commitment
MINOR IDENTIFIERS: Quality-Time
ABSTRACT: Focuses on the experiences of dual-career couples that maintain two homes to attain career satisfaction. Findings include support for the potential strength and satisfaction of commuting relationships. Trust, commitment, regular communication, and quality shared time were endorsed as factors contributing to successful distance marriages. (Author/GCP)
IMT530- Organization of Information Resources 22
Indexing Non-text Objects
• Layne discusses the indexing of images and points out some useful distinctions– Defines four general types of attributes
• Biographical• Subject• Exemplified• Relationship
– While she discusses in the context of images, these can prove useful when indexing almost any object
IMT530- Organization of Information Resources 23
Identification of Concepts
• Taylor lists several concepts that can be helpful in teasing out subject terms– Topics– Names
• Persons, corporations, geographic, other
– Time periods– Form (genre)
• http://isotropic.org/papers/chicken.pdf
• See the appendix in Taylor for an example and checklist
IMT530- Organization of Information Resources 24
Indexing Policies
• Many indexers are guided by indexing policies that determine the types of terms that are finally used in indexing
• Three characteristics of indexing upon which indexing policies may be built: – Exhaustivity– Specific entry (sometimes called
“specificity”, but incorrectly)– Coextensivity
IMT530- Organization of Information Resources 25
ISO 5963
• Despite Wilson’s assertion that subject analysis is impossible, a variety of standards exist prescribing how it should be done – the British Standard ISO 5963 in your readings this week is one of them
• Viewed from Wilson’s or Mai’s perspective (and your own), what are the problems with this standard?
IMT530- Organization of Information Resources 26
IMT530- Organization of Information Resources 27
Steps in Free and Assigned Indexing
1. Identify subject content
2. Identify disciplinary context or domain (for classifications or taxonomies)
3. Express or describe content (steps 1-3 describe the subject analysis process)
4. Select or create terms and add them to the document representation
5. If working with a controlled vocabulary (CV), update and maintain the CV based on the indexing experience
IMT530- Organization of Information Resources 28
Questions?
• If not, take a break!!!
IMT530- Organization of Information Resources 29
Exercise 5
• Purpose is to try different methods of extracting concepts from an article, so you can see the impact on users
• Spend the rest of class working through the questions in Exercise 5
• We’ll discuss before the end of class
IMT530- Organization of Information Resources 30
Differences
• Hopefully, this exercise gave you a chance to see a couple things:– How difficult it can be to actually determine what
something is about– How different methods of assigning terms would
result in very different access for users
• We didn’t throw in Mai’s perspective on domain indexing in this exercise, which makes it even more difficult– This is obviously not a simple thing to do well– But you now are aware of the issues, and can keep
them in mind when working in this area
IMT530- Organization of Information Resources 31
Next Week
• We’ll start looking in more detail at controlled vocabularies and discuss how they might interact with emergent social tagging systems
• Remember to read assignments BEFORE class
• Important– your mid-term assignments are due at the start of class next week!!