controlled vocabularies: name authority control

48
11/7/2000 Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval

Upload: debra

Post on 19-Jan-2016

26 views

Category:

Documents


1 download

DESCRIPTION

Controlled Vocabularies: Name Authority Control. University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval. Review. Dublin Core Other Metadata Systems Cognitive basis of categorization and subject classification. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Controlled Vocabularies: Name Authority Control

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval

Page 2: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Review

• Dublin Core

• Other Metadata Systems

• Cognitive basis of categorization and subject classification

Page 3: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Dublin Core Elements

• Title• Creator• Subject• Description• Publisher• Other Contributors• Date• Resource Type

• Format• Resource Identifier• Source• Language• Relation• Coverage• Rights Management

Page 4: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Issues in Dublin Core

• Lack of guidance on what to put into each element

• How to structure or organize at the element level?

• How ensure consistency across descriptions for the same persons, places, things, etc.

Page 5: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

More Metadata Systems

• The following are a sample of metadata systems for a variety of special types of data/documents/objects.

Page 6: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Type of Metadata systems and standards

• Naming and ID systems – URLs, ISBNs• Bibliographic description – MARC, Dublin Core,

TEI, etc.• Music -- SMDL• Images and objects – CIMI, VRA Core Categories• Numeric Data – DDI, SDSM• Geospatial Data – FGDC • Collections – EAD

Page 7: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Metadata Resources

• Check the Links section from the class home page

• Best site is the “Digital Library: Metadata Resources” page from IFLA at http://www.ifla.org/ifla/II/metadata.htm

Page 8: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Hierarchical vs. Faceted (Subject Heading vs. Descriptor)

Category Systems

Page 9: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Controlled Vocabulary(The following slides follow Bates 88)

• Start with the text of the document• Attempt to “control” or regularize:

– The concepts expressed within• mutually exclusive

• exhaustive

– The language used to express those concepts• limit the normal linguistic variations

• regulate word order and structure of phrases

• reduce the number of synonyms or near-synonyms

• Also, provide cross-references between concepts and their expression.

Page 10: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Classification Schemes

• Classify possible concepts.

• Goals:– Completely distinct conceptual categories

(mutually exclusive)– Complete coverage of conceptual categories

(exhaustive)

Page 11: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

AssigningHeadings vs. Descriptors

• Subject headings – assign one (or a few)

complex heading(s) to the document

• Descriptors– Mix and match

How would we describe recipes using each technique?

Page 12: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Subject Heading vs. Descriptor WILSONLINE

– Athletes– Athletes--Heath&Hygiene– Athletes--Nutrition– Athletes--Physical Exams– …– Athletics– Athletics -- Administration– Athletics -- Equipment --

Catalogs– …– Sports -- Accidents and injuries– Sports -- Accidents and injuries

-- prevention

ERIC– Athletes

– Athletic Coaches

– Athletic Equipment

– Athletic Fields

– Athletics

– …

– Sports psychology

– Sportsmanship

Page 13: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Subject Headings vs. Descriptors

• Describe the contents of an entire document

• Designed to be looked up in an alphabetical index– Look up document

under its heading

• Few (1-5) headings per document

• Describe one concept within a document

• Designed to be used in Boolean searching– Combine to describe the

desired document

• Many (5-25) descriptors per document

Page 14: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Hierarchical Classification

– Each category is successively broken down into smaller and smaller subdivisions

– No item occurs in more than one subdivision– Each level divided out by a “character of

division”. Also known as a feature.• Example: distinguish Literature based on:

– Language

– Genre

– Time Period

Page 15: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Hierarchical Classification

Literature

SpanishFrenchEnglish

DramaPoetryProse

18th17th16th

DramaPoetryProse

19th 18th17th16th 19th

...

... ... ...

...

Page 16: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Labeled Categories for Hierarchical Classification

• LITERATURE– 100 English Literature

• 110 English Prose– English Prose 16th Century– English Prose 17th Century– English Prose 18th Century– ...

• 111 English Poetry– 121 English Poetry 16th Century– 122 English Poetry 17th Century– ...

• 112 English Drama– 130 English Drama 16th Century– …

– 200 French Literature

Page 17: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Faceted Classification

• Create a separate, free-standing list for each characteristic of division (feature).

• Combine features to create a classification.

Page 18: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Faceted Classification along with Labeled Categories

• A Language– a English– b French– c Spanish

• B Genre– a Prose– b Poetry– c Drama

• C Period– a 16th Century– b 17th Century– c 18th Century– d 19th Century

• Aa English Literature• AaBa English Prose• AaBaCa English Prose

16th Century• AbBbCd French

Poetry 19th Century• BbCd Drama 19th

Century

Page 19: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Important Question:How to use both types ofclassification structures?

• How to look through them?

• How to use them in search?

Page 20: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Today

• More on Controlled vocabularies

• Choice of names

• Form of names

• Name Authority files

• Types of Controlled Vocabularies

Page 21: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Controlled Vocabularies

• Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information.

Page 22: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Controlled Vocabularies

• Names and name authorities & Other Types of Controlled Vocabulary (Today)

• Design of controlled vocabularies for subject access -- Thesaurus design (Thursday)

Page 23: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Names

• Cutter’s objectives of bibliographic description:– To enable a person to find a document of which

the author is known– To show what the library has by a given author

• First serves access

• Second serves collocation

Page 24: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Problems with Names

• How many names should be associated with a document?

• Which of these should be the “main entry”?

• What form should each of the names take?

• What references should be made from other possible forms of names that haven’t been used?

Page 25: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

The problem

• Proliferation of the forms of names– Different names for the same person– Different people with the same names

• Examples – from Books in Print (semi-controlled but not

consistent)– ERIC author index (not controlled)

Page 26: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Rules for description

• AACR II and other sets of descriptive cataloging rules provide guidelines for:– Determining the number of name entries– Choosing a main entry– Deciding on the form of name to be used– Deciding when to make references

Page 27: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Authority control

• Authority control is concerned with creation and maintenance of a set of terms that have been chosen as the standard representatives (also know as established) based on some set of rules.

• If you have rules, why do you need to keep track of all of the headings? Can’t you just infer the headings from the rules?

Page 28: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Conditions of Authorship?

• Single person or single corporate entity• Unknown or anonymous authors

– Fictitiously ascribed works

• Shared responsibility• Collections or editorially assembled works• Works of mixed responsibility (e.g.

translations)• Related Works

Page 29: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Added Entries• Personal names

– Collaborators– Editors, compilers, writers– Translators (in some cases)– Illustrators (in some cases)– Other persons associated with the work (such as the honoree

in a Festschrift).

• Corporate Names– Any prominently named corporate body that has involvement

in the work beyond publication, distribution, etc.

Page 30: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Choice of Name

• AACR II says that the predominant form of the name used in a particular author’s writings should be chosen as the form of name.

• References should be made from the other forms of the name.

Page 31: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Form of the Name• When names appear in multiple forms, one

form needs to be chosen. Criteria for choice are– Fullness (e.g. Full names vs. initials only)– Language of the name. – Spelling (choose predominant form)

• Entry element:– John Smith or Smith, John?– Mao Zedong or Zedong, Mao? (Mao Tse Tung?)

Page 32: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Name Authority Files ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-21-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R517 100 10 Creasey, John 400 10 Cooke, M. E. 400 10 Cooke, Margaret,$d1908-1973 400 10 Cooper, Henry St. John,$d1908-1973 400 00 Credo,$d1908-1973 400 10 Fecamps, Elise 400 10 Gill, Patrick,$d1908-1973 400 10 Hope, Brian,$d1908-1973 400 10 Hughes, Colin,$d1908-1973 400 10 Marsden, James 400 10 Matheson, Rodney 400 10 Ranger, Ken 400 20 St. John, Henry,$d1908-1973 400 10 Wilde, Jimmy 500 10 $wnnnc$aAshe, Gordon,$d1908-1973

Different names for thesame person

Page 33: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Name Authority FilesID:NAFO9114111 ST:p EL:n STH:a MS:n UIP:a TD:19910817053048 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:06-03-91 RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-19-91 040 OCoLC$cOCoLC 100 10 Marric, J. J.,$d1908-1973 500 10 $wnnnc$aCreasey, John 663 Works by this author are entered under the name used in the item. For a listing of other names used by this author, search also under$bCrease y, John 670 OCLC 13441825: His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J .J. Marric) 670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric) 670 Pseuds. and nicknames dict., c1987$b(Creasey, John, 1908-1973; Britis h author; pseud.: Marric, J. J.)

Page 34: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Name authority filesID:NAFL8166762 ST:p EL:n STH:a MS:c UIP:a TD:19910604053124 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:08-20-81 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 06-06-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 100 10 Butler, William Vivian,$d1927- 400 10 Butler, W. V.$q(William Vivian),$d1927- 400 10 Marric, J. J.,$d1927- 670 His The durable desperadoes, 1973. 670 His The young detective's handbook, c1981:$bt.p. (W.V. Butler) 670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J .J. Marric)

Different people writing with the same name

Page 35: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Other Types of Controlled Vocabularies

• Gazetteers (Geographic Names)

• Code lists (e.g. LC Language Codes)

• Subject Heading Lists

• Classification Schemes

• Thesaurii

Page 36: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Structure of an IR SystemSearchLine

Interest profiles& Queries

Documents & data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

Page 37: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Uses of Controlled Vocabularies• Library Subject Headings, Classification and

Authority Files.• Commercial Journal Indexing Services and

databases• Yahoo, and other Web classification schemes• Online and Manual Systems within

organizations– SunSolve– MacArthur

Page 38: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Types of Indexing Languages

• Uncontrolled Keyword Indexing• Indexing Languages

– Controlled, but not structured

• Thesauri– Controlled and Structured

• Classification Systems– Controlled, Structured, and Coded

• Faceted Classification Systems

Page 39: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Indexing Languages

• An index is a systematic guide designed to indicate topics or features of documents in order to facilitate retrieval of documents or parts of documents.

• An Indexing language is the set of terms used in an index to represent topics or features of documents, and the rules for combining or using those terms.

Page 40: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Indexing Languages

• Library of Congress Subject Headings

• Yellow Pages Topics

• Wilson Indexes (“Reader’s Guide”)

Page 41: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Thesauri

• A Thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among Synonymous, Equivalent, Broader, Narrower and other Related Terms

Page 42: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Thesauri (cont.)

• National and International Standards for Thesauri– ANSI/NISO z39.19--1994 -- American National Standard

Guidelines for the Construction, Format and Management of Monolingual Thesauri

– ANSI/NISO Draft Standard Z39.4-199x -- American National Standard Guidelines for Indexes in Information Retrieval

– ISO 2788 -- Documentation -- Guidelines for the establishment and development of monolingual thesauri

– ISO 5964-- Documentation -- Guidelines for the establishment and development of multilingual thesauri

Page 43: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Thesauri (cont.)

• Examples:– The ERIC Thesaurus of Descriptors– The Art and Architecture Thesaurus– The Medical Subject Headings (MESH) of the

National Library of Medicine

Page 44: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Classification Systems

• A classification system is an indexing language often based on a broad ordering of topical areas. Thesauri and classification systems both use this broad ordering and maintain a structure of broader, narrower, and related topics. Classification schemes commonly use a coded notation for representing a topic and it’s place in relation to other terms.

Page 45: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Classification Systems (cont.)

• Examples:– The Library of Congress Classification System– The Dewey Decimal Classification System– The ACM Computing Reviews Categories– The American Mathematical Society

Classification System

Page 46: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Automatic Indexing and Classification

• Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words.

• More complex Automatic Indexing Systems attempt to select controlled vocabulary terms based on terms in the document.

• Automatic classification attempts to automatically group similar documents using either:– A fully automatic clustering method.– An established classification scheme and set of documents already

indexed by that scheme.

Page 47: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

ClusteringAgglomerative methods: Polythetic, Exclusive or Overlapping, Unorderedclusters are order-dependent.

DocDoc

DocDoc

DocDoc

DocDoc

1. Select initial centers (I.e. seed the space)2. Assign docs to highest matching centers and compute centroids3. Reassign all documents to centroid(s)

Rocchio’s method

Page 48: Controlled Vocabularies:  Name Authority Control

11/7/2000 Information Organization and Retrieval

Automatic Class Assignment

DocDoc

DocDoc

DocDoc

Doc

SearchEngine

1. Create pseudo-documents representing intellectually derived classes.2. Search using document contents3. Obtain ranked list4. Assign document to N categories ranked over threshold. OR assign to top-ranked category

Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually orderedclusters are order-independent, usually based on an intellectually derived scheme