8/28/97information organization and retrieval controlled vocabularies: name authority control...

44
8/28/97 Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval

Post on 22-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Controlled Vocabularies: Name Authority Control

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval

Page 2: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Review

• Mapping to the relational model

• Database Design & Normalization

• ER Diagrams and Assignment

Page 3: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Normalization

Boyce-Codd and

Higher

Functional dependencyof nonkey attributes on the primary key - Atomic values only

Full Functional dependencyof nonkey attributes on the primary key

No transitive dependency between nonkey attributes

All determinants are candidate keys - Single multivalued dependency

Page 4: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Unnormalized Relations

• First step in normalization is to convert the data into a two-dimensional table

• In unnormalized relations data can repeat within a column

Page 5: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Unnormalized RelationPatient # Surgeon # Surg. date Patient Name Patient Addr Surgeon Surgery Postop drugDrug side effects

1111145 311

Jan 1, 1995; June 12, 1995 John White

15 New St. New York, NY

Beth Little Michael Diamond

Gallstones removal; Kidney stones removal

Penicillin, none-

rash none

1234243 467

Apr 5, 1994 May 10, 1995 Mary Jones

10 Main St. Rye, NY

Charles Field Patricia Gold

Eye Cataract removal Thrombosis removal

Tetracycline none

Fever none

2345 189Jan 8, 1996 Charles Brown

Dogwood Lane Harrison, NY

David Rosen

Open Heart Surgery

Cephalosporin none

4876 145Nov 5, 1995 Hal Kane

55 Boston Post Road, Chester, CN Beth Little

Cholecystectomy Demicillin none

5123 145May 10, 1995 Paul Kosher

Blind Brook Mamaroneck, NY Beth Little

Gallstones Removal none none

6845 243

Apr 5, 1994 Dec 15, 1984 Ann Hood

Hilton Road Larchmont, NY

Charles Field

Eye Cornea Replacement Eye cataract removal

Tetracycline Fever

Page 6: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

First Normal Form

• To move to First Normal Form a relation must contain only atomic values at each row and column.– No repeating groups– A column or set of columns is called a

Candidate Key when its values can uniquely identify the row in the relation.

Page 7: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

First Normal FormPatient # Surgeon #Surgery DatePatient NamePatient AddrSurgeon Name Surgery Drug adminSide Effects

1111 145 01-Jan-95 John White

15 New St. New York, NY Beth Little

Gallstones removal Penicillin rash

1111 311 12-Jun-95 John White

15 New St. New York, NY

Michael Diamond

Kidney stones removal none none

1234 243 05-Apr-94 Mary Jones10 Main St. Rye, NY Charles Field

Eye Cataract removal

Tetracycline Fever

1234 467 10-May-95 Mary Jones10 Main St. Rye, NY Patricia Gold

Thrombosis removal none none

2345 189 08-Jan-96Charles Brown

Dogwood Lane Harrison, NY David Rosen

Open Heart Surgery

Cephalosporin none

4876 145 05-Nov-95 Hal Kane

55 Boston Post Road, Chester, CN Beth Little

Cholecystectomy Demicillin none

5123 145 10-May-95 Paul Kosher

Blind Brook Mamaroneck, NY Beth Little

Gallstones Removal none none

6845 243 05-Apr-94 Ann Hood

Hilton Road Larchmont, NY Charles Field

Eye Cornea Replacement

Tetracycline Fever

6845 243 15-Dec-84 Ann Hood

Hilton Road Larchmont, NY Charles Field

Eye cataract removal none none

Page 8: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Second Normal Form

• A relation is said to be in Second Normal Form when every nonkey attribute is fully functionally dependent on the primary key.– That is, every nonkey attribute needs the full

primary key for unique identification

Page 9: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Second Normal FormPatient # Patient Name Patient Address

1111 John White15 New St. New York, NY

1234 Mary Jones10 Main St. Rye, NY

2345Charles Brown

Dogwood Lane Harrison, NY

4876 Hal Kane55 Boston Post Road, Chester,

5123 Paul KosherBlind Brook Mamaroneck, NY

6845 Ann HoodHilton Road Larchmont, NY

Page 10: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Second Normal FormPatient # Surgeon # Surgery Date Surgery Drug Admin Side Effects

1111 145 01-Jan-95Gallstones removal Penicillin rash

1111 311 12-Jun-95

Kidney stones removal none none

1234 243 05-Apr-94Eye Cataract removal Tetracycline Fever

1234 467 10-May-95Thrombosis removal none none

2345 189 08-Jan-96Open Heart Surgery

Cephalosporin none

4876 145 05-Nov-95Cholecystectomy Demicillin none

5123 145 10-May-95Gallstones Removal none none

6845 243 15-Dec-84Eye cataract removal none none

6845 243 05-Apr-94Eye Cornea Replacement Tetracycline Fever

Page 11: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Third Normal Form

• A relation is said to be in Third Normal Form if there is no transitive functional dependency between nonkey attributes– When one nonkey attribute can be determined with

one or more nonkey attributes there is said to be a transitive functional dependency.

• The side effect column in the Surgery table is determined by the drug administered – Side effect is transitively functionally dependent on

drug so Surgery is not 3NF

Page 12: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Third Normal FormPatient # Surgeon # Surgery Date Surgery Drug Admin

1111 145 01-Jan-95 Gallstones removal Penicillin

1111 311 12-Jun-95Kidney stones removal none

1234 243 05-Apr-94 Eye Cataract removal Tetracycline

1234 467 10-May-95 Thrombosis removal none

2345 189 08-Jan-96 Open Heart Surgery Cephalosporin

4876 145 05-Nov-95 Cholecystectomy Demicillin

5123 145 10-May-95 Gallstones Removal none

6845 243 15-Dec-84 Eye cataract removal none

6845 243 05-Apr-94Eye Cornea Replacement Tetracycline

Page 13: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Third Normal Form

Drug Admin Side Effects

Cephalosporin none

Demicillin none

none none

Penicillin rash

Tetracycline Fever

Page 14: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

JoinsPart # Name Price Count

1 Big blue widget 3.76 22 Small blue Widget 7.35 43 Tiny red widget 5.25 74 large red widget 157.23 235 double widget rack 10.44 126 Small green Widget 30.45 587 Big yellow widget 7.96 18 Tiny orange widget 81.75 429 Big purple widget 55.99 9

Invoice # Part # Quantity93774 3 1084747 23 188367 75 288647 4 3

776879 22 565689 76 1293774 23 1088367 34 2

Invoice # Cust # Rep #93774 3 184747 4 188367 5 288647 9 1

776879 2 265689 6 2

Cust # COMPANY STREET1 STREET2 CITY STATE ZIPCODE

1Integrated Standards Ltd. 35 Broadway Floor 12 New York NY 02111

2 MegaInt Inc. 34 Bureaucracy Plaza Floors 1-172 Phildelphia PA 03756

3 Cyber Associates3 Control Elevation Place

Cyber Assicates Center Cyberoid NY 08645

4General Consolidated 35 Libra Plaza Nashua NH 09242

5Consolidated MultiCorp 1 Broadway Middletown IN 32467

6Internet Behometh Ltd. 88 Oligopoly Place Sagrado TX 78798

7Consolidated Brands, Inc.

3 Independence Parkway Rivendell CA 93456

8 Little Mighty Micro 34 Last One Drive Orinda CA 94563

9 SportLine Ltd. 38 Champion Place Suite 882 Compton CA 95328

Page 15: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

More on Assignment and ER

• Just what is this Cookie database?

• What sort of ways might it be used?

• What are those ER symbols again?

Page 16: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Original Assignment• Examine the Cookie database using Access

and look at the ER Diagram for it posted on the assignments page.

• Consider the possibilities of Book publications– What are the problems with the database?– What new fields would you add to the database,

and where?– Draw a new ER diagram showing your design.

Page 17: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Cookie ER diagram

Has callBIBFILE

pubid

LIBFILE

INDXFILE

accno

SUBFILEHas index

libid

CALLFILE Has copy

publishes pubidPUBFILE

Has subject

subcodeaccno subcode

libidaccno

Note: diagramcontains onlyattributes usedfor linking

Page 18: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Cookie Database• Cookie is a bibliographic database that contains

information about a hypothetical union catalog of several libraries

• There are currently 5 main types of entities in the database (and one linking relation)– Books (bibfile)– Local Call numbers (callfile)– Libraries (libfile)– Publishers (pubfile)– Subject headings (subfile)– Links between subject and books (indxfile)

Page 19: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

BIBFILE• Books (BIBFILE) contains information about

particular books. It includes one record for each book. The attributes are:– accno -- an “accession” or serial number

– author -- The author’s name

– title -- The title of the book

– loc -- Location of publication (where published)

– date -- Date of publication

– price -- Price of the book

– pagination -- Number of pages

– ill -- What type of illustrations (maps, etc) if any

– height -- Height of the book in centimeters

Page 20: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

CALLFILE

• CALLFILE contains call numbers and holdings information linking particular books with particular libraries. Its attributes are:– accno -- the book accession number

– libid -- the id of the holding library

– callno -- the call number of the book in the particular library

– copies -- the number of copies held by the particular library

Page 21: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

LIBFILE• LIBFILE contain information about the libraries

participating in this union catalog. Its attributes include:– libid -- Library id number– library -- Name of the library– laddress -- Street address for the library– lcity -- City name– lstate -- State code (postal abbreviation)– lzip -- zip code– lphone -- Phone number– mop - suncl -- Library opening and closing times for each day of the week.

Page 22: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

PUBFILE• PUBFILE contain information about the

publishers of books. Its attributes include– pubid -- The publisher’s id number– publisher -- Publisher name– paddress -- Publisher street address– pcity -- Publisher city– pstate -- Publisher state– pzip -- Publisher zip code– pphone -- Publisher phone number– ship -- standard shipping time in days

Page 23: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

SUBFILE

• SUBFILE contains each unique subject heading that can be assigned to books. Its attributes are– subcode -- Subject identification number– subject -- the subject heading/description

Page 24: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

INDXFILE

• INDXFILE provides a way to allow many-to-many mapping of subject headings to books. Its attributes consist entirely of links to other tables– subcode -- link to subject id– accno -- link to book accession number

Page 25: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Some examples of Cookie Searches

• Who wrote Microcosmographia Academica?• How many pages long is Alfred Whitehead’s The Aims of Education

and Other Essays?• Which branches in Berkeley’s public library system are open on Sunday?

• What is the call number of Moffitt Library’s copy of Abraham Flexner’s book Universities: American, English, German?

• What books on the subject of higher education are among the holdings of Berkeley (both UC and City) libraries?

• Print a list of the Mechanics Library holdings, in descending order by height.

• What would it cost to replace every copy of each book that contains illustrations (including graphs, maps, portraits, etc.)?

• Which library closes earliest on Friday night?

Page 26: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

ER Diagram Symbols

Entity

AttributePrimary

key

Relationship

Ovals are used to indicate the attributes associated with an entity or relationship (That is, the pieces of information recorded in the database about the entity or relationship) An underlined name indicates that the attribute is a primary key (That is, it can uniquely identify the entity)

Rectangles are used to indicate entities (That is, the representatives or records describing persons, things, or events in the database)

Diamonds are used to indicate relationships between entities. (That is, some association between the data records of different entities)

Page 27: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Cookie ER diagram

Has callBIBFILE

pubid

LIBFILE

INDXFILE

accno

SUBFILEHas index

libid

CALLFILE Has copy

publishes pubidPUBFILE

Has subject

subcodeaccno subcode

libidaccno

Note: diagramcontains onlyattributes usedfor linking

Page 28: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Assignment Goal• The main intent is to have you start thinking

about how databases are structured, and what types of information can or should be included when designing a database

• The main task is to look for MISSING elements in the current design, or badly designed elements given the particular data

• What attributes and/or new relations need to be added to the database?

Page 29: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

And now for something completely different...

Page 30: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Today

• Controlled vocabularies

• Choice of names

• Form of names

• Name Authority files

Page 31: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Controlled Vocabularies

• Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information.

Page 32: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Controlled Vocabularies

• Names and name authorities (Today)

• Cognitive basis of categorization and subject classification (Thursday)

• Design of controlled vocabularies for subject access -- Thesaurus design (next week)

Page 33: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Names

• Cutter’s objectives of bibliographic description:– To enable a person to find a document of which

the author is known.– To show what the library has by a given author.

• First serves access.

• Second serves collocation.

Page 34: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Problems with Names

• How many names should be associated with a document?

• Which of these should be the “main entry”?

• What form should each of the names take?

• What references should be made from other possible forms of names that haven’t been used?

Page 35: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

The problem

• Proliferation of the forms of names– Different names for the same person– Different people with the same names

• Examples – from Books in Print (semi-controlled but not

consistent)– ERIC author index (not controlled)

Page 36: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Rules for description

• AACR II and other sets of descriptive cataloging rules provide guidelines for:– Determining the number of name entries– Choosing a main entry– Deciding on the form of name to be used– Deciding when to make references

Page 37: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Authority control

• Authority control is concerned with creation and maintenance of a set of terms that have been chosen as the standard representatives (also know as established) based on some set of rules.

• If you have rules, why do you need to keep track of all of the headings?

Page 38: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Conditions of Authorship?

• Single person or single corporate entity

• Unknown or anonymous authors

• Shared responsibility

• Collections or editorially assembled works

• Works of mixed responsibility (e.g. translations)

• Related Works

Page 39: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Added Entries• Personal names

– Collaborators– Editors, compilers, writers– Translators (in some cases)– Illustrators (in some cases)– Other persons associated with the work (such as the honoree

in a Festschrift).

• Corporate Names– Any prominently named corporate body that has involvement

in the work beyond publication, distribution, etc.

Page 40: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Choice of Name

• AACR II says that the predominant form of the name used in a particular author’s writings should be chosen as the form of name.

• References should be made from the other forms of the name.

Page 41: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Form of the Name• When names appear in multiple forms, one

form needs to be chosen. Criteria for choice are– Fullness (e.g. Full names vs. initials only)– Language of the name. – Spelling (choose predominant form)

• Entry element:– John Smith or Smith, John?– Mao Zedong or Zedong, Mao? (Mao Tse Tung?)

Page 42: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Name Authority Files ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-21-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R517 100 10 Creasey, John 400 10 Cooke, M. E. 400 10 Cooke, Margaret,$d1908-1973 400 10 Cooper, Henry St. John,$d1908-1973 400 00 Credo,$d1908-1973 400 10 Fecamps, Elise 400 10 Gill, Patrick,$d1908-1973 400 10 Hope, Brian,$d1908-1973 400 10 Hughes, Colin,$d1908-1973 400 10 Marsden, James 400 10 Matheson, Rodney 400 10 Ranger, Ken 400 20 St. John, Henry,$d1908-1973 400 10 Wilde, Jimmy 500 10 $wnnnc$aAshe, Gordon,$d1908-1973

Page 43: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Name Authority FilesID:NAFO9114111 ST:p EL:n STH:a MS:n UIP:a TD:19910817053048 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:06-03-91 RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-19-91 040 OCoLC$cOCoLC 100 10 Marric, J. J.,$d1908-1973 500 10 $wnnnc$aCreasey, John 663 Works by this author are entered under the name used in the item. For a listing of other names used by this author, search also under$bCrease y, John 670 OCLC 13441825: His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J .J. Marric) 670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric) 670 Pseuds. and nicknames dict., c1987$b(Creasey, John, 1908-1973; Britis h author; pseud.: Marric, J. J.)

Page 44: 8/28/97Information Organization and Retrieval Controlled Vocabularies: Name Authority Control University of California, Berkeley School of Information

8/28/97 Information Organization and Retrieval

Name authority filesID:NAFL8166762 ST:p EL:n STH:a MS:c UIP:a TD:19910604053124 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:08-20-81 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 06-06-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 100 10 Butler, William Vivian,$d1927- 400 10 Butler, W. V.$q(William Vivian),$d1927- 400 10 Marric, J. J.,$d1927- 670 His The durable desperadoes, 1973. 670 His The young detective's handbook, c1981:$bt.p. (W.V. Butler) 670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J .J. Marric)