8/28/97information organization and retrieval controlled vocabularies: name authority control...
Post on 22-Dec-2015
219 views
TRANSCRIPT
8/28/97 Information Organization and Retrieval
Controlled Vocabularies: Name Authority Control
University of California, Berkeley
School of Information Management and Systems
SIMS 202: Information Organization and Retrieval
8/28/97 Information Organization and Retrieval
Review
• Mapping to the relational model
• Database Design & Normalization
• ER Diagrams and Assignment
8/28/97 Information Organization and Retrieval
Normalization
Boyce-Codd and
Higher
Functional dependencyof nonkey attributes on the primary key - Atomic values only
Full Functional dependencyof nonkey attributes on the primary key
No transitive dependency between nonkey attributes
All determinants are candidate keys - Single multivalued dependency
8/28/97 Information Organization and Retrieval
Unnormalized Relations
• First step in normalization is to convert the data into a two-dimensional table
• In unnormalized relations data can repeat within a column
8/28/97 Information Organization and Retrieval
Unnormalized RelationPatient # Surgeon # Surg. date Patient Name Patient Addr Surgeon Surgery Postop drugDrug side effects
1111145 311
Jan 1, 1995; June 12, 1995 John White
15 New St. New York, NY
Beth Little Michael Diamond
Gallstones removal; Kidney stones removal
Penicillin, none-
rash none
1234243 467
Apr 5, 1994 May 10, 1995 Mary Jones
10 Main St. Rye, NY
Charles Field Patricia Gold
Eye Cataract removal Thrombosis removal
Tetracycline none
Fever none
2345 189Jan 8, 1996 Charles Brown
Dogwood Lane Harrison, NY
David Rosen
Open Heart Surgery
Cephalosporin none
4876 145Nov 5, 1995 Hal Kane
55 Boston Post Road, Chester, CN Beth Little
Cholecystectomy Demicillin none
5123 145May 10, 1995 Paul Kosher
Blind Brook Mamaroneck, NY Beth Little
Gallstones Removal none none
6845 243
Apr 5, 1994 Dec 15, 1984 Ann Hood
Hilton Road Larchmont, NY
Charles Field
Eye Cornea Replacement Eye cataract removal
Tetracycline Fever
8/28/97 Information Organization and Retrieval
First Normal Form
• To move to First Normal Form a relation must contain only atomic values at each row and column.– No repeating groups– A column or set of columns is called a
Candidate Key when its values can uniquely identify the row in the relation.
8/28/97 Information Organization and Retrieval
First Normal FormPatient # Surgeon #Surgery DatePatient NamePatient AddrSurgeon Name Surgery Drug adminSide Effects
1111 145 01-Jan-95 John White
15 New St. New York, NY Beth Little
Gallstones removal Penicillin rash
1111 311 12-Jun-95 John White
15 New St. New York, NY
Michael Diamond
Kidney stones removal none none
1234 243 05-Apr-94 Mary Jones10 Main St. Rye, NY Charles Field
Eye Cataract removal
Tetracycline Fever
1234 467 10-May-95 Mary Jones10 Main St. Rye, NY Patricia Gold
Thrombosis removal none none
2345 189 08-Jan-96Charles Brown
Dogwood Lane Harrison, NY David Rosen
Open Heart Surgery
Cephalosporin none
4876 145 05-Nov-95 Hal Kane
55 Boston Post Road, Chester, CN Beth Little
Cholecystectomy Demicillin none
5123 145 10-May-95 Paul Kosher
Blind Brook Mamaroneck, NY Beth Little
Gallstones Removal none none
6845 243 05-Apr-94 Ann Hood
Hilton Road Larchmont, NY Charles Field
Eye Cornea Replacement
Tetracycline Fever
6845 243 15-Dec-84 Ann Hood
Hilton Road Larchmont, NY Charles Field
Eye cataract removal none none
8/28/97 Information Organization and Retrieval
Second Normal Form
• A relation is said to be in Second Normal Form when every nonkey attribute is fully functionally dependent on the primary key.– That is, every nonkey attribute needs the full
primary key for unique identification
8/28/97 Information Organization and Retrieval
Second Normal FormPatient # Patient Name Patient Address
1111 John White15 New St. New York, NY
1234 Mary Jones10 Main St. Rye, NY
2345Charles Brown
Dogwood Lane Harrison, NY
4876 Hal Kane55 Boston Post Road, Chester,
5123 Paul KosherBlind Brook Mamaroneck, NY
6845 Ann HoodHilton Road Larchmont, NY
8/28/97 Information Organization and Retrieval
Second Normal FormPatient # Surgeon # Surgery Date Surgery Drug Admin Side Effects
1111 145 01-Jan-95Gallstones removal Penicillin rash
1111 311 12-Jun-95
Kidney stones removal none none
1234 243 05-Apr-94Eye Cataract removal Tetracycline Fever
1234 467 10-May-95Thrombosis removal none none
2345 189 08-Jan-96Open Heart Surgery
Cephalosporin none
4876 145 05-Nov-95Cholecystectomy Demicillin none
5123 145 10-May-95Gallstones Removal none none
6845 243 15-Dec-84Eye cataract removal none none
6845 243 05-Apr-94Eye Cornea Replacement Tetracycline Fever
8/28/97 Information Organization and Retrieval
Third Normal Form
• A relation is said to be in Third Normal Form if there is no transitive functional dependency between nonkey attributes– When one nonkey attribute can be determined with
one or more nonkey attributes there is said to be a transitive functional dependency.
• The side effect column in the Surgery table is determined by the drug administered – Side effect is transitively functionally dependent on
drug so Surgery is not 3NF
8/28/97 Information Organization and Retrieval
Third Normal FormPatient # Surgeon # Surgery Date Surgery Drug Admin
1111 145 01-Jan-95 Gallstones removal Penicillin
1111 311 12-Jun-95Kidney stones removal none
1234 243 05-Apr-94 Eye Cataract removal Tetracycline
1234 467 10-May-95 Thrombosis removal none
2345 189 08-Jan-96 Open Heart Surgery Cephalosporin
4876 145 05-Nov-95 Cholecystectomy Demicillin
5123 145 10-May-95 Gallstones Removal none
6845 243 15-Dec-84 Eye cataract removal none
6845 243 05-Apr-94Eye Cornea Replacement Tetracycline
8/28/97 Information Organization and Retrieval
Third Normal Form
Drug Admin Side Effects
Cephalosporin none
Demicillin none
none none
Penicillin rash
Tetracycline Fever
8/28/97 Information Organization and Retrieval
JoinsPart # Name Price Count
1 Big blue widget 3.76 22 Small blue Widget 7.35 43 Tiny red widget 5.25 74 large red widget 157.23 235 double widget rack 10.44 126 Small green Widget 30.45 587 Big yellow widget 7.96 18 Tiny orange widget 81.75 429 Big purple widget 55.99 9
Invoice # Part # Quantity93774 3 1084747 23 188367 75 288647 4 3
776879 22 565689 76 1293774 23 1088367 34 2
Invoice # Cust # Rep #93774 3 184747 4 188367 5 288647 9 1
776879 2 265689 6 2
Cust # COMPANY STREET1 STREET2 CITY STATE ZIPCODE
1Integrated Standards Ltd. 35 Broadway Floor 12 New York NY 02111
2 MegaInt Inc. 34 Bureaucracy Plaza Floors 1-172 Phildelphia PA 03756
3 Cyber Associates3 Control Elevation Place
Cyber Assicates Center Cyberoid NY 08645
4General Consolidated 35 Libra Plaza Nashua NH 09242
5Consolidated MultiCorp 1 Broadway Middletown IN 32467
6Internet Behometh Ltd. 88 Oligopoly Place Sagrado TX 78798
7Consolidated Brands, Inc.
3 Independence Parkway Rivendell CA 93456
8 Little Mighty Micro 34 Last One Drive Orinda CA 94563
9 SportLine Ltd. 38 Champion Place Suite 882 Compton CA 95328
8/28/97 Information Organization and Retrieval
More on Assignment and ER
• Just what is this Cookie database?
• What sort of ways might it be used?
• What are those ER symbols again?
8/28/97 Information Organization and Retrieval
Original Assignment• Examine the Cookie database using Access
and look at the ER Diagram for it posted on the assignments page.
• Consider the possibilities of Book publications– What are the problems with the database?– What new fields would you add to the database,
and where?– Draw a new ER diagram showing your design.
8/28/97 Information Organization and Retrieval
Cookie ER diagram
Has callBIBFILE
pubid
LIBFILE
INDXFILE
accno
SUBFILEHas index
libid
CALLFILE Has copy
publishes pubidPUBFILE
Has subject
subcodeaccno subcode
libidaccno
Note: diagramcontains onlyattributes usedfor linking
8/28/97 Information Organization and Retrieval
Cookie Database• Cookie is a bibliographic database that contains
information about a hypothetical union catalog of several libraries
• There are currently 5 main types of entities in the database (and one linking relation)– Books (bibfile)– Local Call numbers (callfile)– Libraries (libfile)– Publishers (pubfile)– Subject headings (subfile)– Links between subject and books (indxfile)
8/28/97 Information Organization and Retrieval
BIBFILE• Books (BIBFILE) contains information about
particular books. It includes one record for each book. The attributes are:– accno -- an “accession” or serial number
– author -- The author’s name
– title -- The title of the book
– loc -- Location of publication (where published)
– date -- Date of publication
– price -- Price of the book
– pagination -- Number of pages
– ill -- What type of illustrations (maps, etc) if any
– height -- Height of the book in centimeters
8/28/97 Information Organization and Retrieval
CALLFILE
• CALLFILE contains call numbers and holdings information linking particular books with particular libraries. Its attributes are:– accno -- the book accession number
– libid -- the id of the holding library
– callno -- the call number of the book in the particular library
– copies -- the number of copies held by the particular library
8/28/97 Information Organization and Retrieval
LIBFILE• LIBFILE contain information about the libraries
participating in this union catalog. Its attributes include:– libid -- Library id number– library -- Name of the library– laddress -- Street address for the library– lcity -- City name– lstate -- State code (postal abbreviation)– lzip -- zip code– lphone -- Phone number– mop - suncl -- Library opening and closing times for each day of the week.
8/28/97 Information Organization and Retrieval
PUBFILE• PUBFILE contain information about the
publishers of books. Its attributes include– pubid -- The publisher’s id number– publisher -- Publisher name– paddress -- Publisher street address– pcity -- Publisher city– pstate -- Publisher state– pzip -- Publisher zip code– pphone -- Publisher phone number– ship -- standard shipping time in days
8/28/97 Information Organization and Retrieval
SUBFILE
• SUBFILE contains each unique subject heading that can be assigned to books. Its attributes are– subcode -- Subject identification number– subject -- the subject heading/description
8/28/97 Information Organization and Retrieval
INDXFILE
• INDXFILE provides a way to allow many-to-many mapping of subject headings to books. Its attributes consist entirely of links to other tables– subcode -- link to subject id– accno -- link to book accession number
8/28/97 Information Organization and Retrieval
Some examples of Cookie Searches
• Who wrote Microcosmographia Academica?• How many pages long is Alfred Whitehead’s The Aims of Education
and Other Essays?• Which branches in Berkeley’s public library system are open on Sunday?
• What is the call number of Moffitt Library’s copy of Abraham Flexner’s book Universities: American, English, German?
• What books on the subject of higher education are among the holdings of Berkeley (both UC and City) libraries?
• Print a list of the Mechanics Library holdings, in descending order by height.
• What would it cost to replace every copy of each book that contains illustrations (including graphs, maps, portraits, etc.)?
• Which library closes earliest on Friday night?
8/28/97 Information Organization and Retrieval
ER Diagram Symbols
Entity
AttributePrimary
key
Relationship
Ovals are used to indicate the attributes associated with an entity or relationship (That is, the pieces of information recorded in the database about the entity or relationship) An underlined name indicates that the attribute is a primary key (That is, it can uniquely identify the entity)
Rectangles are used to indicate entities (That is, the representatives or records describing persons, things, or events in the database)
Diamonds are used to indicate relationships between entities. (That is, some association between the data records of different entities)
8/28/97 Information Organization and Retrieval
Cookie ER diagram
Has callBIBFILE
pubid
LIBFILE
INDXFILE
accno
SUBFILEHas index
libid
CALLFILE Has copy
publishes pubidPUBFILE
Has subject
subcodeaccno subcode
libidaccno
Note: diagramcontains onlyattributes usedfor linking
8/28/97 Information Organization and Retrieval
Assignment Goal• The main intent is to have you start thinking
about how databases are structured, and what types of information can or should be included when designing a database
• The main task is to look for MISSING elements in the current design, or badly designed elements given the particular data
• What attributes and/or new relations need to be added to the database?
8/28/97 Information Organization and Retrieval
And now for something completely different...
8/28/97 Information Organization and Retrieval
Today
• Controlled vocabularies
• Choice of names
• Form of names
• Name Authority files
8/28/97 Information Organization and Retrieval
Controlled Vocabularies
• Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information.
8/28/97 Information Organization and Retrieval
Controlled Vocabularies
• Names and name authorities (Today)
• Cognitive basis of categorization and subject classification (Thursday)
• Design of controlled vocabularies for subject access -- Thesaurus design (next week)
8/28/97 Information Organization and Retrieval
Names
• Cutter’s objectives of bibliographic description:– To enable a person to find a document of which
the author is known.– To show what the library has by a given author.
• First serves access.
• Second serves collocation.
8/28/97 Information Organization and Retrieval
Problems with Names
• How many names should be associated with a document?
• Which of these should be the “main entry”?
• What form should each of the names take?
• What references should be made from other possible forms of names that haven’t been used?
8/28/97 Information Organization and Retrieval
The problem
• Proliferation of the forms of names– Different names for the same person– Different people with the same names
• Examples – from Books in Print (semi-controlled but not
consistent)– ERIC author index (not controlled)
8/28/97 Information Organization and Retrieval
Rules for description
• AACR II and other sets of descriptive cataloging rules provide guidelines for:– Determining the number of name entries– Choosing a main entry– Deciding on the form of name to be used– Deciding when to make references
8/28/97 Information Organization and Retrieval
Authority control
• Authority control is concerned with creation and maintenance of a set of terms that have been chosen as the standard representatives (also know as established) based on some set of rules.
• If you have rules, why do you need to keep track of all of the headings?
8/28/97 Information Organization and Retrieval
Conditions of Authorship?
• Single person or single corporate entity
• Unknown or anonymous authors
• Shared responsibility
• Collections or editorially assembled works
• Works of mixed responsibility (e.g. translations)
• Related Works
8/28/97 Information Organization and Retrieval
Added Entries• Personal names
– Collaborators– Editors, compilers, writers– Translators (in some cases)– Illustrators (in some cases)– Other persons associated with the work (such as the honoree
in a Festschrift).
• Corporate Names– Any prominently named corporate body that has involvement
in the work beyond publication, distribution, etc.
8/28/97 Information Organization and Retrieval
Choice of Name
• AACR II says that the predominant form of the name used in a particular author’s writings should be chosen as the form of name.
• References should be made from the other forms of the name.
8/28/97 Information Organization and Retrieval
Form of the Name• When names appear in multiple forms, one
form needs to be chosen. Criteria for choice are– Fullness (e.g. Full names vs. initials only)– Language of the name. – Spelling (choose predominant form)
• Entry element:– John Smith or Smith, John?– Mao Zedong or Zedong, Mao? (Mao Tse Tung?)
8/28/97 Information Organization and Retrieval
Name Authority Files ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-21-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R517 100 10 Creasey, John 400 10 Cooke, M. E. 400 10 Cooke, Margaret,$d1908-1973 400 10 Cooper, Henry St. John,$d1908-1973 400 00 Credo,$d1908-1973 400 10 Fecamps, Elise 400 10 Gill, Patrick,$d1908-1973 400 10 Hope, Brian,$d1908-1973 400 10 Hughes, Colin,$d1908-1973 400 10 Marsden, James 400 10 Matheson, Rodney 400 10 Ranger, Ken 400 20 St. John, Henry,$d1908-1973 400 10 Wilde, Jimmy 500 10 $wnnnc$aAshe, Gordon,$d1908-1973
8/28/97 Information Organization and Retrieval
Name Authority FilesID:NAFO9114111 ST:p EL:n STH:a MS:n UIP:a TD:19910817053048 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:06-03-91 RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-19-91 040 OCoLC$cOCoLC 100 10 Marric, J. J.,$d1908-1973 500 10 $wnnnc$aCreasey, John 663 Works by this author are entered under the name used in the item. For a listing of other names used by this author, search also under$bCrease y, John 670 OCLC 13441825: His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J .J. Marric) 670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric) 670 Pseuds. and nicknames dict., c1987$b(Creasey, John, 1908-1973; Britis h author; pseud.: Marric, J. J.)
8/28/97 Information Organization and Retrieval
Name authority filesID:NAFL8166762 ST:p EL:n STH:a MS:c UIP:a TD:19910604053124 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:08-20-81 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 06-06-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 100 10 Butler, William Vivian,$d1927- 400 10 Butler, W. V.$q(William Vivian),$d1927- 400 10 Marric, J. J.,$d1927- 670 His The durable desperadoes, 1973. 670 His The young detective's handbook, c1981:$bt.p. (W.V. Butler) 670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J .J. Marric)