copyright © 2006 access innovations, inc. 1 building taxonomies part 4 alice redmond-neal access...

25
Copyright © 2006 Access Innovations, Inc. 1 Building Building Taxonomies Taxonomies Part 4 Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City, May 21, 2006

Upload: laura-harrison

Post on 11-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 1

Building Building TaxonomiesTaxonomies

Part 4Part 4 Alice Redmond-NealAccess Innovations, Inc.

Enterprise Search SummitNew York City, May 21, 2006

Page 2: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 2

Evaluating termsEvaluating termsEvaluating termsEvaluating terms

• Do terms represent all necessary concepts? – Gap analysis

• Do terms capture necessary details? – Level of granularity

• Are terms understood by users? – Domain expert vs. common user

Page 3: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 3

Talk about termsTalk about termsTalk about termsTalk about terms

• Term format• Grammatical issues• Singular and plural forms• Spelling• Abbreviations and acronyms• Capitalization• Other punctuation• Consistency

Page 4: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 4

Term formatTerm formatTerm formatTerm format

• KISS – Keep it short and simple– 1-2-3 words

• Effect on search• Factoring, Postcoordination (coming)

• Grammatical issues– Nouns and noun phrases– Verbish things– Adjectives– Adverbs– Initial articles

Page 5: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 5

Most terms are nounsMost terms are nounsMost terms are nounsMost terms are nouns

• Nouns or simple noun phrases (phrase = compound or bound term)– Adj + Noun – Art history (ANSI/NISO

standard)• Noun + Prep + Noun – History of art (ISO

standard)

– Exceptions – Burden of proof, Coats of arms, Prisoners of war, Birds of prey, etc.

Page 6: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 6

Other parts of speechOther parts of speechOther parts of speechOther parts of speech• Verbs

– Gerund form: Fishing• Adjectives

– Not used in isolation– Very rare (lots in Art & Architecture

Thesaurus)– OK when combined with another term –

Dental bridges• Adverbs

– No, except as part of proper name – Very Large Array

• Articles– No, except as part of proper name –

El Salvador, Le Mans

Page 7: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 7

Singular and plural formsSingular and plural formsSingular and plural formsSingular and plural forms

• Plural form for count nouns– “how many” clouds, animals, highways

• Singular form for mass nouns– “how much” security, oxygen, rain

• Exceptions– Body parts in medicine singular (heart,

foot)– Unique entities singular (Brooklyn

Bridge)– User warrant plural/singular (fishes)

stocks?fishes?monies?

Page 8: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 8

Term spellingTerm spellingTerm spellingTerm spelling

• Preferred spelling depends on audience– Multinational company may need

alternative spellings in same taxonomy

• Use most widely accepted spelling• Use secondary spelling as NonPreferred

Term (synonym)• Exception:

– Proper names – Labour Party

Page 9: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 9

Abbreviations and Abbreviations and acronymsacronymsAbbreviations and Abbreviations and acronymsacronyms

• Use only when full form is rarely seen – SCUBA, LASER, DNA, LASIK

• Use full form if abbreviation is not widely used and understood– Automated teller machines – for ATM– Driving while intoxicated – for DWI

• Alternative becomes NonPreferred Term• Use and acceptance always shifting• Be consistent

Page 10: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 10

CapitalizationCapitalizationCapitalizationCapitalization

• Standards: use all lower case– Exceptions:

• Initialisms – DNA• Proper names – Queen Mary• Trade names – Thesaurus Master™• Taxonomic names – Homo sapiens

• Much variation in practice

Page 11: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 11

ParenthesesParenthesesParenthesesParentheses• Use only for

– Parenthetical qualifiers to disambiguate homographs • Bridges (Dentistry), Bridges (Roadways), Bridges

(Music)– Different meanings for singular / plural word forms

• Bridges [all the above] vs. Bridge (Card game)• Wood (Material) vs. Woods (Forest)• Damage (Injury) vs. Damages (Law)

– Facet indicators – Paint (by finish)– Part of the term – benzo(a)pyrene – Trademark indicator (tm) becomes ™

Page 12: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 12

HyphensHyphensHyphensHyphens

• Generally avoid -- nonfiction

• Use only if– Omitting the hyphen would be

ambiguous• cocitation vs. co-occurrence

– The hyphen is part of the term• n-body problem• p-benzoquinone• CD-ROM

Page 13: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 13

Other punctuation bitsOther punctuation bitsOther punctuation bitsOther punctuation bits

• Apostrophes– Keep for possessive case

• Diacritical marks– Keep if possible –

Québec

• Other random marks– Keep if part of a proper name –

A&W Root BeerStandard & Poors

Page 14: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 14

Compound terms Compound terms (aka bound (aka bound

terms)terms) and factored termsand factored termsCompound terms Compound terms (aka bound (aka bound

terms)terms) and factored termsand factored terms• Term consisting of more than one

word that represents a single concept

• Keep compound term or factor out (split)?

Page 15: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 15

Compound terms Compound terms are precoordinatedare precoordinatedCompound terms Compound terms are precoordinatedare precoordinated• Elements are bound together to specify a

concept at the indexing stage• Can’t change the parts

Water pollutionLibrary scienceTelevision influence on preschoolers

Chicken dinner with turnips and rutabagas- no substitutions of menu items!

Page 16: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 16

Factored terms Factored terms can be Postcoordinatedcan be PostcoordinatedFactored terms Factored terms can be Postcoordinatedcan be Postcoordinated• Elements can be strung together to

specify a concept at the search stage• Elements can be mixed and

combined as needed– Few clothing pieces several outfits

• The sum of the elements reflects the concept (usually)

Page 17: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 17

To factor or not to factorTo factor or not to factorTo factor or not to factorTo factor or not to factorIs each factor a single concept?Is each factor in your thesaurus?

If YES, break term down to factors: California highway construction

California + Highways + Construction

If NO, or if factoring would be confusing, retain the compound termChildren’s television Television + Children ??Science library Library + Science ??

Page 18: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 18

Precoordination positivesPrecoordination positivesPrecoordination positivesPrecoordination positives

• User expectations – Rapid transit– Occurs commonly in data– Splitting would be odd– Reflects a single concept for the audience

• Better accuracy – captures specific concepts precisely

• Fewer false drops• Term information is retained

(Related Terms, NonPreferred Terms, Scope Notes, …)

Page 19: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 19

Precoordination negativesPrecoordination negativesPrecoordination negativesPrecoordination negatives

• Poorer total recall• Term proliferation

– Combinations and permutations increase thesaurus size

• Higher cost• Limited flexibility in expressing

new concepts

Page 20: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 20

Postcoordination pros and Postcoordination pros and consconsPostcoordination pros and Postcoordination pros and consconsHigher recallLower costGreater flexibility – enables expression

of new concepts through novel combinations

x Lower accuracy, some false drops– Library scienceNOT = Library + Science– Art museums NOT = Art + Museums

• Postcoordination is implicit in most online searches (implied AND between search words)

Page 21: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 21

About “and”About “and”About “and”About “and”• Avoid “and” in terms – not a single

conceptInstead of: Children and television

Factor and postcoordinate

USE Media influence + Television + Children

• “and” OK when both elements are members of a broader class

Vessels Ships and boats

Your need for granularity may dictate your choice

Page 22: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 22

So far you’ve gotSo far you’ve gotSo far you’ve gotSo far you’ve got

• Hierarchy• Complete term records

– Broader and Narrower Terms• Polyhierarchies when needed

– Preferred/NonPreferred Terms (equivalence relationships)

– Related Terms (associative relationships)– Scope Notes– Correct term format– Compound terms when needed

Page 23: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 23

NotationNotationNotationNotation• Symbols (numbers, letters, hyphens,

colons…)– 1: Apples

• 1.1: Granny Smith• 1.2: Winesap

• Another kind of ordering (non-alphabetic)– Chronological, positional, numeric sequence,

or other logical sequence for user group– Same terms presented differently – Different user groups, different purposes

• Adjunct to verbal expression of term• Secondary to verbal concept organization

Page 24: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 24

Automatic taxonomy Automatic taxonomy construction construction Automatic taxonomy Automatic taxonomy construction construction

• Words and phrases from documents• Based on frequency and co-occurrence

of words• No semantic analysis• Produces list of possible terms • Requires editorial analysis

– hierarchical and conceptual organization– association of related concepts– identifying and deduplicating equivalent

concepts

Page 25: Copyright © 2006 Access Innovations, Inc. 1 Building Taxonomies Part 4 Alice Redmond-Neal Access Innovations, Inc. Enterprise Search Summit New York City,

Copyright © 2006 Access Innovations, Inc. 25

Review, Review, edit,edit, test, test, edit,edit, use, use, edit,edit, and maintain, i.e. and maintain, i.e. editedit

Review, Review, edit,edit, test, test, edit,edit, use, use, edit,edit, and maintain, i.e. and maintain, i.e. editedit• Review

– Users– Expert reviewers

• Test– Index 500+

documents (more for variable writing style; fewer for strict style)

– Monitor search log

• Edit and maintain– Add term– Change existing term– Change term status– Delete term– Add term relationship– Delete term relationship– Add/modify Scope Note– Change overall

structure

Consider machine automated / assisted indexing software