inis training seminar - international atomic energy … november 2005 inis training seminar 1 inis...

14
1 November 2005 INIS Training Seminar 1 INIS Training Seminar 14-18 November 2005 Subject Analysis, Thesaurus and Computer-assisted Indexing Alexander Nevyjel Database Production and Development Group INIS Unit, INIS&NKM Section, IAEA November 2005 INIS Training Seminar 2 Introduction to Subject Analysis Subject Analysis should be carried out whenever possible by subject specialists with a good knowledge of the subject matter and a familiarity with the subject analysis tools of the respective database (subject categories, thesaurus, subject analysis rules) Steps of Subject Analysis subject classification abstracting subject indexing

Upload: hoangnhan

Post on 12-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

1

November 2005 INIS Training Seminar 1

INIS Training Seminar14-18 November 2005

Subject Analysis, Thesaurus and Computer-assisted Indexing

Alexander NevyjelDatabase Production and Development Group

INIS Unit, INIS&NKM Section, IAEA

November 2005 INIS Training Seminar 2

Introduction to Subject Analysis

Subject Analysis should be carried out whenever possible by subject specialists with a good knowledge of the subject matter and a familiarity with the subject analysis tools of the respective database (subject categories, thesaurus, subject analysis rules)Steps of Subject Analysis

subject classificationabstractingsubject indexing

2

November 2005 INIS Training Seminar 3

Subject Classification

The main topic of the document determines theprimary subject categoryIf there are other significant topics, one or moresecondary subject categories can be assigned in addition

November 2005 INIS Training Seminar 4

Abstracting

Each input item should contain an English abstract(exception: short communications)Abstracts in other languages are optionalIf an author abstract is available, it should be checked by the subject specialist, and edited, if necessaryAn abstract should be as informative as possibleEmphasize what is novel about the information in the original document

3

November 2005 INIS Training Seminar 5

ThesaurusWhat is a Thesaurus ?

„A thesaurus is a terminological control deviceused in translating from the natural languageof documents, indexers or users into a more constrained system language. It is a controlled and dynamic vocabulary of semantically and generically related terms which covers a specific domain of knowledge“

This definition has been adopted by UNESCO„Guidelines for the establishment and development of monolingual

thesauri“, UNESCO, SC/W/255, Paris, September 1973

November 2005 INIS Training Seminar 6

The Thesaurus and its Structure

Relationship Sy Cross reference

hierarchical BT broader term (level 1, 2,...)hierarchical NT narrower term (level 1, 2,...)

affinitive RT related term

preferential UF used for (reciprocally USE ...)preferential UF+ used for multiple

(reciprocally USE ... AND ...)preferential SF seen for

(reciprocally SEE ... OR ...)

4

November 2005 INIS Training Seminar 7

Subject Indexing

Subject indexing means analysing the information content of a piece of literature and expressing the meaningfull information content in the language of the database using the controlled vocabulary of the ThesaurusUnderstanding of the content --> subject specialistFamiliarity with Thesaurus and indexing rulesSelect a set of descriptors that describes the subject content of the piece of literature

November 2005 INIS Training Seminar 8

Procedures for Indexing

Carefully read the title and abstract and scan the body of the piece of literaturescan the full text (introduction, table of content, tables, graphs, figures, conclusion) to find information items missing from the abstract or requiring more precisionIdentify the concept(s) about which the piece of literature contains useful informationTranslate the concepts into descriptorsAvoid overindexing

5

November 2005 INIS Training Seminar 9

Proposed Terms (Technical Note 175)

If no suitable descriptor exists in the Thesaurus for the retrieval of a usefull concept, make a proposal for a new one, containing the following: Proposed termProposed word block of the term (in particular proposed BTs)Potential forbidden terms pointing to this proposed descriptor Scope note when appropriate Explanation and justification for the proposal One or more sample records

November 2005 INIS Training Seminar 10

The purpose of subject indexing is

to enable useful retrieval

6

November 2005 INIS Training Seminar 11

Computer-assisted Indexing

Kick-off Meeting Jan 2004Implementation and Customisation Jun 2004Production Indexing from Jun 2004 ongoingCAI version 1.0 final acceptance Aug 2004Tuning of the system from Aug 2004 ongoingCAI version 1.10 kick-off Dec 2004CAI version 1.10 acceptance Apr 2005RetrievalWare pilot Aug 2005CAI Thesaurus extension planned Jan 2006

November 2005 INIS Training Seminar 12

CAI Thesaurus extension

“Hidden terms” are character patterns representing the different appearances of a concept in the free text, which is indexed by one or more descriptors. handled similar to “forbidden terms” with one or more USE relationsCAI internal only not exported to INIS production systemnot exported to FIBRE not printed in any appearance of the thesaurus support identification of descriptors in the free text

7

November 2005 INIS Training Seminar 13

Hidden Terms: Compounds

Descriptor hidden term free text

MAGNESIUM BORIDES MgB_2 MgB2MAGNESIUM CARBONATES MgCO_3 MgCO3MAGNESIUM HYDRIDES MgH_2 MgH2IRON BROMIDES iron dibromideIRON BROMIDES iron tribromideARSENIC IONS As"3"- As3-

ACETYLENE C_2H_2 C2H2ACETALDEHYDE C_2H_4O C2H4OACETIC ACID C_2H_4O_2 C2H4O2

approx. 1400 hidden terms (expected 3000)

November 2005 INIS Training Seminar 14

Hidden Terms: Isotopes

Descriptor hidden term free text

CESIUM 137 Cesium 137, Cesium-137"1"3"7cs 137Cs137 caesium 137 Caesium, 137-Caesiumcaesium 137 Caesium 137, Caesium-137137 cesium 137 Cesium, 137-Cesium137 cs 137 Cs, 137-Css 137 Cs 137, Cs-137cs"1"3"7 Cs137

cs137 Cs137CESIUM 138 "1"3"8"mcs 138mCs

cs"1"3"8"m Cs138m

approx. 22.400 hidden terms

8

November 2005 INIS Training Seminar 15

Hidden Terms: Elementary ParticlesDescriptor hidden term free text

B QUARKS bottom quarksT QUARKS top quarksELECTRON NEUTRINOS #nu#_e νe

MUON NEUTRINOS #nu#_#mu# νµTAU NEUTRINOS #nu#_#tau# ντRHO-770 MESONS #rho#-770 ρ-770OMEGA-782 MESONS #omega#-782 ω-782KAONS NEUTRAL K"0 K0

KAONS NEUTRAL SHORT-LIVED K"0_S K0S

KAONS NEUTRAL LONG-LIVED K"0_L K0L

approx. 300 hidden terms

November 2005 INIS Training Seminar 16

Hidden Terms: UK/US SpellingsDescriptor hidden term

A CENTERS a centresACTIVITY METERS activity metresANALOG COMPUTERS analogue computersANESTHESIA anaesthesiaARCHAEOLOGY archeologyAUSTRIAN ORGANIZATIONS austrian organisationsBALLISTIC MISSILE DEFENSE ballistic missile defenceBAYARD-ALPERT GAGES bayard-alpert gaugesBEAM ANALYZERS beam analysersBEHAVIOR behaviourCATALOGS catalogues

approx. 800 hidden terms

9

November 2005 INIS Training Seminar 17

Hidden Terms: Diacritics and CountriesDescriptor hidden term

Diacritics:BAECKLUND TRANSFORMATION backlund transformationBRUECKNER MODEL bruckner modelBRUNSBUETTEL REACTOR brunsbuttel reactorMOESSBAUER EFFECT mossbauer effect

Country Names:CAMBODIA kampucheaCOTE D'IVOIRE ivory coastGREECE hellasMYANMAR burmaSYRIA syrian arab republicTHAILAND siam

approx. 250 hidden terms

November 2005 INIS Training Seminar 18

Hidden Terms: Other SpellingsDescriptor hidden term

Singular/PluralFUNGI fungusFUNGI fungusesG MATRIX g matricesG MATRIX g matrixes

Reverse SequenceATOM-MOLECULE COLLISIONS atom-molecule scatteringATOM-MOLECULE COLLISIONS molecule-atom scatteringATOM-MOLECULE COLLISIONS atom-molecule reactionsATOM-MOLECULE COLLISIONS molecule-atom reactionsATOM-MOLECULE COLLISIONS atom-molecule interactionsATOM-MOLECULE COLLISIONS molecule-atom interactions

approx. 900 hidden terms

10

November 2005 INIS Training Seminar 19

CAI Thesaurus Extension

ThesaurusValid Descriptors 21.953Forbidden Terms 9.411

CAI Hidden Terms 29.237

Total 60.601

Terminological Knowledge Base

November 2005 INIS Training Seminar 20

Further Improvements under Development

“+” and “-“ signs K+ KAONS PLUS, KAONS MINUS, POTASSIUM IONS

Case sensitivityTiN TIN (instead of TITANIUM NITRIDES)gas GALLIUM SULFIDES“…who is the …” WHO (World Health Organization)

Verbs versus Nouns“… this leads us to …” LEAD“… this leaves it ….” LEAVES

Homographic termsSolutions SOLUTIONS or MATHEMATICAL SOLUTIONS

Nuclear Reactions, e.g. 14N(γ,α)10BTargets BeamsReactions

11

November 2005 INIS Training Seminar 21

C A I In te rac tiveT ra in ing o f C A I

R ecords w ith F u llIndex ing

IN IS V erifica tion a ndP roduc tion S ys tem

C A I O ffline /B a tch

R ecord s w ithC A I-sugges ted

D escrip to rs

IN IS S ub jec tA na lys is M odu le

Inpu t fromM e m ber S ta tes

F u llIndex ing

P roposed T erm s /N o In dex ing

E lec tron ic R e cordsfrom P ub lishe rs

P roposed T erm s/N o Inde x ing

CAI-Workflow

Interactive CAI ProcessingBatch Mode

Conventional Processing

November 2005 INIS Training Seminar 22

12

November 2005 INIS Training Seminar 23

CAI Batch Processing StatisticsNov 2004 – November 2005

Country Records FilesAR Argentina 133 7AU Australia 443 2BD Bangladesh 2 1BG Bulgaria 27 1BR Brazil 10 1CH Switzerland 58 4CN China 294 3DE Germany 363 11FR France 243 3JP Japan 6 1MK Macedonia 107 1MY Malaysia 125 3SE Sweden 27 1TH Thailand 15 1UZ Uzbekistan 144 2

Total 1997 42

November 2005 INIS Training Seminar 24

CAI Batch Processing

Input: MemSt-CC-yymmdd-xxxxxxxxxxxOutput: _MemSt-CC-yymmdd-xxxxxxxxxxx

MemSt is a standard prefix (meaning “member state”)CC is the country code yymmdd is the date when the file was generated xxxxxxxxxxx is any additional identification

ExamplesMemSt-AR-041203-thisismytestfileMemSt-FR-041212-fileidentification

13

November 2005 INIS Training Seminar 25

CAI Batch Processing

Output: _MemSt-CC-yymmdd-xxxxxxxxxxx

These files will carry the CAI suggested descriptors in tag 800, preceded by the string

##CAI suggestions##; Example:

800^##CAI suggestions##; DESCRIPTOR1; DESCRIPTOR2; DESCRIPTOR3; …….

sent back to the member state for reviewing

November 2005 INIS Training Seminar 26

CAI Batch ProcessingReviewing Process

Delete all suggested descriptors which are too generalAdd relevant descriptors which were not found

numerical values, e.g. pressure ranges, temperature ranges,...nuclear reactionschemical compounds, alloys, etc.

CAI is cleaning up BT/NTs clean up BT/NTs from manual additionsClean up suggestions from homographic termsDelete “##CAI suggestions## “Submit file to “INIS Input Box”

14

November 2005 INIS Training Seminar 27

INIS Subject Specialists

A2477Alexander NevyjelPhysics

A2472Bekele NegeriLive Science

A2474(Christine Krieger-Levine)Reactors

A2474Christine Krieger-LevineChemistry

A2479Anvar AvezovPhysics