cig conference norwich september 2006 autindex 1 autindex: automatic indexing and classification of...
TRANSCRIPT
CIG Conference Norwich September 2006
AUTINDEX 1
AUTINDEX
AUTINDEX: Automatic Indexing and Classification of Texts
Catherine Pease & Paul SchmidtIAI, Saarbrücken
{cath,paul}@iai.uni-sb.de
http://www.iai.uni-sb.de
CIG Conference Norwich September 2006
AUTINDEX 2
AUTINDEX
Automatic Indexing and Classification of Texts
AUTINDEX:-
• calculates keywords in texts
• places text in its appropriate classification
CIG Conference Norwich September 2006
AUTINDEX 3
AUTINDEX
APPLICATIONS
• Information Services for indexing scientific articles
• Document Management Systems for text classification according to content
• Libraries for indexing incoming books and articles
CIG Conference Norwich September 2006
AUTINDEX 4
AUTINDEX
Basis Components
• Morpho-syntactic analysis: tagging and lemmatisation
• Shallow parsing: resolution of grammatical ambiguities and identification of NPs
CIG Conference Norwich September 2006
AUTINDEX 5
AUTINDEX
Linguistic Resources for Pre-processing
• Morphological Analyser & Morpheme dictionaries
• Grammar rules for shallow parsing
CIG Conference Norwich September 2006
AUTINDEX 6
AUTINDEX
Morphological Analyser
“Cost reduction”
cost:
{lu=cost,ls=cost,c=verb,vtype=fiv}
{lu=cost,ls=cost,c=verb,vtype=inf}
{lu=cost,ls=cost,c=noun,nb=sg}
reduction:
{lu=reduction,ls=reduce,c=noun,nb=sg}
CIG Conference Norwich September 2006
AUTINDEX 7
AUTINDEX
Shallow Parsing
The company evaluated the cost reduction
noun
NP finite verb NP
CIG Conference Norwich September 2006
AUTINDEX 8
AUTINDEX
Controlled Indexing
• Identifies multiword terms and their syntactic variants
• Calculates keywords based on frequency and semantic weighting
• Checks thesaurus for relevant entry
• Classifies text
CIG Conference Norwich September 2006
AUTINDEX 9
AUTINDEX
Linguistic Resources for Indexing
• Multiword Terms and Variants Direct Match: cost reduction -> cost reduction Indirect match: inflectional differences cost reduction -> cost reductions
CIG Conference Norwich September 2006
AUTINDEX 10
AUTINDEX
Linguistic Resources for Indexing
lexical synonyms: rise - increase
derivational synonyms: biomagnetic – biomagnetism air pollutant – air pollution
CIG Conference Norwich September 2006
AUTINDEX 11
AUTINDEX
Linguistic Resources for Indexing
structural variants: costs of reduction – reduction costs combined (structural plus
derivational): transmitted DC power – DC power transmission
to calculate plane waves – place wave calculation
CIG Conference Norwich September 2006
AUTINDEX 12
AUTINDEX
Semantic Weighting
• 140 semantic types in dictionaries
• Weight assigned to nouns depending on semantic type
• Result of weighting set of keywords belonging to most frequent semantic classes
CIG Conference Norwich September 2006
AUTINDEX 13
AUTINDEX
Classification
• Descriptors annotated with Classification Code
• Hyperonym and Synonym relations used
• Frequency used to calculate Topic Classification
CIG Conference Norwich September 2006
AUTINDEX 14
AUTINDEX
User-Specific Thesauri
• Keywords checked against Thesaurus
• Hierarchical Structure of Thesaurus used to calculate Descriptors:
hyperonym relations synonym relations
CIG Conference Norwich September 2006
AUTINDEX 15
AUTINDEX
Example Output
• Keywords: List of descriptors from thesaurus plus weighting
• List of free terms / free descriptors plus weighting
• Topic Classification with relevant code
CIG Conference Norwich September 2006
AUTINDEX 16
AUTINDEX
Free Indexing
• Free indexing follows the same steps as for controlled indexing but without the use of a thesaurus
• The result is a list of free descriptors
CIG Conference Norwich September 2006
AUTINDEX 17
AUTINDEX
Architecture
CIG Conference Norwich September 2006
AUTINDEX 18
AUTINDEX
Bilingual Components
• Automatic language recognition
• Bilingual dictionaries
• Bilingual thesauri
CIG Conference Norwich September 2006
AUTINDEX 19
AUTINDEX
Libraries & the Internet
• Switch of focus from libraries to Internet because of:
Search engines e.g. Google
Poor access to library resources
CIG Conference Norwich September 2006
AUTINDEX 20
AUTINDEX
Reasons for Poor Access
• search tools need full text match
• human indexation too general and inconsistent
• no flexibility in terms of semantic relations
CIG Conference Norwich September 2006
AUTINDEX 21
AUTINDEX
AUTINDEX in Libraries
• High percentage of all queries have no hit in electronic library catalogue
• From the rest a high percentage is not used
CIG Conference Norwich September 2006
AUTINDEX 22
AUTINDEX
IntelligentCAPTURE
• Complete processing chain for digital content in libraries:
- scanning of contents tables
- treatment with OCR technology
- automatic indexation
- feeding results into library system
- integration of improved retrieval system
CIG Conference Norwich September 2006
AUTINDEX 23
AUTINDEX
Dandelon database
• Supports 16 EU languages for multilingual retrieval
• Running in 4 countries at 9 libraries
CIG Conference Norwich September 2006
AUTINDEX 24
AUTINDEX
Work Flow
CIG Conference Norwich September 2006
AUTINDEX 25
AUTINDEX
Summary
• AUTINDEX provides for controlled and free indexing
• Integrated in a complete processing chain AUTINDEX can be used to improve access to library resources through efficient methods of indexation