hikm’2006amtex automatic document indexing in large medical collections angelos hliaoutakis,...
Post on 22-Dec-2015
218 Views
Preview:
TRANSCRIPT
HIKM’2006 HIKM’2006 AMTEx AMTEx
Automatic Document Indexing in Large Medical Collections
Automatic Document Indexing in Large Medical Collections
Angelos Hliaoutakis, Kalliopi Zervanou, Euripides G.M. Petrakis
Technical University of Crete, Chania, Greece
Evangelos E. MiliosDalhousie University, Halifax, Canada
HIKM’2006 HIKM’2006 AMTEx AMTEx
OverviewOverview
• The need for automatic assignment of index terms in large medical collections
• MMTx (by the US NLM)
• The AMTEx approach to medical document indexing
• AMTEx resources: MeSH & C/NC value
• Experiments & evaluation
• Discussion and future research
HIKM’2006 HIKM’2006 AMTEx AMTEx
Motivation and ObjectivesMotivation and Objectives
• MeSH is a taxonomy of medical terms
• Subset of UMLS Metathesaurus
• MEDLINE is indexed by MeSH terms (assigned by experts)
• Other medical texts need to be associated with MEDLINE, e.g. consumer medical literature
• Need for automatic assignment of MeSH terms to any medical text
HIKM’2006 HIKM’2006 AMTEx AMTEx
MMTx (MetaMap Transfer)MMTx (MetaMap Transfer)
Maps arbitrary text to UMLS Metathesaurus concepts:
Parsing to extract noun phrases(syntactic analysis - linguistic filter)
Variant Generation (uses SPECIALIST Lexicon)
Candidate Retrieval (mapping process to Metathesaurus Concepts)
Candidate Evaluation (criteria: centrality, variation, coverage, cohesiveness)
HIKM’2006 HIKM’2006 AMTEx AMTEx
MMTx ExampleMMTx Example Parsing
• Shallow syntactic analysis of the input text• Linguistic filtering: isolates noun phrases
Variant Generatione.g. “obstructive sleep apnea” has variants:obstructive sleep apnea, sleep apnea, sleep, apnea, osa,…
Candidate RetrievalCandidate Metathesaurus concepts for the variant “osa” : osa [osa antigen],
osa [osa gene product]osa [osa protein]osa [obstructive sleep apnea]
Candidate EvaluationObstructive Sleep apnea 1000Sleep Apnea 901Apnea 827… …Sleeping 793Sleepy 755
HIKM’2006 HIKM’2006 AMTEx AMTEx
MMTx limitationsMMTx limitations• MMTx focus on UMLS rather than MeSH
But MEDLINE indexing is based on MeSH
• Exhaustive variant generation:
the initial phrase is iteratively expanded into all possible UMLS variants
term overgeneration term concept diffusion unrelated terms added to the final candidate list
HIKM’2006 HIKM’2006 AMTEx AMTEx
The AMTEx method The AMTEx method
• New method for automatic indexing of medical documents
• Main idea:
Initial term extraction based on a hybrid linguistic/statistical approach, the C/NC value
Extracts general single and multi-word terms
Extracted terms are validated against MeSH
HIKM’2006 HIKM’2006 AMTEx AMTEx
ΑΜΤΕx OutlineΑΜΤΕx OutlineINPUT:Document Collection
INPUT:Document Collection C/NC value
Multi-word Term Extraction& Term Ranking
C/NC valueMulti-word Term Extraction
& Term Ranking
MeSHTerm Validation
MeSHTerm Validation
Single-word Term ExtractionNon-MeSH multi-word are broken down & validated against MeSH
Single-word Term ExtractionNon-MeSH multi-word are broken down & validated against MeSH
Variant GenerationVariant Generation Term Expansion(MeSH)
Term Expansion(MeSH)
MeSHThesaurusResource
MeSHThesaurusResource
OUTPUT:MeSH
Term Lists
OUTPUT:MeSH
Term Lists
HIKM’2006 HIKM’2006 AMTEx AMTEx
MeSH: Medical Subject HeadingsMeSH: Medical Subject Headings
The NLM medical & biological terms thesaurus:
• Organized in IS-A hierarchies – more than 15 taxonomies & more than 22,000 terms– a term may appear in multiple taxonomies
• No PART-OF relationships
• Terms organized into synonym sets called entry terms, including stemmed term forms
HIKM’2006 HIKM’2006 AMTEx AMTEx
Fragment of the MeSH IS-A HierarchyFragment of the MeSH IS-A Hierarchy
Root
Nervous systemdiseases
Neurologicmanifestations
pain
headache neuralgia
Cranial nervediseases
Facialneuralgia
HIKM’2006 HIKM’2006 AMTEx AMTEx
The C/NC value methodThe C/NC value method
• Hybrid (linguistic / statistical) term extraction method
• Domain independent
• Specifically designed for the identification of multi-word and nested terms:
compound & multi-word terms very common in biomedical domain
multi-word terms often used in indexing
HIKM’2006 HIKM’2006 AMTEx AMTEx
C-valueC-value• C-value: a phrase may be a term, if it
often appears alone or within other candidate terms
otherwise
α: candidate termf(α): frequencyTα: set of candidate terms containing αP(Tα): number of such terms
HIKM’2006 HIKM’2006 AMTEx AMTEx
NC-valueNC-value• NC-value: a phrase is more likely a term,
if it often appears in specific word context
w: context wordt(w): number of terms w appears withn: number of all termsfα(w): frequency of w as context word of α
HIKM’2006 HIKM’2006 AMTEx AMTEx
AMTEx step 1: C/NC valueMulti-word Term Extraction & Ranking
AMTEx step 1: C/NC valueMulti-word Term Extraction & Ranking
Part-of-Speech Tagging
Linguistic filtering:• N+ N
• (A|N)+ N
• ( (A|N)+ | ( (A|N)* (N P)? ) (A|N)* ) N
Candidate term ranking based on C/NC-value
Keep terms with NC-value > T1
HIKM’2006 HIKM’2006 AMTEx AMTEx
AMTEx step 2: MeSH Term Validation
AMTEx step 2: MeSH Term Validation
Candidate terms are validated against the MeSH Thesaurus (simple string matching)
Only candidate terms matching MeSH are kept
Multi-word candidates not matching MeSH may still contain (shorter) MeSH terms
HIKM’2006 HIKM’2006 AMTEx AMTEx
AMTEx step 3: Single-word Term Extraction
AMTEx step 3: Single-word Term Extraction
For multi-word terms not matching MeSH:
Multi-word are split into single-word terms
Single-word terms matched against MeSH
Matched MeSH terms added to term list
HIKM’2006 HIKM’2006 AMTEx AMTEx
AMTEx step 4: Term Variant Generation
AMTEx step 4: Term Variant Generation
Variants are added to the list of terms:
• Inflectional variants of the extracted terms identified during term extraction (C/NC-value)
• Stemmed term-forms available in MeSH
HIKM’2006 HIKM’2006 AMTEx AMTEx
AMTEx step 5: Term ExpansionAMTEx step 5: Term Expansion
• Each term in the list is expanded with neighbouring terms in MeSH hierarchy
• The expansion may include terms more than one level higher or lower than the original term, depending on similarity threshold T
• Semantic similarity metric by Li et al.
Y. Li, Z. A. Bandar, and D. McLean. An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. IEEE Trans. on Knowledge and Data Engineering, 15(4):871–882, July/Aug. 2003.
HIKM’2006 HIKM’2006 AMTEx AMTEx
ExampleExampleInput: Full text article
MEDLINE index terms: “Aged”, “Data Collection”, “Humans”,“Knee”, “Middle Aged”, “Osteoarthritis, Knee/complications”, “Osteoarthritis, Knee/diagnosis”, “Pain/classification”, “Pain/etiology”, “Prospective Studies”, “Research Support, Non-U.S. Gov’t”
MMTx terms: “osteoarthritis knee”, “retention”, “peat”, “rheumatology”, “acetylcholine”, “lysine acetate”, “potassium acetate”, “questionnaires”, “target population”, “population”, “selection bias”, “creativeness”, “reproduction”, “cohort studies”, “europe”, “couples”, “naloxone”, “sample size”, “arthritis”, “data collection”, “mail” ‘health status”, “respondents”, “ontario”, “universities”, “dna”, “baseline survey”, “medical records”, “informatics”, “general practitioners”, “gender”, “beliefs”, “logistic regression”, “female”, “marital status”, “employment status”, “comprehension”, “surveys”, “age distribution”, “manual”, “occupations”, “manuals”, “persons”, “females”, “minor”, “minority groups”, “incentives”, “business”, “ability”, “comparative study”, “odds ratio”, “biomedical research”, “pubmed”, “copyright”, “coding”, “longitudinal studies”, “immunoelectrophoresis”, “skin diseases”, “government”, “norepinephrine”, “social sciences”, “survey methods”, “tyrosine”, “new zealand”, “azauridine”, “gold”, “nonrespondents”, “cycloheximide”, “rheum”, “jordan”, “cadmium”, “radiopharmaceuticals”, “community”, “disease progression”, “history”
AMTEx terms: “health surveys”, “pain”, “review publication type”, “data collection”, “osteoarthritis knee”, “knee”, “science”, “health services needs and demand”, “population”, “research”, “questionnaires”, “informatics”, “health”
HIKM’2006 HIKM’2006 AMTEx AMTEx
EvaluationEvaluationPrecision and Recall measures
Dataset:• 61 full MEDLINE documents (not abstracts), from
PMC database of NCBI Pubmed• MEDLINE documents are paired to respective
MeSH index terms, manually assigned by experts
Ground Truth: • the set of MeSH document index terms
Benchmark method:• MMTx against our AMTEx
HIKM’2006 HIKM’2006 AMTEx AMTEx
Multi-Word Terms onlyMulti-Word Terms only
Method Precision Recall
MMTx 0,013 0,015
AMTEx (T = 0,5) 0,186 0,108
AMTEx (T = 0,6) 0,218 0,090
AMTEx (T = 0,7) 0,236 0,072
AMTEx (T = 0,8) 0,236 0,072
AMTEx (T = 0,9) 0,236 0,070
T: term expansion threshold, lower T means further expansion
HIKM’2006 HIKM’2006 AMTEx AMTEx
Contribution of Single-Word TermsContribution of Single-Word Terms
Method Precision Recall
MMTx 0,013 0,015
AMTEx 0,236 0,070
AMTEx & single-word MeSH terms 0,120 0,228
HIKM’2006 HIKM’2006 AMTEx AMTEx
Conclusions: AMTExConclusions: AMTEx
Designed for indexing and retrieval of MEDLINE documents
Focuses on multi-word term extraction using valid linguistic & statistical criteria
Based on MeSH -- similarly to human indexing
Selectively expands into term variants, synonyms
Outperforms the current benchmark MMTx method, in both precision & recall
top related