a simple algorithm for identifying abbreviation definitions in biomedical text
DESCRIPTION
A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text. A. S. Schwartz & M. A. Hearst UC Berkeley Presented by Jing Jiang. The Problem – to Identify Acronyms. To identify pairs from biomedical text: - PowerPoint PPT PresentationTRANSCRIPT
A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical
TextA. S. Schwartz & M. A. Hearst
UC Berkeley
Presented by Jing Jiang
The Problem – to Identify Acronyms
• To identify <“short form”, “long form”> pairs from biomedical text:– Short form is abbreviation of long form– There exists character mapping from short
form to long form– Example:
• Gcn5-related N-acetyltransferase (GNAT)
• A non-trivial problem:– Words in long form may be skipped– Internal letters in long form may be used
Previous Work
• Machine learning approach– Linear regression (Chang et al.)– Encoding and compression (Yeates et al.)
• Heuristic approach– Rule-based– Factors considered include:
• Distance between definition and abbreviation• Number of stop words• Capitalization
Step 1: Identifying Candidates
• Consider only two cases:– long form ‘(‘ short form ‘)’– short form ‘(‘ long form ‘)’
• Short form:– No more than 2 words– Between 2 and 10 chars– At least one letter– First char alphanumeric
• Long form:– Adjacent to short form– No more than min(|A| + 5, |A| * 2) words
Step 2: Identifying Correct Long Forms
• From right to left, the shortest long form that matches the short form:– Each character in short form must
match a character in long form– The match of the character at the
beginning of the short form must match a character in the initial position of the first word in the long form
Java Code for Finding the Best Long Form for a Given Short Form
Evaluation
• 1000 randomly selected MEDLINE abstracts– 82% recall, 95% precision
• Medstract Gold Standard Evaluation Corpus– 82% recall, 96% precision– Compared with
• 83% recall, 80% precision (Cheng et al., linear regression)
• 72% recall, 98% precision (Pustejovsky et al., heuristics)
Missing Pairs
• Skipped characters in short form– <CNS1, cyclophilin seven suppressor>
• No match– <5-HT, serotonin>
• Out of order– <ATN, anterior thalamus>
• Partial match– <Pol I, RNA polymerase I>
Discussion
• Cons:– Simple method– Decent performance
• Questions:– Tradeoff between complexity of rules
and performance– Generality of the heuristic rules– Heuristics vs. machine learning
Mining MEDLINE for Implicit Links between Dietary
Substances and Diseases
P. Srinivasan & B. LibbusU. Iowa
Presented by Jing Jiang
The Goal – to Discover Implicit Links between Topics• Open discovery
– Start from topic A– Navigate through intermediate topics
B1, B2, etc.– Reach terminal topics C1, C2, etc.
• Closed discovery– Start from topics A and C– Find connections B1, B2, etc.
General model for discovering implicit links
between topics
Terminology
• Topic Profile: a set of terms that are highly related to the topic, together with weights assigned to each term
• MeSH: Medical Subject Heading• UMLS types: Unified Medical
Language System semantic types
Open Discovery Algorithm
• Input:– Topic A– Two sets of UMLS types ST-B & ST-C– Threshold M
• Output:– Terms related to A and of some type
in ST-C
Open Discovery Algorithm (cont.)
• Build topic A’s profile AP• For each type in ST-B, select M top
terms B1, B2, etc. from AP• Build Bi’s profiles BPi• Build combined profile CP from BPs
limited to types in ST-C• Remove terms directly linked to A
from CP
Building Profile for Topic A
• Search PubMed for A• Extract MeSH terms from relevant
documents• Compute TF * IDF
– TF: # occurrences of the term in retrieved document set
– IDF: log(N/TF)– N: # retrieved documents
• Normalize the weight vector
Testing with Turmeric
• Topic A: Turmeric• ST-B:
– Gene or Genome– Enzyme– Amino Acid, Peptide or Protein
• ST-C: – Body Part, Organ or Organ Component– Disease or Syndrome– Neoplastic Process
• M: 5, 10, 15
Results
• B terms:– 37% recall, 38% precision (compared
with manually identified terms)
• C terms:– 67% recall, 67% precision (compared
with manual results)
Novel C MeSH Terms
Discussion
• Cons:– Simple method– Domain knowledge (MeSH terms, UMLS
types) to shape search direction
• Questions:– TF & IDF?– Longer path?– What relationships?– Co-occurrence = link?
End of Talk