a simple algorithm for identifying abbreviation definitions in biomedical text

A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical

TextA. S. Schwartz & M. A. Hearst

UC Berkeley

Presented by Jing Jiang

The Problem – to Identify Acronyms

• To identify <“short form”, “long form”> pairs from biomedical text:– Short form is abbreviation of long form– There exists character mapping from short

form to long form– Example:

• Gcn5-related N-acetyltransferase (GNAT)

• A non-trivial problem:– Words in long form may be skipped– Internal letters in long form may be used

Previous Work

• Machine learning approach– Linear regression (Chang et al.)– Encoding and compression (Yeates et al.)

• Heuristic approach– Rule-based– Factors considered include:

• Distance between definition and abbreviation• Number of stop words• Capitalization

Step 1: Identifying Candidates

• Consider only two cases:– long form ‘(‘ short form ‘)’– short form ‘(‘ long form ‘)’

• Short form:– No more than 2 words– Between 2 and 10 chars– At least one letter– First char alphanumeric

• Long form:– Adjacent to short form– No more than min(|A| + 5, |A| * 2) words

Step 2: Identifying Correct Long Forms

• From right to left, the shortest long form that matches the short form:– Each character in short form must

match a character in long form– The match of the character at the

beginning of the short form must match a character in the initial position of the first word in the long form

Java Code for Finding the Best Long Form for a Given Short Form

Evaluation

• 1000 randomly selected MEDLINE abstracts– 82% recall, 95% precision

• Medstract Gold Standard Evaluation Corpus– 82% recall, 96% precision– Compared with

• 83% recall, 80% precision (Cheng et al., linear regression)

• 72% recall, 98% precision (Pustejovsky et al., heuristics)

Missing Pairs

• Skipped characters in short form– <CNS1, cyclophilin seven suppressor>

• No match– <5-HT, serotonin>

• Out of order– <ATN, anterior thalamus>

• Partial match– <Pol I, RNA polymerase I>

Discussion

• Cons:– Simple method– Decent performance

• Questions:– Tradeoff between complexity of rules

and performance– Generality of the heuristic rules– Heuristics vs. machine learning

Mining MEDLINE for Implicit Links between Dietary

Substances and Diseases

P. Srinivasan & B. LibbusU. Iowa

Presented by Jing Jiang

The Goal – to Discover Implicit Links between Topics• Open discovery

– Start from topic A– Navigate through intermediate topics

B1, B2, etc.– Reach terminal topics C1, C2, etc.

• Closed discovery– Start from topics A and C– Find connections B1, B2, etc.

General model for discovering implicit links

between topics

Terminology

• Topic Profile: a set of terms that are highly related to the topic, together with weights assigned to each term

• MeSH: Medical Subject Heading• UMLS types: Unified Medical

Language System semantic types

Open Discovery Algorithm

• Input:– Topic A– Two sets of UMLS types ST-B & ST-C– Threshold M

• Output:– Terms related to A and of some type

in ST-C

Open Discovery Algorithm (cont.)

• Build topic A’s profile AP• For each type in ST-B, select M top

terms B1, B2, etc. from AP• Build Bi’s profiles BPi• Build combined profile CP from BPs

limited to types in ST-C• Remove terms directly linked to A

from CP

Building Profile for Topic A

• Search PubMed for A• Extract MeSH terms from relevant

documents• Compute TF * IDF

– TF: # occurrences of the term in retrieved document set

– IDF: log(N/TF)– N: # retrieved documents

• Normalize the weight vector

Testing with Turmeric

• Topic A: Turmeric• ST-B:

– Gene or Genome– Enzyme– Amino Acid, Peptide or Protein

• ST-C: – Body Part, Organ or Organ Component– Disease or Syndrome– Neoplastic Process

• M: 5, 10, 15

Results

• B terms:– 37% recall, 38% precision (compared

with manually identified terms)

• C terms:– 67% recall, 67% precision (compared

with manual results)

Novel C MeSH Terms

Discussion

• Cons:– Simple method– Domain knowledge (MeSH terms, UMLS

types) to shape search direction

• Questions:– TF & IDF?– Longer path?– What relationships?– Co-occurrence = link?

End of Talk

a simple algorithm for identifying abbreviation definitions in biomedical text

Documents