1 automated recognition of malignancy mentions in biomedical literature bmc bioinformatics 2006,...

34
1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan and Hsin-Hsi Chen

Upload: janice-ross

Post on 02-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

1

Automated recognition of malignancy mentions in biomedical literature

BMC Bioinformatics 2006, 7:492

Speaker: Yu-Ching Fang

Advisors: Hsueh-Fen Juan and Hsin-Hsi Chen

Page 2: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

2

Outline

• Background

• Methods

• Results

• Discussion

• Conclusion

Page 3: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

3

Background - Motivation• The rapid proliferation of biomedical literature

makes it increasingly difficult for researchers peruse, query, and synthesize it for biomedical knowledge gain.

• Less biomedical text mining work has been performed to identify disease-related objects and concepts.

• Related works about automated disease entity recognition often do not perform well.

• More extensive work on medical entity class recognition is necessary.

Page 4: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

4

Related works

1. Automated extractors for the identification of gene and protein names.

2. Automated entity recognition to the identification of phenotypic and disease objects.

3. A machine-learning algorithm to extract gene-disorder relations.

4. Extract phenotypic attributes from Online Mendelian Inheritance in Man (OMIM).

Page 5: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

5

Goal

• Develop a named entity recognizer (MTag), an entity tagger for recognizing clinical descriptions of malignancy presented in text. MTag is based upon the probability model Conditional Random Fields (CRFs).

• Minimize manual efforts and still perform with high accuracy.

Page 6: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

6

Conditional Random Fields (CRFs)

• CRFs are probabilistic tagging models that give the conditional probability of a possible tag sequence t = t1, ... tn given the input token sequence o = o1,..., on (Ryan McDonald and Fernando Pereira, 2005).

• For example, the identification of gene mentions in text can be implemented as a tagging task.

Begins (B), continues (I), or is outside (O) of a gene mention

o:

t:

Page 7: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

7

Conditional Random Fields (CRFs)

John Lafferty et al., 2001

input token sequence

tag sequence

Page 8: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

8

Methods – Task definition

• To develop an automated method that would accurately identify and extract strings of text corresponding to a clinician’s or researcher’s reference to cancer (malignancy).

• Label “Malignancy”: the full noun phrase encompassing a mention of a cancer subtype.

• For example, “neuroblastoma”, “localized neuroblastoma” and “primary extracranial neuroblastoma” were considered to be distinct mentions of malignancy.

• Directly adjacent prepositional phrases were not allowed, such as “cancer <of the lung>”.

Page 9: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

9

Two corpus combination

1. The first corpus concentrated upon a specific malignancy (neuroblastoma) and consisted of 1,000 randomly selected abstracts identified by querying PubMed with the query terms "neuroblastoma" and "gene".

2. The second corpus consisted of 600 abstracts previously selected as likely containing gene mutation instances for genes commonly mutated in a wide variety of malignancies.

Page 10: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

10

Two corpus combination (cont.)

• 1000+600-158=1442 abstracts (eliminating 158 abstracts that appeared to be non-topical, had no abstract body, or were not written in English.)

• Manually annotated for tokenization, part-of-speech assignments, and malignancy named entity recognition.

Page 11: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

11

Two corpus combination (cont.)

• Annotations were performed on all documents by experienced annotators with biomedical knowledge.

• Discrepancies were resolved through forum discussions.

• A total of 7,303 malignancy mentions were identified in the document set.

Page 12: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

12

MTag algorithm

• MTag was developed using the probability model Conditional Random Fields (CRFs).

• CRFs model the conditional probability of a tag sequence given an observation sequence.

• O is an observation sequence, or a sequence of tokens in the text.

• t is a corresponding tag sequence in which each tag labels the corresponding token with either Malignancy (meaning that the token is part of a malignancy mention) or Other.

O: Lung cancer may be related to gene mutation.

t: <Malignancy><Malignancy><Other><Other><Other><Other><Other><Other>

Page 13: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

13

MTag algorithm (cont.)

• CRFs are based on a set of feature functions, fi(tj, tj-1, O).

• This feature represents the probability of whether the token "cancer" is tagged with label Malignancy given the presence of "lung" as the previous token.

O: Lung cancer may be related to gene mutation.

t: <Malignancy><Malignancy><Other><Other><Other><Other><Other><Other>

Page 14: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

14

MTag algorithm (cont.)

• Consider many textual features when it makes decisions on classifying whether a word comprises all or part of a malignancy mention.

• Word-based features: The frequency of each string of 2, 3, or 4 adjacent characters (character n-grams) within each word of the training text was calculated.

For example, lung (lu, lun, lung, un, ung, ng) • The differential frequency of each n-gram within

words manually tagged as being malignancy mentions was considered as a series of features.

For example: lung (bigram: 3/6, trigram:2/6, fourgram:1/6)

Page 15: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

15

MTag algorithm (cont.)

• Orthographic features included the usage and distribution of punctuation, alternative spellings, and case usage.

• Domain-specific features comprised a lexicon of 5,555 malignancies and a regular expression for tokens containing the suffix -oma.

Page 16: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

16

Evaluation

• The evaluation set: 432 abstracts

- 2,031 sentences containing mentions of malignancy

- 3,752 sentences without mentions• Correctly identified if the predicted and manually

labeled tags were exactly the same in content and both boundary determinations.

• The performance of MTag was calculated according to precision, recall and F-measure.

Page 17: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

17

Results - MTag performance

• Two separate training experiments were performed, either with or without the inclusion of malignancy-specific features, which were the addition of a lexicon of malignancy mentions and a list of indicative suffixes.

Page 18: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

18

MTag performance (cont.)

MTag model

Evaluation set

all biological

feature sets: Yes

all biological

feature sets: No

neuroblastoma-

specific and genome-specific

Precision: 0.846

Recall: 0.831

F-measure: 0.838

Precision: 0.851

Recall: 0.818

F-measure: 0.834

Neuroblastoma-specific

Precision: 0.88

Recall: 0.87

F-measure: 0.88

genome-specific Precision: 0.77

Recall: 0.69

F-measure: 0.73

Page 19: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

19

MTag performance (cont.)

• As expected, the extractor performed with higher accuracy with the more narrowly defined corpus (neuroblastoma).

• At least for this class of entities, the extractor performs the task of identifying malignancy mentions efficiently without the use of a specialized lexicon.

Page 20: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

20

Extraction versus string matching

• String matching: the NCI (National Cancer Institute) neoplasm ontology, a term list of 5,555 malignancies, was used as a lexicon to identify malignancy mentions.

• Lexicon terms were individually queried against text by case-insensitive exact string matching.

Page 21: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

21

Extraction versus string matching (cont.)

Testing set (432 abstracts)

39 abstracts (202 malignancy

mentions)

random selectionMTag: automated

extractor

String matching

Page 22: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

22

Extraction versus string matching (cont.)

• MTag identified 190 of the 202 mentions correctly (94.1%), while the NCI list identified only 85 mentions (42.1%), all of which were also identified by the extractor.

Page 23: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

23

Extraction versus string matching (cont.)

• Change lexicon for string matching

• 79 of 202 mentions (39.1%) • Combining the manually-derived lexicon with the

NCI lexicon yielded 124 of 202 matches (61.4%).

NCI listMalignancy mentions identified in the manually curated training set annotati

ons (1,010 documents)

85 mentions (42.1%)

Page 24: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

24

Extraction versus string matching (cont.)

• 202-124=78 (68) malignancy mentions

68 malignancy mentions

Minor variations in spelling and form (e.g., "leukaemia" ve

rsus "leukemia")

New mentions of malignancies that were in neither in the NCI list or trai

ning set.

acronyms (e.g., "AML" in place of "acute myeloid leukemia")

•Missed by the string matching with combined lists but positively identified by MTag.

•This suggests that MTag contributes a significant learning component.

Page 25: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

25

Application to MEDLINE

• MTag was used to extract mentions of malignancy from all MEDLINE abstracts through 2005.

• 15,433,668 documents

• A total of 9,153,340 redundant mentions and 580,002 unique mentions (ignoring case) were identified.

Page 26: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

26

Application to MEDLINE (cont.)• The 25 mentions found in the greatest number of abstracts by MTag

Page 27: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

27

Application to MEDLINE (cont.)

• Six false postives: pulmonary, fibroblasts, neoplastic, neoplasm metastasis, extramural, and abdominal

• Only "extramural“ is not frequently associated with malignancy descriptions.

• The remaining five phrases are likely the result of the extractor:

- failing to properly define mention boundaries in certain cases. For example, "neoplasm“ v.s “neoplasm metastasis”.

- shared use of an otherwise indicative character string (e.g., "opl" in "brain neoplasm" and "neoplastic") between a true positive and a false positive.

Page 28: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

28

Application to MEDLINE (cont.)

• To assess document-level precision, 100 abstracts identified by MTag were randomly selected each for the malignancies "breast cancer" and "adenocarcinoma".

• Manual evaluation of these abstracts showed that all of the articles were true positives.

Page 29: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

29

MTag input and output

• Directly accept files downloaded from PubMed and formatted in MEDLINE format as input.

• Text or HTML file versions of the extractor output results.

Page 30: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

30

MTag HTML output

Page 31: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

31

Discussion

• It is evident that an F-measure of 0.83 is not sufficient as a stand-alone approach for curation tasks.

• However, such an approach provides highly enriched material for manual curators to utilize further.

• Substantial improvement and efficiency• MTag appeared to be accurately

predicting malignancy mentions.

Page 32: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

32

Discussion (cont.)

• Analysis of mis-annotations would likely suggest additional features and/or heuristics that could boost performance considerably.

• It may be no need for extensive domain-specific lexicons because the addition of biological features provided very little boost to the recall rate.

Page 33: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

33

Conclusion

• MTag is one of the first directed efforts to automatically extract entity mentions in a disease-oriented domain with high accuracy.

• MTag substantially outperformed information retrieval methods using specialized lexicons.

• When combined with expert evaluation of output, MTag can assist with vocabulary building for cancer entity class.

Page 34: 1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan

34

Thank you for your attention