finding and accessing diagrams in biomedical publications
DESCRIPTION
(CC Attribution License does not apply to included third-party material on slides 3, 6, 12, and 19; see the paper for the references: http://www.tkuhn.ch/pub/kuhn2012amia.pdf )TRANSCRIPT
![Page 1: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/1.jpg)
Finding and Accessing Diagrams inBiomedical Publications
Tobias Kuhn, ThaiBinh Luong, and Michael Krauthammer
Krauthammer Lab, Department of PathologyYale University School of Medicine
AMIA 2012 Annual Symposium6 November 2012
Chicago
![Page 2: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/2.jpg)
Introduction
The inclusion of figure images is a recent trend in the area ofliterature mining.
The increasing amount of open access publications makes suchimages available for automated analysis.
Image mining techniques can be used for image search interfaces,for relation mining, and to complement text mining approaches.
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 2 / 20
![Page 3: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/3.jpg)
Answer Queries with Images
Often, a query is best answered by an image.
For example, WolframAlpha for “growth age 6”:
Idea: Use existing diagrams of scientific articles to answer queries.
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 3 / 20
![Page 4: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/4.jpg)
Yale Image Finder
http://krauthammerlab.med.yale.edu/imagefinder/
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 4 / 20
![Page 5: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/5.jpg)
Detection and Analysis of Specific Image Types
For the next version of the Yale Image Finder, we are working on thedetection and analysis of specific image types:
• Axis Diagrams
• Gel Images
• Network Diagrams (work in progress)
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 5 / 20
![Page 6: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/6.jpg)
Axis Diagrams: Examples
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 6 / 20
![Page 7: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/7.jpg)
Axis Diagrams
Axis diagrams are important for several reasons:
• They are abundant in biomedical literature: about 38% of allsubfigures are axis diagrams
• They follow simple common patterns based on axes
• They are complex in the sense that they combine severaldimensions
• They summarize data for human readers
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 7 / 20
![Page 8: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/8.jpg)
Axis Diagram Detection Steps
Basic Idea: Large segments are detected as center segments of axisdiagrams if surrounded by a number of small label segments.
1. 2. 3. 4. 5.original segments center label result
candidates candidates
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 8 / 20
![Page 9: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/9.jpg)
Additional Classifiers
To compare and improve our approach, we apply SVM classifiers withthe following two types of features:
• Image: texture and histogram features of the bitmap image
• Caption: word vector of the tokenized caption text
These classifiers only act on the complete figure and cannot spot thelocation of axis diagrams.
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 9 / 20
![Page 10: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/10.jpg)
Results
Evaluation on a random sample of 100 articles from PubMed Centralwith at least one figure. These 404 figures were manually annotated:they contained 508 axis diagrams.
task method prec
isio
n
reca
ll
F-s
core
detection of figures segments 0.87 0.66 0.75with axis diagrams image 0.66 0.90 0.76
caption 0.84 0.77 0.80image + segments 0.80 0.73 0.76caption + segments 0.90 0.85 0.88image + caption 0.85 0.84 0.84image + caption + segments 0.90 0.89 0.89
extraction of axis segments 0.85 0.40 0.54diagram locations image + segments 0.84 0.39 0.54
caption + segments 0.88 0.39 0.54image + caption + segments 0.89 0.39 0.55
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 10 / 20
![Page 11: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/11.jpg)
Gel Images
Gel diagrams are another important type of image:
• They are the result of gel electrophoresis (e.g. Southern,Western and Northern blotting)
• They are often shown in biomedical publication as evidence forthe discussed findings (e.g. protein-protein interactions andprotein expressions under different conditions)
• About 15% of all subfigures are gel images
• They are structured according to common regular patterns
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 11 / 20
![Page 12: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/12.jpg)
Relations from Gel Images
Condition Measurement ResultMDA-MB-231 14-3-3σ high expressionNHEM 14-3-3σ no expressionC8161.9 14-3-3σ high expressionLOX 14-3-3σ low expressionMDA-MB-231 β-actin high expressionNHEM β-actin high expressionC8161.9 β-actin high expressionLOX β-actin high expression
Condition Measurement ResultIL-1β (–) DEX (–) RU486 (–) p-p38 low expressionIL-1β (+) DEX (–) RU486 (–) p-p38 high expressionIL-1β (–) DEX (+) RU486 (–) p-p38 no expressionIL-1β (+) DEX (+) RU486 (–) p-p38 low expressionIL-1β (–) DEX (–) RU486 (+) p-p38 no expressionIL-1β (+) DEX (–) RU486 (+) p-p38 high expressionIL-1β (–) DEX (+) RU486 (+) p-p38 low expressionIL-1β (+) DEX (+) RU486 (+) p-p38 high expression... ... ...
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 12 / 20
![Page 13: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/13.jpg)
Procedure
A BX
Y
P
A BX
Y
P
A BX
Y
P
A BX
Y
P
A BX
Y
P
A BX
Y
P
articles figures segments text
gels gel panels named entities
1 21 3
4 5 6
relations
7
We focus here on the steps 4, 5, and 6. Steps 1, 2, and 3 have beenaddressed in prior work. Step 7 is future work.
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 13 / 20
![Page 14: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/14.jpg)
Gel Segment Detection
A BX
Y
P
gels
4
Random forest classifiers on a number of features of image segments(position, size, grayscale histogram, color, texture, and number ofrecognized characters).
Results on 1000 manually annotated, random figures:
Threshold Precision Recall F-score AUC
high recall 0.15 0.439 0.909 0.5920.30 0.765 0.739 0.752 0.980
high precision 0.60 0.926 0.301 0.455
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 14 / 20
![Page 15: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/15.jpg)
Gel Panel Detection
A BX
Y
P
gel panels
5
Algorithm:
• Start with a gel segment according to the high-precision classifier
• Repeatedly look for adjacent gel segments according to thehigh-recall classifier, and merge them
• Collect labels in the form of text segments arround the detectedgel region
Results on another set of 500 manually annotated figures:
Precision Recall F-score
0.951 0.379 0.542
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 15 / 20
![Page 16: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/16.jpg)
Named Entity Recognition
named entities
6
Detection of gene and protein names in gel labels from a sampleof 2000 random figures (tokenization; case-sensitive Entrez Genelookup; exclude very short and very common words):
absolute relativeTotal 156 100.0%Incorrect 54 34.6%– Not mentioned (OCR errors) 28 17.9%– Not references to genes or proteins 26 16.7%Correct 102 65.3%– Partially correct (could be more specific) 14 9.0%– Fully correct 88 56.4%
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 16 / 20
![Page 17: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/17.jpg)
Overall Results on PubMed Central
We ran our pipeline on the whole open access subset of PubMedCentral:
Total articles 410 950Processed articles 386 428Total figures from processed articles 1 110 643Processed figures 884 152Detected gel panels 85 942Detected gel panels per figure 0.097
Detected gene tokens 1 854 609Detected gene tokens in gel labels 75 610
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 17 / 20
![Page 18: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/18.jpg)
Conclusions and Future Work
Conclusions
• The location of certain diagram types like axis and gel diagramscan be extracted at a high precision of about 90% with anf-score around 55%
Future Work
• Relation extraction
• Include other image types like network diagrams
• Combination with classical text mining techniques
• Detection of other named entity types: cell lines, drugs, ...
• Sophisticated diagram search interface
• Standard for biomedical diagrams?
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 18 / 20
![Page 19: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/19.jpg)
Discussion: Standardized Biomedical Diagrams?
It seems feasible to extract relations from gel images at satisfactoryaccuracy, but it is clear that this procedure is far from perfect.
Do we need a standard for biomedical diagrams? A UnifiedModeling Language (UML) for biology and medicine?
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 19 / 20
![Page 20: Finding and Accessing Diagrams in Biomedical Publications](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554e81d5b4c9054a698b5517/html5/thumbnails/20.jpg)
Thank you for your Attention!
Questions?
T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 20 / 20