nsf reu program in medical informatics 1 d. raicu, 1 j. furst, 2 d. channin, 3 s. armato, and 3 k....
Post on 20-Dec-2015
216 views
TRANSCRIPT
NSF REU Program in Medical Informatics1D. Raicu, 1J. Furst, 2D. Channin, 3S. Armato, and 3K. Suzuki
1DePaul University, 2Northwestern University, and University of Chicago
REU DataOverview Goal: continue promoting interdisciplinary studies at the frontier between information technology and medicine to undergraduate students - especially students from groups historically underrepresented in exact sciences
Duration: 10 weeks over the summer
Example Teaching:
Interdisciplinary tutorials: Image processing, machine learning
Technology tools tutorials: MatLab, SPSS
Presentations by mentors about projects
Example Activities:
Follow-on activities
Bi-weekly group meetings. presentations to entire MedIX group, final reports (in conference formats), seminars to support student publication
Special events
Day in the life of a PhD student”, “Developing a research career”, “Women in science”, Tours of medical facilities, etc
Unique Site Aspect:
Multi-institution & multi-disciplinary site @ the frontier between computer science & medicine
Outcomes (2005-2007) 88% students had at least one research publication
over 23 publications (1 journal paper, 15 conference papers, 8 extended abstracts)
3 honor theses & senior projects, 4 graduate fellowships, and 1 CRA) honor mention for outstanding undergraduate research
Statistics (2005-2007) Students demographics: 8 per year
Female: 46 %; First generation college: 15%; Outside of home institutions: 73%
Previously presenting a visual (poster) research presentation (31%) or an oral research presentation (27%), (co-) authored a publication in an academic journal (12%), or in the previous two years been involved in any research projects (42%).
Total number of Faculty mentors: 4
Years of operation: 2005 to 2010
Example Research topics: see on the left side
Introduction This work thoroughly investigates ways to predict the results of a semantic-based image re-
trieval system by using solely content-based image features. We extend our previous work1 by
studying the relationships between the two types of retrieval, content-based and semantic-based,
with the final goal of integrating them into a system that will take advantage of both retrieval ap-
proaches. Our results on the Lung Image Database Consortium (LIDC) dataset show that a sub-
stantial number of nodules identified as similar based on image features are also identified as
similar based on semantic characteristics. Furthermore, by integrating the two types of features,
the similarity retrieval improves with respect to certain nodule characteristics.
Methodology
Computation to best represent semantic-based similarity val-ues using only con-tent-based features.
The goal is to find similar nodules to make a better diagnosis of the query. Content-based image re-
trieval is the goal, as that would in-volve little human interaction on
very large data sets.
The 149 CT scans - one of each nodule - are from the Lung Imaging Database
Consortium (LIDC).
Results improve useful-ness of content-based image retrieval system
greatly.
Up to four radiologists rated the nodules on 9 distinct features. Only
7 features varied enough to incorporate, which are rated on a
scale of 1 to 5.
The radiologist compares similar nodules to aid in his diagnosis. Often, comparing similar nod-ules can lead to a more certain
diagnosis.3
Figure 1 — Methodology
The LIDC contains complete thoracic CT scans for 85 patients with lesions. Nodules with a
diameter larger than three millimeters were rated by a panel of four radiologists.2
They rated 9 characteristics of the nodules the masses that they considered nodules. Seven
of those characteristics are useful to our analysis, which were all on a scale of one to five:
Lobulation, Malignancy, Margin, Sphericity, Spiculation, Subtlety, and Texture
For each image, we calculated 64 different content-based features1:
Shape Features: circularity, roughness, elongation, compactness, eccentricity, solidity, extent,
and standard deviation of radial distance
Size Features: area, convex area, perimeter, convex perimeter, equivalence diameter, major
axis length, and minor axis length
Gray-Level Intensity Features: minimum, maximum, mean, standard deviation, and differ-
ence
Texture Features based on co-occurrence matrices, Gabor filters, and Markov random fields
Content-based versus Semantic-based Similarity Retrieval: A LIDC Case Study Sarah Jabona, Jacob Furstb, Daniela Raicub
aRose-Hulman Institute of Technology, Terre Haute, IN 47803, bIntelligent Multimedia Processing Laboratory, School of Computer Science, Telecommunications, and Information Systems, DePaul University, Chicago, IL, USA, 60604
Using k Number of Matches
The number of nodules that had 2 - 5 matches was relatively consistent throughout all image fea-
tures, but slightly higher for Gabor and Markov. No combination of image features had more than
10 matches out of the twenty most similar.
Below is a scatter plot of the content-based similarity versus the semantic-based similarity value.
[1] Lam, M., Disney, T., Pham, M., Raicu, D., Furst, J., “Content-Based Image Retrieval for Pulmonary Computed Tomography Nodule Images”, SPIE Medical Imaging Conference, San Diego, CA, February 2007.
[2] The National Cancer Institute, “Lung Imaging Database Consortium (LIDC), http://imaging.cancer.gov/programsandresources/InformationSystems/LIDC.
[3] Li, Q., Li, F., Shiraishi, J., Katsuragwa, S., Sone, S., Doi, K., “Investigation of New Psychophysical Measures for Evaluation of Similar Images on Thoracic Computed Tomography for Distinction between Benign and Malignant Nodules”, Medical Physics 30:2584-2593, 2003.
[4] Han, J., Kamber, M., [Data Mining: Concepts and Techniques], London: Academic P, 2001.
Image Data
Calculating Similarity Similarity Comparisons In order to assess the correlation between the two similarity measures, we used a round robin ap-
proach where we extracted one nodule as a query and compared it to the remaining 148 nodules. We
took the k most similar values from each query’s semantic-based similarity ordered list and content-
based similarity ordered list and counted how many nodules were common to both lists.
Here is an example with nodule 117 as the query nodule. Below are the most similar nodules
listed with their attributes.
Notice that the semantic similarity values have a much smaller range— from 0 to about 0.3,
whereas the content-based similarities range from 0 to 1. Most of the semantic features are very
similar. A ranking of i signifies that nodule was the ith most similar nodule in the list of similar nod-
ules based on the appropriate feature set.
Analysis
References
Conclusions
Our preliminary results show that a substantial number of nodules identified as similar based on
image features are also identified as similar based on semantic characteristics and therefore, the im-
age features capture properties that radiologists look at when interpreting lung nodules. There are
many similarity metrics that can be used to try to correlate the two retrieval systems. We found the
Euclidean distance to be better for the content–based features and the cosine similarity measure to
be best for the semantic-based characteristics. In our future work, we will try principle component
analysis and linear regression on the data. Further research is necessary to investigate further the
correlations between the two types of features and integrate them in one retrieval system that will be
of clinical use.
Rad. Lob. Mal. Marg. Spher. Spic. Subt. Text.
A 3 4 4 2 4 3 4
B 4 3 4 4 3 5 5
C 4 2 3 4 3 4 5
D 4 3 2 2 4 3 3
4 3 4 3 3 3 5
Summarized:
Figure 2 — Sample CT Scan with Four Radiologists’ Ratings
Semantic-Based Features
Content-Based Features
0.400.200.00
VAR00002
1,200
1,000
800
600
400
200
0
Freq
uenc
y
Mean =0.0766Std. Dev. =0.06374
N =11,026
1.0000000.8000000.6000000.4000000.2000000.000000
VAR00001
600
400
200
0
Freq
uenc
y
Mean =0.2840127Std. Dev. =0.154278896N =11,026
At right is a histogram of the content-based similar-
ity values for all 11,026 nodule pairs. The similarity val-
ues are calculated with the Euclidean distance, which is
defined below, and then min-max normalization is ap-
plied.4
At the end of the feature extraction process, each
nodule is represented by a vector as shown below,
where c stands for a semantic concept and f for a image
feature. Figure 4 — Histogram of Content- Based Similarity
Figure 3 — Histogram of Semantic- Based Similarity
The cosine similarity measure minimized the ceil-
ing effect. The similarity value calculation using the
cosine formula is shown below.
The histogram to the right is of the semantic-
based similarity values for all 11,026 nodule pairs.
Although the values do not represent a perfect nor-
mal curve, the ceiling effect was drastically im-
proved from performing a simple distance on the
seven characteristics.
Query Nodule (Q):
Database Nodule (N):
No. Image Semantic-Based Content-Based Semantic Feature Vector
Ranking Similarity Value Ranking Similarity Value Lob Mal Mar Sph Spic Sub Tex
117
- 0 - 0 2 3 5 5 2 4 5
104
2 0.004452 5 0.415918 2 3 4 4 2 3 4
126
3 0.004596 6 0.421249 2 3 5 5 1 4 5
98 6 0.006817 17 0.505317 2 3 4 5 2 3 5
28
8 0.009119 16 0.504996 1 3 5 5 1 4 5
27
11 0.012752 2 0.380517 1 3 5 5 1 3 5
137 14 0.013072 9 0.430289 1 3 4 5 1 3 4
127
16 0.013606 11 0.474226 2 4 5 4 3 4 5
119
17 0.015268 20 0.538589 3 3 4 4 2 3 5
90
20 0.016383 7 0.425751 1 2 3 4 1 2 4
Figure 5 — Example of Image Retrieval Results
Applying a Threshold
We analyzed the difference in the scales of similarity by seeing how many matches there were
based on thresholds. Below is a graph of two different thresholds of similarity—0.02 and 0.04.
These thresholds are applied to the semantic similarity values. There were many more matches
within these thresholds. Matches Gabor Markov Co-Occurrence Gabor, Markov, and
Co-Occurrence All Features
6 – 10 24 18 31 36 43
2 – 5 107 104 94 98 93
0 – 1 18 27 24 15 13
Figure 6 — Match Count in 20 Most Similar Nodules
Figure 7 — Content-Based Similarity vs. Semantic-Based Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Cont
ent-
Base
d Si
mila
rity
Semantic-Based Similarity
Similarity Values: Content-Based vs. Semantic-Based
Thresholds on Semantic-Based Characteristics
0
5
10
15
20
25
30
35
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56
Number of Matches
Occ
urr
ence
(O
ut
of
149)
Threshold 0.02
Threshold 0.04
Figure 6 — Match Based on All Features and Thresholds