histology image search using multimodal fusion
TRANSCRIPT
1
3
4
5
6 Q1
7 Q28
910
1 2
13141516
1718192021222324
2 5
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
Q1
Journal of Biomedical Informatics xxx (2014) xxx–xxx
YJBIN 2171 No. of Pages 15, Model 5G
14 May 2014
Q1
Contents lists available at ScienceDirect
Journal of Biomedical Informatics
journal homepage: www.elsevier .com/locate /y jb in
Histology image search using multimodal fusion
http://dx.doi.org/10.1016/j.jbi.2014.04.0161532-0464/� 2014 Published by Elsevier Inc.
⇑ Corresponding author. Current address: Computer Science Department,University of Illinois at Urbana-Champaign, 201 N Goodwin Ave, Urbana, IL61801, USA.
E-mail addresses: [email protected] (J.C. Caicedo), [email protected](J.A. Vanegas), [email protected] (F. Paez), [email protected](F.A. González).
1 Work done while at Universidad Nacional de Colombia.
Please cite this article in press as: Caicedo JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/1j.jbi.2014.04.016
Juan C. Caicedo a,⇑,1, Jorge A. Vanegas b, Fabian Paez b, Fabio A. González b
a University of Illinois at Urbana–Champaign, IL, USAb MindLab Research Laboratory, Universidad Nacional de Colombia, Bogotá, Colombia
262728293031323334353637
a r t i c l e i n f o
Article history:Received 18 June 2013Accepted 30 April 2014Available online xxxx
Keywords:HistologyDigital pathologyImage searchMultimodal fusionVisual representationSemantic spaces
38
a b s t r a c t
This work proposes a histology image indexing strategy based on multimodal representations obtainedfrom the combination of visual features and associated semantic annotations. Both data modalities arecomplementary information sources for an image retrieval system, since visual features lack explicitsemantic information and semantic terms do not usually describe the visual appearance of images.The paper proposes a novel strategy to build a fused image representation using matrix factorizationalgorithms and data reconstruction principles to generate a set of multimodal features. The methodologycan seamlessly recover the multimodal representation of images without semantic annotations, allowingus to index new images using visual features only, and also accepting single example images as queries.Experimental evaluations on three different histology image data sets show that our strategy is a simple,yet effective approach to building multimodal representations for histology image search, and outper-forms the response of the popular late fusion approach to combine information.
� 2014 Published by Elsevier Inc.
39
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
1. Introduction
Digital pathology makes it easy to exchange histology imagesand enables pathologists to rapidly study multiple samples fromdifferent cases without having to unpack the glass [1]. The increas-ing adoption of digital repositories for microscopy images resultsin large databases with thousands of records, which may be usefulto supporting the decision making process in clinical and researchactivities. However, in modern hospitals and health care centers,the number of images to keep track of is beyond the ability ofany specialist. A very promising direction to realize the potentialof these collections is through efficient and effective tools for imagesearch. For instance, when a new slide is being observed, a cameracoupled to the microscope can capture the current view, send thepicture to the retrieval system, and show results on a connectedcomputer. These results can help to clarify structures in theobserved image, explore previous cases and, in general, may allowclinicians and researchers to explore large collections of recordspreviously evaluated and diagnosed by other physicians.
78
79
80
81
82
83
84
85
The query-by-example paradigm for image search—when theuser’s query is an example image with no annotations—has a num-ber of potential applications in medicine and clinical activities [2].The main challenge when implementing such a system consists ofcorrectly defining the matching criteria between query images anddatabase images. The standard approach for content-based retrie-val in image collections relies on using similarity measuresbetween low-level visual features to perform a nearest-neighborsearch [3]. The problem of this approach is that these characteris-tics usually fail to capture high-level semantics of images, a prob-lem known as the semantic gap [4]. Different methods to bridgethis gap have been proposed to build a model that connects low-level features with high-level semantic content, such as automaticimage annotation [5] and query by semantic example [6]. Thesemethods represent images in a semantic space spanned bykeywords, so a nearest neighbors search in that space retrievessemantically related images. Approaches like these have also beeninvestigated for histology image search [7–9].
Image search systems based on a semantic representation havebeen shown to outperform purely visual search systems in terms ofMean Average Precision (MAP) [3]. However, these approachesmay lose the notion of visual similarity among images since thesearch process ends up relying entirely on high level descriptionsof images. The ranking of search results is based on potentially rel-evant keywords, ignoring useful appearance clues that are notdescribed by index terms. In a clinical setting, visual informationplays an important role for searching histology images, which
0.1016/
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
2 J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx
YJBIN 2171 No. of Pages 15, Model 5G
14 May 2014
Q1
ultimately reveals the biological evidence for the decision makingprocess in clinical activities. We consider that both visual contentand semantic data are complementary sources of information thatmay be combined to produce high quality search results.
Multimodal fusion has emerged as a very useful approach tocombine different signals with the purpose of making certainsemantic decisions in automated systems. We refer the reader to[10] for a comprehensive survey of multimodal fusion in variousmultimedia applications. For image indexing in particular, multi-modal fusion consists of combining visual and semantic data. Sev-eral methodologies have recently been proposed to model therelationships between these two data modalities, with the goal ofconstructing better image search systems. Two main strategiesmay be identified to achieve the combination of both data modal-ities: (1) early fusion [11], to build a combined representation ofimages before the ranking procedure and (2) late fusion [12], tocombine similarity measures during the ranking procedure. Oneof the advantages of early fusion over late fusion is that the formeroften benefits from explicitly modeling the relationships betweenthe two data modalities, instead of simply using them as separateopinions. However, this clearly requires a significant effort inunderstanding and extracting multimodal correspondences.
In this work, we propose a novel method for indexing histologyimages using an early multimodal fusion approach, that is, combin-ing the two data modalities in a single representation to generatethe ranking directly in such a space. The proposed methods usesemantic annotations as an additional data source that representsimages in a vector space model. Then, matrix-factorization-basedalgorithms are used to find the relationships between data modal-ities, by learning a function that projects visual data to the seman-tic space and the other way around. We take advantage of thisproperty by fusing both data modalities in the same vector space,obtaining as a result the combined representation of images.
A systematic experimental evaluation was conducted on threedifferent histology image databases. Our goal is to validate thepotential of various image search techniques to understand thestrengths and weaknesses of visual, semantic and multimodalindexing in histology image collections. We focus our evaluationon two performance measures commonly used for informationretrieval research: Mean Average Precision (MAP); and Precisionat the first 10 results of the ranked list (P@10), for early precision.We observed that semantic approaches are very good at maximiz-ing MAP, while visual search is a strong baseline for P@10, reveal-ing a trade-off in performance when using one or the otherrepresentation. This also confirms the importance of combiningboth data modalities.
Our approach combines multimodal data using a convexcombination of the visual and semantic information, resulting in acontinuous spectrum of multimodal representations and allowing
Fig. 1. Overview of the image search pipeline. Images acquired in a clinical setting are usmultimodal index, which represent images and text in the database. Results are returne
Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016
us to explore various mixes from purely visual to purely semanticrepresentations as needed. This is similar in spirit to late fusion,which allows the setting of weights to scores produced by eachmodality. However, our study shows significant improvement inperformance when building an explicitly fused representation,instead of considering modalities as separate voters for the rankof images. We also found that multimodal fusion can balance atrade-off between maximizing MAP and early precision, demon-strating the potential to improve the response of histology imageretrieval systems.
1.1. Overview
This work proposes an indexing technique for image search,using both visual image content and associated semantic terms.Fig. 1 illustrates a pipeline for image search in a clinical setting,which involves a physician or expert pathologist working with amicroscopy equipment with digital image acquisition capabilitiesor in a virtual microscopy system. Through an interactive mecha-nism, the user can ask the system to take a picture of the currentview, and send a query to the image search system. The systemhas a pre-computed fused representation of images in the database.A ranking algorithm is used to identify the most relevant results inthe database, which are retrieved and presented to the user.
The main goal of the system is to support to clinicians during thedecision making process by providing relevant associated informa-tion. The ability to find related cases among past records in a data-base has the potential to improve the quality of health care using anevidence-based reasoning approach. Historic archives in a hospitalcomprise a knowledge base reflecting its institutional experienceand expertise, and it can be used to enhance the daily medicalpractice.
This paper focuses on two important aspects of the entire pipe-line: (1) strategies for constructing the index based on a multi-modal fused representation and (2) an empirical evaluation ofdifferent strategies for histology image search using collections ofreal diagnostic images. The main contribution of our work is anovel method for combining visual and semantic data in a fusedimage representation, using a computationally efficient strategythat outperforms the popular late-fusion approach, and that bal-ances the trade-off between visual and semantic data. While theapplicability of the proposed model may be extended to generalimage collections beyond histology images, the second contribu-tion of this work is an extensive evaluation on histology images,since a straightforward application of image retrieval techniquesmay not result in an optimal outcome. Part of our experimentalevaluation shows that off-the-shelf indexing methods such aslatent semantic indexing and late fusion do not always exploitspecific characteristics of histology images.
ed as example queries. The system processes and matches queries with entries in ad to support the decision making process.
ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 3
YJBIN 2171 No. of Pages 15, Model 5G
14 May 2014
Q1
Another question related to the use of a system like this in a realenvironment is what would be the impact of search results in themedical practice itself and the outcome of such decisions on thequality of health care for patients. We believe this question is bothvery interesting in nature and quite important to be investigated.Up to our knowledge, a formal study of this problem has not beenconducted yet in the histology domain, and it is also beyond thescope of this paper. However, other studies in radiology have shownimproved decisions in the final diagnosis made by inexperiencedphysicians when they use image retrieval technologies [13].
The contents of this paper are organized as follows: Section 2discusses relevant related work in histology image retrieval. Thethree histology image data sets used for experimental evaluationsare presented in Section 3. Section 4 introduces the proposed algo-rithms and methods for multimodal fusion. The experimental eval-uation and results are presented in Section 5. Finally, Section 6summarizes and presents the concluding remarks.
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
2. Previous work
The automatic analysis of histology images is an important andgrowing research field that comprises different purposes and tech-niques. From image classification [14] to automatic pathologygrading [15], the large amount of microscopy images in medicinemay benefit from automated methods that allow users to managevisual collections for supporting the decision making process inclinical practice. This work is primarily focused on image searchand retrieval technologies, which serve as a mechanism to find rel-evant and useful histology images from an available database.
2.1. Content-based medical image retrieval
Early studies of content-based medical image retrieval werereviewed by Mller et al. [2]. One of the first systems for histologyimage retrieval was reported by Zheng et al. [16], which uses low-level visual features to discriminate between various pathologicalsamples. The use of low-level features was quickly recognized tohave limitations for distinguishing among complex semanticarrangements in histology images, so researchers proposed thesemantic analysis of histology slides to build image search systems.Later, a system based on artificial neural networks that learned torecognize twenty concepts of gastro-intestinal tissues on digitalslides was presented by Tang et al. [7]. The system allowed clini-cians to query using specific regions on example histology images.However, important efforts to collect labeled examples wererequired since the design of these learning algorithms needed localannotations, a procedure that might be very expensive.
Relatively few works continued the effort of designing semanticimage search systems for histology images, which include the workof Naik et al. [8] on breast tissue samples, and Caicedo et al. [17] onskin cancer slides. However, the task of histology image classifica-tion has been actively explored on various microscopy domains[18–22] which are related to semantic retrieval. The primary pur-pose of these methods is to try to assign correct labels to images,which differs from the problem of building a multimodal represen-tation. Besides, the transformation from visual content to strictsemantic keywords, may lead to a loss of useful visual informationfor a search engine, since images are summarized in a few key-words and visual details are not considered anymore.
2.2. Multimodal fusion for medical image retrieval
Multimodal retrieval has been approached in the medical imag-ing domain, to find useful images in academic journal repositoriesand biomedical image collections by combining captions along
Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016
with visual characteristics to find relevant results using late fusion[23,24]. However, their strategies assume that the user’s query iscomposed of example images as well as a text description. If usersonly provide example images, because they do not know preciseterms or just because of another practical reason, the system doesnot have any other choice rather than matching purely visual con-tent. Relevant work by Rahman et al. [25] uses fusion techniquesfor biomedical image retrieval by combining multiple featuresand scores of classifiers on the fly, allowing users to interact andprovide relevance feedback to refine results.
For ten years, the ImageCLEFmed community also dedicatedefforts to study the problem of medical image retrieval using mul-timodal data around an academic challenge [26–29]. Each year,various research groups obtained a copy of a single collection ofmedical images that includes almost all medical imaging tech-niques at once, such as X-rays, PET, MRI, CT and microscopy. Thegoal was to index its contents and provide answers for specificqueries composed of example images and text. Multimodal fusionwas a central idea in the development of solutions in this chal-lenge, and late fusion has been reported as one of the most robusttechniques.
Our work is focused on combining semantic terms and visualfeatures in a fused image representation (based on early fusionprinciples), which can be used for image retrieval or as input forother classifiers and systems. An important component of the pro-posed strategy is its ability to use the same representation forimages that do not have text annotations. In that way, the systemcan handle example image queries as well as database imageswithout semantic meta-data. In this work, we build on top of Non-negative Matrix Factorization algorithms recently proposed to findrelationships in multimodal image collections [30]. We extendthese ideas to propose a novel algorithm for fusing multimodalinformation in histology image databases with an arbitrary num-ber of terms.
Other studies of multimodal fusion for multimedia retrievalhave been conducted recently. A strategy to summarize andbrowse large collection of Flickr images using text and visual infor-mation was presented by Fan et al. [31]. However, they learn latenttopics for text independent of visual features, so multimodal rela-tionships are not modeled or extracted. The use of latent topicmodels has been extended to explicitly find relationships betweenvisual features and text terms using probabilistic latent semanticindexing [32] and very rich graphical models [33]. In this work,we formulated the problem of extracting multimodal relationshipsas a subspace learning problem, which generates multimodal rep-resentations using vector operations.
Recent studies of multimodal fusion for histology images dealwith the problems of combining different imaging modalities (suchas MRI images and microscopy images) [34] or combiningdecisions made at different regions of the same image [35]. As toour knowledge, our work is the first study of multimodal fusionof semantic and visual data, specifically oriented to histologyimage retrieval. We reported promising experimental results inour previous work [36], and this paper extends that evaluation insubstantial ways. First, the notion of multimodal fusion byback-projection is introduced for the first time, which allows toeffectively combine visual and semantic representations for histol-ogy image indexing. Second, a more comprehensive experimenta-tion was carried out, using three different data sets, additionalevaluations and extended discussions.
3. Histology image collections
Three different histology image collections were selected as casestudies for this work. The first two are from pathology cases with
ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
Table 1Number of images and semantic terms on each data set.
Data set Images Training Query Terms
Cervical Cancer 530 447 54 8Basal-cell Carcinoma 1502 1201 301 18Histology Atlas 2641 2113 528 46
2 Dataset of 20,000 histology images at http://www.informed.unal.edu.co Datasetof tissue types at http://168.176.61.90/histologyDS/.
4 J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx
YJBIN 2171 No. of Pages 15, Model 5G
14 May 2014
Q1
corresponding diagnosis and descriptions. They were collected aspart of different long-term projects in the Pathology and the Biologydepartments, with the collaboration of several experts and graduatestudents from the Medicine School of Universidad Nacional deColombia, in Bogot. The data sets were collected and annotated byexpert pathologists, extracting information of a larger database ofreal cases. In general, these collections of images have been anno-tated by several individuals, which agree on the results after discus-sions in a committee-like process. The efforts of collecting theseannotations have been oriented to create index terms to preserveinformation related to cases, which could be accessible through aninformation retrieval system. These cases were anonymized toremove any information related to patients, and only data associ-ated to the diagnosis and description of images were preserved.
The third data set is part of a histology atlas containing imagesfrom the four fundamental tissues of living beings. These imageswere collected and labeled by researchers in the Biology Depart-ment, to provide students with a high quality reference materialin digital format. Table 1 presents basic statistics of these collec-tions and Fig. 2 shows some example images for each data set.More details about each data set are presented below.
1. Cervical Cancer. This data set, with 530 images from more than120 cases, characterizes various conditions and stages ofCervical Cancer. Images in this collection were acquired by amedical resident and validated by an expert pathologist fromtissue samples stained with hematoxylin and eosin. Imageswere captured at 40� magnification with controlled lightingconditions, and from each slide, an average of 4.5 sub-regionsof 3840 � 3072 pixels were selected. Each image preserves asmetadata the case number to which it belongs and a list ofglobal annotations. Annotations span eight different categoriesincluding relevant diagnostic information and other tissuecharacteristics. This list of categories includes:cervicitis inflam-matory pathology, intraepithelial lesion, squamous cell carcinoma,metaplasia, and intraepithelial lesion, among others.
2. Basal-cell Carcinoma. This collection has 1502 images of skinsamples stained with hematoxylin and eosin used to diagnosecancer from a collection of more than 300 cases. About 900 ofthese images correspond to pathological cases, while theremaining 600 are from normal tissue samples, which allowsphysicians to contrast differences between both conditions.This makes a difference with respect to the previous data set,which only has pathological cases. This dataset contains imagesacquired at different magnification levels, including 8�, 10�and 20�, and stored at 1280 � 1024 pixels in JPG format. Globalannotations were assigned by a pathologist to highlight varioustissue structures and relevant diagnostic information using alist of eighteen different terms. This collection has been usedin previous histology image retrieval work [37,9].
3. Histology Atlas. This is the largest data set used in our study,with 2641 images illustrating biological structures of the fourfundamental tissues in biology: connective, epithelial, muscularand nervous. Images of these tissues come from different organsof several young–adult mice, where samples were stained usinghematoxylin and eosin, and immunohistochemical techniques.These images are in different resolutions and magnificationfactors, and are organized in hierarchical annotations generated
Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016
by pathologists and biologists, indicating the observed biologi-cal system and organs, giving a total of 46 different indexingterms. The resulting annotations include terms like circulatorysystem, heart, lymphatic system and thymus, among others. Thisdata set has also been used in previous work at our lab [38] andis the only one currently available online free of charge.2
Notice that in all cases, a single image can have several seman-tic terms associated to it. This is an important characteristic of realworld databases, that do not split collections in disjoint sets, butrather allow images to have multiple annotations to describe sev-eral aspects or objects within the image.
Images are usually regions of interest with patterns observed infull tissue slides, which were focused and selected by the team ofpathologists and biologists. They rarely included views of a com-plete tissue slide. The resulting image collections present variationsin magnification and acquisition style, which are considered naturalproperties of large-scale, real world medical image collections. Theacquisition process was not restricted to highly controlled or rigidimage views, but instead, encouraged spontaneous variabilitymotivated by domain-specific interestingness, which may makethe search process more challenging.
3.1. Data representation
3.1.1. Visual representationA large variety of methods have been investigated to extract
and represent visual characteristics in histology images. Be it forautomated grading [15], classification [20] or image retrieval [7],two important features are usually modeled: color and texture.Color features exploit useful information associated with staininglevels, which are natural bio-markers for pathologists. Texture fea-tures exploit regularities in biological structures, since tissues tendto follow homogeneous patterns. In this work, a bag-of-featuresrepresentation is used, which has been shown to be a useful repre-sentation of histology images due to its ability to adaptivelyencode distributions of textures in an image collection.
We selected the Discrete Cosine Transform (DCT) computed ateach RGB color channel as local feature descriptor. A dictionaryof 500 codeblocks is constructed using the k-means algorithm foreach image collection separately. Then, a histogram of the distribu-tion of these codeblocks is computed for each image. As a result,we have vectors in Rn, with n ¼ 500 visual features for each image.When appropriately trained, the dictionary is able to encode mean-ingful patterns that correlate with high-level concepts such as thesize and density of nuclei, which may allow the system to distin-guish important features such as magnification factor and tissuetype. We refer the reader to the work of Cruz-Roa et al. [38] formore details about this histology image representation approach,which we followed closely in this work.
3.1.2. Semantic representationLikewise, semantic data is herein represented as a bag-of-words
following a vector space strategy, commonly used in informationretrieval [39]. First, the dictionary of indexing terms is constructedusing the list of available keywords. Then, assuming a dictionarywith m terms, each image is represented as a binary vector inRm, in which each dimension indicates whether the correspondingsemantic term is assigned to the image. Using this representation,each image can have as many semantic terms assigned as needed.Also, the size of the semantic dictionary is not limited and can beeasily extended.
ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477478480480
481483483
Fig. 2. Sample images and annotations from the three histology data sets used in this work: (a) Cervical Cancer data set. (b) Basal Cell Carcinoma data set. (c) Histology Atlasdata set. These sample images have been selected to illustrate the kind of contents and annotations available in each collection.
J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 5
YJBIN 2171 No. of Pages 15, Model 5G
14 May 2014
Q1
The number of different terms for each data set can be found inthe last column of Table 1. None of the images in the three data setshave all annotations at the same time. Usually, a single image hasbetween two to four semantic annotations assigned to it depend-ing on the data set. These keywords may co-occur in some cases,and they can also exclude each other in some cases. We do notexploit these term relationships explicitly since the bag-of-wordsrepresentation has been adopted.
Notice that we use the term semantics to refer to semantic termsonly. In this work, the relationships among terms are not explicitlyconsidered through the use of ontologies or similar data structures.The use of semantics throughout the paper is intended to empha-size our goal of assigning high-level interpretations to low-levelvisual signals, which are not easily understood by computers inthe same way as humans do. Smeulders et al. [4] named this con-dition as the semantic gap, and many other studies thereafter haveadopted similar uses of the term to refer to this problem [40].
Since both, visual and semantic representations are vectors, adatabase of images can be represented with two matrices by stack-ing the corresponding vectors of visual and semantic features ascolumns of two matrices. The notation used in the following sec-tions sets the matrix of visual data for a collection of l images asXV 2 Rn�l, where n is the number of visual patterns in the bag offeatures representations. The matrix of semantic terms for thesame collection is XS 2 Rm�l, where m is the number of keywordsin the semantic dictionary.
484
485
486
487
488
489
490
491
492
493
494
495
496
3 The terms ‘‘basis’’ is slightly abusive here, since the vectors in the matrix W arenot necessarily linearly independent and the set of vectors may be redundant.
4. Multimodal fusion
The search method proposed in this work is based on a multi-modal representation of images that combines visual features withsemantic information. Fig. 3 presents an overview of the proposedapproach, which is comprised of three sequential stages: (1) visualindexing, (2) semantic embedding and (3) multimodal fusion. Threeimage representations are obtained throughout the process: (1)visual features, (2) semantic features and (3) the proposed fusedrepresentation. The retrieval engine can be setup to search usingany of the three representations. In the following subsections, weassume a visual and semantic data representation following thedescription of Section 3.1, and focus on describing the components2 and 3 of Fig. 3.
Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016
4.1. Semantic embedding
The goal of a semantic embedding is to learn the relationshipsbetween visual features and semantic terms, to generate a newrepresentation of images based on high level concepts. The strat-egy proposed in this work is based on a matrix factorization algo-rithm, which allows the system to learn these relationships as alinear projection between the visual and semantic spaces. In thiswork, we adopt the notions of multimodal image indexing usingNonnegative Matrix Factorization (NMF) recently proposed in[30], and extend these ideas by introducing a direct semanticembedding. In the following sections, two strategies for modelingvisual-to-semantic relationships are presented.
4.1.1. Latent semantic embeddingThe first algorithm for semantic embedding is based on NMF,
which allows us to extract structural information from a collectionof data samples. For any input matrix X 2 Rn�l, containing l datasamples with n nonnegative features in its column vectors, NMFfinds a low rank approximation of the data using non-negativityconstraints:
X �WH
W;H P 0
where W 2 Rn�r is the basis3 of the vector space in which the datawill be represented and H 2 Rr�l is the new data representationusing r factors. Both, W and H are unknowns in this problem, andthe decomposition can be found using alternating optimization.We use the divergence criterion to find an approximate solutionwith the multiplicative updating rules proposed by Lee and Seung[41].
Given the matrix of visual features, XV , and the matrix ofsemantic terms XS, we aim to find correlations between both. Wemodel these relationships using a common latent space, in whichboth data modalities have to be projected. We employ thefollowing two-stage approach to find semantic latent factors forimage features and keywords:
ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520521523523
524526526
527
528
529
530
531
532
533
534
535
536
537538
540540
541
542
543
544
545546
548548
549
550
551
552
553
554
555
556
557558560560
561563563
564
565
566
567
568
569
570
571
572
573574
576576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
Fig. 3. Overview of the proposed fused representation. From the input image to the final fused representation, three main processes are carried out: visual indexing, semanticembedding and multimodal fusion. Each stage also produces the corresponding representation: only visual, only semantic and fused.
6 J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx
YJBIN 2171 No. of Pages 15, Model 5G
14 May 2014
Q1
1. Decompose XS: The matrix of semantic terms is first decom-posed using NMF to find a low rank approximation of semanticdata. This step may be understood as finding correlationsbetween various terms according to their joint occurrence onimages. The result of this first step, is a decomposition of theform XS ¼WSH, with matrices WS as the semantic projectionand H as the latent semantic representation.
2. Embed XV : Find a projection function for visual features toembed data in the latent factor space. This is achieved by solv-ing the equation XV ¼WV H for WV only, while fixing the latentrepresentation, H, equal to the matrix found in the previousstage.
The correlations between visual features and semantic data areencoded through the matrices WV ;WS and H. The new latent rep-resentation is semantic by design, since the function to projectvisual features spans a latent space that has been originally formedby semantic data. We refer to this algorithm as NMF Asymmetric(NMFA) in our later discussions.
4.1.2. Direct semantic embeddingAn alternative approach to model the relationships between
visual features and semantic terms is to find a transformation func-tion between both spaces directly. This problem is formulated asfollows:XV �WXS ð1Þ
W P 0
where W 2 Rn�m is a matrix that approximately embeds visual fea-tures in the space of semantic terms. Instead of extracting a latentfactor structure from the data, this strategy fixes the latent encod-ing (matrix H) as the known semantic representation of images inthe collection, Xs. This can be understood as requiring the latent fac-tors to match exactly the semantic representation of images, result-ing in a scheme for learning the structure of visual features thatdirectly correlate with keywords.
To solve this problem, the divergence between the matrix ofvisual features and the embedded data is adopted as objectivefunction:
DðXV jWXSÞ ¼X
ij
XVð ÞijlogXVð ÞijðWXSÞij
� XVð Þij þ ðWXSÞij
!ð2Þ
The goal is to minimize this divergence measurement on a set oftraining images, considering that this optimization problem is con-vex and can be solved efficiently following gradient descent orinterior point strategies. In this work, the matrix W is learned usingthe following multiplicative updating rule:
Wij ¼Wij
Pu XSð Þju XVð Þiu
.WXSð ÞiuP
v XSð Þjvð3Þ
Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016
This is a rescaled gradient descent approach that uses a data-dependent step size, following Lee and Seung’s methodology forNMF [41]. Then, the solution is found by iteratively running thisupdating rule for W for a number of iterations or until a fixed errorreduction is reached. We refer to this algorithm as the NonnegativeSemantic Embedding (NSE) in our later discussions.
4.1.3. Projecting unlabeled imagesTo recover the semantic representation of an image without
keywords, we need to solve the following equation for xS:
xV �WxS ð4Þ
xS P 0
where xV is the observed vector of visual features and W is a learnedsemantic embedding function. This formulation is also compatiblewith the latent semantic embedding, assuming W ¼WV andxS ¼ h. In both situations, the non-negativity constraint holds andthe same procedure is followed. Thus, the problem in Eq. (4) canbe formulated as minimizing the divergence between the observedvisual data and its reconstruction from the semantic space. Regard-ing the non-negativity restriction, the solution can be efficientlyapproximated using the following multiplicative updating rule inan iterative fashion:
XSal ¼ Hal
PiWiaXil
�WHð ÞilP
kWkað5Þ
Following this procedure, a semantic representation of newimages is constructed.
4.2. Fusing visual and semantic content
So far, we have considered the visual and semantic strategies forimage search. The first strategy is entirely based on visual features,to match for visually similar images. The second strategy is basedon an inferred semantic representation to match for semanticallyrelated images. In this section we introduce a third strategy, basedon multimodal fusion. The main goal of this scheme is to combinevisual features and semantic data together in the same image rep-resentation to exploit the best properties of each modality.
4.2.1. Fusion by back-projectionThe proposed fusion strategy projects semantic data back to the
visual feature space to make a convex combination of both, visualand semantic representations, as illustrated in Fig. 4. This can beunderstood as an early fusion strategy, since the representationsare merged before their subsequent use. Assuming a histogramof visual features xV and a vector of a predicted semantic represen-tation, xS, the fusion procedure generates a new image representa-tion defined as:
ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/
597599599
600
601
602
603
604
605607607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
Fig. 4. Illustration of the fusion by back-projection procedure. The first step represents a query image in the visual feature space. Second, a semantic embedding is done toapproximate high-level concepts for that image. Third, the semantic representation is projected back to the visual space and combined with the original visual representation.
J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 7
YJBIN 2171 No. of Pages 15, Model 5G
14 May 2014
Q1
xf :¼ kxv þ ð1� kÞWxs ð6Þ
where xf 2 Rn is the vector of fused features in the visual space andk is the parameter of the convex combination that controls the rel-ative importance of data modalities. This fusion approach takes thesemantic representation of images and projects it back to the visualspace using the reconstruction formula:
x̂v :¼ Wxs ð7Þ
This back-projection is a linear combination of the columnvectors in W using the semantic annotations as weights. In thatway, the reconstructed vector x̂v represents the set of visual fea-tures that an image should have according to the learnedsemantic relationships in the image collection. Therefore, x̂v
and xv highlight different visual structures of the same image,since x̂v is a semantic approximation of the observed visual fea-tures, according to Eqs. (4) and (7). Notice that this extensioncan be applied to a latent semantic embedding using NMFA orto a direct semantic embedding using NSE. We refer to thisextension using the suffix BP for Back-Projection (NMFA-BP orNSE-BP).
663
664
665666
668668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
4.2.2. Controlling modality importanceThe parameter k in the convex combination of the fusion strat-
egy (Eq. (6)) allows us to control the importance of each datamodality. The problem of assigning more weight to one or theother modality mainly depends on the performance that eachmodality offers to solve queries. More specifically, it depends onhow faithfully one modality represents the true content of animage. On the one hand, visual features may be inaccurate to rep-resent high level semantic concepts, but good at representing lowlevel visual arrangements. On the other hand, the semantic repre-sentation may be noisy or incomplete because of human errors orprediction discrepancies.
The parameter k is split into two different parameters to con-sider two kinds of images: database images and query images.For both images, the semantic representations are predicted bythe learned model. For database images, the parameter k will becalled a and for query images it will be called b throughout thepaper. This distinction allows us to control the importance of thesemantic modality for new, unseen query images, taking intoaccount that predicting an approximate semantic representationmay have some inference noise. We evaluate the influence of suchparameters in the following sections.
Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016
4.3. Histology image search
Indexing images by visual content in the retrieval systemmeans that all searchable images in the collection, as well as allquery images, are represented using the bag-of-features histogram,which is a non-parametric probability distribution of visual pat-terns. The latent semantic representation and the direct semanticrepresentation are both nonnegative vectors that can be properlynormalized making their ‘1-norm equal to 1, so the new valuesare interpreted as the probabilities of high-level semantic conceptsfor one image. In the case of the fused representation, features areonce again represented in the visual features space and the ‘1 nor-malization is applied as well.
The retrieval system requires a similarity function to rankimages in the collection by comparing them with the features ofthe query. The three representations discussed above can be consid-ered as probability distributions, and the most natural way to com-pare these features is using a similarity measure appropriate forprobability distributions. The histogram intersection is a measurefor estimating the commonalities between two non-parametricprobability distributions represented by histograms. It computesthe common area between both histograms, obtaining a maximumvalue when both histograms are the same distribution and zerowhen there is nothing in common. The histogram intersection isdefined as follows:
k\ðx; yÞ ¼Xn
i¼1
min xi; yif g ð8Þ
where x and y are histograms and the sub-index, i, represents the i-th bin in each histogram of a total of n. This similarity measure hasbeen shown to be a valid kernel function for machine learningapplications [42], and has been successfully used in different com-puter vision tasks [43]. We adopt this similarity measure for imagesearch in the three representation spaces evaluated in this work,which are the visual space, the semantic space and the fused space.
5. Experiments and results
Experiments to evaluate retrieval performance were conductedon the three histology data sets described in Section 2. In ourexperimental evaluation, we focus on image search experimentsunder the query-by-example paradigm, in other words, the useof non-annotated query images to retrieve relevant databaseimages. Our goal is to demonstrate the strengths and weaknesses
ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/
683
684
685
686 Q3
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
8 J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx
YJBIN 2171 No. of Pages 15, Model 5G
14 May 2014
Q1
of each of the three image representations: visual, semantic andmultimodal.
5.1. Experimental protocol
5.1.1. Training, validation and testWe follow a training, validation and test experimental scheme to
setup and evaluate algorithms. A sample with 20% of the images ineach collection is separated as held out data for testing experi-ments. The other 80% of the images are used for training and valida-tion experiments. This partition is made using stratified samplingover the distribution of semantic terms, to separate representativesamples of the semantic concepts in the data set. The number ofimages in each data set and the number of images in the corre-sponding partitions is reported in Table 1. For training and valida-tion, a 10-fold cross validation procedure was employed and testexperiments are conducted on held out data.
5.1.2. Performance measuresA single experiment in a particular data set consists of a simu-
lated query, i.e., an example image taken from the test or validationsets from which semantic annotations are known, but hidden.Then, the ranking algorithm is run over all database images andthe list of ranked results is evaluated.
The evaluation criterion adopted in our experiments is based onthe assumption that one result is relevant to the query if bothshare one or more semantic terms. This assumption is reasonableunder the query by example paradigm evaluated in this work.Since the system does not receive and explicit set of keywords,the query is highly ambiguous and the intention of the user maybe implicit. It is even possible that the user is not completely awareof exactly what she is looking for. Therefore, the system does agood job if it retrieves images that are related to the query in atleast one possible sense, helping the user to better understandimage contents and supporting the decision making process.
We performed automated experiments by sending a query tothe system and evaluating the relevance of the results. The qualityof the results list is evaluated using information retrieval mea-sures, mainly Mean Average Precision (MAP), and precision at thefirst ten results (P@10 or early precision) [39]. For computing thesemeasurements, we used the trec_eval tool, available online.4 Thesetwo measurements are complementary views of the performance ofa retrieval system, and we found in our experiments that there is atrade-off when trying to maximize both at the same time, as is dis-cussed below.
5.2. Baselines
The first natural baseline for image search under the query-by-example paradigm is the performance of purely visual search, thatis, when no effort is done for introducing semantic informationinto the search process. In this case, the histogram intersectionsimilarity measure is used directly to match query features to sim-ilar visual content from the database. This baseline allows toobserve performance gains when using semantic information witha particular indexing algorithm.
We also consider late fusion as a second baseline, since it cancombine data from two different data sources. Late fusion is a verypopular strategy for combining multiple similarity measures in aretrieval system thanks to its simplicity. In particular, a simple scorecombination has been shown to be robust both in theory [44] andpractice [45,46] performing better than other schemes such as rankcombination, minimum, maximum, and other operators. We adopt
802
8034 http://trec.nist.gov/trec_eval/.
Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016
the score combination strategy using a convex combination ofvisual and semantic similarities to produce a single score for eachimage with respect to the query. Both similarities are first normal-ized using a min–max procedure. In addition, we optimize theparameter of the convex combination to produce the best baselinepossible.
In the experiments reported in this work, the semantic informa-tion used during late fusion is not provided by the user: it is auto-matically generated by the proposed methods. For any queryimage, we predict a semantic representation using the latent(NMFA) or direct (NSE) embeddings, and compute similarity scoreswith respect to database images. Also, similarity scores are com-puted using visual features independently. Then, both visual andsemantic scores are combined to include the opinion of both views.Notice that this is fundamentally different to the proposed fusionby back-projection, since computing similarity scores in each spaceseparately does not require any learning or modeling of multi-modal relationships.
5.3. Trade-offs in retrieval performance
The first set of experiments focuses on performance evaluationfor semantic and multimodal strategies. Results reported in thisSection correspond to experiments conducted on the training andvalidation sets, following a 10-fold cross-validation procedure. Inthe following subsections we describe our findings on the trade-off between MAP and P@10 when using semantic or visual search,and show how our algorithms can help to find a convenientbalance.
5.3.1. Visual and semantic searchThe NMFA and NSE algorithms described in Sections 4.1.1 and
4.1.2 respectively, are used to project images to a semantic spacespanned by terms. One of the main advantages of NSE, is that itdoes not need any parameter tuning during learning or inference,so it can be directly deployed in a retrieval system on top of thevisual representation without further effort. For NMFA, the numberof latent factors has to be chosen with cross-validation. We com-pare the performance of these two semantic indexing mechanismsin Table 2 for our three data sets.
As baseline strategy we use visual matching, which is based onvisual features only to retrieve related images. We also compareagainst the expected performance of a random ranking strategy,which was estimated using a Monte Carlo simulation. The chanceperformance is significantly lower than the visual baseline in allthree data sets.
Observe that the semantic embeddings always outperform thevisual baseline by a significant amount in terms of MAP. Thisshows that the proposed embeddings learn to predict a goodsemantic representation of images. For the Cervical Cancer dataset, the best embedding is NSE, while NMFA performs better onthe Basal-cell Carcinoma and Histology Atlas data sets. This is con-sistent in terms of the complexity of semantic vocabularies forthese data sets, since the Cervical Cancer set has a term dictionaryof eight keywords, while the Basal-cell Carcinoma and HistologyAtlas have 18 and 46 terms, respectively. Then, it is natural for adirect embedding to perform best with simple vocabularies, whilethe latent embedding is able to exploit complex correlationsamong many terms.
Notice also that none of the semantic embeddings were able toimprove the performance of early precision with respect to thevisual baseline, according to the results in Table 2. Early precisionis the measure of how many relevant images are shown in the firsttop ten results, and in this case, visual matching seems to be astrong baseline. Our result is consistent with previous studies thatshow how k-nearest neighbors algorithm in the visual space serves
ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
Table 2Retrieval performance of semantic strategies compared to visual matching. Results show the trade-off between MAP and P@10 on all image collections. Semantic search producessuperior MAP while visual search is aQ4 strong baseline for early precision. Chance performance refers to the expected performance of a random ranking strategy, and it issignificantly lower than visual and semantic search.
Method Cervical Cancer Basal-cell C. Histology Atlas
P@10 MAP P@10 MAP P@10 MAP
Visual matching (baseline) 0.5904 0.5214 0.4360 0.2928 0.7372 0.2751Latent embedding (NMFA) 0.5067 0.6591 0.3176 0.4947 0.5263 0.6309Direct embedding (NSE) 0.5414 0.6970 0.2543 0.4317 0.5230 0.6113Chance performance 0.4623 0.4681 0.2183 0.1806 0.0978 0.1008
J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 9
YJBIN 2171 No. of Pages 15, Model 5G
14 May 2014
Q1
as a good baseline for image annotation [47]. This suggests that inthe visual space, nearby images are more likely to have importantcorrespondences of appearance patterns, which could result insome semantic relationship among them. This matching of visualpatterns disappears when images are projected to the semanticspace, and only dominant semantic concepts are preserved there.
5.3.2. Multimodal fusionTo balance the trade-off between MAP and early precision, a
multimodal fusion strategy may be used. The following experi-ments compare the ability of late fusion and early fusion byback-projection, to balance the performance of semantic and visualrepresentations. Both strategies allow to control the relativeimportance of data modalities, using one parameter for databaseimages and another for query images, as described in Section 4.2.
0.45
0.5
0.55
0.6
0.65
0.7
0.5 0.55 0.6 0.65 0.7 0.75
P@10
MAP
Pareto Frontier - Cervical Cancer DatasetNSE-Late
NMFA-LateNMFABP (15F)
NSEBPVisual
NMFA (15F)NSE
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.3 0.4 0
P@10
MA
Pareto Frontier - His
Fig. 5. Performance of multimodal fusion on the three histology image collections. Resulrepresentation. Plots show the Pareto frontier on the performance space for all multimodReported methods are: Nonnegative Semantic Embedding + Late Fusion (NSE-Late), NMF-ANonnegative Semantic Embedding + Backprojection (NSEBP), Direct Visual Matching (Visual),using Nonnegative Semantic Embedding (NSE).
Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016
Each parameter is the weight of a convex combination betweenvisual and semantic representations. The influence of these param-eters was evaluated by producing fused representations with vary-ing values between 0 and 1, with a step of 0.1, for a (databaseimages) and b (query images). When evaluating the performanceof each a; b pair, MAP and early precision (P@10) are measured.
The parameter space was explored following this procedure onthe three data sets using 10-fold cross-validation, to fuse visualfeatures with the semantic representations obtained by the twoproposed semantic embeddings (NMFA and NSE). Also, we fol-lowed the same parameter search procedure for the late fusionbaseline, using both semantic embeddings as well. Fig. 5 presentsthe results of the multimodal fusion evaluation on the trainingset of the three histology image collections. These plots show theperformance space with MAP on the x-axis and early precision
0.25
0.3
0.35
0.4
0.45
0.5
0.3 0.35 0.4 0.45 0.5
P@10
MAP
Pareto Frontier - Basal-cell Carcinoma DatasetNSE-Late
NMFA-LateNMFABP (15F)
NSEBPVisual
NMFA (15F)NSE
.5 0.6 0.7
P
tology Atlas DatasetNSE-Late
NMFA-LateNMFABP (15F)
NSEBPVisual
NMFA (15F)NSE
ts obtained during parameter search by varying a and b to generate the multimodalal fusion strategies while visual and semantic performance are presented as points.symmetric + Late Fusion (NMFA-Late), NMF-Asymmetric + Backprojection (NMFABP),Latent semantic indexing using NMF-Asymmetric (NMFA), and Direct semantic indexing
ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
10 J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx
YJBIN 2171 No. of Pages 15, Model 5G
14 May 2014
Q1
(P@10) on the y-axis. Each point in this space corresponds to oneconfiguration of the retrieval system, either with multimodalfusion or any other baseline. The performance is best when bothperformance measures are maximized simultaneously, and weonly plot the result for a; b pairs in the Pareto frontier of each mul-timodal fusion strategy.
First, notice the difference between visual and semantic search(NMFA and NSE) in all plots, and how these two points lie on oppo-site sides of the performance space, allowing to visualize theirtrade-off in performance. The plots also show that all multimodalfusion strategies can be configured to produce intermediate resultsbetween the performance of visual and semantic search by givingmore preference to one or another data modality. These Paretofrontiers illustrate the path along the trade-off between visualand semantic search, showing that we cannot get a fused resultthat is better than both of them individually. However, the pro-posed strategy allows us to choose a good balance that preservesthe best response of visual or semantic data as needed.
Finally, our proposed early fusion, based on back-projection ofthe semantic data, consistently outperforms the results obtainedby the late fusion baseline, achieving an improved performancein terms of both, MAP and P@10. The benefits of back-projectionover late fusion are clear in all three image collections by a largemargin. Two important characteristics make our strategy betterthan late fusion: first, during the back-projection step, we takeadvantage of the learned relationships between visual and seman-tic data, and explicitly encode these correspondences in a recon-structed vector (see Eqs. (6) and (7)). Second, we complementthe visual representation with semantically reconstructed visualfeatures, preserving the ability to match the original visual struc-tures as well as semantic relations in the same space.
5.4. Histology image search
This Section presents the results of retrieval experimentsconducted on the test set, after fixing the best parameters in the
NSE
NMF-A
NMFABP
NSEBP
NMFA-Late
NSE-Late
VISUAL
0.45 0.5 0.55 0.6 0.65 0.7 0.75
Cervix
MAP
37.72%
33.97%
21.25%
16.35%
16.02%
5.88%
0.00%
NMF-A
NSE
NMFABP
NSEBP
NSE-Late
NMFA-Late
VISUAL
0.2 0.3 0.4
H
Semantic Early F
0.00%
Fig. 6. Mean Average Precision (MAP) of all evaluated strategies on the three histology imimprovement with respect to the method with the lowest performance, which is the purfollowed by the proposed early fusion methods. Note that the scale of performance has btendency across data sets is best viewed in color. (For interpretation of the references to
Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016
training set for each model. The results compare the performanceof visual, semantic and multimodal search, along with the pro-posed baselines. We keep our attention on the two performancemeasures evaluated in previous Sections, to contrast the benefitsof each approach. The evaluated methods can be grouped in thefollowing broader strategies:
1. Visual matching: Only visual features are used to match queryimages and database images. 2. Semantic search: Images are pro-jected and matched in a semantic space. Two semantic indexingmethods are evaluated: Nonnegative Semantic Embedding (NSEin Section 4.1) and NMF-Asymmetric (NMFA in Section 4.1.1). 3.Early fusion: Visual and semantic features are combined in thesame image representation. Here we evaluated the proposedback-projection strategy with the two semantic embeddings: NSEBack-Projection (NSE-BP) and NMFA Back-Projection (NMFA-BP).4. Late fusion: Visual and semantic features are matched indepen-dently and combined during the ranking procedure by mixing theirsimilarity measures. We evaluated a late fusion strategy betweenvisual features and the predicted semantic features of NSE andNMFA.
These four groups facilitate the interpretation of results andalso allows meaningful comparisons between different strategies.The results with respect to MAP are presented in Fig. 6 for eachdata set. This figure shows a ranking of the methods in decreasingorder to compare relative gains in MAP. Notice that semanticmethods are at the top positions of the rankings, indicating thatsemantic indexing is good at optimizing the precision of an imageretrieval system. Precision measured by MAP can also be seen asthe performance of image auto-annotation, and how faithfulpredictions of semantic terms for all images are. Then, the resultssuggest that semantic embeddings are able to predict meaningfulannotations for images, which are then matched correctly.
The second group (according to MAP) in the ranking of methodsshown in Fig. 6 is the group of methods based on back-projection.The difference with semantic methods is mainly due to thetrade-off discussed in the previous section. We selected fusion
NMF-A
NSE
NMFABP
NSEBP
NSE-Late
NMFA-Late
VISUAL
0.15 0.2 0.25 0.3 0.35 0.4 0.45
Carcinoma
MAP
79.28%
52.26%
45.26%
41.48%
24.55%
19.37%
0.00%
0.5 0.6 0.7 0.8
istology
MAP
usion Late Fusion Visual
134.74%
128.71%
123.71%
121.76%
110.38%
106.94%
age collections. Bars are absolute MAP values, and percentages indicate the relativeely visual search in all cases. Semantic methods have the best overall performance,een set differently for each data set to highlight relative improvements. The overallcolor in this figure legend, the reader is referred to the web version of this article.)
ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
NMFABP
NMF-A
NMFA-Late
NSEBP
VISUAL
NSE-Late
NSE
0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57
Cervix
P@10
13.25%
12.89%
11.74%
8.71%
7.96%
6.05%
0.00%
NMFABP
NSEBP
VISUAL
NMFA-Late
NMF-A
NSE-Late
NSE
0.25 0.3 0.35 0.4 0.45 0.5
Carcinoma
P@10
40.79%
35.62%
34.28%
25.54%
10.29%
5.77%
0.00%
VISUAL
NMFABP
NSEBP
NMFA-Late
NSE-Late
NMF-A
NSE
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85
Histology
P@10
Semantic Early Fusion Late Fusion Visual
40.44%
24.11%
18.98%
16.50%
15.86%
5.99%
0.00%
Fig. 7. Early precision (P@10) of all evaluated strategies on the three histology image collections. Bars are absolute P@10 values, and percentages indicate the relativeimprovement with respect to the method with the lowest performance, which is the pure semantic search in all cases. The proposed fusion approach presents the bestperformance for two data sets, improving over the visual baseline. In all cases, our proposed method outperforms late fusion. Note that the scale of performance has been setdifferently for each data set to highlight relative improvements. The overall tendency across data sets is best viewed in color. (For interpretation of the references to color inthis figure legend, the reader is referred to the web version of this article.)
5 Semantic embeddings use visual information to correctly project images to thesemantic space.
J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 11
YJBIN 2171 No. of Pages 15, Model 5G
14 May 2014
Q1
parameters mainly to improve early precision (P@10) since ourgoal is to balance the trade-off for image retrieval, and the firstresults are crucial for a good user experience. Finally, notice thatour proposed approach for fusion consistently outperforms the latefusion of similarities in all three data sets.
As for early precision, Fig. 7 shows the relative differencesamong all evaluated methods. The ranking of methods has chan-ged, leaving the semantic-based approaches in the bottom of per-formance, even below the visual baseline. In two of the three datasets, the proposed back-projection scheme gets the best perfor-mance, since we balanced the parameters during cross-validationto improve this measure. We wanted to bring more semanticinformation to the top of the ranked list of results, and our strat-egy proves to be effective for combining both data modalities in asingle representation. The visual baseline is very strong in the caseof the Histology Atlas data set, leaving behind all methods by alarge margin. Nevertheless, fusion by back-projection offers a sig-nificant improvement over the original semantic representation,showing a good intermediate compromise. Our fusion methodol-ogy also shows important improvements over the late fusionbaseline.
Some queries are illustrated in Fig. 8 along with the top nineresults retrieved by three methods: visual matching, a semanticembedding and the multimodal representation. Queries are singleimage examples with no text descriptions. The visual rankingbrings images that match features without any knowledge abouttheir high-level interpretations, and thus, sometimes fails toretrieve the correct results. The semantic embedding selected foreach database corresponds to that with the best performance inthe test set according to MAP (NSE for Cervical Cancer and NMFAfor Basal-cell Carcinoma and Histology Atlas). Results obtainedby matching the representation in the semantic space are diverseand correspond to images with higher scores in the terms
Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016
predicted for the query. This strategy clearly does not considervisual information for ranking images,5 which results in large vari-ations of appearance. The ranking produced by the fused representa-tion can improve the retrieval performance of the response and alsoproduces more visually consistent results, since the fusion takesplace in the visual space. This shows how the proposed approachcan effectively introduce semantic information in the visual repre-sentation to bring correct images that respect visual structures.
Finally, to bring a more general sense of the benefits of eachapproach, we consider a comparison of the methods with respectto the positions in the rankings of MAP and P@10 shown in Figs.6 and 7. We use the average position of a method across the threedifferent histology image collections, and re-rank them again toprovide a unified comparison. Fig. 9 presents a visualization ofthe rankings with respect to MAP (on the x-axis) and P@10 (onthe y-axis). An ideal method should be in the coordinate (1,1),which means it ranked first regarding both performance measures.This visualization also reveals the trade-off between visual andsemantic representations, indicating that on average, semanticmethods get first when with respect to MAP.
Fusion methods rank in the intermediate positions, indicatingalso that fusion by back-projection ranks on average above latefusion with respect to both performance measures, providing animproved balance. Also, notice how NMFA-BP usually ranks aheadthe visual baseline in terms of P@10, and also in terms of MAP.Actually, the results suggest that NMFA-BP, which consists of alatent semantic embedding with a corresponding back-projectionand fusion, produces the multimodal representation of imageswith the best compromise between early precision (standing in
ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/
Fig. 8. Example queries at left with the top nine images retrieved by visual, semantic and fused methods. Green frames indicate relevant images, and red frames indicateincorrectly retrieved images. Notice that semantic methods (NSE, NMFA) produce results with large visual variations since no visual information is considered for ranking.The proposed fusion approach (NSE-BP, NMFA-BP) improves the precision of the results and also brings a set of more visually consistent images. (For interpretation of thereferences to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 9. Rankings of evaluated methods according to their average position of performance among the three data sets. The x-axis represents the average ranking with respect toMAP, and the y-axis represents the average ranking with respect to P@10. Points close to the origin have better performance.
12 J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx
YJBIN 2171 No. of Pages 15, Model 5G
14 May 2014
Please cite this article in press as: CaicedoQ1 JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.04.016
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 13
YJBIN 2171 No. of Pages 15, Model 5G
14 May 2014
Q1
the first position of the rankings) and similar to the performance ofsemantic embeddings with respect to MAP.
5.5. Discussions
5.5.1. Multimodal fusionThe proposed framework provides a learning-based tool for
fusion of visual content and semantic information in histologyimages. The method models cross-modal relationships throughsemantic embeddings, which have the interesting property of mak-ing the two modalities exchangeable from one space to another.This property may be understood as a translation scheme betweentwo languages that express the same concepts in different ways.One of the languages is visual, which communicates optical detailsfound in images, and the other language is semantic, which repre-sents high-level interpretations of images. These two views of thesame data are complementary and are fused to build a betterimage representation.
This paper presents an approach to the problem of histologyimage retrieval following a multimodal setup, which is the firstof this kind reported in the literature. Previous work for semanticretrieval of histology images is mainly oriented to train classifiersfor recognizing biological structures in images [7,8,22]. That strat-egy can be understood as a translation from the visual space to thesemantic space without the possibility of a translation in the oppo-site way, and thus, limited to a fusion procedure based on latefusion only.
Experimental results in this work have shown that an exclusivesemantic representation may lead to the loss of important infor-mation for image search with example images. More importantly,our results show a consistent benefit of an early fusion strategybased on multimodal relationships over the popular approach oflate fusion, and the benefits of early fusion are beyond improvedperformance. The implications of having a better image represen-tation may be the starting point for other systems, that take thismultimodal representation as input to learn classifiers or to solveother more complex tasks.
5.5.2. Query expansion effectThe main reason for studying the fusion of visual and semantic
data is that they are complementary sources of information: whilevisual data tends to be ambiguous, semantic data tends to be veryspecific; and while visual data provides detailed appearancedescriptions, semantic data gives no clues on how an image lookslike. So, depending on the fusion strategy, multimodal relation-ships become more useful for making decisions on data. Our setupfor image retrieval considers example images as queries. Since thevisual content representation used in this work is based on a bag offeatures, an analogy with text vocabularies may help to explain theeffects of multimodal fusion.
Visual features in the dictionary of codeblocks may be under-stood as visual words representing specific visual arrangementsor configurations. One specific pattern is a low-level word thatmay have different meanings from a high-level or semantic per-spective. This problem is known in natural language processingas polysemy and usually decreases retrieval precision, that is, theability of the system to retrieve only relevant documents [48]. Also,different visual words may be related to the same high-level mean-ing, which is known as synonymy, and can reduce the ability of aninformation retrieval system to retrieve all relevant documents[48].
Experimental results in Section 5.4 are consistent with thesedefinitions, since a visual polysemy effect is observed when theretrieval system is based on visual features only (the lowest MAPscore). On the other hand, a visual synonymy effect is observedwhen using semantic data, with a higher MAP score but lower early
Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016
precision (P@10). Thus, the back-projection of semantic data is ableto disambiguate visual words by introducing other visual wordssemantically correlated to the query, and so correcting the synon-ymy effect. In this context is when the visual query expansioneffect takes place. Besides, when both modalities are combined,the polysemy effect can also be corrected, if appropriate weightsare assigned.
5.5.3. Large semantic vocabulariesPrevious work on histology image retrieval is mainly based on
classifiers trained to recognize several biological structures[7,18,20,8,22,9]. To transfer these methodologies to real world sys-tem implementations, a significant tuning effort is required, sinceeach classifier may have its own optimal configuration. The pro-posed method is a unified approach that integrates all semanticlabels together in a matrix for learning multimodal relationships.This makes an implementation simpler and ready to scale up fornew keywords, as long as corresponding example images areavailable.
Our methods are adaptive to different sizes of vocabularies asshown in the experimental evaluations, which included three dif-ferent histology databases of different sizes and different numbersof associated keywords. The effort of introducing new semanticterms in our model is virtually zero. Actually, our experimentsshow experimental analysis of histology images with the largestvocabulary reported so far. We believe that image retrieval sys-tems have the potential to support clinical activities, and toachieve that, the underlying computational methods have to bevery flexible and prepared to use semantic data, as it is availablein current management systems. This involves vocabularies withhundreds of medical terms and thousands of images, which canbe easily handled by the methods proposed in this paper.
5.5.4. Histology image collectionsCurrently, digital pathology allows to manage, share and
preserve slides together with electronic health records, which arevery important steps to modernize infrastructure and to provideimproved services. However, these systems can go beyond passiverepositories of data to actually help with the organization, search,visualization and discovery of information hidden or buried in his-tology image collections. This potential could benefit diagnosticactivities, as well as scientific research and academic training, andto realize it, new tools and methodologies have to be designedand evaluated.
Visual search technologies are among the most pervasive appli-cations in daily life, and it could be seamlessly integrated in thepractice of pathology as long as these methodologies fit the correctrequirements for such an endeavor. This work proposed to buildenhanced histology image representations to build effective retrie-val systems, using visual and semantic features. The resulting rep-resentation can also be used for other different automated analysissystems, that could be essential in medical imaging departments tosupport various decisions in the clinical practice.
5.5.5. Other considerationsThis paper has presented a study with experimental evidence in
favor of an early fusion strategy. Even though the proposed algo-rithm for early fusion has shown improved performance, the finalaccuracy is still far from perfect and there are several opportunitiesfor improvement, both in the technical and experimental sides.
In the technical side, our early fusion algorithm may be under-stood as a procedure to learn an image representation given visualfeatures and text annotations. Visual features have been learned inan unsupervised way following a bag-of-features approach, whichhas limited capacity to encode very complex visual patterns.Learning more powerful visual features may help to improve
ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
14 J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx
YJBIN 2171 No. of Pages 15, Model 5G
14 May 2014
Q1
performance as suggested by several recent works [49,50]. Also,even though our early fusion algorithm is simple and efficient, itstill requires more computations per image than late fusionalgorithms.
In the experimental side, one of the limitations of our study hasbeen access to more annotated data. We conducted experimentson three different data sets with small to medium size. However,an indexing method like the one proposed in this paper could ben-efit from more data, which is difficult to collect from real medicalcases and has restricted use in research or even in practical set-tings. We have shared part of the data collections used in this workwith the hope that other researchers may benefit from open andhigh quality histology images, and we keep looking for opportuni-ties to access more sources of information both in the communityas well as internally within our own institutions.
1166
1167
1168
1169
1170
117111721173117411751176117711781179118011811182118311841185118611871188118911901191119211931194119511961197119811991200120112021203120412051206120712081209121012111212121312141215121612171218121912201221122212231224122512261227
6. Conclusions
This work presented a framework to build histology image rep-resentations that combines visual and semantic features, followinga novel early-fusion approach. The proposed method learns therelationships between both data modalities and uses that modelto project semantic information back to the visual space, in whichthe fused representation is built. The resulting multimodal repre-sentation is used in an image search system that matches potentialresults using a similarity measure, however, its use can beextended to other histology image analysis tasks such as classifica-tion or clustering.
The experimental evaluation conducted in this work includedthree histology image collections with various sizes and differentnumbers of text terms, demonstrating the potential of the pro-posed multimodal indexing methods under different conditions.We observed a trade-off between optimizing MAP and early preci-sion at the same time when using either a semantic or a visual rep-resentation. This is mainly explained by the complementary natureof both data modalities.The proposed multimodal fusion approachis an effective strategy to balance this trade-off, and to improve thequality of image representations. Our methods consistently out-performed the visual matching and late fusion baselines in theimage retrieval task, providing the best balance between visualand semantic search.
We observed that, overall, semantic search strategies are verygood at maximizing MAP, and our proposed strategies for earlyfusion can incorporate more visual information in the search pro-cess at the cost of small reductions in MAP performance. Fusionmethods currently require further investigation on how to betterutilize visual features to satisfy visual consistency or visual diver-sity criteria demanded by potential users, without decreasing thesemantic meaningfulness of retrieved results. Semantic-basedindexing can also be exploited using keyword-based search,instead of query-by-visual-example, which was the main searchparadigm evaluated in this work. Keyword-based search may alsobe executed in a multimodal index, since by definition, it containsboth information modalities: visual and semantic.
Further potential research directions include the application ofthis representation to other image analysis tasks, such as imageclassification and automated grading. Also, since the formulationof our method can handle arbitrarily large semantic vocabularies,we are interested in extending its applicability to large scale bio-medical image collections.
We make an argument in favor of multimodal indexing, notonly because of its potential to significantly improve relativeperformance, as we have shown in this paper, but also because thisstrategy has the ability to model different user interactionmechanisms, which could be adapted according to real needs.
Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016
Nevertheless, an additional intriguing question beyond indexingmechanisms is: what is the minimum required performance forimage search technologies in a real clinical setting? The impactthat an image retrieval system might have in health care is prom-ising [13], but it will require more coordinated and collaborativeefforts to be widely adopted. Machine learning tools are capableof empowering clinicians with timely and relevant information tomake evidence-based decisions, which may result in improvedquality of care for patients. This is currently a driving force for alarge body of research.
Acknowledgements
The authors would like to thank the anonymous reviewers fortheir constructive comments, which helped to improve and clarifythis manuscript. This work was partially funded by LACCIR-Microsoft project ‘‘Multimodal Image Retrieval to Support MedicalCase-Based Scientific Literature Search’’.
References
[1] Kragel P, Kragel P. Digital microscopy: a survey to examine patterns of use andtechnology standards. In: Proceedings of the IASTED international conferenceon telehealth/assistive technologies. Anaheim (CA, USA): ACTA Press; 2008. p.195–7.
[2] Müller H, Michoux N, Bandon D, Geissbuhler A. A review of content-basedimage retrieval systems in medical applications–clinical benefits and futuredirections. Int J Med Inf 2004;73(1):1–23.
[3] Datta R, Joshi D, Li J, Wang JZ. Image retrieval: ideas, influences, and trends ofthe new age. ACM Comput Surv 2008;40(2):1–60.
[4] Smeulders AW, Worring M, Santini S, Gupta A, Jain R. Content-based imageretrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell2000;22(12):1349–80.
[5] Barnard K, Duygulu P, Forsyth D, de Freitas N, Blei DM, Jordan MI. Matchingwords and pictures. J Mach Learn Res 2003;3:1107–35.
[6] Rasiwasia N, Moreno PJ, Vasconcelos N. Bridging the gap: query by semanticexample. IEEE Trans Multimedia 2007;9(5):923–38.
[7] Tang HL, Hanka R, Ip HHS. Histological image retrieval based on semanticcontent analysis. IEEE Trans Inf Technol Biomed 2003;7(1):26–36.
[8] Naik J, Doyle S, Basavanhally A, Ganesan S, Feldman MD, Tomaszewski JE, et al.A boosted distance metric: application to content based image retrieval andclassification of digitized histopathology. SPIE Med Imag: Comput-Aided Diagn2009;7260:72603F1–12.
[9] Caicedo JC, Romero E, González FA. Content-based histopathology imageretrieval using a Kernel-based semantic annotation framework. J Biomed Inf2011;44:519–28.
[10] Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS. Multimodal fusion formultimedia analysis: a survey. Multimedia Syst 2010;16(6):345–79.
[11] La Cascia M, Sethi S, Sclaroff S. Combining textual and visual cues for content-based image retrieval on the world wide web. In: 1998. Proceedings of IEEEworkshop on content-based access of image and video libraries; 1998. p. 24–8.
[12] Nuray R, Can F. Automatic ranking of information retrieval systems using datafusion. Inf Process Manage 2006;42(3):595–614.
[13] Marchiori A. Automated storage and retrieval of thin-section ct images toassist diagnosis: system description and preliminary assessment. Radiology2003;228:265–70.
[14] Bonnet N. Some trends in microscope image processing. Micron 2004;35(8):635–53.
[15] Doyle S, Hwang M, Shah K, Madabhushi A, Feldman M, Tomaszeweski J.Automated grading of prostate cancer using architectural and textural imagefeatures. In: 4th IEEE international symposium on biomedical imaging: fromnano to macro, 2007; 2007. p. 1284–7.
[16] Zheng L, Wetzel AW, Gilbertson J, Becich MJ. Design and analysis of a content-based pathology image retrieval system. IEEE Trans Inf Technol Biomed2003;7(4):249–55.
[17] Caicedo JC, Gonzalez FA, Romero E. A semantic content-based retrieval methodfor histopathology images. Inf Retriev Technol LNCS 2008;4993:51–60.
[18] Orlov N, Shamir L, Macura T, Johnston J, Eckley DM, Goldberg IG. WND-CHARM: multi-purpose image classification using compound imagetransforms. Pattern Recogn Lett 2008;29(11):1684–93.
[19] Tambasco M, Costello BM, Kouznetsov A, Yau A, Magliocco AM. Quantifyingthe architectural complexity of microscopic images of histology specimens.Micron 2009;40(4):486–94.
[20] Caicedo JC, Cruz A, Gonzalez FA. Histopathology image classification using bagof features and kernel functions. In: Artif Intell Med. Springer; 2009. p. 126–35.
[21] Mosaliganti K, Janoos F, Irfanoglu O, Ridgway R, Machiraju R, Huang K, et al.Tensor classification of N-point correlation function features for histologytissue segmentation. Med Image Anal 2009;13(1):156–66.
ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/
122812291230123112321233123412351236123712381239124012411242124312441245124612471248124912501251125212531254125512561257125812591260126112621263126412651266126712681269127012711272
12731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316
1317
J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 15
YJBIN 2171 No. of Pages 15, Model 5G
14 May 2014
Q1
[22] Meng T, Lin L, Shyu M-L, Chen S-C. Histology image classification usingsupervised classification and multimodal fusion. In: 2010 IEEE internationalsymposium on multimedia. IEEE; 2010. p. 145–52.
[23] Müller H, Kalpathy-Cramer J. The ImageCLEF medical retrieval task at ICPR2010. In: Proceedings of the 20th international conference on patternrecognition; 2010. p. 3284–7.
[24] Kalpathy-Cramer J, Hersh W. Multimodal medical image retrieval: imagecategorization to improve search precision. In: Proceedings of theinternational conference on multimedia information retrieval. ACM; 2010. p.165–74.
[25] Rahman M, Antani S, Thoma G. A learning-based similarity fusion and filteringapproach for biomedical image retrieval using SVM classification andrelevance feedback. IEEE Trans Inf Technol Biomed 2011;15(4):640–6.
[26] Müller H, Deselaers T, Deserno T, Clough P, Kim E, Hersh W. Overview of theImageCLEFmed 2006 medical retrieval and medical annotation tasks. In:Evaluation of multilingual and multi-modal information retrieval. Springer;2007. p. 595–608.
[27] Müller H, Eggel I, Bedrick S, Radhouani S, Bakke B, Kahn Jr. C, et al. Overview ofthe CLEF 2009 medical image retrieval track. In: Cross Language evaluationforum (CLEF) working notes.
[28] de Herrera AGS, Kalpathy-Cramer J, Demner-Fushman D, Antani S, Müller H.Overview of the ImageCLEF 2013 medical tasks. Working Notes of CLEF; 2013.
[29] Müller H, de Herrera AGS, Kalpathy-Cramer J, Demner-Fushman D, Antani S,Eggel I. Overview of the ImageCLEF 2012 medical image retrieval andclassification tasks. In: CLEF (Online Working Notes/Labs/Workshop); 2012.
[30] Caicedo JC, BenAbdallah J, González FA, Nasraoui O. Multimodalrepresentation, indexing, automated annotation and retrieval of imagecollections via non-negative matrix factorization. Neurocomputing2012;76(1):50–60.
[31] Fan J, Gao Y, Luo H, Keim DA, Li Z. A novel approach to enable semantic andvisual image summarization for exploratory image search. In: Proceedings ofthe 1st ACM international conference on multimedia informationretrieval. ACM; 2008. p. 358–65.
[32] Romberg S, Lienhart R, Hörster E. Multimodal image retrieval. Int J MultimediaInf Retriev 2012;1(1):31–44.
[33] Putthividhy D, Attias HT, Nagarajan SS. Topic regression multi-modal latentDirichlet allocation for image annotation. In: 2010 IEEE conference oncomputer vision and pattern recognition. IEEE; 2010. p. 3408–15.
[34] Rusu M, Wang H, Golden T, Gow A, Madabhushi A. Multiscale multimodalfusion of histological and MRI lung volumes for characterization of lunginflammation. In: SPIE medical imaging, international society for optics andphotonics; 2013. p. 86720X–86720X.
[35] Meng T, Lin L, Shyu M-L, Chen S-C. Histology image classification usingsupervised classification and multimodal fusion. In: 2010 IEEE internationalsymposium on multimedia. IEEE; 2010. p. 145–52.
Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016
[36] Vanegas JA, Caicedo JC, González FA, Romero E. Histology image indexingusing a non-negative semantic embedding. Proceedings of the second MICCAIinternational conference on medical content-based retrieval for clinicaldecision support, vol. 7075. LNCS; 2012. p. 80–91 [chapter 8].
[37] Caicedo JC, Gonzalez FA, Triana E, Romero E. Design of a medical imagedatabase with content-based retrieval capabilities. Adv Image Video TechnolLNCS 2007;4872:919–31.
[38] Cruz-Roa A, Caicedo JC, González FA. Visual pattern mining in histology imagecollections using bag of features. Artif Intell Med 2011;52(2):91–106.
[39] Manning CD, Raghavan P, Schütze H. Introduction to informationretrieval. Cambridge University Press; 2008.
[40] Hare JS, Samangooei S, Lewis PH, Nixon MS. Semantic spaces revisited:investigating the performance of auto-annotation and semantic retrieval usingsemantic spaces. In: Proceedings of the 2008 international conference oncontent-based image and video retrieval. New York (NY, USA): ACM; 2008. p.359–68.
[41] Lee DD, Seung HS. Learning the parts of objects by non-negative matrixfactorization. Nature 1999;401(6755):788–91.
[42] Barla A, Odone F, Verri A, Histogram intersection Kernel for imageclassification, international conference on image processing, 2003. In:Proceedings, vol. 3; 2003. p. 513–16.
[43] Grauman K, Darrell T. The pyramid match kernel: discriminative classificationwith sets of image features. In: Tenth IEEE international conference oncomputer vision, 2005, vol. 2; 2005.
[44] Hsu DF, Taksa I. Comparing rank and score combination methods for datafusion in information retrieval. Inf Retriev 2005;8(3):449–80.
[45] Mc Donald K, Smeaton AF. A comparison of score, rank and probability-basedfusion methods for video shot retrieval. In: Image and videoretrieval. Springer; 2005. p. 61–70.
[46] Lee JH. Analyses of multiple evidence combination. In: Special interest groupon information retrieval. ACM SIGIR conference, vol. 31. ACM; 1997. p.267–76.
[47] Makadia A, Pavlovic V, Kumar S. A new baseline for image annotation. In:Proceedings of the 10th European conference on computer vision. Berlin,Heidelberg: Springer-Verlag; 2008. p. 316–29.
[48] Carpineto C, Romano G. A survey of automatic query expansion in informationretrieval. ACM Comput Surv 2012;44(1):1–50.
[49] Cruz-Roa A, Arevalo JE, Madabhushi A, González FA. A deep learningarchitecture for image representation, visual interpretability and automatedbasal-cell carcinoma cancer detection. In: Medical image computing andcomputer-assisted intervention–MICCAI 2013. Springer; 2013. p. 403–10.
[50] Wang H, Cruz-Roa A, Basavanhally A, Gilmore H, Shih N, Feldman M, et al.Cascaded ensemble of convolutional neural networks and handcraftedfeatures for mitosis detection; 2014.
ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/