histology image search using multimodal fusion

15
1 3 Histology image search using multimodal fusion 4 5 6 Juan C. Caicedo Q1 a,,1 , Jorge A. Vanegas b , Fabian Paez b , Fabio A. González b 7 a University of Illinois at Urbana–Champaign, IL, USA Q2 8 b MindLab Research Laboratory, Universidad Nacional de Colombia, Bogotá, Colombia 9 10 12 article info 13 Article history: 14 Received 18 June 2013 15 Accepted 30 April 2014 16 Available online xxxx 17 Keywords: 18 Histology 19 Digital pathology 20 Image search 21 Multimodal fusion 22 Visual representation 23 Semantic spaces 24 25 abstract 26 This work proposes a histology image indexing strategy based on multimodal representations obtained 27 from the combination of visual features and associated semantic annotations. Both data modalities are 28 complementary information sources for an image retrieval system, since visual features lack explicit 29 semantic information and semantic terms do not usually describe the visual appearance of images. 30 The paper proposes a novel strategy to build a fused image representation using matrix factorization 31 algorithms and data reconstruction principles to generate a set of multimodal features. The methodology 32 can seamlessly recover the multimodal representation of images without semantic annotations, allowing 33 us to index new images using visual features only, and also accepting single example images as queries. 34 Experimental evaluations on three different histology image data sets show that our strategy is a simple, 35 yet effective approach to building multimodal representations for histology image search, and outper- 36 forms the response of the popular late fusion approach to combine information. 37 Ó 2014 Published by Elsevier Inc. 38 39 40 41 1. Introduction 42 Digital pathology makes it easy to exchange histology images 43 and enables pathologists to rapidly study multiple samples from 44 different cases without having to unpack the glass [1]. The increas- 45 ing adoption of digital repositories for microscopy images results 46 in large databases with thousands of records, which may be useful 47 to supporting the decision making process in clinical and research 48 activities. However, in modern hospitals and health care centers, 49 the number of images to keep track of is beyond the ability of 50 any specialist. A very promising direction to realize the potential 51 of these collections is through efficient and effective tools for image 52 search. For instance, when a new slide is being observed, a camera 53 coupled to the microscope can capture the current view, send the 54 picture to the retrieval system, and show results on a connected 55 computer. These results can help to clarify structures in the 56 observed image, explore previous cases and, in general, may allow 57 clinicians and researchers to explore large collections of records 58 previously evaluated and diagnosed by other physicians. 59 The query-by-example paradigm for image search—when the 60 user’s query is an example image with no annotations—has a num- 61 ber of potential applications in medicine and clinical activities [2]. 62 The main challenge when implementing such a system consists of 63 correctly defining the matching criteria between query images and 64 database images. The standard approach for content-based retrie- 65 val in image collections relies on using similarity measures 66 between low-level visual features to perform a nearest-neighbor 67 search [3]. The problem of this approach is that these characteris- 68 tics usually fail to capture high-level semantics of images, a prob- 69 lem known as the semantic gap [4]. Different methods to bridge 70 this gap have been proposed to build a model that connects low- 71 level features with high-level semantic content, such as automatic 72 image annotation [5] and query by semantic example [6]. These 73 methods represent images in a semantic space spanned by 74 keywords, so a nearest neighbors search in that space retrieves 75 semantically related images. Approaches like these have also been 76 investigated for histology image search [7–9]. 77 Image search systems based on a semantic representation have 78 been shown to outperform purely visual search systems in terms of 79 Mean Average Precision (MAP) [3]. However, these approaches 80 may lose the notion of visual similarity among images since the 81 search process ends up relying entirely on high level descriptions 82 of images. The ranking of search results is based on potentially rel- 83 evant keywords, ignoring useful appearance clues that are not 84 described by index terms. In a clinical setting, visual information 85 plays an important role for searching histology images, which http://dx.doi.org/10.1016/j.jbi.2014.04.016 1532-0464/Ó 2014 Published by Elsevier Inc. Corresponding author. Current address: Computer Science Department, University of Illinois at Urbana-Champaign, 201 N Goodwin Ave, Urbana, IL 61801, USA. E-mail addresses: [email protected] (J.C. Caicedo Q1 ), [email protected] (J.A. Vanegas), [email protected] (F. Paez), [email protected] (F.A. González). 1 Work done while at Universidad Nacional de Colombia. Journal of Biomedical Informatics xxx (2014) xxx–xxx Contents lists available at ScienceDirect Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin YJBIN 2171 No. of Pages 15, Model 5G 14 May 2014 Please cite this article in press as: Caicedo Q1 JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/ j.jbi.2014.04.016

Upload: fabio-a

Post on 05-Jan-2017

224 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Histology image search using multimodal fusion

1

3

4

5

6 Q1

7 Q28

910

1 2

13141516

1718192021222324

2 5

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

Q1

Journal of Biomedical Informatics xxx (2014) xxx–xxx

YJBIN 2171 No. of Pages 15, Model 5G

14 May 2014

Q1

Contents lists available at ScienceDirect

Journal of Biomedical Informatics

journal homepage: www.elsevier .com/locate /y jb in

Histology image search using multimodal fusion

http://dx.doi.org/10.1016/j.jbi.2014.04.0161532-0464/� 2014 Published by Elsevier Inc.

⇑ Corresponding author. Current address: Computer Science Department,University of Illinois at Urbana-Champaign, 201 N Goodwin Ave, Urbana, IL61801, USA.

E-mail addresses: [email protected] (J.C. Caicedo), [email protected](J.A. Vanegas), [email protected] (F. Paez), [email protected](F.A. González).

1 Work done while at Universidad Nacional de Colombia.

Please cite this article in press as: Caicedo JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/1j.jbi.2014.04.016

Juan C. Caicedo a,⇑,1, Jorge A. Vanegas b, Fabian Paez b, Fabio A. González b

a University of Illinois at Urbana–Champaign, IL, USAb MindLab Research Laboratory, Universidad Nacional de Colombia, Bogotá, Colombia

262728293031323334353637

a r t i c l e i n f o

Article history:Received 18 June 2013Accepted 30 April 2014Available online xxxx

Keywords:HistologyDigital pathologyImage searchMultimodal fusionVisual representationSemantic spaces

38

a b s t r a c t

This work proposes a histology image indexing strategy based on multimodal representations obtainedfrom the combination of visual features and associated semantic annotations. Both data modalities arecomplementary information sources for an image retrieval system, since visual features lack explicitsemantic information and semantic terms do not usually describe the visual appearance of images.The paper proposes a novel strategy to build a fused image representation using matrix factorizationalgorithms and data reconstruction principles to generate a set of multimodal features. The methodologycan seamlessly recover the multimodal representation of images without semantic annotations, allowingus to index new images using visual features only, and also accepting single example images as queries.Experimental evaluations on three different histology image data sets show that our strategy is a simple,yet effective approach to building multimodal representations for histology image search, and outper-forms the response of the popular late fusion approach to combine information.

� 2014 Published by Elsevier Inc.

39

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

1. Introduction

Digital pathology makes it easy to exchange histology imagesand enables pathologists to rapidly study multiple samples fromdifferent cases without having to unpack the glass [1]. The increas-ing adoption of digital repositories for microscopy images resultsin large databases with thousands of records, which may be usefulto supporting the decision making process in clinical and researchactivities. However, in modern hospitals and health care centers,the number of images to keep track of is beyond the ability ofany specialist. A very promising direction to realize the potentialof these collections is through efficient and effective tools for imagesearch. For instance, when a new slide is being observed, a cameracoupled to the microscope can capture the current view, send thepicture to the retrieval system, and show results on a connectedcomputer. These results can help to clarify structures in theobserved image, explore previous cases and, in general, may allowclinicians and researchers to explore large collections of recordspreviously evaluated and diagnosed by other physicians.

78

79

80

81

82

83

84

85

The query-by-example paradigm for image search—when theuser’s query is an example image with no annotations—has a num-ber of potential applications in medicine and clinical activities [2].The main challenge when implementing such a system consists ofcorrectly defining the matching criteria between query images anddatabase images. The standard approach for content-based retrie-val in image collections relies on using similarity measuresbetween low-level visual features to perform a nearest-neighborsearch [3]. The problem of this approach is that these characteris-tics usually fail to capture high-level semantics of images, a prob-lem known as the semantic gap [4]. Different methods to bridgethis gap have been proposed to build a model that connects low-level features with high-level semantic content, such as automaticimage annotation [5] and query by semantic example [6]. Thesemethods represent images in a semantic space spanned bykeywords, so a nearest neighbors search in that space retrievessemantically related images. Approaches like these have also beeninvestigated for histology image search [7–9].

Image search systems based on a semantic representation havebeen shown to outperform purely visual search systems in terms ofMean Average Precision (MAP) [3]. However, these approachesmay lose the notion of visual similarity among images since thesearch process ends up relying entirely on high level descriptionsof images. The ranking of search results is based on potentially rel-evant keywords, ignoring useful appearance clues that are notdescribed by index terms. In a clinical setting, visual informationplays an important role for searching histology images, which

0.1016/

Page 2: Histology image search using multimodal fusion

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

2 J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

YJBIN 2171 No. of Pages 15, Model 5G

14 May 2014

Q1

ultimately reveals the biological evidence for the decision makingprocess in clinical activities. We consider that both visual contentand semantic data are complementary sources of information thatmay be combined to produce high quality search results.

Multimodal fusion has emerged as a very useful approach tocombine different signals with the purpose of making certainsemantic decisions in automated systems. We refer the reader to[10] for a comprehensive survey of multimodal fusion in variousmultimedia applications. For image indexing in particular, multi-modal fusion consists of combining visual and semantic data. Sev-eral methodologies have recently been proposed to model therelationships between these two data modalities, with the goal ofconstructing better image search systems. Two main strategiesmay be identified to achieve the combination of both data modal-ities: (1) early fusion [11], to build a combined representation ofimages before the ranking procedure and (2) late fusion [12], tocombine similarity measures during the ranking procedure. Oneof the advantages of early fusion over late fusion is that the formeroften benefits from explicitly modeling the relationships betweenthe two data modalities, instead of simply using them as separateopinions. However, this clearly requires a significant effort inunderstanding and extracting multimodal correspondences.

In this work, we propose a novel method for indexing histologyimages using an early multimodal fusion approach, that is, combin-ing the two data modalities in a single representation to generatethe ranking directly in such a space. The proposed methods usesemantic annotations as an additional data source that representsimages in a vector space model. Then, matrix-factorization-basedalgorithms are used to find the relationships between data modal-ities, by learning a function that projects visual data to the seman-tic space and the other way around. We take advantage of thisproperty by fusing both data modalities in the same vector space,obtaining as a result the combined representation of images.

A systematic experimental evaluation was conducted on threedifferent histology image databases. Our goal is to validate thepotential of various image search techniques to understand thestrengths and weaknesses of visual, semantic and multimodalindexing in histology image collections. We focus our evaluationon two performance measures commonly used for informationretrieval research: Mean Average Precision (MAP); and Precisionat the first 10 results of the ranked list (P@10), for early precision.We observed that semantic approaches are very good at maximiz-ing MAP, while visual search is a strong baseline for P@10, reveal-ing a trade-off in performance when using one or the otherrepresentation. This also confirms the importance of combiningboth data modalities.

Our approach combines multimodal data using a convexcombination of the visual and semantic information, resulting in acontinuous spectrum of multimodal representations and allowing

Fig. 1. Overview of the image search pipeline. Images acquired in a clinical setting are usmultimodal index, which represent images and text in the database. Results are returne

Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016

us to explore various mixes from purely visual to purely semanticrepresentations as needed. This is similar in spirit to late fusion,which allows the setting of weights to scores produced by eachmodality. However, our study shows significant improvement inperformance when building an explicitly fused representation,instead of considering modalities as separate voters for the rankof images. We also found that multimodal fusion can balance atrade-off between maximizing MAP and early precision, demon-strating the potential to improve the response of histology imageretrieval systems.

1.1. Overview

This work proposes an indexing technique for image search,using both visual image content and associated semantic terms.Fig. 1 illustrates a pipeline for image search in a clinical setting,which involves a physician or expert pathologist working with amicroscopy equipment with digital image acquisition capabilitiesor in a virtual microscopy system. Through an interactive mecha-nism, the user can ask the system to take a picture of the currentview, and send a query to the image search system. The systemhas a pre-computed fused representation of images in the database.A ranking algorithm is used to identify the most relevant results inthe database, which are retrieved and presented to the user.

The main goal of the system is to support to clinicians during thedecision making process by providing relevant associated informa-tion. The ability to find related cases among past records in a data-base has the potential to improve the quality of health care using anevidence-based reasoning approach. Historic archives in a hospitalcomprise a knowledge base reflecting its institutional experienceand expertise, and it can be used to enhance the daily medicalpractice.

This paper focuses on two important aspects of the entire pipe-line: (1) strategies for constructing the index based on a multi-modal fused representation and (2) an empirical evaluation ofdifferent strategies for histology image search using collections ofreal diagnostic images. The main contribution of our work is anovel method for combining visual and semantic data in a fusedimage representation, using a computationally efficient strategythat outperforms the popular late-fusion approach, and that bal-ances the trade-off between visual and semantic data. While theapplicability of the proposed model may be extended to generalimage collections beyond histology images, the second contribu-tion of this work is an extensive evaluation on histology images,since a straightforward application of image retrieval techniquesmay not result in an optimal outcome. Part of our experimentalevaluation shows that off-the-shelf indexing methods such aslatent semantic indexing and late fusion do not always exploitspecific characteristics of histology images.

ed as example queries. The system processes and matches queries with entries in ad to support the decision making process.

ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/

Page 3: Histology image search using multimodal fusion

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 3

YJBIN 2171 No. of Pages 15, Model 5G

14 May 2014

Q1

Another question related to the use of a system like this in a realenvironment is what would be the impact of search results in themedical practice itself and the outcome of such decisions on thequality of health care for patients. We believe this question is bothvery interesting in nature and quite important to be investigated.Up to our knowledge, a formal study of this problem has not beenconducted yet in the histology domain, and it is also beyond thescope of this paper. However, other studies in radiology have shownimproved decisions in the final diagnosis made by inexperiencedphysicians when they use image retrieval technologies [13].

The contents of this paper are organized as follows: Section 2discusses relevant related work in histology image retrieval. Thethree histology image data sets used for experimental evaluationsare presented in Section 3. Section 4 introduces the proposed algo-rithms and methods for multimodal fusion. The experimental eval-uation and results are presented in Section 5. Finally, Section 6summarizes and presents the concluding remarks.

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

2. Previous work

The automatic analysis of histology images is an important andgrowing research field that comprises different purposes and tech-niques. From image classification [14] to automatic pathologygrading [15], the large amount of microscopy images in medicinemay benefit from automated methods that allow users to managevisual collections for supporting the decision making process inclinical practice. This work is primarily focused on image searchand retrieval technologies, which serve as a mechanism to find rel-evant and useful histology images from an available database.

2.1. Content-based medical image retrieval

Early studies of content-based medical image retrieval werereviewed by Mller et al. [2]. One of the first systems for histologyimage retrieval was reported by Zheng et al. [16], which uses low-level visual features to discriminate between various pathologicalsamples. The use of low-level features was quickly recognized tohave limitations for distinguishing among complex semanticarrangements in histology images, so researchers proposed thesemantic analysis of histology slides to build image search systems.Later, a system based on artificial neural networks that learned torecognize twenty concepts of gastro-intestinal tissues on digitalslides was presented by Tang et al. [7]. The system allowed clini-cians to query using specific regions on example histology images.However, important efforts to collect labeled examples wererequired since the design of these learning algorithms needed localannotations, a procedure that might be very expensive.

Relatively few works continued the effort of designing semanticimage search systems for histology images, which include the workof Naik et al. [8] on breast tissue samples, and Caicedo et al. [17] onskin cancer slides. However, the task of histology image classifica-tion has been actively explored on various microscopy domains[18–22] which are related to semantic retrieval. The primary pur-pose of these methods is to try to assign correct labels to images,which differs from the problem of building a multimodal represen-tation. Besides, the transformation from visual content to strictsemantic keywords, may lead to a loss of useful visual informationfor a search engine, since images are summarized in a few key-words and visual details are not considered anymore.

2.2. Multimodal fusion for medical image retrieval

Multimodal retrieval has been approached in the medical imag-ing domain, to find useful images in academic journal repositoriesand biomedical image collections by combining captions along

Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016

with visual characteristics to find relevant results using late fusion[23,24]. However, their strategies assume that the user’s query iscomposed of example images as well as a text description. If usersonly provide example images, because they do not know preciseterms or just because of another practical reason, the system doesnot have any other choice rather than matching purely visual con-tent. Relevant work by Rahman et al. [25] uses fusion techniquesfor biomedical image retrieval by combining multiple featuresand scores of classifiers on the fly, allowing users to interact andprovide relevance feedback to refine results.

For ten years, the ImageCLEFmed community also dedicatedefforts to study the problem of medical image retrieval using mul-timodal data around an academic challenge [26–29]. Each year,various research groups obtained a copy of a single collection ofmedical images that includes almost all medical imaging tech-niques at once, such as X-rays, PET, MRI, CT and microscopy. Thegoal was to index its contents and provide answers for specificqueries composed of example images and text. Multimodal fusionwas a central idea in the development of solutions in this chal-lenge, and late fusion has been reported as one of the most robusttechniques.

Our work is focused on combining semantic terms and visualfeatures in a fused image representation (based on early fusionprinciples), which can be used for image retrieval or as input forother classifiers and systems. An important component of the pro-posed strategy is its ability to use the same representation forimages that do not have text annotations. In that way, the systemcan handle example image queries as well as database imageswithout semantic meta-data. In this work, we build on top of Non-negative Matrix Factorization algorithms recently proposed to findrelationships in multimodal image collections [30]. We extendthese ideas to propose a novel algorithm for fusing multimodalinformation in histology image databases with an arbitrary num-ber of terms.

Other studies of multimodal fusion for multimedia retrievalhave been conducted recently. A strategy to summarize andbrowse large collection of Flickr images using text and visual infor-mation was presented by Fan et al. [31]. However, they learn latenttopics for text independent of visual features, so multimodal rela-tionships are not modeled or extracted. The use of latent topicmodels has been extended to explicitly find relationships betweenvisual features and text terms using probabilistic latent semanticindexing [32] and very rich graphical models [33]. In this work,we formulated the problem of extracting multimodal relationshipsas a subspace learning problem, which generates multimodal rep-resentations using vector operations.

Recent studies of multimodal fusion for histology images dealwith the problems of combining different imaging modalities (suchas MRI images and microscopy images) [34] or combiningdecisions made at different regions of the same image [35]. As toour knowledge, our work is the first study of multimodal fusionof semantic and visual data, specifically oriented to histologyimage retrieval. We reported promising experimental results inour previous work [36], and this paper extends that evaluation insubstantial ways. First, the notion of multimodal fusion byback-projection is introduced for the first time, which allows toeffectively combine visual and semantic representations for histol-ogy image indexing. Second, a more comprehensive experimenta-tion was carried out, using three different data sets, additionalevaluations and extended discussions.

3. Histology image collections

Three different histology image collections were selected as casestudies for this work. The first two are from pathology cases with

ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/

Page 4: Histology image search using multimodal fusion

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

Table 1Number of images and semantic terms on each data set.

Data set Images Training Query Terms

Cervical Cancer 530 447 54 8Basal-cell Carcinoma 1502 1201 301 18Histology Atlas 2641 2113 528 46

2 Dataset of 20,000 histology images at http://www.informed.unal.edu.co Datasetof tissue types at http://168.176.61.90/histologyDS/.

4 J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

YJBIN 2171 No. of Pages 15, Model 5G

14 May 2014

Q1

corresponding diagnosis and descriptions. They were collected aspart of different long-term projects in the Pathology and the Biologydepartments, with the collaboration of several experts and graduatestudents from the Medicine School of Universidad Nacional deColombia, in Bogot. The data sets were collected and annotated byexpert pathologists, extracting information of a larger database ofreal cases. In general, these collections of images have been anno-tated by several individuals, which agree on the results after discus-sions in a committee-like process. The efforts of collecting theseannotations have been oriented to create index terms to preserveinformation related to cases, which could be accessible through aninformation retrieval system. These cases were anonymized toremove any information related to patients, and only data associ-ated to the diagnosis and description of images were preserved.

The third data set is part of a histology atlas containing imagesfrom the four fundamental tissues of living beings. These imageswere collected and labeled by researchers in the Biology Depart-ment, to provide students with a high quality reference materialin digital format. Table 1 presents basic statistics of these collec-tions and Fig. 2 shows some example images for each data set.More details about each data set are presented below.

1. Cervical Cancer. This data set, with 530 images from more than120 cases, characterizes various conditions and stages ofCervical Cancer. Images in this collection were acquired by amedical resident and validated by an expert pathologist fromtissue samples stained with hematoxylin and eosin. Imageswere captured at 40� magnification with controlled lightingconditions, and from each slide, an average of 4.5 sub-regionsof 3840 � 3072 pixels were selected. Each image preserves asmetadata the case number to which it belongs and a list ofglobal annotations. Annotations span eight different categoriesincluding relevant diagnostic information and other tissuecharacteristics. This list of categories includes:cervicitis inflam-matory pathology, intraepithelial lesion, squamous cell carcinoma,metaplasia, and intraepithelial lesion, among others.

2. Basal-cell Carcinoma. This collection has 1502 images of skinsamples stained with hematoxylin and eosin used to diagnosecancer from a collection of more than 300 cases. About 900 ofthese images correspond to pathological cases, while theremaining 600 are from normal tissue samples, which allowsphysicians to contrast differences between both conditions.This makes a difference with respect to the previous data set,which only has pathological cases. This dataset contains imagesacquired at different magnification levels, including 8�, 10�and 20�, and stored at 1280 � 1024 pixels in JPG format. Globalannotations were assigned by a pathologist to highlight varioustissue structures and relevant diagnostic information using alist of eighteen different terms. This collection has been usedin previous histology image retrieval work [37,9].

3. Histology Atlas. This is the largest data set used in our study,with 2641 images illustrating biological structures of the fourfundamental tissues in biology: connective, epithelial, muscularand nervous. Images of these tissues come from different organsof several young–adult mice, where samples were stained usinghematoxylin and eosin, and immunohistochemical techniques.These images are in different resolutions and magnificationfactors, and are organized in hierarchical annotations generated

Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016

by pathologists and biologists, indicating the observed biologi-cal system and organs, giving a total of 46 different indexingterms. The resulting annotations include terms like circulatorysystem, heart, lymphatic system and thymus, among others. Thisdata set has also been used in previous work at our lab [38] andis the only one currently available online free of charge.2

Notice that in all cases, a single image can have several seman-tic terms associated to it. This is an important characteristic of realworld databases, that do not split collections in disjoint sets, butrather allow images to have multiple annotations to describe sev-eral aspects or objects within the image.

Images are usually regions of interest with patterns observed infull tissue slides, which were focused and selected by the team ofpathologists and biologists. They rarely included views of a com-plete tissue slide. The resulting image collections present variationsin magnification and acquisition style, which are considered naturalproperties of large-scale, real world medical image collections. Theacquisition process was not restricted to highly controlled or rigidimage views, but instead, encouraged spontaneous variabilitymotivated by domain-specific interestingness, which may makethe search process more challenging.

3.1. Data representation

3.1.1. Visual representationA large variety of methods have been investigated to extract

and represent visual characteristics in histology images. Be it forautomated grading [15], classification [20] or image retrieval [7],two important features are usually modeled: color and texture.Color features exploit useful information associated with staininglevels, which are natural bio-markers for pathologists. Texture fea-tures exploit regularities in biological structures, since tissues tendto follow homogeneous patterns. In this work, a bag-of-featuresrepresentation is used, which has been shown to be a useful repre-sentation of histology images due to its ability to adaptivelyencode distributions of textures in an image collection.

We selected the Discrete Cosine Transform (DCT) computed ateach RGB color channel as local feature descriptor. A dictionaryof 500 codeblocks is constructed using the k-means algorithm foreach image collection separately. Then, a histogram of the distribu-tion of these codeblocks is computed for each image. As a result,we have vectors in Rn, with n ¼ 500 visual features for each image.When appropriately trained, the dictionary is able to encode mean-ingful patterns that correlate with high-level concepts such as thesize and density of nuclei, which may allow the system to distin-guish important features such as magnification factor and tissuetype. We refer the reader to the work of Cruz-Roa et al. [38] formore details about this histology image representation approach,which we followed closely in this work.

3.1.2. Semantic representationLikewise, semantic data is herein represented as a bag-of-words

following a vector space strategy, commonly used in informationretrieval [39]. First, the dictionary of indexing terms is constructedusing the list of available keywords. Then, assuming a dictionarywith m terms, each image is represented as a binary vector inRm, in which each dimension indicates whether the correspondingsemantic term is assigned to the image. Using this representation,each image can have as many semantic terms assigned as needed.Also, the size of the semantic dictionary is not limited and can beeasily extended.

ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/

Page 5: Histology image search using multimodal fusion

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477478480480

481483483

Fig. 2. Sample images and annotations from the three histology data sets used in this work: (a) Cervical Cancer data set. (b) Basal Cell Carcinoma data set. (c) Histology Atlasdata set. These sample images have been selected to illustrate the kind of contents and annotations available in each collection.

J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 5

YJBIN 2171 No. of Pages 15, Model 5G

14 May 2014

Q1

The number of different terms for each data set can be found inthe last column of Table 1. None of the images in the three data setshave all annotations at the same time. Usually, a single image hasbetween two to four semantic annotations assigned to it depend-ing on the data set. These keywords may co-occur in some cases,and they can also exclude each other in some cases. We do notexploit these term relationships explicitly since the bag-of-wordsrepresentation has been adopted.

Notice that we use the term semantics to refer to semantic termsonly. In this work, the relationships among terms are not explicitlyconsidered through the use of ontologies or similar data structures.The use of semantics throughout the paper is intended to empha-size our goal of assigning high-level interpretations to low-levelvisual signals, which are not easily understood by computers inthe same way as humans do. Smeulders et al. [4] named this con-dition as the semantic gap, and many other studies thereafter haveadopted similar uses of the term to refer to this problem [40].

Since both, visual and semantic representations are vectors, adatabase of images can be represented with two matrices by stack-ing the corresponding vectors of visual and semantic features ascolumns of two matrices. The notation used in the following sec-tions sets the matrix of visual data for a collection of l images asXV 2 Rn�l, where n is the number of visual patterns in the bag offeatures representations. The matrix of semantic terms for thesame collection is XS 2 Rm�l, where m is the number of keywordsin the semantic dictionary.

484

485

486

487

488

489

490

491

492

493

494

495

496

3 The terms ‘‘basis’’ is slightly abusive here, since the vectors in the matrix W arenot necessarily linearly independent and the set of vectors may be redundant.

4. Multimodal fusion

The search method proposed in this work is based on a multi-modal representation of images that combines visual features withsemantic information. Fig. 3 presents an overview of the proposedapproach, which is comprised of three sequential stages: (1) visualindexing, (2) semantic embedding and (3) multimodal fusion. Threeimage representations are obtained throughout the process: (1)visual features, (2) semantic features and (3) the proposed fusedrepresentation. The retrieval engine can be setup to search usingany of the three representations. In the following subsections, weassume a visual and semantic data representation following thedescription of Section 3.1, and focus on describing the components2 and 3 of Fig. 3.

Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016

4.1. Semantic embedding

The goal of a semantic embedding is to learn the relationshipsbetween visual features and semantic terms, to generate a newrepresentation of images based on high level concepts. The strat-egy proposed in this work is based on a matrix factorization algo-rithm, which allows the system to learn these relationships as alinear projection between the visual and semantic spaces. In thiswork, we adopt the notions of multimodal image indexing usingNonnegative Matrix Factorization (NMF) recently proposed in[30], and extend these ideas by introducing a direct semanticembedding. In the following sections, two strategies for modelingvisual-to-semantic relationships are presented.

4.1.1. Latent semantic embeddingThe first algorithm for semantic embedding is based on NMF,

which allows us to extract structural information from a collectionof data samples. For any input matrix X 2 Rn�l, containing l datasamples with n nonnegative features in its column vectors, NMFfinds a low rank approximation of the data using non-negativityconstraints:

X �WH

W;H P 0

where W 2 Rn�r is the basis3 of the vector space in which the datawill be represented and H 2 Rr�l is the new data representationusing r factors. Both, W and H are unknowns in this problem, andthe decomposition can be found using alternating optimization.We use the divergence criterion to find an approximate solutionwith the multiplicative updating rules proposed by Lee and Seung[41].

Given the matrix of visual features, XV , and the matrix ofsemantic terms XS, we aim to find correlations between both. Wemodel these relationships using a common latent space, in whichboth data modalities have to be projected. We employ thefollowing two-stage approach to find semantic latent factors forimage features and keywords:

ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/

Page 6: Histology image search using multimodal fusion

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520521523523

524526526

527

528

529

530

531

532

533

534

535

536

537538

540540

541

542

543

544

545546

548548

549

550

551

552

553

554

555

556

557558560560

561563563

564

565

566

567

568

569

570

571

572

573574

576576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

Fig. 3. Overview of the proposed fused representation. From the input image to the final fused representation, three main processes are carried out: visual indexing, semanticembedding and multimodal fusion. Each stage also produces the corresponding representation: only visual, only semantic and fused.

6 J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

YJBIN 2171 No. of Pages 15, Model 5G

14 May 2014

Q1

1. Decompose XS: The matrix of semantic terms is first decom-posed using NMF to find a low rank approximation of semanticdata. This step may be understood as finding correlationsbetween various terms according to their joint occurrence onimages. The result of this first step, is a decomposition of theform XS ¼WSH, with matrices WS as the semantic projectionand H as the latent semantic representation.

2. Embed XV : Find a projection function for visual features toembed data in the latent factor space. This is achieved by solv-ing the equation XV ¼WV H for WV only, while fixing the latentrepresentation, H, equal to the matrix found in the previousstage.

The correlations between visual features and semantic data areencoded through the matrices WV ;WS and H. The new latent rep-resentation is semantic by design, since the function to projectvisual features spans a latent space that has been originally formedby semantic data. We refer to this algorithm as NMF Asymmetric(NMFA) in our later discussions.

4.1.2. Direct semantic embeddingAn alternative approach to model the relationships between

visual features and semantic terms is to find a transformation func-tion between both spaces directly. This problem is formulated asfollows:XV �WXS ð1Þ

W P 0

where W 2 Rn�m is a matrix that approximately embeds visual fea-tures in the space of semantic terms. Instead of extracting a latentfactor structure from the data, this strategy fixes the latent encod-ing (matrix H) as the known semantic representation of images inthe collection, Xs. This can be understood as requiring the latent fac-tors to match exactly the semantic representation of images, result-ing in a scheme for learning the structure of visual features thatdirectly correlate with keywords.

To solve this problem, the divergence between the matrix ofvisual features and the embedded data is adopted as objectivefunction:

DðXV jWXSÞ ¼X

ij

XVð ÞijlogXVð ÞijðWXSÞij

� XVð Þij þ ðWXSÞij

!ð2Þ

The goal is to minimize this divergence measurement on a set oftraining images, considering that this optimization problem is con-vex and can be solved efficiently following gradient descent orinterior point strategies. In this work, the matrix W is learned usingthe following multiplicative updating rule:

Wij ¼Wij

Pu XSð Þju XVð Þiu

.WXSð ÞiuP

v XSð Þjvð3Þ

Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016

This is a rescaled gradient descent approach that uses a data-dependent step size, following Lee and Seung’s methodology forNMF [41]. Then, the solution is found by iteratively running thisupdating rule for W for a number of iterations or until a fixed errorreduction is reached. We refer to this algorithm as the NonnegativeSemantic Embedding (NSE) in our later discussions.

4.1.3. Projecting unlabeled imagesTo recover the semantic representation of an image without

keywords, we need to solve the following equation for xS:

xV �WxS ð4Þ

xS P 0

where xV is the observed vector of visual features and W is a learnedsemantic embedding function. This formulation is also compatiblewith the latent semantic embedding, assuming W ¼WV andxS ¼ h. In both situations, the non-negativity constraint holds andthe same procedure is followed. Thus, the problem in Eq. (4) canbe formulated as minimizing the divergence between the observedvisual data and its reconstruction from the semantic space. Regard-ing the non-negativity restriction, the solution can be efficientlyapproximated using the following multiplicative updating rule inan iterative fashion:

XSal ¼ Hal

PiWiaXil

�WHð ÞilP

kWkað5Þ

Following this procedure, a semantic representation of newimages is constructed.

4.2. Fusing visual and semantic content

So far, we have considered the visual and semantic strategies forimage search. The first strategy is entirely based on visual features,to match for visually similar images. The second strategy is basedon an inferred semantic representation to match for semanticallyrelated images. In this section we introduce a third strategy, basedon multimodal fusion. The main goal of this scheme is to combinevisual features and semantic data together in the same image rep-resentation to exploit the best properties of each modality.

4.2.1. Fusion by back-projectionThe proposed fusion strategy projects semantic data back to the

visual feature space to make a convex combination of both, visualand semantic representations, as illustrated in Fig. 4. This can beunderstood as an early fusion strategy, since the representationsare merged before their subsequent use. Assuming a histogramof visual features xV and a vector of a predicted semantic represen-tation, xS, the fusion procedure generates a new image representa-tion defined as:

ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/

Page 7: Histology image search using multimodal fusion

597599599

600

601

602

603

604

605607607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

Fig. 4. Illustration of the fusion by back-projection procedure. The first step represents a query image in the visual feature space. Second, a semantic embedding is done toapproximate high-level concepts for that image. Third, the semantic representation is projected back to the visual space and combined with the original visual representation.

J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 7

YJBIN 2171 No. of Pages 15, Model 5G

14 May 2014

Q1

xf :¼ kxv þ ð1� kÞWxs ð6Þ

where xf 2 Rn is the vector of fused features in the visual space andk is the parameter of the convex combination that controls the rel-ative importance of data modalities. This fusion approach takes thesemantic representation of images and projects it back to the visualspace using the reconstruction formula:

x̂v :¼ Wxs ð7Þ

This back-projection is a linear combination of the columnvectors in W using the semantic annotations as weights. In thatway, the reconstructed vector x̂v represents the set of visual fea-tures that an image should have according to the learnedsemantic relationships in the image collection. Therefore, x̂v

and xv highlight different visual structures of the same image,since x̂v is a semantic approximation of the observed visual fea-tures, according to Eqs. (4) and (7). Notice that this extensioncan be applied to a latent semantic embedding using NMFA orto a direct semantic embedding using NSE. We refer to thisextension using the suffix BP for Back-Projection (NMFA-BP orNSE-BP).

663

664

665666

668668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

4.2.2. Controlling modality importanceThe parameter k in the convex combination of the fusion strat-

egy (Eq. (6)) allows us to control the importance of each datamodality. The problem of assigning more weight to one or theother modality mainly depends on the performance that eachmodality offers to solve queries. More specifically, it depends onhow faithfully one modality represents the true content of animage. On the one hand, visual features may be inaccurate to rep-resent high level semantic concepts, but good at representing lowlevel visual arrangements. On the other hand, the semantic repre-sentation may be noisy or incomplete because of human errors orprediction discrepancies.

The parameter k is split into two different parameters to con-sider two kinds of images: database images and query images.For both images, the semantic representations are predicted bythe learned model. For database images, the parameter k will becalled a and for query images it will be called b throughout thepaper. This distinction allows us to control the importance of thesemantic modality for new, unseen query images, taking intoaccount that predicting an approximate semantic representationmay have some inference noise. We evaluate the influence of suchparameters in the following sections.

Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016

4.3. Histology image search

Indexing images by visual content in the retrieval systemmeans that all searchable images in the collection, as well as allquery images, are represented using the bag-of-features histogram,which is a non-parametric probability distribution of visual pat-terns. The latent semantic representation and the direct semanticrepresentation are both nonnegative vectors that can be properlynormalized making their ‘1-norm equal to 1, so the new valuesare interpreted as the probabilities of high-level semantic conceptsfor one image. In the case of the fused representation, features areonce again represented in the visual features space and the ‘1 nor-malization is applied as well.

The retrieval system requires a similarity function to rankimages in the collection by comparing them with the features ofthe query. The three representations discussed above can be consid-ered as probability distributions, and the most natural way to com-pare these features is using a similarity measure appropriate forprobability distributions. The histogram intersection is a measurefor estimating the commonalities between two non-parametricprobability distributions represented by histograms. It computesthe common area between both histograms, obtaining a maximumvalue when both histograms are the same distribution and zerowhen there is nothing in common. The histogram intersection isdefined as follows:

k\ðx; yÞ ¼Xn

i¼1

min xi; yif g ð8Þ

where x and y are histograms and the sub-index, i, represents the i-th bin in each histogram of a total of n. This similarity measure hasbeen shown to be a valid kernel function for machine learningapplications [42], and has been successfully used in different com-puter vision tasks [43]. We adopt this similarity measure for imagesearch in the three representation spaces evaluated in this work,which are the visual space, the semantic space and the fused space.

5. Experiments and results

Experiments to evaluate retrieval performance were conductedon the three histology data sets described in Section 2. In ourexperimental evaluation, we focus on image search experimentsunder the query-by-example paradigm, in other words, the useof non-annotated query images to retrieve relevant databaseimages. Our goal is to demonstrate the strengths and weaknesses

ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/

Page 8: Histology image search using multimodal fusion

683

684

685

686 Q3

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

8 J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

YJBIN 2171 No. of Pages 15, Model 5G

14 May 2014

Q1

of each of the three image representations: visual, semantic andmultimodal.

5.1. Experimental protocol

5.1.1. Training, validation and testWe follow a training, validation and test experimental scheme to

setup and evaluate algorithms. A sample with 20% of the images ineach collection is separated as held out data for testing experi-ments. The other 80% of the images are used for training and valida-tion experiments. This partition is made using stratified samplingover the distribution of semantic terms, to separate representativesamples of the semantic concepts in the data set. The number ofimages in each data set and the number of images in the corre-sponding partitions is reported in Table 1. For training and valida-tion, a 10-fold cross validation procedure was employed and testexperiments are conducted on held out data.

5.1.2. Performance measuresA single experiment in a particular data set consists of a simu-

lated query, i.e., an example image taken from the test or validationsets from which semantic annotations are known, but hidden.Then, the ranking algorithm is run over all database images andthe list of ranked results is evaluated.

The evaluation criterion adopted in our experiments is based onthe assumption that one result is relevant to the query if bothshare one or more semantic terms. This assumption is reasonableunder the query by example paradigm evaluated in this work.Since the system does not receive and explicit set of keywords,the query is highly ambiguous and the intention of the user maybe implicit. It is even possible that the user is not completely awareof exactly what she is looking for. Therefore, the system does agood job if it retrieves images that are related to the query in atleast one possible sense, helping the user to better understandimage contents and supporting the decision making process.

We performed automated experiments by sending a query tothe system and evaluating the relevance of the results. The qualityof the results list is evaluated using information retrieval mea-sures, mainly Mean Average Precision (MAP), and precision at thefirst ten results (P@10 or early precision) [39]. For computing thesemeasurements, we used the trec_eval tool, available online.4 Thesetwo measurements are complementary views of the performance ofa retrieval system, and we found in our experiments that there is atrade-off when trying to maximize both at the same time, as is dis-cussed below.

5.2. Baselines

The first natural baseline for image search under the query-by-example paradigm is the performance of purely visual search, thatis, when no effort is done for introducing semantic informationinto the search process. In this case, the histogram intersectionsimilarity measure is used directly to match query features to sim-ilar visual content from the database. This baseline allows toobserve performance gains when using semantic information witha particular indexing algorithm.

We also consider late fusion as a second baseline, since it cancombine data from two different data sources. Late fusion is a verypopular strategy for combining multiple similarity measures in aretrieval system thanks to its simplicity. In particular, a simple scorecombination has been shown to be robust both in theory [44] andpractice [45,46] performing better than other schemes such as rankcombination, minimum, maximum, and other operators. We adopt

802

8034 http://trec.nist.gov/trec_eval/.

Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016

the score combination strategy using a convex combination ofvisual and semantic similarities to produce a single score for eachimage with respect to the query. Both similarities are first normal-ized using a min–max procedure. In addition, we optimize theparameter of the convex combination to produce the best baselinepossible.

In the experiments reported in this work, the semantic informa-tion used during late fusion is not provided by the user: it is auto-matically generated by the proposed methods. For any queryimage, we predict a semantic representation using the latent(NMFA) or direct (NSE) embeddings, and compute similarity scoreswith respect to database images. Also, similarity scores are com-puted using visual features independently. Then, both visual andsemantic scores are combined to include the opinion of both views.Notice that this is fundamentally different to the proposed fusionby back-projection, since computing similarity scores in each spaceseparately does not require any learning or modeling of multi-modal relationships.

5.3. Trade-offs in retrieval performance

The first set of experiments focuses on performance evaluationfor semantic and multimodal strategies. Results reported in thisSection correspond to experiments conducted on the training andvalidation sets, following a 10-fold cross-validation procedure. Inthe following subsections we describe our findings on the trade-off between MAP and P@10 when using semantic or visual search,and show how our algorithms can help to find a convenientbalance.

5.3.1. Visual and semantic searchThe NMFA and NSE algorithms described in Sections 4.1.1 and

4.1.2 respectively, are used to project images to a semantic spacespanned by terms. One of the main advantages of NSE, is that itdoes not need any parameter tuning during learning or inference,so it can be directly deployed in a retrieval system on top of thevisual representation without further effort. For NMFA, the numberof latent factors has to be chosen with cross-validation. We com-pare the performance of these two semantic indexing mechanismsin Table 2 for our three data sets.

As baseline strategy we use visual matching, which is based onvisual features only to retrieve related images. We also compareagainst the expected performance of a random ranking strategy,which was estimated using a Monte Carlo simulation. The chanceperformance is significantly lower than the visual baseline in allthree data sets.

Observe that the semantic embeddings always outperform thevisual baseline by a significant amount in terms of MAP. Thisshows that the proposed embeddings learn to predict a goodsemantic representation of images. For the Cervical Cancer dataset, the best embedding is NSE, while NMFA performs better onthe Basal-cell Carcinoma and Histology Atlas data sets. This is con-sistent in terms of the complexity of semantic vocabularies forthese data sets, since the Cervical Cancer set has a term dictionaryof eight keywords, while the Basal-cell Carcinoma and HistologyAtlas have 18 and 46 terms, respectively. Then, it is natural for adirect embedding to perform best with simple vocabularies, whilethe latent embedding is able to exploit complex correlationsamong many terms.

Notice also that none of the semantic embeddings were able toimprove the performance of early precision with respect to thevisual baseline, according to the results in Table 2. Early precisionis the measure of how many relevant images are shown in the firsttop ten results, and in this case, visual matching seems to be astrong baseline. Our result is consistent with previous studies thatshow how k-nearest neighbors algorithm in the visual space serves

ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/

Page 9: Histology image search using multimodal fusion

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

Table 2Retrieval performance of semantic strategies compared to visual matching. Results show the trade-off between MAP and P@10 on all image collections. Semantic search producessuperior MAP while visual search is aQ4 strong baseline for early precision. Chance performance refers to the expected performance of a random ranking strategy, and it issignificantly lower than visual and semantic search.

Method Cervical Cancer Basal-cell C. Histology Atlas

P@10 MAP P@10 MAP P@10 MAP

Visual matching (baseline) 0.5904 0.5214 0.4360 0.2928 0.7372 0.2751Latent embedding (NMFA) 0.5067 0.6591 0.3176 0.4947 0.5263 0.6309Direct embedding (NSE) 0.5414 0.6970 0.2543 0.4317 0.5230 0.6113Chance performance 0.4623 0.4681 0.2183 0.1806 0.0978 0.1008

J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 9

YJBIN 2171 No. of Pages 15, Model 5G

14 May 2014

Q1

as a good baseline for image annotation [47]. This suggests that inthe visual space, nearby images are more likely to have importantcorrespondences of appearance patterns, which could result insome semantic relationship among them. This matching of visualpatterns disappears when images are projected to the semanticspace, and only dominant semantic concepts are preserved there.

5.3.2. Multimodal fusionTo balance the trade-off between MAP and early precision, a

multimodal fusion strategy may be used. The following experi-ments compare the ability of late fusion and early fusion byback-projection, to balance the performance of semantic and visualrepresentations. Both strategies allow to control the relativeimportance of data modalities, using one parameter for databaseimages and another for query images, as described in Section 4.2.

0.45

0.5

0.55

0.6

0.65

0.7

0.5 0.55 0.6 0.65 0.7 0.75

P@10

MAP

Pareto Frontier - Cervical Cancer DatasetNSE-Late

NMFA-LateNMFABP (15F)

NSEBPVisual

NMFA (15F)NSE

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.3 0.4 0

P@10

MA

Pareto Frontier - His

Fig. 5. Performance of multimodal fusion on the three histology image collections. Resulrepresentation. Plots show the Pareto frontier on the performance space for all multimodReported methods are: Nonnegative Semantic Embedding + Late Fusion (NSE-Late), NMF-ANonnegative Semantic Embedding + Backprojection (NSEBP), Direct Visual Matching (Visual),using Nonnegative Semantic Embedding (NSE).

Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016

Each parameter is the weight of a convex combination betweenvisual and semantic representations. The influence of these param-eters was evaluated by producing fused representations with vary-ing values between 0 and 1, with a step of 0.1, for a (databaseimages) and b (query images). When evaluating the performanceof each a; b pair, MAP and early precision (P@10) are measured.

The parameter space was explored following this procedure onthe three data sets using 10-fold cross-validation, to fuse visualfeatures with the semantic representations obtained by the twoproposed semantic embeddings (NMFA and NSE). Also, we fol-lowed the same parameter search procedure for the late fusionbaseline, using both semantic embeddings as well. Fig. 5 presentsthe results of the multimodal fusion evaluation on the trainingset of the three histology image collections. These plots show theperformance space with MAP on the x-axis and early precision

0.25

0.3

0.35

0.4

0.45

0.5

0.3 0.35 0.4 0.45 0.5

P@10

MAP

Pareto Frontier - Basal-cell Carcinoma DatasetNSE-Late

NMFA-LateNMFABP (15F)

NSEBPVisual

NMFA (15F)NSE

.5 0.6 0.7

P

tology Atlas DatasetNSE-Late

NMFA-LateNMFABP (15F)

NSEBPVisual

NMFA (15F)NSE

ts obtained during parameter search by varying a and b to generate the multimodalal fusion strategies while visual and semantic performance are presented as points.symmetric + Late Fusion (NMFA-Late), NMF-Asymmetric + Backprojection (NMFABP),Latent semantic indexing using NMF-Asymmetric (NMFA), and Direct semantic indexing

ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/

Page 10: Histology image search using multimodal fusion

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

899

900

901

902

10 J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

YJBIN 2171 No. of Pages 15, Model 5G

14 May 2014

Q1

(P@10) on the y-axis. Each point in this space corresponds to oneconfiguration of the retrieval system, either with multimodalfusion or any other baseline. The performance is best when bothperformance measures are maximized simultaneously, and weonly plot the result for a; b pairs in the Pareto frontier of each mul-timodal fusion strategy.

First, notice the difference between visual and semantic search(NMFA and NSE) in all plots, and how these two points lie on oppo-site sides of the performance space, allowing to visualize theirtrade-off in performance. The plots also show that all multimodalfusion strategies can be configured to produce intermediate resultsbetween the performance of visual and semantic search by givingmore preference to one or another data modality. These Paretofrontiers illustrate the path along the trade-off between visualand semantic search, showing that we cannot get a fused resultthat is better than both of them individually. However, the pro-posed strategy allows us to choose a good balance that preservesthe best response of visual or semantic data as needed.

Finally, our proposed early fusion, based on back-projection ofthe semantic data, consistently outperforms the results obtainedby the late fusion baseline, achieving an improved performancein terms of both, MAP and P@10. The benefits of back-projectionover late fusion are clear in all three image collections by a largemargin. Two important characteristics make our strategy betterthan late fusion: first, during the back-projection step, we takeadvantage of the learned relationships between visual and seman-tic data, and explicitly encode these correspondences in a recon-structed vector (see Eqs. (6) and (7)). Second, we complementthe visual representation with semantically reconstructed visualfeatures, preserving the ability to match the original visual struc-tures as well as semantic relations in the same space.

5.4. Histology image search

This Section presents the results of retrieval experimentsconducted on the test set, after fixing the best parameters in the

NSE

NMF-A

NMFABP

NSEBP

NMFA-Late

NSE-Late

VISUAL

0.45 0.5 0.55 0.6 0.65 0.7 0.75

Cervix

MAP

37.72%

33.97%

21.25%

16.35%

16.02%

5.88%

0.00%

NMF-A

NSE

NMFABP

NSEBP

NSE-Late

NMFA-Late

VISUAL

0.2 0.3 0.4

H

Semantic Early F

0.00%

Fig. 6. Mean Average Precision (MAP) of all evaluated strategies on the three histology imimprovement with respect to the method with the lowest performance, which is the purfollowed by the proposed early fusion methods. Note that the scale of performance has btendency across data sets is best viewed in color. (For interpretation of the references to

Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016

training set for each model. The results compare the performanceof visual, semantic and multimodal search, along with the pro-posed baselines. We keep our attention on the two performancemeasures evaluated in previous Sections, to contrast the benefitsof each approach. The evaluated methods can be grouped in thefollowing broader strategies:

1. Visual matching: Only visual features are used to match queryimages and database images. 2. Semantic search: Images are pro-jected and matched in a semantic space. Two semantic indexingmethods are evaluated: Nonnegative Semantic Embedding (NSEin Section 4.1) and NMF-Asymmetric (NMFA in Section 4.1.1). 3.Early fusion: Visual and semantic features are combined in thesame image representation. Here we evaluated the proposedback-projection strategy with the two semantic embeddings: NSEBack-Projection (NSE-BP) and NMFA Back-Projection (NMFA-BP).4. Late fusion: Visual and semantic features are matched indepen-dently and combined during the ranking procedure by mixing theirsimilarity measures. We evaluated a late fusion strategy betweenvisual features and the predicted semantic features of NSE andNMFA.

These four groups facilitate the interpretation of results andalso allows meaningful comparisons between different strategies.The results with respect to MAP are presented in Fig. 6 for eachdata set. This figure shows a ranking of the methods in decreasingorder to compare relative gains in MAP. Notice that semanticmethods are at the top positions of the rankings, indicating thatsemantic indexing is good at optimizing the precision of an imageretrieval system. Precision measured by MAP can also be seen asthe performance of image auto-annotation, and how faithfulpredictions of semantic terms for all images are. Then, the resultssuggest that semantic embeddings are able to predict meaningfulannotations for images, which are then matched correctly.

The second group (according to MAP) in the ranking of methodsshown in Fig. 6 is the group of methods based on back-projection.The difference with semantic methods is mainly due to thetrade-off discussed in the previous section. We selected fusion

NMF-A

NSE

NMFABP

NSEBP

NSE-Late

NMFA-Late

VISUAL

0.15 0.2 0.25 0.3 0.35 0.4 0.45

Carcinoma

MAP

79.28%

52.26%

45.26%

41.48%

24.55%

19.37%

0.00%

0.5 0.6 0.7 0.8

istology

MAP

usion Late Fusion Visual

134.74%

128.71%

123.71%

121.76%

110.38%

106.94%

age collections. Bars are absolute MAP values, and percentages indicate the relativeely visual search in all cases. Semantic methods have the best overall performance,een set differently for each data set to highlight relative improvements. The overallcolor in this figure legend, the reader is referred to the web version of this article.)

ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/

Page 11: Histology image search using multimodal fusion

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

NMFABP

NMF-A

NMFA-Late

NSEBP

VISUAL

NSE-Late

NSE

0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57

Cervix

P@10

13.25%

12.89%

11.74%

8.71%

7.96%

6.05%

0.00%

NMFABP

NSEBP

VISUAL

NMFA-Late

NMF-A

NSE-Late

NSE

0.25 0.3 0.35 0.4 0.45 0.5

Carcinoma

P@10

40.79%

35.62%

34.28%

25.54%

10.29%

5.77%

0.00%

VISUAL

NMFABP

NSEBP

NMFA-Late

NSE-Late

NMF-A

NSE

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85

Histology

P@10

Semantic Early Fusion Late Fusion Visual

40.44%

24.11%

18.98%

16.50%

15.86%

5.99%

0.00%

Fig. 7. Early precision (P@10) of all evaluated strategies on the three histology image collections. Bars are absolute P@10 values, and percentages indicate the relativeimprovement with respect to the method with the lowest performance, which is the pure semantic search in all cases. The proposed fusion approach presents the bestperformance for two data sets, improving over the visual baseline. In all cases, our proposed method outperforms late fusion. Note that the scale of performance has been setdifferently for each data set to highlight relative improvements. The overall tendency across data sets is best viewed in color. (For interpretation of the references to color inthis figure legend, the reader is referred to the web version of this article.)

5 Semantic embeddings use visual information to correctly project images to thesemantic space.

J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 11

YJBIN 2171 No. of Pages 15, Model 5G

14 May 2014

Q1

parameters mainly to improve early precision (P@10) since ourgoal is to balance the trade-off for image retrieval, and the firstresults are crucial for a good user experience. Finally, notice thatour proposed approach for fusion consistently outperforms the latefusion of similarities in all three data sets.

As for early precision, Fig. 7 shows the relative differencesamong all evaluated methods. The ranking of methods has chan-ged, leaving the semantic-based approaches in the bottom of per-formance, even below the visual baseline. In two of the three datasets, the proposed back-projection scheme gets the best perfor-mance, since we balanced the parameters during cross-validationto improve this measure. We wanted to bring more semanticinformation to the top of the ranked list of results, and our strat-egy proves to be effective for combining both data modalities in asingle representation. The visual baseline is very strong in the caseof the Histology Atlas data set, leaving behind all methods by alarge margin. Nevertheless, fusion by back-projection offers a sig-nificant improvement over the original semantic representation,showing a good intermediate compromise. Our fusion methodol-ogy also shows important improvements over the late fusionbaseline.

Some queries are illustrated in Fig. 8 along with the top nineresults retrieved by three methods: visual matching, a semanticembedding and the multimodal representation. Queries are singleimage examples with no text descriptions. The visual rankingbrings images that match features without any knowledge abouttheir high-level interpretations, and thus, sometimes fails toretrieve the correct results. The semantic embedding selected foreach database corresponds to that with the best performance inthe test set according to MAP (NSE for Cervical Cancer and NMFAfor Basal-cell Carcinoma and Histology Atlas). Results obtainedby matching the representation in the semantic space are diverseand correspond to images with higher scores in the terms

Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016

predicted for the query. This strategy clearly does not considervisual information for ranking images,5 which results in large vari-ations of appearance. The ranking produced by the fused representa-tion can improve the retrieval performance of the response and alsoproduces more visually consistent results, since the fusion takesplace in the visual space. This shows how the proposed approachcan effectively introduce semantic information in the visual repre-sentation to bring correct images that respect visual structures.

Finally, to bring a more general sense of the benefits of eachapproach, we consider a comparison of the methods with respectto the positions in the rankings of MAP and P@10 shown in Figs.6 and 7. We use the average position of a method across the threedifferent histology image collections, and re-rank them again toprovide a unified comparison. Fig. 9 presents a visualization ofthe rankings with respect to MAP (on the x-axis) and P@10 (onthe y-axis). An ideal method should be in the coordinate (1,1),which means it ranked first regarding both performance measures.This visualization also reveals the trade-off between visual andsemantic representations, indicating that on average, semanticmethods get first when with respect to MAP.

Fusion methods rank in the intermediate positions, indicatingalso that fusion by back-projection ranks on average above latefusion with respect to both performance measures, providing animproved balance. Also, notice how NMFA-BP usually ranks aheadthe visual baseline in terms of P@10, and also in terms of MAP.Actually, the results suggest that NMFA-BP, which consists of alatent semantic embedding with a corresponding back-projectionand fusion, produces the multimodal representation of imageswith the best compromise between early precision (standing in

ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/

Page 12: Histology image search using multimodal fusion

Fig. 8. Example queries at left with the top nine images retrieved by visual, semantic and fused methods. Green frames indicate relevant images, and red frames indicateincorrectly retrieved images. Notice that semantic methods (NSE, NMFA) produce results with large visual variations since no visual information is considered for ranking.The proposed fusion approach (NSE-BP, NMFA-BP) improves the precision of the results and also brings a set of more visually consistent images. (For interpretation of thereferences to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 9. Rankings of evaluated methods according to their average position of performance among the three data sets. The x-axis represents the average ranking with respect toMAP, and the y-axis represents the average ranking with respect to P@10. Points close to the origin have better performance.

12 J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

YJBIN 2171 No. of Pages 15, Model 5G

14 May 2014

Please cite this article in press as: CaicedoQ1 JC et al. Histology image search using multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.04.016

Page 13: Histology image search using multimodal fusion

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

992

993

994

995

996

997

998

999

1000

1001

1002

1003

1004

1005

1006

1007

1008

1009

1010

1011

1012

1013

1014

1015

1016

1017

1018

1019

1020

1021

1022

1023

1024

1025

1026

1027

1028

1029

1030

1031

1032

1033

1034

1035

1036

1037

1038

1039

1040

1041

1042

1043

1044

1045

1046

1047

1048

1049

1050

1051

1052

1053

1054

1055

1056

1057

1058

1059

1060

1061

1062

1063

1064

1065

1066

1067

1068

1069

1070

1071

1072

1073

1074

1075

1076

1077

1078

1079

1080

1081

1082

1083

1084

1085

1086

1087

1088

1089

1090

J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 13

YJBIN 2171 No. of Pages 15, Model 5G

14 May 2014

Q1

the first position of the rankings) and similar to the performance ofsemantic embeddings with respect to MAP.

5.5. Discussions

5.5.1. Multimodal fusionThe proposed framework provides a learning-based tool for

fusion of visual content and semantic information in histologyimages. The method models cross-modal relationships throughsemantic embeddings, which have the interesting property of mak-ing the two modalities exchangeable from one space to another.This property may be understood as a translation scheme betweentwo languages that express the same concepts in different ways.One of the languages is visual, which communicates optical detailsfound in images, and the other language is semantic, which repre-sents high-level interpretations of images. These two views of thesame data are complementary and are fused to build a betterimage representation.

This paper presents an approach to the problem of histologyimage retrieval following a multimodal setup, which is the firstof this kind reported in the literature. Previous work for semanticretrieval of histology images is mainly oriented to train classifiersfor recognizing biological structures in images [7,8,22]. That strat-egy can be understood as a translation from the visual space to thesemantic space without the possibility of a translation in the oppo-site way, and thus, limited to a fusion procedure based on latefusion only.

Experimental results in this work have shown that an exclusivesemantic representation may lead to the loss of important infor-mation for image search with example images. More importantly,our results show a consistent benefit of an early fusion strategybased on multimodal relationships over the popular approach oflate fusion, and the benefits of early fusion are beyond improvedperformance. The implications of having a better image represen-tation may be the starting point for other systems, that take thismultimodal representation as input to learn classifiers or to solveother more complex tasks.

5.5.2. Query expansion effectThe main reason for studying the fusion of visual and semantic

data is that they are complementary sources of information: whilevisual data tends to be ambiguous, semantic data tends to be veryspecific; and while visual data provides detailed appearancedescriptions, semantic data gives no clues on how an image lookslike. So, depending on the fusion strategy, multimodal relation-ships become more useful for making decisions on data. Our setupfor image retrieval considers example images as queries. Since thevisual content representation used in this work is based on a bag offeatures, an analogy with text vocabularies may help to explain theeffects of multimodal fusion.

Visual features in the dictionary of codeblocks may be under-stood as visual words representing specific visual arrangementsor configurations. One specific pattern is a low-level word thatmay have different meanings from a high-level or semantic per-spective. This problem is known in natural language processingas polysemy and usually decreases retrieval precision, that is, theability of the system to retrieve only relevant documents [48]. Also,different visual words may be related to the same high-level mean-ing, which is known as synonymy, and can reduce the ability of aninformation retrieval system to retrieve all relevant documents[48].

Experimental results in Section 5.4 are consistent with thesedefinitions, since a visual polysemy effect is observed when theretrieval system is based on visual features only (the lowest MAPscore). On the other hand, a visual synonymy effect is observedwhen using semantic data, with a higher MAP score but lower early

Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016

precision (P@10). Thus, the back-projection of semantic data is ableto disambiguate visual words by introducing other visual wordssemantically correlated to the query, and so correcting the synon-ymy effect. In this context is when the visual query expansioneffect takes place. Besides, when both modalities are combined,the polysemy effect can also be corrected, if appropriate weightsare assigned.

5.5.3. Large semantic vocabulariesPrevious work on histology image retrieval is mainly based on

classifiers trained to recognize several biological structures[7,18,20,8,22,9]. To transfer these methodologies to real world sys-tem implementations, a significant tuning effort is required, sinceeach classifier may have its own optimal configuration. The pro-posed method is a unified approach that integrates all semanticlabels together in a matrix for learning multimodal relationships.This makes an implementation simpler and ready to scale up fornew keywords, as long as corresponding example images areavailable.

Our methods are adaptive to different sizes of vocabularies asshown in the experimental evaluations, which included three dif-ferent histology databases of different sizes and different numbersof associated keywords. The effort of introducing new semanticterms in our model is virtually zero. Actually, our experimentsshow experimental analysis of histology images with the largestvocabulary reported so far. We believe that image retrieval sys-tems have the potential to support clinical activities, and toachieve that, the underlying computational methods have to bevery flexible and prepared to use semantic data, as it is availablein current management systems. This involves vocabularies withhundreds of medical terms and thousands of images, which canbe easily handled by the methods proposed in this paper.

5.5.4. Histology image collectionsCurrently, digital pathology allows to manage, share and

preserve slides together with electronic health records, which arevery important steps to modernize infrastructure and to provideimproved services. However, these systems can go beyond passiverepositories of data to actually help with the organization, search,visualization and discovery of information hidden or buried in his-tology image collections. This potential could benefit diagnosticactivities, as well as scientific research and academic training, andto realize it, new tools and methodologies have to be designedand evaluated.

Visual search technologies are among the most pervasive appli-cations in daily life, and it could be seamlessly integrated in thepractice of pathology as long as these methodologies fit the correctrequirements for such an endeavor. This work proposed to buildenhanced histology image representations to build effective retrie-val systems, using visual and semantic features. The resulting rep-resentation can also be used for other different automated analysissystems, that could be essential in medical imaging departments tosupport various decisions in the clinical practice.

5.5.5. Other considerationsThis paper has presented a study with experimental evidence in

favor of an early fusion strategy. Even though the proposed algo-rithm for early fusion has shown improved performance, the finalaccuracy is still far from perfect and there are several opportunitiesfor improvement, both in the technical and experimental sides.

In the technical side, our early fusion algorithm may be under-stood as a procedure to learn an image representation given visualfeatures and text annotations. Visual features have been learned inan unsupervised way following a bag-of-features approach, whichhas limited capacity to encode very complex visual patterns.Learning more powerful visual features may help to improve

ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/

Page 14: Histology image search using multimodal fusion

1091

1092

1093

1094

1095

1096

1097

1098

1099

1100

1101

1102

1103

1104

1105

1106

1107

1108

1109

1110

1111

1112

1113

1114

1115

1116

1117

1118

1119

1120

1121

1122

1123

1124

1125

1126

1127

1128

1129

1130

1131

1132

1133

1134

1135

1136

1137

1138

1139

1140

1141

1142

1143

1144

1145

1146

1147

1148

1149

1150

1151

1152

1153

1154

1155

1156

1157

1158

1159

1160

1161

1162

1163

1164

1165

14 J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

YJBIN 2171 No. of Pages 15, Model 5G

14 May 2014

Q1

performance as suggested by several recent works [49,50]. Also,even though our early fusion algorithm is simple and efficient, itstill requires more computations per image than late fusionalgorithms.

In the experimental side, one of the limitations of our study hasbeen access to more annotated data. We conducted experimentson three different data sets with small to medium size. However,an indexing method like the one proposed in this paper could ben-efit from more data, which is difficult to collect from real medicalcases and has restricted use in research or even in practical set-tings. We have shared part of the data collections used in this workwith the hope that other researchers may benefit from open andhigh quality histology images, and we keep looking for opportuni-ties to access more sources of information both in the communityas well as internally within our own institutions.

1166

1167

1168

1169

1170

117111721173117411751176117711781179118011811182118311841185118611871188118911901191119211931194119511961197119811991200120112021203120412051206120712081209121012111212121312141215121612171218121912201221122212231224122512261227

6. Conclusions

This work presented a framework to build histology image rep-resentations that combines visual and semantic features, followinga novel early-fusion approach. The proposed method learns therelationships between both data modalities and uses that modelto project semantic information back to the visual space, in whichthe fused representation is built. The resulting multimodal repre-sentation is used in an image search system that matches potentialresults using a similarity measure, however, its use can beextended to other histology image analysis tasks such as classifica-tion or clustering.

The experimental evaluation conducted in this work includedthree histology image collections with various sizes and differentnumbers of text terms, demonstrating the potential of the pro-posed multimodal indexing methods under different conditions.We observed a trade-off between optimizing MAP and early preci-sion at the same time when using either a semantic or a visual rep-resentation. This is mainly explained by the complementary natureof both data modalities.The proposed multimodal fusion approachis an effective strategy to balance this trade-off, and to improve thequality of image representations. Our methods consistently out-performed the visual matching and late fusion baselines in theimage retrieval task, providing the best balance between visualand semantic search.

We observed that, overall, semantic search strategies are verygood at maximizing MAP, and our proposed strategies for earlyfusion can incorporate more visual information in the search pro-cess at the cost of small reductions in MAP performance. Fusionmethods currently require further investigation on how to betterutilize visual features to satisfy visual consistency or visual diver-sity criteria demanded by potential users, without decreasing thesemantic meaningfulness of retrieved results. Semantic-basedindexing can also be exploited using keyword-based search,instead of query-by-visual-example, which was the main searchparadigm evaluated in this work. Keyword-based search may alsobe executed in a multimodal index, since by definition, it containsboth information modalities: visual and semantic.

Further potential research directions include the application ofthis representation to other image analysis tasks, such as imageclassification and automated grading. Also, since the formulationof our method can handle arbitrarily large semantic vocabularies,we are interested in extending its applicability to large scale bio-medical image collections.

We make an argument in favor of multimodal indexing, notonly because of its potential to significantly improve relativeperformance, as we have shown in this paper, but also because thisstrategy has the ability to model different user interactionmechanisms, which could be adapted according to real needs.

Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016

Nevertheless, an additional intriguing question beyond indexingmechanisms is: what is the minimum required performance forimage search technologies in a real clinical setting? The impactthat an image retrieval system might have in health care is prom-ising [13], but it will require more coordinated and collaborativeefforts to be widely adopted. Machine learning tools are capableof empowering clinicians with timely and relevant information tomake evidence-based decisions, which may result in improvedquality of care for patients. This is currently a driving force for alarge body of research.

Acknowledgements

The authors would like to thank the anonymous reviewers fortheir constructive comments, which helped to improve and clarifythis manuscript. This work was partially funded by LACCIR-Microsoft project ‘‘Multimodal Image Retrieval to Support MedicalCase-Based Scientific Literature Search’’.

References

[1] Kragel P, Kragel P. Digital microscopy: a survey to examine patterns of use andtechnology standards. In: Proceedings of the IASTED international conferenceon telehealth/assistive technologies. Anaheim (CA, USA): ACTA Press; 2008. p.195–7.

[2] Müller H, Michoux N, Bandon D, Geissbuhler A. A review of content-basedimage retrieval systems in medical applications–clinical benefits and futuredirections. Int J Med Inf 2004;73(1):1–23.

[3] Datta R, Joshi D, Li J, Wang JZ. Image retrieval: ideas, influences, and trends ofthe new age. ACM Comput Surv 2008;40(2):1–60.

[4] Smeulders AW, Worring M, Santini S, Gupta A, Jain R. Content-based imageretrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell2000;22(12):1349–80.

[5] Barnard K, Duygulu P, Forsyth D, de Freitas N, Blei DM, Jordan MI. Matchingwords and pictures. J Mach Learn Res 2003;3:1107–35.

[6] Rasiwasia N, Moreno PJ, Vasconcelos N. Bridging the gap: query by semanticexample. IEEE Trans Multimedia 2007;9(5):923–38.

[7] Tang HL, Hanka R, Ip HHS. Histological image retrieval based on semanticcontent analysis. IEEE Trans Inf Technol Biomed 2003;7(1):26–36.

[8] Naik J, Doyle S, Basavanhally A, Ganesan S, Feldman MD, Tomaszewski JE, et al.A boosted distance metric: application to content based image retrieval andclassification of digitized histopathology. SPIE Med Imag: Comput-Aided Diagn2009;7260:72603F1–12.

[9] Caicedo JC, Romero E, González FA. Content-based histopathology imageretrieval using a Kernel-based semantic annotation framework. J Biomed Inf2011;44:519–28.

[10] Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS. Multimodal fusion formultimedia analysis: a survey. Multimedia Syst 2010;16(6):345–79.

[11] La Cascia M, Sethi S, Sclaroff S. Combining textual and visual cues for content-based image retrieval on the world wide web. In: 1998. Proceedings of IEEEworkshop on content-based access of image and video libraries; 1998. p. 24–8.

[12] Nuray R, Can F. Automatic ranking of information retrieval systems using datafusion. Inf Process Manage 2006;42(3):595–614.

[13] Marchiori A. Automated storage and retrieval of thin-section ct images toassist diagnosis: system description and preliminary assessment. Radiology2003;228:265–70.

[14] Bonnet N. Some trends in microscope image processing. Micron 2004;35(8):635–53.

[15] Doyle S, Hwang M, Shah K, Madabhushi A, Feldman M, Tomaszeweski J.Automated grading of prostate cancer using architectural and textural imagefeatures. In: 4th IEEE international symposium on biomedical imaging: fromnano to macro, 2007; 2007. p. 1284–7.

[16] Zheng L, Wetzel AW, Gilbertson J, Becich MJ. Design and analysis of a content-based pathology image retrieval system. IEEE Trans Inf Technol Biomed2003;7(4):249–55.

[17] Caicedo JC, Gonzalez FA, Romero E. A semantic content-based retrieval methodfor histopathology images. Inf Retriev Technol LNCS 2008;4993:51–60.

[18] Orlov N, Shamir L, Macura T, Johnston J, Eckley DM, Goldberg IG. WND-CHARM: multi-purpose image classification using compound imagetransforms. Pattern Recogn Lett 2008;29(11):1684–93.

[19] Tambasco M, Costello BM, Kouznetsov A, Yau A, Magliocco AM. Quantifyingthe architectural complexity of microscopic images of histology specimens.Micron 2009;40(4):486–94.

[20] Caicedo JC, Cruz A, Gonzalez FA. Histopathology image classification using bagof features and kernel functions. In: Artif Intell Med. Springer; 2009. p. 126–35.

[21] Mosaliganti K, Janoos F, Irfanoglu O, Ridgway R, Machiraju R, Huang K, et al.Tensor classification of N-point correlation function features for histologytissue segmentation. Med Image Anal 2009;13(1):156–66.

ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/

Page 15: Histology image search using multimodal fusion

122812291230123112321233123412351236123712381239124012411242124312441245124612471248124912501251125212531254125512561257125812591260126112621263126412651266126712681269127012711272

12731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316

1317

J.C. CaicedoQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 15

YJBIN 2171 No. of Pages 15, Model 5G

14 May 2014

Q1

[22] Meng T, Lin L, Shyu M-L, Chen S-C. Histology image classification usingsupervised classification and multimodal fusion. In: 2010 IEEE internationalsymposium on multimedia. IEEE; 2010. p. 145–52.

[23] Müller H, Kalpathy-Cramer J. The ImageCLEF medical retrieval task at ICPR2010. In: Proceedings of the 20th international conference on patternrecognition; 2010. p. 3284–7.

[24] Kalpathy-Cramer J, Hersh W. Multimodal medical image retrieval: imagecategorization to improve search precision. In: Proceedings of theinternational conference on multimedia information retrieval. ACM; 2010. p.165–74.

[25] Rahman M, Antani S, Thoma G. A learning-based similarity fusion and filteringapproach for biomedical image retrieval using SVM classification andrelevance feedback. IEEE Trans Inf Technol Biomed 2011;15(4):640–6.

[26] Müller H, Deselaers T, Deserno T, Clough P, Kim E, Hersh W. Overview of theImageCLEFmed 2006 medical retrieval and medical annotation tasks. In:Evaluation of multilingual and multi-modal information retrieval. Springer;2007. p. 595–608.

[27] Müller H, Eggel I, Bedrick S, Radhouani S, Bakke B, Kahn Jr. C, et al. Overview ofthe CLEF 2009 medical image retrieval track. In: Cross Language evaluationforum (CLEF) working notes.

[28] de Herrera AGS, Kalpathy-Cramer J, Demner-Fushman D, Antani S, Müller H.Overview of the ImageCLEF 2013 medical tasks. Working Notes of CLEF; 2013.

[29] Müller H, de Herrera AGS, Kalpathy-Cramer J, Demner-Fushman D, Antani S,Eggel I. Overview of the ImageCLEF 2012 medical image retrieval andclassification tasks. In: CLEF (Online Working Notes/Labs/Workshop); 2012.

[30] Caicedo JC, BenAbdallah J, González FA, Nasraoui O. Multimodalrepresentation, indexing, automated annotation and retrieval of imagecollections via non-negative matrix factorization. Neurocomputing2012;76(1):50–60.

[31] Fan J, Gao Y, Luo H, Keim DA, Li Z. A novel approach to enable semantic andvisual image summarization for exploratory image search. In: Proceedings ofthe 1st ACM international conference on multimedia informationretrieval. ACM; 2008. p. 358–65.

[32] Romberg S, Lienhart R, Hörster E. Multimodal image retrieval. Int J MultimediaInf Retriev 2012;1(1):31–44.

[33] Putthividhy D, Attias HT, Nagarajan SS. Topic regression multi-modal latentDirichlet allocation for image annotation. In: 2010 IEEE conference oncomputer vision and pattern recognition. IEEE; 2010. p. 3408–15.

[34] Rusu M, Wang H, Golden T, Gow A, Madabhushi A. Multiscale multimodalfusion of histological and MRI lung volumes for characterization of lunginflammation. In: SPIE medical imaging, international society for optics andphotonics; 2013. p. 86720X–86720X.

[35] Meng T, Lin L, Shyu M-L, Chen S-C. Histology image classification usingsupervised classification and multimodal fusion. In: 2010 IEEE internationalsymposium on multimedia. IEEE; 2010. p. 145–52.

Please cite this article in press as: Caicedo JC et al. Histology image search usij.jbi.2014.04.016

[36] Vanegas JA, Caicedo JC, González FA, Romero E. Histology image indexingusing a non-negative semantic embedding. Proceedings of the second MICCAIinternational conference on medical content-based retrieval for clinicaldecision support, vol. 7075. LNCS; 2012. p. 80–91 [chapter 8].

[37] Caicedo JC, Gonzalez FA, Triana E, Romero E. Design of a medical imagedatabase with content-based retrieval capabilities. Adv Image Video TechnolLNCS 2007;4872:919–31.

[38] Cruz-Roa A, Caicedo JC, González FA. Visual pattern mining in histology imagecollections using bag of features. Artif Intell Med 2011;52(2):91–106.

[39] Manning CD, Raghavan P, Schütze H. Introduction to informationretrieval. Cambridge University Press; 2008.

[40] Hare JS, Samangooei S, Lewis PH, Nixon MS. Semantic spaces revisited:investigating the performance of auto-annotation and semantic retrieval usingsemantic spaces. In: Proceedings of the 2008 international conference oncontent-based image and video retrieval. New York (NY, USA): ACM; 2008. p.359–68.

[41] Lee DD, Seung HS. Learning the parts of objects by non-negative matrixfactorization. Nature 1999;401(6755):788–91.

[42] Barla A, Odone F, Verri A, Histogram intersection Kernel for imageclassification, international conference on image processing, 2003. In:Proceedings, vol. 3; 2003. p. 513–16.

[43] Grauman K, Darrell T. The pyramid match kernel: discriminative classificationwith sets of image features. In: Tenth IEEE international conference oncomputer vision, 2005, vol. 2; 2005.

[44] Hsu DF, Taksa I. Comparing rank and score combination methods for datafusion in information retrieval. Inf Retriev 2005;8(3):449–80.

[45] Mc Donald K, Smeaton AF. A comparison of score, rank and probability-basedfusion methods for video shot retrieval. In: Image and videoretrieval. Springer; 2005. p. 61–70.

[46] Lee JH. Analyses of multiple evidence combination. In: Special interest groupon information retrieval. ACM SIGIR conference, vol. 31. ACM; 1997. p.267–76.

[47] Makadia A, Pavlovic V, Kumar S. A new baseline for image annotation. In:Proceedings of the 10th European conference on computer vision. Berlin,Heidelberg: Springer-Verlag; 2008. p. 316–29.

[48] Carpineto C, Romano G. A survey of automatic query expansion in informationretrieval. ACM Comput Surv 2012;44(1):1–50.

[49] Cruz-Roa A, Arevalo JE, Madabhushi A, González FA. A deep learningarchitecture for image representation, visual interpretability and automatedbasal-cell carcinoma cancer detection. In: Medical image computing andcomputer-assisted intervention–MICCAI 2013. Springer; 2013. p. 403–10.

[50] Wang H, Cruz-Roa A, Basavanhally A, Gilmore H, Shih N, Feldman M, et al.Cascaded ensemble of convolutional neural networks and handcraftedfeatures for mitosis detection; 2014.

ng multimodal fusion. J Biomed Inform (2014), http://dx.doi.org/10.1016/