holistic and comprehensive annotation of clinically ... · noisy and incomplete training labels...

14
Holistic and Comprehensive Annotation of Clinically Significant Findings on Diverse CT Images: Learning from Radiology Reports and Label Ontology Ke Yan 1 , Yifan Peng 2 , Veit Sandfort 1 , Mohammadhadi Bagheri 1 , Zhiyong Lu 2 , Ronald M. Summers 1 1 Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Clinical Center 2 National Center for Biotechnology Information, National Library of Medicine 1,2 National Institutes of Health, Bethesda, MD 20892 {ke.yan, yifan.peng, veit.sandfort, mohammad.bagheri, zhiyong.lu, rms}@nih.gov Abstract In radiologists’ routine work, one major task is to read a medical image, e.g., a CT scan, find significant lesions, and describe them in the radiology report. In this paper, we study the lesion description or annotation problem. Given a lesion image, our aim is to predict a comprehensive set of relevant labels, such as the lesion’s body part, type, and attributes, which may assist downstream fine-grained diag- nosis. To address this task, we first design a deep learning module to extract relevant semantic labels from the radi- ology reports associated with the lesion images. With the images and text-mined labels, we propose a lesion annota- tion network (LesaNet) based on a multilabel convolutional neural network (CNN) to learn all labels holistically. Hier- archical relations and mutually exclusive relations between the labels are leveraged to improve the label prediction ac- curacy. The relations are utilized in a label expansion strat- egy and a relational hard example mining algorithm. We also attach a simple score propagation layer on LesaNet to enhance recall and explore implicit relation between labels. Multilabel metric learning is combined with classification to enable interpretable prediction. We evaluated LesaNet on the public DeepLesion dataset, which contains over 32K diverse lesion images. Experiments show that LesaNet can precisely annotate the lesions using an ontology of 171 fine- grained labels with an average AUC of 0.9344. The labels of DeepLesion and the code have been released 1 . 1. Introduction In recent years, there has been remarkable progress on computer-aided diagnosis (CAD) based on medical im- ages, especially with the help of deep learning technologies [23, 36]. Lesion classification is one of the most important 1 https://github.com/rsummers11/CADLab/tree/ master/LesaNet Semantic labels Body parts Types Attributes Lesion patch Lesion Annotation Network (LesaNet) Lesion dataset Supervision Text- mining module Unchanged large nodule bilaterally for example right lower lobe OTHER_BMK and right middle lobe BOOKMARK. Sentence in radiology report Large Nodule Right mid lung Filtered labels Hierarchical Exclusive Label relations + Ontology Text Image Knowledge Training phase: Inference phase: Lesion patch LesaNet Nodule: Right mid lung: Lung mass: Perihilar: Predict 0.93 0.92 0.89 0.64 Figure 1. The overall framework. We propose lesion annotation network (LesaNet) to predict fine-grained labels to describe di- verse lesions on CT images. The training labels are text-mined from radiology reports. Label relations are utilized in learning. topics in CAD. Typical applications include using medical images to classify the type of liver lesions and lung tissues [10, 15, 36], to describe the fine-grained attributes of pul- monary nodules and breast masses [5, 30], and to predict their malignancy [6, 12]. However, existing studies on this topic usually focus on certain body parts (lung, breast, liver, etc.) and attempt to distinguish between a limited set of la- bels. Hence, many clinically meaningful lesion labels cov- ering different body parts were not explored yet. Besides, in practice, multiple labels can be assigned per lesion and are often correlated. In this paper, we tackle a more general and clinically use- ful problem to mimic radiologists. When an experienced ra- diologist reads a medical image such as a computed tomog- raphy (CT) scan, he or she can detect all kinds of lesions in various body parts, identify the lesions’ detailed informa- 1 arXiv:1904.04661v2 [cs.CV] 27 Apr 2019

Upload: others

Post on 16-Sep-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Holistic and Comprehensive Annotation of Clinically ... · Noisy and incomplete training labels often exist in datasets mined from the web [7], which is similar to our la-bels mined

Holistic and Comprehensive Annotation of Clinically Significant Findings onDiverse CT Images: Learning from Radiology Reports and Label Ontology

Ke Yan1, Yifan Peng2, Veit Sandfort1, Mohammadhadi Bagheri1, Zhiyong Lu2, Ronald M. Summers11 Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Clinical Center2 National Center for Biotechnology Information, National Library of Medicine

1,2 National Institutes of Health, Bethesda, MD 20892{ke.yan, yifan.peng, veit.sandfort, mohammad.bagheri, zhiyong.lu, rms}@nih.gov

Abstract

In radiologists’ routine work, one major task is to reada medical image, e.g., a CT scan, find significant lesions,and describe them in the radiology report. In this paper, westudy the lesion description or annotation problem. Givena lesion image, our aim is to predict a comprehensive setof relevant labels, such as the lesion’s body part, type, andattributes, which may assist downstream fine-grained diag-nosis. To address this task, we first design a deep learningmodule to extract relevant semantic labels from the radi-ology reports associated with the lesion images. With theimages and text-mined labels, we propose a lesion annota-tion network (LesaNet) based on a multilabel convolutionalneural network (CNN) to learn all labels holistically. Hier-archical relations and mutually exclusive relations betweenthe labels are leveraged to improve the label prediction ac-curacy. The relations are utilized in a label expansion strat-egy and a relational hard example mining algorithm. Wealso attach a simple score propagation layer on LesaNet toenhance recall and explore implicit relation between labels.Multilabel metric learning is combined with classificationto enable interpretable prediction. We evaluated LesaNeton the public DeepLesion dataset, which contains over 32Kdiverse lesion images. Experiments show that LesaNet canprecisely annotate the lesions using an ontology of 171 fine-grained labels with an average AUC of 0.9344. The labelsof DeepLesion and the code have been released 1.

1. IntroductionIn recent years, there has been remarkable progress on

computer-aided diagnosis (CAD) based on medical im-ages, especially with the help of deep learning technologies[23, 36]. Lesion classification is one of the most important

1https://github.com/rsummers11/CADLab/tree/master/LesaNet

Semantic labels

Body parts

Types

Attributes

Lesion patch

Lesion

Annotation

Network

(LesaNet)

Lesion

dataset

Supervision

Text-

mining

module

Unchanged large nodule bilaterally for

example right lower lobe OTHER_BMK

and right middle lobe BOOKMARK.

Sentence in radiology reportLarge

Nodule

Right mid lung

Filtered labels

Hierarchical

Exclusive

Label relations

+Ontology

Text

Image

Knowledge

Training

phase:

Inference

phase:Lesion patch LesaNet

Nodule:

Right mid lung:

Lung mass:

Perihilar:

Predict

0.93

0.92

0.89

0.64

Figure 1. The overall framework. We propose lesion annotationnetwork (LesaNet) to predict fine-grained labels to describe di-verse lesions on CT images. The training labels are text-minedfrom radiology reports. Label relations are utilized in learning.

topics in CAD. Typical applications include using medicalimages to classify the type of liver lesions and lung tissues[10, 15, 36], to describe the fine-grained attributes of pul-monary nodules and breast masses [5, 30], and to predicttheir malignancy [6, 12]. However, existing studies on thistopic usually focus on certain body parts (lung, breast, liver,etc.) and attempt to distinguish between a limited set of la-bels. Hence, many clinically meaningful lesion labels cov-ering different body parts were not explored yet. Besides,in practice, multiple labels can be assigned per lesion andare often correlated.

In this paper, we tackle a more general and clinically use-ful problem to mimic radiologists. When an experienced ra-diologist reads a medical image such as a computed tomog-raphy (CT) scan, he or she can detect all kinds of lesions invarious body parts, identify the lesions’ detailed informa-

1

arX

iv:1

904.

0466

1v2

[cs

.CV

] 2

7 A

pr 2

019

Page 2: Holistic and Comprehensive Annotation of Clinically ... · Noisy and incomplete training labels often exist in datasets mined from the web [7], which is similar to our la-bels mined

tion including its associated body part, type, and attributes,and finally link these labels to the predefined ontology. Weaim to develop a new framework to predict these semanticlabels holistically (jointly learning all labels), so as to goone step closer to the goal of “learning to read CT images”.In brief, we wish the computer to recognize where, what,and how the lesion is, helping the user comprehensively un-derstand it. We call this task lesion annotation due to itsanalogy to the multilabel image annotation/tagging problemin general computer vision literature [49].

To learn to annotate lesions, a large-scale and diversedataset of lesion images is needed. Existing lesion datasets[8, 34] are typically either too small or less diverse. For-tunately, the recently-released DeepLesion dataset [47, 48]has largely mitigated this limitation. It contains bounding-boxes of over 32K lesions from a variety of body parts onCT images. However, there are no fine-grained semanticlabels given for each lesion in DeepLesion. Manual annota-tion is tedious, expensive, and not scalable, not to mention itrequires experts with considerable domain knowledge. In-spired by recent studies [43, 41], we take an automatic datamining approach to extract labels from radiology reports.Reports contain rich but complex information about multi-ple findings in the medical image. In the course of inter-preting a CT scan, a radiologist may manually annotate alesion in the image, and place a hyperlink to the annotation(a “bookmark”) in the report. We first locate the sentencewith bookmark in the report that refers to a lesion, then ex-tract labels from the sentence. We defined a fine-grainedontology based on the RadLex lexicon [19]. This processis entirely data-driven and requires minimal manual effort,thus can be easily employed to build large datasets with richvocabularies. Sample lesion image, sentence, and labels canbe found in Fig. 1.

We propose a LESion Annotation NETwork (LesaNet)to predict semantic labels given a lesion image of inter-est. This lesion annotation task is treated as a multilabelimage classification problem [49]. Despite extensive previ-ous studies [11, 14, 16, 42, 21], our problem is particularlychallenging due to several reasons: 1) Radiology reportsare often in the format of free-text, so extracted labels canbe noisy and incomplete [43]. 2) Some labels are difficult todistinguish or learn, e.g. adjacent body parts, similar types,and subtle attributes. 3) The labels are highly imbalancedand long-tailed. To tackle these challenges, we present theframework shown in Fig. 1. First, we reduce the noise intraining labels by a text-mining module. The module ana-lyzes the report to find the labels relevant to the lesion-of-interest. Second, we build an ontology which includes thehierarchical hyponymy and mutually exclusive relations be-tween the labels. With the hierarchical relations, we applya label expansion strategy to infer the missing parent labels.The exclusive relations are used in a relational hard exam-

ple mining (RHEM) algorithm to help LesaNet learn hardcases and improve precision. Third, we also attach a simplescore propagation layer to enhance recall, especially for rarelabels. Finally, metric learning is incorporated in LesaNetto not only improve classification accuracy but also enableprediction interpretability.

The main contributions of this work includes the fol-lowing: 1) We study the holistic lesion annotation problemand propose an automatic learning framework with mini-mum manual annotation effort; 2) An algorithm is proposedto text-mine relevant labels from radiology reports; 3) Wepresent LesaNet, an effective lesion annotation algorithmthat can also be adopted in other multilabel image classifica-tion problems; and 4) To leverage the ontology-based med-ical knowledge, we incorporate label relations in LesaNet.

2. Related Work

Medical image analysis with reports: Annotating med-ical images is tedious and requires considerable medicalknowledge. To reduce manual annotation burden, some re-searchers leveraged the rich information contained in asso-ciated radiology reports. Disease-related labels have beenmined from reports for classification and weakly-supervisedlocalization on X-ray [43, 41] and CT images [35, 15]. Thisapproach boosts the size of datasets and label sets. How-ever, current studies can only extract image-level labels,which cannot be accurately mapped to specific lesions onthe image. The DeepLesion dataset2 consists of lesionsfrom a variety of body parts on CT images. It has beenadopted to train algorithms for universal lesion detection[46], retrieval [48], segmentation and measurement [3, 40].This paper will explore its usage on lesion-level semanticannotation. Another line of study directly generates reportsaccording to the whole image [44, 50]. Although the gen-erated reports may learn to focus on certain lesions on theimage, it is difficult to assess the usability of generated re-ports. The key information in reports are the labels. If wecan accurately predict the labels for each lesion on the im-age, the creation of high-quality (structured) reports wouldbe straightforward.

Multilabel image classification: Multilabel image clas-sification [49] is a long-standing topic that has been tackledfrom multiple angles. A direct idea is to treat each labelindependently with a binary cross-entropy loss [43]. Thepairwise ranking loss is applied in [45, 14, 21] to makethe scores of positive labels larger than those of the nega-tive ones for each sample. CNN-RNN [42] uses a recurrentmodel to predict multiple labels one-by-one. It can implic-itly model label dependency and avoid the score threshold-ing issue. In [11], deep metric learning and hard examplemining are combined to deal with imbalanced labels.

2https://nihcc.box.com/v/DeepLesion

Page 3: Holistic and Comprehensive Annotation of Clinically ... · Noisy and incomplete training labels often exist in datasets mined from the web [7], which is similar to our la-bels mined

Noisy and incomplete training labels often exist indatasets mined from the web [7], which is similar to our la-bels mined from reports. Strategies to handle them includedata filtering [18], noise-robust losses [31], noise model-ing [25], finding reliable negative samples [20], and so on.We use a text-mining module to filter noisy positive labelsand leverage label relation to find reliable negative labels.Label relations have been exploited by researchers to im-prove image classification. Novel loss functions were pro-posed in [9] for labels with hierarchical tree-like structures.In [24, 16], prediction scores of different labels are propa-gated between network layers whose structure is designedto capture label relations. We apply label expansion andRHEM strategies to use label relations explicitly, and at thesame time employ a score propagation layer to learn themimplicitly.

3. Label Mining and Ontology3.1. Ontology Construction

We constructed our lesion ontology based on RadLex[19], a comprehensive lexicon for standardized indexingand retrieval of radiology information resources [26, 1].The labels in our lesion ontology can be categorized intothree classes: 1. Body parts, which include coarse-levelbody parts (e.g., chest, abdomen), organs (lung, lymphnode), fine-grained organ parts (right lower lobe, pretra-cheal lymph node), and other body regions (porta hepatis,paraspinal); 2. Types, which include general terms (nodule,mass) and more specific ones (adenoma, liver mass); and 3.Attributes, which describe the intensity, shape, size, etc.,of the lesions (hypodense, spiculated, large).

The labels in the lesion ontology are organized in a hier-archical structure (Fig. 2). For example, a fine-grained bodypart (left lung) can be a part of a coarse-scale one (lung); atype (hemangioma) can be a sub-type of another one (neo-plasm); and a type (lung nodule) can locate in a body part(lung). These relations form a directed graph instead ofa tree, because one child (lung nodule) may have multi-ple parents (lung, nodule). Some labels are also mutuallyexclusive, meaning that the presence of one label signifiesthe absence of others (e.g., left and right lungs). However,in Fig. 2, chest and lymph node are not exclusive becausethey may physically overlap; lung nodule and ground-glassopacity are not exclusive either since they may coexist inone lesion. We hypothesize that if labels a and b are exclu-sive, any child of a and any child of b are also exclusive.This rule can help us in annotation of exclusive labels.

3.2. Relevant Label Extraction

After constructing the lesion ontology, we extracted la-bels from the associated radiology reports of DeepLesion[47]. In the reports, radiologists describe the lesions and

Chest Lung

Mediastinum

Left lung

Right lung

Mediastinum lymph node

Lung noduleNodule

Lymph node

Hypodense

Right lower lobe

Hyperdense

Ground-glassopacity

Solid Cavitary

Figure 2. Sample labels with relations. Blue, red, and green la-bels correspond to body parts, types, and attributes, respectively.Single-headed arrows point from the parent to the child. Double-headed arrows indicate exclusive labels.

sometimes insert hyperlinks, size measurements, or slicenumbers (known as bookmarks) in the sentence to refer tothe image of interest. In this work, we only used the sen-tences with bookmarks to text-mine labels associated withthe lesions. First, we tokenized the sentence and lemma-tized the words in the sentence using NLTK [2] to obtaintheir base forms. Then, we matched the named entity men-tions in the preprocessed sentences and normalized them tolabels based on their synonyms.

The bookmarked sentences often contain a complexmixture of information describing not only the book-marked lesion but also other related lesions and unrelatedthings. A sample sentence is shown in Fig. 1, where theword “BOOKMARK” is the hyperlink of interest, while“OTHER BMK” is the hyperlink for another lesion. Thereare 4 labels matched based on the ontology, namely large,nodule, right lower lobe, and right middle lobe. Amongthem, “right lower lobe” is irrelevant since it describes an-other lesion. In other examples, there are also uncertainlabels such as “adenopathy or mass”. Since both the irrel-evant and uncertain labels may bring noise to downstreamtraining, we developed a text-mining module to distinguishthem from relevant labels. Specifically, we reformulate itas a relation classification problem. Given a sentence withmultiple labels and bookmarks, we aim to assign relevantlabels to each bookmark from all label-bookmark pairs.

To achieve that, we propose to use a CNN model basedon Peng et al. [28, 29]. The input of our model consistsof two parts: the word sequence with mentioned labels andbookmarks, and the sentence embedding [4]. The modeloutputs a probability vector corresponding to the type of therelation between the label and the bookmark (irrelevant, un-certain, and relevant). Due to space limit, we refer readersto [29] for details about this algorithm.

4. Lesion Annotation Network (LesaNet)Fig. 3 displays the framework of the proposed lesion an-

Page 4: Holistic and Comprehensive Annotation of Clinically ... · Noisy and incomplete training labels often exist in datasets mined from the web [7], which is similar to our la-bels mined

FC

25

6

Conv1_2 2_2 3_3 4_3 5_3

VGG-16 with BatchNorm

Lesion patch Predicted

scores 𝒔

1.12

-0.89

0.01

2.35

FC

Refined

scores 𝒔

1.35

-0.96

-0.13

2.40

Score

propagation

layer (SPL)

𝒔 = 𝑊𝒔

Multiscale

features

Multilabel triplet loss Relational hard example

mining (RHEM)

Weighted

CE loss

Weighted

CE loss

Large, nodule,

right mid lung

Filtered labelsLarge, nodule,

right mid lung,

right lung, lung,

chest

Expanded labelsLabel

expansion

Exclusive label relations

Hierarchical label relationsRoIPool 5×5 FC 256

Sigmoid Sigmoid

Sentence

Text-mining

module

Figure 3. The framework of LesaNet. The input is the lesion image patch and the final output are the refined scores s. The expanded labelsare used to train LesaNet and optimize the four losses. Modules in red are our main contributions.

notation network (LesaNet). In this section, we introduceeach component in detail.

4.1. Multiscale Multilabel CNN

The backbone of the network is VGG-16 [38] with batchnormalization [17]. In our task, different labels may be bestmodeled by features at different levels. For instance, bodyparts require high-level contextual features while many at-tributes depict low-level details. Therefore, we use a multi-scale feature representation similar to [48]. Region of in-terest pooling layers (RoIPool) [13] are used to pool thefeature maps to 5 × 5 in each convolutional block. Forconv1 2, conv2 2, and conv3 3, the RoI is the bounding-box of the lesion in the patch to focus on its details. Forconv4 3 and conv5 3, the RoI is the entire patch to capturethe context. Each pooled feature map is then projected to a256D vector by a fully-connected layer (FC) and concate-nated together. After another FC layer, the network outputsa score vector s ∈ RC , where C is the number of labels.Because positive cases are sparse for most labels, we adopta weighted cross-entropy (CE) loss [43] for each label as inEq. 1, where B is the number of lesion images in a mini-batch; σi,c = sigmoid(si,c) is the confidence of lesion ihaving label c, whose ground-truth is yi,c ∈ {0, 1}; the lossweights are βp

c = |Pc +Nc|/|2Pc|, βnc = |Pc +Nc|/|2Nc|,

where Pc, Nc are the numbers of positive and negative casesof label c in the training set, respectively.

LWCE =

B∑i=1

C∑c=1

(βpcyi,c log σi,c + βn

c(1− yi,c) log(1− σi,c)) .

(1)

4.2. Leveraging Label Relations

Label expansion: Labels extracted from reports are notcomplete. The hierarchical label relations can help us inferthe missing parent labels. If a child label is true, all its par-ents should also be true. In this way, we can find the labels“right lung”, “lung”, and “chest” in Fig. 3 based on the ex-isting label “right mid lung” in both training and inference.

Relational hard example mining (RHEM): Label ex-pansion cannot complete other missing labels if their chil-dren labels are not mentioned in the report. This problemoccurs when radiologists did not describe every attribute ofa lesion or omitted the fine-grained body part. Although it ishard to retrieve these missing positive labels, we can utilizethe exclusive relations to find reliable negative labels. Inother words, if the expanded labels of a lesion are reliably1, then their exclusive labels should be reliably 0.

One challenge of our task is that some labels are difficultto learn. We hope the loss function to emphasize them auto-matically. Inspired by online hard example mining (OHEM)[37], we define the online difficulty of label c of lesion i as:

δi,c = |σi,c − yi,c|γ , (2)

γ > 0 is a focusing hyper-parameter similar to the focal loss[22]. Higher γ puts more focus on hard examples. Then, wesample S lesion-label pairs according to δ in the minibatch,and compute their average CE loss. The higher δi,c is, themore times it will be sampled. Hence, the loss will auto-matically focus on hard lesion-label pairs. This stochasticsampling strategy works better in our experiments than theselection strategy in OHEM [37] and the reweighting onein focal loss [22]. An important note is that the sampling isonly performed on reliable lesion-label pairs, so as to avoidtreating missing positive labels as hard negatives. RHEMalso works as a dynamic weighting mechanism for imbal-anced labels, thus there is no need to impose weights [37]as the β in Eq. 1. We combine the CE loss of RHEM withEq. 1 instead of replacing it, since some labels have no ex-clusive counterparts and have to be learned from Eq. 1.

4.3. Score Propagation Layer

A score propagation layer (SPL) is attached at the endof LesaNet (Fig. 3). It is a simple FC layer that refines thepredicted scores with a linear transformation matrixW , fol-lowed by a weighted CE loss (Eq. 1). W is initialized withan identity matrix and can learn to capture the first-ordercorrelation between labels. Although the hierarchical and

Page 5: Holistic and Comprehensive Annotation of Clinically ... · Noisy and incomplete training labels often exist in datasets mined from the web [7], which is similar to our la-bels mined

exclusive label relations have been explicitly expressed bylabel expansion and RHEM, it is still useful to have SPLas it can enhance the scores of positively related labels andsuppress the scores of labels with negative correlation andclear separation. On the other side, some exclusive labelscan be very similar in location and appearance, for instance,hemangioma and metastasis in liver. When SPL sees a highscore of hemangioma, it will know that it may also be ametastasis since in some cases they are hard to distinguish.Therefore, SPL will actually increase the score for metas-tasis slightly instead of suppressing it. This mechanism isparticularly beneficial to improve the recall of rare labelswhose prediction scores are often low. This rationale distin-guishes SPL from previous knowledge propagation meth-ods [16] that enforce negative weights on exclusive labels,which led to a lower performance in our task. By observingthe learned W , we can also discover more label correlationand compare them with our prior knowledge.

4.4. Multilabel Triplet Loss

Interpretability is important for CAD tasks [32]. We ex-pect the algorithm to provide evidence for its predictions.After classifying a lesion, it is desirable if LesaNet can showlesions in the database that have similar labels, which willhelp the user better understand its prediction as well as thelesion itself. This is a joint lesion annotation and retrievalproblem. Lesion retrieval was studied in [48], but only 8coarse-scale body part labels were used. In this paper, weuse the comprehensive labels mined from reports to learn afeature embedding to model the similarity between lesions.As shown in Fig. 3, an FC layer is applied to project themultiscale features to a 256D vector, followed by a tripletloss [33]. To measure the similarity between two imageswith multiple labels, Zhao et al. [51] used the number ofcommon positive labels as a criterion. However, we arguethat each lesion may have a different number of labels, sothe number of disjoint positive labels also matters. SupposeX and Y are the set of positive labels of lesions A and B,we use the following similarity criterion:

sim(A,B) = |X ∩ Y |2/|X ∪ Y |. (3)

When training, we first randomly sample an anchor lesionA from the minibatch, and then find a similar lesionB fromthe minibatch so that sim(A,B) ≥ θ, finally find a dis-similar lesion C so that sim(A,C) < sim(A,B). θ is thesimilarity threshold. We sample T such triplets from theminibatch and calculate the triplet loss:

Ltriplet =1

T

T∑t=1

max(0, d(A,B)− d(A,C) + µ), (4)

where d(A,B) is the L2 distance of the embeddings of Aand B, µ is the margin. Ltriplet makes lesions with similarlabel sets closer in the embedding space.

The final loss of LesaNet combines the four components:

L = LWCE + LCE, RHEM + LWCE, SPL + λLtriplet. (5)

5. Experiments5.1. Dataset

From DeepLesion and its associated reports, we gath-ered 19,213 lesions with sentences as the training set, 1,852as the validation set, and 1,759 as the test set. Each patientwas assigned to one of the subsets only. The total number issmaller than DeepLesion because not all lesions have book-marks in the reports. We extracted labels as per Sec. 3.2,then kept the labels occurring at least 10 times in the train-ing set and 2 times in both the validation (val) and the testsets, resulting in a list of 171 unique labels. Among them,there are 115 body parts, 27 types, and 29 attributes. Weextracted hierarchical label relations from RadLex followedby manual review and obtained 137 parent-child pairs. Wefurther invited a radiologist to annotate mutually exclusivelabels and obtained 4,461 exclusive pairs.

We manually annotated the label relevance (relevant / un-certain / irrelevant, Sec. 3.2) in the val and test sets with twoexpert radiologists’ verification. As a result, there are 4,648relevant, 443 uncertain, and 1,167 irrelevant labels in thetest set. The text-mining module was trained on the val setand applied on the training set. Then, labels predicted asrelevant or uncertain in the training set were used to trainLesaNet. LesaNet was evaluated on the relevant labels inthe test set. Because the bookmarked sentences may notinclude all information about a lesion, there may be miss-ing annotations in the test set when relying on sentencesonly. Hence, two radiologists further manually annotated500 random lesions in the test set in a more comprehensivefashion. On average, there are 4.2 labels per lesion in theoriginal test set, and 5.4 in the hand-labeled test set. Anaverage of 1.2 labels are missing in each bookmarked sen-tence. We call the original test set the “text-mined test set”because the labels were mined from reports. The secondhand-labeled test set is also used to evaluate LesaNet.

5.2. Implementation Details

For each lesion, we cropped a 120mm2 patch around itas the input of LesaNet. To encode 3D information, we used3 neighboring slices to compose a 3-channel image. Otherdetails about the dataset and image preprocessing are pre-sented in Sec. 7.1. For the weighted CE loss, we clampedthe weights β to be at most 300 to ensure training stability.For RHEM, we set γ = 2 and S = 104. For the triplet loss,we empirically set θ = 1, µ = 0.1, and T = 5000. Thetriplet loss weight was λ = 5 since this loss is generallysmaller than other loss terms. LesaNet was implementedusing PyTorch [27] and trained from scratch. Lesions withat least one positive label were used in training. The batch

Page 6: Holistic and Comprehensive Annotation of Clinically ... · Noisy and incomplete training labels often exist in datasets mined from the web [7], which is similar to our la-bels mined

size was 128. LesaNet was trained using stochastic gradientdescent (SGD) with a learning rate of 0.01 for 10 epochs,then with 0.001 for 5 more epochs.

5.3. Evaluation Metric

The AUC, i.e. the area under the receiver operating char-acteristic (ROC) curve, is a popular metric in CAD tasks[43, 6]. However, AUC is a rank-based metric and doesnot involve label decision, thus cannot evaluate the qual-ity of the final predicted label set in the multilabel setting.Thus, we also computed the precision, recall, and F1 scorefor each label, which are often used in multilabel imageclassification tasks [49]. Each metric was averaged acrosslabels with equal weights (per-class-averaging). Overall-averaging [49] was not adopted because it biases towardsthe frequent labels (chest, abdomen, etc.) which are less in-formative. To turn confidence scores into label decisions,we calibrated a threshold for each label that yielded the bestF1 on the validation set, and then apply it on the test set.

5.4. Lesion Annotation Results

A comparison of different methods and an ablation studyof our method are shown in Table 1. The baseline methodis the multiscale multilabel CNN described in Sec. 4.1. Theweighted approximate ranking pairwise loss (WARP) [14]is a widely-used multilabel loss that aims to rank positive la-bels higher than the negative ones. We applied it to the mul-tiscale multilabel CNN. We defined that fine-grained labelsshould rank higher than coarse-scale ones if they are all pos-itive. Lesion embedding [48] was trained on DeepLesionbased on labels of coarse-scale body parts, lesion location,and size. Among these four methods, LesaNet achieved thebest AUC and F1 scores on the two test sets.

The AUCs in Table 1 are relatively high. The algorithmshave correctly ranked most positive cases higher than nega-tive ones, proving the effectiveness of the algorithms. How-ever, the F1 scores are relatively low. There are mainly tworeasons: 1) The dataset is highly imbalanced with many rarelabels. A total of 78 labels have fewer than 10 positive casesin the text-mined test set. These labels may have many morefalse positives (FPs) than true positives (TPs) when testing,resulting in a low F1. 2) There are missing annotations inthe test sets, which is why the accuracies (especially preci-sions) on the hand-labeled set are significantly higher thanthe text-mined test set.

Accuracies of some typical labels on the text-mined testset are displayed in Table 2. The average AUC of bodyparts, types, and attributes are 0.9656, 0.9044, and 0.8384,respectively. Body parts are easier to predict since they typ-ically have more regular appearances. The visual featureof some labels (e.g., paraspinal, nodule) is variable, thusharder to learn. The high AUC and low F1 of “paraspinal”can be explained by the lack of positive test cases (see

the explanation in the last paragraph). Some types (e.g.,metastasis) can be better predicted by incorporating ad-ditional prior knowledge and reasoning. Attributes havelower AUCs partially because some attributes are subjec-tive (“large”) or can be subtle (“sclerotic”). Besides, radiol-ogists typically do not describe every attribute of a lesion inthe report, thus there are missing annotations in the test set.

Fig. 4 demonstrates examples of our predictions.LesaNet accurately predicted the labels of many lesions.For example, in subplots (a) and (b), two fine-grained bodyparts (right hilum and pretracheal lymph nodes) were iden-tified; In (c) and (d), a ground-glass opacity and a cavitarylung lesion; In (g) and (h), a hemangioma and a metasta-sis in liver. Some attributes were also predicted correctly,such as “calcified” in (e), “lobular” in (h), and “tiny” in (i).Errors can occur on some similar body parts and types. In(c), although “left lower lobe” has a high score, “left up-per lung” was also predicted, since the two body parts areclose. In (g), “metastasis” is a wrong prediction as it may behard to be distinguished from hemangioma in certain cases.Some rare and / or variable labels were not learned verywell, such as “conglomerate” and “necrosis” in (b). Pleasesee the supplementary material for more results.

It is efficient to jointly learn all labels holistically. Fur-thermore, our experiments showed that it does not affect theaccuracy of single labels. We conducted an experiment totrain and test LesaNet on subsets of labels. For example,subset 1 consists of labels with more than 1000 occurrencesin the training set (ntr > 1000); Subset 2 contains labelswith ntr > 500. When trained on subset 2, we can test onboth subsets 1 and 2 to see if the accuracy on subset 1 hasdegraded. The results are exhibited in Fig. 5. We can seethat for the same test set, the F1 score did not change sig-nificantly as the number of training labels increased. Thus,with more data harvested, we may safely add more clini-cally meaningful labels into training. On the other hand, asmore rare labels were added to the test set, the F1 becamelower. Fine-grained body parts, types, and many attributesare rare. They are harder to learn due to the lack of train-ing cases. Possible solutions include harvesting more dataautomatically [47] and using few-shot learning [39].

5.5. Ablation Study and Analysis

Score propagation layer: From the ablation study inTable 1, we find that removing SPL decreased the aver-age per-class recall by 3%. Among it, the recall of fre-quent labels (ntr > 1000) only decreased 0.4%, show-ing that SPL is important for the recall of rare labels, atthe cost of small precision loss. We further examined thelearned transformation matrix W in SPL, see Fig. 6 foran example. We can find that W (liver, hemangioma) andW (enhancing, hemangioma) are high. It means SPL dis-covered the fact that if a lesion is a hemangioma in DeepLe-

Page 7: Holistic and Comprehensive Annotation of Clinically ... · Noisy and incomplete training labels often exist in datasets mined from the web [7], which is similar to our la-bels mined

Method Text-mined test set Hand-labeled test setAUC Precision Recall F1 AUC Precision Recall F1

Multiscale multilabel CNN 0.9048 0.2738 0.5224 0.2823 0.9151 0.3823 0.5340 0.3894WARP [14] 0.9250 0.2441 0.6202 0.3017 0.9316 0.6677 0.3273 0.3325Lesion embedding [48] 0.8933 0.2290 0.5767 0.2610 0.9017 0.3496 0.5776 0.3615LesaNet 0.9344 0.3593 0.5327 0.3423 0.9398 0.4737 0.5274 0.4344w/o score propagation layer 0.9275 0.3680 0.4733 0.3233 0.9326 0.4833 0.4965 0.4092w/o RHEM 0.9338 0.2983 0.5550 0.3178 0.9374 0.4341 0.5327 0.4303w/o label expansion 0.9148 0.3523 0.5104 0.3270 0.9236 0.4503 0.5420 0.4205w/o text-mining module 0.9334 0.3365 0.5350 0.3324 0.9392 0.4869 0.5361 0.4250w/o triplet loss 0.9312 0.3201 0.5394 0.3274 0.9335 0.4645 0.5624 0.4337

Table 1. Multilabel classification accuracy averaged across labels on two test sets. Bold results are the best ones. Red underlined results inthe ablation studies are the worst ones, indicating the ablated strategy is the most important for the criterion.

(a) Lesion #20877

TP: right hilum

lymph node

0.9268

TP: lymphade-

nopathy

0.5065

(b) Lesion #18759

TP: lymphade-

nopathy

0.9032

TP: pretracheal

lymph node

0.8231

FP: conglomerate 0.7437

FP: necrosis 0.7160

(c) Lesion #30088

TP: ground-glass

opacity

0.9667

TP: nodule 0.9645

TP: left lower lobe 0.9617

TP: lung nodule 0.9108

FP: left upper lung 0.8122

(d) Lesion #22789

TP: cavitary 0.9587

TP: right upper lobe 0.9430

FP: lung mass 0.8625

FP: perihilar 0.8205

FP: lobular 0.7320

FN: nodule 0.3876

(e) Lesion #10283

TP: calcified 0.9802

TP: anterior

mediastinum

0.9776

TP: mass 0.6369

(f) Lesion #3669

TP: rib 0.9895

TP: heterogeneous 0.9566

TP: mass 0.9501

FP: lobular 0.9418

TP: large 0.8903

TP: lytic 0.8862

TP: pleura 0.8222

TP: metastasis 0.8208

(g) Lesion #15628

TP: liver 0.9849

TP: hemangioma 0.9508

TP: enhancing 0.9071

TP: indistinct 0.8703

FP: metastasis 0.8549

TP: hyperdense 0.8061

(h) Lesion #27443

TP: liver mass 0.9151

TP: metastasis 0.8832

TP: conglomerate 0.8277

TP: lobular 0.7826

FP: indistinct 0.7699

FN: heterogeneous 0.8851

FN: large 0.8206

FN: enhancing 0.7320

(i) Lesion #20994

TP: right kidney 0.9926

TP: cortex 0.9576

TP: hypodense 0.9405

TP: tiny 0.9375

FP: solid 0.9371

TP: kidney cyst 0.8896

TP: simple cyst 0.8326

(j) Lesion #12188

TP: external iliac

lymph node

0.9929

FP: pelvic wall 0.9788

TP: lymphade-

nopathy

0.9018

Figure 4. Sample predicted labels with confidence scores on the text-mined test set. Green, red, and blue results correspond to TPs, FPs,and FNs (false negatives), respectively. Underlined labels are TPs with missing annotations, thus were treated as FPs during evaluation.Only the most fine-grained predictions are shown with their parents omitted for clarity.

sion, it is highly likely in the liver and enhancing, so SPLincreased the scores for “liver” and “enhancing”. In turn,the scores of liver and enhancing also contributed positivelyto the final score of hemangioma (see Fig. 4 (g) for an ex-ample of hemangioma). Note that these relations were notexplicitly defined in the ontology. The label “chest” is ex-clusive with “abdomen” and “liver”, so the learned weightsbetween them are negative. As explained in Sec. 4.3, he-

mangioma and metastasis in the liver are hard for the algo-rithm to distinguish, so SPL also learned positive weightsbetween them. In the future, using our holistic and compre-hensive prediction framework, we may try to incorporatemore human knowledge into the model, such as “type a lo-cates in body part b and has attribute c”, “type d is similarto type e except for attribute f”.

Relational hard example mining: RHEM, on the con-

Page 8: Holistic and Comprehensive Annotation of Clinically ... · Noisy and incomplete training labels often exist in datasets mined from the web [7], which is similar to our la-bels mined

Label AUC F1 Label AUC F1Chest 96.2 90.2 Nodule 89.1 66.9Lung 98.6 92.0 Cyst 96.0 40.7Liver 98.6 78.8 Adenoma 99.9 30.8Lymph node 93.7 76.2 Metastasis 74.0 10.7Adrenal gland 99.5 76.2 Hypodense 87.7 50.9Right mid lung 98.7 56.6 Sclerotic 99.7 75.4Pancreatic tail 97.5 35.3 Cavitary 94.9 25.0Paraspinal 97.5 9.8 Large 80.6 17.5

Table 2. Accuracies (%) of typical body parts, types, and attributes.

Figure 5. Accuracy of training and testing LesaNet on differentsubsets of labels. Each curve corresponds to a test subset.

1.01

-0.02

-0.02

-0.01

-0.00

-0.01

-0.02

1.01

0.03

0.01

0.01

0.01

-0.01

0.02

1.04

0.02

0.02

0.01

-0.01

0.03

0.07

1.02

0.02

0.03

-0.00

0.01

0.02

0.02

0.99

0.00

-0.01

0.01

0.01

0.01

0.00

1.01

ches

t

abdo

men liv

er

hem

angiom

a

met

asta

sis

enha

ncing

chest

abdomen

liver

hemangioma

metastasis

enhancing-0.01

0

0.01

0.02

0.03

0.04

0.05

0.06

Figure 6. A part of the learned score propagation weights W . Theweight in row i, column j is wij , i.e., the refinement of label i’sscore received from label j’s score. The final scores s = Ws.

trary to SPL, is crucial for improving the precision (Table1), probably because it suppressed the scores of the reliablehard negative labels at the cost of mildly decreased recall.

DeepLesion

Within the left

adrenal gland,

there is a bilobed

BOOKMARK

nodule consistent

with adenoma.

There is a left

adrenal adenoma

measuring

BOOKMARK versus

2.3 x 2.1 cm

previously.

LesaNet

Query image

256D deep

embedding

TP: left adrenal gland

TP: adenoma

0.998

0.995

Predict

Nearest neighbor

search

Retrieved lesions with

ground-truth reports

Figure 7. Sample lesion retrieval results.

In RHEM, the hard negative labels of a lesion were selectedfrom the exclusive labels of existing positive ground-truths.If we discard this “reliable” requirement and select negativelabels from all labels that are not positive, the precision willincrease 1.5% because we suppressed more negative labels,but the recall will decrease 4% since many suppressed neg-ative labels are actually positive due to missing annotations.

Label expansion: Without it, the training set will lose40% (parent) labels, thus the accuracy was not good.

Text-mining module: When this was not used, theoverall accuracy dropped as the irrelevant training labelsbrought noises. However, the performance did not degradesubstantially, showing that our model is able to toleratenoisy labels to a certain degree [18]. We also found trainingwith the relevant + uncertain labels was better than usingrelevant labels only, which is because most uncertain labelsare radiologists’ inferences that are very likely to be true,especially if we only consider the lesion’s appearance.

Triplet loss: The triplet loss also contributed to the clas-sification accuracy slightly. The 256D embedding learnedfrom the triplet loss can be used to retrieve similar lesionsfrom the database given a query one. In Fig. 7, LesaNet notonly predicted the labels of the query lesion correctly, butalso retrieved lesions with the same labels, although theirappearances are not identical. The retrieved lesions and re-ports can provide evidences to the predicted labels as wellas help the user understand the query lesion.

More qualitative and quantitative results are presented inthe supplementary material.

6. Conclusion and Future Work

In this paper, we studied the holistic lesion annotationproblem, and proposed a framework to automatically learnclinically meaningful labels from radiology reports and la-bel ontology. A lesion annotation network was proposedwith effective strategies that can both improve the accuracy

Page 9: Holistic and Comprehensive Annotation of Clinically ... · Noisy and incomplete training labels often exist in datasets mined from the web [7], which is similar to our la-bels mined

Figure 8. Distribution of label occurrences in the training set.

and bring insights and interpretations. Our future work mayinclude harvesting more data to better learn rare and hardlabels and trying to incorporate more human knowledge.

Acknowledgment

This research was supported by the Intramural ResearchProgram of the NIH Clinical Center and National Library ofMedicine. We thank NVIDIA for the donation of GPUs andDr. Le Lu and Dr. Jing Xiao for their valuable comments.

7. Supplementary Material

7.1. Dataset and Image Preprocessing Details

The DeepLesion dataset [47] was mined from a hospi-tal’s picture archiving and communication system (PACS)based on bookmarks, which are markers annotated by ra-diologists during their routine work to measure significantimage findings. It is a large-scale dataset with 32,735 le-sions on 32,120 axial slices from 10,594 CT studies of 4,427unique patients. There are 1 – 3 lesions in each axial slice.The numbers of training samples of some typical labels inour experiment are shown in Fig. 8. We can find that thelabels are imbalanced and positive cases are sparse for mostlabels.

We rescaled the 12-bit CT intensity range to floating-point numbers in [0,255] using a single windowing (-1024–3071 HU) that covers the intensity ranges of the lung, softtissue, and bone. Every image slice was resized so that eachpixel corresponds to 1mm. The slice intervals of most CTscans in the dataset are either 1mm or 5mm. We interpo-lated in the z-axis to make the intervals of all volumes 2mm.

7.2. More Lesion Annotation Results

7.2.1 Examples

Fig. 9 shows more lesion annotation examples of LesaNetin various body parts. We found that:

• LesaNet is good at identifying fine-grained lymphnodes (subplots (c),(e),(g),(h)), which account for amajor part of the DeepLesion dataset.

• In (d), LesaNet correctly recognized the coarse-scalebody part (axilla), but it classified the lesion as a lymphnode instead of a mass-like skin thickening (ground-truth). This is possibly because most axillary lesionsin DeepLesion are lymph nodes, while axillary skinlesions are rare.

7.2.2 Quantitative Results

In order to observe the effect of the components in LesaNetmore clearly, we randomly re-split the training and valida-tion set in the patient level 10 times and rerun the ablationstudy. Mean and standard deviation accuracies are reportedin Table 3. Similar conclusions can be drawn from the tablecompared to Sec. 5.5 of the main paper.

The batch size during training may affect results becauseof the triplet loss and RHEM strategies used in LesaNet. Wetested various batch sizes from 16 to 200 with or withoutthe two strategies. No significant correlation was observedbetween the settings of batch size and accuracy. Methodswith triplet loss and RHEM were consistently better thanthose without them.

7.3. More Lesion Retrieval Examples

Fig. 10 demonstrates more lesion retrieval examples ofLesaNet (please refer to Fig. 7 in the main paper). We con-strain that the query and all retrieved lesions must comefrom different patients, so as to better exhibit the retrievalability and avoid finding identical lesions of the same pa-tient. For lesions that are common in DeepLesion, suchas lung nodules and liver masses, it is easy for LesaNet toretrieve lesions that are very similar in both visual appear-ance and semantic labels, e.g. Fig. 10 (a) and (b). Moreover,LesaNet is also able to retrieve lesions that look different butshare similar semantic labels, e.g. the rib/chest wall massin subplot (c), the pancreatic tail mass in (d), and the leftadrenal nodule in (e).

We have conducted another experiment to quantitativelycompare the lesion retrieval accuracy of LesaNet and lesionembedding [48]. We used the lesions in the text-mined testset as queries to retrieve similar lesions from the trainingset, which has no patient-level overlap with the test set. Theaccuracy criterion is the average cumulative gain (ACG),

Page 10: Holistic and Comprehensive Annotation of Clinically ... · Noisy and incomplete training labels often exist in datasets mined from the web [7], which is similar to our la-bels mined

(a) Lesion #30452

TP: right mid lung 0.9790

FP: subpleural 0.9393

TP: thickening 0.8142

TP: pleura 0.8120

FP: solid pulmonary nodule 0.7141

FN: fissure 0.6348

(b) Lesion #12382

TP: lung base 0.9696

FP: consolidation 0.9513

TP: right lower lobe 0.9442

FP: spiculated 0.9199

TP: lung nodule 0.8309

TP: scar 0.5725

FP: patchy 0.3786

FN: cavitary 0.8009

(c) Lesion #18996

TP: cardiophrenic 0.9935

FP: fat 0.9489

TP: lymph node 0.9285

TP: lymphadenopathy 0.8298

TP: soft tissue 0.7580

(d) Lesion #16556

TP: axilla 0.9932

FP: axilla lymph node 0.9819

TP: enhancing 0.8566

TP: soft tissue

attenuation

0.8255

FP: conglomerate 0.6118

FN: mass 0.4684

FN: thickening 0.3866

FN: skin 0.0612

(e) Lesion #18470

TP: peripancreatic lymph

node

0.9582

TP: porta Hepatis lymph

node

0.8937

TP: lymphadenopathy 0.8210

TP: paracaval lymph node 0.5750

(f) Lesion #6479

TP: right adrenal gland 0.9993

TP: adrenal gland 0.9987

TP: adenoma 0.9861

TP: mass 0.7416

TP: nodule 0.7357

FN: hypodense 0.3862

(g) Lesion #275

TP: paraaortic 0.9027

TP: retroperitoneum 0.8617

TP: lymph node 0.8300

FP: aorta 0.6216

TP: lymphadenopathy 0.5605

FP: conglomerate 0.4281

(h) Lesion #15600

TP: tiny 0.9625

TP: mesentery lymph

node

0.8954

FP: fat 0.8287

TP: soft tissue

attenuation

0.7177

FP: intestine 0.6258

(i) Lesion #32328

TP: spleen 0.9925

TP: hypodense 0.9338

FP: metastasis 0.8404

TP: indistinct 0.7976

(j) Lesion #17942

TP: enhancing 0.9169

TP: large 0.8619

TP: abdomen 0.8163

TP: conglomerate 0.7866

TP: soft tissue 0.7014

FN: calcified 0.6624

(k) Lesion #12134

TP: bone 0.9962

TP: pelvis 0.9848

TP: sclerotic 0.9777

(l) Lesion #27438

TP: pelvis 0.9959

TP: urinary bladder 0.9910

TP: calcified 0.9854

FP: pelvic wall 0.9595

TP: hyperdense 0.8865

FP: enhancing 0.8762

FP: pelvic bone 0.8642

Figure 9. Sample predicted labels with confidence scores on the text-mined test set. Green, red, and blue results correspond to TPs, FPs,and FNs, respectively. Underlined labels are TPs with missing annotations, thus were treated as FPs during evaluation. Only the mostfine-grained predictions are shown with their parents omitted for clarity.

Page 11: Holistic and Comprehensive Annotation of Clinically ... · Noisy and incomplete training labels often exist in datasets mined from the web [7], which is similar to our la-bels mined

Method Text-mined test set Hand-labeled test setAUC Precision Recall F1 AUC Precision Recall F1

LesaNet 93.240.08 30.891.23 53.741.62 31.760.90 93.830.18 47.012.09 54.631.41 42.291.08

w/o score propagation layer 92.420.09 34.252.60 49.611.55 30.890.83 93.280.30 50.602.06 51.741.72 41.091.09w/o RHEM 93.210.10 28.401.49 56.052.19 31.020.93 93.620.22 43.091.49 57.652.11 42.041.06w/o label expansion 92.370.12 30.161.72 55.681.95 30.730.60 93.320.30 45.612.09 55.873.14 40.941.24w/o text-mining module 93.270.09 30.791.43 53.771.90 31.941.16 93.680.23 46.162.05 54.052.68 41.490.65w/o triplet loss 93.030.07 30.651.94 53.911.86 31.601.19 93.560.18 46.291.30 54.731.53 41.841.22

Table 3. Multilabel classification accuracy averaged across labels on two test sets. Bold results are the best ones. Red underlined results inthe ablation studies are the worst ones, indicating the ablated strategy is the most important for the criterion. We report mean and standarddeviation of accuracies calculated on 10 random data splits formatted as mean std..

which is defined as the average number of overlapping la-bels between the query and each of the top-K retrieved sam-ples [51]. The ACG@top-5 of lesion embedding [48] is2.25, meaning that a retrieved lesion shares an average of2.25 common labels with the query lesion. The ACG@top-5 of LesaNet is 2.36. LesaNet learned from more fine-grained labels text-mined from radiology reports, which isthe main reason of its improved accuracy, despite the factthat it uses a shorter embedding vector (256D vs. 1024D)and was not primarily trained for retrieval.

References[1] BioPortal. Radiology Lexicon, 2018.[2] Steven Bird, Steven Bird, and Edward Loper. NLTK: The

natural language toolkit. In Annual Meeting of the Associa-tion for Computational Linguistics, pages 63–70, 2016.

[3] Jinzheng Cai, Youbao Tang, Le Lu, Adam P. Harrison, KeYan, Jing Xiao, Lin Yang, and Ronald M. Summers. Ac-curate Weakly-Supervised Deep Lesion Segmentation us-ing Large-Scale Clinical Annotations: Slice-Propagated 3DMask Generation from 2D RECIST. In MICCAI, pages 396–404, 2018.

[4] Qingyu Chen, Yifan Peng, and Zhiyong Lu. BioSentVec:creating sentence embeddings for biomedical texts. arXivpreprint arXiv:1810.09302, 2018.

[5] Sihong Chen, Jing Qin, Xing Ji, Baiying Lei, Tianfu Wang,Dong Ni, and Jie Zhi Cheng. Automatic Scoring of Multi-ple Semantic Attributes with Multi-Task Feature Leverage:A Study on Pulmonary Nodules in CT Images. IEEE Trans-actions on Medical Imaging, 36(3):802–814, Mar. 2017.

[6] Jie-Zhi Cheng, Dong Ni, Yi-Hong Chou, Jing Qin, Chui-Mei Tiu, Yeun-Chung Chang, Chiun-Sheng Huang, Ding-gang Shen, and Chung-Ming Chen. Computer-Aided Di-agnosis with Deep Learning Architecture: Applications toBreast Lesions in US Images and Pulmonary Nodules in CTScans. Sci. Rep., 6(1):24454, 2016.

[7] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhip-ing Luo, and Yantao Zheng. NUS-WIDE: A Real-WorldWeb Image Database from National University of Singapore.In Proceeding of the ACM International Conference on Im-age and Video Retrieval - CIVR ’09, page 48, 2009.

[8] Kenneth Clark, Bruce Vendt, Kirk Smith, John Freymann,Justin Kirby, Paul Koppel, Stephen Moore, Stanley Phillips,

David Maffitt, Michael Pringle, Lawrence Tarbox, and FredPrior. The cancer imaging archive (TCIA): Maintaining andoperating a public information repository. Journal of DigitalImaging, 26(6):1045–1057, 2013.

[9] Sergey Demyanov, Rajib Chakravorty, Zongyuan Ge, Seyed-Behzad Bozorgtabar, Michelle Pablo, Adrian Bowling, andRahil Garnavi. Tree-loss function for training neural net-works on weakly-labelled datasets. In ISBI, pages 287–291.IEEE, apr 2017.

[10] Idit Diamant, Assaf Hoogi, Christopher F. Beaulieu, MustafaSafdari, Eyal Klang, Michal Amitai, Hayit Greenspan, andDaniel L. Rubin. Improved Patch-Based Automated LiverLesion Classification by Separate Analysis of the Interiorand Boundary Regions. IEEE J. Biomed. Heal. Informatics,20(6):1585–1594, 2016.

[11] Qi Dong, Shaogang Gong, and Xiatian Zhu. ImbalancedDeep Learning by Minority Class Incremental Rectification.IEEE Transactions on Pattern Analysis and Machine Intelli-gence, pages 1–14, 2018.

[12] Jose Raniery Ferreira, Marcelo Costa Oliveira, andPaulo Mazzoncini de Azevedo-Marques. Characterization ofPulmonary Nodules Based on Features of Margin Sharpnessand Texture. Journal of Digital Imaging, pages 1–13, 2017.

[13] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter-national conference on computer vision, pages 1440–1448,2015.

[14] Yunchao Gong, Yangqing Jia, Thomas Leung, Alexan-der Toshev, and Sergey Ioffe. Deep Convolutional Rank-ing for Multilabel Image Annotation. arXiv preprintarXiv:1312.4894, 2013.

[15] Johannes Hofmanninger and Georg Langs. Mapping visualfeatures to semantic profiles for retrieval in medical imaging.In CVPR, volume 07-12-June, pages 457–465, 2015.

[16] Hexiang Hu, Guang-Tong Zhou, Zhiwei Deng, ZichengLiao, and Greg Mori. Learning Structured Inference NeuralNetworks with Label Relations. In CVPR, pages 2960–2968,2016.

[17] Sergey Ioffe and Christian Szegedy. Batch Normalization:Accelerating Deep Network Training by Reducing InternalCovariate Shift. In ICML, pages 448–456, 2015.

[18] Jonathan Krause, Benjamin Sapp, Andrew Howard, HowardZhou, Alexander Toshev, Tom Duerig, James Philbin, andLi Fei-Fei. The unreasonable effectiveness of noisy data forfine-grained recognition. In ECCV, pages 301–320, 2016.

Page 12: Holistic and Comprehensive Annotation of Clinically ... · Noisy and incomplete training labels often exist in datasets mined from the web [7], which is similar to our la-bels mined

Query Retrieved #1 Retrieved #2 Retrieved #3

(a) Unchanged pulmonary

nodule at the left lower lobe

At least 2 subcentimeter

peripheral left lower lung focus

Left lower lung mass unchanged Noncalcified left lower lung

mass unchanged

(b) Abnormality likely represent

metastasis including focal mass

right lobe liver

Other new concerning hypodense

mass include lesion scattered in

the right lobe

The upper abdomen is unchanged

with a hypodense liver lesion

Additional enlarging hypodense

lesion are present near the

resection margin in the right

lobe

(c) Expanded right posterior rib

lesion

Posterior left rib mass Right chest wall mass Unchanged large right 7th rib

expansile mass

(d) Complex retroperitoneal

mass involving the region of the

tail and body of the pancreas

Pancreatic tail mass Centrally hypoattenuating mass

within the pancreatic tail

Low attenuation pancreatic

tail mass

(e) Left adrenal nodule not

significantly changed in size

Left adrenal nodule Left adrenal mass unchanged ,

probably due to adenoma

Left Adrenal Nodule

Figure 10. Sample lesion retrieval results of LesaNet. The input of LesaNet is the lesion image patch only, whereas the associated reportsentence is shown for reference. The irrelevant words in the sentences describing other lesions have been removed for clarity.

Page 13: Holistic and Comprehensive Annotation of Clinically ... · Noisy and incomplete training labels often exist in datasets mined from the web [7], which is similar to our la-bels mined

[19] Curtis P. Langlotz. RadLex: a new method for indexingonline educational materials. Radiographics, 26(6):1595–1597, Nov 2006.

[20] Xiaoli Li and Bing Liu. Learning to classify texts using pos-itive and unlabeled data. In IJCAI, pages 587–592, 2003.

[21] Yuncheng Li, Yale Song, and Jiebo Luo. Improving pair-wise ranking for multi-label image classification. In CVPR,volume 2017-Janua, pages 1837–1845, 2017.

[22] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, andPiotr Dollar. Focal Loss for Dense Object Detection. InICCV, pages 2980–2988, 2017.

[23] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Ar-naud Arindra Adiyoso Setio, Francesco Ciompi, MohsenGhafoorian, Jeroen A.W.M. van der Laak, Bram van Gin-neken, and Clara I. Sanchez. A survey on deep learning inmedical image analysis. Medical Image Analysis, 42:60–88,dec 2017.

[24] Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta.The More You Know: Using Knowledge Graphs for ImageClassification. In CVPR, pages 20–28, 2017.

[25] Ishan Misra, C. Lawrence Zitnick, Margaret Mitchell, andRoss Girshick. Seeing through the Human Reporting Bias:Visual Classifiers from Noisy Human-Centric Labels. InCVPR, pages 2930–2939, 2016.

[26] The National Institutes of Health. RadLex, 2016.[27] Adam Paszke, Sam Gross, Soumith Chintala, Gregory

Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-ban Desmaison, Luca Antiga, and Adam Lerer. Automaticdifferentiation in pytorch. In NIPS-W, 2017.

[28] Yifan Peng, Anthony Rios, Ramakanth Kavuluru, and Zhiy-ong Lu. Extracting chemical-protein relations with ensem-bles of svm and deep learning models. Database : the jour-nal of biological databases and curation, 2018, Jan. 2018.

[29] Yifan Peng, Ke Yan, Veit Sandfort, Ronald M. Summers, andZhiyong Lu. A self-attention based deep learning method forlesion attribute detection from CT reports. In IEEE Interna-tional Conference on Healthcare Informatics, 2019.

[30] Hariharan Ravishankar, Prasad Sudhakar, Rahul Venkatara-mani, Sheshadri Thiruvenkadam, Pavan Annangi,Narayanan Babu, and Vivek Vaidya. Medical ImageDescription Using Multi-task-loss CNN. In LABELS 2016,DLMIA 2016, volume 1, pages 121–129, 2016.

[31] Scott Reed, Honglak Lee, Dragomir Anguelov, ChristianSzegedy, Dumitru Erhan, and Andrew Rabinovich. TrainingDeep Neural Networks on Noisy Labels with Bootstrapping.In ICLR Workshop, 2015.

[32] Berkman Sahiner, Aria Pezeshk, Lubomir M. Hadjiiski, Xi-aosong Wang, Karen Drukker, Kenny H. Cha, Ronald M.Summers, and Maryellen L. Giger. Deep learning in medicalimaging and radiation therapy. Med. Phys., oct 2018.

[33] Florian Schroff, Dmitry Kalenichenko, and James Philbin.Facenet: A unified embedding for face recognition and clus-tering. In CVPR, pages 815–823, 2015.

[34] Arnaud Arindra Adiyoso et al. Setio. Validation, compar-ison, and combination of algorithms for automatic detec-tion of pulmonary nodules in computed tomography images:The LUNA16 challenge. Medical Image Analysis, 42:1–13,2017.

[35] Hoo Chang Shin, Le Lu, Lauren Kim, Ari Seff, Jianhua Yao,and Ronald Summers. Interleaved text/image deep miningon a large-scale radiology image database for Automated Im-age Interpretation. Journal of Machine Learning Research,17(9783319429984):305–321, 2016.

[36] Hoo-Chang Shin, Holger R. Roth, Mingchen Gao, Le Lu,Ziyue Xu, Isabella Nogues, Jianhua Yao, Daniel Mollura,and Ronald M. Summers. Deep Convolutional Neural Net-works for Computer-Aided Detection: CNN Architectures,Dataset Characteristics and Transfer Learning. IEEE Trans-actions on Medical Imaging, 35(5):1285–1298, may 2016.

[37] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick.Training Region-based Object Detectors with Online HardExample Mining. In CVPR, pages 761–769, 2016.

[38] Karen Simonyan and Andrew Zisserman. Very deep convo-lutional networks for large-scale image recognition. In ICLR2015, 2015.

[39] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H STorr, and Timothy M Hospedales. Learning to Compare:Relation Network for Few-Shot Learning. In CVPR, pages1199–1208, 2018.

[40] Youbao Tang, Adam P. Harrison, Mohammadhadi Bagheri,Jing Xiao, and Ronald M. Summers. Semi-Automatic RE-CIST Labeling on CT Scans with Cascaded ConvolutionalNeural Networks. In MICCAI, pages 405–413, jun 2018.

[41] Yuxing Tang, Xiaosong Wang, Adam P. Harrison, Le Lu,Jing Xiao, and Ronald M. Summers. Attention-Guided Cur-riculum Learning for Weakly Supervised Classification andLocalization of Thoracic Diseases on Chest Radiographs.In International Workshop on Machine Learning in MedicalImaging, pages 249–258. Springer, Cham, sep 2018.

[42] Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, ChangHuang, and Wei Xu. CNN-RNN: A Unified Framework forMulti-label Image Classification. In CVPR, pages 2285–2294, 2016.

[43] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Moham-madhadi Bagheri, and Ronald M. Summers. ChestX-ray8:Hospital-scale Chest X-ray Database and Benchmarks onWeakly-Supervised Classification and Localization of Com-mon Thorax Diseases. In CVPR, pages 2097–2106, may2017.

[44] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, andRonald M Summers. TieNet: Text-Image Embedding Net-work for Common Thorax Disease Classification and Re-porting in Chest X-rays. In CVPR, pages 9049–9058, 2018.

[45] Jason Weston, Samy Bengio, and Nicolas Usunier. WSA-BIE: Scaling Up To Large Vocabulary Image Annotation. InIJCAI, pages 2764–2770, 2011.

[46] Ke Yan, Mohammadhadi Bagheri, and Ronald M. Summers.3D Context Enhanced Region-based Convolutional NeuralNetwork for End-to-End Lesion Detection. In MICCAI,pages 511–519, 2018.

[47] Ke Yan, Xiaosong Wang, Le Lu, and Ronald M. Summers.DeepLesion: automated mining of large-scale lesion annota-tions and universal lesion detection with deep learning. Jour-nal of Medical Imaging, 5(3), 2018.

[48] Ke Yan, Xiaosong Wang, Le Lu, Ling Zhang, Adam Harri-son, Mohammadhadi Bagheri, and Ronald Summers. Deep

Page 14: Holistic and Comprehensive Annotation of Clinically ... · Noisy and incomplete training labels often exist in datasets mined from the web [7], which is similar to our la-bels mined

Lesion Graphs in the Wild: Relationship Learning and Or-ganization of Significant Radiology Image Findings in a Di-verse Large-scale Lesion Database. In CVPR, pages 9261–9270, 2018.

[49] Min Ling Zhang and Zhi Hua Zhou. A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng.,26(8):1819–1837, aug 2014.

[50] Zizhao Zhang, Yuanpu Xie, Fuyong Xing, Mason McGough,and Lin Yang. MDNet: A Semantically and Visually Inter-pretable Medical Image Diagnosis Network. In CVPR, pages6428–6436, 2017.

[51] Fang Zhao, Yongzhen Huang, Liang Wang, and Tieniu Tan.Deep semantic ranking based hashing for multi-label imageretrieval. In CVPR, volume 07-12-June, pages 1556–1564,2015.