geometry in active learning for binary and multi-class ... · pdf filegeometry in active...

1

Geometry in Active Learning for Binary andMulti-class Image Segmentation

Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua, Fellow, IEEE,

Abstract—We propose an Active Learning approach to image segmentation that exploits geometric priors to streamline the annotationprocess. We demonstrate this for both background-foreground and multi-class segmentation tasks in 2D images and 3D imagevolumes. Our approach combines geometric smoothness priors in the image space with more traditional uncertainty measures toestimate which pixels or voxels are most in need of annotation. For multi-class settings, we additionally introduce two novel criteria foruncertainty. In the 3D case, we use the resulting uncertainty measure to show the annotator voxels lying on the same planar patch,which makes batch annotation much easier than if they were randomly distributed in the volume. The planar patch is found using abranch-and-bound algorithm that finds a patch with the most informative instances. We evaluate our approach on Electron Microscopyand Magnetic Resonance image volumes, as well as on regular images of horses and faces. We demonstrate a substantialperformance increase over state-of-the-art approaches.

Index Terms—Active Learning, Multi-Class Active Learning, Image Segmentation, Branch-and-Bound

F

1 INTRODUCTION

Machine learning techniques are a key component of modern ap-proaches to segmentation, making the need for sufficient amountsof training data critical. As far as images of everyday scenes areconcerned, this is addressed by compiling large training databasesand obtaining the ground truth via crowd-sourcing [1], [2], but at ahigh cost. By contrast, in specialized domains such as biomedicalimage processing, this is not always an option because onlyexperts whose time is scarce and precious can annotate imagesreliably. This stands in the way of wide acceptance of many state-of-the-art segmentation algorithms, which are formulated in termsof a classification problem and require large amounts of annotateddata for training. The problem is even more acute for multi-classsegmentation, which requires even larger training sets and moresophisticated interfaces to produce them [3].

Active learning (AL) is an established way to reduce theannotation workload by automatically deciding which parts of theimage an annotator should label to train the system as quickly aspossible and with minimal amount of manual intervention. How-ever, most AL techniques used in computer vision, such as [3],[4], [5], [6], [7], [8], [9], are inspired by earlier methods developedprimarily for general tasks or Natural Language Processing [10],[11]. As such, they rarely account for the specific difficulties orexploit the opportunities that arise when annotating individualpixels in 2D images and 3D voxels in image volumes. Moreover,multi-class classification is rarely considered in the AL literature,despite its importance in computer vision.

More specifically, 3D stacks such as those depicted by Fig. 1are common in the biomedical field and are particularly chal-lenging, in part because it is difficult both to develop effective

• K. Konyushkova and P. Fua are with the Computer Vision Laboratory,EPFL, Lausanne, Switzerland. E-mail: [email protected]

• R. Sznitman is with the ARTORG Center, University of Bern, Bern,Switzerland. E-mail: [email protected]

This manuscript is submitted to Computer Vision and Image Understandingjournal

interfaces to visualize the huge image data and for users to quicklyfigure out what they are looking at. In this paper, we therefore in-troduce a novel approach to AL that is geared towards segmenting3D stacks while accounting for geometric constraints of regionshapes and thus making the annotation process convenient. Ourapproach applies both to background-foreground and multi-classsegmentation of ordinary 2D images and 3D volumes. Our maincontributions are as follows.

• We introduce an approach to exploiting geometric priorsto more effectively select the image data the expert useris asked to annotate, both for background-foreground andmulti-class segmentation.

• We introduce a novel definition of uncertainty for multi-class AL, which can be combined with the above-mentioned geometric priors.

• We streamline the annotation process in 3D volumesso that annotating them is no more cumbersome thanannotating ordinary 2D images, as depicted by Fig. 2.

We first presented these ideas in a conference paper [12] thatonly addressed the foreground-background case and we extendthem here to the multi-class case. Moreover, we present hereadditional comparisons to baselines and describe experimentsdesigned to provide a deeper understanding of our method’s innerworking. In the remainder of this paper, we first review currentapproaches to binary and multi-class AL and discuss why they arenot necessarily the most effective when dealing with pixels andvoxels. We then give a short overview of our approach before dis-cussing in details our use of geometric priors and how we searchfor an optimal cutting plane to simplify the annotation process.Additionally, we test multi-class AL on image classification tasks.Finally, we compare our results against those of state-of-the-arttechniques in a few challenging cases and provide experimentsthat illustrate the role of human intuition in the labeling procedure.

arX

iv:1

606.

0902

9v3

[cs

.CV

] 1

6 Ja

n 20

18

2

(yz) (xy)

volume cut (xz)

Fig. 1. Interface of the FIJI Visualization API [13], which is extensivelyused to interact with 3D image stacks. The user is presented with threeorthogonal planar slices of the stack. While effective when working sliceby slice, this is extremely cumbersome for random access to voxelsanywhere in the 3D stack, which is what a naive AL implementationwould require.

User input mitochondrion

synapse

membrane

backgound

Click onmisclassied

supervoxel

Fig. 2. Our approach to annotation. Top row: The system selects an op-timal plane in an arbitrary orientation and presents the user with a patchthat is easy to annotate. The annotated area shown as part of the full 3Dstack. Bottom row: user interface, the planar patch the user would see.Left: in case of two classes present in the patch, it could be annotatedby clicking twice to specify the red segment that forms the boundarybetween the inside and outside of a target object within the green circle.Right: the other way to annotate data is to correct mistakes in the currentprediction. Supervoxels predicted to be mitochonria are shown in red,background in blue. If a user clicks on the misclassified supervoxel hecan select the correct class among proposed. Best viewed in color asmost figures in this paper.

2 RELATED WORK AND MOTIVATION

In this paper, we are concerned with situations where domainexperts are available to annotate images. However, their time islimited and expensive. We would therefore like to exploit it aseffectively as possible. In such a scenario, AL is the techniqueof choice because it seeks the smallest possible set of training

samples to annotate for effective model instantiation [14].In practice, almost any classification scheme can be incorpo-

rated into an AL framework. For image processing purposes, thatincludes SVMs [5], [8], [3], Conditional Random Fields [6], Gaus-sian Processes [4], [9] and Random Forests [7]. Typical strategiesfor query selection rely on uncertainty sampling [5], query-by-committee [15], [16], expected model change [14], [17], [6], [18],or measuring information in the Fisher matrix [19]. While there isa wide range of literature on AL for binary classification, multi-class problems are seldom considered. Multi-class scenarios areoften approached by reducing the problem to one-vs.-all binaryclassification [4], [8]. Alternative methods rely on the expectedreduction in misclassification risk and are independent of thenumber of classes [3]. Unfortunately, they can run in reasonabletime only when they are combined with a classifier that has aneasy model update rule for adding new samples. On the otherside, for uncertainty sampling, one needs to redefine the notion ofuncertainty or disagreement [14], [5], [9], [20], [21], [22]. Threeways to define the most uncertain datapoint are introduced in [14]:(1) maximum entropy of posterior probability distribution overclasses, (2) minimal probability of selected class and (3) minimalgap between the two most probable classes. There are manyworks relying on one of the above criteria or on combining them.This includes selection uncertainty [21], posterior distributionentropy [5], [20], the combination of entropy, minimum marginand exploration criteria [9], and all three strategies of [14]together [22].

The above techniques have been used for tasks such as naturallanguage processing [10], [11], [23], image classification [5], [19],[4], [3], visual recognition [9], [8], semantic segmentation [6],[16], and foreground segmentation [24]. However, selection strate-gies are rarely designed to take advantage of image specificitieswhen labeling individual pixels or voxels, such as the fact that aneighborhood of pixels/voxels tends to have homogeneous labels.The segmentation methods presented in [25], [16], [26] do takesuch geometric constraints into account for classification purposesbut not to guide AL, as we do.

Batch-mode selection [27], [19], [28], [29] has become astandard way to increase efficiency by asking the expert toannotate more than one sample at a time [30], [31]. But again,this has been mostly investigated in terms of semantic querieswithout due consideration to the fact that, in images, it is mucheasier for annotators to quickly label many samples in a localizedimage patch than having to annotate random image locations. Ifsamples are distributed randomly in a 3D volume, it is extremelycumbersome to labels them using current image display tools suchas the popular FIJI platform depicted by Fig. 1. Thus, in 3D imagevolumes [25], [16], [32], it is even more important to providethe annotator with a patch in a well-defined plane, such as theone shown in Fig. 2. The technique of Top [33] is an exceptionin that it asks users to label objects of interest in a plane ofmaximum uncertainty. Our approach is similar, but has severaldistinctive features. First, the procedure we use to find the planerequires far fewer parameters to be set, as discussed in Sec. 5Second, we search for the most uncertain patch in the plane anddo not require the user to annotate the whole plane. Finally, ourapproach can be used in conjunction with an ergonomic interfacethat requires at most three mouse clicks per iteration when twoclasses are involved. Also, as we show in the result section, ourmethod combined with geometric smoothness priors outperformsthe earlier one.

3

0

0.2

0.4

0.6

0.8

1

0

0.1

0.2

0.3

0.4

0.5

0.6

Fig. 3. Geometry-based uncertainty score. Left: binary classificationmap for an 8×8 image. We assume that the classifier assigns the yellowpixels to class 1 with probability 1 and the blue ones to class 0, also withprobability 1. Right: geometric uncertainty score of Section 4.3. The areaof transition between the two classes is given a high uncertainty score.Its maximum is reached where the boundary is not smooth.

3 APPROACH

We begin by broadly outlining our framework, which is set in atraditional AL context. That is, we wish to train a classifier forsegmentation purposes, but have initially only few labeled andmany unlabeled training samples at our disposal.

Since segmentation of 3D volumes is computationally expen-sive, supervoxels have been extensively used to speed up theprocess [34], [35]. In the remainder of this section and in Sec. 4,we will refer almost solely to supervoxels for simplicity but thedefinitions apply equally to superpixels when dealing with 2Dimages. We formulate our problem in terms of classifying super-voxels as a part of a specific target object. As such, we start byoversegmenting the image volume using the SLIC algorithm [36]and computing for each resulting supervoxel si a feature vector xi.When dealing when with ordinary 2D images, we simply replacethe 3D supervoxels with 2D superpixels, which SLIC can alsoproduce. Our AL problem thus involves iteratively finding the nextset of supervoxels that should be labeled by an expert to improvesegmentation performance as quickly as possible. To this end, ouralgorithm proceeds as follows:

1) Train a classifier on the labeled supervoxels SL and useit to predict the class probabilities for the remainingsupervoxels SU .

2) Score SU on the basis of a novel uncertainty function thatwe introduce in Sec. 4. It is inspired by the geometricproperties of images in which semantically meaningfulregions tend to have smooth boundaries. Fig. 3 illustratesits behavior given a simple prediction map: Non-smoothregions tend to be assigned the highest uncertainty scores.

3) In volumes, select a 2D plane that contains a patchwith the most uncertain supervoxels, as shown in Fig. 2and, in regular images, select a patch around the mostuncertain superpixel. The expert can then effortlesslylabel an indicated 2D patch without having to examinethe image data from multiple perspectives, as would bethe case otherwise and as depicted by Fig. 1. Furthermore,we can then design a simple interface that lets the userlabel supervoxel or superpixel batches with just a fewmouse clicks, as shown in Fig. 2 and described in Sec. 6.

4) Sets SL and SU are updated and the process is repeateduntil the segmentation quality is satisfactory.

4 GEOMETRY-BASED ACTIVE LEARNING

Most AL methods were developed for general tasks and operateexclusively in feature space, thus ignoring the geometric properties

of images and more specifically their geometric consistency. Westarting from Uncertainty Sampling (US). It is designed to focusthe annotators’ attention on samples for which image features donot yet provide enough information for the classifier to decidewhat label to assign them. It selects samples that are uncertain infeature space to be annotated first so that classifier is updated withthe largest amount of information. We will refer to this family ofapproaches as Feature Uncertainty (FUn). These methods are botheffective and computationally inexpensive, thus, they are chosenas a basis of our work. However, they do not account for imagegeometry to clue which samples may be mislabeled.

To remedy this, we first introduce the concept of GeometricUncertainty (GUn) and then show how to combine it with FUn.Our basic insight is that supervoxels that are assigned a labeldifferent from that of their neighbors ought to be considered morecarefully than those that are assigned the same label, as illustratedby Fig. 3. In this 2D toy example, pixels near classificationboundaries in the image space, as opposed to the feature space,are marked as being more uncertain and those near irregular partsof the boundary even more.

We express both kinds of uncertainties in terms of entropyso that we can combine them in a principled way. Doing this inmulti-class segmentation case requires a new criterion, which weintroduce below.

4.1 Uncertainty Measures

For each supervoxel si and each label y in a set Y of possiblelabels, let p(yi = y|xi) be the probability that its label yi isy, given the corresponding feature vector xi. In this section weare not concerned with the question of how this probability isobtained. For background-foreground segmentation, we take Y tobe 0, 1. In the multi-class scenario, Y is a larger set, such asbackground, hair, skin, eyes, nose, mouth for face segmentation.We describe below three entropy-based uncertainty measures. Westart from the well-known Shannon Entropy of the predicteddistribution over all possible classes and we then introduce twonovel uncertainty measures which are both entropic in nature,but account for different properties of the predicted probabilitydistribution.

4.1.1 Total EntropyThe simplest and most common way to estimate uncertainty is tocompute the Shannon entropy of the total probability distributionover classes

H(p(yi = y|xi)) = (1)

−∑y∈Y

p(yi = y|xi) log p(yi = y|xi) ,

which we will refer to as Total Entropy. By definition, it is notrestricted to the binary case and can be used straightforwardly inthe multi-class scenario as well.

4.1.2 Selection EntropyWhen there are more than two elements in Y , another wayto evaluate uncertainty is to consider the label b1 with highestprobability against all others taken together. For bk ∈ b1, b1this yields a probability distribution

ps = p(yi = bk|xi), (2)

4

Conditional entropy Selection entropy Total entropy

Min margin Min max

0.08

0.16

0.24

0.32

0.40

0.48

0.56

0.64

−1.0

−0.9

−0.8

−0.7

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0.0

0.08

0.16

0.24

0.32

0.40

0.48

0.56

0.64

−0.96

−0.88

−0.80

−0.72

−0.64

−0.56

−0.48

−0.40

0.00

0.15

0.30

0.45

0.60

0.75

0.90

1.05

Fig. 4. Measures of Feature Uncertainty in a three-class problem. In each triangle the color denotes the uncertainty as a function of the threeprobabilities assigned to each class, which sum to 1. The three corners correspond to a point with probability 1 belonging to one of the three classesand therefore no uncertainty. By contrast, the center point can belong to any class with equal probability and therefore yields maximal uncertainty.For better comparison, we inverted some values such that yellow corresponds to higher uncertainty and dark blue to the lower uncertainty. Top:entropy-based measures of Sec. 4.1, Bottom: Measures proposed in [14].

such that p(yi = b1|xi) =∑y∈Y \b1 p(yi = y|xi). Then, we

compute the entropy of the resulting probability distribution overtwo classes as Selection Entropy Hs

Hs = H(ps). (3)

This definition of uncertainty is motivated by our desire to mini-mize the number of misclassified samples by concentrating on theclassifier’s decision output.

4.1.3 Conditional EntropyAnother way to evaluate uncertainty in a multi-class scenario isto consider how much more likely the top candidate is than thesecond one. More precisely, let b1 and b2 be two highest rankedclasses for a supervoxel si, with p(yi = b1|xi) > p(yi =b2|xi) > p(yi = bj |xi),∀bj 6= b1, b2. If one of them truly is thecorrect class, we can condition on this fact. For ∀bk ∈ b1, b2this yields

pc = p(yi = bk|xi, y∗i ∈b1, b2) = (4)p(yi = bk|xi)

p(yi = b1|xi) + p(yi = b2|xi),

where y∗i stands for the true class label. We then take theConditional Entropy uncertainty to be the Shannon entropy of thisprobability distribution, which is

Hc = H(pc). (5)

This definition of uncertainty is motivated by the fact that theclassifier is rarely confused about all possible classes. Moretypically, there are two classes that are hard to distinguish andwe want to focus on those 1

4.2 Feature Uncertainty (FUn)In practice, we estimate p(yi = y|xi) by means of a classifiertrained using parameters θ and we denote the distribution probabil-ity by pθ . Then, any of the uncertainty measures from Sec. 4.1 can

1. For example, when trying to recognize digits from 0 to 9, it is unusual tofind samples that resemble all possible classes with equal probability, but thereare many cases in which 3 and 5 are not easily distinguishable.

be applied to the probability distribution pθ(yi = y|xi)∀y ∈ Yresulting in Feature Total Entropy H from Eq. (2), Feature Selec-tion Entropy Hs from Eq. (3) and Feature Conditional EntropyHc from Eq. (5). While all Feature Uncertainty measures areequivalent in the binary classification case, they behave quitedifferently in a multi-class scenario, as shown in the top rowof Fig. 4. Furthermore, even though our Selection Entropy andConditional Entropy measures are in the same spirit as the Minmargin and Min max measures of [14] (bottom row of Fig. 4),their selection is still different and they enable the combinationwith geometric priors, as shown in Sec. 4.4. In the remainderof the paper, we will refer to any one of these three uncertaintymeasures as the Feature Uncertainty Hθ .

4.3 Geometric Uncertainty

Estimating the uncertainty as described above does not explicitlyaccount for correlations between neighboring supervoxels. Toaccount for them, we can estimate the entropy of a differentprobability, specifically the probability that supervoxel si belongsto class y given the classifier predictions of its neighbors andwhich we denote pG(yi = y).

To this end, we treat the supervoxels of a single imagevolume as nodes of a directed weighted graph G whose edgesconnect neighboring supervoxels, as depicted in Fig. 5. We letAk(si) = si1, si2, .., sik be the set of k nearest neighbors ofsi and assign a weight inversely proportional to the Euclideandistance between the voxel centers to each one of the edges.This simple definition makes most sense when the supervoxelsare close to being spherical, which is the case when using analgorithm of [36]. For each node si, we normalize the weights ofall incoming edges so that their sum is one and treat this as theprobability pT (yi = y|yj = y) of node si having the same labelas node sij ∈ Ak(si). In other words, the closer two nodes are,the more likely they are to have the same label.

To define pG(yi = y) we use a random walk procedureG [37], as it reflects well our smoothness assumption and hasbeen extensively used for image segmentation purposes [38],[33]. Given the pT (yi = y|yj = y) transition probabilities, we

5

p(yj1 = y)

Fig. 5. Image represented as a graph: we treat supervoxels as nodesin the graphs and edge weights between them reflect the probabilityof transition of the same label to a neighbour. Supervoxel si has kneighbours from Ak(i) = si1, si2, .., sik, pT (yi = y|yj = y) is theprobability of node si having the same label as node sij , pθ(yi = y|xi)is the probability that yi , class of si, is y, given only the correspondingfeature vector xi

can compute the probabilities pG iteratively by initially takingp0G(yi = y) to be pθ(yj = y|xj) and then iteratively computing

pτ+1G (yi = y) =

∑sj∈Ak(si)

pT (yi = y|yj = y)pτG(yj = y) . (6)

Note that pθ(yj = y|xj), p0G(yi = y) and pτ+1

G (yi = y)are vectors whose dimension is the cardinality of Y , the set ofall possible labels. The above procedure propagates the labels ofindividual supervoxels into their neighborhood and the number ofiterations, τmax, defines the radius of the neighborhood involvedin the computation of pG for si, thus encoding smoothness priors.Fig. 3 shows the result of this computation for a simple 8× 8image with initial prediction of a classifier as shown on the left andk = 4 neighbors with equal edge weights. We apply τmax = 4iterations and the resulting geometric uncertainty on the rightshows how smoothness prior is reflected in the uncertainty: non-smooth boundaries receive the highest uncertainty score.

Given these probabilities, we can use the approaches ofSec. 4.1.1, 4.1.2 and 4.1.3 to compute the Geometric UncertaintyHG for the probability distribution pG(yi = y|xi)∀y ∈ Y asGeometric Total EntropyH, Geometric Selection EntropyHs andGeometric Conditional Entropy Hc, respectively.

4.4 Combining Feature and Geometric EntropyFinally, given a trained classifier, we can estimate both FUn andGUn. To use them jointly, we should in theory estimate the jointprobability distribution pθ,G(yi = y|xi) and the correspondingjoint entropy. As this is not modeled by our AL procedure, wetake advantage of the fact that the joint entropy is upper boundedby the sum of individual entropies Hθ and HG. Thus, for eachsupervoxel, we take the Combined Uncertainty (CUn) to be

Hθ,G = Hθ +HG (7)

that is, the upper bound of the joint entropy. The same rule can beequally applied to the Total entropy and entropy-based functionsSelection Entropy and Conditional Entropy. This principled wayto combine the uncertainties gives much better results than simplesumming up of the scores of the methods of [14] Min margin andMin max. In practice, using this measure means that supervoxels

that individually receive uncertain predictions and are in areas ofnon-smooth transition between classes will be considered first, asdepicted by Fig. 3. Note that the AL method [39] based on Zhou’spropagation [26], which is similar to the one we use, operatesexclusively on HG. However, we experimentally observed on ourdatasets that considering the upper bound on the joint entropy fromEq. 7 results in a significant improvement in the learning curve.Our MATLAB implementation of Combined Total Entropy on 10volumes of resolution 176× 170× 220 of the MRI dataset fromSec. 6.3.2 takes 1.4s per iteration (2.3 GHz Intel Core i7, 64-bit).The time performance is extremely important in the interactiveapplications.

5 BATCH-MODE GEOMETRY QUERY SELECTION

The simplest way to exploit the CUn from Sec. 4.4 would beto pick the most uncertain supervoxel, ask the expert to label it,retrain the classifier, and iterate. A more effective way however isto find appropriately-sized batches of uncertain supervoxels andask the expert to label them all at once before retraining theclassifier. As discussed in Sec. 2, this is referred to as batch-mode selection, which usually reduces the time-complexity ofAL. However, a naive implementation would force the user torandomly view and annotate several supervoxels in 3D volumesregardless of where they are. This would not be user friendly asthey would have to navigate a potentially large volume at eachiteration.

In this section, we therefore introduce an approach to usingthe uncertainty measure described in the previous section to firstselect a planar patch in 3D volumes and then allow the user toquickly label supervoxels within it, as shown in Fig. 2. In practice,we operate on SLIC superpixels/supervoxels [36] that are roughlycircular/spherical. We allow annotator to only consider circularregions within planar patches such as the one depicted in Figs. 2and 7. These can be understood as the intersection of a sphere witha plane of arbitrary orientation.

Recall from Sec. 4, that we can assign to each supervoxel sian uncertainty estimate U(si) in one of several ways. Whicheverone we choose, finding the circular patch of maximal uncertaintyp∗ can be formulated as finding

p∗ = arg maxp

∑sj∈p

U(sj) , (8)

where the summation occurs over the voxels that intersect theplane and are within the sphere.

Since Eq. (8) is linear in U(sj) ≥ 0 for any given voxel si,we designed a branch-and-bound approach to finding the planethat yields the largest uncertainty. It recursively eliminates wholesubsets of planes and quickly converges to the correct solution.Whereas an exhaustive search would be excruciatingly slow, ourcurrent MATLAB implementation on MRI dataset takes 0.024sper plane search with the same settings as in Sec. 4.4. This meansthat a C implementation of the entire pipeline would be real-time,which is critical for acceptance by users of such an interactivemethod.

As discussed above, in theory, this procedure could be used inconjunction with any one of the uncertainty measures defined inthe previous section. In practice, as shown in Sec. 6, it is mostbeneficial when used in combination with the geometry-awarecriterion of Sec. 4.4. We describe our branch-and-bound plane-finding procedure in more detail below.

6

(a) (b) (c)Fig. 6. a) A corridor is a union of the areas between planes p1 and p4 as well as between p2 and p3. The green points depict supervoxels includedin corridor Ω while black points depict supervoxels outside of it. Best seen in color. b) Coordinate system for plane selection. A circular patch isdefined as the intersection of a plane with a sphere. Plane pi (yellow) is parametrised by two angles, φ and γ; φ is the angle between the negativecomponent of axis −Y and plane intersection with XY (blue), similarly, γ is the angle between −Z and plane intersection with Y Z (red). Bestseen in color. c) Sector splitting procedure. U(p0) < U(Ω). We split the corridor Ω into corridors [φmin, φ0) × [γmin, γ0), [φmin, φ0) × [γ0, γmax),[φ0, φmax)× [γmin, γ0) and [φ0, φmax)× [γ0, γmax) and evaluate their uncertainty values. Among all available sectors we select a sector with thehighest value to be split next. Best seen in color.

5.1 Parameterizing the Search SpaceLet us consider a spherical volume centered at supervoxel si,such as the one depicted by Fig. 6a. Since the SLIC super-pixels/supervoxels are always roughly circular/spherical, any su-pervoxel sj can be well approximated by a spherical object ofradius κ, set to a constant for a particular dataset, and its centerwj . We will refer to such an approximation as sj . Then, everysj = (wj , κ) is characterized by its center wj and the commonradius κ.

Let Sri be the set of supervoxels within the distance r from si,that is,

Sri = sj = (wj , κ) | ‖wj − wi‖ ≤ r. (9)

If we take the desired patch size to be r, we can then operateexclusively on the elements of Sri . Let Pi be the set of all planesbisecting it at the center of si. As we will see below, our procedurerequires defining planes, area splits of approximately equal size,and supervoxel membership to certain areas and planes. To makethis easy to do, we parameterize planes in Pi as follows.

Let us consider a plane p ∈ Pi, such as the one shownin yellow in Fig. 6a. It intersects the XY plane along a linecharacterized by a vector ~v1, shown in blue. Without loss ofgenerality, we can choose the orientation of ~v1 so that its Xcoordinate is positive and denote by φ the angle between thenegative component of axis −Y and ~v1. Similarly, let us considerthe intersection of p with Y Z plane and characterize it by thevector ~v2 with a positive Y coordinate and shown in red. Now letγ be the angle between −Z and ~v2. We can now parameterize theplane p by the two angles φ ∈ [0, π) and γ ∈ [0, π) because thereis one and only one plane passing through two intersecting lines.We will refer to (φ, γ) as the plane’s angular coordinates. Finally,let Cri (p) be the set of supervoxels sj ∈ Sri lying on p, that is,

Cri (p) = sj ∈ Sri | distance(p, wj) ≤ 2κ . (10)

The set Pi can be represented by the Cartesian product [0, π)×[0, π) of the full ranges of φ and γ. Let Φ = [φmin, φmax) andΓ = [γmin, γmax) be two angular intervals. We will refer to a setof planes with angular coordinates in Φ× Γ as the corridor Ω =Φ × Γ, as illustrated by Fig. 6b. The boundaries of this corridorare defined by planes p1 = (φmin, γmin), p2 = (φmin, γmax),p3 = (φmax, γmin) and p4 = (φmax, γmax).

5.2 Searching for the Best Bisecting Plane

5.2.1 Uncertainty of Planes and CorridorsRecall that we assign to each supervoxel sj an uncertainty valueU(sj) ≥ 0. We take the uncertainty of plane p to be

U(p) =∑

sj∈Cri (p)

U(sj) . (11)

Finding a circular patch p∗ of maximum uncertainty then amountsto finding

p∗ = (φ∗, γ∗) = arg maxp∈Pi

U(p) . (12)

Similarly, we define the uncertainty of a corridor as the sumof the uncertainty values of all supervoxels lying between the fourplanes bounding it, between p1 and p4, and between p2 and p3 asdepicted by Fig. 6b. We therefore write

U(Ω) =∑

sj∈Cri (Ω)

U(sj) , (13)

where Cri (Ω) represents the supervoxels lying between the fourbounding planes. In practice, a supervoxel is considered to belongto the corridor if its center lies either between p1 and p4 orbetween p2 and p3, or is no further than κ away from any of them.When the angles are acute, this is easily decided by checking thatthe dot product of the voxel coordinates with the plane normalshave the same sign, provided that these normals orientations arechosen so that they all point inside the corridor.

5.2.2 Branch and BoundTo solve Eq. 12 and find the optimal circular patch, we usea branch-and-bound approach. It involves quickly eliminatingentire subsets of the parameter space Φ × Γ using a boundingfunction [40], a recursive search procedure, and a terminationcriterion, which we describe below.

Bounding function. Let us again consider the corridorΩ = [φmin, φmax) × [γmin, γmax) bounded by the four planesp1 to p4. Let us also introduce the plane p0 = (α1φmin +β1φmax, α2γmin + β2γmax), where α1 + β1 = 1, α2 + β2 = 1depicted by Fig. 6c. Given that U(sj) ≥ 0 and that Eq. 12 islinear in U(sj), the uncertainty of p0 will always be less or equal

7

to that of Ω. This allows us to bound the uncertainty of any planefrom above and to search for the solution only within the mostpromising parameter intervals, as follows.

Search procedure. As in [40], we maintain a priority queueL of corridors. At each iteration, we pop the corridor Ωjmax =[pj1, p

j2, p

j3, p

j4] with the highest uncertainty U(Ωjmax) according

to Eq. 13 and process it as follows.We introduce two new angles φj0 = (φjmin + φjmax)/2 and

γj0 = (γjmin + γjmax)/2 and split the original parameter intervalsinto two, as shown in Fig. 6c. We compute the uncertaintyof corridors [φmin, φ0) × [γmin, γ0), [φmin, φ0) × [γ0, γmax),[φ0, φmax) × [γmin, γ0) and [φ0, φmax) × [γ0, γmax) and addthem to the priority queue L.

Note, that we always operate on acute angles after the firstiteration with initialization [0;π), which allows us to compute theuncertainty scores of corridors as discussed in Section 5.2.1.

Termination condition. The search procedure terminateswhen the bisector plane p0 = (φ0, γ0) of the corridor Cri (Ωjmax)touches all the supervoxels from the corridor. To fulfill thiscondition it is enough to ensure that the distance from any pointin the corridor to a bisector plane is within the offset 2κ, that is,

distance(p0, sl) ≤ 2κ,∀sl ∈ Ωjmax . (14)

Since U(p0) is greater than the uncertainty of all the remainingcorridors, which is itself greater than that of all planes they containas discussed above, p0 is guaranteed to be the optimal plane weare looking for.

5.2.3 Global OptimizationOur branch-and-bound search is relatively fast for a single voxelbut not fast enough to perform for all supervoxels in a stack.Instead, we restrict our search to t most uncertain supervoxelsin the volume.

We assume that the uncertainty scores are often consistent insmall neighborhoods, which is especially true for the geometry-based uncertainty of Section 4.3. By doing so it enables us to finda solution that is close to the optimal one with a low value of t.In this way, the final algorithm first takes all supervoxels S withuncertainty U and selects the top t locations. Then, we find thebest plane for each of the top t supervoxels and choose the bestplane among them.

6 EXPERIMENTS

In this section, we evaluate our full approach on two differentElectron Microscopy (EM) datasets and on one of MagneticResonance Imaging (MRI) dataset. We then demonstrate that CUnis also effective for natural 2D images. In multi-class MRI andmulti-class natural 2D images of faces the extended version of ourapproach also results in enhanced performance.

6.1 Setup and ParametersFor all our experiments, we used Boosted Trees selected bygradient boosting [41], [42] as our underlying classifier. This isa general-purpose classifier that can be trained fast, it providesprobabilistic prediction and it extends naturally== to the multi-class scenario. However, there exist no closed form solution forthe optimization of Boosted Trees when new points are added.Thus, AL strategies such as expected model change or expectederror reduction are ill suited because they takes hours for one

model update for a typical dataset size in our applications. Giventhat during early AL iterations rounds, only limited amounts oftraining data are available, we limit the depth of our trees to2 to avoid over-fitting. Following standard practices, individualtrees are optimized using 40%-60% of the available training datachosen at random and 10 to 40 features are explored per split.The average radius of supervoxels κ is 4.3 in EM dataset and5.7 in MRI dataset. We set the number k of nearest neighborsof Sec. 4.3 to be the average number of immediately adjacentsupervoxels on average, which is between 7 and 15 dependingon the resolution of the image and size of supervoxels. However,experiments showed that the algorithm is not very sensitive tothe choice of this parameter. We restrict the size of each planarpatch to be small enough to contain typically not more than 2classes of objects and we explain what happens if this conditionis not satisfied. To this end, we take the radius r of Sec. 5.1 to bebetween 10 and 15, which yields patches such as those depictedby Fig. 7.

Fig. 7. Circular patches to be annotated by the expert highlighted by theyellow circle in Electron Microscopy and natural images. The patchescan be annotated either with a line that separates 2 classes or bycorrecting the mistakes in the current prediction, as shown in Fig. 2.

6.1.1 BaselinesFor each dataset, we compare our approach against several base-lines. The simplest is Random Sampling (Rand), which involvesrandomly selecting samples to be labeled. It can be understood asan indicator of how difficult the learning task is.

In practice, the most widely accepted approach is to performUncertainty Sampling [29], [1], [21], [5], [20], [9], [22] of su-pervoxel by using one of the three criteria described in [14]. Totest several variations of it, we use several standard definitionsof uncertainty. The first involves choosing the sample with thesmallest posterior probability for its predicted class b1, that is,

arg minsi∈SU

pθ(yi = b1|xi). (15)

Because of the minmax nature of this strategy, we will refer to isas FMnMx. Uncertainty can also be defined by considering theprobability difference between the first and second most highlyranked classes b1 and b2. The most uncertain sample is then takento be

arg minsi∈SU

pθ(yi = b1|xi)− pθ(yi = b2|xi). (16)

We will refer to this strategy as FMnMar. Finally, the ALprocedure can take into account the entire distribution of scoresover all classes, compute the Total entropy H of Sec. 4.1.1, andselect

s∗ = arg maxsi∈SU

(H(si)) , (17)

8

which we will refer to as FEnt. Recall that FMnMx and FMnMarcannot be easily combined with the geometric uncertainty becauseno upper-bound rule is applicable.

In the case of binary classification, FMnMx, FMnMar andFEnt are strictly equivalent because the corresponding expressionsare monotonic functions of each other. In the multi-class scenario,however, using one or the other can result in different behaviors, asshown in [5], [9], [14], [20], [21], [22]. According to [14], FEnt isbest for minimizing the expected logarithmic loss while FMnMxand FMnMar are better suited for minimizing the expected 0/1-loss.

6.1.2 Proposed strategies

All entropy-based measures introduced in Secs. 4.2 and 4.3 can beused in our unified framework. Let HF be the specific one we usein a given experiment. The strategy then is to select

s∗ = arg maxsi∈SU

(HF (si)) . (18)

Recall that we refer to the feature uncertainty FUn strategy ofSec. 4.1 that relies on standard Total entropy as FEnt. By analogy,we will refer to those that rely on the Selection Entropy andConditional Entropy of Eqs. 3 and 5 as FEntS and FEntC, respec-tively. Similarly, when using the combined uncertainty CUn ofSec. 4.4, we will distinguish between CEnt, CEntS, and CEntCdepending on whether we use Total Entropy, Selection Entropy,or Conditional Entropy. For random walk inference, τmax = 10iterations yield the best learning rates in the multi-class case andτmax = 20 in the binary-segmentation one.

Any strategy can be applied in a randomly chosen plane, whichwe will denote by adding p to its name, as in pFEnt. Finally, wewill refer to the plane selection strategy of Sec. 5 in conjunc-tion with either FUn or CUn as p*FEnt, p*FEntS, p*FEntC,p*CEnt, p*CEntS and p*CEntC, depending on whether un-certainty from FEnt, FEntS, FEntC, CEnt, CEntS, or CEntCis used in the plane optimization. All plane selection strategiesuse the t = 5 best supervoxels in the optimization procedure asdescribed in Sec. 5. Further increasing this value does not yieldany significant learning rate improvement.

Figs. 2, 7 jointly depict what a potential user would see forplane selection strategies given a small enough patch radius. Givena well designed interface, it will typically require no more than oneor two mouse clicks to provide the required feedback, as depictedby Fig. 2. The easiest way to annotate patches with only twoclasses is to indicate a line between them, and in situations whenmore than two classes co-occur in one patch, we allow users to cor-rect mistakes in the current prediction instead. We will show thatit does not require more than three corrections per iteration. Forperformance evaluation purposes, we will therefore estimate thateach user intervention for p*FEnt, p*CEnt, p*FEntS, p*CEntS,p*FEntC, p*CEntC requires either two or three inputs from theuser whereas for other strategies it requires only one.

Note that p*FEnt is similar in spirit to the approach ofTop [33] and can therefore be taken as a good indicator of howit would perform on our data. However, unlike in [33], we do notrequire the user to label the whole plane and retain our proposedinterface for a fair comparison.

(a)

10 20 30 40 50 60 70 80 90 100

# inputs from expert

-40

-20

0

20

40

60

thre

shold

Rand

FEnt

CEnt

p*FEnt

p*CEnt

(b)

Fig. 8. Threshold selection. (a) We estimate mean and standard devia-tion for classifier scores of positive class datapoints (µ+ and σ+, datais shown in red) and negative class datapoints (µ−, σ−, data is shownin blue) and fit 2 Gaussian distributions. Given their pdf, we estimatethe optimal Bayesian error with threshold h∗. (b) Adaptive Thresholdingconvergence rate of classifier threshold for different AL strategies.

Fig. 9. Sample images from the image-classification datasets Chinese,Butterflies and Digits.

6.1.3 Adpative Thresholding for binary AL

The probability of a supervoxel belonging to a certain class fromSec. 4.2 is computed as

pθ(yi = y|xi) =exp−2·(Fy−hy)∑

yj∈Y exp−2·(Fyj−hyj

)), (19)

9

where F = Fy|y ∈ Y is the classifier output and h =hy|y ∈ Y is the threshold [43]. Given enough training data,it can be chosen by cross-validation but this may be misleadingor even impossible in an AL context. In practice, we observethat the optimal threshold value varies significantly for binaryclassification tasks and that the uncertainty measures are sensitiveto it. By contrast, in multi-class scenarios, the threshold valuesremain close to 0 and our proposed entropy-based strategiesare comparatively unaffected. In our experiments, we thereforetake it to be 0 for multi-class segmentation and compute it asfollows in the binary case. We assume that the scores of trainingsamples in each class are Gaussian distributed with unknownparameters µ and σ. We then find an optimal threshold h∗ byfitting Gaussian distributions to the scores of positive and negativeclasses and choosing the value that yields the smallest Bayesianerror, as depicted by Fig. 8a. We refer to this approach as AdaptiveThresholding and we use it for all our experiments. Fig. 8b depictsthe value of the selected threshold as the amount of annotateddata increases. Note that our various strategies yield differentconvergence rates, with the fastest for the plane-based strategies,p*FEnt and p*CEnt.

6.1.4 Experimental ProtocolIn all cases, we start with 5 labeled supervoxels from each classand perform AL iterations until we receive 100 simulated userinputs in the binary case and 200 in the multi-class case. Eachmethod starts with the same random subset of samples and eachexperiment is repeated N = 40−50 times. We will therefore plotnot only accuracy results but also indicate the variance of theseresults. We use half of the available data for independent testingand the AL strategy selects new training datapoints from the otherhalf.

We have access to fully annotated ground-truth volumes andwe use them to simulate the expert’s intervention in our experi-ments. This ground truth allows us to model several hypotheticalstrategies of human expert as will be shown in Sec 6.5. We detailthe specific features we used for EM, MRI, and natural imagesbelow.

6.2 Multi-Class Active LearningRecall from Sec. 4.1 that in multi-class scenarios, our differentapproaches to measuring FUn yield different results, as shownin Fig. 4. Therefore, even though all our strategies derive fromthe similar intuition, they favor different points. For example,FMnMar selects samples with small margin between the mostprobable classes irrespectively of the absolute values of the prob-abilities, whereas FEntC allows for bigger margins for highervalues. Selection Entropy FEntS tends to avoid samples that looklike they can belong to any of the existing classes. This propertycan be useful to avoid querying outliers that look unlikely tobelong to any class.

To study these differences independently of a full imagesegmentation pipeline, we first test the various strategies in asimple multi-class image classification task. We use them on thethree datasets depicted by Fig. 9. Digits is a standard MNISTcollection with 10 hand-written digits and we use raw pixelvalues as features. Chinese comprises 3 classes from the a datasetof of Chinese handwriting characters [44]. Butterflies datasetcontains 5 classes from British butterfly images from a museumcollection [45]. In the Chinese and Butterflies datasets, featuresare extracted using a Deep Net [45], [46].

We use a logistic regression classifier and test our vari-ous AL multi-class strategies including Expected Model Change(EMC) [14], [17], [6], [18]. The results are shown in Fig. 10. Thestrategies based on the Selection and Conditional Entropy performeither better or at least as well as the strategies based on the stan-dard measures of uncertainty. The performance of EMC approachis not consistent and does not justify a high computational cost:45 and 310 seconds per iteration in Chinese and Butterfly datasetswith 4096 samples with 359 and 750 features correspondingly,against 0.005 and 0.01 seconds by Conditional Entropy. TheEMC execution time grows with the AL pool size and thus, wedid not run experiments with more than 10 000 samples.

6.3 Results on volumetric data

6.3.1 Results on EM dataFirst, we work with two 3D EM stacks of rat neural tissue,one from the striatum and the other from the hippocampus2.One stack of size 318× 711× 422 (165× 1024× 653 for thehippocampus) is used for training and another stack of size318× 711× 450 (165× 1024× 883) is used to evaluate theperformance. Their resolution is 5nm in all three spatial orien-tations. The slices of Fig. 1 as well as patches in Fig. 7a comefrom the striatum dataset. The hippocampus volume is shownin Fig. 11a. Since the image have the same resolution in alldimensions, they can be viewed equally well in all orientationsand specialized tools have been developed for use by neuroscien-tists [47].

The task is to segment mitochondria, which are the intra-cellular structures that supply the cell with its energy and areof great interest to neuroscientists. It is extremely laborious toannotate sufficient amounts of training data for learning segmen-tation algorithms to work satisfactorily. Furthermore, differentbrain areas have different characteristics, which means that theannotation process must be repeated often. The features we feedour Boosted Trees rely on local texture and shape informationusing ray descriptors and intensity histograms as in [35].

In Fig. 12, we plot the performance of all the approaches interms of the intersection over union (IoU) [48] score, a commonlyused measure for this kind of application, as a function of the an-notation effort. The horizontal line at the top depicts the IoU scoresobtained by using the whole training set, which comprises 276 130and 325 880 supervoxels for the striatum and the hippocampus,respectively. FEnt provides a boost over Rand and CEnt yieldsa larger one. Any strategy can be combined with a batch-modeAL that is done by selecting a 2D plane to be annotated. For ex-ample, strategies pRand, pFEnt and pCEnt present to the user arandomly selected 2D plane around the sample selected by Rand,FEnt and CEnt. Addition of a plane boosts the performance ofall corresponding strategies, but further improvement is obtainedby introducing the batch-mode geometry query selection with anoptimal plane search by Branch-and-Bound algorithm in strategiesp*FEnt and p*CEnt. The final strategy p*CEnt outperforms allthe rest of the strategies thanks to the synergy of geometry-inspireduncertainty criteria and the selection of a batch.

Recall that these numbers are averaged over many runs. InTable 1, we give the corresponding variances. Note that both usingthe CUn and the batch-mode with optimal plane selection tend toreduce variances, thus making the process more predictable. Also

2. http://cvlab.epfl.ch/data/em

10

0 5 10 15 20


0.7

0.75

0.8

0.85

0.9

0.95

1a

ccu

racy

Chinese

Rand

FMnMx

FMnMar

FEnt

FEntS

FEntC

EMC

0 5 10 15 20


0.95

0.96

0.97

0.98

0.99

1

accu

racy

Butterflies

Rand

FMnMx

FMnMar

FEnt

FEntS

FEntC

EMC

0 50 100 150 200


0.4

0.5

0.6

0.7

0.8

0.9

accu

racy

Digits

Rand

FMnMx

FMnMar

FEnt

FEntS

FEntC

Fig. 10. Multi-class AL strategies applied to image classification tasks. Logistic regression is used as an underlying classifier, compare standardmulti-class AL criteria to the newly introduced entropy-based criteria.

(a)

(b)

Fig. 11. Examples of 3D datasets. a) Hippocampus volume for mitochon-dria segmentation b) MRI data for tumor segmentation (Flair image).

note that the 100 samples we use are two orders of magnitudesmaller than the total number of available samples. NeverthelessAL provides a segmentation of comparable quality.

6.3.2 Results on MRI dataIn this section, we consider multimodal brain-tumor segmentationin MRI brain scans. Segmentation quality critically depends onthe amount of training data and only highly-trained experts canprovide it. T1, T2, FLAIR, and post-Gadolinium T1 MR imagesare available in the BRATS dataset for each of 20 subjects [49].We use standard filters such as Gaussian, gradient filter, tensor,Laplacian of Gaussian and Hessian with different parameters tocompute the feature vectors we feed to the Boosted Trees.

6.3.2.1 Foreground-Background Segmentation: We firstconsider segmentation of tumor versus healthy tissue. We plot theperformance of all the approaches in terms of the dice score [32](Fig. 13, left), a commonly used quality measure for brain tumorsegmentation, as a function of the annotation effort and in Table 1,we give the corresponding variances. We observe the same patternas in Fig. 12, with p*CEnt again resulting in the highest score.Note that difference between p*CEnt and pCEnt is greater thanbetween p*FEnt and pFEnt in all the experiments. This is theevidence of the synergy brought by the geometric uncertainty andthe batch selection based on the geometry.

The patch radius parameter r of Sec. 5.1 plays an importantrole in plane selection procedure. To evaluate its influence, werecomputed our p*CEnt results 50 times using three differentvalues for r = 10, 15 and 20. The resulting plot is shown inFig. 13 on the right. As expected, with a larger radii, the learning-rate is slightly higher since more voxels are labeled each time.However, as the patches become larger, it stops being clear thatthis can be done with small user effort and that is why we limitourselves to radius sizes of 10 to 15.

6.3.2.2 Multi-Class Segmentation: To test our multi-class approach, we use the full label set of the BRATS compe-tition: healthy tissue (label 1), necrotic center (2), edema (3), non-enhancing gross abnormalities (4), and enhancing tumor core (5).Fig. 14 shows a ground truth example for one of the volumes,where different classes are indicated in different colors. Note thatthe ground truth is highly unbalanced: we have 4000 samples ofhealthy tissue, 1600 of edema, 750 of enhancing tumor core, 250of necrotic center and 200 of non-enhancing gross abnormalitiesin the full training dataset. We use the protocol of the BRATScompetition [49] to analyze our results. This involves evaluatinghow well we segment complete tumors (classes 2, 3, 4, and 5),core tumors (classes 2, 4, and 5), and enhancing tumors (class 5only).

Fig. 15 depicts our results and those of the selected baselineson these three tasks. As before, the results clearly indicate thatAL provides a significant improvement over passive selection. Inthis case we do not show all the variants of batch-mode queryselection for the benefit of the figure clarity. Among the basicstrategies, FMnMar gives the best performance in subtasks 1and 2 and FMnMx in subtask 3. Our entropy-based uncertaintystrategies FEntS and FEntC perform better or equivalent to thecorresponding baselines FMnMx and FMnMar as in the taskof image classification before. Next, the CUn strategies CEnt,CEntS and CEntC outperform their corresponding FUn versionsFEnt, FEntS and FEntC, where the improvement depends on thesubtask and the strategy. Note that FEntS and FEntC as well asCEntS and CEntC perform equally well, thus, they can be usedinterchangeably. Further improvement is obtained when each ofthe strategies is combined with the optimal plane selection and weomit the random plane selection for the clarity of the figure.

In practice, we observed that around 43% of selected patchescontain more than two classes. In such cases, simply finding a lineseparating two classes is not enough. To handle such cases, wepropose a different annotation scheme. The current prediction onsupervoxels is displayed to the annotator who needs to correct themistakes in the prediction. We count the number of misclassified

11

0 20 40 60 80 100


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

IoU all data

Rand

pRand

FEnt

CEnt

pFEnt

pCEnt

p*FEnt

p*CEnt

0 20 40 60 80 100


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

IoU all data

Rand

pRand

FEnt

CEnt

pFEnt

pCEnt

p*FEnt

p*CEnt

Fig. 12. Comparison of various AL strategies for (binary) mitochondria segmentation. Left: striatum dataset, right: hippocampus dataset.

0 20 40 60 80 100


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

dic

e s

co

re

all data

Rand

pRand

FEnt

CEnt

pFEnt

pCEnt

p*FEnt

p*CEnt

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


dice

sco

re

radius 10radius 15radius 20

Fig. 13. Comparison of various AL strategies for MRI data for binary tumor segmentation. Right: dice score for BRATS2012 dataset, left: p ∗CEntstrategy with patches of different radius.

Fig. 14. Example of ground truth from multi-class brain-tumor segmen-tation. Necrotic center in red, edema in green, non-enhancing grossabnormalities in blue and enhancing tumor core in yellow. Best seenin color.

samples throughout the experiments and on average there wereno more than 10.42% errors in the supervoxel classes, that isapproximately 2.42 samples per iteration. Thus, we show thelearning curves for the plane-based strategy and we count oneannotation iteration as either two and or three inputs from theuser, with both variants dominating simple strategies.

The difference between competing CUn strategies becomesnegligible with a slight dominance of Selection Entropy p*CEntS

in subtasks 1 and 2 and Total entropy p*CEnt in the last subtask.In seven of nine cases, the CUn in conjunction with the planeselection yields better results than FUn with plane selectionand in two of nine, they perform equally well. For illustrativepurposes, Fig. 15 contains only the best performing learning curveof p*CEntS and Fig. 16 shows the performance of all strategiesbased on the Selection Entropy in the third subtask.

Dataset FEnt CEnt pFEnt pCEnt p*FEnt p*CEntStriatum 0.133 0.105 0.121 0.094 0.115 0.0860.0860.086Hippoc. 0.117 0.101 0.081 0.092 0.090 0.0780.0780.078MRI 0.076 0.064 0.078 0.074 0.073 0.0480.0480.048Natural 0.145 0.140 0.149 0.1240.1240.124 — —

TABLE 1Variability of results (in the metric corresponding to the task) by

different binary AL strategies. 80% of the scores are lying within theindicated interval.

6.4 Results on Natural ImagesFinally, we turn to natural 2D images and replace supervoxelsby superpixels. In this case, the plane selection reduces to asimple selection of patches in the image and we will refer tothese strategies as pFEnt and pCEnt because they do not involvethe branch-and-bound selection procedure. In practice, we simplyselect superpixels with their 4 neighbors in binary segmentationand 7 in multi-class segmentation. This parameter is determined

12

0 50 100 150 200


0.3

0.4

0.5

0.6

0.7

0.8

Dic

e s

core

all data

Rand

FMnMx

FMnMar

FEnt

FEntS

FEntC

CEnt

CEntS

CEntC

p*CEntS2

p*CEntS3

0 50 100 150 200


0.2

0.3

0.4

0.5

0.6

0.7

Dic

e s

core

all data

Rand

FMnMx

FMnMar

FEnt

FEntS

FEntC

CEnt

CEntS

CEntC

p*CEntS2

p*CEntS3

0 50 100 150 200


0.2

0.3

0.4

0.5

0.6

0.7

Dic

e s

core

all data

Rand

FMnMx

FMnMar

FEnt

FEntS

FEntC

CEnt

CEntS

CEntC

p*CEntS2

p*CEntS3

Fig. 15. Comparison of different AL strategies for multi-class MRI seg-mentation. Dice scores for three BRATS2012 tasks: complete tumor,tumor core, enhancing tumor.

by the size of superpixels used in oversegmentation. Increasingthis number would lead to higher learning rates in the same wayas increasing the patch radius r, but we restrict it to a small valueto ensure labeling can be done with two mouse clicks on average.

6.4.0.1 Foreground-Background Segmentation: Westudy the results of binary AL on the Weizmann horsedatabase [50] in Fig. 17 and give the corresponding variancesin Table 1. To compute image features, we use Gaussian,Laplacian, Laplacian of Gaussian, Prewitt and Sobel filters forintensity and color values, gather first-order statistics such as localstandard deviation, local range, gradient magnitude and directionhistograms, as well as SIFT features. The pattern is again similar

0 50 100 150 200


0.2

0.3

0.4

0.5

0.6

0.7

Dic

e s

core

all data

Rand

FEntS

CEntS

p*FEntS2

p*CEntS2

Fig. 16. Dice score for enhancing tumor segmentation. Performance ofvarious strategies that have Selection Entropy at their basis.

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8


f−sc

ore

all dataRandFEntCEntpFEntpCEnt

Fig. 17. Comparison of various AL strategies for binary segmentation ofnatural images.

to the one observed in Figs. 12 and 13, with the differencebetween CEnt and pCEnt being smaller due to the fact that 2Dbatch-mode approach does not involve any optimization of patchselection. Note, however, that while the first few iterations resultin the reduced scores for all methods, plane-based methods areable to recover from this effect quite fast.

6.4.0.2 Multi-Class Face Segmentation: We apply themulti-class AL to the task of segmenting human faces [51]. Wedistinguish 6 classes: background, hair, skin, eyes, nose, andmouth. Fig. 18 demonstrates an example of an image from thedataset with the corresponding ground truth annotation. Noticeagain that we are dealing with unbalanced problem, obviouslyclasses ‘eyes’, ‘nose’, ‘mouth’ are a minority compared to ‘back-ground’, ‘skin’ and ‘hair’. We use the same features as for theWeizmann horse database plus HOG features.

As in the case of multi-class MRI images, we must handlecases in which more than two classes are present in a single patch.However, this only happens in 0.84% of the selected patchesbecause three classes do no often not co-occur in the same smallneighborhood. Thus, we can still use the simple line separationheuristic depicted by Fig. 2 in most iterations and leave a user anopportunity to use a standard brush for rare refined annotations.

In Fig. 19 we compare our results to those of the baselinesin terms of precision averaged over each one of the 6 classes.

13

(a) (b)

Fig. 18. Dataset for face segmentation (a) Example of an image fromface segmentation dataset (b) Ground truth annotation for the givenimage. Different classes are indicated in different colors. Best viewedin color.

This measure was chosen because it is better suited for capturingthe performance in smaller classes and, thus reflects better theperformance in segmentation with unbalanced classes. At the sametime we monitor the score of total precision (but omit the figurefor conciseness), that performs in similar way for all AL strategies.This is done to ensure that performance on dominant classes isnot sacrificed. Entropy-based algorithms FEntS and FEntC arebetter than the standard FMnMx and FMnMar, respectively.Moreover, selection that is based on the entropy allows for acombination with CUn and brings further improvement in averageprecision with the strategies CEnt, CEntS and CEntC. Next, eachof the strategies can be used in conjunction with patch selectionthat allows for further growth of the learning rate. We show thepatch selection results only for Selection Entropy and ConditionalEntropy and skip Total Entropy as it performs poorly in totalprecision. As we can see, the combination of plane selection withCUn strategies demonstrates better results at the end of learningexperiments with the best result obtained by pCEntS.

To understand the implications of CUn, we study distances tothe closest class boundary for selected samples. For this purposewe count how many samples lie within radius of 10 pixels fromthe boundary for 2 strategies: Rand and CEntS. We observethat CEntS strategy samples 7.4% more datapoints in this areathan Rand. More superpixels in this area illustrate the effect ofgeometric component that prefers regions in the non-smooth areasof the prediction.

0 50 100 150 200

0.58

0.6

0.62

0.64

0.66

0.68

0.7

0.72


Ave

rage

pre

cisi

on

RandFMnMxFMnMarFEntFEntSFEntCCEntCEntSCEntCpFEntSpFEntCpCEntSpCEntC

Fig. 19. Comparing several AL strategies for multi-class face segmenta-tion.

6.5 Active Learning or Human Intuition

As part of the AL query selection procedure, we predict thesegmentation for the whole training volume at every iteration.Given this prediction, a human expert could be expected tomanually identify patches that are worth labeling. For example,he might first correct the most obvious mistakes or, alternatively,first label patches at the boundary between classes. To illustratethis, we implemented two selection strategies that simulate thesebehaviors. In the first, we select first the most confidently butwrongly classified sample (max error strategy). In the second, weselect samples at classification boundaries (boundary strategy).We ran fifty trials using each of these two strategies on theface segmentation problem of Section 6.4.0.2. Fig. 20 depicts theresults.

0 20 40 60 800.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8


prec

isio

n

randomboundarymax errror

Fig. 20. Hypothetical human expert selection strategies. We demon-strate that strategies that are intuitive for a human annotator do not resultin better performance than passive sampling.

Surprisingly, the human strategies perform much worse thaneven random sample selection scheme, which confirms the diffi-culty of the AL problem. The heuristics we proposed derive fromour intuitive understanding of the problem. However, applyingthese intuitions is not straightforward for a human user. Thus,intelligent and automated query selection is necessary to determinehow uncertain the classifier is and what smoothness prior shouldbe used when selecting the next samples to be labeled.

7 CONCLUSION

In this paper we introduced an approach to exploiting the geomet-ric priors inherent to images to increase the effectiveness of Ac-tive Learning for segmentation purposes. We introduced entropy-based AL methods for multi-class classification that demonstrateempirical dominance over such a measure of uncertainty as Totalentropy and they allow for an elegant combination with geometricuncertainty for segmentation tasks. For 2D images, it relies on anapproach to uncertainty sampling that accounts not only for theuncertainty of the prediction at a specific location but also in itsneighborhood. For 3D image stacks, it adds to this the ability toautomatically select a planar patch in which manual annotation iseasy to do.

We demonstrated the efficiency of our approach on severaldatasets featuring MRI, EM, and natural images and both forforeground-background and multi-class segmentation. In futurework, we will strive to replace the heuristics we have introducedby AL strategies that are themselves learned.

14

ACKNOWLEDGEMENTS

This project has received funding from the European UnionsHorizon 2020 Research and Innovation Programme under GrantAgreement No. 720270 (HBP SGA1). We would like to thankLucas Maystre for asking many questions that helped to improvethis article. It is important to note that the research on multi-classAL started as a semester project by Udaranga Wickramasinghe.Finally, we would like thank Carlos Becker for proofreading andcomments on the text.

15

REFERENCES

[1] C. Long, G. Hua, and A. Kapoor, “Active Visual Recognition withExpertise Estimation in Crowdsourcing,” in International Conference onComputer Vision, 2013.

[2] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Dollar, and C. Zitnick, “Microsoft COCO: Common Objects inContext,” in European Conference on Computer Vision, 2014, pp. 740–755.

[3] A. J. Joshi, F. Porikli, and N. P. Papanikolopoulos, “Scalable ActiveLearning for Multiclass Image Classification,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 34, no. 11, pp. 2259–2273, 2012.

[4] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell, “Active Learningwith Gaussian Processes for Object Categorization,” in InternationalConference on Computer Vision, 2007.

[5] A. Joshi, F. Porikli, and N. Papanikolopoulos, “Multi-Class ActiveLearning for Image Classification,” in Conference on Computer Visionand Pattern Recognition, 2009.

[6] A. Vezhnevets, V. Ferrari, and J. Buhmann, “Weakly Supervised Struc-tured Output Learning for Semantic Segmentation,” in Conference onComputer Vision and Pattern Recognition, 2012.

[7] J. Maiora and M. G. na, “Abdominal CTA Image Analysis throughActive Learning and Decision Random Forests: Aplication to AAASegmentation,” in International Joint Conference on Neural Networks,2012.

[8] T. Luo, K. Kramer, S. Samson, A. Remsen, D. B. Goldgof, L. O. Hall, andT. Hopkins, “Active Learning to Recognize Multiple Types of Plankton,”in International Conference on Pattern Recognition, 2004.

[9] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks forSemantic Segmentation,” in Conference on Computer Vision and PatternRecognition, 2015.

[10] S. Tong and D. Koller, “Support Vector Machine Active Learning withApplications to Text Classification,” Machine Learning, 2002.

[11] D. Lewis and W. Gale, “A Sequential Algorithm for Training TextClassifiers,” in ACM SIGIR proceedings on Research and Developmentin Information Retrieval, 1994.

[12] K. Konyushkova, R. Sznitman, and P. Fua, “Introducing Geometry intoActive Learning for Image Segmentation,” in International Conferenceon Computer Vision, 2015.

[13] B. Schmid, J. Schindelin, A. Cardona, M. Longair, and M. Heisenberg,“A High-Level 3D Visualization API for Java and ImageJ,” BMC Bioin-formatics, vol. 11, p. 274, 2010.

[14] B. Settles, “Active Learning Literature Survey,” University of Wisconsin–Madison, Tech. Rep., 2010.

[15] R. Gilad-bachrach, A. Navot, and N. Tishby, “Query by Committee MadeReal,” in Advances in Neural Information Processing Systems, 2005.

[16] J. Iglesias, E. Konukoglu, A. Montillo, Z. Tu, and A. Criminisi, “Combin-ing Generative and Discriminative Models for Semantic Segmentation,”in Information Processing in Medical Imaging, 2011.

[17] R. Sznitman and B. Jedynak, “Active Testing for Face Detection andLocalization,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 32, no. 10, pp. 1914–1920, June 2010.

[18] C. Kading, A. Freytag, E. Rodner, P. Bodesheim, and J. Denzler, “Activelearning and discovery of object categories in the presence of unnameableinstances,” in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2015, pp. 4343–4352.

[19] S. Hoi, R. Jin, J. Zhu, and M. Lyu, “Batch Mode Active Learningand Its Application to Medical Image Classification,” in InternationalConference on Machine Learning, 2006.

[20] Y. Yang, Z. Ma, F. Nie, X. Chang, and A. G. Hauptmann, “Multi-ClassActive Learning by Uncertainty Sampling with Diversity Maximization,”International Journal of Computer Vision, vol. 113, no. 2, pp. 113–127,2015.

[21] P. Jain and A. Kapoor, “Active Learning for Large Multi-Class Problems,”2009.

[22] C. Korner and S. Wrobel, “Multi-Class Ensemble-Based Active Learn-ing,” in European Conference on Machine Learning, 2006, pp. 687–694.

[23] F. Olsson, “A Literature Survey of Active Machine Learning in theContext of Natural Language Processing,” Swedish Institute of ComputerScience, 2009.

[24] S. D. Jain and K. Grauman, “Active Image Segmentation Propagation,”2016.

[25] Q. Li, Z. Deng, Y. Zhang, X. Zhou, U. Nagerl, and S. Wong, “A GlobalSpatial Similarity Optimization Scheme to Track Large Numbers of Den-dritic Spines in Time-Lapse Confocal Microscopy,” IEEE Transactionson Medical Imaging, vol. 30, no. 3, pp. 632–641, 2011.

[26] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf, “Learningwith Local and Global Consistency,” in Advances in Neural InformationProcessing Systems, 2004, pp. 321–328.

[27] B. Settles, “From Theories to Queries : Active Learning in Practice,”Active Learning and Experimental Design, 2011.

[28] B. Settles, M. Craven, and S. Ray, “Multiple-Instance Active Learning,”in Advances in Neural Information Processing Systems, 2008.

[29] E. Elhamifar, G. Sapiro, A. Yang, and S. Sastry, “A Convex Optimiza-tion Framework for Active Learning,” in International Conference onComputer Vision, 2013.

[30] S. Olabarriaga and A. Smeulders, “Interaction in the Segmentation ofMedical Images : A Survey,” Medical Image Analysis, 2001.

[31] A. Al-Taie, H. H. K., and L. Linsen, “Uncertainty Estimation andVisualization in Probabilistic Segmentation,” Computers & Graphics,2014.

[32] N. Gordillo, E. Montseny, and P. Sobrevilla, “State of the Art Surveyon MRI Brain Tumor Segmentation,” Magnetic Resonance in Medicine,2013.

[33] A. Top, G. Hamarneh, and R. Abugharbieh, “Active Learning for Interac-tive 3D Image Segmentation,” Conference on Medical Image Computingand Computer Assisted Intervention, 2011.

[34] B. Andres, U. Koethe, M. Helmstaedter, W. Denk, and F. Hamprecht,“Segmentation of SBFSEM Volume Data of Neural Tissue by Hierarchi-cal Classification,” in DAGM Symposium on Pattern Recognition, 2008,pp. 142–152.

[35] A. Lucchi, K. Smith, R. Achanta, G. Knott, and P. Fua, “Supervoxel-Based Segmentation of Mitochondria in EM Image Stacks with LearnedShape Features,” IEEE Transactions on Medical Imaging, vol. 31, no. 2,pp. 474–486, February 2012.

[36] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Suesstrunk,“SLIC Superpixels Compared to State-Of-The-Art Superpixel Methods,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34,no. 11, pp. 2274–2282, November 2012.

[37] L. Lovasz, “Random Walks on Graphs: A Survey,” Combinatorics, PaulErdos is Eighty, 1993.

[38] L. Grady, “Random Walks for Image Segmentation,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 28, no. 11, pp. 1768–1783, 2006.

[39] A. Mosinska, R. Sznitman, P. Glowacki, and P. Fua, “Active Learningfor Delineation of Curvilinear Structures,” in Conference on ComputerVision and Pattern Recognition, 2016.

[40] C. Lampert, M. Blaschko, and T. Hofmann, “Beyond Sliding Windows:Object Localization by Efficient Subwindow Search,” in Conference onComputer Vision and Pattern Recognition, 2008.

[41] R. Sznitman, C. Becker, F. Fleuret, and P. Fua, “Fast Object Detectionwith Entropy-Driven Evaluation,” in Conference on Computer Vision andPattern Recognition, 2013, pp. 3270–3277.

[42] C. Becker, R. Rigamonti, V. Lepetit, and P. Fua, “Supervised FeatureLearning for Curvilinear Structure Segmentation,” in Conference onMedical Image Computing and Computer Assisted Intervention, Septem-ber 2013.

[43] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of StatisticalLearning. Springer, 2001.

[44] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, “Casia online and offlinechinese handwriting databases,” in Document Analysis and Recognition(ICDAR), 2011 International Conference on. IEEE, 2011, pp. 37–41.

[45] E. Johns, O. Mac Aodha, and G. J. Brostow, “Becoming the expert-interactive multi-class machine teaching,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2015, pp.2616–2624.

[46] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in Proceedings of the 22nd ACM internationalconference on Multimedia. ACM, 2014, pp. 675–678.

[47] T. Pietzsch, S. Saalfeld, S. Preibisch, and P. Tomancak, “Bigdataviewer:visualization and processing for large image data sets,” Nature methods,vol. 12, no. 6, pp. 481–483, 2015.

[48] M. Everingham, C. W. L. Van Gool and, J. Winn, and A. Zisserman, “ThePascal Visual Object Classes Challenge (VOC2010) Results,” 2010.

[49] B. Menza, A. Jacas et al., “The Multimodal Brain Tumor Image Segmen-tation Benchmark (BRATS),” IEEE Transactions on Medical Imaging,2014.

[50] E. Borenstein and S. Ullman, “Combined Top-Down/bottom-Up Seg-mentation,” IEEE Transactions on Pattern Analysis and Machine Intelli-gence, vol. 30, no. 12, pp. 2109–2125, 2008.

[51] K. Khan, M. Mauro, and R. Leonardi, “Multi-Class Semantic Segmenta-tion of Faces,” in International Conference on Image Processing, 2015.

geometry in active learning for binary and multi-class ... · pdf filegeometry in active...

Documents