aggregating crowdsourced image segmentationsceur-ws.org/vol-2173/paper10.pdf · this algorithm...

Aggregating Crowdsourced Image SegmentationsDoris Jung-Lin Lee

University of Illinois, [email protected]

Akash Das SarmaFacebook, Inc.

[email protected]

Aditya ParameswaranUniversity of Illinois, Urbana-Champaign

[email protected]

Abstract

Instance-level image segmentation provides rich informationcrucial for scene understanding in a variety of real-world ap-plications. In this paper, we evaluate multiple crowdsourcedalgorithms for the image segmentation problem, includingnovel worker-aggregation-based methods and retrieval-basedmethods from prior work. We characterize the different typesof worker errors observed in crowdsourced segmentation, andpresent a clustering algorithm as a preprocessing step that isable to capture and eliminate errors arising due to workershaving different semantic perspectives. We demonstrate thataggregation-based algorithms attain higher accuracies thanexisting retrieval-based approaches, while scaling better withincreasing numbers of worker segmentations.

1 IntroductionPrecise, instance-level object segmentation is crucial foridentifying and tracking objects in a variety of real-worldemergent applications of autonomy, including robotics (Na-tonek 1998), image organization and retrieval (Yamaguchi2012), and medicine (Irshad and et. al. 2014). To this end,there has been a lot of work on employing crowdsourcingto generate training data for segmentation, including Pascal-VOC (Everingham et al. 2015), LabelMe (Torralba et al.2010), OpenSurfaces (Bell et al. 2015), and MS-COCO (Linet al. 2012). Unfortunately, raw data collected from thecrowd is known to be noisy due to varying degrees ofworker skills, attention, and motivation (Bell et al. 2014;Welinder et al. 2010).

To deal with these challenges, many have employed heuris-tics indicative of crowdsourced segmentation quality to pickthe best worker-provided segmentation (Sorokin and Forsyth2008; Vittayakorn and Hays 2011). However, this approachends up discarding the majority of the worker segmentationsand is limited by what the best worker can do. In this paper,we make two contributions: First, we introduce a novel classof aggregation-based methods that incorporates portions ofsegmentations from multiple workers into a combined onedescribed in Section 4. To our surprise, despite its intuitivesimplicity, we have not seen this class of algorithms describedor evaluated in prior work. We evaluate this class of algo-rithms against existing methods in Section 6. Second, ouranalysis of common worker errors in crowdsourced segmen-tation shows that workers often segment the wrong objects orerroneously include or exclude large semantically-ambiguousportions of an object in the resulting segmentation. We dis-

Copyright c© 2018for this paper by its authors. Copying permittedfor private and academic purposes.

Figure 1: Taxonomy of quality evaluation algorithms forcrowdsourced segmentation, including existing methods(blue) and our novel algorithms (yellow).

cuss such errors in Section 3 and propose a clustering-basedpreprocessing technique that resolves them in Section 5.

2 Related WorkAs shown in Figure 1, quality evaluation methods for crowd-sourced segmentation can be classified into two categories:Retrieval-based methods pick the “best” worker segmenta-tion based on some scoring criteria that evaluates the qual-ity of each segmentation, including vision information (Vit-tayakorn and Hays 2011; Russakovsky et al. 2015), and click-stream behavior (Cabezas et al. 2015; Sameki et al. 2015;Sorokin and Forsyth 2008).Aggregation-based methods combine multiple worker seg-mentations to produce a final segmentation that is not re-stricted to any single worker segmentation. An aggregation-based majority vote approach was employed in Sameki etal. (2015) to create an expert-established gold standard forcharacterizing their dataset and algorithmic accuracies, ratherthan for segmentation quality evaluation as described here.

3 Error AnalysisOn collecting and analyzing a number of crowdsourced seg-mentations (described in Section 6), we found that commonworker segmentation errors can be classified into three types:(1) Semantic Ambiguity: workers have differing opinionson whether particular regions belong to an object (Figure 2left: annotations around ‘flower and vase’ when ‘vase’ is re-quested); (2) Semantic Error: workers annotate the wrongobject entirely (Figure 2 right: annotations around ‘turtle’ and‘monitor’ when ‘computer’ is requested.); and (3) BoundaryImperfection: workers make unintentional mistakes whiledrawing the boundaries, either due to low image resolution,small area of the object, or lack of drawing skills (Figure 3left: imprecision around the ‘dog’ object).

Quality evaluation methods in prior work have largelyfocused on minimizing boundary imperfection issues. So,we first describe our novel aggregation-based algorithms de-signed to reduce boundary imperfections in Section 4. Next,

in Section 5, we discuss a preprocessing method that elimi-nates semantic ambiguities and errors. We present our exper-imental evaluation in Section 6.

SemanticAmbiguity [vase] Semantic Error [computer] Boundary Imprecision [dog]

ground truth

individual workers

pointer to semantic object to be segmented

Figure 2: Examples of common worker errors.

4 Fixing Boundary ImperfectionsAt the heart of our aggregation techniques is the tile datarepresentation. A tile is the smallest non-overlapping discreteunit created by overlaying all of the workers’ segmentationson top of each other. The tile representation allows us toaggregate segmentations from multiple workers, rather thanbeing restricted to a single worker’s segmentation, allowingus to fix one worker’s errors with help from another. In Figure3 (left), we display three worker segmentations for a toyexample with 6 resulting tiles. Any subset of these tiles cancontribute towards the final segmentation.

This simple but powerful idea of tiles also allows us toreformulate our problem from one of “generating a segmenta-tion” to a setting that is much more familiar to crowdsourcingresearchers. Since tiles are the lowest granularity units cre-ated by overlaying all workers’ segmentations on top of eachother, each tile is either completely contained within or out-side a given worker segmentation. Specifically, we can regarda worker segmentation as multiple boolean responses wherethe worker has voted ‘yes’ or ‘no’ to every tile independently.Intuitively, a worker votes ‘yes’ for every tile that is containedin their segmentation, and ‘no’ for every tile that is not. Asshown in Figure 3 (right), tile t2 is voted ‘yes’ by worker 1,2, and 3; tile t3 is voted ‘yes’ by worker 2 and 3. The goalof our aggregation algorithms is to pick an appropriate set oftiles that effectively trades off precision versus recall.

Now that we have modeled segmentation as a collection ofworker votes for tiles, we can now develop familiar variantsof standard quality evaluation algorithms for this setting.Aggregation: Majority Vote Aggregation (MV)This simple algorithm includes a tile in the output segmenta-tion if and only if the tile has ‘yes’ votes from at least 50%of all workers.Aggregation: Expectation-Maximization (EM)Unlike MV, which assumes that all workers perform uni-formly, EM approaches infer the likelihood that a tile is partof the ground truth segmentation, while simultaneously es-timating hidden worker qualities. In Section 6 we evaluatean EM variant which assumes that each worker has a (dif-ferent) fixed probability for a correct vote. Details of this,and more fine-grained variants can be found in our technicalreport (Lee et al. 2018).Aggregation: Greedy Tile Picking (greedy)The greedy algorithm picks tiles in descending order of thetiles’ ratios of (estimated) overlap area with the ground truth

to (estimated) non-overlap area with ground truth, for as longas the (estimated) Jaccard similarity of the resulting segmen-tation continues to increase. Intuitively, tiles that have a highoverlap area and low non-overlap area contribute to highrecall with limited loss of precision. Since tile overlap andnon-overlap areas, and Jaccard similarity of segmentationswith ground truth are unknown, we use different heuristicsto estimate these values. We discuss details of this algorithmand its theoretical guarantees in our technical report.Retrieval: Number of Control Points (num pts)This algorithm picks the worker segmentation with the largestnumber of control points around the segmentation boundary(i.e., the most precise drawing) as the output segmentation(Vittayakorn and Hays 2011; Sorokin and Forsyth 2008).

worker 2

worker 1 worker 3object boundary

t5

t4t3t2t1

t6

Tile-based Inference (5 worker example)

Figure 3: Left: Toy example demonstrating tiles created bythree workers’ segmentations around an object delineated bythe black dotted line. Right: Segmentation boundaries drawnby five workers shown in red. Overlaid segmentation createsa mask where the color indicates the number of workers whovoted for the tile region.

5 Perspective ResolutionAs discussed in Section 3, disagreements often arise in seg-mentation due to differing worker perspectives on large tileregions. We developed a clustering-based preprocessing ap-proach to resolve this issue. Based on the intuition that work-ers with similar perspectives will have segmentations thatare close to each other, we compute the Jaccard similaritybetween each pair of segmentations and perform spectralclustering to separate the segmentations into clusters. Fig-ure 2 (bottom) illustrates how spectral clustering divides theworker segmentations into clusters with meaningful semanticassociations, reflecting the diversity of perspectives for thesame task. Clustering results can be used as a preprocessingstep for any quality evaluation algorithm by keeping onlythe segmentations that belong to the largest cluster, which istypically free of semantic errors.

In addition, clustering offers the additional benefit of pre-serving a worker’s semantic intentions. For example, whilethe green cluster in Figure 2 (bottom right) would be consid-ered bad segmentations for the particular task (‘computer’),this cluster can provide more data for another segmentationtask corresponding to ‘monitor’. A potential future work di-rection would be to crowdsource the semantic labels for thecomputed clusters to enable the reuse of segmentations acrossmultiple objects to lower costs.

After Clustering

Colors denote clusters with

different worker perspectives.

Figure 4: Example image showing clustering performed onthe same object from Figure 2 left and middle.

6 Experimental EvaluationDataset DescriptionWe collected crowdsourced segmentations from Amazon Me-chanical Turk; each HIT consisted of one segmentation taskfor a specific pre-labeled object in an image. Workers werecompensated $0.05 per task. There were a total of 46 objectsin 9 images from the MSCOCO dataset (Lin et al. 2014)segmented by 40 different workers each, resulting in a totalof 1840 segmentations. Each task contained a keyword forthe object and a pointer indicating the object to be segmented.Two of the authors generated the ground truth segmentationsby carefully segmenting the objects using the same interface.Evaluation MetricsEvaluation metrics used in our experiments measure howwell the final segmentation (S) produced by these algorithmscompare against ground truth (GT). We use the Jaccard scoreJaccard (J) = UA(S)

IA(S) , which accounts for the intersectionarea, IA = area(S ∩GT ) and union area, UA = area(S ∪GT ) between the worker and ground truth segmentations.Experiment 1: Aggregation-based methods perform sig-nificantly better than retrieval-based methods

Figure 5: Performance of the original algorithms that do notmake use of ground truth information (Left) and ones thatdo (Right). Here, the EM result overlaps with MV as theyexhibit similar performance. Other diverging variants of EMis described in our technical report.

In Figure 5, we vary the number of worker segmentationsalong the x-axis and plot the average Jaccard score on they-axis across different worker samples of a given size acrossdifferent algorithms. Figure 5 (left) shows that the perfor-mance of aggregation-based algorithms (greedy, EM) ex-ceeds the best achievable through existing retrieval-based

methods (Retrieval). Then, in Figure 5 (right), we estimatethe upper-bound performance of each algorithm by assum-ing that ‘full information’ based on ground truth is givento the algorithm. For greedy, the algorithm is aware of allthe actual tile overlap and non-overlap areas against groundtruth. For EM, the true worker quality parameter values (un-der our worker quality model) are known. For retrieval, thefull information version directly picks the worker with thehighest Jaccard similarity with respect to the ground truth.By making use of ground truth information (Figure 5 right),the best aggregation-based algorithm can achieve a close-to-perfect average Jaccard score of 0.98 as an upper bound, farexceeding the results achievable by any single ‘best’ worker(J=0.91). This result demonstrates that aggregation-basedmethods are able to achieve better performance by perform-ing inference at the tile granularity, which is guaranteed tobe finer grained than any individual worker segmentation.The performance of aggregation-based methods scalewell as more worker segmentations are added.Intuitively, larger numbers of worker segmentations result infiner granularity tiles for the aggregation-based methods. Thefirst row in Table 1 shows the average percentage change inperformance between 5-workers and 30-workers samples. Weobserve that aggregation based methods typically improve inperformance with an increase in number of workers, whilethis is not generally true for retrieval-based methods.Experiment 2: Clustering as preprocessing improves al-gorithmic performance.The second row in Table 1 shows the average percentageJaccard change when clustering preprocessing is used. Whileclustering generally results in an accuracy increase, since the‘full information’ variants are already free of semantic errors,we do not see further improvement for these variants.

Retrieval-based Aggregation-basedAlgorithm num pts worker* MV EM greedy greedy*Worker Scaling -6.30 2.58 2.12 1.78 2.07 5.38Clustering Effect 5.92 -0.02 2.05 0.03 5.73 0.283

Table 1: Jaccard percentage change due to worker scaling andclustering. Algorithms with * use ground truth information.

7 Conclusion and Future WorkWe identified three different types of errors for crowdsourcedimage segmentation, developed a clustering-based methodto capture the semantic diversity caused by differing workerperspectives, and introduced novel aggregation-based meth-ods that produce more accurate segmentations than existingretrieval-based methods.

Our preliminary studies show that our worker quality mod-els are good indicators of the actual accuracy of worker seg-mentations. We also observe that the greedy algorithm is capa-ble of achieving close-to-perfect segmentation accuracy withground truth information. Given the success of aggregation-based methods, including the simple majority vote algorithm,we plan to use our worker quality insights to improve ourEM and greedy algorithms. We are also working on usingcomputer vision signals to further improve our algorithms.

References[Bell et al. 2014] Sean Bell, Kavita Bala, and Noah Snavely.Intrinsic images in the wild. ACM Trans. on Graphics (SIG-GRAPH), 33(4), 2014.

[Bell et al. 2015] Sean Bell, Paul Upchurch, Noah Snavely,and Kavita Bala. Material recognition in the wild with thematerials in context database. Computer Vision and PatternRecognition (CVPR), 2015.

[Cabezas et al. 2015] Ferran Cabezas, Axel Carlier, VincentCharvillat, Amaia Salvador, and Xavier Giro-I-Nieto. Qualitycontrol in crowdsourced object segmentation. Proceedings ofInternational Conference on Image Processing, ICIP, 2015-Decem:4243–4247, 2015.

[Everingham et al. 2015] M. Everingham, S. M. A. Eslami,L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.The pascal visual object classes challenge: A retrospective.International Journal of Computer Vision, 111(1):98–136,January 2015.

[Irshad and et. al. 2014] H Irshad and Montaser-Kouhsari et.al. Crowdsourcing Image Annotation for Nucleus Detectionand Segmentation in Computational Pathology: EvaluatingExperts, Automated Methods, and the Crowd. Biocomputing2015, pages 294–305, 2014.

[Lee et al. 2018] Doris Jung-Lin Lee, Akash Das Sarma,and Aditya Parameswaran. Aggregating crowdsourced im-age segmentations. Technical report, Stanford InfoLab(ilpubs.stanford.edu:8090/1161/), 2018.

[Lin et al. 2012] Christopher H Lin, Mausam, and Daniel SWeld. Crowdsourcing control : Moving beyond multiplechoice. AAAI Conference on Human Computation andCrowdsourcing (HCOMP), pages 491–500, 2012.

[Lin et al. 2014] Tsung Yi Lin, Michael Maire, Serge Be-longie, James Hays, Pietro Perona, Deva Ramanan, PiotrDollár, and C. Lawrence Zitnick. Microsoft COCO: Com-

mon objects in context. European Conference on ComputerVision (ECCV), 8693 LNCS(PART 5):740–755, 2014.

[Natonek 1998] E. Natonek. Fast range image segmenta-tion for servicing robots. In Proceedings. 1998 IEEE In-ternational Conference on Robotics and Automation (Cat.No.98CH36146), volume 1, pages 406–411 vol.1, May 1998.

[Russakovsky et al. 2015] Olga Russakovsky, Li-Jia Li, andLi Fei-Fei. Best of Both Worlds: Human-Machine Collabora-tion for Object Annotation. pages 2121–2131, 2015.

[Sameki et al. 2015] Mehrnoosh Sameki, Danna Gurari, andMargrit Betke. Characterizing Image Segmentation Behaviorof the Crowd. pages 1–4, 2015.

[Sorokin and Forsyth 2008] Alexander Sorokin and DavidForsyth. Utility data annotaton with Amazon MechanicalTurk. Proceedings of the 1st IEEE Workshop on InternetVision at CVPR 08, (c):1–8, 2008.

[Torralba et al. 2010] Antonio Torralba, Bryan C. Russell,and Jenny Yuen. LabelMe: Online image annotation andapplications. Proceedings of the IEEE, 98(8):1467–1484,2010.

[Vittayakorn and Hays 2011] Sirion Vittayakorn and JamesHays. Quality Assessment for Crowdsourced Object Annota-tions. Procedings of the British Machine Vision Conference,pages 109.1–109.11, 2011.

[Welinder et al. 2010] Peter Welinder, Steve Branson, SergeBelongie, and Pietro Perona. The Multidimensional Wis-dom of Crowds. NIPS (Conference on Neural InformationProcessing Systems), 6:1–9, 2010.

[Yamaguchi 2012] Kota Yamaguchi. Parsing clothing in fash-ion photographs. In Proceedings of the 2012 IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR),CVPR ’12, pages 3570–3577, Washington, DC, USA, 2012.IEEE Computer Society.

aggregating crowdsourced image segmentationsceur-ws.org/vol-2173/paper10.pdf · this algorithm...

Documents