may be related to crf

14
Relating Things and Stuff via Object Property Interactions Min Sun, Byung-soo Kim, Pushmeet Kohli, and Silvio Savarese Abstract—In the last few years, substantially different approaches have been adopted for segmenting and detecting “things” (object categories that have a well defined shape such as people and cars) and “stuff” (object categories which have an amorphous spatial extent such as grass and sky). While things have been typically detected by sliding window or Hough transform based methods, detection of stuff is generally formulated as a pixel or segment-wise classification problem. This paper proposes a framework for scene understanding that models both things and stuff using a common representation while preserving their distinct nature by using a property list. This representation allows us to enforce sophisticated geometric and semantic relationships between thing and stuff categories via property interactions in a single graphical model. We use the latest advances made in the field of discrete optimization to efficiently perform maximum a posteriori (MAP) inference in this model. We evaluate our method on the Stanford dataset by comparing it against state-of-the-art methods for object segmentation and detection. We also show that our method achieves competitive performances on the challenging PASCAL ’09 segmentation dataset. Index Terms—Scene understanding, semantic labeling, segmentation, graph-cut Ç 1 INTRODUCTION T HE last decade has seen the development of a number of methods for object detection, segmentation and scene understanding. These methods can be divided into two broad categories: methods that attempt to model and detect object categories that have distinct shape properties such as cars or humans (things), and methods that seek to model and identify object categories whose internal structure and spatial support are more heterogeneous such as grass or sky (stuff). In the first category, we find that methods based on pictorial structures (i.e., Felzenszwalb et al. [1]), pyramid structures (i.e., Grauman and Darrell [2]), generalized Hough transform [3], [4], [5], [6], [7], or multi-view model [8], [9] work best. These representations are appropriate for capturing shape or structural properties of things, and typi- cally identify the object by a bounding box. The second cate- gory of methods aims at segmenting the image into semantically consistent regions [10], [11], [12] and work well for stuff, like sky or road. In order to coherently interpret the depicted scene, var- ious types of contextual relationships among objects (stuff or things) have been explored. For example, co-occurrence relationships (e.g., cow and grass typically occur in the same image) have been captured in [13], [14], 2D geomet- ric relationships (e.g., below, next-to, etc.) have been uti- lized in [15], [16], [17], 2.5D geometry relationships (e.g., horizon line) have been incorporated by Hoiem et al. [18] and Bao et al. [19]. The use of such contextual relation- ships has inspired the development of robust algorithms for various object recognition tasks. For instance, many segmentation methods [13], [20], [21] have been proposed to capture stuff-stuff relationships in a random field for- mulation. Similarly, thing-thing relationships have been incorporated into a random field for jointly detecting mul- tiple objects (Desai et al. [16]). Recently, researchers have proposed methods to jointly detect things and segment stuff. Gould et al. [22] proposed a random field model incorporating both stuff-stuff, thing-stuff, and thing-horizon relationships. One limita- tion of their approach is that it cannot capture 2D geo- metric and co-occurrence relationships between things. Moreover, inference is computationally very demanding and typically takes around five minutes per image. To overcome this limitation, some authors have proposed inference procedures which leverage existing approaches for detection and segmentation and use the output of such approaches as input features in an iterative fashion [23], [24], [25], [26]. Unfortunately, convergence is not guaranteed for most of these approaches. We propose a novel framework for jointly detecting things and segmenting stuff that can coherently capture many known types of contextual relationships. Our contri- butions are three-fold. First, the model infers the geometric and semantic relationships describing the objects (i.e., object x is behind object y) via object property interactions. Second, the model enables instance base segmentation (see color coded segments in Fig. 1d by associating segments of thing M. Sun is with the University of Washington, AC101 Paul G. Allen Center, 185 Stevens Way, Seattle, WA 98195-2350. E-mail: [email protected]. B. Kim is with the University of Michigan and the Computer Science Department, Stanford University, 353 Serra Mall, Gates Building, Stan- ford, CA 94305-9020. E-mail: [email protected]. P. Kohli is at Microsoft Research, 21 Station Road, Cambridge CB1 2FB, United Kingdom. E-mail: [email protected]. S. Savarese is with the Computer Science Department, Stanford Univer- sity, 353 Serra Mall, Gates Building, Stanford, CA 94305-9020. E-mail: [email protected]. Manuscript received 19 Aug. 2012; revised 1 May 2013; accepted 15 Sept. 2013. Date of publication 2 Oct. 2013; date of current version 13 June 2014. Recommended for acceptance by D. Ramanan. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPAMI.2013.193 1370 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 7, JULY 2014 0162-8828 ß 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: moodaeng-theparitt

Post on 10-Nov-2015

219 views

Category:

Documents


0 download

DESCRIPTION

computer vision

TRANSCRIPT

  • Relating Things and Stuff via ObjectProperty Interactions

    Min Sun, Byung-soo Kim, Pushmeet Kohli, and Silvio Savarese

    AbstractIn the last few years, substantially different approaches have been adopted for segmenting and detecting things (object

    categories that have a well defined shape such as people and cars) and stuff (object categories which have an amorphous spatial

    extent such as grass and sky). While things have been typically detected by sliding window or Hough transform based methods,

    detection of stuff is generally formulated as a pixel or segment-wise classification problem. This paper proposes a framework for scene

    understanding that models both things and stuff using a common representation while preserving their distinct nature by using a

    property list. This representation allows us to enforce sophisticated geometric and semantic relationships between thing and stuff

    categories via property interactions in a single graphical model. We use the latest advances made in the field of discrete optimization to

    efficiently perform maximum a posteriori (MAP) inference in this model. We evaluate our method on the Stanford dataset by comparing

    it against state-of-the-art methods for object segmentation and detection. We also show that our method achieves competitive

    performances on the challenging PASCAL 09 segmentation dataset.

    Index TermsScene understanding, semantic labeling, segmentation, graph-cut

    1 INTRODUCTION

    THE last decade has seen the development of a number ofmethods for object detection, segmentation and sceneunderstanding. These methods can be divided into twobroad categories: methods that attempt to model and detectobject categories that have distinct shape properties such ascars or humans (things), and methods that seek to modeland identify object categories whose internal structure andspatial support are more heterogeneous such as grass or sky(stuff). In the first category, we find that methods based onpictorial structures (i.e., Felzenszwalb et al. [1]), pyramidstructures (i.e., Grauman and Darrell [2]), generalizedHough transform [3], [4], [5], [6], [7], or multi-view model[8], [9] work best. These representations are appropriate forcapturing shape or structural properties of things, and typi-cally identify the object by a bounding box. The second cate-gory of methods aims at segmenting the image intosemantically consistent regions [10], [11], [12] and workwell for stuff, like sky or road.

    In order to coherently interpret the depicted scene, var-ious types of contextual relationships among objects (stuff

    or things) have been explored. For example, co-occurrencerelationships (e.g., cow and grass typically occur in thesame image) have been captured in [13], [14], 2D geomet-ric relationships (e.g., below, next-to, etc.) have been uti-lized in [15], [16], [17], 2.5D geometry relationships (e.g.,horizon line) have been incorporated by Hoiem et al. [18]and Bao et al. [19]. The use of such contextual relation-ships has inspired the development of robust algorithmsfor various object recognition tasks. For instance, manysegmentation methods [13], [20], [21] have been proposedto capture stuff-stuff relationships in a random field for-mulation. Similarly, thing-thing relationships have beenincorporated into a random field for jointly detecting mul-tiple objects (Desai et al. [16]).

    Recently, researchers have proposed methods to jointlydetect things and segment stuff. Gould et al. [22] proposeda random field model incorporating both stuff-stuff,thing-stuff, and thing-horizon relationships. One limita-tion of their approach is that it cannot capture 2D geo-metric and co-occurrence relationships between things.Moreover, inference is computationally very demandingand typically takes around five minutes per image. Toovercome this limitation, some authors have proposedinference procedures which leverage existing approachesfor detection and segmentation and use the output ofsuch approaches as input features in an iterative fashion[23], [24], [25], [26]. Unfortunately, convergence is notguaranteed for most of these approaches.

    We propose a novel framework for jointly detectingthings and segmenting stuff that can coherently capturemany known types of contextual relationships. Our contri-butions are three-fold. First, the model infers the geometricand semantic relationships describing the objects (i.e., objectx is behind object y) via object property interactions. Second,the model enables instance base segmentation (see colorcoded segments in Fig. 1d by associating segments of thing

    M. Sun is with the University of Washington, AC101 Paul G. AllenCenter, 185 Stevens Way, Seattle, WA 98195-2350.E-mail: [email protected].

    B. Kim is with the University of Michigan and the Computer ScienceDepartment, Stanford University, 353 Serra Mall, Gates Building, Stan-ford, CA 94305-9020. E-mail: [email protected].

    P. Kohli is at Microsoft Research, 21 Station Road, Cambridge CB1 2FB,United Kingdom. E-mail: [email protected].

    S. Savarese is with the Computer Science Department, Stanford Univer-sity, 353 Serra Mall, Gates Building, Stanford, CA 94305-9020.E-mail: [email protected].

    Manuscript received 19 Aug. 2012; revised 1 May 2013; accepted 15 Sept.2013. Date of publication 2 Oct. 2013; date of current version 13 June 2014.Recommended for acceptance by D. Ramanan.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPAMI.2013.193

    1370 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 7, JULY 2014

    0162-8828 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

  • categories to instance-specific labels. Finally, the specialdesign of model potentials allows us to utilize a combina-tion of state-of-the-art discrete optimization techniques toachieve efficient inference which takes a few seconds perimage in average using a single core.

    Object hypotheses and property lists. Our frameworkextends the basic conditional random field (CRF) formula-tions for scene segmentation (i.e., assigning an object cate-gory to each segmentation variable X) [11], [27] byintroducing the concept of a generic object hypothesisdescribed by a property list (Fig. 2-Top). Instead of onlycapturing the semantic property of the hypothesis usingan object categorical label l in the basic conditional ran-dom field, we allow the list to include geometric proper-ties, such as the 2D location u; v, the distance from thecamera (depth) d, and the set of segments V associatedwith the object hypothesis. Notice that the generic object

    hypothesis naturally describes the existence of an objectinstance with respect to the camera. Hence, our scene seg-mentation framework can not only segment a scene intodifferent object categories but also into object instanceswith different properties (i.e., things) as shown in Fig. 1d.Thus, it generalizes the work of Barinova et al. [6], [28].

    We augment the above-mentioned CRF formulation withobject hypothesis indicator (binary) variables which capturethe presence or absence of object hypotheses (see the solid(on state) and dash (off state) nodes in Fig. 2a-Top). We referto our model as the augmented CRF, or ACRF, to highlightthe newly added object hypothesis indicator variables. Twoadditional relationships are captured in our ACRF. First,the state of the indicator variable needs to be consistentwith the assignment of the segmentation variables associ-ated with the corresponding object hypothesis. We intro-duce a novel higher-order potential function to penalize, for

    Fig. 2. Our augmented CRF model (ACRF). Panel (a) illustrates that our model jointly segments the image by assigning labels to segments (bottomlayer) and detects object by determining which object hypotheses exist (top layer). The existence of the hypotheses is indicated by solid (on) anddash (off) nodes. As shown on the top panel, each thing object hypothesis possesses properties such as category, location u; v, etc. On the otherhand, each stuff object hypothesis possesses only the category label. In panel (b), we give examples of relationships established via property inter-actions. For thing categories, geometric relationships such as behind and above can be established. On the other hand, for stuff categories,semantic relationship such as co-occurrence can be established. Notice that the two edges, which connect to the stuff categories and end in thedashed separator line, represent the co-occurrence relationships between stuff and all thing categories. In panel (c), the figure shows the label spaceof the segmentation variables X (color-coded blocks) and the indicator variables Y (0 and 1 blocks). The higher-order potential capturing the relation-ship between indicator i and a set of segmentation variable in Vi is represented by an edge between a node on top layer and a set of nodes at the bot-tom layer. In panel (d)-Bottom, the figure shows the pairwise and higher order potential among segmentation variables X which are presented in thebasic CRF formulation. In panel (d)-Top, the figure shows the pairwise potential between pairs of indicator variables Y which encodes different geo-metric and semantic relationships.

    Fig. 1. Our goal is to segment the image into things (e.g., cars, humans, etc.) and stuff (e.g., road, sky, etc.) by combining segmentation (bottom) withobject detection (top). Results from different variants of our method (capturing a subset of critical contextual relationships) are shown from left to rightcolumns. At the top of each column, we show the top 4 probable bounding boxes, where light and dark boxes denote the confidence ranking fromhigh to low. Instance-based segmentation are shown in each bottom column, where different colors represent different object instances, and all stuffcategories are labelled with white color to make the figure less cluttered. The CRF+Det model enforces the detections to be consistent with the seg-ment labels. While the green car segments are reinforced by the strong (white) car detection, very weak (dark) false detections are also introduceddue to noisy segment labels in the background region. On the other hand, our final ACRF captures the key relationships so that it recovers manymissing detections and segmentation labels. Thing-Stuff and Thing-Thing relationships are indicated by color-coded arrows connecting boundingboxes and/or segments. Different color codes indicate different types of relationships.

    SUN ET AL.: RELATING THINGS AND STUFF VIA OBJECT PROPERTY INTERACTIONS 1371

  • the first time, both types of inconsistency: i) the indicator isoff but many segments in set V are still assigned to the cor-responding hypothesis; ii) the indicator is on but only a fewsegments in set V are assigned to the corresponding hypoth-esis. Second, the object indicator variables allow us to easilyencode sophisticated semantic and geometric relationshipsbetween pairs of object hypotheses. For instance, simplepairwise potentials defined over object indicator variablescan allow to incorporate i) 2D geometric relationships suchas above which models the property that one hypothesislies above the other (e.g., a person sitting on a bike), ii) 2.5Ddepth-ordering and occlusion relationships such as in-front which models the property that one hypothesis liesin front of the other (e.g., a person standing in front of acar). More sophisticated relationships such as a compositionof these basic 2D or 2.5D relationships can also be sup-ported. Critically, the ACRF model generalizes Ladickyet al.s model [13] capturing stuff-stuff co-occurrence con-textual relationships only. In contrast, our model canencode relationships between generic object hypotheses(both things and stuff) depending on their semantic andgeometrical properties. We illustrate the efficacy of ourapproach in Fig. 1. As seen in the figure, detections typicallydo not agree with the segmentation results (Fig. 1b if thedetection and segmentation are applied separately. A modelcapturing relationships among object hypothesis and seg-ments ensures consistency between detection and segmen-tation results (Fig. 1c). However, the relationships betweenobject hypotheses are ignored. Hence, false object hypothe-ses sometime are introduced. Finally, when pair-wise rela-tionships of object hypotheses (e.g., next-to, behind, etc.) areincluded, even small object instances, that are hard to detectand segment, can be discovered (Fig. 1d).

    Learning. Given the property list, a pre-defined set ofpair-wise relationships of object hypotheses are encoded inour model via property interactions as described in Section3.3. The likelihood of the relationships are treated as modelparameters that are learned from training data. For exam-ple, the model should learn that a person is likely to sit on amotorbike, and cow and airplane are unlikely to co-occur.In our model, a likely relationship will add a negative costto the energy of the model. On the other hand, an unlikelyrelationship will add a positive cost. We formulate the prob-lem of learning these costs jointly with other parameters as aStructured SVM (SSVM) [29] learning problem with twotypes of loss functions related to the segmentation loss anddetection loss, respectively (see Section 5 for details).

    MAP Inference. Jointly estimating the segmentation varia-blesX and indicator variables Y (see nodes in Fig. 2c) is chal-lenging due to the intrinsic difference of the indicator andsegmentation variable space, and newly added complex rela-tionships between them (see edges in Figs. 2c, and 2d). Wedesign an efficient graph-cut-based move making algorithmby combining state-of-the-art discrete optimization techni-ques. Our method is based on the a-expansion move makingapproach [30], which works by projecting the energy mini-mization problem of segmentation variables X into a binaryenergy minimization problem to have the same space as theindicator variables Y . We use the probing approach simi-lar to the one introduced by Rother et al. [31] to handle thenon-submodular function describing pair-wise relationships

    of object hypotheses. Our MAP inference algorithm takesonly a few seconds per image in average using a single coreas opposed to fiveminutes by Gould et al. [22].

    Outline of the Paper. The rest of the paper is organized asfollows. We first describe the related work in Section 2.Model representation, inference, and learning are elabo-rated in Section 3, 4, and 5, respectively. Implementationdetails and experimental results are given in Section 6.

    2 RELATED WORK

    Our method is closely related to the following three meth-ods which all can be considered as special cases of ourmodel. Desai et al. [16] propose a CRF model capturingthing-thing relationships and show that object detectionperformance can be consistently improved for multipleobject categories. Their model can be considered as a specialcase of our model when no segmentation variableX exists.

    Both Ladicky et al.s methods [13], [32] extend the basicCRF model to incorporate more sophisticated relationships.Ladicky et al. [32] incorporate things-stuff relationships anddemonstrate that the information from object detection canbe used to improve the segmentation performance consis-tently across all object categories. Their model can be con-sidered as a special case of our model when no thing-thingrelationship is incorporated. One more subtle difference isthat their model only weekly enforces the consistencybetween things and stuff. Their model does not penalize thecase when the indicator is off but many segments are stillassigned to the corresponding hypothesis. Notice that thestrong consistency of things-stuff in our model is crucial incombination with thing-thing relationships. Otherwise,many segments will still be assigned to the correspondinghypothesis even when the hypothesis is suppressed bything-thing relationships. Ladicky et al. [13] propose to cap-ture co-occurrence types of object relationships and demon-strate that the co-occurrence information can be used toimprove the segmentation performance significantly. Theirmodel can also be considered as a special case of our modelwhen no geometric relationships of object hypotheses (i.e.,above, same horizon, etc.) are established. Finally, [13] can-not be used to assign segments to object instances or localizeobject instances. Our results on the Stanford dataset demon-strate that our model achieves superior performances than[13], [32].

    Our method shares similar ideas with Yao et al. [33] tojointly model object detection and scene segmentation.However, there are two main differences. Yao et al. alsojointly model scene classification problem and use the sceneclass to restrict the possible object categories existing in animage. Therefore, they need to apply a set of pre-trainedscene classifiers in addition to the object detectors andscene segmentation methods. Moreover, their model doesnot incorporate thing-thing geometric relationships. Theirmodel ignores facts such as cars are typically next-toeach other and cups are on-top a table. As a result, theirmethod works well on datasets which contains only fewobject instances (typically less than 3) such as the MSRCdataset [27]. On the contrary, our experimental resultsshow that our method works well on datasets containingmanymore object instances such as Stanford dataset [21].

    1372 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 7, JULY 2014

  • The semantic structure from motion (SSFM) proposed byBao and Savarese [34] also jointly models object instancesand regions. However, unlike our method which utilizesone single image, their approach utilizes the correspond-ences of object instances and regions established across mul-tiple images to improve object detection and segmentationperformances.

    Li et al. [35], [36] propose generative models to jointlyclassify the scene, recognize the class of each segment,and/or annotate the images with a list of tags. However,the model cannot localize each object instances. Hence, thething-thing and thing-stuff interactions are not incorpo-rated in their model.

    Many methods explore contextual relationships betweensegments and/or object hypotheses to improve a specificvisual task such as detection, category discovery, etc. Forinstance, [37] use contextual relationships to discover objectcategories commonly appearing within a set of images. Itwas demonstrated in [38], [39] that the contextual relation-ships can be used to improve object detection performance.

    3 AUGMENTED CRF

    We now explain our augmented conditional random field(ACRF) model. ACRF jointly models object detection andscene segmentation (Fig. 2a) by incorporating contextualrelationships between things and stuff, and between multi-ple things (Fig. 2b).

    Basic CRF. Similar to other scene segmentation methods,our model is developed on top of a basic conditional ran-dom fields model. The basic CRF model is defined over aset of random variables X fxig; i 2 V where V representsthe set of image elements, which could be pixels, patches,segments, etc., (Fig. 2c-Bottom). Each random variable xi isassigned to a label from a discrete label space L, which istypically the set L of object categories such as grass, road,car and people.

    The energy (or cost) function EX of the CRF is the neg-ative logarithm of the joint posterior distribution of themodel and has the following form:

    EX logP XjE log feRF XjE KXc2CX

    ccXc K; (1)

    where E is the given evidence from the image and any addi-tional information (e.g., object property lists), feRF XjEtakes the form of a CRF model with higher order potentials

    defined over image elements (Fig.2d-Bottom). feRF XjE canbe decomposed into potential cc which is a cost function

    defined over a set of element variables Xc (called a clique)indexed by c 2 CX; CX is the set of cliques for image ele-ments, and K is a constant related to the partition function.

    The problem of finding the most probable or maximum a

    posteriori assignment of the CRF model is equivalent to

    solving the following discrete optimization problem:

    X arg minX2LjVjEX, where jVj indicates the number of

    elements.

    The basic CRF model mostly relies on bottom-up infor-mation. It is constructed using unary potentials based onlocal classifiers and smoothness potentials defined over

    pairs of neighboring pixels. Higher-order potentials (suchas the ones used in [11]) encourage labels of image ele-ments within a group to be the same. This classic repre-sentation for object segmentation has led to excellentresults for the stuff object categories, but has failed toreplicate the same level of performance on the thingobject categories. The reason for this dichotomy lies in themodels inability to explicitly encode the relationshipbetween the shape and relative positions of different partsof structured object categories such as the head and thetorso of a person.

    In contrast, part-based models such as pictorial struc-tures [40], Latent SVM (LSVM) [1], and Hough transformbased models [3], [6] have shown to be much more effec-tive at detecting things by generating a list of objecthypotheses ordered according to their scores/likelihoods.Each hypothesis is often characterized by a property listincluding the category of the object l, the spatial locationin the image u; v, the depth or distance d of the objectinstance from the camera, and the set of segments V asso-ciated with the object hypothesis (Fig. 2 Top panel). Inmany application, a detection problem can be relaxedinto an image-level classification problem. A classificationmethod generates a hypothesis of the existence of anobject category in the image without specifying the spatialconfiguration of the object. Since the spatial configurationof the object does not need to be specified, hypotheses forboth things and stuff can be generated. Notice that,in this case, the property of such hypothesis only includesthe category of the object l.

    Augmented CRF. In order to take advantage of both theobject detection and segmentation methods, we introduce aset of indicator variables (later referred to as indicators)Y fyj 2 f0; 1g; j 2 Q^g corresponding to every objecthypothesis in our ACRF model (Fig. 2c-Top). Theoretically,the number of all possible object hypotheses jQ^j is large,since it is the Cartesian product of the space of all possibleobject category labels L, all possible spatial locations UVin the image, and all depth or distance values within a range0; D. For example, a sliding window detector exploring10 scales/depths considers 369K hypotheses in total for a320 240 image.Therefore, it is potentially hard to handle.Fortunately, in real world images, only a few hypothesesare actually present. Thus, most indicator variables yj; j 2 Q^are off (i.e., yj 0). We use object detectors that have beentrained on achieving a high recall rate to generate a rela-tively small set of plausible object hypotheses Qd (about 20per class on Stanford dataset1) compared to the size of allpossible object hypotheses Q^. In addition, a set of objecthypotheses Qc with only object category label, similar to theones generated by image-level classification methods, areincluded. As a result, we obtain the set of object hypothesesQ Qd [Qc.

    Recall that variables X representing the image elementsin the basic CRF formulation for object segmentation typi-cally take values from the set of object categories L. In con-trast, in our framework, these variables take values from theset of plausible object hypotheses xi 2 L Q (refer as

    1. We set the pre-trained detector threshold as 0.7.

    SUN ET AL.: RELATING THINGS AND STUFF VIA OBJECT PROPERTY INTERACTIONS 1373

  • augmented labeling space). This allows us to obtain segmenta-tions of instances of particular object categories which thebasic CRF formulation is unable to handle.

    The joint posterior distribution of the segmentation varia-blesX and indicator variables Y can be written as

    P X;Y jE / feRF XjE foRF Y jE fconX; Y jE: (2)The function foRF takes the form of a CRF model defined over

    object indicator variables as follows:

    foRF Y jE Yc2CY

    ecYc; (3)

    where the potential cYc is a cost function defined over a setof indicator variables Yc indexed by c 2 CY , and CY is the set ofcliques of indicators. The potential function fcon enforces that

    the segmentation and indicator variables take values which are

    consistent with each other (Fig. 2c). The term is formally

    defined as

    fconX;Y jE Yj2Q

    eFyj;Xj; (4)

    where Fyj;Xj is the potential relating each indicator yj witha set of image elements Xj fxi; i 2 Vjg corresponding to theset of segments Vj of the jth hypothesis. Hence, the modelenergy can be written as

    EX;Y Xc2CX

    ccXc Xj2QFyj;Xj

    Xc2CY

    cYc:(5)

    The first term of the energy function is defined in a manner

    similar to [11]. We describe other terms of the energy function

    in detail in the following sections.

    3.1 Relating Y and X

    The function Fyj;Xj (Fig. 2c) is a likelihood term thatenforces consistency in the assignments of the jth indicatorvariable yj and a set of segmentation variables Xj. It is for-mally defined as

    Fyj;Xj inf if yj 6 dXj; ljg lj jXjj 0 if yj dXj; lj 10 if yj dXj; lj 0;

    8>>>>>>>>>>:

    (21)

    where Tj fti; i 2 Vjg is a set of transformed binary variables,Vjlj fi;xi ljg \ fi 2 Vjg is the set of image elementsin Vj with label lj, Vj n Vjlj is the remaining set of elementsin Vj with labels other than lj (i.e.,fi;xi 6 ljg \ fi 2 Vjg).Most importantly, we observed that when a 6 lj the functionis submodular in yj; ti, but when a lj it is submodular inyj; ti, where yj 1 yj is the negation of yj. This motivatesus to dynamically negate a subset of the indicator variables

    according to the category labels fljg.

    4.2.3 Probing pyj; ykThe graph construction of the pair-wise instance indicators inEq. (10) is equivalent to the construction of binary variableswhich is described in [30]. However, there is one issue that weneed to resolve in order to ensure submodularity of the pair-wise binary function. First of all, since it is essential to captureboth attractive (i.e., both indicators having the same labels)and repulsive (i.e., both indicators having different labels)relationships, some pair-wise functions pyj; yk will be sub-modular and otherswill be non-submodular in yj; yk. There-fore, we need to fix the negating pattern of the indicatorvariables. However, this contradicts with the dynamic negat-ing approach described in the previous section.

    Fortunately, a simple approach probing the indicatorvariables similar to the one described in [31] can effectivelyhandle the non-submodular function, since each indicatoronly interacts with a small number of nearby indicators. Theprobing approach randomly fixes a small set of indicatorvariables fyj; j 2 Qfixg, where the contradiction between thefix negating pattern and dynamic negating algorithm takesplace. As a result, the pair-wise function is ensured to besubmodular. Notice that our inference algorithm does notrely on sophisticated techniques such as QPBO [31] whichrequires more memory and computation time.

    4.3 Overall Projected Energy Function

    At each iteration of the a-expansion, the terms of the origi-nal model energy (Eq. (5)) becomes a pairwise and submod-ular function of T , Y , and Y . The overall projected energyfunction (except the function in Eq. (13)) becomes,

    ET; Y; Y Xc2CX

    ccTc Xj2Qa

    Fyj; T Xj2Qa

    Fyj; T

    X

    j2Qa;k2Qa[j;k2Udpyj; yk

    X

    j2Qa;k2Qa[j;k2Udpyj; yk

    X

    j2Qa;k2Qa[j;k2Udpyj; yk;

    (22)

    Fig. 4. Comparison between the original function Fy;Xj (blue lines)and the approximated function (red lines) in Eqs. (18) and (17). The leftpanel shows the case when y 1. The right panel shows the case wheny 0. Notice that the dash blue lines indicate the sharp transition fromfinite values to infinite values.

    1376 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 7, JULY 2014

  • where Qa fj; lj 6 ag n Qfix, Qa fj; lj ag n Qfix, andUd U n Uc. Therefore, we will construct the graph using T ,partially using indicator Y fyj; j 2 Qag, and partially usingthe negation of indicator Y fyj; j 2 Qag depending onwhether lj a. Notice that Qfix is randomly selected at everyiteration. Therefore, no indicator variable is always fixed. The

    a-expansion algorithm eventually converges to a local optimal

    solution.

    5 LEARNING

    The full ACRF model in Eq. (5) contains several terms. Inorder to balance the importance of different terms, we intro-duce a set of linear weights for each term as follows:

    WTCX;Y Xc2CX

    wcccXc X

    j;k2Udwrjklj;lk

    yj; yk

    Xj2Qd

    wuljFyj;Xj uyj

    wco X

    j;k2Ucpyj; yk

    Xj2Qc

    Fyj;Xj!;

    (23)

    where wc models weights for unary, pair-wise, and higher-order terms in X. wul is the object category specific weightfor unary term in y, wco is the weight for hypothesis with onlycategory label, and w

    rjklj;lk

    is the weight for a pair of object cate-

    gories lj; lk with the relationship type rjk in Eq. (10). Recallfrom Sections 3.2 and 4 that Qd and Qc are the set of hypothe-ses with geometric properties and with only category label,

    respectively. Similarly, Ud and Uc are the subset of U such thatj; k 2 Qd and j; k 2 Qc, respectively. Notice that the functionCX;Y; I also depends on the image evidence I. We typicallyomit the image evidence I for simplicity. Since all these weightsare linearly related to the energy function, we formulate the

    problem of jointly training these weights as a Structured SVM

    (SSVM) learning problem [29] similar to [16].

    Assuming that a set of example images, ground truthsegment object category labels, and ground truth objectbounding boxes fIn;Xn; Y ngn1;...;N are given. The SSVMproblem is as follows:

    minW;0 WTW CXn

    nX;Y

    s:t:

    nX;Y maxX;Y

    ~X;Y ;Xn; Y n

    WTCXn; Y n; In WTCX;Y; In; 8n;(24)

    whereW concatenates all the model parameters which are line-arly related to the potentials CX;Y ; C controls the relativeweight of the sum of the violated terms fnX; Y g withrespect to the regularization term;~X;Y ;Xn; Y n is the lossfunction that generates large loss when the X or Y is very dif-ferent from Xn or Y n. Depending on the performance evalua-tion metric, we design different loss functions as described in

    the Section 5.1

    Following the SSVM formulation, we propose to use astochastic subgradient descent method to solve Eq. (24).

    The subgradient of @WnX; Y can be calculated as CXn;

    Y n CX; Y , where X; Y arg minX;Y WTCX; Y ~X; Y ;Xn; Y n. When the loss function can be decom-posed into a sum of local losses on individual segments andindividual detections, X; Y can be solved using graph-cut similar to the inference problem (Section 4). For othercomplicated loss functions, we found that it is effective toset X; Y approximately as arg minX;YWTCX; Y , whenthe loss is bigger than a threshold.

    The remaining model parameters are set as follows. Theobject category-specific consistency threshold rl in Eq. (6)are estimated using the median values observed in trainingdata. The g involved in Eq. (13) are estimated from the MSEas described in [13]. The b in Eq. (9) are set to be the cali-brated detection confidence such that b 0. The unarypotentials in ccXc are obtained from off-the-shelf stuffclassifiers [13], [43] (see details in Section 6). The pair-wisepotentials in ccXc are modelled using codebook represen-tations similar to [40].

    5.1 Loss Function

    For the experiments on Stanford dataset, the performance ismeasured by the pixel-wise classification accuracy (i.e, per-centage of pixels correctly classified), and the detectionaccuracy. We define the loss function ~X;Y ;Xn; Y n asthe sum of segmentation loss ~X;Xn and the detectionloss~Y ;Y n.

    The segmentation accuracy is measured by

    true positive

    true positive false negative : (25)

    Hence, the segmentation loss~X;Xn is defined as

    ~X;Xn 1Q

    Xi2V1xi 6 xni

    cxli; (26)

    where V captures the indices for the set of segments,1fSTATEMENTg is 1 if the STATEMENT is true, cxli isthe object category li specific cost (used to re-weight the losscontributed from different object categories), and

    Q Pi2V cxli. Therefore, the overall segmentation loss canbe decomposed into a sum over local loss for each segment1Q1fxi 6 xni gcxli.

    The detection loss~Y ;Y n is defined as

    ~Y ;Y n 1M

    Xi2Qd

    1yi 6 yni

    cyli; (27)

    where Qd captures the indices for the set of detections,M Pi2B cyli. Similarly, the overall detection loss can bedecomposed into a sum over local loss for each detection1M 1fyi 6 yni gcyli.

    For the experiments on the PASCAL dataset, the overallloss function ~X;Y ;Xn; Y n is similarly decomposed intosum of the segmentation loss ~X;Xn and the detectionloss ~Y ;Y n. The detection loss is the same as before.However, since the segmentation performance is measureddifferently by

    true positive

    true positive false positive false negative ; (28)

    SUN ET AL.: RELATING THINGS AND STUFF VIA OBJECT PROPERTY INTERACTIONS 1377

  • the segmentation loss is defined as 1-segmentation perfor-

    mance. Notice that the segmentation loss cannot be decom-

    posed into a per segment loss. Therefore, we obtain X; Y approximately as arg minX;YW

    TCX;Y , when ~X; Y ;Xn; Y n is bigger than a threshold.

    6 EXPERIMENTS

    We compare our full ACRF model with [13], [21], [32], [44],[45] on the Stanford background (referred to as Stanford)dataset [21] as well as with several state-of-the-art techni-ques on PASCAL VOC 2009 segmentation (referred to asPASCAL) dataset [46]. As opposed to other datasets, suchas MSRC [27], Stanford dataset contains more clutteredscenes and more object instances per image. Hence, seg-menting and detecting things is particularly challenging.The challenging PASCAL segmentation dataset contains alarge number of things labels with a single stuff/back-ground label. However, the dataset contains a limitednumber of object instances in each image which is less idealto demonstrate the importance of pair-wise relationshipsbetween object hypotheses.

    Implementation details. For all the experiments below, weuse the same pre-trained LSVM detectors [1] to obtain aset of object hypotheses with geometric properties forthings categories (e.g., car, person, and bike). The objectdepths are inferred by combining both cues from the sizeand the bottom positions of the object bounding boxes sim-ilar to [18], [19], [24]. The subset of segments V associatedto each object hypothesis is obtained by using the averageobject segmentation in the training set. In detail, for eachmixture component in LSVM, we estimate the averageobject segmentation and use it to select the set of segmen-tation variables overlapped with the average object seg-mentation. The responses from off-the-shelf stuffclassifiers are used as the unary stuff potentials in ourmodel. On Stanford dataset, we use the STAIR VisionLibrary [43] that was earlier used in [21]. On PASCALdataset, we use only the pixel-wise unary responses fromthe first layer of the hierarchical CRF [13].We model

    different types of pair-wise stuff relationships using acodebook representation similar to [47]. The followinggeometric pair-wise relationships are used for the experi-ments to incorporate geometric relationship between twoobject hypotheses: next-to, above, below, in-front, behind,overlap. On top of that, we have one additional geometricrelationship based on horizon lines agreement betweentwo hypotheses. The types of relationships are determinedas described in Section 3.3. All models are trained withobject bounding boxes and pixel-wise class supervisions.

    6.1 Stanford Dataset

    The Stanford dataset [21] contains 715 images fromchallenging urban and rural scenes. On top of eight back-ground (stuff) categories, we annotate nine foreground(things) object categoriescar, person, motorbike, bus,boat, cow, sheep, bicycle, others. We follow the five-foldcross-validation scheme which splits the data into differentset of 572 training and 143 testing images. In Table 2a,2 ourACRF model outperforms state-of-the-art methods [13],[21], [32], [44], [45] in the percentage of pixels correctly clas-sified as either one of the eight background classes, or a gen-eral foreground class (referred to as global accuracy).

    Global accuracy versus average accuracy. The global accu-racy is not ideal to highlight the accuracy gain of ourmethod in foreground classes, since it ignores classificationerrors in fine foreground classes and the number of back-ground pixels clearly outnumbers the number of fore-ground pixels. Hence, we report per class accuracy and theuniform average accuracy across a general background classand eight foreground classes (referred to as average accu-racy) in Tables 1, 2b, and 5.

    In Table 1, the performances of most foreground classes(seven out of eight) are significantly improved when addi-tional components are added on top of the basic CRF model.As a result, the full ACRF model obtains a 14:2 percent

    TABLE 2Segmentation Performance Comparison on the Stanford Dataset

    (a) Global accuracy of our ACRF model compared to state-of-the-art methods (b) Sensitivity analysis of our segmentation accuracy affected by thequality of the detectors on Stanford dataset. Notice that the average accuracy decreases only gradually when the maximum recall decreases.

    TABLE 1System Analysis of Our Model on the Stanford Dataset

    The CRF row shows the results of the basic CRF model which uses only the stuff-stuff relationship component (first term in Eq. (5)) of our ACRFmodel. The C+D row shows results by adding object hypothesis indicators obtained from pre-trained detectors to the CRF model (first two terms inEq. (5)). The last row shows results of the full ACRF model. Notice that in the Background column, we treat all background classes as a generalbackground class. Our ACRF model improves the average accuracy for a significant 14 percent compared to the basic CRF model.

    2. We implement [13], [32] by ourselves and evaluate theperformance.

    1378 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 7, JULY 2014

  • average improvement over the basic CRF model. Typicalresults are shown in Fig. 9-Top. Using our efficient inferencealgorithm, inference takes in average 1.33 seconds for eachimage which has 200 to 300 super-pixels on a Intel(R) Xeon(R) CPU@ 2.40 GHz.We highlight that our model can gener-ate object instance-based segmentations due to the ability toreason in the augmented labeling spaceQ (Fig. 5a).

    Another advantage of using our model is to improvedetection accuracy. We measure detection performance interms of Recall versus false positive per image (FPPI) inFig. 5b, where detection results from five-fold validations areaccumulated and shown in one curve. The performance ofthe proposed model is compared with the pre-trained LSVM[1]. Our model achieves consistent higher recall than theLSVM baseline at small number of FPPI as shown in Fig. 5b.

    Since our method utilizes pre-trained object detectors toobtain a set of plausible object hypotheses, we evaluate thesegmentation accuracy given worse detectors to see howour method depends on the quality of the detectors. Wesimulate a worse detector by reducing the number ofrecalled objects in the set of plausible detections. As shownin Table 2b, the average accuracy decreases only graduallywhen the maximum recall decreases. Notice that, evenwhen the recall is only 20 percent, our ACRF model stillachieves accuracy better than the basic CRF.

    6.2 PASCAL Dataset

    This dataset contains 14; 743 images with 21 categoriesincluding 20 thing categories and 1 stuff category. Only asubset of images have segmentation labels, and we used

    the standard split for training (749 images), validation(750 images), and testing (750 images). A system analysis ofour model (Table 3) shows that the performances of mostclasses were improved when additional components areadded on top of the basic CRF model. However, our ACRFmodel is able to significantly boost up the performance andachieves competitive accuracy compared to other teams inthe challenge (ranked in 4th in Table 4). Typical results areshown in Fig. 9-Bottom.

    6.3 Relationship Analysis

    We found that the pair-wise geometric relationships con-tribute to the accuracy improvement of our ACRF modelmore than the co-occurrence relationships, since CRF Det co-occurrence relationships (Last row in Table 5) do notconsistently improve the accuracy of all categories com-pared to CRF Det (middle row in Table 1). Moreover, all

    Fig. 5. (a) Typical thing segmentation results on the Stanford dataset. Notice that our model can obtain instance-based segmentations (last column)due to the ability to reason in the augmented labeling space Q. (b) Recall versus FPPI curves of our ACRF and LSVM on the Stanford dataset. OurACRF achieves better recall at different FPPI values.

    TABLE 3The Segmentation Accuracy of Different Variants of Our Model (i.e., CRF, CRF+ Detection,

    and Full ACRF Models) on PASCAL Dataset

    TABLE 4Average Segmentation Accuracy of Our ACRF Model Compared to Other State-of-the-Art Methods on PASCAL Dataset

    TABLE 5Segmentation Accuracy on the Stanford Dataset When the

    Pair-Wise Geometric Relationships Are Partially (e.g., Car andPerson, or Car and Bus Geometric Relationships Removed) or

    Totally (i.e., All Geometric Relationships) Removed

    SUN ET AL.: RELATING THINGS AND STUFF VIA OBJECT PROPERTY INTERACTIONS 1379

  • geometric relationships collectively contribute to the accu-racy improvement, since our model is still better than CRF Det when the geometric relationships of two most fre-quently co-occurred pairs of object categories, namely carversus person and car versus bus, are removed, respectively(First two rows in Table 5). The learned relation parametersof the two most frequently co-occurred pairs of object cate-gories are visualized in Figs. 6a and 6b. Nevertheless, thelearned relationships sometime introduce errors. We showthe failure cases in Fig. 7.

    In Fig. 6c, we evaluate the percentage of pairs of true pos-itive object detections for each relationship. Before (i.e., rawdetections from LSVM [1]) applying inference, the percent-age is fairly low since there are many false positive detec-tions. After applying our ACRF model, the percentageincreases dramatically as expected. We also outperform twostronger baseline methods aiming at pruning out incorrectpairs of object hypotheses for each relationship as defined

    below. BL1 uses only the detection confidence to prune outdetections. In specific, for each pair of detections with a cer-tain relationship, we assign a score as a sum of scoresfor both bounding boxes from LSVM. Then, we sample ppercent of pairs with highest scores, where p is the percent-age of pairs of true positive detections for a certain relation-ship from the training set. BL2 incorporates pairwise objectinteractions and prune out detections. Again, for each pairof detections with a certain relationship, we assign a scoreas a sum of detection scores for both detections. Then, wesample pairs within top pc1; c2 percent, where pc1; c2 is aclass-pair specific percentage of pairs of true positive detec-tions from the training set, and c1 and c2 is classes corre-sponding to two bounding boxes.

    Using the inferred relationshipswe can provide high levelgeometrical description of the scene and determine proper-ties such as: object x is behind object y. Finally, we can obtain3Dpop-upmodelsof the scene fromasingle imageas inFig. 8.

    7 CONCLUSION

    We have presented a unified CRF-based framework forjointly detecting and segmenting things and stuff cate-gories in natural images. We have shown that our frame-work incorporates in a coherent fashion various types of(geometrical and semantic) contextual relationships viaproperty interactions. Our new formulation generalizes pre-vious results based on CRF where the focus was to capturethe co-occurrence between stuff categories only. We havequantitatively and qualitatively demonstrated that ourmethod: i) produces better segmentation results than state-

    Fig. 6. Examples of the learned pair-wise relationships between object hypotheses are visualized in panel (a,b). The grayscale color code indicates towhat degree the relationship is likely (white means it is likely, black means it is unlikely). Our model learned that (a) a car is likely to be in front of abus, and a car is unlikely to be below a bus, (b) a car is likely to be behind a person. (c) Prediction accuracy of the objects co-occurrence for eachtype of relationship averaged over five-fold validations. The first and last columns show the accuracy before and after applying inference on our fullACRF model, respectively. Notice that there is a consistent improvement across all types. The performance of two baseline methods are reported inthe middle two columns which are all inferior then our results.

    Fig. 7. Failure cases analysis. Panel (a) shows the case when thelearned pairwise relationship between a car and a person next-to eachother does not match to the existing relationship in the test image. As aresult, the false alarm of a car (red box) appears with ACRF. (b) The typ-ical example when depth heuristic fails. The yellow car in the center ofthe image is successfully detected and segmented with a CRF Detmodel. However, it fails to detect with ACRF model, because the depthis not correctly inferred due to the fact that the height of the yellow car isnot close to the average height of the cars in the training set.

    Fig. 8. 3D pop-up models from the Stanford dataset. Videos relatedto above 3D pop-up models can be found in the project page: cvgl.stanford.edu/vision/projects/ACRF/ACRFproj.html.

    1380 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 7, JULY 2014

  • of-the art on the Stanford dataset and competitive resultson PASCAL 09 dataset; ii) improves the recall of objectinstances on Stanford dataset; iii) enables the estimation of

    contextual relationship among things and stuff. Extensionsfor future work include incorporating more sophisticatedtypes of properties.

    Fig. 9. Typical results on theStanford (top four rows) and thePASCALdatasets (bottom four rows). Every set of results compare ground truth annotation,disjointedmodel (disjointedly appliedobject detection and segmentationmethods), CRF+Det,ACRF, from left to right, respectively. Theodd rows showthe topK object hypotheses (color-coded bounding boxes representing the confidence ranking from light (hight confidence) to dark (low confidence)),whereK is the number of recalled objects in theACRF result. The even rows show the segmentation results (color-code is shownat the bottom).

    SUN ET AL.: RELATING THINGS AND STUFF VIA OBJECT PROPERTY INTERACTIONS 1381

  • ACKNOWLEDGMENTS

    The authors would like to acknowledge the support of USNational Science Foundation (NSF) CAREER grant(#1054127), AROgrant (W911NF-09-1-0310), andONRgrant(N00014-13-1-0761).

    REFERENCES[1] P. Felzenszwalb, D. McAllester, and D. Ramanan, A Discrimina-

    tively Trained, Multiscale, Deformable Part Model, Proc. IEEEConf. Computer Vision and Pattern Recognition (CVPR), 2008.

    [2] K. Grauman and T. Darrell, The Pyramid Match Kernel: Discrim-inative Classification with Sets of Image Features, Proc. 10th IEEEIntl Conf. Computer Vision (ICCV), 2005.

    [3] B. Leibe, A. Leonardis, and B. Schiele, Combined Object Catego-rization and Segmentation with an Implicit Shape Model, Proc.ECCVWorkshop Statistical Learning in Computer Vision, 2004.

    [4] J. Gall and V. Lempitsky, Class-Specific Hough Forests for ObjectDetection, Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion (CVPR), 2009.

    [5] S. Maji and J. Malik, Object Detection Using a Max-MarginHough Tranform, Proc. IEEE Conf. Computer Vision and PatternRecognition (CVPR), 2009.

    [6] O. Barinova, V. Lempitsky, and P. Kohli, On Detection of Multi-ple Object Instances Using Hough Transforms, Proc. IEEE Conf.Computer Vision and Pattern Recognition (CVPR), 2010.

    [7] M. Sun, G. Bradski, B.X. Xu, and S. Savarese, Depth-EncodedHough Voting for Coherent Object Detection, Pose Estimation,and Shape Recovery, Proc. 11th European Conf. Computer Vision:Part V (ECCV), 2010.

    [8] M. Sun, H. Su, S. Savarese, and L. Fei-Fei, A Multi-View Probabi-listic Model for 3D Object Classes, Proc. IEEE Conf. ComputerVision and Pattern Recognition (CVPR), 2009.

    [9] Y. Xiang and S. Savarese, Estimating the Aspect Layout of ObjectCategories, Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion (CVPR), 2012.

    [10] X. He, R.S. Zemel, and M.A. Carreira-Perpi~nan, Multiscale Con-ditional Random Fields for Image Labeling, Proc. IEEE Conf.Computer Vision and Pattern Recognition (CVPR), 2004.

    [11] P. Kohli, L. Ladicky, and P.H. Torr, Robust Higher Order Poten-tials for Enforcing Label Consistency, Proc. IEEE Conf. ComputerVision and Pattern Recognition (CVPR), 2008.

    [12] J. Shotton, A. Blake, and R. Cipolla, Semantic Texton Forests forImage Categorization and Segmentation, Proc. IEEE Conf. Com-puter Vision and Pattern Recognition (CVPR), 2008.

    [13] L. Ladicky, C. Russell, P. Kohli, and P.H. Torr, Graph Cut BasedInference with Co-Occurrence Statistics, Proc. 11th European Conf.Computer Vision: Part V (ECCV), 2010.

    [14] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S.Belongie, Objects in Context, Proc. 11th IEEE Conf. ComputerVision (ICCV), 2007.

    [15] A. Gupta and L. Davis, Beyond Nouns: Exploiting Prepositionsand Comparators for Learning Visual Classifiers, Proc. EuropeanConf. Computer Vision (ECCV), 2008.

    [16] C. Desai, D. Ramanan, and C. Fowlkes, Discriminative Modelsfor Multi-Class Object Layout, Proc. 11th IEEE Conf. ComputerVision (ICCV), 2009.

    [17] G. Heitz and D. Koller, Learning Spatial Context: Using Stuff toFind Things, Proc. 10th European Conf. Computer Vision: Part I(ECCV), 2008.

    [18] D. Hoiem, A.A. Efros, and M. Hebert, Putting Objects inPerspective, Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion (CVPR), 2006.

    [19] S.Y. Bao, M. Sun, and S. Savarese, Toward Coherent ObjectDetection and Scene Layout Understanding, Proc. IEEE Conf.Computer Vision and Pattern Recognition (CVPR), 2010.

    [20] J. Winn and J. Shotton, The Layout Consistent Random Field forRecognizing and Segmenting Partially Occluded Objects, Proc.IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2006.

    [21] S. Gould, R. Fulton, and D. Koller, Decomposing a Scene intoGeometric and Semantically Consistent Regions, Proc. 11th IEEEConf. Computer Vision (ICCV), 2009.

    [22] S. Gould, T. Gao, and D. Koller, Region-Based Segmentation andObject Detection, Proc. 23rd Ann. Conf. Neural Information Process-ing Systems (NIPS), 2009.

    [23] G. Heitz, S. Gould, A. Saxena, and D. Koller, Cascaded Classifica-tion Models: Combining Models for Holistic Scene Under-standing, Proc. Ann. Conf. Neural Information Processing Systems(NIPS), 2008.

    [24] M. Sun, S.Y. Bao, and S. Savarese, Object Detection with Geomet-rical Context Feedback Loop, Proc. British Machine Vision Conf.(BMVC), 2010.

    [25] D. Hoiem, A.A. Efros, and M. Hebert, Closing the Loop on SceneInterpretation, Proc. IEEE Conf. Computer Vision and Pattern Rec-ognition (CVPR), 2008.

    [26] C. Li, A. Kowdle, A. Saxena, and T. Chen, Toward Holistic SceneUnderstanding: Feedback Enabled Cascaded Classification Mod-els, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 34,no. 7, pp. 1394-1408, July 2012.

    [27] J. Shotton, J. Winn, C. Rother, and A. Criminisi, Textonboost:Joint Appearance, Shape and Context Modeling for Multi-ClassObject Recognition and Segmentation, Proc. European Conf. Com-puter Vision (ECCV), 2006.

    [28] O. Barinova, V. Lempitsky, and P. Kohli, On Detection of Multi-ple Object Instances Using Hough Transforms, IEEE Trans. Pat-tern Analysis and Machine Intelligence, vol. 34, no. 9, pp. 1773-1784,Sept. 2012.

    [29] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun,Support Vector Learning for Interdependent and StructuredOutput Spaces, Proc. 21st Intl Conf. Machine learning (ICML04), 2004.

    [30] Y. Boykov, O. Veksler, and R. Zabih, Fast Approximate EnergyMinimization via Graph Cuts, IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 23, no. 11, pp. 1222-1239, Nov. 2001.

    [31] C. Rother, V. Kolmogorov, V. Lempitsky, and M. Szummer,Optimizing Binary MRFS via Extended Roof Duality, Proc. IEEEConf. Computer Vision and Pattern Recognition (CVPR), 2007.

    [32] L. Ladicky, P. Sturgess, K. Alahari, C. Russell, and P.H. Torr,What, Where, and How Many? Combining Object Detectors andCRFs, Proc. European Conf. Computer Vision (ECCV), 2010.

    [33] Y. Yao, S. Fidler, and R. Urtasun, Describing the Scene as aWhole: Joint Object Detection, Scene Classification and SemanticSegmentation, Proc. IEEE Conf. Computer Vision and Pattern Recog-nition (CVPR), 2012.

    [34] S.Y. Bao and S. Savarese, Semantic Structure from Motion, Proc.IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2011.

    [35] L.J. Li, R. Socher, and L. Fei-Fei, Towards Total Scene Under-standing: Classification, Annotation and Segmentation in anAutomatic Framework, Proc. IEEE Conf. Computer Vision and Pat-tern Recognition (CVPR), 2009.

    [36] L. Fei-Fei and L.J. Li, What, Where andWho? Telling the Story ofan Image by Activity Classification, Scene Recognition and ObjectCategorization, Computer Vision, pp. 157-171, 2010.

    [37] Y.J. Lee and K. Grauman, Object-Graphs for Context-Aware Cat-egory Discovery, Proc. IEEE Conf. Computer Vision and PatternRecognition (CVPR), 2010.

    [38] C. Galleguillos, B. McFee, S. Belongie, and G.R.G. Lanckriet,Multi-Class Object Localization by Combining Local ContextualInteractions, Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion (CVPR), 2010.

    [39] S.K. Divvala, D. Hoiem, J. Hays, A.A. Efros, and M. Hebert, AnEmpirical Study of Context in Object Detection, Proc. IEEE Conf.Computer Vision and Pattern Recognition (CVPR), 2009.

    [40] P.F. Felzenszwalb and D.P. Huttenlocher, Pictorial Structures forObject Recognition, Intl J. Computer Vision, vol. 61, no. 1, pp. 55-79, 2005.

    [41] E. Boros and P. Hammer, Pseudo-Boolean Optimization, Dis-crete Applied Math., vol. 123, pp. 155-225, 2002.

    [42] V. Kolmogorov and R. Zabih, What Energy Functions Can BeMinimized via Graph Cuts, IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 26, no. 2, pp. 147-159, Feb. 2004.

    [43] S. Gould, O. Russakovsky, I. Goodfellow, P. Baumstarck, A. Ng,and D. Koller, The Stair Vision Library (v2.3), 2009.

    [44] J. Tighe and S. Lazebnik, Superparsing: Scalable NonparametricImage Parsing with Superpixels, Proc. European Conf. ComputerVision (ECCV), 2010.

    [45] D. Munoz, J.A. Bagnell, and M. Hebert, Stacked HierarchicalLabeling, Proc. European Conf. Computer Vision (ECCV), 2010.

    [46] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn and A.Zisserman, The PASCAL Visual Object Classes Challenge 2009(VOC 09) Results, 2009.

    1382 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 7, JULY 2014

  • [47] X.B. Bosch, J.M. Gonfaus, J. van de Weijer, A.D. Bagdanov, J.Serrat, and J. Gonzalez, Harmony Potentials for Joint Classi-fication and Segmentation, Proc. IEEE Conf. Computer Visionand Pattern Recognition (CVPR), 2010.

    Min Sun received the BSc degree from NationalChiao Tung University, the MSc degree fromStanford University, and the PhD degree fromthe University of Michigan, Ann Arbor. He is cur-rently a postdoctoral researcher at the Universityof Washington. His research interests includeobject recognition, image understanding, humanpose estimation, and machine learning. He wonbest paper awards from 3DRR 2007.

    Byung-soo Kim received the BSc degree fromthe Korea Advanced Institute of Science andTechnology, the MSc degree from the Universityof Michigan, Ann Arbor, and he is currently work-ing toward the PhD degree at the University ofMichigan, Ann Arbor. His research interestsinclude object recognition and image understand-ing from color/depth images.

    Pushmeet Kohli is a senior researcher in theMachine Learning and Perception group atMicrosoft Research Cambridge, and an associ-ate of the Psychometric Centre, University ofCambridge. His PhD thesis written at OxfordBrookes University was the winner of the BritishMachine Vision Associations Sullivan DoctoralThesis Award, and a runner-up for the BritishComputer Societys Distinguished DissertationAward. His research revolves around intelligentsystems and computational sciences with a par-

    ticular emphasis on algorithms and models for scene understanding andhuman pose estimation. His papers have appeared in SIGGRAPH,NIPS, ICCV, AAAI, CVPR, PAMI, IJCV, CVIU, ICML, AISTATS,AAMAS, UAI, ECCV, and ICVGIP and have won best paper awardsfrom ECCV 2010, ISMAR 2011 and ICVGIP 2006, 2010.

    Silvio Savarese received the PhD degree inelectrical engineering from the California Insti-tute of Technology in 2005. He is an assistantprofessor of computer science at Stanford Uni-versity. He joined the University of Illinois atUrbana-Champaign from 2005-2008 as a Beck-man Institute Fellow. He is a recipient of theJames R. Croes Medal in 2013, a TWR Automo-tive Endowed Research Award in 2012, an NSFCareer Award in 2011, and Google ResearchAward in 2010. In 2002, he was awarded the

    Walker von Brimer Award for outstanding research initiative. He servedas workshops chair and area chair for CVPR 2010, and as area chairfor ICCV 2011 and CVPR 2013. He was editor of the Elsevier Journalin Computer Vision and Image Understanding, special issue on 3DRepresentation for Recognition in 2009. He co-authored a book on 3Dobject and scene representation published by Morgan and Claypool in2011. His work with his students has received several best paperawards including a best student paper award from the IEEE CORPWorkshop in conjunction with ICCV 2011, a Best Paper Award fromthe Journal of Construction Engineering and Management in 2011 andthe CETI Award from the 2010 FIATECHs Technology Conference.His research interests include computer vision, object recognition andscene understanding, shape representation and reconstruction,human activity recognition, and visual psychophysics.

    " For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    SUN ET AL.: RELATING THINGS AND STUFF VIA OBJECT PROPERTY INTERACTIONS 1383

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /CreateJDFFile false /Description >>> setdistillerparams> setpagedevice