conves: a context verification framework for object recognition system

6
ConVeS: A Context Verification Framework for Object Recognition System Mozaherul Hoque Abul Hasanat Computer Vision Research Group School of Computer Sciences Universiti Sains Malaysia Malaysia Tel: (+604) 653 4393 [email protected] Dhanesh Ramachandram Computer Vision Research Group School of Computer Sciences Universiti Sains Malaysia Malaysia Tel: (+604) 653 4046 [email protected] Mandava Rajeswari Computer Vision Research Group School of Computer Sciences Universiti Sains Malaysia Malaysia Tel: (+604) 653 4641 [email protected] ABSTRACT Context is a vital element in both biological as well as syn- thetic vision systems. It is essential for deriving meaningful explanation of an image. Unfortunately, there is a lack of consensus in the computer vision community on what con- text is and how it should be represented. In this paper con- text is defined generally as “any and all information that is not directly derived from the object of interest but helps in explaining it”. Furthermore, a description of context is provided in terms of its three major aspects namely scope, source and type. As an application of context in improv- ing object detection results a Context Verification System (ConVeS) is proposed. ConVeS incorporates semantic and spatial context with an external knowledgebase to verify ob- ject detection results provided by state-of-the-art machine learning algorithms such as support vector machine or ar- tificial neural network. ConVeS is presented as a simple framework that can be effectively applied to a wide range of computer vision applications such as medical image, surveil- lance video, and natural imagery. 1. INTRODUCTION The ability of humans to recognize objects is amazingly dynamic and robust. Human beings can recognize an ob- ject in variety of pose, illumination, and even in partially occluded conditions. A good reason behind this ability is the use of context in visual perception process. In the real world, objects often co-appear with other objects and in particular environments, providing contextual associations to be learned by intelligent vision systems like that of a hu- man being [3][17] or perhaps by a synthetic vision system. These contextual associations proved to be useful to distin- guish visually similar objects as exemplified in Figure 1. The importance of context in computer vision was empha- Figure 1: The two image regions in the setting sun scene (left) and the orange in the orange basket (right) are visually very similar, but can easily be distinguished as one representing a setting sun and the other representing an orange due to the context provided by their neighboring objects. (Photographs used with permission from FreeFoto.com) sized by Moshe Bar, a neuroscientist in [2]. He presented a detailed account on differents aspects of context in biolog- ical vision system and how these can possibly be used in computer vision. Unfortunately, there is a lack of consen- sus among computer vision researchers on what context is and how it should be represented. A general definition of context is therefore important in order to effectively utilize this resource in computer vision applications. We have at- tempted to address this issue in Section 2 of this paper. The rest of the paper is organized as follows: Section 3 briefly re- views past researches that used context in one form or other. Section 4 introduces ConVeS (Context Verification System) as a system to verify object detection results provided by any machine learning algorithm; followed by a discussion on the advantages and implimentation challenges of ConVeS in Section 5. Finally, a conclusion on the paper is drawn in Section 6. 2. WHAT IS CONTEXT? The understanding of context in the area of computer vi- sion is somewhat imprecise. Some researchers are in the opinion that context should not be defined in the first place and be regarded as a primitive as Hirst explained in [13] “. . . context is what context does”. and avoided giving any strict definition of context. Carneiro [7], He [12], and Ramstrom [25] assumed context to be statis- tical features of pixel groups. Kruppa et. al. [18] described context as the spatial correlation between two image regions. ___________________________ Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISTA’09, March 20-22, 2009, Kuwait, Kuwait. Copyright 2009 ACM 978-1-60558-478-2/09/03…$5.00. 78

Upload: imamu

Post on 03-Feb-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

ConVeS: A Context Verification Framework for ObjectRecognition System

Mozaherul Hoque AbulHasanat

Computer Vision ResearchGroup

School of Computer SciencesUniversiti Sains Malaysia

MalaysiaTel: (+604) 653 4393

[email protected]

DhaneshRamachandram

Computer Vision ResearchGroup

School of Computer SciencesUniversiti Sains Malaysia

MalaysiaTel: (+604) 653 4046

[email protected]

MandavaRajeswari

Computer Vision ResearchGroup

School of Computer SciencesUniversiti Sains Malaysia

MalaysiaTel: (+604) 653 4641

[email protected]

ABSTRACTContext is a vital element in both biological as well as syn-thetic vision systems. It is essential for deriving meaningfulexplanation of an image. Unfortunately, there is a lack ofconsensus in the computer vision community on what con-text is and how it should be represented. In this paper con-text is defined generally as “any and all information thatis not directly derived from the object of interest but helpsin explaining it”. Furthermore, a description of context isprovided in terms of its three major aspects namely scope,source and type. As an application of context in improv-ing object detection results a Context Verification System(ConVeS) is proposed. ConVeS incorporates semantic andspatial context with an external knowledgebase to verify ob-ject detection results provided by state-of-the-art machinelearning algorithms such as support vector machine or ar-tificial neural network. ConVeS is presented as a simpleframework that can be effectively applied to a wide range ofcomputer vision applications such as medical image, surveil-lance video, and natural imagery.

1. INTRODUCTIONThe ability of humans to recognize objects is amazingly

dynamic and robust. Human beings can recognize an ob-ject in variety of pose, illumination, and even in partiallyoccluded conditions. A good reason behind this ability isthe use of context in visual perception process. In the realworld, objects often co-appear with other objects and inparticular environments, providing contextual associationsto be learned by intelligent vision systems like that of a hu-man being [3][17] or perhaps by a synthetic vision system.These contextual associations proved to be useful to distin-guish visually similar objects as exemplified in Figure 1.

The importance of context in computer vision was empha-

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ISTA’09 March 20-22, 2009, KuwaitCopyright 2009 ACM ...$5.00.

Figure 1: The two image regions in the setting sunscene (left) and the orange in the orange basket(right) are visually very similar, but can easily bedistinguished as one representing a setting sun andthe other representing an orange due to the contextprovided by their neighboring objects. (Photographsused with permission from FreeFoto.com)

sized by Moshe Bar, a neuroscientist in [2]. He presented adetailed account on differents aspects of context in biolog-ical vision system and how these can possibly be used incomputer vision. Unfortunately, there is a lack of consen-sus among computer vision researchers on what context isand how it should be represented. A general definition ofcontext is therefore important in order to effectively utilizethis resource in computer vision applications. We have at-tempted to address this issue in Section 2 of this paper. Therest of the paper is organized as follows: Section 3 briefly re-views past researches that used context in one form or other.Section 4 introduces ConVeS (Context Verification System)as a system to verify object detection results provided byany machine learning algorithm; followed by a discussion onthe advantages and implimentation challenges of ConVeS inSection 5. Finally, a conclusion on the paper is drawn inSection 6.

2. WHAT IS CONTEXT?The understanding of context in the area of computer vi-

sion is somewhat imprecise. Some researchers are in theopinion that context should not be defined in the first placeand be regarded as a primitive as Hirst explained in [13]

“. . . context is what context does”.

and avoided giving any strict definition of context. Carneiro[7], He [12], and Ramstrom [25] assumed context to be statis-tical features of pixel groups. Kruppa et. al. [18] describedcontext as the spatial correlation between two image regions.

___________________________

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise,

or republish, to post on servers or to redistribute to lists, requires prior

specific permission and/or a fee.

ISTA’09, March 20-22, 2009, Kuwait, Kuwait.

Copyright 2009 ACM 978-1-60558-478-2/09/03…$5.00.

78

Oliva and Torralba [21] explained context as a global texturefeature of the whole image. Papadopoulos [23] construedcontext as the relationship between objects derived from anexternal knowledgebase based on the semantic labels of theobjects in the image. Clearly, researchers have describedcontext from a restricted point of view without any attemptto provide a holistic view on what context means in com-puter vision and to explain the different types of contextualinformation that are available for use; the problem beingthe lack of a common understanding or consensus on themeaning of context.

So far the most comprehensive definition of context is pro-vided by Wolf and Bileschi in [28]. According to them, con-text is :

“. . . information relevant to the detection taskbut not directly due to the physical appearance ofthe object”.

But this definition restricts the utility of context in objectdetection task only, whereas context can be used to recog-nize and explain the image and its contents. The notion ofcontext includes any and all information that can be derivedfrom nearby objects or the whole scene. So, we provide animproved definition of context:

“Context is any and all information that isnot directly derived from the object of interest inthe image but helps in explaining the object.”

Our definition of context implies that there exist severaltypes of context based on the different sources of contextualinformation. Furthermore, any context has a scope withinwhich its meaning is relevant. Hence, we describe contextby its three important aspects - i) scope ii) source and iii)type. Context derived from the neighboring areas has a lo-cal scope, whereas context derived from the whole imagehas a global scope. Local context sources include pixel, re-gion and labeled objects. Global context sources are the re-gion encompassing the whole image and background sceneof the image. Based on these different sources context cancategorized into several types: Statistical Feature contextis computed on pixel, region and the whole image; Seman-tic, Spatial, Scale and Orientation context is obtained fromobject and scene; and finally, Knowledge based context isderived from external knowledgebase and metadata. Table1 provides an overview of these three aspects and their rela-tions. Interestingly, these different aspects of context can becombined with each other and a synergy can be achieved inthe generated contextual knowledge. For instance, seman-tic context can be linked with an external knowledgebase toexplore the inter object relationships that exists naturallyamong real life objects.

A review of the relevant research works involving differenttypes and sources of context is presented in the next section.

3. RELATED WORKSExisting context based vision systems use various types of

context mentioned earlier for recognition tasks. Informationderived from statistical features of surrounding regions wereused by many researchers as the context to determine theidentity of an image segment. Details about such systemcan be found in [4][10] and [25]. Kruppa et al. [18] used co-occurrence information of image regions to infer the location

of an object; their system detects human head to find theface region in the image. Pixel co-occurrence relationshiphas also been used by Millet et. al. in [19] to improve recog-nition results. However, such system disregards existenceof any distant object which might have strong influence indetermining the identity of the object. Additionally, suchsystem does not require finding the identity of the neighbor-ing objects and only seeks correlation among nearby imageregions to correctly identify the object of interest. It leavesthe system indiscriminant to two image patches of two dif-ferent objects having similar visual attributes, for example“sky” and “water” has very similar visual attributes but willgive different contextual cue.

Systems using semantic context extracted from nearby re-gions do not suffer from this problem, but still lacks knowl-edge about objects presence in distance. All these systemslack from utilizing external knowledge sources which can bea very good contextual cue as exploited in [14].

Another approach to integrate context in vision systemis what is deriving semantic context from the scene of theimage as described by Oliva and Torralba in [22]. Murphyand Torralba [20][27] have shown that it is possible to con-sider the whole scene instead of individual objects as a con-text. Such approach does not require detection of surround-ing object identities and thus avoids the difficulties relatedto typical segmentation approaches. They suggested thatlearning statistical relations between objects can cause thedetection of one object or scene to generate strong expec-tations about the probable presence and location of otherobjects. Although this scene context approach is good topredict the general content of the image or the domain, butit does not explore inter-class relationship among objectswithin the domain. For example, the likelihood of finding acar in the image increases if the wheels can be detected. Therelation between the wheel and the car can only be derivedthrough external knowledgebase.

A handful of researchers attempted to incorporate knowl-edgebase derived context in various manner. Strat and Fis-chler [26] encoded contextual knowledge as a set of controlrules and used it to interpret and verify detection results.An interesting way to incorporate contextual knowledge isto structure the knowledge itself into a generic (not imagespecific) manner that represent the real world scenario. Itcan be achieved through a graph structure to encode se-mantic relationships among objects. Aslandogan et. al. [1]and Rabinovich et. al. [24] used WordNet [9] and Google-Set1 respectively to create an ontology or ontology-like treestructure. But due to the undirected and acyclic nature oftree structure, it is not able to model the dependency rela-tion among the members of a domain that exist in real lifenatural imagery [2]. Im and Cho [15] used a Baysian net-work based model of objects and their locations to infer theprobability of a location given a set of detected objects. Themodel did not incorporate any domain information or interobject dependency relation. Furthermore, most of the pre-vious works which employs any form of knowledge structuredid not consider spatial relationship in their system. Spatialrelationships proved to be an important cue in verifying theidentity of an object both in primates’ vision system [11] andin computer vision [6]. The influence of a specific domainon likelihoods of each object was also largely ignored.

1http://labs.google.com/sets

79

Table 1: Relationship of various context types, scopes and sources

Type Scope Source Example

Statistical Features Local Pixel RGB values

Region Mean, median, circularity

Global Region (whole image) Histogram

Semantic Local Object “car”, “building”

Global Scene “city”, “beach”, “indoor

Spatial Relations Local Object “beside”, “above”

Region

Global Scene “top”, “bottom”

Scale Local Object

Region

Orientation Local Object

Region

Knowledge Local Knowledgebase

Metadata

Global Knowledgebase

Metadata

We introduce ConVeS to incorporate high level semanticcontext and spatial relationships to verify object detectionresults generated by any typical machine learning algorithmsuch as support vector machine [16] or artificial neural net-work [5]. Given an external knowledgebase and a set ofdetected objects we seek to answer the question

“is this really the objects that the machinelearning algorithm thinks it to be?”.

Unlike the system prposed by Im, our system incorporatesdomain information by linking with an external knowledge-base and thus, is able to reason about object dependenciesthat exist in the real world. A detailed description of ourproposed system is presented in the following section.

4. CONVES - CONTEXT VERIFICATIONSYSTEM

4.1 OverviewThe proposed system utilizes high level semantic context

derived from the image and verify it against a knowledge-base to improve the agreement between the label and thesegment the label is assigned to. The system incorporatesdomain knowledge, object co-occurrence and spatial rela-tionships for this purpose. Different components of the pro-posed framework are illustrated in the Figure 2.

4.2 InputIn an object recognition system a classifier algorithm typ-

ically generates labeled image segments along with the cor-responding detection scores or probabilities for each label.Our system accepts the output of the classifier algorithm inan image recognition system as its input. For example, in animage the object detection algorithm typically detects a setof segments S = {s1, s2, ..., sn} where, n is the total num-ber of segments. Based on the algorithm’s belief about the

identity of the segment, each segment si is assigned witha set of labels L = {l1, l2, ..., lm} where, m is the numberof labels for each segment and corresponding probabilitiesP (silj) where, j = 1...m.

4.3 Identify neighboring objectsThe adjacent objects of each segment are identified and

listed along with their set of labels and corresponding proba-bilities. The adjacency is determined based on 8-connectivityrules. These neighboring objects form the semantic contextfor the object in question. The set of neighboring objectsis denoted as Snbr =

{snbr1 , snbr

2 , ..., snbrn−1

}where, n is the

total number of segments in the image.

4.4 Generate context map with spatial relationThe list of neighboring objects is augmented with spatial

context information. We call this augmented list contextmap. The neighboring objects and their spatial relations areconsidered as the context for the image segment in question.Spatial relationships between the segment and a neighboringobject are determined based on the position of the objectrelative to the segment. Figure 3 provides a simple schemashowing the possible spatial relationships.

4.5 Verify contextual agreementAfter creating the context map for each segment, we ver-

ify the object label based on the contextual agreement ofthe object with its neighbors and the scene against an ex-pert constructed knowledgebase. Here we propose a knowl-edge structure based on Bayesian network to model the realworld relationships among objects in any domain. The pro-posed knowledgebase will model the domain membershipand spatial relationships among objects that exist it natu-ral imagery. Bayesian network has the advantage of being adirected graph structure as opposed to tree structure ontol-ogy used by [24] which allows us to model the dependencyrelationships.

80

Figure 2: Proposed Context Verification System - ConVeS demonstrating a hypothetical verification task.(Photographs used with permission from FreeFoto.com)

Figure 3: Positions of adjacent objects and theirspatial relations with respect to object S

In order to verify the object label, the probability scoreof the candidate segment is iteratively updated by present-ing the probabilities of all other neighboring segments tothe knowledgebase as observed evidences. According to theBayes’ rule, the probability of image segment si to be la-beled as lj using the notations introduced in section 4.2 and4.3 is:

P(silj |snbr

1 lj , snbr2 lj , ..., s

nbrn−1lj

)=

P(silj)P(snbr1

lj ,snbr2

lj ,...,snbrn−1

lj |silj)P(snbr

1lj ,snbr

2lj ,...,snbr

n−1lj)

4.6 Update segment labelsSegment label verified through multiple knowledge bases

with probability exceeding a predefined threshold are se-lected as final candidates. The process is illustrated in Fig-ure 4

5. DISCUSSIONConVeS integrates semantic context and knowledgebase to

verify the object detection results of an image. The systemis designed to be generic. Given an appropriate knowledge-base, ConVeS can be applied to natural imagery, medicalimages, surveillance videos or any other problem domain incomputer vision. A difficulty in implementation of ConVeSmight come from the construction of a proper knowledge-base. Ideally the underlying Bayesian network model of aknowledgebase should encode all the object dependenciesand related conditional probabilities. In real world scenariothis requires a considerable amount of time and expertise.Fortunately, for a reasonably good accuracy Bayesian net-work does not require to specify all the conditional prob-abilities as suggested by Charniak in [8]. Domain expertscan subjectively determine the probabilities based on a rep-resentative set of examples.

81

Figure 4: Illustration of updating segment labels using hypothetical knowledgebase and probabilities. (Pho-tographs used with permission from FreeFoto.com)

6. CONCLUSIONUnderstanding the meaning and different aspects of con-

text is vital in order to fully utilize this rich resource in com-puter vision tasks. In this paper we provided a functionaldefinition of context with a brief discussion on its aspects.The proposed ConVeS (Context Verification System) incor-porates contextual information in a generic object recogni-tion system. ConVeS illustrates that semantic context andspatial relationships can be effectively combined with exter-nal knowledgebase to improve the agreement between thelabel and the image segment it is assigned to. It can begeneralized for any image and video domain such as naturalimagery, medical, surveillance, etc. In this regard we furtherpropose the construction of a knowledgebase which modelsthe real world relationships that exist among objects. Webelieve ConVeS will make vision applications more reliableand applicable to a wider field where accuracy is vital forsuccess.

7. ACKNOWLEDGMENTSThis research has been supported by the Ministry of Higher

Education, Government of Malaysia under the FundamentalResearch Grant Scheme.

8. REFERENCES[1] Y. Aslandogan, C. Thier, C. Yu, J. Zou, and N. Rishe.

Using semantic contents and wordnet in imageretrieval. Proceedings of the 20th annual internationalACM SIGIR conference on Research and developmentin information retrieval, pages 286–295, 1997.

[2] M. Bar. Visual objects in context. Nature ReviewsNeuroscience, 5(8):617–629, 2004.

[3] M. Bar and S. Ullman. Spatial context in recognition.Perception, 25(3):343 – 352, 1996.

[4] N. Bergboer, E. Postma, and H. v. d. Herik.Context-based object detection in still images. Imageand Vision Computing, 24(9):987–1000, Sep 2006.

[5] K. Boehnke, M. Otesteanu, P. Roebrock, W. Winkler,and W. Neddermeyer. Neural network based objectrecognition using color block matching. In Proceedingsof the Fourth conference on IASTED InternationalConference: Signal Processing, Pattern Recognition,and Applications table of contents, pages 122–125.ACTA Press Anaheim, CA, USA, 2007.

[6] P. Carbonetto, N. d. Freitas, and K. Barnard. Astatistical model for general contextual objectrecognition. Computer Vision - ECCV 2004, pages350–362, 2004.

[7] G. Carneiro and A. Jepson. Pruning local featurecorrespondences using shape context. PatternRecognition, 2004. ICPR 2004. Proceedings of the 17thInternational Conference on, 3:16–19, 23-26 Aug.2004.

[8] E. Charniak. Bayesian networks without tears. AIMagazine, 12(4):50–63, 1991.

[9] C. Fellbaum and I. NetLibrary. WordNet: anelectronic lexical database. MIT Press USA, 1998.

[10] H. Feng and T.-S. Chua. A learning-based approachfor annotating large on-line image collection. InMultimedia Modelling Conference, 2004. Proceedings.10th International, pages 249–256, 5-7 Jan. 2004.

[11] I. Gauthier and M. Tarr. Unraveling mechanisms forexpert object recognition: Bridging brain activity andbehavior. Journal of experimental psychology. Humanperception and performance, 28(2):431–446, 2002.

[12] X. He, R. Zemel, and M. Carreira-Perpinan.Multiscale conditional random fields for imagelabeling. In Computer Vision and PatternRecognition, 2004. CVPR 2004. Proceedings of the2004 IEEE Computer Society Conference on,volume 2, pages II–695–II–702, 27 June-2 July 2004.

[13] G. Hirst. Context as a spurious concept. InProceedings, Conference on Intelligent Text Processingand Computational Linguistics, page 273287, MexicoCity, 2000.

82

[14] A. Hoogs and R. Collins. Object boundary detectionin images using a semantic ontology. cvprw, 0:111,2006.

[15] S.-B. Im and S.-B. Cho. Context-based scenerecognition using bayesian networks withscale-invariant feature transform. In AdvancedConcepts for Intelligent Vision Systems, volume4179/2006 of Lecture Notes in Computer Science,pages 1080–1087. Springer Berlin / Heidelberg, 2006.

[16] W. Kienzle, G. Bakir, M. Franz, and B. Scholkopf.Efficient approximations for support vector machinesin object detection. Proc. DAGM04, pages 54–61,2005.

[17] A. Kleinschmidt, C. B. chel, S. Zeki, and R. S. J.Frackowiak. Human brain activity duringspontaneously reversing perception of ambiguousfigures. Proceedings of the Royal Society B: BiologicalSciences, 265(1413):2427–2427, Dec. 1998.

[18] H. Kruppa and B. Schiele. Using local context toimprove face detection. Proc. of the BMVC, Norwich,England, pages 3–12, 2003.

[19] C. Millet, I. Bloch, P. Hede, and P. Moellic. Usingrelative spatial relationships to improve individualregion recognition. In Proc. 2nd Eur. WorkshopIntegration Knowledge, Semantics and Digital MediaTechnol, pages 119–126, 2005.

[20] K. Murphy, A. Torralba, and W. Freeman. Using theforest to see the trees: a graphical model relatingfeatures, objects and scenes. Advances in NeuralInformation Processing Systems, 16, 2003.

[21] A. Oliva and A. Torralba. Modeling the shape of thescene: A holistic representation of the spatialenvelope. International Journal of Computer Vision,42(3):145–175, 2001.

[22] A. Oliva and A. Torralba. The role of context inobject recognition. Trends in Cognitive Sciences,11(12):520–527, Dec. 2007.

[23] G. Papadopoulos, P. Mylonas, V. Mezaris,Y. Avrithis, and I. Kompatsiaris. Knowledge-assistedimage analysis based on context and spatialoptimization. International Journal on Semantic Weband Information Systems, 2(3)(3):17–36,July-September 2006.

[24] A. Rabinovich, A. Vedaldi, C. Galleguillos,E. Wiewiora, and S. Belongie. Objects in context.Computer Vision, 2007. ICCV 2007. IEEE 11thInternational Conference on, pages 1–8, 14-21 Oct.2007.

[25] O. Ramstrom and H. Christensen. Object detectionusing background context. Pattern Recognition, 2004.ICPR 2004. Proceedings of the 17th InternationalConference on, 3:45–48, 23-26 Aug. 2004.

[26] T. Strat and M. Fischler. Context-based vision:Recognition of natural scenes. In Signals, Systems andComputers, 1989. Twenty-Third Asilomar Conferenceon, volume 1, pages 532–536, 1989.

[27] A. Torralba. Contextual priming for object detection.Int. J. Comput. Vision, 53(2):169–191, 2003.

[28] L. Wolf and S. Bileschi. A critical view of context.International Journal of Computer Vision,69(2):251–261, August 2006.

83