[ieee 2011 fourth international workshop on advanced computational intelligence (iwaci) - wuhan,...

5
AbstractIn this paper, we present a non-symmetry and anti-packing object pattern representation model (NAM). The NAM model codes the geometry of generic object categories as a hierarchy of sub-patterns and each sub-pattern is represented by a rich set of image cues. The sub-pattern-based NAM model is designed to decouple variations due to affine warps and other forms of shape deformations. The combination of multi-feature is to deal with the local variation of object. We then train part classifiers. Based on this model, we apply a generalized Hough voting scheme to generate object locations and scales. The experimental results on a variety of categories demonstrate that our method provides successful detection of the object within the image. I. INTRODUCTION BJECT detection is an important task in computer vision which has made great use of learning. This is a difficult problem because objects in a category can vary greatly in shape and appearance. Variation arise not only from changes in illumination, occlusion, background clutter and view point, but also due to non-rigid deformations, and intra-class variation in shape and other visual properties among objects in a rich category. How do we deal with the variation, especial the intra-class and pose variability of object? Most of the current researches have focused on modeling object variability, including several kinds of deformable template models [1]-[2], and a variety of part-based, fragment-based models [3]-[9]. The pictorial structure models [5]-[10] represent an object by a collection of parts arranged in a deformable configuration, where the deformable configuration is represented by spring-like connections between pairs of parts. Crandall et al. propose k-fans model [11] to study the extent to which additional spatial constraints among parts are Manuscript received April 7, 2011. This work was supported in part by the National Natural Science Foundation of China under Grant No. 40806011, the Natural Science Foundation of JiangSu under Grant BK20082140, the Natural Science Foundation of Huaihai Institute of Technology under Grant Z2009013, and the Construction Found of JiangSu for Key Disciplines in Applications of Computing. Yihua Lan is with the School of Computer Engineering, Huaihai Institute of Technology, Lianyungang, CO 222005 China (corresponding author to provide phone:086 18986188662;e-mail: lanhua_2000@ sina.com.cn). Haozheng Ren is with the School of Computer Engineering, Huaihai Institute of Technology, Lianyungang, CO 222005 China (e-mail: Haozhengren666 @163.com). Yong Zhang is with the School of Computer Engineering, Huaihai Institute of Technology, Lianyungang, CO 222005 China (e-mail: [email protected]). Hongbo Yu is with the School of Computer Engineering, Huaihai Institute of Technology, Lianyungang, CO 222005 China (e-mail: [email protected]). Guangwei Wang is with the School of Computer Science and Technology,Huazhong University of Science and Technology, CO 430074 China (e-mail: [email protected]). actually helpful in detection and localization. The patchwork of parts model from [6] is similar, but it explicitly considers how the appearance model of overlapping parts interacts to define a dense appearance model for images. It is proved that adding spatial constraints gives better performance. There are several possibilities to represent object classes. A star shape model [12]-[13] can be easily trained and evaluated in contrast to the constellation model [14] or complex graphical models [15]. It allows using as many parts as required, since the complexity scales linearly. Moreover, this model is flexible enough to deal with large variations in object shape and appearance of rigid and articulated structures. The method of Leibe et al. [16] give a highly flexible learned representation for object shape that can combine the information observed on different training examples. Opelt, et al. [9] explore a similar geometric representation to that of Leibe et al. but use only the boundaries of the object, both internal and external (silhouette). A number of recent approaches have attempted to learn features weights automatically [17]-[18] using variants of multiple kernel learning. These learning mechanisms, however, only allow identify and weigh the most discriminate features, but do not allow to identify and model the interplay between features that may prove important to representing object well. Our approach has two methods to deal with the variation of object, both global and local. Firstly, we propose a non-symmetry and anti-packing object pattern representation model (NAM) to represent an object category. The NAM object model consist of several local parts, we call it sub-patterns. The model codes the global geometry of generic visual object categories with spatial relations linking object pattern to sub-patterns. Secondly, multi-cues of part have been incorporated as a key component of local features. We introduce appropriate descriptor for each cue to describe the information of sub-pattern. The proposed framework can be applied to any object that consists of distinguishable parts arranged in a relatively fixed spatial configuration. Our experiments are performed on images of side views of horses; therefore, this object class will be used as a running example throughout the paper to illustrate the ideas and techniques involved. The rest of the paper is organized as follow. Section II describes the non-symmetry and anti-packing model. Section III presents the parts feature descriptor, and section IV describes learning and detection method. In section V, experiments on real images show that the model is effective for object categorization. And then conclusions are given in section VI. A Novel Object Representation Model for Object Detection Yihua Lan, Haozheng Ren, Yong Zhang, Hongbo Yu, and Guangwei Wang O 59 Fourth International Workshop on Advanced Computational Intelligence Wuhan, Hubei, China; October 19-21, 2011 978-1-61284-375-9/11/$26.00 @2011 IEEE

Upload: guangwei

Post on 04-Mar-2017

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: [IEEE 2011 Fourth International Workshop on Advanced Computational Intelligence (IWACI) - Wuhan, China (2011.10.19-2011.10.21)] The Fourth International Workshop on Advanced Computational

Abstract—In this paper, we present a non-symmetry and anti-packing object pattern representation model (NAM). The NAM model codes the geometry of generic object categories as a hierarchy of sub-patterns and each sub-pattern is represented by a rich set of image cues. The sub-pattern-based NAM model is designed to decouple variations due to affine warps and other forms of shape deformations. The combination of multi-feature is to deal with the local variation of object. We then train part classifiers. Based on this model, we apply a generalized Hough voting scheme to generate object locations and scales. The experimental results on a variety of categories demonstrate that our method provides successful detection of the object within the image.

I. INTRODUCTION

BJECT detection is an important task in computer vision which has made great use of learning. This is a difficult problem because objects in a category can vary greatly in

shape and appearance. Variation arise not only from changes in illumination, occlusion, background clutter and view point, but also due to non-rigid deformations, and intra-class variation in shape and other visual properties among objects in a rich category.

How do we deal with the variation, especial the intra-class and pose variability of object? Most of the current researches have focused on modeling object variability, including several kinds of deformable template models [1]-[2], and a variety of part-based, fragment-based models [3]-[9].

The pictorial structure models [5]-[10] represent an object by a collection of parts arranged in a deformable configuration, where the deformable configuration is represented by spring-like connections between pairs of parts. Crandall et al. propose k-fans model [11] to study the extent to which additional spatial constraints among parts are

Manuscript received April 7, 2011. This work was supported in part by the National Natural Science Foundation of China under Grant No. 40806011, the Natural Science Foundation of JiangSu under Grant BK20082140, the Natural Science Foundation of Huaihai Institute of Technology under Grant Z2009013, and the Construction Found of JiangSu for Key Disciplines in Applications of Computing.

Yihua Lan is with the School of Computer Engineering, Huaihai Institute of Technology, Lianyungang, CO 222005 China (corresponding author to provide phone:086 18986188662;e-mail: lanhua_2000@ sina.com.cn).

Haozheng Ren is with the School of Computer Engineering, Huaihai Institute of Technology, Lianyungang, CO 222005 China (e-mail: Haozhengren666 @163.com).

Yong Zhang is with the School of Computer Engineering, Huaihai Institute of Technology, Lianyungang, CO 222005 China (e-mail:[email protected]).

Hongbo Yu is with the School of Computer Engineering, Huaihai Institute of Technology, Lianyungang, CO 222005 China (e-mail: [email protected]).

Guangwei Wang is with the School of Computer Science and Technology,Huazhong University of Science and Technology, CO 430074 China (e-mail: [email protected]).

actually helpful in detection and localization. The patchwork of parts model from [6] is similar, but it explicitly considers how the appearance model of overlapping parts interacts to define a dense appearance model for images. It is proved that adding spatial constraints gives better performance.

There are several possibilities to represent object classes. A star shape model [12]-[13] can be easily trained and evaluated in contrast to the constellation model [14] or complex graphical models [15]. It allows using as many parts as required, since the complexity scales linearly. Moreover, thismodel is flexible enough to deal with large variations in object shape and appearance of rigid and articulated structures.

The method of Leibe et al. [16] give a highly flexible learned representation for object shape that can combine the information observed on different training examples. Opelt, et al. [9] explore a similar geometric representation to that of Leibe et al. but use only the boundaries of the object, both internal and external (silhouette).

A number of recent approaches have attempted to learn features weights automatically [17]-[18] using variants of multiple kernel learning. These learning mechanisms, however, only allow identify and weigh the most discriminate features, but do not allow to identify and model the interplay between features that may prove important to representing object well.

Our approach has two methods to deal with the variation of object, both global and local. Firstly, we propose a non-symmetry and anti-packing object pattern representation model (NAM) to represent an object category. The NAM object model consist of several local parts, we call it sub-patterns. The model codes the global geometry of generic visual object categories with spatial relations linking object pattern to sub-patterns.

Secondly, multi-cues of part have been incorporated as a key component of local features. We introduce appropriate descriptor for each cue to describe the information of sub-pattern.

The proposed framework can be applied to any object that consists of distinguishable parts arranged in a relatively fixed spatial configuration. Our experiments are performed on images of side views of horses; therefore, this object class will be used as a running example throughout the paper to illustrate the ideas and techniques involved.

The rest of the paper is organized as follow. Section IIdescribes the non-symmetry and anti-packing model. Section III presents the parts feature descriptor, and section IVdescribes learning and detection method. In section V,experiments on real images show that the model is effective for object categorization. And then conclusions are given in section VI.

A Novel Object Representation Model for Object DetectionYihua Lan, Haozheng Ren, Yong Zhang, Hongbo Yu, and Guangwei Wang

O

59

Fourth International Workshop on Advanced Computational Intelligence Wuhan, Hubei, China; October 19-21, 2011

978-1-61284-375-9/11/$26.00 @2011 IEEE

Page 2: [IEEE 2011 Fourth International Workshop on Advanced Computational Intelligence (IWACI) - Wuhan, China (2011.10.19-2011.10.21)] The Fourth International Workshop on Advanced Computational

II. DESCRIPTION OF NON-SYMMETRY AND ANTI-PACKING PATTERN

REPRESENTATION MODEL

The non-symmetry and anti-packing object pattern representation model (NAM) is an anti-packing problem. The idea of the NAM can be described as follows: Giving a packed pattern and n predefined sub-patterns pick up these sub-patterns from the packed pattern and then represent the packed pattern with the combination of these sub-patterns.

Here, we use an object pattern � to describe an object category. The model for an object with n parts is formally defined by a set of part models 1 2( , , , )nP P PLwhere ( , , , , )iP x l d w �� , i.e.,

1 21

( , , , , , ( , ) { , , , })i

n

i mi

P v x l d w x l f f f��

� � � LU (1)

Where v is a two-dimensional vector specifying an “anchor” position for sub-pattern iP relative to the pattern position; x is the position of sub-pattern iP in query image; l represent the scale of the sub-pattern; d is a deformation vector; ( , )x l� denote a feature vector for the thi sub-pattern and (1 )j if j m� � is one of the feature descriptors.

Fig. 1. A NAM object representation model. Sub-patterns (1 )i ip i m� �(black box) are arranged within the bounding box (red). Small red circles show the optimal position of sub-patterns, and green ellipses the spatial uncertainty cd . Blue arrows represent the voting transformation. Big red circle is the collection of sub-patterns voting.

The non-symmetry relationship between sub-patterns describes the global structure information of object category and is designed to decouple variations due to affine warps,pose variability and other forms of shape deformations.

Anti-packing is a procedure that finding sup-patterns in query images, combining them into an object pattern and classifying.

III. SUB-PATTERN FEATURE DESCRIPTOR

The object represent model in current paper is capable of incorporating all useful object information and their feature descriptor, so that it is suitable for use with very basic feature detectors. Each object is represented by a set of parts based on NAM model. Parts are described by a rich set of cues (shape, color, appearance, and texture) inside them. Based on the observation that for a wide variety of common object categories, shape [19]-[21] matters more than other

information [22]-[23]. Here, we adapt the method of [24] for describing

sub-patterns. A sub-pattern is subdivided evenly its bounding box into a n n� grid. Each cell encodes information only inside the part. We capture different cues from the cells, and each type of cues is encoded by concatenating cell signal into a histogram. In this paper, we consider contour shape, appearance and color cues.

Contour shape, we use shape information as a key component for object detection. Since edge points are related to shape information closely, edge direction histogram (EDH) is a very simple and direct way to characterize shape information of an object. EDH is computed by grouping the edge pixels which fall into edge directions and counting the number of pixels in each direction.

Appearance represented by HoG descriptor [25]. Local object appearance and shape can often be characterized rather well by the distribution of local intensity gradients. Texture described by texton histograms.

IV. LEARNING AND DETECTION

A. Learning and Initial Detection The task of learning is to establish n classifier

1 2{ ( ), ( ), , ( )}nCf Cf Cf� � �L for an object pattern with nsub-patterns. Take a classifier for example, given a set of training image windows labeled as positive (object) or negative (nonobjective), each image window is converted into a feature vector as described above. These vectors are then fed as input to a supervised learning algorithm that learns to classify an image window as member or nonmember of the object pattern. In our experiments, we chose linear SVM as classifier.

Sliding window classification in [20][27] is a simple, effective technique for object detection. The initial detection problem is to determine whether the query image contains sub-pattern instances and where it is. The classifier ( )iCf � is applied to fix-sized windows at various locations in the feature pyramid, each window being represented as a feature vector ( , )x l� , where x specifies the position of the window in the image, and l specifies the level of the image in pyramid. The following expression represents the classifier ( )iCf � at one of the sliding windows.

, ( ( , ))iP j it Cf x l�� (2)

If ,iP jt , then , ( , )iP jh x l� , and it is a hypothesis position

of sub-pattern iP . Otherwise, it is not. Then we put the ,iP jh

into the thi sub-pattern hypothesis set ,1 ,2 ,{ , , , }

i i i i iP P P P KH h h h� L .Let

1 2{ , , , }

nP P PH H H� LH = denote the collection of all sub-patterns hypothesis position.

B. Voting and Detection The goal of this section is to generate bounding boxes of

that category in the image. To achieve it, we use a generalized Hough voting scheme based on the transformation between detected parts as well as associated object NAM model.

60

Page 3: [IEEE 2011 Fourth International Workshop on Advanced Computational Intelligence (IWACI) - Wuhan, China (2011.10.19-2011.10.21)] The Fourth International Workshop on Advanced Computational

In recognition techniques based on Hough voting [27] the main idea is to represented the object as a collection of parts (sub-patterns) and have each part to cast votes in a discrete voting-space. Each vote corresponds to a hypothesis of object location h� and class � . The object is identified by the conglomeration of votes in a small neighborhood of the voting space ( , )V h� � . ( , )V h� � is typically defined as the sum of independent votes ˆ ( , , , , , )ip v x l d w � from each part i ,where x is the location of the part in query image, l is the scale of the part, and � is the part feature. The final Hough image is computed by combining the votes from all parts for the image in question.

The first step is to generate the voting score is at a centroid circle location ic of the object pattern � by applying a transformation T( )� to

iPh and � . The transformation is a voting procedure, which is characterized by:

( ) T( , )ii i Ps c h� � (3)

Where ic is the voting location of sub-patternsiPh . This

transformation provides not only the voting score, location ic ,but also deviation id and weight iw of sub-patterns in the object pattern. In the Fig. 1, blue arrows show the transformation procedure.

iPh . The transformation T( )�exploits the rough localization provided by the spatial relationship between the sub-patterns and the object pattern.

Next, we want to find the local max voting score S� in the Hough image. The object location is obtained by choosing local maxima in the Hough image that have responses above a certain threshold� .

ˆ1,

ˆ( ) ( ( ) )i

n

det i i i ii c D

S s c w d�� �

� � � � (4)

The overall detection score ˆ( )detS � for object pattern � is a combination of the detected part voting score in the domain ˆD

�. Where iw is the discriminative weight of

sub-pattern ip , id is the deviation of sub-pattern from the optimal position. If ˆ( )detS �� , ˆD

� is a centroid of detected

object � . Otherwise, it is not.

V. EXPERIMENTS

We present an evaluation of the detection performance of our technique on several challenging datasets, and comparing against other methods. We adopt the evaluation criteria used in Pascal challenge [28]. Detection is considered correct, if the area of overlap between the predicted bounding box and ground truth bounding box exceeds 50%.

First, we test our algorithm on the INRIA (170 positive and 170 negative images) and Weizmann Horse (328 images) databases. The INRIA and Weizmann are very challenging datasets of horse images, containing different breeds, colors, and textures, with varied articulations, lighting conditions, and scales.

We compare our method with [29] and our previous work [30] on the INRIA and Weizmann Horse databases. Table 1

shows the results of recognition accuracy at the equal error rate points by comparing the algorithm to Marius et al. and our previous work. It can be seen that the performance of the algorithm is superior to other two methods.

We also compare our method with V. Ferrari et al. [31] on the ETHZ Shape classes datasets. The ETHZ shape database (collected by V. Ferrari et. al. [32]) consists of five distinctive shape categories (apple logos, bottles, giraffes, mugs and swans) in a total of 255 images. All categories have significant intra-class variations, scale changes, and illumination changes. Moreover, many objects are surrounded by extensive background clutter and have interior contours. Figure 2 shows our comparison to [31] on each of the categories. Our method outperforms [31] on all five categories, and the average detection rate increases by 11.2% (78.4% with respect to their 67.2%) at false positive per image (FPPI) rate of 0.3.

TABLE ICATEGORY RECOGNITION RATES(AT EQUAL ERROR RATE) ON

WEIZMANN, AND INRIA HORSES DATASETS.

Dataset Previous work Marius et al. Ours Horses (Weizmann) 92.67% 92.02% 93.12%

Horse (INRIA) 89.72% 87.14% 91.76%

61

Page 4: [IEEE 2011 Fourth International Workshop on Advanced Computational Intelligence (IWACI) - Wuhan, China (2011.10.19-2011.10.21)] The Fourth International Workshop on Advanced Computational

Fig. 2. Comparison of detection performance with Ferrari et al. [32] on ETHZ shape database. Each plot shows the detection rate as a function of FPPI.

VI. CONCLUSIONS

In this paper, we have proposed a non-symmetry and anti-packing object pattern representation model (NAM) to represent an object category. This model can effectively codes global structure of object. The object pattern model consists of several part sub-patterns. We selected appropriate feature descriptor for the sub-pattern to deal with the local variation. In our work, several feature descriptors introduced to describe the information of sub-patterns. Based on this representation, Hough voting method was used to detect object. The proposed framework can be applied to any object category that consists of distinguishable parts arranged in a

relatively fixed spatial configuration.

ACKNOWLEDGMENT

The authors would like to thank Department of Network Engineering, School of Computer Engineering, Huaihai Institute of Technology for supporting us the hardware environment to carry out experiments, and especially to Jian Zhang, a colleague of the author and Jun Shi, the vice dean of School of Computer Engineering, for the very useful discussions in the optimization algorithm at HHIT.

REFERENCES

[1]T. F. Cootes, G.J.Edwards, and C.J.Taylor, "Active appearance models," ,IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, pp. 681-685, 2001.

[2]I. Matthews and S. Baker, "Active appearance models revisited," International journal of computer vision, vol. 60, pp. 135-164, 2004.

[3]R. Fergus, P.Perona, and A.Zisserman, "A sparse object category model for efficient learning and exhaustive recognition," Proc. CVPR 2005, pp. 380-387.

[4]S. Agarwal, A.Awan, and D.Roth, "Learning to detect objects in images via a sparse, part-based representation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, pp. 1475-1490, 2004.

[5]P. F. Felzenszwalb and D. P. Huttenlocher, "Pictorial structures for object recognition," International journal of computer vision, vol. 61, pp. 55-79, 2005.

[6]Y. Amit and A. Trouv¨¦, "Pop: Patchwork of parts models for object recognition," International journal of computer vision, vol. 75, pp. 267-282, 2007.

[7]K. M. Lee, "Component-based face detection and verification," Pattern Recognition Letters, vol. 29, pp. 200-214, 2008.

[8]H. Schneiderman and T. Kanade, "Object detection using the statistics of parts," International journal of computer vision, vol. 56, pp. 151-177, 2004.

[9]A. Opelt, A.Pinz and A.Zisserma, "A boundary-fragment-model for object detection," Computer Vision¨Proc.CECCV 2006, pp. 575-588, 2006.

[10] D. Crandall, P.Felzenszwalb, and D.Huttenlocher, "Spatial priors for part-based recognition using statistical models," Proc.CVPR 2005, pp. 10-17.

[11] B. Leibe, A.Leonardis and B.Schiele, "Robust object detection with interleaved categorization and segmentation," International journal of computer vision, vol. 77, pp. 259-289, 2008.

[12] J. Shotton, A. Blake, and R.Cipolla,"Contour-based learning for object detection," Proc.ICCV 2005, vol. 1, pp. 503-510.

[13] B. Leibe, E.Seemann, and B.Schiele, "Pedestrian detection in crowded scenes," Proc.CVPR 2005. pp. 878-885.

[14] R. Fergus, P. Perona, and A. Zisserman, "Object class recognition by unsupervised scale-invariant learning," Proc.CVPR 2003. pp. 264-271.

[15] E. B. Sudderth, A. Torralba, and WT. Freeman,"Learning hierarchical models of scenes, objects, and parts," Proc.iccv 2003. pp. 1331-1338.

[16] B. Leibe, A.Leonardis and B.Schiele, "Robust object detection with interleaved categorization and segmentation," International journal of computer vision, vol. 77, pp. 259-289, 2008.

[17] A. Bosch and A Zisserman, "Image classification using rois and multiple kernel learning," Proc.IJCV 2008, pp. 1-25, 2008.

[18] K. Grauman and T. Darrell, "The pyramid match kernel: Efficient learning with sets of features," The Journal of Machine Learning Research, vol. 8, pp. 725-760, 2007.

[19] J. Sullivan, "Exploiting Part-Based Models and Edge Boundaries for Object Detection," In Digital Image computing 2008, pp. 199-206.

[20] V. Ferrari,L. Fevrier,F. Jurie, and C. Schmid, "Groups of adjacent contour segments for object detection," IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 36-51, 2007.

62

Page 5: [IEEE 2011 Fourth International Workshop on Advanced Computational Intelligence (IWACI) - Wuhan, China (2011.10.19-2011.10.21)] The Fourth International Workshop on Advanced Computational

[21] A. Opelt, A.Pinz and A.Zisserman, "A boundary-fragment-model for object detection," Computer Vision¨Proc.CECCV 2006, pp. 575-588, 2006.

[22] G. Bouchard and B. Triggs, "Hierarchical part-based visual object categorization," Proc.CVPR 2005.

[23] L. L. Zhu, Y.h.Chen, A. Yuille, W.Freeman, "Latent hierarchical structural learning for object detection," Proc.CVPR 2010.

[24] C. Gu, J.J. Lim, P. Arbeláez, "Recognition using regions," Proc.ICCV2009.

[25] N. Dalal and B. Triggs, "Histogram of Oriented Gradients for Human Detection". Proc.CVPR, 2005.

[26] S. Agarwal and D. Roth. “Learning a sparse representation for object detection”. Proc.ECCV, pp.113–130. Springer, May 2002.

[27] D. H. Ballard, "Generalizing the Hough transform to detect arbitrary shapes," Pattern Recognition, vol. 13, pp. 111-122, 1981.

[28] E. M., L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results”http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html.

[29] M. Leordeanu, M. Hebert and R. Sukthankar, "Beyond local appearance: category recognition from pairwise interactions of simple features". Proc.CVPR, 2007.

[30] G. W. Wang and C. B. Chen, " Using Non - symmetry and Anti-packing Representation Model For Object detection". Proc.CiSE,2010.

[31] V. Ferrari, L. Fevrier and C. Schmid, "Accurate object detection with deformable shape models learnt fromimages". Proc.CVPR, 2007

[32] V. Ferrari, T. Tuytelaars, and L.V. Gool. "Object detection by contour segment networks". Proc.ECCV, 2006.

63