[ieee 2011 ieee symposium on computational intelligence for multimedia, signal and vision processing...

Computational Visual Attention Model Capable of Exploring Similarity

Ru-Je Lin, Wei-Song Lin, Yu-Wei Huang Department of Electrical Engineering

National Taiwan University Taipei, Taiwan

[email protected], [email protected], [email protected]

Abstract—Computational visual attention (CVA) model is one of the methods which focus on finding region of interesting (ROI) in an image or in a scene. Similarity attention is one important task in CVA. If there are many objects in a scene, people will pick up the most abnormal one, which perhaps the similar one or dissimilar one, according to the composition objects of the scene. Capability of similarity attention enables human vision to promptly focus on similar or dissimilar regions in a scene. This paper implements this capability in the CVA model by attaching a high-level similarity comparison function to find ROI in the scene. The output of the model simulates the serial search mode and more approach to human visual behavior. Experimental results show that the function of similarity attention can be achieved successfully.

Keywords-computational visual attention; top-down; botton-up

I. INTRODUCTION Visual attention has been investigated by several science

fields. Psychologists are interesting in the human visual attention behavior, while computer scientists are concentrating on building computational visual attention models for machine vision use, and cognitive-neurosienists are trying to find anatomical evidences to support the attention theory. In computer vision, visual attention can be seen as a method of finding the region of interesting (ROI), which can determine what region or what object is more important than others in a scene. This method is useful in reducing computational effort and saving system resource for subsequent image processing procedures. Applications have been found in surveillance system, robot control, pattern recognition, vehicle cruise and automatic pilot system [1][2].

Basically, the architecture of computational visual attention (CVA) model can be divided into the bottom-up and top-down approaches. Itti [3] proposed a bottom-up model which may effectively find attentive spots in a scene. The top-down approach can be roughly divided into two sub-groups. The first one uses previous images or data as training samples to adjust weightings and parameters of models in order to fit different operation situations [4][5][6]. This is a kind of internal top-down because the prior information usually obtained from systems themselves. The second one analyzes tasks and user preference from outside, or store patterns beforehand for matching in order to process advanced

functions [7][8]. It is a kind of external top-down because the prior information is not provided by the systems themselves. Combination of the top-down and bottom-up approaches may attain a CVA model much like human vision.

In recent years, advances in the CVA model focus on finding ROI in continuous visions or videos [9] [10] [11]. In addition, Reference [12] presented a new bottom-up model which used stochastic way instead of traditional local contrast to simulate attention behavior. Reference [13] proposed a discrimination formulation of top-down visual saliency which well connected to recognition problem. Even though, some characteristics of human visual attention are still absent in the CVA model. Similarity attention is one of the most important. It is a capability to find distinct object in a scene paved by similar objects, or to find similar objects in a batch of various objects. Technically, similarity attention is to find ROI by examining similarity against neighbors. Fig.1 shows some examples of similarity attention. Objects that are similar in size or shape can be found easily. Objects with similar features such as pattern or relative position also attract visual attention, due to the similarity attention mechanism.

Feature-integration theory (FIT) [14] gives a simple explanation of similarity attention. It says that some simple visual features can be grabbed in cortex and scan large ranges of images simultaneously. This parallel search mode occurs in human pre-attentive stage. In this stage, features are free-floating and the response time human needed is almost fixed whether how many objects in a scene. However, for more complex features, further process in brain is needed to perceive the scene. People switch to serial search mode in human attentive stage and the response time increases with the increasing number of objects in a scene. It is believed that this physiological phenomenon is caused by the limitation of human visual process ability. Unfortunately, while many CVA models have implemented the parallel search mode in some extent, presently no CVA model realizes the serial searching mode. A mechanism called pop-out in Itti’s model can process and compare large ranges of components parallel in low-level feature maps [3]. However, the implementation of serial search mode needs more advanced processing, such as separating objects from background.

978-1-4244-9915-1/11/$26.00 ©2011 IEEE

The concept of “object” is one kind of important perceptions in human visual ability, which is discussed in many psychological researches, but very few CVA models mentioned and implemented it. Sun [15] proposed an objected-based attention model that consider relationships between different objects by adding grouping concepts. However, this method relies on manual preprocessing of segmentation and grouping, and can not solve the problem of discriminating different shape of objects. Boiman [16] proposed a method that can detect irregularities in images and in video. It uses a bit of previous visual examples as training data, determines the validities and composes a set of regular pieces to detect irregularities in new images or video. Although it is not a kind of CVA model, it is able to find salient region by compare different parts of an image. However, the ability in solving similarity problem is not clear in this model.

To implement similarity attention, this paper presents a CVA model with high-level similarity comparison function. This model consists of the serial search mode in human attentive stage, and includes a top-down mechanism which can memorize and analyze some patterns for further comparing and matching. Statistical parameters and the SIFT key points [17] are used to characterize high-level features such as size, shape and structure in objects. The experimental results show the model is useful and more fit human visual behavior.

II. MODEL STRUCTURE

A. The Processing Procedures With reference to Fig.2, input image goes through the

feature extraction procedure; then outcomes are divided into the color, intensity and orientation channels. The color channel consists of the R-G and B-Y color features; the orientation channel consists of the 0 , 45 , 90 and 135 maps. Each channel is investigated for three hierarchical levels; totally 21 feature maps are obtained. The center surrounding mechanism convolutes feature maps with ON DoG filter and OFF DoG filter. Then the low-level similarity comparison mechanism is conducted to produce 42 maps. These maps are combined by averaging to attain one saliency map.

Input image also goes to the high-level similarity comparison module to compute the relationship of objects in the image. As a top-down mechanism, some memorized images (or properties) also can be sent into the high-level similarity comparison module to bias the result of similarity comparison. Outcomes of the high-level similarity comparison module go to the adjustment unit which may adjust the saliency map in accordance with similarity between objects and user preference. The saliency map is adjusted by inhibiting some regions or exciting some regions depending on different scenario or user decision. Final outcome is the similarity saliency map.

B. The Low-level Similarity Comparison The low-level similarity comparison unit uses a bottom-up

mechanism called the pop-out function [3] to adjust the weighting factors of each feature map. This mechanism may depress the magnitude of feature maps by multiplying with smaller weighting factor. In general, feature map with more local peaks (maxima) would be depressed. Thereby, the weighting factor is dependent on the number of local peaks. We use a more simple method proposed by Frintrop [5], as (1) shown.

mXXW /)( = (1) where ( )W ⋅ denotes the weighting decision process, X is the feature map, m is the number of local maxima above a certain threshold (we use 0.8) after normalization.

In the center surrounding mechanism, the pop-out function carries out both the ON DoG filter and OFF DoG filter. The ON DoG filter is responsible for enhancing bright region; on the other hand the OFF DoG filter is for enhancing dark region. Basically, the pop-out function deals with similarity problem by exciting unusual maps, and inhibiting general maps in low-level feature fields. It only adjusts values in whole feature map, rather than in specific regions. Tasks that discriminate high level features such as shape, size and more complex structure invoke to the high-level similarity

(a)

(b)

Figure 1. Searching paradigm; (a) parallel search mode; (b) serial search mode.

Figure 2. The flow chart of our computational visual attention model.

comparison module.

III. HIGH-LEVEL SIMILARITY COMPARISON MODULE The high-level similarity comparison module consists of

five units. The size comparison unit, shape comparison unit, structure comparison unit, and recognized feature comparison unit are used to compare the similarity of objects which are segmented by object separation unit. The indexes using in four comparison units are shown in Table 1.

A. Object Separation Unit The object separation unit (OSU) segments the objects in a

frame. Canny filter is used to detect edges, and a low-pass filter is used to smooth the result.

B. Size Comparison Unit This unit compares the sizes of the objects. The pixel area

indexes the area of the object in pixels. The filled pixel area indexes the area of the object when vacancy in this object has been filled. These indexes show the occupying size of this object. The comparison algorithm is illustrated in Section 3.6.

C. Shape Comparison Unit This unit compares the shapes of the objects. Eccentricity

and compactness are chosen as indexes. Each object is enclosed by a smallest ellipse, as shown in Fig.3(a). Lengths of the major and minor axes and distance of foci are calculated. Eccentricity is defined as the ratio of the distance between two foci and the major axis length. This value is between 0 and 1. Larger eccentricity means longer and thinner object. Compactness is defined by

�=

−+−=P

icici yyxxPscompactnes

1

222 )()(/ (2)

where P is the amount of pixels pertaining to this object, and (xc , yc) is the coordinates of the object center. Compactness shows the degree of this object approximating a circle. Fig.3 (b) and Fig.3(c) show that two objects are different in shape, same in eccentricity, but different in compactness.

D. Structure Comparison Unit This unit compares the inside intensity structures of

objects. Two statistical parameters are used as index. The mean intensity value is obtained by averaging the intensity of the object, and the standard deviation intensity is obtained by computing the intensity standard deviation of the object. These indexes represent the characteristics of the structure of the object.

E. Recognized Feature Comparison Unit This unit compares some invariant features inside the

objects. The scale invariant feature transform (SIFT) are used to extract invariant features [17]. The SIFT key points of each object are found by the SIFT algorithm. The matched rate from object i to object j is defined by

ijiSIFT

ji kmMR /,, = (3) where ki is the number of the SIFT key points in object i, and

mi,j is the number of matched key points from object i to object j. Note that MRi,j and MRj,i may not be equal.

Table 1. Indexes using in high level comparison module Unit Indexes

Size Comparison Unit Pixel Area

Filled Pixel Area

Shape Comparison Unit Eccentricity Compactness

Structure Comparison Unit Mean Intensity Standard Deviation

Intensity Recognized Feature

Comparison Unit SIFT key points

F. Similarity Index Outcomes of the recognized feature comparison unit signify

relationships between objects. However, outcomes of other comparison units need further process so as to signify object’s relationship. The associated matched rate is calculated by

|)1(|,,

,,,

kjki

kjkikji ff

ffMR

+−

−= (4)

where fi,k is the k-th index value of the i-th object. For example, if the 5-th index values of object 1 and object 2 are 90 and 110, respectively, then the matched rate is

8.0|)11090110901(|5

2,1 =+−−=MR (5)

Similarity between object i and object j is indexed by average of all matched rates,

(a)

(b) (c)

Figure 3. Indexes in the shape comparison unit. (a) The red ellipse indicates the minimum ellipse to enclose the object. The red dots are positions of foci. The blue lines indicate major and minor axes. Eccentricity can be calculated from these parameters. (b) An object whose eccentricity value approaches 0 and compactness value is 18.61. (c) An object whose eccentricity value approaches 0, and compactness value is 11.09.

�=

=n

k

kjiji MR

nS

1,,

1 (6)

where n is the number of related indexes. By comparing the similarity index against a threshold, two objects can be said similar or dissimilar. For each object, the number of similar objects existing in the same scene is recorded. This data is reported to the adjustment decision unit for further computation.

IV. OBJECT-BASED VISUAL ATTENTION The similarity attention model consists of the low-level

bottom-up portion, high-level similarity comparison module, adjustment decision unit and top-down mechanism.

A. Adjustment Decision Unit Depending on the similarity index, the adjustment decision

unit sends a command to inhibit or to excite the corresponding object region in the saliency map. Thus, an object neighboring many similar objects can be enhanced or depressed. Since human will focus on similar things or different things according to different situations, the choice of excitation or inhibition is selected in user preference. In this way, similarity attention is achieved.

How much scale of magnitude should be adjusted is another important question. It has no certain answer because human perception about this phenomenon is not easy to measure and quantify. A simple way is to divide or to multiply every pixel belongs to the object with the number of similar objects, based on excitation mode or inhibition mode. It is a rough but effective way.

B. Top-Down Mechanism The high-level similarity comparison module can not only

extract information from input image but also from memorized images. If the memorized object images appear in the input images, then the similarity attention can enhance or depress the region of similar objects in the input images. This mechanism can be seen as a pattern recognition function.

V. EXPERIMENTAL RESULTS Several pictures containing similar objects are chosen as

samples of subsequent experiments. The results of the similarity attention model are compared with that of using the saliency toolbox [18] of iLab. In Fig.4, columns from left to right contain original images, iLAB result, saliency map, and similarity saliency map. Each target has shape or size different from other objects in the same picture. Since the target is unique in the picture, we choose to inhibit similar objects to highlight the target. Fig. 4 shows that the saliency toolbox selects the brightest or largest object that may not be the target. In contrast, in the similarity saliency map, the target is enhanced by the similarity attention mechanism, thus it is highlighted in the map.

Fig.5 contains five Chinese chess pieces that are same in shape and size, but different in structure (word). We use red

circles to indicate the location of targets in original images. Due to each piece of chess appearing equal intensity in the image, the saliency toolbox can not find the target correctly. On the other hand, each target is found correctly in the similarity saliency map. Evidently, the similarity attention model is effective.

VI. CONCLUSION A similarity attention model was attained by attaching the

high-level similarity comparison module to the CVA model. The similarity attention mechanism may enhance or depress image regions containing objects similar in size, shape or structure. This mechanism implemented the serial search mode which occurred in human attentive stage.

REFERENCES

[1] M. T. López, A. Fernández-Caballero, M. A. Fernández, J. Mira and A. E. Delgado, "Visual surveillance by dynamic visual attention method," Pattern Recognition, 2006.J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68–73.

[2] F. Shic and B. Scassellati, "A Behavioral Analysis of Computational Models of Visual Attention," IJCV, 2007.

[3] L. Itti, C. Koch and E. Niebur, "A Model of Saliency-Based Visual Attention for Rapid Scene Analysis," PAMI, 1998.

[4] R. Milanese, H. Wechsler, S. Gill, J. M. Bost and T. Pun, "Integration of bottom-up and top-down cues for visual attention using non-linear relaxation," CVPR, 1994

[5] S. Frintrop, "VOCUS :a visual attention system for object detection and goal-directed search," University of Bonn, Germany, 2006.

[6] F. H. Hamker, "Modeling feature-based attention as an active top-down inference process," Biosystems, 2006.

[7] V. Navalpakkam and L. Itti, "Modeling the influence of task on attention," Vision Research, 2005.

[8] D. Walther, U. Rutishauser, C. Koch and P. Perona, "Selective visual attention enables learning and recognition of multiple objects in cluttered scenes,"CVIU, 2005.

[9] L. Itti and P. Baldi, "A principled approach to detecting surprising events in video," CVPR 2005.

[10] J. K. Tsotsos, Y. Liu, J. C. Martinez-Trujillo, M. Pomplun, E. Simine and K. Zhou, "Attending to visual motion," CVIU , 2005.

[11] M. T. López, M. A. Fernández, A. Fernández-Caballero, J. Mira and A. E. Delgado, "Dynamic visual attention model in image sequences," Image and Vision Computing, 2007.

[12] T. Avraham and M. Lindenbaum, "Esaliency (Extended Saliency): Meaningful Attention Using Stochastic Image Modeling," PAMI, 2010.

[13] D. Gao, S. Han and N. Vasconcelos, "Discriminant Saliency, the Detection of Suspicious Coincidences, and Applications to Visual Recognition," PAMI, 2009.

[14] A. M. Treisman and G. Gelade, "A feature- integration theory of attention,"Cognitive Psychology, 1980.

[15] Y. Sun and R. Fisher, "Object-based visual attention for computer vision," Artificial Intelligence, 2003.

[16] O. Boiman and M. Irani, "Detecting Irregularities in Images and in Video," IJCV, 2007.

[17] D. G. Lowe, "Distinctive Image Features from Scale-Invariant Keypoints," IJCV, 2004.

[18] D. Walther. SaliencyToolbox [Online]. Available: http://www.saliencytoolbox.net/

Original Image iLAB Result Saliency Map Similarity Saliency Map

Figure 4. Pictures contain similar object images; left column: original images, target is unique in shape or size; the second left column: results of iLAB; the second right column: the saliency map; right column: the similarity saliency map.

Original Image iLAB result Saliency Map Similarity Saliency Map

Figure 5. Pictures contains Chinese chess pieces; left column: original images, target highlighted by red circle.; second left column: results of iLAB; second right column: saliency map; right column: similarity saliency map. The patterns (Chinese chess) we used are shown above the figure. From left to right are: general, chariot, horse, artillery and soldier.

[ieee 2011 ieee symposium on computational intelligence for multimedia, signal and vision processing...

Documents