adaptive resolution system for distributed surveillance

11
Adaptive Resolution System for Distributed Surveillance W e present a real-time foveation system for remote and distributed surveillance. The system performs face detection, tracking, selective encoding of the face and background, and efficient data transmission. A Java-based client–server architecture connects a radial network of camera servers to their central processing unit. Each camera server includes a detection and tracking module that signals the human presence within an observation area and provides the 2D face coordinates and its estimated scale to the video transmission module. The captured video data are then efficiently represented in log-polar coordinates, with the foveation point centered on the face, and sent to the connecting client modules for further processing. The current set-up of the system employs active cameras that track the detected person, by switching between smooth pursuit and saccadic movements, as a function of the target presence in the fovea region. High reconstruction resolution in the fovea region enables the successive application of recognition/verification modules on the transmitted video without sacrificing their performance. The system modules are well suited for implementation on the next generation of Java-based intelligent cameras. # 2002 Elsevier Science Ltd. All rights reserved. Dorin Comaniciu 1 , Fabio Berton 2 and Visvanathan Ramesh 1 1 Real-Time Vision and Modeling Department, Siemens Corporate Research, 755 College Road East, Princeton, NJ 08540, USA. E-mail: [email protected] 2 Aitek SRL, Via della Crocetta 15, Genova, Italy Introduction Autonomous processing of visual information for the efficient description and transmission of events and data of interest represents the new challenge for next generation video surveillance systems [1, 2]. The advances in the new generation of intelligent cameras having local processing capabilities, either supporting Java applications or based on DSP chips, will make possible the customization of the devices for a specific video understanding and summarization task. As a result, important bandwidth reduction can be achieved, combined with a decrease in the operational cost [3]. The low bit rate allows wireless transmission of the data of interest to a central processing unit, for further pro- cessing, and/or retransmission. This scenario is not limited to video surveillance and is appropriate for industrial monitoring or video conference applications [4, 5]. This paper presents a system for real-time distributed surveillance that follows the guidelines from above. Starting with our previous work on object tracking [6, 7], we have built a modular client–server architecture 1077-2014/02/$35.00/0 r 2002 Elsevier Science Ltd. All rights reserved. Real-Time Imaging 8, 427–437 (2002) doi:10.1006/rtim.2002.0298, available online at http://www.idealibrary.com on

Upload: dorin-comaniciu

Post on 15-Jun-2016

218 views

Category:

Documents


4 download

TRANSCRIPT

Real-Time Imaging 8, 427–437 (2002)doi:10.1006/rtim.2002.0298, available online at http://www.idealibrary.com on

Adaptive Resolution Systemfor Distributed Surveillance

We present a real-time foveation system for remote and distributed surveillance. Thesystem performs face detection, tracking, selective encoding of the face andbackground, and efficient data transmission. A Java-based client–server architecture

connects a radial network of camera servers to their central processing unit. Each camera serverincludes a detection and tracking module that signals the human presence within an observationarea and provides the 2D face coordinates and its estimated scale to the video transmissionmodule. The captured video data are then efficiently represented in log-polar coordinates, with thefoveation point centered on the face, and sent to the connecting client modules for furtherprocessing. The current set-up of the system employs active cameras that track the detectedperson, by switching between smooth pursuit and saccadic movements, as a function of the targetpresence in the fovea region. High reconstruction resolution in the fovea region enables thesuccessive application of recognition/verification modules on the transmitted video withoutsacrificing their performance. The system modules are well suited for implementation on the nextgeneration of Java-based intelligent cameras.

# 2002 Elsevier Science Ltd. All rights reserved.

Dorin Comaniciu1, Fabio Berton

2and Visvanathan Ramesh

1

1Real-Time Vision and Modeling Department,Siemens Corporate Research, 755 College Road East,

Princeton, NJ 08540, USA.E-mail: [email protected]

2Aitek SRL, Via della Crocetta 15, Genova, Italy

Introduction

Autonomous processing of visual information for theefficient description and transmission of events and dataof interest represents the new challenge for nextgeneration video surveillance systems [1, 2]. Theadvances in the new generation of intelligent camerashaving local processing capabilities, either supportingJava applications or based on DSP chips, will makepossible the customization of the devices for a specificvideo understanding and summarization task. As aresult, important bandwidth reduction can be achieved,

1077-2014/02/$35.00/0

combined with a decrease in the operational cost [3]. Thelow bit rate allows wireless transmission of the data ofinterest to a central processing unit, for further pro-cessing, and/or retransmission. This scenario is not limitedto video surveillance and is appropriate for industrialmonitoring or video conference applications [4, 5].

This paper presents a system for real-time distributedsurveillance that follows the guidelines from above.Starting with our previous work on object tracking[6, 7], we have built a modular client–server architecture

r 2002 Elsevier Science Ltd. All rights reserved.

Figure 1. Intelligent processing and summarization of videoin a radial network of cameras. Wireless connections can beemployed for the transmission of data of interest to the centralprocessing unit.

428 D. COMANICIUETAL.

that supports the intelligent transmission of the visualinformation from a radial network of (active) camerasto the central unit (see Figure 1).

Each camera module functions as an image serverthat reports to the client (the central processing unit)whenever a human is present in its field of view. For thetransmission of the visual information associated withthis event, the data are filtered and sampled in log-polarcoordinates, with the foveation point centered on thesubject’s face. The client performs the inverse mappingto derive an approximated replica of the original data.

A flexible log-polar representation of the backgroundhas been designed to deal in real-time with scale andlocation changes of the face of interest, while maintain-ing the required bit rate. The main idea is to preserve thehigh resolution of the face and trade-off the quality ofthe background for the efficiency of the representation.Our system performs lossless transmission of the fovearegion; hence, it allows the application of recognition/verification processes at the central unit. As a novelty,the sampling grid employed to compute the log-polarrepresentation has locally a hexagonal lattice structure,known to be 13.4% more efficient than rectangularlattices in the case of circularly band-limited signals [8].

The paper is organized as follows. The followingsection presents a short review of color-based facedetection and tracking techniques and discusses fovea-tion processing. The subsequent section summarizes themean shift-guided module performing the detection and

tracking of faces. The flexible log-polar mapping ispresented in next. Another section describes the dual-mode active camera controller. Finally, the systemperformance is analyzed.

Background

This section provides first a brief survey on facedetection and tracking using color. For a more detailedpresentation, see [9, 10]. Color-based methods areclassified as feature-invariant approaches, since theyaim at locating faces even when the illumination,viewpoint, or pose vary.

In [11], the face color is represented in the normalizedRG space [12] and face color samples are obtained byanalyzing faces of different human races. A face colormodel that is built with Gaussian assumption is back-projected in the current frame for face localization.Additional geometric and motion constraints areimposed. A different approach is taken in [13], whererobustness to occlusions is achieved via an explicithypothesis-tree model of the occlusion process. Histo-grams are employed in [14] for a better characterizationof the face colors. The search for the face is based onhistogram intersection. In addition to the color model, agradient-based module enforces the elliptical shape ofthe detected face. The technique in [15] derives a skin-color reference map in the YCrCb color space andbinarize the input frame according to the model.Subsequent stages perform morphological operationsand contour extraction for the localization of faces. Theperceptually uniform CIE Lab color space is employedin [16] to build chroma charts that are used afterwardsto transform the input into a skin-likelihood image. Thepeaks in the skin-likelihood image represent probableface locations. Adaptive Gaussian mixtures are used in[17] to model the face colors based on the change inappearance. The hue, saturation, intensity (HSI) repre-sentation was employed and color distributions weremodeled in the hue–saturation space.

In the second part of this section we discuss theproblem of foveation processing. The foveation proces-sing of video is mainly inspired by the primate visualsystem that reduces the enormous amount of availabledata through a non-uniformly sampled retina [18]. Thedensity of photoreceptors and ganglion cells varies inthe retina of the human eye, being larger at the point ofthe fovea and dropping away from that point as afunction of eccentricity. There has been a growing

ADAPTIVERESOLUTIONSYSTEMFORDISTRIBUTEDSURVEILLANCE 429

interest in recent years in video processing techniquesthat resemble the primate visual system [19, 20]. Thesetechniques divide the field of view into a region ofmaximal resolution, (fovea), and a region whoseresolution decreases towards the extremities (periphery)[21]. The motivation is that many applications onlyrequire high-resolution representation for some parts ofthe video data. Generally, the fovea covers the object ofinterest and the background is represented by theperiphery. This framework extends naturally to multipleobjects of interest. In a surveillance scenario, forexample, one is interested to represent with highresolution the human faces for subsequent face recogni-tion/verification procedures.

A very common implementation of the foveation isthrough log-polar sampling [22]. The fovea is typicallyuniformly sampled, while the periphery is sampledaccording to a log-polar grid. Space-variant sensorsthat implement in CCD technology the mappingbetween the Cartesian and log-polar plane are presentedin [23, 24]. The use of a specific lens system to obtainnon-uniform resolution is discussed in [25]. However, amajor drawback of these approaches is that theparameters of the data reduction are fixed, dependingon the physical design. A more flexible solution, towhich we subscribe, is the software implementation ofthe log-polar mapping [26]. The performance throughthe use of additional hardware is discussed in [27].

The definition of anti-aliasing and interpolation filtersis a difficult problem for non-uniform sampling. In thesimpler case of polar sampling, it can be shown [28, 29,30] that a real function f(r, y) whose Fourier transformis of compact support can be reconstructed from itsequally spaced samples in the y direction and at thenormalized zeros of Bessel functions of the first kind, inthe r direction. Efficient approximations for reconstruc-tion are presented in [31]. However, the Fourier trans-form formulation is not valid when the mapping is spacevariant, such as the log-polar mapping. Although thereexist fast methods for reconstruction from non-uniformsamples [32], their efficiency is still not appropriate forreal-time applications. The anti-aliasing filtering in thecase of log-polar sampling is typically based on position-dependent Gaussian filters with a disk-shaped supportregion with exponentially growing radius size [27]. Notethat a frequency-domain representation can be con-structed by applying the log-polar mapping directly tothe Fourier integral. The result is the exponential chirptransform, shown to preserve the shift-invariant proper-ties of the usual Fourier transform [33].

Face Detection and Tracking

This section presents our module performing thedetection and tracking of faces. The main idea is tocombine viewpoint-invariant models of the face withefficient gradient-based optimization. For more detailsplease see [6].

Modeling and optimization framework

The color model of the face is obtained by computingthe mean histogram of face instances recorded in themorning, afternoon, and at night. The mean histogramis represented in the intensity-normalized RG space with128� 128 bins. As a dissimilarity measure between theface model and the face candidates, a metric based onthe Bhattacharyya coefficient [34] is employed. Hence,the problem of face localization is reduced to a metricminimization, or equivalently to the maximization of theBhattacharyya coefficient between two color distribu-tions. By including spatial information into the colorhistograms, we showed that the maximization of theBhattacharyya coefficient is equivalent to maximizing adensity estimate. As a consequence, the gradient ascentmean shift procedure [35] can be employed to guide afast search for the best face candidate in the neighbor-hood of a given image location.

The resulting optimization achieves convergence inonly a few iterations, thus being well suited for the taskof real-time detection and tracking. To adapt to thescale changes of the target the scale invariance propertyof the Bhattacharyya coefficient is exploited as well asthe gradient information on the border of the hypothe-sized face region. Mean shift processes are run atdifferent scales and the scale corresponding to the bestresponse is chosen.

Detection

The detection process involves the mean shift optimiza-tion with multiple initializations, each one in a differentlocation of the current image frame. For the currentsettings of the system 320� 240 pixel images withsubjects at a distance between 30 and 3m from thecamera) we use five initial regions of elliptical shape withsemi-axes equal to (37,51), as shown in Figure 2. Thisarrangement guarantees that at least one initial ellipse isin the basin of attraction of a face of typical size.

Figure 3 presents the surface obtained by computingthe Bhattacharyya coefficient for the entire image from

Figure 2. Three-people image and initialization ellipses usedfor face detection.

430 D. COMANICIUETAL.

Figure 2. One can easily identify the three peaks of thesurface, one for each face in the image. The advantage ofour method should be obvious from Figure 3. Whilemost of the tracking approaches based on regions mustperform an exhaustive search of the image (or a givenneighborhood) to find the maximum value of thesimilarity measure, our algorithm exploits the gradientof the surface to climb to the closest peak. With propermultiple initializations, the highest peak is found veryquickly.

Tracking

The tracking process involves only optimizations in theneighborhood of the previous face location estimate,and is therefore sufficiently fast to run at the frame rate

− 100− 500

50100

− 50

0

50

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

XY

Bha

ttach

aryy

a C

oeffi

cien

t

Figure 3. Values of the Bhattacharya coefficient correspond-ing to the image shown in Figure 2. One can identify the threepeaks of the surface, one for each face in the image.

(30 frames/s). The module that implements the log-polarmapping receives from tracking module two vectors foreach frame, representing the estimated position andscale of the currently observed face.

A set of experiments are presented in Figure 4,demonstrating face detection and tracking on a sequenceof 1059 frames. The subject’s face has been detectedwithin a few frames after entering the field of view of thecamera (frame 42). The detected face is shown in thesmall upper-left window. The camera is then trackingthe face during walking (frames 69 and 147) andturnings (frames 99 and 177). The subject tries to escapethe tracker by performing fast lateral movements (frame267) or hand waving (frame 309). Observe the blurringthat accompanies these movements, without affectingthe tracker. Next, the subject tries to hide behind a chair(frames 552 and 582), but only when the head iscompletely occluded the tracker fails (frame 654).However, once the occlusion is terminated, the face isimmediately recovered. Finally, one can see the scaleadaptation working when the subject approaches thecamera.

A Flexible Log-Polar Mapping

The design of a flexible log-polar mapping that satisfiescertain geometric constraints is presented in this section.Let us denote by (x,y) the Cartesian coordinates, withthe origin assumed to be in the center of the image, andby (r,y) the polar coordinates. The transformationbetween Cartesian and polar grid is then defined by thepair

r ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffix2 þ y2

p

y ¼ arctan y=xð1Þ

and

x ¼ r cos y

y ¼ r sin yð2Þ

The log-polar mapping is defined by the transfor-mation

rx ¼ Alx � B

y ¼ y0fð3Þ

where ðx;fÞ are positive integers representing the log-polar coordinates, l41 is the base of the transforma-

Figure 4. Dorin sequence.

�=1�=M

�=0

�q

�M

1

Periphery

Fovea

Figure 5. The first geometric constraint imposed that theradius difference between the rings of indexes 1 and 0 is equalto 1.

ADAPTIVERESOLUTIONSYSTEMFORDISTRIBUTEDSURVEILLANCE 431

tion, and A, B, and y0 are constants that will be derivedfrom geometric constraints. The coordinate x representsthe index of rings whose radius increases exponentially,while f denotes equidistant radii starting from theorigin.

Geometric constraint for rings

Let the index of the smallest ring be x ¼ 0 and the indexof the largest ring be x ¼ M, and r0 and rM their radii,respectively. There are M rings that cover the periphery(see Figure 5). The ring of index 0 is the border betweenthe fovea and periphery.

We assume in the sequel that M, r0, and rM areknown and will show how to compute the parameters A,B, and l of transformation (3). Since the fovea isrepresented with full resolution, a natural geometricconstraint is that the radius difference between the ringsof indexes 1 and 0 is equal to 1. Using the aboveconstraint, it can be shown (see Appendix I) that l is thereal root of the polynomial of order M�1:

gðlÞ ¼ lM�1 þ lM�2 þ . . .þ 1� ðrM � r0Þ ð4Þ

WhenMorM�r0, the polynomial g(l) has a root largerthan 1. Using the same constraint we obtain

A ¼1

l� 1ð5Þ

and

B ¼1

l� 1� r0 ð6Þ

432 D. COMANICIUETAL.

Introducing now Eqns (5) and (6) in Eqn (3), we have

rx ¼ r0 þlx � 1

l� 1ð7Þ

Geometric constraint for radii

The parameter y0 is derived such that the resultingsampling grid (defined by the centers of the log-polarpixels) has locally a hexagonal lattice structure. Thismodified log-polar mapping has been suggested in [30],but (to our knowledge) has not yet been exploited. Theprocessing of hexagonal sequences is discussed in [8].

Our construction assumes the rotation of each otherring by y0/2. To obtain a structure that is locallyhexagonal, the aspect ratio of the log-polar pixels shouldbe

ffiffiffi3

p=2, that is, the centers of three adjacent pixels

should form an equilateral triangle. Although the aspectratio changes with the index of the ring, for typicalparameter values the changes are negligible. By enfor-cing the hexagonal constraint for the pixels at the half ofthe periphery region ðxh ¼ M=2Þ, it can be shown aftersome trigonometric manipulations that

y0 ¼ 2 arctan2ffiffiffi3

plxh � lxh�1

lxh þ lxh�1 � 2þ 2r0ðl� 1Þð8Þ

Mapping design

According to the formulas derived in the last twosubsections, the log-polar grid is completely specified ifthe number of ringsM is given, together with the radius offovea region r0, and the maximum radius of the peripheryrM. As an example, Figure 6 presents the log-polar pixels

0 20 40 60 80 100 120 140 1600

20

40

60

80

100

120

x

y

(a)

Figure 6. (a) Log-polar pixels for an image of 160� 120 pixels. (structure. See text for the numerical value of the parameters.

and the sampling grid corresponding to an image of160� 120 pixels. We imposed a number ofM=30 rings, afovea radius r0=9, and a maximum periphery radiusrM=100, equal to half of the image diagonal. Solving forthe real root of polynomial (4) it results that l=1.068 andusing (8) we have y0=0.0885 rad.

The number of log-polar pixels is b¼M 2p=y0� �

¼30� 71 ¼ 2130 pixels, which is 9 times smaller than thenumber of Cartesian pixels (the notation :d e denotes theceiling function). When the foveation point does notcoincide with the image center, the number of log-polarpixels decreases.

Imposing the transmission bandwidth

Assume in the sequel, a communication scenario withfixed bandwidth. The main idea of our system is totransmit with full resolution the detected face, whileadapting the log-polar mapping for the remainingbandwidth of the communication channel. The error-free transmission of the face is required in the case whena recognition module is employed at the receiver site.

Since the dependence between the number of log-polar pixels and the input parameters of the mapping iscomplex, we employ an LUT of size 100 kB to determinethe mapping parameters for a given bandwidth andfovea size.

Implementation details

The input image of size 320� 240 is first converted tothe YCbCr color format, the color planes beingsubsampled to 160� 120 pixels. Based on the locationand scale information received from the tracking

0 20 40 60 80 100 120 140 1600

20

40

60

80

100

120

x

y

(b)

b) The corresponding sampling grid exhibits a local hexagonal

ADAPTIVERESOLUTIONSYSTEMFORDISTRIBUTEDSURVEILLANCE 433

module, the fovea regions of both the luminance andcolor planes are transmitted separately.

If the bandwidth of the communication channel isknown in advance, the log-polar mapping has only onevariable parameter, the radius of fovea region (thenumber of rings being constrained by the bandwidth).By coarsely quantizing the range of values of the fovearadius into 10 values, for each radius value we built a setof LUTs for anti-aliasing filtering and sampling. Thepre-sampling filtering employs Gaussian filters whosekernel width is position dependent. More specifically,the width of each kernel is proportional to the localdistance between two consecutive rings at the location ofthe kernel. The Gaussian kernels are pre-computed insmall LUTs. The relative position of the sampling pointsfor the ten fovea radii is memorized in an LUT of500 kB for the luminance data and an LUT of 125 kB forthe color data. Bilinear interpolation is used for therecovery of uniformly sampled data. The interpolationfor each Cartesian location is based on its boundingtriangle (Figure 6(b)).

The resulting system maintains an approximatelyconstant transmission bandwidth, while allowing thefull-resolution transmission of the detected face, inde-pendent of the scale. The penalty is paid by theperiphery (background) whose quality decreases whenthe face scale increases.

Camera Control

In the case when active cameras are used, theadequate control of the pan, tilt, and zoom is animportant phase of the tracking process. The camerashould execute fast saccades in response to sudden andlarge movements of the target while providing a smoothpursuit when the target is quasi-stationary [36, 37]. Weimplemented this type of control which resembles that ofthe human visual system. The fovea subimage occupieslaterally about 61 of the camera’s 501 field of view, atzero zoom.

However, contrary to other tracking systems thatsuspend the processing of visual information during thesaccades movements [38], our visual face tracker issufficiently robust to deal with the large amount ofblurring resulting from camera motion. As a result, thevisual tracking is a continuous process that is notinterrupted by the servo commands. A standard

RS–232C interface is used to communicate with theSony EVI-D30 camera.

System Performance

We discuss in this section, a set of results obtained byrunning the system in an indoor environment. For allconsidered experiments, we impose an overall compres-sion ratio of 16.

Reconstruction quality

Figure 7 (a) presents the user interface of a cameraserver, showing the detected face in the upper-left side ofthe image. The entire server package, performing,detection, tracking, camera control, and adaptive log-polar representation is well suited for implementationon the next generation of intelligent cameras. Figure7(b) shows the reconstructed image and user interface ofthe primary client, which is able to control the camerafrom the distance. Note the preservation of the detailson the face region.

Another pair of original and reconstructed image ispresented in Figure 8(a) and (b), respectively. Again, theface details in the reconstructed image are identical tothose in the original.

The left part of Figure 9 shows four framesof a sequence of about 450 frames containing thedetection and tracking with an active camera of asubject that meets another person. The same frames,reconstructed by the client application, are presented inthe right part of Figure 9. For visualization purposes,both sequences were captured in real-time, simulta-neously, on the same computer. Since the entire datatransmission chain was operating, one can observe aslight time delay between the original and receivedframes. The transmission mechanism was implementedthrough serializable Java classes. An image serverand a client performing all processing stages (detection,tracking, efficient representation, transmission, recon-struction, and visualization) operate at frame rate on asingle computer.

Computational load

We measured the computational load for running thesystem on a 900MHz PC. The face tracking at30 frames/s involves only 10–20% of the CPU. Thefiltering and log-polar sampling needs additional 40%.

Figure 7. User interface of the (a) image server and (b) primary client.

434 D. COMANICIUETAL.

The reconstruction filters at the receiver require about20% of computational power. The Java software can befurther optimized by creating C++DLLs for its mostcritical parts.

Figure 8. (a) Original and (b) receives.

Conclusions

This paper presented a flexible framework towardsintelligent processing of visual data for low bandwidth

Figure 9. Alessio sequence. Left: original data. Right: received data. The frames 30, 240, 390, and 420 are shown.

ADAPTIVERESOLUTIONSYSTEMFORDISTRIBUTEDSURVEILLANCE 435

436 D. COMANICIUETAL.

communication, with direct application in distributedsurveillance using a radial network of cameras. Weshowed that the region of interest (target) from thecurrent image frame can be detected and transmitted inreal-time with high resolution. To compensate for bitrate variations due to the changes in the target scale, therepresentation quality of the background is modifiedthrough an adaptive log-polar mapping.

We envision further developments of the currentsystem. The extension of the detection module formultiple targets, for example, will also allow thetransmission of the face of the second person fromFigure 9, frame 240. At the same time, the two streamsof information (faces and log-polar pixels correspondingto the background) can be further compressed in theMPEG-4 [39] framework, resulting in additional band-width reduction. The overall complexity would bedecreased in comparison with direct MPEG-4 compres-sion since the time-consuming motion estimation part ofMPEG-4 will be applied only for the reduced log-polarframes. Ongoing research focuses on the implementa-tion of this scenario on smart cameras.

Acknowledgement

We would like to thank our colleagues who helped uswith the experiments. Some parts of the material arepresented in [40].

References

1. Aggarwal, J.K. & Cai, Q. (1999) Human motion analysis:a review. Computer Vision and Image Understanding, 73:428–440.

2. Collins, R.T., Lipton, A.J. & Kanade, T. (1999) A systemfor video surveillance and monitoring. American NuclearSociety Eighth International Meeting on Robotics andRemote Systems.

3. Geisler, W.S. & Perry, J.S. (1998) A real-time foveatedmultiresolution system for low-bandwidth video commu-nication. In: Rogowitz, B. & Pappas, T. (eds), HumanVision and Electronic Imaging. SPIE 3299: 294–305.

4. Crowley, J.L. & Berard, F. (1997) Multi-modal tracking offaces for video communications. IEEE Conference onComputer Vision and Pattern Recognition, Puerto Rico,pp. 640–645.

5. Eleftheriadis, A. & Jacquin, A. (1995) Automatic facelocation detection and tracking for model-assisted codingof video teleconference sequences at low bit rates. SignalProcessing – Image Communication, 7: 231–248.

6. Comaniciu, D., Ramesh, V. & Meer, P. (2000) Real-timetracking of non-rigid objects using mean shift. IEEE

Conference on Computer Vision and Pattern Recognition.Hilton Head Island, SC, Vol. 2: pp. 142–149.

7. Greiffenhagen, M., Ramesh, V., Comaniciu, D. &Niemann, H. (2000) Statistical modeling and performancecharacterization of a real-time dual camera surveillancesystem. IEEE Conference on Computer Vision and PatternRecognition, Hilton Head Island, South Carolina, Vol. 2:pp. 335–342.

8. Woodward, M. & Muir, F. (1984) Hexagonal sampling.Stanford Exploration Project. SEP-38: 183–194.

9. Hjelmas, E. & Low, B.K. (2001) Face detection: a survey.Computer Vision and Image Understanding 83: 236–274.

10. Yang, M.H., Kriegman, D.J. & Ahuja, N. (2002)Detecting faces in images: a survey. IEEE Transactionson Pattern Analysis and Machine Intelligence PittsburghPA24: 34–58.

11. Yang, J. & Waibel, A. (1996) A real-time face tracker.IEEE Workshop on Applications of Computer Vision,Sarasota, pp. 142–147.

12. Wyszechi, G. & Stiles, W.S. (1982) Color Science:Concepts and Methods, Quantitative Data and Formulae.(2nd edn). New York: Wiley.

13. Fieguth, P. & Terzopoulos, D. (1997) Color-basedtracking of heads and other mobile objects at video framerates. IEEE Conference on Computer Vision and PatternRecognition Puerto Rico, pp. 21–27.

14. Birchfield, S. (1998) Elliptical head tracking using intensitygradients and color histograms. IEEE Conference onComputer Vision and Pattern Recognition Santa Barbara,232–237.

15. Chai, D. & Ngan, K.N. (1999) Face segmentation usingskin color map in videophone applications. IEEE Trans-actions on Circuits and Systems Video Technology 9:551–564.

16. Cai, J. & Goshtasby, A. (1999) Detecting humanfaces in color images. Image and Visual Computing 18:63–75.

17. McKenna, S.J., Raja, Y. & Gong, S. (1999) Trackingcolour objects using adaptive mixture models. Image andVision Computing 17: 223–229.

18. Wandell, B.A. (1995) Foundations of vision. Sunderland,MA: Sinauer Associates, Inc.

19. Wallace, R.S., Ong, P.W, Bederson, B.B. & Schwartz, E.L.(1994) Space variant image processing. InternationalJournal of Computer Vision 13: 71–90.

20. Sela, G. & Levine, M.D. (1997) Real-time attention forrobotic vision. Real-Time Imaging 3: 223–194.

21. Wang, Z. & Bovik, A.C. (2001) Embedded foveationimage coding. IEEE Transactions on Image Processing2:1397–1410.

22. Jurie, F. (1999) A new log-polar mapping for space variantimaging. Application to face detection and tracking.Pattern Recognition, 32: 865–875.

23. Ferrari, F, Nielsen, J., Questa, P. & Sandini, G. (1995)Space variant sensing for personal communication andremote monitoring. EU-HCM Smart Workshop, Lisbon,Portugal.

24. Sandini, G. et al. (1996) Image-based personal commu-nication using an innovative space-variant CMOS sensor.ROMAN’96, Tsukuba, Japan.

25. Shin, C.W. & Inokushi, S. (1994) A new retina-like visualsensor performing the polar transform. IAPR Workshop

ADAPTIVERESOLUTIONSYSTEMFORDISTRIBUTEDSURVEILLANCE 437

on Machine Vision Applications, Kawasaki, Japan, pp.52–56.

26. Rojer, A.S. & Schwartz, E.L. (1990) Design considerationsfor a space-variant visual sensor with complex-logarithmicgeometry. International Conference on Pattern Recogni-tion, Atlantic City, NJ, pp. 278–285.

27. Bolduc, M. & Levine, M.D. (1997) A real-time foveatedsensor with overlapping receptive fields. Real-Time Ima-ging 3: 195–212.

28. Jerri, A.J. (1977) The Shannon sampling theorem – itsvarious extensions and applications: a tutorial review.Proceedings of the IEEE 65: 1565–1596.

29. Stark, H. (1979) Sampling theorems in polar coordinates.Journal of the Optical Society of America 69: 1519–1525.

30. Lewitt, R.M. (1983) Reconstruction algorithms: transformmethods. Proceedings of the IEEE 71: 390–408.

31. Derrode, S. & Gorbel, F. (2001) Robust and efficientFourier-Mellin transform approximations for gray-levelimage reconstruction and complete invariant description.Computer Vision and Image Understanding 83: 57–78.

32. Feichtinger, H.G., Grochenig, K. & Strohmer, T. (1995)Efficient numerical methods in non-uniform samplingtheory. Numerische Mathematik 69: 423–440.

33. Bonmassar, G. & Schwartz, E.L. (1997) Space-variantfourier analysis: the exponential chirp transform. IEEETransactions on Pattern Analysis Machine Intelligence 19:1080–1089.

34. Kailath, T. (1967) The divergence and Bhattacharyyadistance measures in signal selection. IEEE Transactionson Communication Technology COM-15: 52–60.

35. Comaniciu, D. & Meer, P. (1999) Mean shift analysis andapplications. IEEE International Conference on ComputerVision, Kerkyra, Greece, pp. 1197–1203.

36. Murray, D.W., Bradshaw, K.J., McLauchlan, P.F., Reid,I.D. & Sharkley, P.M. (1995) Driving saccade to pursuitusing image motion. International Journal of ComputerVision 16: 204–228.

37. Rotstein, H.P. & Rivlin, E. (1996) Optimal servoing foractive foveated vision. IEEE Conference on ComputerVision and Pattern Recognition, San Francisco, pp.227–182.

38. Batista, J., Peixoto, P. & Araujo, H. (1998) Real-timeactive visual surveillance by integrating peripheral motiondetection with foveated tracking. IEEE Workshop onVisual Surveillance, Bombay, India, pp. 18–25.

.

39. Moving Picture Experts Group (2000) Overview of theMPEG-4 Standard. ISO/IEC JTC1/SC29/WG11.

40. Comaniciu, D. & Ramesh, V. (2000) Robust detection andtracking of human faces with an active camera. IEEEInternational Workshop on Visual Surveillance, Dublin,Ireland, pp. 11–18.

APPENDIX I

We derive below the base l and the constants A and B ofthe transformation

rx ¼ Alx � B ðI:1Þ

By imposing a unit difference between the radii of therings x ¼ 1 and 0 we obtain

A ¼1

l� 1ðI:2Þ

Using Eqn (I.2) in the smallest ring equation

r0 ¼ Alxm � B ðI:3Þ

it results that

B ¼1

l� 1� r0 ðI:4Þ

To obtain l we introduce Eqns (I.2) and (I.4) in Eqn(I.1) and write the expression of the largest peripheryring

rM ¼lM � 1

l� 1þ r0 ðI:5Þ

which is equivalent to

lM�1 þ lM�2 þ þ 1� ðrM � r0Þ ¼ 0 ðI:6Þ

This shows that l is the real root of polynomial of orderM�1.