\u003ctitle\u003eaccurate pose estimation for forensic identification\u003c/title\u003e

12
Accurate Pose Estimation for Forensic Identification Gert Merckx a , Jeroen Hermans a and Dirk Vandermeulen a a Center for Processing Speech and Images, Katholieke Universiteit Leuven, Belgium ABSTRACT In forensic authentication, one aims to identify the perpetrator among a series of suspects or distractors. A fundamental problem in any recognition system that aims for identification of subjects in a natural scene is the lack of constrains on viewing and imaging conditions. In forensic applications, identification proves even more challenging, since most surveillance footage is of abysmal quality. In this context, robust methods for pose estimation are paramount. In this paper we will therefore present a new pose estimation strategy for very low quality footage. Our approach uses 3D-2D registration of a textured 3D face model with the surveillance image to obtain accurate far field pose alignment. Starting from an inaccurate initial estimate, the technique uses novel similarity measures based on the monogenic signal to guide a pose optimization process. We will illustrate the descriptive strength of the introduced similarity measures by using them directly as a recognition metric. Through validation, using both real and synthetic surveillance footage, our pose estimation method is shown to be accurate, and robust to lighting changes and image degradation. Keywords: pose estimation, surveillance, recognition, monogenic image phase, 3D-2D registration 1. INTRODUCTION With the dramatic increase of CCTV monitoring in today’s society the problem of identifying subjects in the acquired surveillance footage is becoming increasingly important. While great advances in 2D face recognition have been made, most systems require passport style photos or extensive training to achieve acceptable results. Identification based on surveillance data is significantly more challenging. 1 Wide variations in subject pose, unconstrained lighting conditions and poor image quality still form serious hindrances that any viable natural scene recognition system needs to cope with. Robust pose estimation prior to recognition is therefore of great importance and, in a forensic context often even surpasses the importance of fully automatic identification as, in the court of law, the value of forensic evidence is always assessed by a human expert. 2 In this paper, we will therefore focus on the problem of obtaining an accurate far field pose estimate, irrespective of changes in lighting conditions and low image quality. Fortunately, in the forensic setting, the pose estimation problem can be somewhat constrained: one usually has a series of suspects (the gallery) that need to be matched to the perpetrator in the footage (the probe). This allows us to make use of 3D models of these suspects to create an augmented reality reconstruction of the scene through 3D-2D registration. In this setting of pose estimation through registration, the use of a robust similarity measure that can handle low image quality and varying illumination conditions is crucial. The similarity measure used for registration should therefore capture as much as possible of the structural information available in the image. As shown in the famous paper by Oppenheim and Lim 3 on Fourier phase based image reconstruction, most of the structural information in an image is embedded in its phase component, whereas the magnitude component contains contrast and brightness information. The local variant of the Fourier phase, the local or instantaneous phase, maintains this property and effectively encodes both the type (edges, peaks and troughs) and location of textural image features. 4 As a consequence, local image phase information is to a large extend invariant to lighting and contrast variations. For our new pose estimation scheme, we will therefore propose new phase based similarity measures to guide 3D-2D registration. The robustness of these similarity measures will be tested by evaluating their accuracy and lighting invariance with respect to existing techniques. Furthermore, we will test their descriptive strength by directly using the assessed similarity as a recognition metric. Further author information: (Send correspondence to Gert Merckx.) Gert Merckx: E-mail: [email protected], Telephone: +3232652445 Jeroen Hermans: E-mail: [email protected], Telephone: +3216349049 Dirk Vandermeulen: E-mail: [email protected], Telephone: +3216349001

Upload: kuleuven

Post on 15-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Accurate Pose Estimation for Forensic Identification

Gert Merckxa, Jeroen Hermansa and Dirk Vandermeulena

aCenter for Processing Speech and Images, Katholieke Universiteit Leuven, Belgium

ABSTRACT

In forensic authentication, one aims to identify the perpetrator among a series of suspects or distractors. Afundamental problem in any recognition system that aims for identification of subjects in a natural scene isthe lack of constrains on viewing and imaging conditions. In forensic applications, identification proves evenmore challenging, since most surveillance footage is of abysmal quality. In this context, robust methods for poseestimation are paramount. In this paper we will therefore present a new pose estimation strategy for very lowquality footage. Our approach uses 3D-2D registration of a textured 3D face model with the surveillance imageto obtain accurate far field pose alignment. Starting from an inaccurate initial estimate, the technique usesnovel similarity measures based on the monogenic signal to guide a pose optimization process. We will illustratethe descriptive strength of the introduced similarity measures by using them directly as a recognition metric.Through validation, using both real and synthetic surveillance footage, our pose estimation method is shown tobe accurate, and robust to lighting changes and image degradation.

Keywords: pose estimation, surveillance, recognition, monogenic image phase, 3D-2D registration

1. INTRODUCTION

With the dramatic increase of CCTV monitoring in today’s society the problem of identifying subjects in theacquired surveillance footage is becoming increasingly important. While great advances in 2D face recognitionhave been made, most systems require passport style photos or extensive training to achieve acceptable results.Identification based on surveillance data is significantly more challenging.1 Wide variations in subject pose,unconstrained lighting conditions and poor image quality still form serious hindrances that any viable naturalscene recognition system needs to cope with. Robust pose estimation prior to recognition is therefore of greatimportance and, in a forensic context often even surpasses the importance of fully automatic identification as,in the court of law, the value of forensic evidence is always assessed by a human expert.2 In this paper, wewill therefore focus on the problem of obtaining an accurate far field pose estimate, irrespective of changes inlighting conditions and low image quality. Fortunately, in the forensic setting, the pose estimation problem canbe somewhat constrained: one usually has a series of suspects (the gallery) that need to be matched to theperpetrator in the footage (the probe). This allows us to make use of 3D models of these suspects to createan augmented reality reconstruction of the scene through 3D-2D registration. In this setting of pose estimationthrough registration, the use of a robust similarity measure that can handle low image quality and varyingillumination conditions is crucial. The similarity measure used for registration should therefore capture as muchas possible of the structural information available in the image.As shown in the famous paper by Oppenheim and Lim3 on Fourier phase based image reconstruction, most ofthe structural information in an image is embedded in its phase component, whereas the magnitude componentcontains contrast and brightness information. The local variant of the Fourier phase, the local or instantaneousphase, maintains this property and effectively encodes both the type (edges, peaks and troughs) and locationof textural image features.4 As a consequence, local image phase information is to a large extend invariant tolighting and contrast variations. For our new pose estimation scheme, we will therefore propose new phase basedsimilarity measures to guide 3D-2D registration. The robustness of these similarity measures will be tested byevaluating their accuracy and lighting invariance with respect to existing techniques. Furthermore, we will testtheir descriptive strength by directly using the assessed similarity as a recognition metric.

Further author information: (Send correspondence to Gert Merckx.)Gert Merckx: E-mail: [email protected], Telephone: +3232652445Jeroen Hermans: E-mail: [email protected], Telephone: +3216349049Dirk Vandermeulen: E-mail: [email protected], Telephone: +3216349001

2. RELATED WORK

Within the field of pose estimation the use of registration techniques is rather limited since textured 3D facemodels of the subject are usually not available. As a consequence, most systems require passport quality photosor extensive training to estimate 3D pose from an image with acceptable accuracy. A comprehensive overviewof the field is offered in Ref. 5. An often encountered approach for far field pose estimation is to classify faceimages into a number of discrete pose classes or to restrict the degrees of freedom of head movement. A poseestimation technique that shares some similarity with the method being presented in this paper is that of LaCascia,6 in which a texture mapped cylindrical 3D model is aligned with a near field 2D pasport quality image.In another method, presented by Gall7 3D object to image matching is achieved by 3D-2D feature detection andalignment.The volume of literature produced on phase based image processing is considerably more vast. After the interestin phase based image processing was sparked in 1989 by Oppenheim,3 phase based image processing has beenused in numerous computer vision applications mainly focussed on feature detection. The Local Energy modelof feature detection postulated by Morrone8 states that features (edges, troughs and peaks) can be defined andclassified based on their phase signatures (see section 3). Kovesi9 defined phase congruency and used it tosuccessfully detect image features and edges.In the field of image registration, phase has been predominantly used in the Fourier domain by means of phasecorrelation.10 It is robust to noise and is claimed to provide subpixel accuracy.11 Only recently, the use oflocal image phase was adopted by the medical image processing community. In 2003, Hemmendorf12 used localphase for 3D volumetric registration of CT and MRI images. Mellor and Brady13 used mutual information oflocal image phase for multi-modal registration. They used the monogenic signal14 to obtain the local phaseinformation, but discarded the additional benefits the monogenic signal representation has to offer (see section3). Pan4 adapted the approach to create a multi-scale feature detector.In the field of pose estimation, the number of methods that use image phase information is even more limited.Gabor phase signatures were used by Peters15 to track 3D object pose. Gabor phase was also used in Ref. 16to estimate head pose and gaze direction: based on the image phase, the system learned to assign images to adiscrete number of pose classes.

3. LOCAL IMAGE PHASE AND ENERGY

As stated in the introduction, our pose estimation system will make use of a similarity measure that includesthe local image phase to align 3D face models to 2D surveillance footage. This choice was motivated by the factthat local image phase is to a large extend invariant to smooth image intensity changes that occur as a result ofunconstrained illumination conditions of a 3D surface.The concepts of local phase and energy originate from the field of 1D signal analysis. When considering a realfunction f(x), the local or instantaneous phase ϕ(x) is the argument of the function’s complex valued analyticrepresentation fa(x), whereas the local energy is the magnitude of fa(x). This signal representation offers aso called split of identity: structural information is separated from energy information. The local phase angledescribes what signal feature occurs at x (edge, peak or trough), whereas the local amplitude describes howprominent that feature is. The analytic signal is formally defined as

fa(x) = f(x)− i fH(x), (1)

in which i =√−1 and fH(x) is the Hilbert transform of f(x). The Hilbert transform fH(x) of f(x) is most

easily interpreted as the result of applying a specific multiplicative operator in the frequency domain:

FH(ω) = i sign(ω) · F(ω) (2)

with F(ω) being the Fourier transform of f(x).When used as a Fourier multiplier the Hilbert transform only affects the phase of a Fourier series by shiftingnegative frequency components by −π/2 and positive frequency components by π/2.

σH(ω) =

−i = e−iπ/2, for ω < 0

0, for ω = 0

i = e+iπ/2 for ω > 0.

,

6 Michael Felsberg

1−1

i

−i

ϕ

p

q1

q

q2

r⊥

(q1, q2, p)T

A

ϕ

θ

Fig. 3. Local phase models. Left: the 1D phase, the corresponding filter shapesat 1, i, −1, and −i, and the continuously changing signal profile (grey values inthe background). Right: the local phase of the monogenic signal. The 3D vector(q1, q2, p)

T together with the p-axis define a plane at orientation θ in which therotation takes place. The normal of this plane multiplied by the rotation angleϕ results in the rotation vector r⊥.

The radial bandpass filter is given by some suitable frequency response Be(ρ),where (ρ, φ) are the polar coordinates of the frequency domain, such that itis rotational symmetric and therefore symmetric (even) about the origin. Thecorresponding antisymmetric (odd) filters are then given by

Bo1(ρ, φ) = i cosφBe(ρ) and Bo2(ρ, φ) = i sinφBe(ρ) . (6)

All together, an SQF provides three responses; the even filter response p(x) =(I ∗ be)(x) and the two odd filter responses q(x) = (q1(x), q2(x))

T = ((I ∗bo1)(x), (I ∗ bo2)(x))T .

3.2 Extracting Local Phase

The local amplitude can be extracted likewise as in the 1D case by calculatingthe magnitude of the 3D vector:

Ax =√q1(x)2 + q2(x)2 + p(x)2 , (7)

cf. Fig. 4 for an example. The phase, however, cannot be extracted as the argu-ment of a complex number, since we need two angles to describe the 3D rotationfrom a reference point (on the p-axis) into the SQF response. These angles areindicated in Fig. 3, and they have direct interpretations in terms of local orien-tation and local phase.

Figure 1. The 1D analytic phase and the monogenic 3-vector. Left: 1D local phase angle and its corresponding signalprofile. Right: monogenic 3 vector, consisting of the even bandpass filter response p, the odd filter responses in the x-,and y-direction (i.e. q1, q2 respectively).22

with σH(ω) = i sign(ω). The signal’s Hilbert transform fH(x) is then obtained through inverse Fourier transfor-mation of (2).By combining the original signal and its Hilbert transform as in (1), we obtain the analytic signal: a complexrepresentation of the original signal in which negative frequency components are suppressed and positive fre-quency components are doubled in energy, as can be understood by combining the Fourier transform of (1) with(2) into:

Fa(ω) = F(ω)· [1 + sign(ω)].

To provide accurate spatial (time) and scale (frequency) localization, local phase and energy are usually deter-mined through convolution of the signal with quadrature filters. An even bandpass filter is used to generate thereal part of the analytic signal, whereas its odd counterpart generates the Hilbert transform.

fa(x) = (he(x)− i ho(x))⊗ f(x),

in which he(x) is the even bandpass filter and ho(x) is its odd counterpart, the Hilbert transform of he(x).The phase of the analytic signal effectively encodes signal features such as peaks or steps, as illustrated in Fig.1 (left).Unfortunately, due to the lack of a straightforward 2D generalization of the Hilbert transform, a viable extensionof the analytic signal to two dimensions has eluded researchers for quite some time. Recently however, the gen-eralization proposed by Felsberg17 incited a renewed interest in phase based image processing. In the remainderof this section, we will briefly illustrate the disadvantages of the widely adopted Gabor filter approach for 2Dlocal phase estimation and present a solution in the form of the monogenic signal, as proposed by Felsberg.

3.1 Gabor Filterbanks

One of the most popular choices among filters for local phase estimation is the Gabor class of filters.18 A 2DGabor filter is a 2 dimensional Gaussian kernel modulated by a complex exponential. Their popularity can beattributed to their optimality with respect to the uncertainty principle of scale-space localization,19 which alsoexplains their application in Short Time Fourier Transforms (STFT).However, one major drawback of using spatial Gabor filters in our application is their implicit orientation. In aquadrature pair of Gabor filters the odd filter always has a specific orientation. To be receptive to image features(e.g. troughs and ridges) along n orientations the use of a bank of n filters is required. This is undesirable, as onlya limited number of local orientations are considered, which makes the filter response anisotropic. Furthermore,using such a filterbank increases computation time significantly, as n convolutions with the filter are required.A solution to the orientation anisotropy of a filterbank was devised by Freeman et al. In Ref. 20, he proposed afiltering method in which the local dominant orientation is determined prior to filtering. However, to determinethe dominant orientation a bank of basis filters is still used.For a more detailed discussion on the advantages and disadvantages of Gabor and other quadrature filters thereader is referred to Ref. 21.

Figure 2. Example of a spherical quadrature filter in the image domain. Left: convolution kernel of a spherical bandpassfilter (DOP). Middle and right: the two components of the conjugate Reisz transformed filter (DOCP)

3.2 Monogenic Signal

As illustrated in the previous paragraph using a quadrature pair of Gabor filters as an example, the problemof anisotropy of the odd spatial filter can not be solved using scalar valued filters. In Ref. 17, Felsberg et al.proposed another solution to this anisotropy problem. By substituting the Hilbert transform, used to obtain theanalytic signal, with the vector valued Riesz transform, the notion of the 1D analytic signal was extended tomultiple dimensions, while maintainting all desirable properties of the analytic signal. This generalization of theanalytic signal is known as the monogenic signal and is formally defined as

fM (x) = f(x)− (i, j) fR(x),

. in which fR(x) is the Reisz transform of the vector valued signal (or image) f(x). Making abstraction ofthe Clifford algebra involved (see Ref. 17), the monogenic signal quaternion can be understood as made upout of three components: one real part and two imaginary parts. As in the case of the analytic signal, theindividual components of the monogenic signal can be obtained through convolution by quadrature filters. Theisotropic set of quadrature filters proposed by Felsberg can consist of any even spherical bandpass filter and itsodd Riesz transform. The Reisz transformed filter is vector valued and hence can be considered to consist oftwo separate orthogonally oriented filters (see Fig. 2). The monogenic signal can therefore be represented asa 3-vector consisting of the even filter response p and the two odd filter responses q1 and q2. The monogenicrepresentation of an image is therefore a field of vectors, with each vector encoding local orientation, local phaseand local energy. As can be understood from Fig. 1(right), local energy Ax, local orientation θ, and local phaseϕ can be extracted from the monogenic signal vector as:

Ax =√p2 + q21 + q22 ,

θ = tan−1

(q2q1

),

ϕ = tan−1

(cos θq1 + sin θq2

p

).

The monogenic signal preserves the desirable properties found in the analytic signal (i.e. the split of identity),and therefore provides a convenient mathematical framework for fast and isotropic estimation of local imageproperties.The pose estimation method presented in this work will use this framework to register the monogenic represen-tations of the surveillance footage and the rendered 3D model. Each component of the monogenic vector offersin itself significant advantages for image registration:

• Local phase: the image phase component ϕ qualitatively describes structures encountered in images. Thephase angle encodes whether an image structure is an edge, a peak or a trough or anything in between.

Figure 3. The monocular camera calibration problem: What are the camera parameters of the camera in the scene (left)that generated the image on the right.

Since the strength of that feature is encoded in the energy component, the phase component is to a largeextend invariant to the way the subject is illuminated in the scene.

• Local energy: the quantity in which a feature is present in the considered region is contained in themonogenic signal’s local energy component. The higher the energy of a feature, the better visible it is.Therefore, the monogenic energy component can be considered a measure of uncertainty for the local phaseestimate. Well visible features can therefore be given a higher weight during alignment.

• Local orientation: the explicit availability of a local orientation estimate of each encountered structureoffers obvious advantages during image alignment: the angle between orientation vectors directly relatesto the rotational misalignment of corresponding structures in the image plane. This information will, incontrast to the monogenic phase based method in Ref. 13, also be utilized during registration for increasedrobustness. The orientation estimate obtained through spherical quadrature filtering is also claimed14 tobe more accurate than that obtained by Sobel filtering.

4. METHODS: 3D-2D REGISTRATION

Registration of 3D models to 2D images can be approached as a monocular camera calibration problem. Inmonocular camera calibration, one tries to recover the internal and external camera parameters from one of thecamera’s images, as illustrated in Fig. 3.A projective camera maps a 3D world point X to the a point x on the 2D image plane. Assuming a pinholecamera model, the camera projection matrix P is composed of an internal camera calibration matrix K, and anexternal camera calibration matrix T :

x = P ·X = K·T ·X with T =[R|t].

T is a rigid transformation matrix describing the mapping of points from the 3D world coordinate system to the3D camera coordinate system. It consists of a rotation matrix R and a translation vector t = −R· c, in which c isthe coordinate vector of the camera center in world space. Our method for recovering the perspective projectionmatrix uses a two-staged registration process, comprising an initialization and optimization phase.

4.1 Initialization

One typically needs 6 3D-2D correspondences to recover the 11 parameters of the camera projection matrix. Dueto the nature of surveillance equipment and the way the footage is stored, the quality of the acquired imagery isoften insufficient to reliably use feature detectors to find at least 6 3D-2D correspondences. Faces consisting of alimited number of pixels undergoing serious noise and compression degradation, are not suitable candidates forconventional automatic feature matching. Furthermore, to our knowledge, there exists no feature detector thatoffers reliable detection and matching of corresponding feature points between a 3D textured surface and a 2Dimage, as such a feature would have to be completely viewpoint invariant.To reduce the amount of minimal interaction to a minimum a number of assumptions (no skewness, unity aspectratio) can be made to initialize the internal camera calibration matrix. Since our method aims at far fieldpose estimation focal length can also assumed to be known. This reduces the camera calibration problem to apose estimation problem, for which several correspondence based solutions exist. In this work the Pose fromOrthography and Scaling with Iterations (POSIT) algorithm was used to estimate the model’s initial rotation

Figure 4. The initialization stage: POSIT alignment based on 3 3D-2D correspondences. Left: the surveillance image;middle: the 3D model; right: the initial pose estimate obtained by rendering the model in the original image.

Figure 5. The optimization stage: left, pose after initialization; right, pose after optimization.

matrix R and the world center to model center translation vector t. As illustated in Fig. 4.1, POSIT usuallyrequires only 3 3D-2D correspondences23 for a fast and robust estimate of the external camera calibration matrix.By rendering the 3D model using these estimates of R and t, we obtain an initial augmented reality image byoverlaying or blending the original image with the rendered one.

4.2 Optimization

Starting from the initial pose estimate, the external camera calibration parameters are optimized to matchthe subjects pose more accurately. Each iteration, the external camera calibration matrix T is modified andsimilarity between the projected 3D face model and the overlapping part of the surveillance image is evaluated.Maximizing similarity (or minimizing the cost function E(T )) should ideally lead to convergence to a globaloptimum at which optimal 3D-2D alignment is achieved.

T ∗ = argminT

E(T ),

in which T ∗ is the recovered external camera calibration matrix.

As optimizer we use a line search algorithm as proposed by Powell in Ref. 24 since it requires no derivativesand it is quite robust to local minima in the cost function. The choice of the cost function itself is less straight-forward. Several similarity measures, both new and existing, were implemented to compare their performancewithin our 3D-2D registration framework. Among the numerous measures tested, we will discuss only the threemeasures that proved most accurate.

4.2.1 Monogenic cost functions

We implemented several novel monogenic based cost functions to assess the similarity between the monogenicrepresentations of two images. As spherical quadrature filter we used an even DOP (Difference of Poisson)bandpass filter and its odd Riesz conjugate (DOCP):

he(x) =s1

2π(s12 + x2)3/2− s2

2π(s22 + x2)3/2,

ho(x) =x

2π(s12 + x2)3/2− x

2π(s22 + x2)3/2,

in which s1 and s2 are parameters that define the fine and coarse scale of the bandpass filter, respectively. Notethat applying the DOCP filter ho results in vector valued output. The triple formed by the combining even and

Figure 6. MahCosine and NMI registration: from left to right, surveillance footage; initial pose estimate; MahCosinebased pose estimate; NMI based pose estimate.

Figure 7. MVC and NMI registration: from left to right, surveillance footage; initial pose estimate; MVC based poseestimate; NMI based pose estimate.

odd filter output constitutes the monogenic 3-vector [p(x) q1(x) q2(x)]. The monogenic signal representation ofan entire image can be interpreted as a field of vectors (see Fig. 1), one for each pixel x.As noted in Section 3, these vectors contain local phase and energy information, as well as a local orientationestimate. With this wealth of information combined in one vector, it becomes very appealing to use these vectorsdirectly to align the 3D model to the 2D image. Since there are numerous options to assess the similarity of twovector fields we will only describe the measures that proved most accurate during our preliminary testing.

• Monogenic whitened cosine (MahCosine): The cosine of the angle α between two monogenic vectors u andv at a corresponding location in the monogenic representations U and V of two images, can be used as afunctional vector similarity measure.

d = cosα =u · v‖u‖ ‖v‖ .

If we calculate |d| for each pixel of the overlapping region of the surveillance image and the projected model,the mean of these distances can be used as a monogenic similarity measure. It disregards local energy Axand therefore expresses similarity purely based on the monogenic phase and orientation estimates.

• Monogenic normalized vector difference: Direct subtraction of the corresponding vectors of two monogenicimage representations U and V, does not result in a viable difference measure. This can be attributed tothe fact that the energies of features in the low quality surveillance footage are typically much lower thanthose in the sharp, high quality rendering of the 3D model. Therefore, all amplitudes of the monogenicvectors were set to unity prior to subtraction. The mean norm of the resulting vector differences was usedas a monogenic similarity measure. It does not include local energy information.

• Monogenic vector correlation (MVC): The mean correlation between the monogenic vectors of the over-lapping image regions can also be used as a vector field similarity measure. There are however differentopinions on how the concept of vector correlation should be defined.25,26 The definition proposed byCrosby27 is closely related to other correlation measures such as canonical correlation. The Crosby corre-lation coefficient is formally expressed as:

ρ2υ = Tr[(ΣUU )−1 ΣUV (ΣV V )−1 ΣV U ],

in which ΣUV is the cross-covariance matrix of two samples of vectors U and V. This definition includes di-rectional and magnitude information into the correlation estimate. The result of MVC based optimizationsis shown in Fig. 5 and Fig. 7.

4.2.2 Baseline cost functions

• Normalized Mutual Information: We used Normalized Mutual Information28 (NMI) as a baseline for testingregistration accuracy and robustness. It is the golden standard of the intensity based similarity measures.In29 the NMI measure was shown to be sensitive to smooth shading variations which is, in our application,expected to be a major drawback.

• Gabor Phase Mutual Information: Also Gabor Phase MI was included as a baseline metric. It was imple-mented using a filterbank consisting of eight orientations. For each filter orientation the mutual informationbetween floating and reference images was calculated. All orientations were combined by summation oftheir MI values.Although computationally inefficient and therefore not suited for use as a cost function, Gabor Phase Mu-tual Information offers a clear view on the benefits of using the monogenic vector over using only phaseinformation.

• Phase Mutual Information: Monogenic Phase MI has been used for non-rigid multimodal registration ofmedical images in Ref. 13. It uses the phase (see Fig. 1) component of the monogenic vector and cantherefore be evaluated much more quickly than its Gabor counterpart.

5. TESTING AND VALIDATION

5.1 Data description and implemetation

Our small dataset was kindly provided by the Netherlands Forensic Institute and includes 3D face models andsurveillance footage of 8 subjects over a wide variety of poses. Surveillance footage was recorded by a numberof cameras of common CCTV camera brands. Images are of standard PAL resolution (720 by 576 pixels) andcompressed using lossy compression schemes. Only subjects located in the far field of the camera view wereconsidered, as such, faces in the surveillance footage typically consist of approximately 30 by 30 pixels. Theinteractive registration tool is implemented in Matlab and all calculations are performed on a single core of a3GHz Core i7 CPU. A typical registration using the novel cost functions takes approximately 30 seconds, inwhich the bottleneck is the CPU based rendering. Typical Gabor Phase based registrations take approximately120 seconds, in which the bottleneck is the filterbank. Significant speedup can be achieved by using a GPU formodel rendering.

5.2 Registration accuracy under image degradation

For accuracy testing purposes, surveillance images with known pose of the subject’s face are required to serveas ground truth. Since, to our knowledge, no publicly available databases exist that offer both subject specific3D scans and 2D footage with known ground truth pose, synthetic images were used. These were created bymanually aligning a subject’s face scan with the face of that subject in an image. This augmented reality imagecan now serve as ground truth, since we know the transformation and lighting parameters used to render this3D model in the surveillance image. When compared to rendering a model in empty space, these overlay imagesare more challenging since structures in the image background might introduce additional local optima due tovarying overlap during optimization.To test the accuracy of our phase based similarity measures with respect to model rotation, we sample the costfunctions for model rotations of up to 14 degrees away from the true optimum, and we repeat this for the threeorthogonal rotation axes. The optimum of the sampled cost functions is now expected to coincide with the trueoptimum. The process is then repeated after degrading the quality of the synthetic image through convolutionwith a Gaussian smoothing kernel. As a result, errors between the optimum of the sampled cost function and thetrue optimum will start to occur, allowing us to assess the robustness of the cost functions to degrading imagequality.The evolution of the error with respect to degrading image quality is plotted in Fig. 8. The error plotted is themaximum of the errors along the three rotation axes. Results were averaged over 2 experiments.

1 2 3 4 5 6 7 8 9 100

2

4

6

8

10

12

14

Gaussian smoothing factor σ

Mea

n an

gula

r er

ror

(°)

Normalized Vector DifferenceMVCMahCosineNMIPhase NMIGabor Phase MI

Figure 8. Mean registration error with respect to progressive image degradation: at σ = 1, the face region is 100 by 100pixels, at σ = 10, the face area is 10 by 10 pixels. Results are averaged over two series of progressively deterioratedimages.

5.3 Invariance to changing lighting conditions

To verify the claim of lighting invariance of phase based image representations to some extend, without requiringlarge amounts of registrations under different lighting conditions, we tested the stability of the true cost functionoptimum with respect to changes in directional lighting of the model. Since absolute changes of the cost functionvalues are of no significance in this case, we compared the evolution of the cost functions due to lighting changeswith the evolution of the cost functions due to rotational misalignment. This is achieved by calculating a stabilitymeasure S formulated as

S =σRσL

,

with σR and σL being the standard deviation of the cost function samples due to model rotation, and due tolighting vector rotation respectively. Although this can not be used as a performance metric for registrationpurposes, it can be considered an indicator for the robustness to lighting changes of the similarity measures.Table 1 shows the stability under lighting changes of the six most accurate similarity measures. This analysisis performed on traces of the cost functions, as depicted in Fig. 9.

Table 1. Global optimum stability with respect to lighting changes: S expresses the ratio of variations in cost functionevaluations due to model misalignment and variations due to lighting changes. A high value indicates that a cost functionis more sensitive to model misalignment than to lighting variations.

SMVC 17.8Norm. vect. diff. 14.4Phase NMI 12.3MahCosine 9.4Gabor NMI 9.4NMI 2.9

Synthetic gauss. smoothed image (σ = 3)

−10 0 10θ

Cos

t

Norm. vector diff.

−10 0 10θ

Vector correlation

−10 0 10θ

MahCosine

−10 0 10θ

Cos

t

NMI

−10 0 10θ

Phase NMI

−10 0 10θ

GaborPhase MI

Figure 9. Cost function traces: The evolution of the cost function is plotted with respect to model rotation around x, yand z axes (3 full lines), and directional lighting vector rotation around x, y and z axes (3 dashed lines). θ ranges from-14 to 14 in 2◦ increments. These evaluations are performed on a face region of 25 by 25 pixels. Large cost changesdue to model rotation combined with small cost changes due to lighting changes suggest robustness to variable lightingconditions. The traces are also used to empirically assess smoothness and area of convergence.

5.4 Recognition performance

As in the court of law judgment on the identity of the suspect to the subject in the surveillance footage is passedupon visual assessment of the augmented reality image, the main purpose of this work is the accurate and robust3D-2D registration. However it is interesting to evaluate the recognition performance of the proposed similaritymeasures.This test is performed using real surveillance images. In this identification task, the subject in the 2D image canbe considered the probe. We will therefore register each of the real surveillance images with each of the 3D facemodels and evaluate the similarity measures before and after each registration. Similarity should be higher whenthe probe is registered with his own 3D model. To circumvent the problem of manual point based initialization,we only manually initialized each surveyed subject with its own model and used 3D-3D ICP registration to findthe transformations between the models. Recognition rate on our small database of 8 subjects is plotted in Fig.10.

1 2 3 4 5 6 7 8

0.4

0.5

0.6

0.7

0.8

0.9

1

rank

Rec

ogni

tion

rate

MahCosineNMI

1 2 3 4 5 6 7 80.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

rank

Rec

ogni

tion

rate

MahCosineMVCNMI

Figure 10. Rank n recognition rate: left, recognition rate prior to pose optimization; right, recognition rate after poseoptimization.

6. DISCUSSION

With good image quality, our accuracy tests in Section 5.2 show that, under constrained lighting conditions,NMI performs similar to the newly introduced Monogenic MahCosine, Monogenic Vector Difference and theMonogenic Vector Correlation similarity measures. However, when progressing to the poor quality images, itbecomes clear that all monogenic signal based measures are more robust to image degradation than NMI. In itsturn, NMI performs better than the phase-only measures (Gabor Phase MI and Monogenic Phase MI).However, as our lighting invariance test in Section 5.3 indicates, NMI is far more sensitive to lighting changesthan any of the phase or vector based similarity measures. When considering the monogenic based measures, itis striking that the MVC performs so well. Given that it includes local energy information, this can most likelybe explained by the fact that the local energy component can be considered as a measure of uncertainty for thelocal phase and orientation estimate, as noted in Sec. 3.In Section 5.4, the illumination sensitivity of NMI is confirmed by analyzing the pre- and post-registrationrecognition results, which were performed under unconstrained lighting conditions. Recognition performance ofthe Monogenic MahCosine measure is boosted by performing registration. Combined with the fact that a betteralignment is achieved by performing registration, this suggests that the MahCosine measure is descriptive enoughto capture and assess facial similarity. Visual evaluation also leads us to conclude that NMI is responsible for ahigher percentage of misregistrations than the Monognenic MahCosine and MVC similarity measures.Comparison of the overall recognition system to existing pose estimation systems is not straightforward. Thisis mainly because we know of no other system that uses subject specific 3D models, since usually non forensicapplications are the aim. However, when comparing the accuracy tests from Section 5.2 with state of the artpose estimation techniques5 on passport quality photos, the achieved accuracy on low resolution synthetic imageslooks promising. However, validation should be continued when 3D-2D datasets with readily available groundtruth become available.

7. CONCLUSIONS AND FUTURE WORK

In this work a new method for forensic head pose estimation was introduced. Using similarity measures for imageregistration as a cost function, we optimized an initial pose estimate to obtain accurate alignment of a 3D facemodel to a 2D low quality surveillance image. As expected, the use of the 3D model information pushes far fieldaccuracy toward acceptable results for forensic pose estimation. The monogenic similarity measures proposed toachieve this, were shown to be more resistant to varying imaging conditions than conventional mutual informationor purely phase based measures. Performing a recognition experiment on real data also supported the superiorityof the proposed similarity measures.Future work on our pose estimation system will be aimed at including focal length as a parameter to be optimized,thereby enabling the system to perform pose estimation regardless of whether the subject is situated in the near orfar field of the surveillance camera. Another valuable extension would be the adoption of a multi-scale approachto the monogenic signal based registration process. Moreover, it is entirely possible that far better methodsexist for assessing the similarity between two monogenic representations of images. We consider the methodsand measures in this paper to be only our first step toward the adoption of monogenic signal representations forregistration.

REFERENCES

[1] Goos, M. I. M., Alberink, I. B., and Ruifrok, A. C. C., “2D/3D image (facial) comparison using cameramatching,” Forensic science international 163(1-2), 10–17 (2006).

[2] Bijhold, J., Ruifrok, A., Jessen, M., Geradts, Z., Ehrhardt, S., and Alberink, I., “Forensic audio and visualevidence 2004-2007: A review,” in [15th INTERPOL Forensic Science Symposium, October ], (2007).

[3] Oppenheim, A. V. and Lim, J. S., “The importance of phase in signals,” Proceedings of the IEEE 69(5),529–541 (1981).

[4] Pan, X., Brady, M., Highnam, R., and Declerck, J., “The use of multi-scale monogenic signal on structureorientation identification and segmentation,” Lecture Notes in Computer Science 4046, 601 (2006).

[5] Murphy-Chutorian, E. and Trivedi, M. M., “Head pose estimation in computer vision: A survey,” IEEETrans. Pattern Anal. Mach. Intell. 31(4), 607–626 (2009).

[6] La Cascia, M., Sclaroff, S., and Athitsos, V., “Fast, reliable head tracking under varying illumination: anapproach based on registration of texture-mapped 3 D models,” IEEE Transactions on Pattern Analysisand Machine Intelligence 22(4), 322–336 (2000).

[7] Gall, J., Rosenhahn, B., and Seidel, H., “Robust pose estimation with 3d textured models,” Lecture NotesIn Computer Science 4319, 84 (2006).

[8] Morrone, M. C., Navangione, A., and Burr, D., “An adaptive approach to scale selection for line and edgedetection,” Pattern Recognition Letters 16(7), 667–677 (1995).

[9] Kovesi, P., “Image features from phase congruency,” Videre: Journal of Computer Vision Research 1(3),1–26 (1999).

[10] Castro, E. D. and Morandi, C., “Registration of translated and rotated images using finite fourier trans-forms,” IEEE Trans. Pattern Anal. Mach. Intell. 9(5), 700–703 (1987).

[11] Foroosh, H., Zerubia, J. B., and Berthod, M., “Extension of phase correlation to subpixel registration,”IEEE Transactions on Image Processing 11(3), 188–200 (2002).

[12] Hemmendorff, M., Andersson, M. T., Kronander, T., and Knutsson, H., “Phase-based multidimensionalvolume registration,” in [2000 IEEE International Conference on Acoustics, Speech, and Signal Processing,2000. ICASSP’00. Proceedings ], 6 (2000).

[13] Mellor, M. and Brady, M., “Phase mutual information as a similarity measure for registration,” MedicalImage Analysis 9(4), 330–343 (2005).

[14] Felsberg, M. and Sommer, G., “A new extension of linear signal processing for estimating local propertiesand detecting features,” Proceedings of the DAGM 2000 , 195202 (2000).

[15] Peters, G., Zitova, B., and der Malsburg, C. V., “How to measure the pose robustness of object views,”Image and Vision Computing 20(5-6), 341–348 (2002).

[16] Weidenbacher, U., Layher, G., Bayerl, P., and Neumann, H., “Detection of head pose and gaze directionfor Human-Computer interaction,” Lecture Notes in Computer Science 4021, 9 (2006).

[17] Felsberg, M. and Sommer, G., “Structure multivector for local analysis of images,” Lecture notes in computerscience , 93–104 (2001).

[18] Fleet, D. J. and Jepson, A. D., “Computation of component image velocity from local phase information,”International Journal of Computer Vision 5(1), 77–104 (1990).

[19] Daugman, J. G., “Uncertainty relation for resolution in space, spatial frequency, and orientation optimizedby two-dimensional visual cortical filters,” Journal of the Optical Society of America A 2(7), 1160–1169(1985).

[20] Freeman, W. T. and Adelson, E. H., “The design and use of steerable filters,” IEEE Transactions on Patternanalysis and machine intelligence 13(9), 891–906 (1991).

[21] Boukerroui, D., Noble, J. A., and Brady, M., “On the choice of band-pass quadrature filters,” Journal ofMathematical Imaging and Vision 21(1), 53–80 (2004).

[22] Felsberg, M., “Optical flow estimation from monogenic phase,” Lecture Notes in Computer Science 3417,1 (2007).

[23] DeMenthon, D. F. and Davis, L. S., “Model-based object pose in 25 lines of code,” International Journalof Computer Vision 15(1), 123–141 (1995).

[24] Powell, M. J. D., “Direct search algorithms for optimization calculations,” Acta Numerica 7, 287–336 (2008).

[25] Stephens, M. A., “Vector correlation,” Biometrika 66, 41–48 (Apr. 1979).

[26] Jupp, P. E. and Mardia, K. V., “A general correlation coefficient for directional data and related regressionproblems,” Biometrika 67(1), 163–173 (1980).

[27] Crosby, D., Breaker, L., and Gemmill, W., “A proposed definition for vector correlation in geophysics:Theory and application,” Journal of Atmospheric and Oceanic Technology 10, 355–367 (June 1993).

[28] Studholme, C., Hill, D. L. G., and Hawkes, D. J., “An overlap invariant entropy measure of 3D medicalimage alignment,” Pattern Recognition 32(1), 71–86 (1999).

[29] Roche, A., Malandain, G., Pennec, X., and Ayache, N., “The correlation ratio as a new similarity measurefor multimodal image registration,” Lecture Notes in Computer Science , 1115–1124 (1998).