on photometric issues in 3d visual recognition from a ...p1: bps/klk p2: bps/scm p3: rba/bs qc: p1:...

24
International Journal of Computer Vision 21(1/2), 99–122 (1997) c 1997 Kluwer Academic Publishers. Manufactured in The Netherlands. On Photometric Issues in 3D Visual Recognition from a Single 2D Image AMNON SHASHUA The Hebrew University of Jerusalem, Institute for Computer Science, Jerusalem 91904, Israel [email protected] Abstract. We describe the problem of recognition under changing illumination conditions and changing viewing positions from a computational and human vision perspective. On the computational side we focus on the math- ematical problems of creating an equivalence class for images of the same 3D object undergoing certain groups of transformations—mostly those due to changing illumination, and briefly discuss those due to changing viewing positions. The computational treatment culminates in proposing a simple scheme for recognizing, via alignment, an image of a familiar object taken from a novel viewing position and a novel illumination condition. On the human vision aspect, the paper is motivated by empirical evidence inspired by Mooney images of faces that suggest a relatively high level of visual processing is involved in compensating for photometric sources of variability, and furthermore, that certain limitations on the admissible representations of image information may exist. The psy- chophysical observations and the computational results that follow agree in several important respects, such as the same (apparent) limitations on image representations. 1. Introduction The problem of visual recognition is one of the well known challenges to researchers in human and ma- chine vision. The task seems very easy and natural for biological systems, yet has proven to be very difficult to place within a comprehensive analytic framework. Some of the difficulties arise due to a lack of a widely accepted definition of what the problem is. For exam- ple, one can easily recognize scenes (such as a highway scene, city scene, restaurant scene, and so forth) (Potter, 1975) without an apparent need to recognize individual objects in the scene, nor to have a detailed recollection of the spatial layout of “things” in the scene. In this case it seems that some form of statistical regularity of scenes of a specific type is exploited, rather than what we normally associate with the task of “recognizing an object”. Other difficulties arise due to hard mathe- matical problems in understanding the relationship be- tween 3D objects and their images. For example, as we move our eyes, change position relative to the object, or move the object relative to ourselves, the image of the object undergoes change. Some of these changes are intuitive and include displacement and/or rotation in the image plane, but in general the changes are far from obvious. If the illumination conditions change, that is, the level of illumination, as well as the positions and distributions of light sources, then the image of the object changes as well. The light intensity distribution changes, and shadows and highlights may change their position. In this paper we focus on the mathematical problems of creating an equivalence class for images of the same 3D object undergoing a certain group of changes. Be- fore narrowing further the scope of discussion it may be worthwhile to consider further the types of “sources of variability” that are of general interest in recognition of individual objects. Following the seminal work of (Ullman, 1986), we distinguish four general sources of variability: Photometric: Changes in the light intensity dis- tribution as a result of changing the illumination conditions. Geometric: Changes in the spatial location of image information as a result of a relative change of viewing position.

Upload: others

Post on 27-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

International Journal of Computer Vision 21(1/2), 99–122 (1997)c© 1997 Kluwer Academic Publishers. Manufactured in The Netherlands.

On Photometric Issues in 3D Visual Recognition from a Single 2D Image

AMNON SHASHUAThe Hebrew University of Jerusalem, Institute for Computer Science, Jerusalem 91904, Israel

[email protected]

Abstract. We describe the problem of recognition under changing illumination conditions and changing viewingpositions from a computational and human vision perspective. On the computational side we focus on the math-ematical problems of creating an equivalence class for images of the same 3D object undergoing certain groupsof transformations—mostly those due to changing illumination, and briefly discuss those due to changing viewingpositions. The computational treatment culminates in proposing a simple scheme for recognizing, via alignment,an image of a familiar object taken from a novel viewing position and a novel illumination condition. On the humanvision aspect, the paper is motivated by empirical evidence inspired by Mooney images of faces that suggest arelatively high level of visual processing is involved in compensating for photometric sources of variability, andfurthermore, that certain limitations on the admissible representations of image information may exist. The psy-chophysical observations and the computational results that follow agree in several important respects, such as thesame (apparent) limitations on image representations.

1. Introduction

The problem of visual recognition is one of the wellknown challenges to researchers in human and ma-chine vision. The task seems very easy and natural forbiological systems, yet has proven to be very difficultto place within a comprehensive analytic framework.Some of the difficulties arise due to a lack of a widelyaccepted definition of what the problem is. For exam-ple, one can easily recognize scenes (such as a highwayscene, city scene, restaurant scene, and so forth) (Potter,1975) without an apparent need to recognize individualobjects in the scene, nor to have a detailed recollectionof the spatial layout of “things” in the scene. In thiscase it seems that some form of statistical regularity ofscenes of a specific type is exploited, rather than whatwe normally associate with the task of “recognizingan object”. Other difficulties arise due to hard mathe-matical problems in understanding the relationship be-tween 3D objects and their images. For example, as wemove our eyes, change position relative to the object,or move the object relative to ourselves, the image ofthe object undergoes change. Some of these changesare intuitive and include displacement and/or rotation

in the image plane, but in general the changes are farfrom obvious. If the illumination conditions change,that is, the level of illumination, as well as the positionsand distributions of light sources, then the image of theobject changes as well. The light intensity distributionchanges, and shadows and highlights may change theirposition.

In this paper we focus on the mathematical problemsof creating an equivalence class for images of the same3D object undergoing a certain group of changes. Be-fore narrowing further the scope of discussion it maybe worthwhile to consider further the types of “sourcesof variability” that are of general interest in recognitionof individual objects. Following the seminal work of(Ullman, 1986), we distinguish four general sources ofvariability:

• Photometric: Changes in the light intensity dis-tribution as a result of changing the illuminationconditions.• Geometric: Changes in the spatial location of image

information as a result of a relative change of viewingposition.

Page 2: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

100 Shashua

• Varying Context: Objects rarely appear in isolationand a typical image contains multiple objects that arenext to each other or partially occluding each other.Changes in the image can, therefore, occur by chang-ing the context without applying any transformationto the object itself.• Non-rigid Object Characteristics: These include ob-

jects changing shape (such as facial expressions),objects having movable parts (like scissors), and soforth.

The photometric source of variability has to do withthe relation between objects and the images they pro-duce under changing conditions of illumination, i.e.,changing the level of illumination, direction and num-ber of light sources. This has the effect of changing thelight intensity distribution in the image and the locationof shadows and highlights. We will examine this issuelater in the paper.

The geometric source of variability has to do withthe geometric relation between rigid objects and theirperspective images produced under changing viewingpositions (relative motion between the viewer and theobject). This is probably the most emphasized sourceof variability in computer vision circles and has re-ceived much attention in the context of recognition,structure from motion, visual navigation, and recentlyin the body of research on geometric invariants (e.g.,(Mundy and Zisserman, 1992; Mundy et al., 1994)).We note that even relatively small changes in viewingposition between two images of the same object oftencreate a real problem in matching the two against eachother. Figure 1 illustrates this point by superimposingtwo edge images of a face separated by a relativelysmall rotation around the vertical axis. We will discuss

Figure 1. Demonstrating the effects of changing viewing positionon the matching process. The difficulty of matching two differentviews can be illustrated by superimposing the two. One can seethat, even for relatively small changes in viewing position, it couldbe very difficult to determine whether the two views come fromthe same face without first compensating for the effects of viewingtransformation.

geometric issues of recognition later in Section 6, butfor the most part of this paper we will focus only onthe photometric source of variability.

The third source of variability has to do with theeffect of varying context. A typical image often con-tains multiple objects that are next to each other, orpartially occluding each other. If we attempt to com-pare the entire image (containing a familiar object) tothe model representation of an object in question, thenwe are unlikely to have a match between the two. Theproblem of varying context is, therefore, a questionof how the image representation of an object (say itscontours) can be separated from the rest of the im-age before we have identified the object. The prob-lem is difficult and is often referred to as the problemof “segmentation”, “grouping” or “selection”. In thecontext of achieving recognition the crucial questionis whether the problem of context can be approachedin a bottom-up manner, i.e., irrespective of the objectto be recognized, or whether it requires top-down pro-cesses as well. It appears that in some cases in hu-man vision the processes for performing grouping andsegmentation cannot be isolated from the recognitionprocess. In some well known examples, such as R.C.James’ image of a Dalmation dog (see, (Marr, 1982)),it appears unlikely that the image of the object canbe separated from the rest of the image based on im-age properties alone and, therefore, some knowledgeabout the specific class of objects is required to interpretthe image.

Human vision, however, appears also to contain re-latively elaborate processes that perform grouping andsegmentation solely on a data-driven basis independentof subsequent recognition processes. For example,Kinsbourne and Warrington (1962), cited in (Farah,1990) report that patients with lessions in the left infe-rior temporo-occipital region are generally able to rec-ognize single objects, but do poorly when more thanone object is present in the scene. Another line ofevidence comes from displays containing occlusions.The occluding stimuli, when made explicit, seem tostimulate an automatic ‘grouping’ process that groupstogether different parts of the same object (Nakayamaet al., 1989). The third line of evidence comes from‘saliency’ displays in which structures, not necessarilyrecognizable ones, are shown against a complex back-ground. Some examples are shown in Fig. 2. In thesedisplays, the figure-like structures seem to be detectedimmediately despite the lack of any apparent localdistinguishing cues, such as local orientation, contrast

Page 3: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

3D Visual Recognition from a Single 2D Image 101

Figure 2. Structural-saliency displays. The figure like structures seem to ‘pop-up’ from the display, despite the lack of any apparent localdistinguishing cues, such as local orientation, contrast and curvature (Shashua and Ullman, 1988).

and curvature (Shashua and Ullman, 1988, 1991). Wewill not consider further the problem of varying con-text, and assume instead that the region in the imagecontaining the object has been isolated for purposes ofrecognition.

The fourth source of variability has to do with objectschanging their shape. These include objects with mov-able parts (such as the human body) and flexible objects(for example, a face where the changes in shape are in-duced by facial expressions). This source of variabilityis geometrical, but unlike changing viewing positions,the geometric relation between objects and their im-ages has less to do with issues of projective geometryand more to do with defining the space of admissibletransformations in object space.

Our primary focus will be on the photometric sourceof variability, with some discussion on the geometricsource. We discuss next in more detail the issues relatedto changing illumination in visual recognition.

2. The Photometric Source of Variabilityand Its Impact on Visual Recognition

The problem of varying illumination conditions, orthe photometric source of variability as we refer to ithere, raises the question of whether the problem canbe isolated and dealt with independently of subsequentrecognition processes, or whether it is coupled with therecognition process.

It appears that in some cases in human vision theeffects of illumination are factored out at a relativelyearly stage of visual processing and independently ofsubsequent recognition processes. A well known ex-ample is the phenomenon of lightness and color con-stancy. In human vision the color of an object, or itsgreyness, is determined primarily by it’s reflectancecurve, not by the actual wavelengths that reach theobserver’s eye. This property of the visual systemis not completely robust as it is known, for exam-ple, that fluorescent lighting alters our perception ofcolors. Nevertheless, this property appears to sug-gest that illumination is being factored out at an earlystage prior to recognition. Early experiments that wereused to demonstrate this used simple displays such as aplanar ensemble of rectangular color patches, namedafter Mondrians’ paintings, or comparisons amongMunsel chips (Land and McCann, 1971). More re-cent psychophysical experiments demonstrated the ef-fect of 3D structure on the perception of color andlightness (Gilchrist, 1979; Knill and Kersten, 1991).These experiments show that the perception of light-ness changes with the perceived shape of the object.The objects that were used for these experiments arerelatively simple, such as cylinders, polyhedrons and soforth. It is therefore conceivable that the 3D structureof the object displayed in these kinds of experimentscan be re-constructed on the basis of image proper-ties alone after which illumination effects can be fac-tored out.

Page 4: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

102 Shashua

Another example of factoring out the illumination atan early stage, prior to and independently of recogni-tion, is the use of edge detection. Edge detection is themost dominant approach to the problem of changingillumination and is based on recovering features fromthe image that are invariant to changes of illumina-tion. The best known example of such features arestep edges, i.e., contours where the light intensitychanges relatively abruptly from one level to another.Such edges are often associated with object bound-aries, changes in surface orientation, or material pro-perties (Marr, 1976; Marr and Hildreth, 1980). Edgeimages contain most of the relevant information in theoriginal grey-level image in cases where the informa-tion is mostly contained in changing surface material,in sharp changes in surface depth and/or orientation,and in surface texture, color, or greyness. In terms of3D shape, these are characteristics of relatively simpleobjects. Therefore, the edges of simple objects are re-latively informative (or recognizable) and will changeonly slightly when the illumination conditions change.

Many natural objects have a more complex struc-ture, however: surface patches do not change orien-tation abruptly but rather smoothly. In this case, stepedges may not be an ideal representation for two rea-sons: the edge image may not necessarily contain mostof the relevant information in the grey-level image, andnot all edges are stable with respect to changing illumi-nation. For example, edges that correspond to surfaceinflections in depth are actually “phantom” edges anddepend on the direction of light source (Moses, 1993).

Alternative edge detectors prompted by the needfor more recognizable or more stable contour imagessearch instead for extremal points of the light inten-sity distribution, known as valleys and ridges, or buildup a “composite” edge representation made out of theunion of step edges, valleys, and ridges (Freeman andAdelson, 1991; Morrone and Burr, 1988; Pearson et al.,1986; Perona and Malik, 1990). The composite edgeimages do not necessarily contain the subset of edgesthat are stable against changing illumination; they gen-erally look better than step edges alone, but that variesconsiderably depending on the specific object.

The process of edge detection, producing step edges,ridges, valleys, and composite edge images, is illus-trated in Figs. 3 and 4. In Fig. 3 three ‘Ken’ doll imagesare shown, each taken under a distinct illumination con-dition, with their corresponding step edges. In Fig. 4the ridges, valleys, and the composite edge imagesof the three original images are shown (produced by

Figure 3. Grey-scale images of ‘Ken’ taken from three differentillumination conditions. The bottom row shows the step edges de-tected by local energy measures followed by hysteresis (Freemanand Adelson, 1991). The step edges look very similar to the onesproduced by Canny’s edge detection scheme.

Figure 4. Valleys, ridges, and composite contour images producedby Freeman’s contour detection method applied to the three imagesof the previous figure.

Freeman and Adelson’s (1991) edge and line detec-tor). These results show the invariance of edges arenot complete; some edges appear or disappear, somechange location, and spurious edges result from shad-ows (especially attached shadows), specularities, andso forth.

The ‘Ken’ images and their edge representations alsodemonstrate the practical side of the problem of recog-nition under changing illumination conditions. The

Page 5: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

3D Visual Recognition from a Single 2D Image 103

Figure 5. Images of ‘Ken’ taken from different illumination con-ditions followed by a thresholding operation. The recognizability ofthe thresholded images suggests that some knowledge about objectsis required in order to factor out the illumination, and specificallythat the image we are looking at is an image of a face.

images appear different to the degree that a tem-plate match between any two of them is not likely tosucceed without first compensating for the changingillumination.

Human vision appears also to contain processes thatfactor out the effect of illumination during the recogni-tion process. In other words, the image and the modelare coupled together early on in the stages of visual pro-cessing. Consider, for example, the images displayedin Fig. 5. The images are of ‘Ken’ lit by two differentillumination conditions, and thresholded by an arbi-trary value. The thresholded images appear to be re-cognizable, at least in the sense that one can clearlyidentify the image as containing a face. Because theappearance of the thresholded images critically relyon the illumination conditions, it appears unlikely thatrecognition in this case is based on the input propertiesalone. Some knowledge about objects (specifically thatwe are looking at the image of a face) may be requiredin order to factor out the illumination.

Thresholded images are familiar in psychologicalcircles, but less so in computational. A well-known ex-ample is the set of thresholded face images producedby Mooney (1960) for clinical recognizability tests,known as the closure faces test, in which patients had tosort the pictures into general classes that include: boy,girl, grown-up man or woman, old man or woman, andso forth. An example of Mooney’s pictures are shown

Figure 6. Mooney faces and their level-crossings.

Figure 7. A less interpretable Mooney picture and its level-crossings.

in Fig. 6. Most of the control subjects could easily la-bel most of the pictures correctly. Some of Mooney’spictures are less interpretable (for example, Fig. 7), butas a general phenomenon it seems remarkable that avivid visual interpretation is possible from what seemsan ambiguous collection of binary patches that do notbear a particularly strong relationship to surface struc-ture or other surface properties.

Mooney images are sometimes referred to as re-presenting the phenomenon of “shape from shadows”(Cavanagh, 1990). Although some Mooney images docontain cast shadows, the phenomenon is not limited tothe difficulty of separating shadow borders from objectcontours. The thresholded image shown in Fig. 8, forexample, is not less difficult to account for in compu-tational terms, yet the original image was not lit in away to create cast or attached shadows.

These kind of images appear also to indicate that insome cases in human vision the interpretation processinvolves more than just contours. It is evident that thecontours (level-crossings) alone are not interpretable,as can be seen with the original Mooney pictures and

Page 6: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

104 Shashua

Figure 8. Top row: a ‘Ken’ image represented by grey-levels, thesame image followed by a threshold, the level-crossings of the thresh-olded image. The thresholded image shown in the center display isdifficult to account for in computational terms, yet the original imagewas not lit in a way to create cast or attached shadows.Bottom row:the sign-bits of the Laplacian of Gaussian operator applied to theoriginal image, and its zero-crossings (step edges). Interpretabilityof the sign-bit image is considerably better than the interpretabilityof the zero-crossings.

with the level-crossing image in Fig. 8. It seems thatonly when the distinction of what regions are abovethe threshold and what are below the threshold is madeclear (we refer to that as adding “sign-bits”) does theresulting image become interpretable. This appears tobe true not only for thresholded images but also for stepedges and their sign-bits (see Fig. 8, bottom row).

It appears, therefore, that in some cases in humanvision the illumination is factored out within the recog-nition process using top-down information and that theprocess responsible apparently requires more than justcontours—but not much more. We refer from here on tothe Mooney-kind of images as reduced images. Froma computational standpoint we will be interested notonly in factoring out the illumination, in a model-basedapproach, but also in doing so from reduced images.

3. Problem Scope

The recognition problem we consider is that of identi-fying an image of an arbitrary individual 3D object. Weallow the object to be viewed from arbitrary viewingpositions and to be illuminated by an arbitrary settingof light sources. We assume that the image of the ob-ject is already separated from the rest of the image,but may have missing parts (for example, as caused byocclusion).

We adopt the alignment methodology, which de-fines “success” as the ability to exactly re-construct

the input image representation of the object (possi-bly viewed under novel viewing and illumination con-ditions) from the model representation of the objectstored in memory. Alignment has been studied in thepast for compensating for changes of viewing positionsand the method is typically realized by storing a fewnumber of “model” views (two, for example), or a 3Dmodel of the object, and with the help of correspond-ing points between the model and any novel input view,the object is “re-projected” onto the novel viewing po-sition. Recognition is achieved if the re-projected im-age is successfully matched against the input image(Fischler and Bolles, 1981; Huttenlocher and Ullman,1990; Lowe, 1985; Ullman, 1986; Ullman and Basri,1989). Our approach is to apply the basic conceptof alignment onto the photometric domain, and thencombine both sources of variability into a single align-ment framework that can deal with both changes due togeometry and photometry occurring simultaneously.

The basic method, we callphotometric alignment,for compensating for the effects of illumination duringrecognition is introduced next. The method is basedon a result that three images of the surface providea basis that spans all other images of the surface(same viewing position, but changing illumination con-ditions). The photometric problem of recognition is,therefore, reduced to the problem of determining thelinear coefficients—which is conceptually similar tothe idea of Ullman and Basri (1989) in the geomet-ric context. We then extend the basic method to dealwith situations of recognition from reduced image re-presentations with results that appear to agree with theempirical observation made earlier that sign-bits ap-pear to be sufficient for visual interpretation, whereasedges alone do not.

4. Photometric Alignment

The basic approach is based on finding an algebraicconnection between all images of an object taken undervarying illumination conditions. We start by definingthe family of surface reflectance functions for whichour results will hold:

Definition. An orderk Linear Reflectance Model isdefined as the scalar productx · a, wherex is a vec-tor in k-dimensional Euclidean space of invariant sur-face properties (such as surface normal, surface albedo,and so forth), anda is an arbitrary vector (of the samedimension).

Page 7: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

3D Visual Recognition from a Single 2D Image 105

The Lambertian model of reflection is an obviouscase of an order 3 linear reflectance model. The grey-value, I (p), at location p in the image can be re-presented by the scalar product of the surface normalvector and the light source vector,

I (p) = np · s.

Here the length of the surface normalnp representsthe surface albedo (a scalar ranging from zero to one).The length of the light source vectorsrepresents a mix-ture of the spectral response of the image filters, and thespectral composition of light sources—both of whichare assumed to be fixed for all images of the surface(we assume for now that light sources can change direc-tion and level of intensity but not spectral composition).

Another example of a linear reflectance model is theimage irradiance of a tilted Lambertian surface undera hemispherical sky. Horn (1986, p. 234) shows thatthe image irradiance equation isEδp cos2 α

2 , whereαis the angle between the surface normal and the zenith,E is the intensity of light source, andδp is the surfacealbedo. The equation is an order 4 linear reflectancefunction:

I (p) = 1

2Eδp(1+ cosα) = np · s+ |np| · |s|

= (np, |np|)t (s, |s|),

wheresrepresents the direction of zenith, whose lengthis E

2 .

Proposition 1. An image of an object with an orderk linear reflection model I(p) = x(p) ·a can be repre-sented as a linear combination of a fixed set of k imagesof the object.

Proof: Let a1, . . . ,ak be some arbitrary set of ba-sis vectors that spank-dimensional Euclidean space.The image intensityI (p) = x(p) · a is therefore repre-sented by

I (p) = x(p)[α1a1+ · · · + αkak]

= α1I1(p)+ · · · + αk Ik(p),

whereα1, . . . , αk are the linear coefficients that repre-senta with respect to the basis vectors, andI1, . . . , Ik

are thek imagesIk(p) = x(p) · ak. 2

To see the relevance of this proposition to visualrecognition, consider the case of a Lambertian surface

under a point light source (or multiple point lightsources). Assume we take three pictures of the objectI1, I2, I3 from light source directionss1, s2, s3, respec-tively. The linear combination result is that any otherimageI of the object, taken from a novel setting of lightsources, is simply a linear combination of the threepictures,

I (p) = α1I1(p)+ α2I2(p)+ α3I3(p),

for some coefficientsα1, α2, α3. The coefficients canbe solved by observing the grey-values of three pointsproviding three equations. Using more than three pointswill provide a least squares solution. The solution isunique provided thats1, s2, s3 are linearly independent,and that the normal directions of the three sampledpoints span all other surface normals (for a general3D surface, for example, the three normals should belinearly independent).

Alignment-based recognition under changing illu-mination can proceed in the following way. The imagesI1, . . . , Ik are the model images of the object (three forLambertian under point light sources). For any newinput imageI , rather than matching it directly to previ-ously seen images (the model images), we first select anumber of points (at leastk) to solve for the coefficients,and then synthesize an imageI ′ = α1I1 + · · · + αk Ik.If the imageI is of the same object, and the only changeis in illumination, thenI andI ′ should perfectly match(the matching is not necessarily done at the image in-tensity level, one can match the edges ofI against theedges ofI ′, for example). This procedure has factoredout the effects of changing illumination from the recog-nition process without recovering scene information,i.e., surface albedo or surface normal, and without as-suming knowledge of direction of light sources. An-other property of this method is that one can easilyfind a least squares solution for the reconstruction ofthe synthesized image, thereby being less sensitive toerrors in the model, or in the input.

We address below the problems that arise when someof the objects points are occluded from some of the lightsources, and when the surface reflects light specularly.We then extend the results to deal with cases of chang-ing spectral composition of light sources.

4.1. Attached and Cast Shadows

We have practically assumed that surfaces are con-vex because the linear combination result requires that

Page 8: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

106 Shashua

points be visible to the light sources. In a general non-convex surface, object points may be occluded fromsome, or from all, the light sources. This situation gen-erally leads to two types of shadows known as attachedand cast shadows. A pointP is in an attached shadowif the angle between the surface normal and the direc-tion of light source is obtuse (np · s < 0). An objectpoint P is in a cast shadow if it is obstructed from thelight source by another object or by part of the sameobject. An attached shadow, therefore, lies directly onthe object, whereas cast shadows are thrown from oneobject onto another, or from one part onto another ofthe same object (such as when the nose casts a shadowon the cheek under oblique illumination).

In the case of attached-shadows, a correct recon-struction of the image grey-value atp does not re-quire that the object pointP be visible to the lightsources, but only that it be visible to the light sourcess1, s2, s3. If P is not visible tos, then the linear com-bination will produce a negative grey-value (becausenp · s < 0), which can be set to 0 for purposes ofdisplay or recognition.

If P is not visible to one of the model light sources,says1, then the linear combination of the three modelimages will produceI ′(p) under a light sources′ whichis the projection ofs onto the sub-space spanned bys2, s3. This implies that photometric alignment wouldperform best in the case where the novel direction oflight sources is within the cone of directionss1, s2, s3.

The remaining case is when the object pointP isin a cast shadow region with respect to the novel lightdirections. In this case there is no way to predict alow, or zero, grey-value forI ′(p) and the reconstruc-tion will not matchI (p) in that region. Therefore, castshadow regions in the novel image are not modeled inthis framework, and hence, the performance degradeswith increasing number and extent of cast-shadows inthe novel image.

With regard to human vision, there appears to bea marked increase in difficulty in interpreting castshadows compared to attached shadows. Arnheim(1954) discusses the effect of cast shadows on visualperception, its relation to chiaroscuro in Renaissanceart, and its symbolism in various cultures. He pointsout that cast shadows often interfere with the object’sintegrity, whereas attached shadows are often perceivedas an integral part of the object. The general observa-tion is that the more the cast-shadow extends from thepart that throws it, the less meaningful is the connec-tion made with the object. The interpretability of cast

shadows can be illustrated by ‘Ken’ images displayedin Fig. 3. The three model images have extensive at-tached shadows that appear naturally integrated withthe object. The cast shadow region thrown from thenose in the image on the right appears less integratedwith the overall composition of the image.

In conclusion, attached shadows in the novel im-age, or shadows in general in the model images, donot have significant adverse effects on the photomet-ric alignment scheme. Cast shadows in the novel im-age, cannot be reconstructed or even approximated, andtherefore are not modeled in this framework. It may benoted that apparently there is a perceptual differencebetween attached and cast shadows, whereby the lattermay appear to be disconnected from the object uponwhich they are cast.

4.2. Detecting and Removing Specular Reflections

The linear combination result and the photometricalignment scheme that followed assume that objectsare matte. In general, inhomogeneous surfaces aredominantly Lambertian, except for isolated regions thatare specularly reflecting light. In practice, if the spec-ular component is ignored, the reconstructed imagecontains the specular regions of all three model im-ages combined together, and the specular regions ofthe novel image are not reconstructed. For purposesof recognition, as long as the specular regions are rel-atively small, they do not seem to have a significantadverse effect on the overall photometric alignmentscheme. Nevertheless, the alignment method can beused to detect the specular regions and replace themwith the Lambertian reflectance provided that fourimages are used.

The detection of specular points is based on the ob-servation that if a point is in the specular lobe, then it islikely to be so only in one of the images at most. This isbecause the specular lobe occupies a region that falls offexponentially from the specular direction. In generalwe cannot detect the specular points by simply com-paring grey-values in one image with the grey-valuesof the same points in the other images because the in-tensity of the light source may arbitrarily change fromone image to another.

By using Proposition 1, that is, the result that threeimages uniquely determine the Lambertian componentof the fourth image, we can, thereby, compare thereconstructed intensity of the fourth image with the

Page 9: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

3D Visual Recognition from a Single 2D Image 107

observed intensity, and check for significant deviations.For every pointp, we select the image with the highestintensity, call itIs, and reconstructI ′s(p) from the otherthree images (we recover the coefficients once, basedon points that are not likely to be specular or shadowed,i.e., do not have an especially high or low intensity). IfIs(p) is in the specular lobe, thenI ′s(p) ¿ Is(p). Toavoid deviations that are a result of shadowed points,we apply this procedure to points for which none of theimages has an especially low grey-value.

In practice we observe that the deviations thatoccur at specular points are of an order of mag-nitude higher than deviations anywhere else, whichmakes it relatively easy to select a threshold for de-ciding what is specular and what is not. A similar ap-proach for detecting specular points was suggested by(Coleman and Jain, 1982) based on photometric stereo(Woodham, 1980). The idea is to have four imagesand to reconstruct the normal at each point from everysubset of three images. If the point in question is notsignificantly specular, then the reconstructed normalsshould have the same direction and length, otherwisethe point is likely to be specular. Their method, how-ever, requires knowledge of direction and intensity oflight sources, whereas in our method we do not.

4.3. Some Experiments

We used the three ‘Ken’ images displayed in Fig. 3 asmodel images for the photometric alignment scheme.The surface of the doll is non-convex and not purelymatte which gives rise to specular reflections and shad-ows. The novel image (shown in Fig. 9) was taken us-ing light source directions that were within the cone ofdirections used to create the model images. In princi-ple, one can use novel light source directions that areoutside the cone of directions, but that will increase thelikelihood of creating new cast shadow regions. Thereconstruction was based on a least squares solutionusing eight points. The points were chosen automati-cally by searching for smooth regions of image inten-sity. The search was restricted to the area of the face,not including the background. To minimize the chanceof selecting shadowed or specular points, a point wasconsidered as an admissible candidate if it was con-tained in an 8× 8 sized smooth area, and its intensitywas not at the low or high end of the spectrum. We thenselected eight points that were widely separated fromeach other. The reconstructed image (linear combina-tion of the three model images) is displayed in Fig. 9

Figure 9. Reconstructing a novel image.Row1 (left to right): anovel image taken from two point light sources, and the reconstructedimage (linear combination of the three model images).Row2: stepedges of the novel and reconstructed images.Row3: overlayingboth edge maps, and subtracting (xor operation) the edge maps fromeach other. The difference between the images both at the grey-scaleand edge level is hardly noticeable.

together with its step edges. The novel and recon-structed image are visually very similar at the grey-value level, and even more so at the edge-map level.The difference between the two edge maps is neg-ligible and is mostly due to quantization of pixellocations.

In conclusion, this result shows that for the pur-poses of recognition, the existence of shadows and(small) specular regions in the model images do nothave a significantly adverse effect on the reconstruc-tion. Moreover, we did not use a matte surface for theexperiment, illustrating the point that plastic surfacesare dominantly Lambertian, and therefore sufficientlyapplicable to this method.

Figure 10 demonstrates the specular detectionscheme. The method appears to be successful in iden-tifying small specular regions. Other schemes for

Page 10: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

108 Shashua

Figure 10. Detecting and removing specular regions.Row1: theimage on the left is a novel image, and the one on the right is thesame image following the procedure for detecting and removingthe specular regions. The specular regions are replaced with the re-constructed grey-value from the model images.Row2: the specularregions that were detected from the image.

detecting specular regions using the dichromatic modelof reflection often require a relatively large region ofanalysis and, therefore, would have difficulties in de-tecting small specular regions (Klinker et al., 1990;Shafer, 1985).

4.4. The Linear Combination of Color Bands

The photometric problem considered so far involvedonly changes in direction and intensity of light sources,but not changes in their spectral compositions. Lightsources that change their spectral composition are com-mon as, for example, sunlight changes its spectral com-position depending on the time of day (because ofscattering). The implication for recognition, however,is not entirely clear because there may be an adap-tation factor involved rather than an explicit processof eliminating the effects of illumination. Adaptationis not a possibility when it comes to changing direc-tion of light source, because objects are free to movein space and hence change their positions with re-spect to the light sources. Nevertheless, it is of interestto explore the possibility of compensating for chang-ing spectral composition as well as direction of lightsources.

We assume, for reasons that will be detailed below,that our surface is eitherneutral, or is of the samecolor, but may change in luminosity. A neutral sur-face is a grey-scale surface only affecting the scale oflight falling on the surface, but not its spectral com-position. For example, the shades of grey from whiteto black are all neutral. Note that the assumption isweaker than the uniform albedo assumption becausewe allow change in luminosity, but is less generalthan what we had previously because we do not al-low changes in hue or saturation to occur across thesurface. We also assume that our model of the objectconsists of a single color image obtained by overlayingthree color images of the object each taken from a dis-tinct direction of light source having a distinct spectralcomposition.

Let Ir , Ig, Ib be the three color bands that togetherdefine the color picture. Letδpρ(λ) be the surfacereflectance function, whereδp is the surface albedoandρ(λ) is the spectral reflectance function of wave-length λ. Note that the neutral surface assumptionmeans that across the surfaceρ(λ) is fixed, butδp

may change arbitrarily. LetS1(λ), S2(λ), S3(λ) be thespectral composition of the three light sources, ands1, s2, s3 be their directions. As before, we requirethat the directions be non-coplanar, and that the spec-tral compositions be different from each other. This,however, does not mean that the three spectral func-tions should form a basis (such as required in somecolor constancy models (Maloney and Wandell, 1986)).Finally, let Rr (λ), Rg(λ), Rb(λ) be the spectral sen-sitivity functions of the three CCD filters (or retinalcones). The composite color picture (taking the pictureseparately under each light source, and then combiningthe results) is, therefore, determined by the followingequation:

Ir (p)

Ig(p)

Ib(p)

=∫S1(λ)ρ(λ)Rr (λ)dλ∫S1(λ)ρ(λ)Rg(λ)dλ∫S1(λ)ρ(λ)Rb(λ)dλ

np · s1

+

∫S2(λ)ρ(λ)Rr (λ)dλ∫S2(λ)ρ(λ)Rg(λ)dλ∫S2(λ)ρ(λ)Rb(λ)dλ

np · s2

+

∫S3(λ)ρ(λ)Rr (λ)dλ∫S3(λ)ρ(λ)Rg(λ)dλ∫S3(λ)ρ(λ)Rb(λ)dλ

np · s3,

Page 11: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

3D Visual Recognition from a Single 2D Image 109

where the length ofnp is δp. This can re-written inmatrix form, as follows: Ir (p)

Ig(p)Ib(p)

= v1

v2

v3

np · s1+u1

u2

u3

np · s2

+w1

w2

w3

np · s3

= [v, u,w]

s1

s2

s3

np = Anp.

The 3× 3 matrix [v, u,w] is assumed to be non-singular (for that reason we required that the spectralcomposition of light sources be different from one an-other), and therefore the matrixA is also non-singular.Note that because of the assumption that the surface isneutral, the matrixA is independent of position. Con-sider any novel image of the same surface, taken undera new direction of light source with a possible dif-ferent spectral composition. Let the novel picture beJr , Jg, Jb. The red color band, for instance, can berepresented as a linear combination of the three colorbandsIr , Ig, Ib, as follows:

Jr (p) =[∫

S(λ)ρ(λ)Rr (λ)dλ

]np · s

= np · (α1A1+ α2A2+ α3A3)

= α1Ir (p)+ α2Ig(p)+ α3Ib(p)

where A1, A2, A3 are the rows of the matrixA. Be-causeA is non-singular, the row vectors form a basisthat spans the vector [

∫S(λ)ρ(λ)Rr (λ)dλ]swith some

coefficientsα1, α2, α3. These coefficients are fixed forall points in the red color band because the scale∫S(λ)ρ(λ)Rr (λ)dλ is independent of position (the neu-

tral surface albedoδp is associated with the length ofnp). Similarly the remaining color bandsJg, Jb are alsorepresented as a linear combination ofIr , Ig, Ib, butwith different coefficients. We have, therefore, arrivedat the following result:

Proposition 2. An image of a Lambertian object witha neutral surface reflectance(grey-scale surface) takenunder an arbitrary point light source condition(in-tensity, direction and spectral composition of lightsource) can be represented as a linear combinationof the three color bands of a model picture of the sameobject taken under three point light sources having

different (non-coplanar) directions and differentspectral composition.

For a neutral surface, the linear combination of colorbands can span only images of the same surface withthe same hue and saturation under varying illumina-tion conditions. The combination of color bands of anon-neutral surface spans the space of illuminationandcolor (hue and saturation). That is, two surfaces withthe same structure but with different hue and saturationlevels, are considered the same under the photometricalignment scheme.

5. Photometric Alignment with Reduced Images

We have seen in Section 2 empirical evidence tosuggest that in some cases the process responsible forfactoring out the illumination during the recognitionprocess appears to require more than just contour in-formation, but just slightly more. So far we proposed ascheme which can directly factor out the illuminationduring the model-to-image matching stage by using theinformation contained in the grey-values of the modeland novel images.

In this section we explore the possibilities of us-ing less than grey-values for purposes of factoring outthe illumination. In other words, since the photomet-ric alignment method is essentially about recoveringthe linear coefficients that represent the novel image asa linear combination of the three model images, thenthe question is whether those coefficients can be re-covered by observing more reduced representations ofthe novel image, such as edges, edges and gradients,sign-bits, and so forth. Specifically, we are most inter-ested in making a computational connection with theempirical observation that sign-bits appear to be suf-ficient for visual interpretation, whereas edges aloneare not.

The proposition below shows that in principle thelevel-crossing or zero-crossing contours of the novelimage are theoretically sufficient for recovering the lin-ear coefficients for combining the model images.

Proposition 3. The coefficients that span an image Ifrom three model images, as described in Proposition1can be solved, up to a common scale factor, from justthe contours of I, zero-crossings or level-crossings.

Proof: Let α j be the coefficients that spanI by thebasis imagesI j , j = 1, 2, 3, i.e., I = ∑

j α j I j . Let

Page 12: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

110 Shashua

f, f j be the result of applying a Laplacian of Gaus-sian (LOG) operator, with the same scale, on im-agesI , I j , j = 1, 2, 3. Since LOG is a linear op-erator we havef = ∑

j α j f j . Since f (p) = 0along zero-crossing pointsp of I , then by taking threezero-crossing points, which are not on a cast shadowborder and whose corresponding surface normals arenon-coplanar, we get a homogeneous set of equationsfrom which α j can be solved up to a common scalefactor.

Similarly, let k be an unknown threshold appliedto I . Therefore, along level crossings ofI we havek =∑ j α j I j ; hence four level-crossing points that arevisible to all four light sources are sufficient for solvingα j andk. 2

The result is that in principle we could cancel theeffects of illumination directly from the zero-crossings(or level-crossings) of the novel image instead of fromthe raw grey-values of the novel image. Note that themodel images are represented as before by grey-values.Because the model images are taken only once, it isnot unreasonable to assume more strict requirementson the quality of those images. We therefore make adistinction between the model acquisition, or learning,phase for which grey-values are used and the recog-nition phase for which a reduced representation of thenovel image is being used.

The result that contours may be used instead ofgrey-values is not surprising at a theoretical level, con-sidering the literature of image compression. Undercertain restrictions on the class of signals, it is knownthat the zero-crossings form a complete representa-tion of an arbitrary signal of that class. The case ofone-dimensional bandpass signals, with certain condi-tions on the signals’ Hilbert transform, is provided by(Logan, 1977). The more general case is approachedby assuming the signal can be represented as a finitecomplex polynomial (Curtis et al., 1985; Sanz andHuang, 1989). Complex polynomials have the wellknown property that they are fully determined by theiranalytic varieties (curves in the one-dimensional case)using analytic continuation methods (see for example,(Saff and Snider, 1976)). It is well known that ana-lytic continuation is an unstable process (Hille, 1962)and therefore, the reconstruction of the image from itszero-crossings is likely to be unstable. Curtis et al.(1985) report, for instance, that zero-crossings must berecorded with great precision, at sub-pixel accuracy of14 digits.

The result of Proposition 3 can be viewed as a model-based reconstruction theorem, that applies to a muchless restricted class of signals (images do not have tobe bandpass, for instance). The process is much sim-pler, but on the other hand it is restricted to a specificmodel undergoing a restricted group of transformations(changing illumination). The simplicity of the model-based reconstruction, however, is not of great help incircumventing the problem of instability. Stability de-pends on whether contours are recorded accurately andwhether those contours are invariant across the modelimages.

The assumption that the value off at a zero-crossinglocation p is zero, is true for a subpixel locationp.In other words, it is unlikely thatf (p) = 0 for someintegral locationp. This introduces, therefore, a sourceof error whose magnitude depends on the ‘strength’ ofthe edge that gives rise to the zero-crossing in the signalf , that is, the sharper and stronger the discontinuity inimage intensities along an edge in the imageI is, thelarger the variance aroundf (p). This suggests that‘weak’ edges should be sampled, with more or lessthe same strength, so that by sampling more than theminimum required number of points, the error couldbe canceled by a least squares solution.

The second source of error has to do with the sta-bility of the particular edge under changing illumina-tion. Assume, for example, that the zero-crossing atp (recorded accurately) is a result of a sharp changein surface reflectance. Although the image intensitydistribution aroundp changes across the model im-ages, the location of the discontinuity does not, i.e.,the zero-crossing is stable. In this case we have thatf (p) = f j (p) = 0, j = 1, 2, 3. Therefore, such apoint will not contribute any information if recordedaccurately and will contribute pure noise if recordedwith less than the required degree of accuracy. Thisfinding suggests, therefore, that zero-crossings shouldbe sampled along attached shadow contours or alongvalleys and ridges of image intensities (a valley or aridge gives rise to two unstable zero-crossings (Moses,1993)).

The situation with reconstruction from level-crossings is slightly different. The first source of er-ror, related to the accuracy in recording the locationof level-crossings, still applies, but the second sourcedoes not. In general, the variance in intensity around alevel crossing pointp is not as high as the variancearound an edge point. A random sampling of points fora least squares solution is not likely to have a zero mean

Page 13: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

3D Visual Recognition from a Single 2D Image 111

error, however, and the mean error would therefore beabsorbed in the unknown thresholdk. The least squaressolution would be biased towards a zero mean error so-lution that will affect both the recovered threshold andthe linear coefficientsα j . The solution, therefore, doesnot necessarily consist of a correct set of coefficientsand a slightly off thresholdk, but a mixture of both in-accurate coefficients and an inaccurate threshold. Thisimplies that level-crossings should be sampled at loca-tions that do not correspond to zero-crossings in orderto minimize the magnitude of errors.

To summarize, the reconstruction of the novel im-age from three model images and the contours of thenovel image is possible in principle. In the case of bothzero-crossings and level-crossings, the locations of thecontours must be recorded at sub-pixel accuracy. Inthe case of zero-crossings, another source of potentialerror arises, which is related to the stability of the zero-crossing location under changing illumination. There-fore, a stable reconstruction requires a sample of pointsalong weak edges that correspond to attached shadowcontours or to ridges and valleys of intensity. Alterna-tively, the locations of contour points must be recordedat sub-pixel accuracy, given also that the sample is largeenough to contain unstable points with respect to illu-mination. Experimental results show that a randomsample of ten points (spread evenly all over the object)with accuracy of two digits for zero-crossings and onedigit for level-crossings is sufficient to produce resultscomparable to those produced from sampling imageintensities directly. The performance with integral lo-cations of points sampled over edgesp that have nocorresponding edges in a 3× 3 window aroundp inany of the model images was not satisfactory.

These results show that reconstruction from contoursdoes not appear to be generally useful for the photo-metric alignment scheme because of its potential insta-bility. It is also important to note that in these experi-ments the viewing position is fixed, thereby eliminatingthe correspondence problem that would arise otherwiseand would most likely increase the magnitude of errors.

5.1. Photometric Alignment from Contoursand Gradients

When zero-crossings are supplemented with gradientdata, the reconstruction does no longer suffer from thetwo sources of errors that were discussed in the previ-ous section. We can use gradient data to solve for thecoefficients, because the operation of taking derivatives

(continuous and discrete) is linear and therefore leavesthe coefficients unchanged. The accuracy requirementis relaxed because the gradient data is associated withthe integral location of contour points, not with theirsub-pixel location. Stable zero-crossings do not af-fect the reconstruction, because the gradient dependson the distribution of grey-values in the neighborhoodof the zero-crossing, and the distribution changes witha change in illumination (even though the location ofthe zero-crossing may not change).

Errors, however, may be more noticeable once weallow changes in viewing positions in addition tochanges in illumination. Changes in viewing positionsmay introduce errors in matching edge points acrossimages. Because the change in image intensity dis-tribution around an edge point is localized and maychange significantly at nearby points, then errors inmatching edge points across the model images maylead to significant errors in the contribution those pointsmake to the system of equations.

5.2. Photometric Alignment from Sign-Bits

Reconstruction from contours, general or model-based,appears to rely on the accurate location of contours.This reliance, however, seems to be at odds with theintuitive interpretation of Mooney-type pictures, likethose in Figs. 6. These images suggest that, instead ofcontours being the primary vehicle for shape interpreta-tion, the regions bounded by the contours (the sign-bitregions) are primarily responsible for the interpreta-tion process. Thus instead of a contour-based tech-nique we investigate below an area-based techniquearising from the use of sign-bits. It is worth noting thatthe “sign-bit correlation” method for stereo matchingproposed by Nishihara (1984) was advocated on simi-lar grounds. Nishihara’s conclusion was that the sign-bits contributed to increased stability (because regionschange less than contours do)—conclusions that aresimilar to what we find below. It is also worthwhilenoting that, theoretically speaking, only one bit of in-formation is added in the sign-bit displays. This is be-cause zero-crossings and level-crossings form nestedloops (Koenderink and Van Doorn, 1980), and there-fore the sign-bit function is completely determined upto a common sign flip. In practice, however, this prop-erty of contours does not emerge from edge detectorsbecause weak contours are often thresholded out as theytend to be the most sensitive to noise (see, for example,Fig. 3). This may also explain why our visual system

Page 14: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

112 Shashua

apparently does not use this property of contours. Wetherefore do not make use of the global property of thesign-bit function; rather, we treat it as a local source ofinformation, i.e., one bit of information per pixel.

Because the location of contours is an unreliablesource of information, especially when the effects ofchanging viewing positions are considered, we proposeto rely instead only on the sign-bit source of informa-tion. From a computational standpoint, the only in-formation that a point inside a region can provide iswhether the function to be reconstructed (the filteredimage f , or the thresholded imageI ) is positive ornegative (or above/below threshold). This informationcan be incorporated in a scheme for finding a separatinghyperplane, as suggested in the following proposition:

Proposition 4. Solving for the coefficients from thesign-bit image of I is equivalent to solving for a sep-arating hyperplane in3D or 4D space in which imagepoints serve as“examples” .

Proof: Let z(p) = ( f1, f2, f3)T be a vector function

andω = (α1, α2, α3)T be the unknown weight vector.

Given the sign-bit filtered imagef of I , we have thatfor every pointp, excluding zero-crossings, the scalarproductωTz(p) is either positive or negative. In thisrespect, points inf can be considered as “examples” in3D space and the coefficientsα j as a vector normal tothe separating hyperplane. Similarly, the reconstruc-tion of the thresholded imageI can be represented as aseparating hyperplane problem in 4D space, in whichz(p) = (I1, I2, I3,−1)T andω = (α1, α2, α3, k)T .

2

The contours lead to a linear system of equations,whereas the sign-bits lead to a linear system ofinequal-ities. The solution of a linear system of inequalitiesAw < b can be approached using Linear Program-ming techniques or using Linear Discriminant Analysistechniques (see (Duda and Hart, 1973) for a review).Geometrically, the unknown weight vectorw can beconsidered as the normal direction to a plane, pass-ing through the origin, in 3D Euclidean space, and asolution is found in such a way that the plane sepa-rates the “positive” examples,ωTz(p) > 0, from the“negative” examples,ωTz(p) < 0. In the general case,whereb 6= 0, the solution is a point inside a polytopewhose faces are planes in 3D space.

The most straightforward solution is known as theperceptronalgorithm (Rosenblatt, 1962). The basicperceptron scheme proceeds by iteratively modifying

the estimate ofw by the following rule:

wn+1 = wn +∑i∈M

zi

wherewn is the current estimate ofw, and M is theset of exampleszi that are incorrectly classified bywn.The critical feature of this scheme that it is guaran-teed to converge to a solution, irrespective of the initialguessw0, provided that a solution exists (examples arelinearly separable). Another well known method is toreformulate the problem as a least squares optimizationproblem of the form

minw|Aw− b|2

where thei ’th row of A is zi , and b is a vector ofarbitrarily specified positive constants (oftenb = 1).The solutionw can be found using the pseudoinverseof A, i.e.,

w = A+b = (At A)−1Atb,

or iteratively through a gradient descent procedure,which is known as the Widrow-Hoff procedure. Theleast squares formulation is not guaranteed to find acorrect solution but has the advantage of finding a so-lution even when a correct solution does not exist (aperceptron algorithm is not guaranteed to converge inthat case).

By using the sign-bits instead of the contours, we aretrading a unique, but unstable, solution for an approx-imate, but stable, solution. The stability of reconstruc-tion from sign-bits is achieved by sampling points thatare relatively far away from the contours. This sam-pling process also has the advantage of tolerating a cer-tain degree of misalignment between the images as a re-sult of less than perfect correspondence due to changesin viewing position (this feature is discussed furtherin Section 7.4). Experimental results (see Figs. 11and 12) demonstrate that 10 to 20 points, distributedover the entire object, are sufficient to produce resultsthat are comparable to those obtained from an exactsolution. The experiments were done on images of‘Ken’ and on another set of face images taken froma plaster bust of Roy Lamson (courtesy of the M.I.TMedia Laboratory). Both the perceptron algorithm andthe least-squares approach were implemented and bothyielded practically the same results. The sample pointswere chosen manually, and over several trials we foundthat the reconstruction is not sensitive to the particular

Page 15: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

3D Visual Recognition from a Single 2D Image 113

Figure 11. Reconstruction from sign-bits.Top row(left to right):the input novel image; the same image but with the sample pointsmarked for display.Bottom row: the reconstructed image; the over-lay of the original level-crossings and the level-crossings of the re-constructed thresholded image.

Figure 12. Reconstruction from sign-bits.Row 1: three modelimages.Row2: novel image; thresholded input; reconstructed image(same procedure as described in the previous figure). Note that theleft ear has not been reconstructed; this is mainly because the earis occluded in two of the three model images.Row3: the level-crossings of the novel input; level-crossings of the reconstructedimage; the overlay of both level-crossing images.

choice of sample points, as long as they are not clus-tered in a local area of the image and are sampled a fewpixels away from the contours. The results show the re-construction of a novel thresholded images from threemodel images. The linear coefficients and the thresh-old are recovered from the system of inequalities usinga sample of 16 points; the model images are then com-bined and thresholded with the recovered threshold toproduce a synthesized thresholded image. Recognitionthen proceeds by matching the novel thresholded imagegiven as input against the synthesized image.

6. The Geometric Source of Variability

So far, we have assumed that the object is viewed froma fixed viewing position, and allowed only photometricchanges to occur. This restriction was convenient be-cause that allowed us to combine the model images ina very simple manner. When changes in viewing posi-tions are allowed to occur, the same image point acrossdifferent projections does no longer correspond to thesame object point. The simplest example is translationand rotation in the image plane which occur when theobject translates, rotates around the line of sight, andthen orthographically projects onto the image. Thistransformation can easily be undone if we observe twocorresponding points between the novel input imageand the model images. In general, however, the effectsof changing viewing positions may not be straight-forward as happens when the object rotates in depthand when the projection is perspective.

Because of the Lambertian assumption, we can treatthe photometric and geometric sources of variabilityindependently of each other. In other words, we haveassumed photometric changes occurring in the absenceof any geometric changes, and now we will assume thatthe different views are taken under identical illumina-tion conditions. We can later combine the two sourcesof variability into a single framework which will allowus to compensate for both photometric and geometricchanges occurring simultaneously.

The geometric source of variability raises two relatedissues. First, is establishing point-to-point correspon-dence between the model images. Second, given corre-spondence between the model images, undo the effectsof viewing transformation based on a small number ofcorresponding points between the novel view and themodel views.

The process of “undoing” the effects of changingviewing transformation between views is known as

Page 16: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

114 Shashua

the “alignment” approach in recognition. Given a 3Dmodel, or at least two model views in full correspon-dence, one can “re-project” the object onto the novelviewing position with the help of a small number ofcorresponding points. Recognition is achieved if there-projected image is successfully matched against theinput image. We refer from hereafter to the problemof predicting a novel view from a set of model viewsusing a limited number of corresponding points, as theproblem ofre-projection. The problem of finding asmall number of corresponding points between twoviews is often referred to asMinimal Correspondence(Huttenlocher and Ullman, 1987).

The problem of establishing full correspondence be-tween the model images requires not only undoing theeffects of viewing transformation, but knowledge ofobject structure as well. Given two views, there is nofinite number of corresponding points that would de-termine uniquely all other correspondences, unless theobject is planar.

We discuss these issues briefly in the next section,restricting the discussion to the case of orthographicviews. A more detailed treatment of these issues, in-cluding perspective views, can be found in Shashua(1991, 1992, 1994, 1995; Shashua and Navab, 1996;Shashua and Toelg, 1994).

6.1. Re-Projection and Correspondence

Let O, P1, P2, P3 be four non-coplanar object points,referred to as reference points, and letO′, P′1, P′2, P′3 bethe coordinates of the reference points from the secondcamera position. Letb1, b2, b3 be the affine coordinatesof an object point of interestP with respect to the basisO P1,O P2,O P3, i.e.,

O P =3∑

j=1

bj (O Pj ),

where the O P denotes the vector fromO to P.Under parallel projection the viewing transformationbetween the two cameras can be represented by an ar-bitrary affine transformation, i.e.,O′P′ = T(O P) forsome linear transformationT . Therefore, the coordi-natesb1, b2, b3 of P remain fixed under the viewingtransformation, i.e.,

O′P′ =3∑

j=1

bj (O′P′j ).

Since depth is lost under parallel projection, we have asimilar relation in image coordinates:

op=3∑

j=1

bj (opj ) (1)

o′p′ =3∑

j=1

bj (o′p′j ). (2)

Given the corresponding pointsp, p′ (in image co-ordinates), the two formulas 1, 2 provide four equa-tions for solving for the three affine coordinates as-sociated with the object pointP that projects to thepoints p, p′. Furthermore, since the affine coordi-nates are fixed for all viewing transformations, we canpredict the locationp′′ on a novel view by first re-covering the affine coordinates from the two modelviews and then substituting them in the followingformula:

o′′p′′ =3∑

j=1

bj (o′′p′′j ).

We have, therefore, a method for recovering affine co-ordinates from two views and a method for achievingre-projection given two model views (in full correspon-dence) and four corresponding points across the threeviews.

Assume we would like to find the correspondingpoint p′ given we know the correspondences due tothe four reference points. It is clear that this cannotbe done with the available information because a di-mension is lost due to the projection from 3D to 2D (infact, any number of corresponding pointsn would notbe sufficient for determining the correspondence of then+ 1 point (Aloimonos and Brown, 1989; Huang andLee, 1989)). Since we do not have a sufficient numberof observations to recover the affine coordinates, welook for an additional source of information.

We assume that both views are taken under similarillumination conditions:

I (x +1x, y+1y, t +1t) = I (x, y, t),

wherev = (1x,1y) is the displacement vector, i.e.,p′ = p + v. We assume the convention that the twoviews were taken at timest and t + 1t . A first or-der approximation of a Taylor series expansion leadsto the following equation which describes a linear ap-proximation to the change of image grey-values atp

Page 17: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

3D Visual Recognition from a Single 2D Image 115

due to motion:

∇ I · v+ It = 0, (3)

where∇ I is the gradient at pointp, andIt is the tem-poral derivative atp. Equation (3) is known as the“constant brightness equation” and was introduced byHorn and Schunk (1981). In addition to assuming thatthe change in grey-values is due entirely to motion, wehave assumed that the motion (or the size of view sepa-ration) is small, and that the surface patch atP is locallysmooth.

The constant brightness equation provides only onecomponent of the displacement vectorv, the compo-nent along the gradient direction, or normal to theisobrightness contour atp. This “normal flow” infor-mation is sufficient to uniquely determine the affinecoordinatesbj at p, as shown next. By subtractingEq. (1) from Eq. (2) we get the following relation:

v =3∑

j=1

bj v j +(

1−∑

j

bj

)vo, (4)

wherev j ( j = 0, . . . ,3) are the known displacementvectors of the pointso, p1, p2, p3. By substitutingEq. (4) in the constant brightness equation, we get anew equation in which the affine coordinates are theonly unknowns:∑

j

bj [∇ I · (v j − vo)] + It +∇ I · vo = 0. (5)

Equations (1) and (5) provide a complete set of lin-ear equations to solve for the affine coordinates at alllocationsp that have a non-vanishing gradient, whichis not perpendicular to the direction of the epipolarline passing throughp′. Once the affine coordinatesare recovered, the location ofp′ immediately follows.We have, therefore, derived a scheme for obtaining fullcorrespondence given a small number of known corre-spondences, and a scheme for re-projecting the objectonto any third view, given four corresponding pointswith the third view (the affine coordinatesb1, b2, b3

are view independent). Both schemes can be consid-erably simplified by expressing the problems in termsof one unknown per image point (instead of three) asfollows.

Let A and w be the six affine parameters de-termined (uniquely) from the three corresponding

vectorsopj ↔ o′p′j j = 1, 2, 3, i.e.,

o′p′j = A(opj )+ w, j = 1, 2, 3. (6)

For an arbitrary pair of corresponding pointsp, p′, wehave the following relation:

o′p′ =3∑

j=1

bj (o′p′j ) =

3∑j=1

bj (A(opj )+ w)

= A(op)+(∑

j

bj

)w,

or equivalently:

p′ = [ A(op)+ o′ + w] + γpw. (7)

whereγp =∑

bj − 1 is view-point invariant and isunknown. Note thatγp = 0 if the object pointPis coplanar with the planeP1P2P3 (reference plane),and thereforeγp represents the relative deviation ofPfrom the reference plane (Shashua, 1991; Koenderinkand Van Doorn, 1991). A convenient way to view thisresult is that the location of the corresponding pointp′

is determined by a “nominal component”, described byA(op)+ o′ +w and a “residual parallax component”,described byγpw. The nominal component is deter-mined from the four known correspondences, and theresidual component can be determined using the con-stant brightness Eq. (3) (for more details and discussionsee (Shashua, 1991, 1992)):

γp = −It −∇ I · [ A− I ](op)

∇ I · w .

The “affine depth”γp can be also used to simplify there-projection scheme onto a third view: sinceγp isinvariant, then it can be computed from the correspon-dence between the two model views and substitutedin the equation describing the epipolar relation be-tween the first and third (novel) view. The re-projectionscheme can be further simplified (Shashua, 1992) toyield the “linear combination of views” of (Ullman andBasri, 1989) (also (Poggio, 1990)):

x′′ = α1x′ + α2x + α3y+ α4, (8)

y′′ = β1y′ + β2x + β3y+ β4, (9)

where the coefficientsα j , β j are functions of the affineviewing transformations between the three views. Thecoefficients can be recovered from four corresponding

Page 18: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

116 Shashua

points across the three views, and then used to generatep′′ for every corresponding pairp↔ p′.

These techniques extend to the general case of per-spective views. Instead of affine coordinates one canrecover projective coordinates from two views andeight corresponding points (Faugeras, 1992; Hartleyet al., 1992; Shashua, 1994). The affine depth invari-antγp turns into a projective invariant (“relative affinestructure”) (Shashua and Navab, 1996). Finally, thelinear combination of views result of Ullman and Basri(1989) turns into a trilinear relation requiring sevenmatching points in general (instead of four), or bilin-ear in case only the model views are orthographic—requiring five matching points (Shashua, 1995).

7. Combining Changes in Illuminationwith Changes in Viewing Positions:Experimental Results

We have described so far three components that arenecessary building blocks for dealing with recogni-tion via alignment under the geometric and photomet-ric sources of variability. First, is the component de-scribing the photometric relation between three modelimages and a novel image of the object. Second, isthe component describing the geometric relation be-tween two model views and a novel view of an objectof interest. Third, is the correspondence componentwith which it becomes possible to represent objects bya small number of model images. The geometric andphotometric components were treated independentlyof each other. In other words, the photometric prob-lem assumed the surface is viewed from a fixed view-ing position. The geometric problem assumed that theviews are taken under a fixed illumination condition,i.e., the displacement of feature points across the dif-ferent views is due entirely to a change of viewing posi-tion. In practice, the visual system must confront bothsources of variability at the same time. The combinedgeometric and photometric problem is defined below:

We assume we are given three model images of a3Dmatte object taken under different viewing positionsand illumination conditions. For any input image, de-termine whether the image can be produced by the ob-ject from some viewing position and by some illumina-tion condition.

The combined problem definition suggests that theproblem be solved in two stages: first, changes in view-ing positions are compensated for, such that the three

model images are aligned with the novel input im-age. Second, changes of illumination are subsequentlycompensated for, by using the photometric alignmentmethod. In the following sections we describe sev-eral experiments with ‘Ken’ images starting from theprocedure that was used for creating the model im-ages, followed by three recognition situations: (i) thenovel input image is represented by its grey-levels,(ii) the input representation consists of sign-bits, and(iii) the input representation consists of grey-levels, butthe model images are taken from a fixed viewing posi-tion (different from the viewing position of the novelimage). In this case we make use of the sign-bits inorder to achieve photometric alignment although thenovel image is taken from a different viewing position.

7.1. Creating a Model of the Object

The combined recognition problem implies that themodel images represent both sources of variability, i.e.,be taken from at least two distinct viewing positions andfrom three distinct illumination conditions. The threemodel images displayed in the top row of Fig. 13 weretaken under three distinct illumination conditions, andfrom two distinct viewing positions (23◦ apart, mainlyaround the vertical axis). In order to apply the corres-pondence method described in the previous section,we took an additional image in the following way.Let the three illumination conditions be denoted bythe symbolsS1, S2, S3, and the two viewing positionsbe denoted byV1,V2. The three model images, fromleft to right, can be described by〈V1, S1〉, 〈V2, S2〉 and〈V1, S3〉, respectively. Since the first and third modelimages are taken from the same viewing position, thetwo images are already aligned. In order to achieve fullcorrespondence between the first two model images, afourth image〈V2, S1〉 was taken. Correspondence be-tween〈V1, S1〉 and〈V2, S1〉 was achieved via the cor-respondence method described in the previous section.Since〈V2, S1〉 and〈V2, S2〉 are from the same viewingposition, then the correspondence achieved previouslyholds also between the first and second model images.The fourth image〈V2, S1〉 was then discarded and didnot participate in subsequent recognition experiments.

7.2. Recognition from Grey-Level Images

The method for achieving recognition under bothsources of variability is divided into two stages: first,

Page 19: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

3D Visual Recognition from a Single 2D Image 117

Figure 13. Recognition from full grey-level novel image (see textfor more detailed description).Row1 (left to right): three model im-ages (the novel image is shown third row lefthand display).Row2:view-compensated model images—all three model images are trans-formed (using four points) as if viewed from the novel viewing po-sition. Row 3: novel image, edges of novel image, photometricalignment of the three view-compensated model images (both viewand illumination compensated).Row4: edges of the resulting syn-thesized image (third row righthand), overlay of edges of novel andsynthesized image.

the three model images are re-projected onto the novelimage. This is achieved by first assuming minimal cor-respondence between the novel image and one of themodel images. With minimal correspondence of fourpoints across the images (model and novel) we canpredict the new locations of model points that shouldmatch with the novel image (assuming orthographicprojection). Second, photometric alignment is subse-quently applied by selecting a number of points (nocorrespondence is needed at this stage because all im-ages are now view-compensated) to solve for the linearcoefficients. The three model images are then linearlycombined to produce a synthetic image that is bothview and illumination compensated, i.e., should matchthe novel image.

Figure 13 illustrates the chain of alignment transfor-mations. The novel image, displayed in the third rowleft image, is taken from an in-between viewing po-sition and illumination condition. Although, in prin-ciple, the recognition components are not limited toin-between situations, there are few practical limita-tions. The more extrapolated the viewing position is,the more new object points appear and old object pointsdisappear, and similarly, the more extrapolated the il-lumination condition is, the more new cast shadowsare created (see Section 4.1). Minimal correspondencewas achieved by manually selecting four points thatcorresponded to the far corners of the eyes, one eye-brow corner, and one mouth corner. The model viewswere re-projected onto the novel view, and their origi-nal grey-values retained. As a result, we have createdthree synthesized model images (shown in Fig. 13, sec-ond row) that are from the same viewing position as thenovel image, but have different image intensity distri-butions due to changing illumination. The photometricalignment method was then applied to the three syn-thesized model images and the novel image, withouthaving to deal with correspondence because all fourimages were already aligned. The sample points forthe photometric alignment method were chosen auto-matically by searching over smooth regions of imageintensity (as described in Section 4.3). The resultingsynthesized image is displayed in Fig. 13, third rowright image. The similarity between the novel and thesynthesized image is illustrated by superimposing thestep edges of the two images (Fig. 13, bottom row rightimage).

7.3. Recognition from Reduced Images

A similar procedure to the one described above can beapplied to recognize a reduced novel image. In thiscase the input image is taken from a novel viewing po-sition and illumination condition, followed by a thresh-olding operator (unknown to the recognition system).Figure 14 illustrates the procedure. We applied the lin-ear combination method of re-projection (Ullman andBasri, 1989) and used more than the minimum requiredfour points. In this case it is more difficult to reli-ably extract corresponding points between the thresh-olded input and the model images. Therefore, sevenpoints were manually selected and their correspondingpoints were manually estimated in the model images.The linear combination method was then applied usinga least squares solution for the linear coefficients to

Page 20: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

118 Shashua

Figure 14. Recognition from a reduced image.Row1 (left to right):novel thresholded image; its level-crossings (the original grey-levelsof the novel image are shown in the previous figure, third row on theleft). Row2: the synthesized image produced by the recognition pro-cedure; its level-crossings.Row3: overlay of both level-crossingsfor purposes of verifying the match.

produce three synthesized view-compensated modelimages. The photometric alignment method from sign-bits was then applied (Section 5.2) using a similar dis-tribution of sample points as shown in Fig. 11.

We consider next another case of recognition fromreduced images, in which we make use of the prop-erty that exact alignment is not required when usingsign-bits.

7.4. Recognition from a Single Viewing Position

Photometric alignment from sign-bits raises the possi-bility of compensating for changing illumination with-out an exact correspondence between the model images

Figure 15. Demonstrating the effect of applying only the nominaltransformation between two distinct views.Row1: edges of twodistinct views. Row 2: overlay of both edge image, and overlayof the edges of the left image above and the nominally transformedrighthand image.

and the novel image. The reason lies in the way pointsare sampled for setting the system of inequalities; thatis, points are sampled relatively far away from the con-tours (see Section 5.2). In addition, the separation ofimage displacements into nominal and residual compo-nents (Section 6.1) suggests that in an area of interestbounded by at least three reference points, the nom-inal component alone may be sufficient to bring themodel images close enough to the novel image so thatwe can apply the photometric alignment from sign bitsmethod.

Consider, for example, the effect of applying only thenominal transformation between two different views(Fig. 15). Superimposing the two views demonstratesthat the displacement is concentrated mostly in the cen-ter area of the face (most likely the area in which wewould like to select the sample points). By selectingthree corresponding points covering the center area ofthe face (two extreme eye corners and one mouth cor-ner), the 2D affine transformation (nominal transfor-mation) accounts for most of the displacement in thearea of interest at the expense of large displacements atthe boundaries (Fig. 15, bottom row on the right). Thisis expected from the geometric interpretation of affinedepthγp, as it increases as the object gets farther fromthe reference plane (Koenderink and Van Doorn, 1991;Shashua, 1991).

Page 21: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

3D Visual Recognition from a Single 2D Image 119

Taken together, the use of sign-bits and the nominaltransformation suggests that one can compensate forillumination and for relatively small changes in view-ing positions from model images taken from the sameviewing position. We apply first the nominal trans-formation to all three model images and obtain threesynthesized images. We then apply the photometricalignment from sign-bits to recover the linear coeffi-cients used for compensating for illumination. Thethree synthesized images are then linearly combinedto obtain an illumination-compensated image. The re-maining displacement between the synthesized imageand the novel image can be recovered by applying theresidual motion transformation (along the epipolar di-rection using the constant brightness equation).

Figure 16 illustrates the alignment steps. The threemodel images are displayed in the top row and arethe same as those used in Section 4.3 for compensat-ing for illumination alone. The novel image (secondrow, left display) is the same as in Fig. 13, i.e., it istaken from a novel viewing position and novel illumi-nation condition. The image in the center of the second

Figure 16. Recognition from a single viewing position (see text fordetails).

row illustrates the result of attempting to recover thecorrespondence (using the full correspondence methoddescribed in the previous section) between the novelimage and one of the model images without first com-pensating for illumination. The image on the left inthe third row is the result of first applying the nominaltransformation to the three model images followed bythe photometric alignment using the sign-bits (the sam-ple points used by the photometric alignment methodare displayed in the image on the right in the secondrow). The remaining residual displacement between thelatter image and the novel image is recovered using thefull correspondence method and the result is displayedin the center image in the third row. The similarity be-tween the final synthesized image and the novel imageis illustrated by superimposing their step edges (fourthrow, right display).

8. Summary and Discussion

In this paper we addressed the connection betweenrecognition of general 3D objects and the ability tocreate an equivalence class of images of the same ob-ject. Recognizing objects eventually reduces to com-paring/matching images against each other or againstmodels of objects. This can be viewed as comparingmeasurements (features,x, y positions of points, andso forth) that ideally must be selected or manipulatedsuch that they remain invariant if coming from imagesof the same object.

We have distinguished two sources of variabilityagainst which invariant measurements are needed.One, is the well known geometric problem of changingviewing positions between the camera and the object.Second, is the photometric problem due to changingillumination conditions in the scene. Our emphasis inthis paper was on the latter problem which, unlike thegeometric problem of recognition, did not receive muchattention in the past. The traditional assumption con-cerning the photometric problem is that one can recovera reasonably complete array of invariants just from asingle image alone, such as the representations pro-duced by edge detectors. It is interesting to note thatin the earlier days of recognition, the geometric prob-lem was approached in a similar manner by restrictingthe class of objects to polyhedra (the so-called “blocksworld”). We have argued that complete invarianceof edges is achieved when simple block-like objectsare considered, whereas for more natural and complexobjects, like a face, it may be necessary to explore

Page 22: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

120 Shashua

model-based approaches, i.e., the photometric invari-ants are model-dependent.

Motivated by several empirical observations on hu-man vision, and by computational arguments on the useof edge detection, we arrived at two observations: first,there appears to be a need for a model-based approachto the photometric problem. Second, the process re-sponsible for factoring out the illumination during therecognition process appears to require more than con-tour information, but just slightly more.

We suggested a method, we call photometric align-ment, that is based on recording multiple images of theobject. We do not use these pre-recorded images torecover intrinsic properties of the object, as in photo-metric stereo, but rather to directly compensate for thechange in illumination conditions for any other novelimage of the object. This difference is critical, as weare no longer bounded by assumptions on the lightsource parameters (e.g., we do not need to recoverlight source directions) or assumptions on the distri-bution of surface albedo (e.g., arbitrary distributionsof surface albedo are allowed). We have discussed thesituations of shadows, specular reflections, and chang-ing spectral compositions of light sources. In the caseof shadows, we have seen that the alignment schemedegrades with increasing cast shadow regions in thenovel input image. As a result, photometric alignment,when applied to general non-convex surfaces, is mostsuitable for reconstructing novel images whose illumi-nation conditions are in between those used to createthe model images. We have also seen that specular re-flections arising from non-homogeneous surfaces canbe detected and removed if necessary. Finally, the the-ory was extended to deal with color images and thecase of changing spectral composition of light sources:the of color bands of a single model image of a neutralsurface can form a basis set for reconstructing novelimages.

We next introduced two new results to explore thepossibility of working with reduced representations in-stead of image grey values—as suggested by empir-ical evidence from human vision (Section 2). First,step edges and level-crossings of the novel image aretheoretically sufficient for the photometric alignmentscheme. This result, however, assumes that edges begiven at sub-pixel accuracy—a finding that impliesdifficulties in making use of this result in practice.Second, the sign-bit information can be used insteadof edges. Photometric alignment using sign-bits is aregion-based process by which points inside the binary

regions of the sign-bit image are sampled and eachcontributes a partial observation. Taken together, thepartial observations are sufficient to determine the so-lution for compensating for illumination. The morepoints sampled, the more accurate the solution. Exper-imental results show that a relatively small number ofpoints (10 to 20) are generally sufficient for obtainingsolutions that are comparable to those obtained by us-ing the image grey values. This method agrees withthe empirical observations that were made in Section 2regarding the possibility of having a region-based pro-cess rather than a contour-based one, the possibility ofpreferring sign-bits over edges, and the sufficiency ofsign-bits for factoring out the illumination. The possi-bility of using sign-bits instead of edges raises also apotentially practical issue related to changing view-ing positions. A region-based computation has theadvantage of tolerating a small degree of misalign-ment between the images due to changing viewingpositions. This finding implies that the illuminationcan be factored out even in the presence of smallchanges in viewing positions without explicitly ad-dressing the geometric problem of compensating forviewing transformations—a property that was demon-strated experimentally in Section 7.4.

Finally, we have shown how the geometric, pho-tometric, and the correspondence components can beput together to address the case when both changes inillumination and changes in viewing positions occursimultaneously, i.e., recognition of an image of a fa-miliar object taken from a novel viewing position anda novel illumination condition.

8.1. Issues of Future Directions

The ability to interpret Mooney images of faces maysuggest that these images are an extreme case of a widerphenomenon. Some see it as a tribute to the humanability to separate shadow borders from object borders(Cavanagh, 1990); here we have noted that the phe-nomenon may indicate that in some cases illuminationis factored out in a model-based manner and that theprocess responsible apparently requires more than justcontour information, but only slightly more. A possibletopic of future research in this domain would be to drawa connection, both at the psychophysical and computa-tional levels, between Mooney images and more naturalkinds of inputs. For example, images seen in newspa-pers, images taken under poor lighting, and other low

Page 23: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

3D Visual Recognition from a Single 2D Image 121

quality imagery have less shading information to relyon and their edge information may be highly unreli-able, yet are interpreted without much difficulty by thehuman visual system. Another related example, is theimage information contained in draftsmen’s drawings.Artists rarely use just contours in their drawings andrely on techniques such as “double stroking” to createa sense of relief (surface recedes towards the contours)and highlights to make the surface protrude. Thesepictorial additions that artists introduce are generallynot interpretable at the level of contours alone, yet donot introduce any direct shading information. In otherwords, it would be interesting (and probably importanton practical grounds) to discover a continuous trans-formation, a spectrum or scale-space of sorts, startingfrom high-quality grey-level imagery, producing mid-way low-quality imagery of the type mentioned above,and converging upon Mooney-type imagery.

Another related topic of future interest is the levelat which sources of variability are compensated for.In this paper the geometric and photometric sourcesof variability were factored out based on connectionsbetween different images of individual objects. Theempirical observations we used to support the argumentthat illumination should be compensated for in a model-based manner, actually indicate that if indeed such aprocess exists, it is likely to take place at the level ofclassifying the image as belonging to a general classof objects, rather than at the level of identifying theindividual object. This is simply because the Mooneyimages are of generally unfamiliar faces, and therefore,the only model-based information available is that weare looking at an image of a face. A similar situationmay exist in the geometric domain as well, as it isknown that humans can recognize novel views just froma single view of the object.

There are also questions of narrower scope relatedto the photometric domain that may be of general in-terest. The question of image representation in thispaper was applied only to the novel image. A moregeneral question should apply to the model acquisi-tion stage as well. In other words, what informationneeds to be extracted from the model images, at thetime of model acquisition, in order to later compensatefor photometric effects? This question applies to boththe psychophysical and computational aspects of theproblem. For example, can we learn to generalize tonovel images just from observing many Mooney-typeimages of the object? (changing illumination, view-ing positions, threshold, and so forth). A more basic

question is whether the Mooney phenomenon is limitedexclusively to faces. And if not, what level of famil-iarity with the object, or class of objects, is necessaryin order to generalize to other Mooney-type images ofthe same object, or class of objects.

At a more technical level, there may be interest infurther pursuing the use of sign-bits. The sign-bitswere used as a source of partial observations that, takentogether, can restrict sufficiently well the space of pos-sible solutions for the photometric alignment scheme.In order to make further use of this idea, and perhapsapply it to other domains, the question of how to selectsample points, and the number and distribution of sam-ple points, should be addressed in a more systematicmanner.

Acknowledgments

I thank my advisor Shimon Ullman for allowing me tobenifit from his knowledge, experience and financialsupport during my doctoral studies. Thanks to WilliamT. Freeman for producing the contour images in Fig. 4.

References

Aloimonos, J. and Brown, C.M. 1989. On the kinetic depth effect.Biological Cybernetics, 60:445–455.

Arnheim, R. 1954.Art and Visual Perception: A Psychology of theCreative Eye, University of California Press: Berkeley, CA.

Cavanagh, P. 1990. What’s up in top-down processing? InProceed-ings of the XIIIth ECVP, A. Gorea (Ed.), pp. 1–10.

Coleman, E.N., Jr. and Jain, R. 1982. Obtaining 3-dimensional shapeof textured and specular surfaces using four-source photometry.Computer Graphics and Image Processing, 18:309–328.

Curtis, S.R., Oppenheim, A.V., and Lim, J.S. 1985. Signal recon-struction from Fourier transform sign information.IEEE Transac-tions on Acoustic Speech and Signal Processing, ASSP-33:643–657.

Duda, R.O. and Hart, P.E. 1973.Pattern Classification and SceneAnalysis. John Wiley: New York.

Farah, M.J. 1990.Visual Agnosia. MIT Press: Cambridge, MA.Faugeras, O.D. 1992. What can be seen in three dimensions with

an uncalibrated stereo rig? InProceedings of the EuropeanConference on Computer Vision, Santa Margherita Ligure, Italy,pp. 563–578.

Fischler, M.A. and Bolles, R.C. 1981. Random sample concensus:A paradigm for model fitting with applications to image analysisand automated cartography.Communications of the ACM, 24:381–395.

Freeman, W.T. and Adelson, E.H. 1991. The design and use of steer-able filters.IEEE Transactions on Pattern Analysis and MachineIntelligence, PAMI-13.

Gilchrist, A.L. 1979. The perception of surface blacks and whites.SIAM J. Comp., pp. 112–124.

Page 24: On Photometric Issues in 3D Visual Recognition from a ...P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC: International Journal of Computer Vision 05-Shashua

P1: BPS/KLK P2: BPS/SCM P3: RBA/BS QC: P1: BPS/KLK P2: BPS/BS P3: RBA/BS QC:

International Journal of Computer Vision 05-Shashua January 16, 1997 12:12

122 Shashua

Hartley, R., Gupta, R., and Chang, T. 1992. Stereo from uncali-brated cameras. InProceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, Champaign, IL, pp. 761–764.

Hille, E. 1962.Analytic Function Theory. Ginn/Blaisdell: Waltham,MA.

Horn, B.K.P. 1986.Robot Vision. MIT Press: Cambridge, Mass.Horn, B.K.P. and Schunk, B.G. 1981. Determining optical flow.

Artificial Intelligence, 17:185–203.Huang, T.S. and Lee, C.H. 1989. Motion and structure from ortho-

graphic projections.IEEE Transactions on Pattern Analysis andMachine Intelligence, PAMI-11:536–540.

Huttenlocher, D.P. and Ullman, S. 1987. Object recognition usingalignment. InProceedings of the International Conference onComputer Vision, London, pp. 102–111.

Huttenlocher, D.P. and Ullman, S. 1990. Recognizing solid objectsby alignment with an image.International Journal of ComputerVision, 5(2):195–212.

Kinsbourne, M. and Warrington, E.K. 1962. A disorder of simulta-neous form perception.Brain, 85:461–486.

Klinker, G.J., Shafer, S.A., and Kanade, T. 1990. A physical approachto color image understanding.International Journal of ComputerVision, 4:7–38.

Knill, D.C. and Kersten, D. 1991. Apparent surface curvature affectslightness perception.Nature, 351:228–230.

Koenderink, J.J. and Van Doorn, A.J. 1980. Photometric invariantsrelated to solid shape.Optica Acta, 27:981–986.

Koenderink, J.J. and Van Doorn, A.J. 1991. Affine structure frommotion.Journal of the Optical Society of America, 8:377–385.

Land, E.H. and McCann, J.J. 1971. Lightness and retinex theory.Journal of the Optical Society of America, 61:1–11.

Logan, B. 1977. Information in the zero-crossings of band pass sig-nals.Bell Syst. Tech. J., 56:510.

Lowe, D.G. 1985.Perceptual Organization and Visual Recognition.Kluwer Academic Publishing: Hingham, MA.

Maloney, L.T. and Wandell, B. 1986. A computational model of colorconstancy.Journal of the Optical Society of America, 1:29–33.

Marr, D. 1979. Early processing of visual information.Proceedingsof the Royal Society of London B, 175:483–534.

Marr, D. 1982.Vision. W.H. Freeman and Company: San Francisco,CA.

Marr, D. and Hildreth, E.C. 1980. Theory of edge detection.Proceedings of the Royal Society of London B, 207:187–217.

Mooney, C.M. 1960. Recognition of ambiguous and unambiguousvisual configurations with short and longer expsures.Brit. J. Psy-chol., 51:119–125.

Morrone, M.C. and Burr, D.C. 1988. Feature detection in human vi-sion: A phase-dependent energy model.Proceedings of the RoyalSociety of London B, 235:221–245.

Moses,Y. 1993. Face recognition: Generalization to novel images.Ph.D. Thesis, Applied Math. and Computer Science, The Weiz-mann Institute to Science, Israel.

Mundy, J. and Zisserman, A. 1992.Geometric Invariances in Com-puter Vision, MIT Press: Cambridge.

Mundy, J.L., Zisserman, A., and Forsyth, D. 1994.Applications ofInvariance in Computer Vision, Springer-Verlag: LNCS 825.

Nakayama, K., Shimojo, S., and Silverman, G.H. 1989. Stereoscopicdepth: Its relation to image segmentation, grouping, and the recog-nition of occluded objects,Perception, 18:55–68.

Nishihara, H.K. 1984. Practical real-time imaging stereo matcher.Optical Engineering, 23(5):536–545.

Pearson, D.E., Hanna, E., and Martinez, K. 1986. Computer-generated cartoons. InImages and Understanding, H. Barlow,C. Blakemore, and M. Weston-Smith (Eds.), Cambridge Univer-sity Press: New York, NY, 1990. Collection based on Rank PrizeFunds’ International Symposium, Royal Society.

Perona, P. and Malik, J. 1990. Detecting an localizing edges com-posed of steps, peaks and roofs. InProceedings of the InternationalConference on Computer Vision, Osaka, Japan.

Poggio, T. 1990. 3D object recognition: On a result of Basri andUllman. Technical Report IRST 9005-03.

Potter. M.C. 1975. Meaning in visual search.Science, 187:965–966.Rosenblatt, F. 1962.Principles of Neurodynamics: Perceptrons and

the Theory of Brain Mechanisms. Spartan Books: Washington,DC.

Saff, E.B. and Snider, A.D. 1976.Fundamentals of Complex Analy-sis. Prentice-Hall: New-Jersey.

Sanz, J. and Huang, T. 1989. Image representations by sign informa-tion. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, PAMI-11:729–738.

Shafer, S.A. 1985. Using color to separate reflection components.COLOR Research and Applications, 10:210–218.

Shashua, A. 1991. Correspondence and affine shape from two or-thographic views: Motion and recognition. A.I. Memo No. 1327,Artificial Intelligence Laboratory, Massachusetts Institute of Tech-nology.

Shashua, A. 1992. Geometry and photometry in 3D visual recogni-tion. Ph.D. Thesis, M.I.T. Artificial Intelligence Laboratory, AI-TR-1401.

Shashua, A. 1994. Projective structure from uncalibrated images:Structure from motion and recognition.IEEE Transactions on Pat-tern Analysis and Machine Intelligence, 16(8):778–790.

Shashua, A. 1995. Algebraic functions for recognition.IEEE Trans-actions on Pattern Analysis and Machine Intelligence, 17(8):779–789.

Shashua, A. and Navab, N. 1996. Relative affine structure: Canonicalmodel for 3D from 2D geometry and applications.IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 18(9).

Shashua, A. and Toelg, S. 1994. The quadric references surface: Ap-plications in registering views of complex 3D objects. InProceed-ings of the European Conference on Computer Vision, Stockholm,Sweden.

Shashua, A. and Ullman, S. 1988. Structural saliency: The detectionof globally salient structures using a locally connected network. InProceedings of the International Conference on Computer Vision,Tampa, FL, pp. 321–327.

Shashua, A. and Ullman, S. 1991. Grouping contours iteratedpairing network. InAdvances in Neural Information Process-ing Systems 3, R.P. Lippmann, J.E. Moody, and D.S. Touretzky(Eds.), Morgan Kaufmann Publishers: San Mateo, CA, pp. 335–341. Proceedings of the Third Annual Conference NIPS, Dec.1990, Denver, CO.

Ullman. S. 1986. Aligning pictorial descriptions: An approach toobject recognition.Cognition, 32:193–254, 1989. Also in MIT AIMemo 931.

Ullman S. and Basri. R. 1989. Recognition by linear combinationof models.IEEE Transactions on Pattern Analysis and MachineIntelligence, PAMI-13:992–1006, 1991. Also in M.I.T. AI Memo1052.

R.J. Woodham. 1980. Photometric method for determining surfaceorientation from multiple images.Optical Engineering, 19:139–144.