shape from texture without boundaries

International Journal of Computer Vision 67(1), 71–91, 2006c© 2006 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

DOI: 10.1007/s11263-006-4068-8

Shape from Texture without Boundaries

ANTHONY LOBAY AND D.A. FORSYTHComputer Science Division, University of California at Berkeley, Berkeley, CA 94720, USA

[email protected]

Received January 22, 2005; Revised July 25, 2005; Accepted August 8, 2005

First online version published in February, 2006

Abstract. We describe a shape from texture method that constructs an estimate of surface geometry using only thedeformation of individual texture elements. Our method does not need to use either the boundary of the observedsurface or any assumption about the overall distribution of elements.The method assumes that surface textureelements are drawn from a number of different types, each of fixed shape. Neither the shape of the elements northe number of types need be known in advance. We show that, with this assumption and assuming a generic, scaledorthographic view and texture, each type of texture element can be reconstructed in a frontal coordinate systemfrom image instances. Interest-point methods supply a method of simultaneously obtaining instances of each textureelement automatically and defining each type of element. Furthermore, image instances that have been markedin error can be identified and ignored using the Expectation-Maximization algorithm. A further EM procedureyields a surface reconstruction and a relative irradiance map from the data. We provide numerous examples ofreconstructions for images of real scenes, show a comparison between our reconstruction and range maps, anddemonstrate that the reconstructions display geometric and irradiance phenomena that can be observed in theoriginal image.

Keywords: shape from texture, texture, computer vision, surface fitting, structure from motion, auto-calibration,interest point methods

1. Introduction

Texture is the result of a variety of phenomena. Spatialvariation in the BRDF of a surface is one possibility;another is that the surface is rough at an appropriatescale, and surface elements either shadow or reflectlight onto one another. One might have phenomenalike the leaves on a bush, where we see elements whoseonly claim to being on a surface is that the texture theygenerate sometimes looks like they lie on a surface.Phenomena related to roughness and surface structurehave been studied (Lu et al., 1998, 1999; Pont andKoenderink, 2002), but few algorithms for inferringsurface structure from these phenomena are known. Itis quite difficult to talk precisely about what phenom-

ena should be thought of as surface roughness, as onehas to compare the scale of the relief from the “sur-face” (a fairly dubious concept for, say, aerial views ofmountains) with the viewing distance.

As is usual in the computer vision literature, we willrestrict our attention to phenomena caused by varia-tions in albedo alone. We assume that we have a Lam-bertian surface, which is smooth (in the sense of notbeing rough) at scales up to about that of the textureelements. The force of this assumption is that we canmodel the texture elements as lying on the tangentplane of the underlying surface. The albedo of thissurface varies in some, possibly principled, way. Weshould like to infer information about the shape of thissurface from these albedo variations, which we shall

72 Lobay and Forsyth

call “texture”. We discuss alternate models of texturebriefly in Section 6.

1.1. Shape from Texture

There is a substantial body of literature on shape fromtexture. It is usual to think of textures as being madeup of instances of elements, though these elementsmay not necessarily be identified explicitly. These el-ements are distributed on the surface in some fash-ion. One can then reason about either the distortionof this distribution, or the distortion of the elements,or both.

1.1.1. Texture Terminology and a Texture Model. Apoint process is some random procedure that resultsin points lying on a surface (exact definitions involvetedious measure theory (Daley and Vere-Jones, 1988)).A marked point process is one where each point carriesa mark, drawn randomly according to some mark den-sity from an available collection (for example, pointsmight be red or blue; rendered as squares or circles;etc.); we assume that this collection is discrete. Thereis a long tradition of using marked point processes astexture models (explicitly in, for example (Ahuja andSchachter, 1983a, 1983b; Blake and Marinos, 1990;Schachter, 1980; Schachter and Ahuja, 1979) and im-plicitly in pretty much all existing literature). A Poissonmodel has the property that the expected number of el-ements in a domain is proportional to the area of thedomain. The constant of proportionality is known asthe model’s intensity. A texture is isotropic if the choiceof element rotation is uniform and random, and is ho-mogeneous if the density from which texture elementsare drawn is independent of position on the surface.

We use a standard, general model of texture. Wemodel a texture on a surface as a marked point process,of unknown spatial properties. The marks are textureelements (texels or textons, as one prefers) and the ori-entation of those texture elements with respect to somesurface coordinate system. We assume that the marksare drawn from some known, finite set of classes ofEuclidean equivalent texels. Each mark is defined inits own coordinate system; the surface is textured bytaking a mark, placing it on the tangent plane of the sur-face at the point being marked, translating the mark’sorigin to lie on the surface point being marked, and ro-tating randomly about the mark’s origin (according tothe mark distribution) (Fig. 1). The choice of rotation,etc. may follow some probabilistic model. We assume

Figure 1. The general texture imaging model we use. We assumethat some texture element—which may belong to a discrete set ofelements—is rotated in its coordinate frame and then placed on thetangent plane of the surface. This element is foreshorted by a mapthat is diagonal in the right coordinate frame (the slant-tilt frame)and the frame rotates to the image frame. Notice that the translationis ignored, because we can reasonably expect that detectors canidentify the origin of the elements frame, and because it contains nogeometric information.

that these texture elements can be isolated. Further-more, we assume that they are sufficiently small thatthey can be modelled as lying on a surface’s tangentplane at a point.

1.1.2. Methods Exploiting the Distribution of theElement. An isotropy assumption is a natural sourceof shape information. In either orthographic or per-spective views, some surface directions are more heav-ily foreshortened than others, meaning that the imagetexture of a view of an isotropic surface is not itselfisotropic. Now assume we have a view of a plane; ifwe can observe enough texture elements, we can recon-struct its normal by (in essence) computing an inverseviewing transformation that leads to an isotropic tex-ture. This extends to obtaining a surface normal if theelements are small enough (Witkin, 1981). The reason-ing applies to both orthographic and perspective views.Unfortunately, isotropic natural textures appear to beuncommon.

A homogeneity assumption can yield normal esti-mates, too. If one has a perspective view, then moredistant surface elements appear smaller in the image(so that there appear to be more texture elements perunit area in the image plane). This means that the orien-tation of a plane can be recovered by (in essence) com-puting an inverse viewing transformation that leads to ahomogenous texture (e.g. see the discussion in Forsythand Ponce (2002)).

Shape from Texture without Boundaries 73

Homogeneity can lead to a shape estimate for curvedsurfaces under an orthographic viewing assumption aswell. One assumes that the texture elements are dis-tributed according to a homogenous Poisson processof fixed intensity. For the moment, assume that the in-tensity is known. Assuming an orthographic viewingmodel, a surface element with area dA projects to animage element with area cos σ dA, where σ is the an-gle between the normal and the viewing direction. Inturn, this means that if the process has intensity λ onthe surface, it will appear to have intensity λ/(cos σ ) inthe image. This means that we have an estimate of cosσ at each point on the surface, from which the surfacecan be reconstructed using the methods like those ofshape-from-shading; the main methods in this class are(Aloimonos, 1986; Blake and Marinos, 1990). Noticethat, if λ is unknown, it can still be recovered, becausein any view of a smooth compact surface there will beat least one frontal point; this will have the largest in-tensity, whose value will be λ. This approach will workpoorly for small values of λ, because there will be rel-atively few elements near any given point, so that rea-sonable estimates of the process’s intensity will requirelarge neighbourhoods. Perspective viewing could com-plicate this process slightly, but in Section 1.2 we showquite strong arguments that scaled orthography is al-most certainly sufficient for the case of curved surfaces.

1.1.3. Methods Exploiting the Deformation of theElement. Typical textured images contain multipleviews of the same texture element. Camera transfor-mations that depend on the local surface normal areapplied to the model element to produce each imageinstance. This observation offers a variety of differentpossible cues. For example, texture elements that arerepeated on the surface are imaged as different affinetransformations of the same element in the image, andone can tune filters to these affine deformations (withsome assumptions about the element—see the meth-ods in Lee and Kuo (1998), Sakai and Finkel (1994),Stone and Isard (1995)).

In another approach, one observes that the per-element imaging transformations are going to affectthe spatial frequency components on the surface; thismeans that if the texture has constrained spatial fre-quency properties, one may observe the orientationfrom the texture gradient (Bajcsy and Lieberman 1976;Krumm and Shafer, 1990, 1992; Sakai and Finkel,1994; Super and Bovik, 1995). A more recent ap-proach due to Garding (1992, 1995), extended by Ma-lik and Rosenholtz (1997), and given a statistical form

by Clerc and Mallat (1999) assumes that elements aretranslated along the surface but do not rotate—this isreferred to as local homogeneity—and then recoversthe normal and curvature at each element by an anal-ysis of the image texture gradient. Local homogeneitycan be true in the large only for surfaces of constantand non-positive gaussian curvature; furthermore, thisassumption excludes most textures produced by ran-dom processes that are local in nature—for example,splats and splashes; surface damage; sprinkles on adoughnut—because the elements are not allowed torotate with respect to one another.

One may construct a generative model, where objecttexture is modelled with a parametric random model,then choose a geometry and parameters that minimizesthe difference between either a predicted image and theobserved image (Choe and Kashyap, 1991) or a pre-dicted image density and the observed image density(Lee and Kuo, 1998).

1.2. Perspective vs. Orthography

We show below that one can reconstruct a textureelement in a Euclidean frame (with unknown scale)from three or more instances in a scaled-orthographicview. This process yields an estimate of the surfacenormal at each instance. But is this enough? should oneuse a perspective model? The answer seems to be thatthere is no reason to. Shape from texture is a sharplydifferent problem for the case of planes and of curvedsurfaces. Perspective effects in views of texturedplanes are so familiar as to not need illustration. Theyoccur because a plane with a very large change indepth can span a small visual angle. Furthermore, ifthe surface in view is known to be plane, it is mostunwise to reconstruct locally—instead, reconstructionis properly seen as an attempt to recover the threeparameters defining the plane.

Curved surfaces—or rather, surfaces sufficientlycurved to be observably different from a plane—aredifferent, because to obtain a large depth range we mustaccumulate curvature, and so face the prospect that thesurface will turn away from the eye. This means that formany (possibly most) curved surfaces, we can ignoreperspective from the start because its effects will betoo small to notice because the surface spans a smallrange of depths (in practice, it is usual to ignore the ef-fects of perspective when the range of depths spannedby the observed scene is small compared to the averagedepth—1/10 is a convenient and popular threshold).


A second, important difference about curved sur-faces is that we can meaningfully estimate their prop-erties locally. If a surface is known to be a plane, oneshould approach shape estimation as estimating the co-efficients of that plane, meaning that normal estimatesare not independent. If nothing is known about the formof the surface in advance (except that it is reasonablysmooth), then local normal measurements constrainone another only over small spatial scales (from thesmoothness). We use an automatic method for findingtexton instances that collects together instances thatcould be views of the same texture element, and rejectsinstances that are inconsistent with the reconstructedfrontal pattern. This means, in turn, that the methodmust automatically obtain a pool of element instancesfor which scaled orthography applies. Since each in-stance reports its surface normal information and onlythis is used for reconstruction, in principle we can ig-nore perspective effects in recovering the normal.

It remains the case that surfaces that are curved,but have large flat patches might be intractable bythe reasoning presented below. We regard this caseas being rare, however; in principle one might attack itwith a perspective extension of our methods, but thiswill result in a considerably more complex formula-tion. Another difficult case involves surfaces that havevery high entropy textures—ones where elements re-peat seldom—but no current method applies to suchtextures, which appear to be rare.

All this means that an orthographic model is suffi-cient for our purposes. But why are perspective meth-ods common in the shape from texture literature? Thereseem to be two reasons. First, perspective really is im-portant in the case of planes. Second, for local element(e.g. the methods of Garding (1992, 1995) of Malikand Rosenholtz (1997) and of Clerc (1999, 2002), theperspective scaling term must be of no significance andcancel; if it was of significance, then the methods couldnot work, because the changes of depth between ele-ments in the examples in those papers is two to threeorders of magnitude smaller than the overall depth tothe elements—perspective effects are simply not ob-servable under these circumstances, and so cannot havecontributed to the effectiveness of these methods.

1.3. What is a Texture Element?

We have modelled texture as consisting of a series ofelements that are repeated. But what are to be the el-ements? It is common to assume that elements have

some semantics—for example, are polka dots, bricks,or little butterflies—and most papers on shape fromtexture either explicitly or implicitly endorse this no-tion. We argue that a much more inclusive definitionof an element is possible. Section 2 will show that, ifone possesses three generic scaled-orthographic viewsof a plane element, one can determine (a) the frontalappearance of the element and (b) the slant and tilt atwhich each instance is viewed. All this suggests thatwe require only two properties of a texture element:

• Repetition: An element must be repeated oftenenough to provide a useful guide to the surface ge-ometry; since the minimum number of views is three,this is a very weak constraint.

• Localizability: An element must have sufficientstructure that (a) it can be localized in the image(i.e. line endings might be elements but lines shouldnot) and (b) it can be used to infer the viewing trans-formation.

These are both extremely weak constraints, too.

The advantage of defining an element by these prop-erties is that textures that do not appear to havemany repeated instances of whole pattern elements—paisley prints, for example—may well repeat smallerstructures—corners or small spots, say.

To benefit from this approach, we must be able toidentify putative instances of elements automaticallyfrom images. If we strip elements of all semantics andrequire only repetition and localizability, this processboils down to identifying image structures that are re-peated and can be localized. This is a relatively simplebusiness, if we see it in terms of three steps:

• First, identifying putative instances of a particularelement. We explain this process below, and expandin Section 3.

• Second, reconstructing the frontal appearance of thatelement while discarding instances that cannot bewithin a scaled orthographic view of the frontal ele-ment. We show how to do this in Section 4.

• Third, discarding elements whose frontal appear-ance suggests that they cannot yield good slantand tilt estimates. We show how to do this inSection 3.1.2.

The key to this procedure is identifying instances ofa particular element while not knowing the element.


Instances must be within an affine transformation ofone another. Furthermore, we expect that these affinetransformations have commensurate eigenvalues(strongly foreshortened instances give no guide to theslant and tilt because there are too few image pixelsin the instance), suggesting that we can get by usingrepresentations that are not rigorously affine invariant.

A particularly simple and effective mechanism offinding instances has been made available by recentwork on representing image patches around interestpoints. Schmid and Mohr demonstrated that one couldmatch objects by identifying interest points in an imageand then building representations of the image aroundthose points that are invariant to some appropriatelychosen family of transformations (Schmid and Mohr,1997). The key observation in this work is that sucha representation can (a) distinctively identify imagepatches and (b) be robust to affine transformations.Repeated patches then must have similar or the samerepresentation, meaning we can find instances of dis-tinct elements by simply clustering the patch repre-sentations. An alternative strategy is to cluster imagepatches; there is some history of doing this successfully(e.g. (Leung and Malik, 1996; Malik et al., 1999)), butwe feel using the interest point representation is easierand more efficient.

Interest point representations are now widely usedin recognition (e.g. (Lowe, 1999, 2004; Schmid andMohr, 1997); points are matched to points in imagesof models) and tracking (e.g. (Lowe, 2004); pointsare matched to points in the next frame). Further-more, one can build a texture representation by iden-tifying points that repeat and are good for match-ing (Lazebnik et al., 2003a, 2003b). The emphasisin these last papers is on reducing the number of in-terest points by identifying patches that are uncom-mon within one scene and match well across views.We want a dense set of texton instances, and wewant textons to match one another within a givenscene.

A comparison of methods by Mikolajczk andSchmid (2003) is unequivocally in favour of themethod of Lowe (2004), which we use. This methodfinds each interest point and then produces a 256 di-mensional vector which represents the image patchsurrounding it. The representation is invariant underscale, rotation and translation, and has been shown tobe robust to about 60 degrees of foreshortening (Miko-lajczk and Schmid, 2003). While we have found thisapproach satisfactory in practice, we make no asser-

tion that this is necessarily the best; the recent litera-ture contains several methods to build affine covariantneighbourhoods that may offer better elements.

A further advantage of finding interest points andlocal regions, and then clustering them (however onedoes so) is that then the question of camera models isconsiderably eased. This is because we can expect thatan image that contains significant perspective effectswill result in instances that cluster into image neigh-bourhoods within each of which scaled orthographyapplies, even though it may not apply from neighbour-hood to neighbourhood.

1.4. Applications

Applications for shape from texture have been largelyabsent, explaining its status as a minority interest.However, we believe that image-based rendering ofclothing is an application with substantial promise.Cloth is difficult to model for a variety of reasons. It ismuch more resistant to stretch than to bend: this meansthat dynamical models result in stiff differential equa-tions (for example, see (Terzopolous et al., 1987)) andthat it buckles in fine scale, complex folds (for example,see (Bridson et al., 2002)). However, rendering clothis an important technical problem, because people areinteresting to look at and most people wear clothing.A natural strategy for rendering objects that are intrin-sically difficult to model satisfactorily is to rearrangeexisting pictures of the objects to yield a rendering.In particular, one would wish to be able to retextureand reshade such images. Earlier work on motion cap-turing cloth used stereopsis, but faced difficulties withmotion blur and calibration (Pritchard, 2003; Pritchardand Heidrich, 2003); we believe that, in future, shapefrom texture methods may make it possible to avoidsome of these problems.

1.5. Overview of our Method

We first demonstrate that three instances of a texture el-ement allow unique frontal reconstruction and providenormal information (Section 2). We then show practi-cal applications of this method in a pipeline where we:

• Recover image instances of multiple distinct tex-ture elements, which we do using the interest pointmethod of Lowe (2004) (Section 3).


• Recover the frontal appearance of all elements,which we do using the methods of Section 2. Byso doing we can exclude uninformative elements,obtain relative irradiance and normal estimates, and(often) significantly enrich the field of elements(Section 3).

• Obtain a surface model and a relative irradiancemap from element appearances. To obtain a surfacemodel, we use EM to resolve the two-fold ambigu-ity in normal direction that results from our recoverymethod (Section 4). Remarkably, the relative irradi-ance map is unambiguous.

The main novel features of this paper are:

• Opportunistic definition of texture elements as pat-terns that are (a) repeated and (b) localizable meansthat we can find an extremely dense field of elementinstances (and so normal measurements) automati-cally.

• Scaled orthography is enough because (a) perspec-tive effects are not significant for most curved sur-faces and (b) the method of defining texture ele-ments automatically identifies points that are likelyto repeated instances of the same element under ascaled orthographic model. Thus, if scaled orthog-raphy doesn’t apply globally, it still applies locallyand we can in principle still obtain measurements.

• Very dense normal measurements from element de-formation mean that, while we use a smoothnessterm to obtain a surface reconstruction, the recon-structions don’t appear to be significantly biased bythis term.

• A relative irradiance reconstruction follows auto-matically from our method.

2. The Geometry of Shape from Textureunder Scaled Orthography

We assume that we have an orthographic view of acompact smooth surface and the viewing directionis the z-axis. It will turn out that all significant re-sults hold for scaled orthographic views, too. We writethe surface in the form (x, y, f(x, y)), and adopt theusual convention of writing fx = p and fy = q. Wewrite matrices as M, vectors as x and use I for theidentity.

2.1. Texture Imaging Transformationsfor Orthographic Views

Now consider one class of texture element; each in-stance in the image of this class was obtained by aEuclidean transformation of the model texture element,followed by a foreshortening. The transformation fromthe model texture element to the particular image in-stance is affine. One consequence of our use of aninterest point method for representing the local textureelements is that the methods identify the same pointon each instance of a patch, which can be thoughtof as a projection of an origin in the texture element’sframe. This means we need not consider the translationcomponent further. Furthermore, in an appropriate co-ordinate system on the surface and in the image, theforeshortening can be written as

Fi =(

1 0

0 cos σi

)

where σ i is the angle between the surface normal atmark i and the z axis. Notice that cos σ i is alwaysnon-negative.

The transformation from the model texture elementin its (2D) model coordinate system to the i’th imageelement is then

TM→i = RG(i)FiRS(i)

where RS(i) rotates the texture element in the localsurface frame, Fi foreshortens it, and RG(i) rotatesthe element in the image frame. From elementary con-siderations, we have that

RG(i) = 1√p2 + q2

(p q

−q p

)

The transformation from the model texture element tothe image element is not a general affine transforma-tion (because there are only three degrees of freedom);instead, it has the form T = RGFRS, where F is a di-agonal foreshortening as above. We call a matrix of thisform a local texture imaging transformation. In partic-ular, T T T has eigenvalues 1 and (cos σ )2. Recall thatthe trace of a matrix is the sum of its eigenvalues, thatthe determinant is the product of its eigenvalues, thatdet(AB) = det(A)det(B) and that det(AT ) = det(A).This yields a characterization of local texture imagingtransformations. We have:


Lemma 1. An affine transformation T can be writ-ten as RGFRS, where RG, RS are arbitrary rota-tions and F is a foreshortening (as above) if and onlyif

det(T )2 − trace(T T T ) + 1 = 0

and

0 ≤ det(T ) ≤ 1

Proof: First, we show a texture imaging transfor-mation has these properties. From above, we havedet(T )2 = det(T T T ) = (cos σ )2 and trace(T T T ) =1 + (cos σ )2 and so the equality holds. Now, we showhow to construct our expansion from a matrix T thatsatisifies the conditions. WriteRGDRS for the singularvalue decomposition of T . Then RGD2RT

G = T T T .

Because D is diagonal, the diagonal entries of D2 mustbe the eigenvalues of T T T . Because T meets our con-ditions, the diagonal entries of D2 must be 1 and c,with 0 ≤ c ≤ 1. We can arrange the rows of RG toensure that D00 = 1, and choose the positive squareroot of c to be D11; we now have an expansion of thegiven form for T . �

Notice that, given an affine transformation A that is atexture imaging transformation, we know the factoriza-tion into components only up to a two-fold ambiguity.This is because

A = RGFRS = RG(−I)F(−I)RS = A

The other square roots of the identity are ruled outby the requirement that cos σ i be positive. Now as-sume that the model texture element(s) are known. Inprinciple, we can obtain the surface normals by firstrecovering all transformations from the model textureelements to the image texture elements. We then per-form an eigenvalue decomposition of TiT T

i , yieldingRG(i) and Fi . From the equations above, it is obviousthat these yield the value of p and q at the i’th point upto a sign ambiguity (i.e. (p, q) and (−p, −q) are bothsolutions). So far, things have been straightforward,but we usually do not know the texture element in itsown frame—we have to determine it, and this requiresa process rather like self calibration.

2.2. The Texture Element is Unambiguous ina Generic Scaled Orthographic View

Generally, the model texture element is not known.However, an image texture element can be used in itsplace. We know that an image texture element is withinsome (unknown) affine transformation of the modeltexture element. This is not obviously helpful, becausethe unknown transformation is unknown. Write thetransformation from image element j to image elementi as

T j→i

While we don’t need to measure this transformation infact, it can quite easily be measured in principle (e.g.(Forsyth, 2001; Malik and Rosenholtz, 1997; Rosen-holtz and Malik, 1997)).

Now we observe many image texture elements, andso can construct many such maps. We should like todetermine a frontal view of the texture element in itsown frame. It is important to know what ambiguitiesexist. If we are using texture element j as a model, thereis at least one affine transformation A such that

TM→i = T j→iA

for every image element i—we could choose

A = TM→ j

Now assume that we have many image texture ele-ments. We must choose some A such that

T j→iA

is a texture imaging transform for every i. If thereare sufficient instances, the result is that we know theelement in a frontal frame.

We start with straightforward, but useful, propertiesof texture imaging transformations.

Lemma 2. Assume T is a texture imaging transfor-mation, then so is T T . Furthermore, T there is some vsuch that |v| = |T v|.

Proof: The first point follows from the definition.The second point follows because, for any u, |T u |2=uTT TT u. But T T is a texture imaging transforma-tion (because T is), so that T TT has one eigenvalue


1. Assume the corresponding eigenvector is v; thenvTT TT v = vT v. �

Now we need the following simple technical result,whose proof appears in the appendix.

Lemma 3. Assume that B is a linear transformationof the plane. Assume that there is some set of vectorsvi such that |vi |=|Bvi |. If, in that set of vectors, thereis a set of three vectors no two of which are parallel,then BTB = I (the identity).

Proof: Relegated to appendix. �

And with this result, we have

Lemma 4. Assume TM→i for i = 1, . . . , N is a tex-ture imaging transformation arising from a genericsurface and given that N is sufficiently large. As-sume TM→iB is a texture imaging transformation fori = 1, . . . , N and B some affine transformation. ThenBTB = I.

Proof: For eachTM→i , there is a direction vi in whichlength is fixed, i.e. that vT

i vi = vTi (T T

M→i )TM→i vi .Write T ∗

M→i = TM→iB; this is a texture imaging trans-formation, too, by the hypothesis. Now for each T ∗

M→i ,there is a direction v∗

i in which length is fixed. Noticethat TM→i vi = T ∗

M→i v∗i , because the images of these

transformations are the same—they result in the sameimage element. From this, we have that | vi |=| v∗

i |.But we also have that vi = Bv∗

i , because TM→ i hasfull rank. This means that B fixes the length of N vec-tors. Now if the surface is generic and N is greater thanthree, then in this set of vectors there will be three,no two of which are parallel; so, by the lemma above,BTB = I. �

All this means that, from a set of texture imagingtransformations, we can determine the frontal appear-ance of the element up to a possible rotation and flip;neither is of any significance. We could obtain thefrontal appearance by searching over affine transfor-mations A to find transformations such that

TM→i = T j→iA

is a texture imaging transformation for every i. As longas the transformations Tj→ i can be determined (thereis a special case, which we discuss below), Lemma 4

gives us that this search yields the element in a frontalframe.

Lemma 4 is crucial, because it means that, fororthographic views, we can recover the textureelement independent of the surface geometry (whetherwe should is another matter). We have not seenLemma 4 in the literature before, but assume that itis known in some form—it doesn’t appear in Mundyand Zisserman (1994), which describes other repeatedstructure properties, in Schaffalitzky and Zisserman(1999), which groups plane repeated structures, orin Leung and Malik (1996), which groups affineequivalent structures but doesn’t recover normals. Atheart, it is a structure from motion result, echoingMalik and Rosenholtz’s linking of shape from texture,stereopsis and structure from motion as similarproblems (in the introduction to Malik and Rosenholtz(1997)). An interesting comparison is this corollarywith Triggs’ demonstration, based on a parametercounting argument and numerical optimization, thatfive uncalibrated perspective views of a plane objectyield the object up to a Euclidean transformation(equivalently, self-calibrate the views) (Triggs, 1998).

2.2.1. The Effect of Scale. In scaled orthography,we are missing the scale factor that connects absolutelength with image coordinates. Although Lemma 4 wasderived for orthography, it applies to scaled orthog-raphy with one important exception—it is no longerpossible to estimate the absolute scale of the element.The easiest way to see this is to let the absolute lengthbe in numbers of pixels (i.e., we allow world coordi-nates to be in pixels, rather than metres). If we do this,then the camera is orthographic, and Lemma 4 goesthrough. However, we no longer know the scale of therecovered texture item in metres.

2.2.2. Special Cases. The non-generic cases are in-teresting. If we are to be unable to reconstruct, thenfor every instance but one, the same direction vi in theframe of the element must be preserved. This can oc-cur with a general element as a result of an unfortunatecoincidence between view and texture field—in par-ticular, at each instance of an element on the surface,the surface gradient is in the same direction in the ele-ment’s coordinate frame. It is a view property, becausethe gradient of the surface (which is determined by theview), is aligned with the texture field; this case can bedismissed by the generic view assumption. A circularelement presents a nastier problem, because one cannot


distinguish between directions in the element’s coordi-nate frame. A circular element results in a collection ofimage instances that are ellipses, each of whose majoraxis is the same length l (we ignore the possibility thatwe have an exactly frontal view of one element). Theellipse with smallest aspect ratio could be the frontalelement; but so could any ellipse with major axis l andsmaller aspect ratio, including a circle with diameter l.Now assume that the element is an ellipse, as opposedto a circle. Then the major axis of each of these ele-ments on the surface is parallel to the view direction,so we have a view property. The alternative is that thetexture consists of circular elements. Notice that if theelement is an ellipse, we do not generally expect allmajor axes of image instances to be the same. All thissuggests that a complete theory would detect this case,and then set the element to a circle.

3. Recovering the Element

As Section 1.3 described, we recover putative in-stances of texture elements by marking interest pointsin the image, obtaining affine robust descripters forthose points, and clustering them using the descrip-tors. Each cluster represents putative instances of asingle element. We use Lowe’s method for findinginterest points and computing descriptors, by apply-ing his program (which he has kindly made availableat http://www.cs.ubc.ca/∼lowe/keypoints/). These de-scriptors are then clustered using k-means to find de-scriptors that appear to represent instances of the sametexture element. Because the descriptors produced byLowe’s program are invariant under rotation and trans-lation and robust to quite substantial foreshortening,each cluster should represent instances of a potentialtexture element. Note that there is little reason to at-tempt to extract heavily foreshortened instances, be-cause a fortiori they must result in poor estimates ofsurface normal and of element appearance (there arefew pixels on the element). We must now determine(a) which putative instances are, in fact, instances ofeach element and (b) which textons are useful. Thisinformation emerges from the process of recoveringfrontal textons.

3.1. Recovering Information for a Single Texton

We assume, for the moment, that we are dealing withimage instances that are either instances of a singletexture element or noise. We will deal with the case

of multiple textons below; we do so basically by as-suming that instances are common and mislabellinginstances is relatively uncommon, so that we need notbother with dealing with all elements at once; instead,we do each separately. For a given element (or, equiv-alently, equivalence class of image instances), one cansimultaneously determine (a) whether a descriptor isan instance of an element (b) the appearance of thatelement and (c) the texture imaging transformation byan application of the EM algorithm.

Notation: For the i’th image instance of the textureelement, write θgi for the rotation angle of the in-image rotation, σ i for the foreshortening, θsi for therotation angle of the on-surface rotation and Ti =Ti (θgi , σi , θsi ) for the texture imaging transformationencoded by these parameters. Write δi for the hiddenvariable that encodes whether the image texture ele-ment is an instance of the model texture element ornot. Write Iµ for the estimate of the texture element,and Ii = Ii (θgi , σi , θsi ) for the patch obtained by ap-plying the known texture imaging transformation T −1

ito the image texture element i.

If the irradiance is unknown, we can assume it is con-stant over the texture element (elements are “small”).For the moment, assume that all texture imaging trans-formations are known, but the element is not known.Write the sum of squared pixel differences betweentwo image patches I and J as

‖I − J ‖2

(a notation that allows us to remain vague about thedomains in which the comparison is made for the mo-ment). We must then choose Iµ and some set of scalarsLi to minimize

�i‖LiIµ − Ii‖2

The Li are required because each instance of the textonwill be subject to different irradiance; although the ab-solute irradiance cannot be measured, Li represents theirradiance at i up to some unknown constant, which wecall the relative irradiance. Now assume that we havean estimate of the model texture element and the rela-tive irradiance field; we can clearly recover the textureimaging transformations by transforming the lightedmodel texture element to look like an image patch. Fi-nally, given all parameters, it is possible to tell whetheran image texture element represents an instance of the


model texture element or not—it will be an instance if,by applying the inverse texture imaging transformationand relative irradiance to the image texture element, weobtain a pattern that looks like the model texture ele-ment. This suggests that we can insert a set of hiddenvariables, one for each image texture element, whichencode whether the image observation is an instanceor not.

Domains: To compare image and model texture ele-ments, we must be careful about domains. Implicit inthe definition of Iµ is its domain of definition D—sayan n × n pixel grid—and we can use this. It is sig-nificantly easier to compare estimates of the textureelement by mapping all of these estimates into the ele-ment frame and comparing them there. The alternative,mapping the estimate of the element onto the image andcomparing there, creates some problems with resolu-tion. Write T −1

i (I) for the pattern obtained by applyingT −1

i to the image domain Ti (D). This is most easilycomputed by scanning D, and for each sample points = (sx , sy) evaluating the image at T −1

i s. The scale ofthe domain is a more delicate point (see Section 6); weassume a fixed texture element size in the frontal frame,and set this constant once, by hand, for all images.

The negative log-posterior is now relatively straight-forward to write. We assume that imaging noise is nor-mally distributed with zero mean and standard devia-tion σ im. We assume that image texture elements thatare not instances of the model texture element arisewith uniform probability—which we encode with aconstant K, whose value is chosen by experiment. Wehave that 0 ≤ σ i ≤ 1 for all i, a property that can beenforced with a prior term. To avoid the meaninglesssymmetry where illumination is increased and albedofalls we use a prior that charges for Li different fromone. We can now write the negative log-posterior

1

2σ 2im

∑i

(‖LiIµ − Ii‖2δi )

+∑

i

(1 − δi )K + 1

2σ 2light

(Li − 1)2 + L

where L is some unknown normalizing constant of nofurther interest.

Applying EM to this expression is straightforward.Computing expected values of the δi follows the usual

Figure 2. Gradient information obtained by the process describedin Section 3 for one type of texture element (there are a total of8 types for this image). The image instances of this element wereidentified automatically, as described in that section. The gradientinformation has a two-fold ambiguity, as described in Section 2. Weshow the gradient by rendering the image of a circle viewed at therelevant slant and tilt—this means that if we recover the gradientsaccurately, the surface should look as though it is carrying smallopen circles.

pattern, but the continuous parameters require numer-ical minization. This minimisation is unusual in beingefficiently performed by coordinate descent. This isbecause, for fixed Iµ, each Ti can be obtained by in-dependently minimizing a function of only three vari-ables. We therefore minimize by iterating two sweeps:fix Iµ and minimize over each Ti in turn; now fix allthe Ti and minimize over Iµ (this is done in closedform by computing an average).

This process produces normal information auto-matically, as each Ti is an explicit function of rotationon the surface, the surface slant and p and q (Section 2).However, there is a two-fold ambiguity, as a rotationon the surface of 180◦ can be absorbed by the map(p, q)→ (−p, −q). Furthermore, the EM Coefficientsencode the extent to which an image pattern is, in fact,an instance of a texton. However, with many imageelements the process could be slow. In fact, increasedefficiency is possible because, although using allputative instances gives the best estimate of the frontalelement, one runs into diminishing returns quitequickly. This suggests our strategy of using a subsetof the instances to estimate the frontal element, thenfixing the appearance of the element and using this toestimate configuration parameters, relative irradianceand δ’s for all other instances. Recovery of the frontalappearance of the texton is good; Fig. 4 shows allfrontal textons from the shirt of Fig. 3. Recall thatfrontal appearances are estimated by backprojectionand averaging: The relatively crisp images suggest the


Figure 3. In the center, an image of a shirt with the position of each texton instance superimposed as a cross; there are so many it is difficultto resolve them, as the detail from the collar region (left) shows. There are 350 instances in total, and instances are less dense in the area ofdarker shading near the arms (detail on the right). Instances from this area do not result in much surface normal data, because the representationprovided by Lowe’s method appears to sensitive to relatively large changes in brightness. In turn, this means that interest points found heretend to be discarded as not being instances of their element (the EM coefficients of Section 4 are low), and so we have little surface normalinformation in these regions.

Figure 4. The left row shows estimates of the frontal appearance of a texture element for the image of the shirt depicted in Fig. 3 after 1, 5,10 and 20 iterations of EM respectively. Initially the estimate is blurred, because the slant-tilt estimates are poor, but very quickly it becomessharp. The rows on the right show the frontal appearance of each of the 12 texture elements found for this shirt. Note that the clustering couldreasonably be criticized, but that it is not particularly important to identify the correct number of clusters. Each texton consists of a small patchcentered on some part of the shirt pattern; the more such patches, the better, because this leads to a very dense set of surface orientation andrelative irradiance estimates. The two elements on the bottom right are difficult to localize; this is detected automatically using the Hessian trickof Section 3.1.1 and they are omitted from reconstruction.

image instances have been very well registered by thebackprojection process.

3.1.1. Handling Multiple Textons. It is relativelystraightforward to deal with multiple textons (Figs. 3and Fig. 4). We first cluster putative instances us-ing k-means; note that the value of k isn’t crucial

here, as long as it is neither too small nor too large.This is because if k exceeds the number of textonclasses, some elements will be represented by morethan one cluster. The only consequence of processingthese clusters independently is (in principle) a slightreduction in the accuracy with which the frontal ap-pearance of the element can be estimated. This doesn’t


appear in practice, because we have so many instancesof each element that the reduction in accuracy is notnoticeable.

Each cluster is then processed independently, to pro-duce independent frontal appearance, normal and rel-ative irradiance estimates at the instance centers. Thenormal estimates will be linked by the surface recon-struction process, and the relative irradiance estimatesby smoothness. The relative irradiance estimates fora given element are known up to a single missingscale factor. We can fix the scales for one element,and must scale all others to be consistent with thatelement (which we do by smoothness, Section 4).

3.1.2. Bad Elements. Bad elements—as opposed tobad instances, which are dealt with by the missingvariables—are those that cannot produce reliable es-timates of p and q. For example, consider an elementthat has a constant grey level, or is a single point. Weidentify bad elements by looking at the Hessian of thefitting error. Once we have done so, nothing furtherneed be done to merge estimates of p and q obtainedfrom different texture elements.

4. Fitting a Surface and a RelativeIrradiance Map

We now have a set of points (xi, yi) at which we knowmeasurements of the gradient up to a two fold sign am-biguity: either di = (p, q) or di = (−p, −q). Further-more, we have an estimate—from the expected valueof the hidden variable in the previous section—of thereliability of these measurements. We accept only mea-surements for which these expected values exceed athreshold (0.8 for what follows). There are three possi-bilities at each point that has been accepted: First, di =(p, q); second, di = (−p, −q); third, the measurementdoes not derive from the surface (a bad texton match,say). We encode these states using a missing variable,assigning δi

1 = 1 for the first case, δi2 = 1 for the

second case, δi3 = 1 for the third case. As usual, one

of the three is one and the other two zero for every site.We can now apply EM.

Assume for the moment that there is no sign am-biguity. We must now fit a surface to gradient data.We represent the surface with radial basis functions, anatural choice for scattered data interpolation. We use

φ j (x, y) = 1/((x − x j )2 + (y − y j )

2 + ε2)

as a basis function, so at (x, y), the surface is

(x, y, h(x, y)) =(

x, y,∑

j

a jφ j (x, y)

)

and to simplify fitting to normal data, we use a lin-ear model. In particular, we require that the normalmeasurement be orthogonal to the tangent of the fittedsurface. If we write pi for the measured x-derivative at(xi, yi), etc., we must minimize

∑i∈points

(∂h

∂x(xi , yi ) − pi

)2

+(

∂h

∂y(xi , yi ) − qi

)2

where h is a linear function of the vector of surface co-efficients a so that the error is quadratic in the surfacecoefficients. Encoding the possibility that a measure-ment is incorrect with a hidden variable means that wedon’t need to use an m-estimator. This makes for a con-siderably simpler M-step than that in Forsyth (2002).We should like to impose a smoothness constraint.There are three alternatives: First, one can computeand charge for large second derivatives (a widespreadpractice, described in, for example Horn and Schunck(1981)). Unfortunately, this approximation to curva-ture is unreliable near boundaries when the surfaceis near vertical. Second, one can compute the aver-age norm of the shape operator and charge for that(see Forsyth (2002)). This is painfully slow, however.Third, one can penalize large coefficients in the surfaceexpansion (for this method, see, for example Hastieet al., (2001)). We have found this method to be en-tirely satisfactory. Incorporating the hidden variables,the log-likelihood becomes

∑i∈points

×

δ1i

12σ 2

l

{(∂h∂x (xi , yi ) − pi

)2 + (∂h∂y − qi

)2}

+δ2i

12σ 2

l

{(∂h∂x (xi , yi ) + pi

)2 + (∂h∂y + qi

)2}

+δ3i Kr

+µaT a + C

where Kr encodes the (uniform) probability that a de-tector response is not an instance and is chosen byexperiment, C is a constant of no further interest and µ

adjusts the weight of the smoothness term with respectto the error term. From this point, the application of EMis straightforward; the usual expressions apply for the


re-estimates of the hidden variables, and when knownvalues of the hidden variables are substituted the min-imization problem involves solving a linear system.As a result, the method is very much faster than thatof Forsyth (2002); it appears to produce much bettersurfaces, too.

A direct method is possible, too. One uses the approx-imating surface to evaluate the slant and tilt at each tex-ton instance, recovers the image instance to its frontalframe, and compares with the recovered appearance ofthe texton. In principle, this approach should lead to animproved representation because one can then couplethe process of estimating whether an image pattern is atexton instance with that of estimating the approximat-ing surface. In our experience, this does not materiallychange the recovered surface, probably because thereare so many instances that are good that the accuracyadded in principle is not significant in practice.

Starting the method is straightforward. In all the ex-amples shown, we start with a cylinder aligned by handwith the major axis of the body by finding the horizon-tal component of the center of gravity of the instances.

4.1. Recovering a Relative Irradiance Map

We identify acceptable instances of texture elements bytesting the local value of the relevant hidden variableagainst a threshold. At each acceptable instance i is anestimate of irradiance relative to some unknown scale,given by Li. The scale is fixed for each type of texton.Write the irradiance at the i’th texton as Ii. Now if thei’th texton comes from the l’th type, we have that

Li = sl Ii

where sl is an unknown scale factor, one for each type.This estimate is available because we must shade thetexton to get it to agree with the image, and we areassuming that repeated instances all have the samealbedo.

We wish to recover a relative irradiance field—equivalently, we care only about Li/Lj for any pair of i,j—and so we must recover rl = sl/sl for all l > 1. Thereare two ways to attack this; first, we could ignore allbut the most populous type of textons, and so recovera set of point samples of relative irradiance. We wouldtypically expect to reconstruct a field from these pointsamples, which we would do by some form of smooth-

ing or interpolation. The disadvantage of doing this isthat we are discarding some information.

The second alternative is to apply a prior to theinterpolating relative irradiance field, which will yieldestimates of rl for all the other types of texton. Thisapproach is justified if the textons are spaced closelywith respect to expected fast changes in the irradiancefield—but this case applies here, because the textonsare very dense. As above, write φj(x, y) for the radialbasis function

1/((x − x j )2 + (y − y j )

2 + ε2)

We can then estimate the relative irradiance field byminimizing

∑k∈classes

{ ∑i∈class

(rk Li − ∑j

a jφ j (xi , yi ))2

}

+ν∑

j a2j

with respect to the r’s and the a’s. The last term isa smoothness term, as above, r1=1, and ν weightsthe smoothness term with respect to the reconstruc-tion error and is set by experiment. The expression isquadratic in the variables and minimization involvessolving a straightforward linear system. Notice that,remarkably, there is no ambiguity in the relative irra-diance (as opposed to the normal direction).

5. Experimental Results

It is always difficult to evaluate a reconstructionmethod, particularly if ground truth is not available.For most interesting cases of shape from texture, it isnot; furthermore, synthetic images are (as usual) anunreliable guide.

The first point to notice is just how many featurepoints there are (Fig. 3 shows a typical case). This sug-gests that a competent reconstruction method shouldbe able to obtain detail at quite a fine scale. In the low-est third of that image (particularly the patch indicatedby the arrow), there are relatively few accurate orien-tation estimates because some dark patches mean thatmany instances are poorly rectified. We ascribe thisphenomenon to strong shading differences followingMikolajczk and Schmid (2003), who note that Lowe’smethod is affected by strong changes in illumination. Ifone reconstructs incorporating the scattered good mea-surements that remain in this area, the reconstruction


Figure 5. On the left, a reconstruction of the shirt image of Fig. 3, obtained using all the interest point detector responses in the image top:textured and bottom: untextured). As the caption of that Figure indicated, there are relatively few good normal estimates in the lower third ofthe shirt. The reconstruction that results has some substantial ripples in it, and is quite unsatisfactory. However, if one ignores the lower thirdof the image and uses only normals estimated in the top two thirds, one obtains the reconstruction on the right (again, top textured and bottomuntextured). This is a much better surface for the shirt. All this suggests that our very dense normal estimates are genuinely useful.

is poorer than if one omits them (Fig. 5). This impliesthat a rich set of feature points is truly helpful.

The second point to notice is that the frontal esti-mates of the elements appear to be quite good. Figure 4shows the set of twelve frontal elements obtained fromthe shirt of Fig. 3. Notice how sharp these elements are,and recall that they are obtained by averaging many in-stances through the estimated slant and tilt. For theelements to be so sharp requires that (a) the featurepoint detector is accurate and (b) the slant and tiltestimates are good—otherwise slightly misaligned es-timates would result in an excessively smooth element.Notice also how quickly the estimate converges—theelement estimate after five EM iterations is not muchdifferent from that after 20.

Figure 9 compares a reconstruction obtained usingour method with a reconstruction obtained using amethod due to Clerc and Mallat (Fig. 7(a) and (c) ofClerc and Mallat (2002)). We chose to compare withClerc and Mallat because their method is the most re-cent synthesis of thinking in local shape from texture.Note that our method has produced a surface that ap-pears to have fewer fine scale ripples than that obtained

using their method, although the number of basis ele-ments is high. We did not artificially set the smoothingterm to be high, and believe that this result is caused bythe very dense set of texton instances. The advantage ofthese instances over deforming a surface to correspondto estimates of texture gradient locally is that we areable to pool instances over a wide scale to obtain animproved p, q estimate. A further point of comparisonbetween methods is that Clerc and Mallat’s methodis demonstrated on images from which shading hasbeen “removed” (e.g. Clerc and Mallat (2002), p. 542caption to Fig. 6); our method infers and reports theshading. We are aware of no other shape from texturemethod that does so.

Our reconstructions are qualitatively accurate, too.Figures 7, 8 and 6 show reconstructions of differentdresses. Note the reconstructions have been able toidentify the visible folds in the dress, and the overallfall of the garment. Again, we attribute this to the denseset of texton instances (1200 in the case of Fig. 7),meaning that we can reconstruct surface detail at quitea small scale. However, our reconstructions are subjectto a concave-convex ambiguity (the hidden variables


Figure 6. On the left, an image of a spotted dress with circles superimposed on the locations of elements corresponding to the second of 8 textonchannels (this is a repeat of Fig. 2, for convenience in comparison). Circles are viewed as if they were at the slant and tilt of the element—notehow good the normal estimates are. In the center, the reconstruction rendered with surface texture and on the right the reconstruction renderedas a grey surface.

Figure 7. On the far left, a view of a model in a spotted dress. In the center, a textured view of the reconstruction obtained using our method.This reconstruction used 1200 texton instances, in 8 clusters. Note the relatively fine detail that was obtained by the reconstruction, includingthe two main folds in the skirt (indicated with white arrows). Typically, rendering texture on top of the view produces a better looking surface,so we show the surface without texturing on the right; arrows indicate the reconstructed folds in the geometry. Notice that the fold in the skirtis well represented. The smoothing term is generally good at resolving normal ambiguities, but patches of surface that are not well connectedto the main body can be subjected to a concave-convex ambiguity, as has happened to part of the skirt’s bodice here (black arrows).

encoding the choice of normal could all flip simultane-ously) and this can manifest itself (Fig. 7). We believethat in many applications, an appropriate choice ofstarting surface will suppress this effect.

Finally, we can compare reconstructions obtainedusing shape from texture with range data. Figure 11shows a view of a piece of textured cloth togetherwith views of a surface reconstructed from the cloth.


Figure 8. On the far left, another view of a model in a spotted dress. In the center, a textured view of the reconstructed surface and on theright, a view without texture. Note that the bodice of the dress and the main fold in the center have been recovered correctly, and that the surfaceis generally satisfactory.

Figure 9. On the left, an image from Clerc and Mallat’s work and in the center, their reconstruction (Fig. 7(a) and (c) respectively fromClerc and Mallat (2002). On the right, a reconstruction obtained using our method. The arrow shows the rough direction of viewing for bothreconstructions. Note that our surface lacks the little wiggles of their surface, probably because the global pass, where the frontal appearanceof a texture element is estimated from image instances, helps obtain more accurate normal estimates than a method looking at local texturegradients can yield. However, as Fig. 3 illustrates, we typically have a very large number of instances and so can obtain dense normal estimates.This is one of Clerc and Mallat’s images from which shading has been “removed” (e.g. Clerc and Mallat (2002), p. 542 caption to Fig. 6); wedo not require that shading be removed (Section 4.1).

Because a numerical comparison is tricky to assess—the meaning of least squares error depends on how onemanages correspondence and how one assesses patchesmissing from a reconstruction, meaning that it is mostuseful to score two reconstruction methods against oneanother—we show a qualitative comparison instead.

In particular, we have taken the surface reconstructedusing our methods, and rotated and translated it by handto get the best correspondence with the range surface.Figure 12 shows the comparison. Note that our methodscannot reconstruct surface folds at a scale smaller thantexture elements (Fig. 13 gives a detailed view), but


Figure 10. Vignetting occurs when a surface sees less light than it could because other surfaces obstruct the view. The most significant formfor our purposes occurs locally, at the bottom of gutters, and is illustrated on the far left; here, point B is darker than point A for almost anyillumination, because many of the directions along which light could arrive at B are blocked by the local surface itself. This effect appearscommonly on clothing, and is reproduced in our recovered irradiance maps. Center and right show the images of Figs. 7 and 8, together withthe irradiance map (up to scale) recovered using our method. The arrows show vignetting effects in the image reproduced in the recoveredirradiance map; no features were recovered in the very dark narrow gutter in the center image (because it is too dark) and so it is not reproduced.We believe our method to be the only irradiance recovery method that can measure such effects.

Figure 11. On the left, a view of a textured cloth; center left and center right show a set of recovered normals (ellipses are rendered as circleson the surface; green ellipses show elements believed to be good, and red those believed to be bad, using the methods of Section 3.1 usingthe automatically detected texture element, whose frontal reconstruction is shown in the inlay. Note the sharp reconstruction, which suggestsnormal estimates are good, as image pixels are being averaged together to obtain this reconstructed value. On the right, a textured surfacereconstruction. Figure 12 compares this reconstruction with a range map of the surface.

produces a generally satisfactory reconstruction thatreproduces the main structure of the original surface ina satisfactory way.

No comparison is possible for our reconstructionof relative irradiance maps, because this is the onlymethod of which we know that can obtain a fine-scalemeasurement of relative irradiance on a surface with-out using a simple parametric model like a point lightsource. The relative irradiance maps also appear to bereasonable. Our method identifies dark patches asso-ciated with folds in the garment for Fig. 10. This isa strong qualitative validation of the method; simplephysical reasoning confirms that these dark patchesmust be where the method identifies them, because

the surface is vignetted—at the base of the fold, muchof a patches incoming hemisphere of directions is oc-cluded by the surface itself (Fig. 10; see Koenderinkand Van Doorn (1983) for more discussion of this phe-nomenon). The dark patches are of about the right sizeand shape, and in about the right position. This sug-gests that our relative irradiance recovery method isproducing useful answers.

6. Discussion

Much of current vision research is, like this paper, anelaboration of two points: Repeated views are useful,and repetition is awfully common. Lemma 4 can be


Figure 12. On the top, the same view of textured cloth shown in Fig. 11. The lower three images show comparisons between range data(in red) and the reconstructed data of Fig. 11 (in blue). The point marked A in each image corresponds, and the surfaces are rotated about avertical axis as indicated in the first image; in the last image of the series, the point A is at the back, and occluded. Note, in particular, that ourreconstruction preserves the main features of the range map rather well, and that the smoothing used to infer the direction of the normal doesnot flatten the surface excessively. Our reconstruction cannot preserve detail at a scale finer than the texture elements, as Fig. 13 indicates.

read as a form of structure from motion result, a shapefrom texture lemma, or an affine clustering result. Wehaven’t seen it in the literature to date, however. Therelationship between the three topics suggests someinteresting possibilities. Shape from texture is anattractive measurement strategy, because reconstruc-tion from a single view means that we can deal withdeforming objects. However, there is a requirement forrepetition in the texture. One might be able to recon-struct deforming objects with textures that do not repeatby viewing a motion sequence. By working across thesequence, we can recover the frontal appearance ofeach element (by using Lemma 4 across time, ratherthan space); now by working within each frame, wecan recover a surface estimate for that frame. Similarly,one might regard the problem of identifying the actorsin a movie (after Fitzgibbon and Zisserman (2002),who attacked it as an affine clustering problem; also seeMiller et al. (2004); Berg et al. (2004) as a problem ofestimating the frontal appearance of a texture element.Alternatively, one might see the problem of identifyingthe frontal appearance of a texture element as one ofcongealing, along the lines of Miller et al. (2000).

Our reformulation of what a texture element is—anything that repeats and is localizable—appears tobe useful, mainly because it means that most picturescontain very large numbers of texture elements. Fur-thermore, as Fig. 4 shows, we clearly can get the frontalappearance of an element and its slant and tilt rather ac-curately. In turn, this means that we obtain very densenormal estimates, which is probably why the surfacesare quite good. Reconstructed surfaces have curioussemantics in computer vision. On the one hand, whywould one believe an interpolate in a region wheredata is absent? on the other, most surfaces do have rel-atively low curvature, meaning that normals are quitestrongly correlated over small scales, so that fittinga surface should help control error. This appears tobe a classic bias-variance tradeoff—fitting a surfaceavoids variance in estimates of geometric properties,but can create bias by over smoothing, etc. We be-lieve that this difficulty can be avoided by obtainingvery dense normal estimates (as we have) and fittingsurfaces with a relatively large number of degrees offreedom (as we have). In this case, one can benefitfrom correlations between normals over small scales


Figure 13. With relatively large elements, the cloth can fold in structures that are not flat on the scale of the element, and so are difficult tocapture with our method. On the top left, the image of textured cloth from Fig. 11, with a window overlaid to show the rough configuration of afold identified in the comparison between range data and our reconstruction on the top right. Again, the range data is red and the reconstructionis blue; the view is approximately that of the image, but not exactly, and this is reflected in our approximate region of interest. A rectangleindicates the region of detail shown below in each view. On the bottom left, range; bottom center reconstruction; and bottom right texturedreconstruction only. The reconstruction offers a reasonable guide to the general structure of the surface, but is incapable of recovering the folds,which are rather smaller than individual elements (cf. Fig. 11). We show the textured reconstruction because the folds appear visible in thiscase; this suggests that information about these folds still exists, either in the shading pattern or in the finer details of the texture deformation.

without suffering too much bias from a rigid surfacemodel.

One advantage of our view of what a texture ele-ment is, is that, from the point of view of illuminationestimation, pretty much any object can be seen as acalibration object. This is because if we know that apattern is repeated, we can tell how brightly particu-lar instances are illuminated. Of course, this approachwould fail if a pattern were repeated with albedo vari-ations, but this case seems to be less common and onemight reasonably invoke a generic view assumption.We are aware of no other shape from texture methodthat recovers an irradiance estimate at the same time itestimates surface geometry, though the problems ap-pear to be quite naturally coupled. It is interesting tospeculate on the effects of coupling the surface normalestimate with the irradiance estimate—this might pro-

vide stronger disambiguating information than surfacesmoothness.

Finally, we note that our method of handling the es-timated size of an element (assuming that this size issome fixed constant in a frontal frame) is manageablein practice, but is not truly satisfactory. Elements re-peat at a small scale because their appearance is notsignificantly disrupted by surface curvature. In the par-ticular and important case of loose cloth, there is littleGaussian curvature (which is energetically expensive;see Terzopolous et al., (1987), House et al. (2000),Baraff and Witkin (1998), Bridson et al., (2003), andit is important to establish the material parametrisationof the surface. We hypothesize that, at least in this caseand possibly in others, a similar form of regularity ap-plies at longer spatial scales, and is a powerful cue toa natural parametrization of surfaces.


Future work will involve building automaticparametrisations of reconstructions, building recon-structions of moving surfaces and attempting to linksurface geometry inference to irradiance measure-ments. A matter of particular interest is how to thinkabout shape and texture in the case of surfaces thatare not smooth, where texture is a matter of more thanalbedo variation.

Appendix

We need to prove

Lemma 3. Assume that B is a linear transformationof the plane. Assume that there is some set of vectorsvi such that |vi |=|Bvi |. If, in that set of vectors, thereis a set of three vectors no two of which are parallel,then BTB = I (the identity).

Proof: Write vi = (vi0, vi1)T . Write

BTB = a bb c

For each i our hypothesis yields an equation linear ina, b and c:

v2i0 + v2

i1 = av2i0 + 2bvi0vi1 + cv2

i1

It is clear that the solution (a, b, c) = (1, 0, 1) is alwaysavailable. Choose three vectors, say i = 1, 2, 3; if anyother solution is available, the determinant

d = det

v210 v10v11 v2

11

v220 v20v21 v2

31

v230 v30v31 v2

31

must vanish. This determinant factors as

d = (v11v20 −v10v21)(v11v30 −v10v31)(v31v20 −v30v21)

and so can vanish only if two of these three vectors areparallel; in any other case, BTB = I as claimed. �

Acknowledgments

This paper is almost entirely in response to conver-sations with Andrew Zisserman and Jitendra Malik.Anonymous referees have made a number of very help-ful suggestions.

References

Ahuja, N. and Schachter, B. 1983a. Image models. ACM ComputingSurveys, 15(1):83–84.

Ahuja, N. and Schachter, B. 1983b. Pattern Models. Wiley.Aloimonos, Y. 1986. Detection of surface orientation from texture.

i. the case of planes. In IEEE Conf. on Computer Vision andPattern Recognition, pp. 584–593.

Bajcsy, R.K. and Lieberman, L.I. 1976. Texture gradient as a depthcue. Computer Graphics Image Processing, 5(1):52–67.

Baraff, D. and Witkin, A. 1998. Large steps in cloth simulation.Computer Graphics, 32(Annual Conference Series):43–54.

Berg, T.L., Berg, A.C., Edwards, J., and Forsyth, D.A. 2004. Who’sin the picture? In Proc. NIPS.

Blake, A. and Marinos, C. 1990. Shape from texture: estimation,isotropy and moments. Artificial Intelligence, 45(3):323–380.

Bridson, R., Fedkiw, R., and Anderson, J. 2002. Robust treatmentof collisions, contact and friction for cloth animation. ComputerGraphics, (Annual Conference Series):594–603.

Bridson, R., Marino, S., and Fedkiw, R. 2003. Simulation ofclothing with folds and wrinkles. In SCA ’03: Proceedings of the2003 ACM SIGGRAPH/Eurographics Symposium on Computeranimation, pp. 28–36. Eurographics Association.

Choe, Y. and Kashyap, R.L. 1991. 3-D shape from a shaded texturalsurface image. IEEE Trans. Pattern Analysis and MachineIntelligence, 13(9):907–919.

Clerc, M. and Mallat, S. 1999. Shape from texture throughdeformations. In Int. Conf. on Computer Vision, pp. 405–410.

Clerc, M. and Mallat, S. 2002. The texture gradient equation forrecovering shape from texture. IEEE T. Pattern Analysis andMachine Intelligence, 24(4):536–549.

Daley, D.J. and Vere-Jones, D. 1988. An Introduction to the theoryof point processes. Springer-Verlag.

Fitzgibbon, A.W. and Zisserman, A. 2002. On affine invariantclustering and automatic cast listing in movies. In Proc. 7thEuropean Conference on Computer Vision. Springer-Verlag.

Forsyth, D.A. 2001. Shape from texture and integrability. In Int.Conf. on Computer Vision, pp. 447–452.

Forsyth, D.A. 2002. Shape from texture without boundaries. InProc. ECCV, vol. 3, pp. 225–239.

Forsyth, D.A. and Ponce, J. 2002. Computer Vision: A ModernApproach. Prentice-Hall.

Garding, J. 1992. Shape from texture for smooth curved surfaces.In European Conference on Computer Vision, pp. 630–638.

Garding, J. 1995. Surface orientation and curvature from differentialtexture distortion. In Int. Conf. on Computer Vision, pp. 733–739.

Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements ofStatistical Learning: Data Mining, Inference and Prediction.Springer Verlag.

Horn, B.K.P. and Schunck, B.G. 1981. Determining optical flow.Artificial Intelligence, 17:185–203.

House, D.H., Breen, D., and Breen, D. (Eds.). 2000. ClothModelling and Animation. A.K. Peters.

Koenderink, J.J. and Van Doorn, A.J. 1983. Geometrical modes asa method to treat diffuse interreflections in radiometry. J. Opt.Soc. Am., 73(6):843–850.

Krumm, J. and Shafer, S.A. 1990. Local spatial frequence analysisfor computer vision. In Proceedings, Third InternationalConference on Computer Vision, pp. 354–358.


Krumm, J. and Shafer, S.A. 1992. Shape from periodic texture usingthe spectorgram. In IEEE Conference on Computer Vision andPattern Recognition, pp. 284–289. IEEE Press.

Lazebnik, S., Schmid, C., and Ponce, J. 2003. Affine-invariant localdescriptors and neighborhood statistics for texture recognition.In Int. Conf. on Computer Vision.

Lazebnik, S., Schmid, C., and Ponce, J. 2003. Sparse texturerepresentation using affine-invariant neighborhoods. In IEEEConf. on Computer Vision and Pattern Recognition.

Lee, K.M. and Kuo, C.C.J. 1998. Direct shape from texture using aparametric surface model and an adaptive filtering technique. InIEEE Conference on Computer Vision and Pattern Recognition,pp. 402–407. IEEE Press.

Leung, T. and Malik, J. 1996. Detecting, localizing and groupingrepeated scene elements from an image. In European Conferenceon Computer Vision, pp. 546–555.

Lowe, D.G. 1999. Object recognition from local scale-invariantfeatures. In Int. Conf. on Computer Vision, pp. 1150–1157.

Lowe, D.G. 2004. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91–110.

Lu, R., Koenderink, J., and Kappers, A. 1999. Specularities onsurfaces with tangential hairs or grooves. In ICCV, pp. 2–7.

Lu, R., Koenderink, J.J., and Kappers, A.M.L. 1998. Opticalproperties (bidirectional reflection distribution functions) ofvelvet. Applied Optics, 37(25):5974–5984.

Malik, J., Belongie, S., Shi, J. and Leung, T. 1999. Textons, contoursand regions: cue integration in image segmentation. In Int. Conf.on Computer Vision, pp. 918–925.

Malik, J. and Rosenholtz, R. 1997. Computing local surfaceorientation and shape from texture for curved surfaces. Int. J.Computer Vision, pp. 149–168.

Mikolajczk, K. and Schmid, C. 2003. A performance evaluation oflocal descriptors. In IEEE Conf. on Computer Vision and PatternRecognition.

Miller, E., Matsakis, N., and Viola, P. 2000. Learning from oneexample through shared densities on transforms. In ProceedingsIEEE Conference on Computer Vision and Pattern Recognition,vol. 1, pp. 464–471.

Mundy, J.L. and Zisserman, A. 1994. Repeated structures: imagecorrespondence constraints and 3d structure recovery. In J.L.Mundy, A. Zisserman, and D.A. Forsyth, (Eds.), Applications ofInvariance in Computer Vision, pp. 89–107.

Pont, S.C. and Koenderink, J.J. 2002. Bidirectional texture contrastfunction. In Proc. ECCV: Lecture Notes Comput.Sci 2353, pp.808–823.

Pritchard, D. 2003. Cloth parameters and motion capture. Master’sthesis, University of British Columbia.

Pritchard, D. and Heidrich, W. 2003. Cloth motion capture.Computer Graphics Forum (Eurographics 2003), 22(3):263–271.

Rosenholtz, R. and Malik, J. 1997. Surface orientation fromtexture: isotropy or homogeneity (or both)? Vision Research,37(16):2283–2293.

Sakai, K. and Finkel, L.H. 1994. A shape-from-texture algorithmbased on the human visual psychophysics. In IEEE Conferenceon Computer Vision and Pattern Recognition, IEEE Press, pp.527–532.

Schachter, B. 1980. Model-based texture measures. IEEETrans. Pattern Analysis and Machine Intelligence, 2(2):169–171.

Schachter, B. and Ahuja, N. 1979. Random pattern generationprocesses. Computer Graphics Image Processing, 10(1):95–114.

Schaffalitzky, F. and Zisserman, A. 1999. Geometric grouping ofrepeated elements within images. In D.A. Forsyth, J.L. Mundy,V. diGesu, and R. Cipolla (Eds.), Shape, Contour and Groupingin Computer Vision, pp. 165–181.

Schmid, C. and Mohr, R. 1997. Local grayvalue invariants for imageretrieval. IEEE Transactions on Pattern Analysis and MachineIntelligence, 19(5):530–534.

Stone, J.V. and Isard, S.D. 1995. Adaptive scale filtering: Ageneral-method for obtaining shape from texture. IEEETrans. Pattern Analysis and Machine Intelligence, 17(7):713–718.

Super, B.J. and Bovik, A.C. 1995. Shape from texture using localspectral moments. IEEE Trans. Pattern Analysis and MachineIntelligence, 17(4):333–343.

Terzopolous, D., Platt, J., Barr, A., and Fleischer, K. 1987. Elasti-cally deformable models. In Computer Graphics (SIGGRAPH87 Proceedings), pp. 205–214.

Leung, T. and Malik, J. 1996. Detecting, localizing and groupingrepeated scene elements from an image. In European Conferenceon Computer Vision, pp. 546–555.

Miller, T., Berg, A., Edwards, J., Maire, M., White, R., Teh, Y-W.,Miller, E., and Forsyth, D.A. 2004. Faces and names in the news.In Proc IEEE Conf. on Computer Vision and Pattern Recognition.

Triggs, B. 1998. Autocalibration from planar scenes. In Proc ECCV,vol. 1, pp. 89–105.

Witkin, A.P. 1981. Recovering surface shape and orientation fromtexture. Artificial Intelligence, 17:17–45.

shape from texture without boundaries

Documents