estimating the natural illumination conditions from a single outdoor image

Upload: tauseef-khan

Post on 02-Mar-2016

8 views

Category:

Documents


0 download

DESCRIPTION

Computer Science

TRANSCRIPT

  • Int J Comput Vis (2012) 98:123145DOI 10.1007/s11263-011-0501-8

    Estimating the Natural Illumination Conditions from a SingleOutdoor Image

    Jean-Franois Lalonde Alexei A. Efros Srinivasa G. Narasimhan

    Received: 24 January 2011 / Accepted: 3 October 2011 / Published online: 27 October 2011 Springer Science+Business Media, LLC 2011

    Abstract Given a single outdoor image, we present amethod for estimating the likely illumination conditions ofthe scene. In particular, we compute the probability distribu-tion over the sun position and visibility. The method relieson a combination of weak cues that can be extracted fromdifferent portions of the image: the sky, the vertical surfaces,the ground, and the convex objects in the image. While nosingle cue can reliably estimate illumination by itself, eachone can reinforce the others to yield a more robust estimate.This is combined with a data-driven prior computed over adataset of 6 million photos. We present quantitative resultson a webcam dataset with annotated sun positions, as well asquantitative and qualitative results on consumer-grade pho-tographs downloaded from Internet. Based on the estimatedillumination, we show how to realistically insert synthetic3-D objects into the scene, and how to transfer appearanceacross images while keeping the illumination consistent.

    Keywords Illumination estimation Data-driven methods Shadow detection Scene understanding Image synthesis

    1 Introduction

    The appearance of a scene is determined to a great extent bythe prevailing illumination conditions. Is it sunny or over-cast, morning or noon, clear or hazy? Claude Monet, a fas-

    J.-F. Lalonde () A.A. Efros S.G. NarasimhanSchool of Computer Science, Carnegie Mellon University,Pittsburgh, PA, USAe-mail: [email protected]. Efrose-mail: [email protected]

    S.G. Narasimhane-mail: [email protected]

    tidious student of light, observed: A landscape does not ex-ist in its own right. . . but the surrounding atmosphere bringsit to life. . . For me, it is only the surrounding atmospherewhich gives subjects their true value. Within the Grand Vi-sion Problem, illumination is one of the key variables thatmust be untangled in order to get from pixels to image un-derstanding.

    But while a lot of work has been done on modeling andusing illumination in a laboratory setting, relatively little isknown about it in the wild, i.e. in a typical outdoor scene.In fact, most vision applications treat illumination more asa nuisancesomething that one strives to be invariant torather than a source of signal. Examples include illuminationadaptation in tracking and surveillance (e.g. Stauffer 1999),or contrast normalization schemes in popular object detec-tors (e.g. Dalal and Triggs 2005). Alas, the search for theultimate illumination invariant might be in vain (Chen et al.2000). Instead, we believe there is much to be gained by em-bracing illumination, even in the challenging, uncontrolledworld of consumer photographs.

    In this paper, we propose a method for estimating nat-ural illumination (sun position and visibility) from a sin-gle outdoor image. To be sure, this is an extremely difficulttask, even for humans (Cavanagh 2005). In fact, the prob-lem is severely underconstrained in the general casewhilesome images might have enough information for a reason-ably precise estimate, others will be completely uninforma-tive. Therefore, we will take a probabilistic approach, esti-mating illumination parameters using as much informationas may be available in a given image and producing the max-imum likelihood solution (see Fig. 1).

    So what information about illumination is available ina single image? Unfortunately, there is no simple answer.When we humans perform this task, we look at differentparts of the image for clues. The appearance of the sky can

  • 124 Int J Comput Vis (2012) 98:123145

    Fig. 1 A synthetic 3-D statue has been placed in a photograph (a) in anillumination-consistent way (the original image is shown in the inset).To enable this type of operation, we develop an approach that usesinformation from the sky, the shading, the shadows, and the visiblepedestrians to estimate a distribution on the likely sun positions (b),and is able to generate a synthetic sky model (c)

    tell us if it is clear or overcast (i.e. is the sun visible?). On aclear day, the sky might give some weak indication about thesun position. The presence of shadows on the ground planecan, again, inform us about sun visibility, while the direc-tion of shadows cast by vertical structures can tell us aboutsun direction. The relative shading of surfaces at differingorientations (e.g. two building facades at a right angle), canalso give a rough indication of sun direction, and so can theshading effects on convex objects populating the scene (e.g.pedestrians, cars, poles, etc.)

    Our approach is based on implementing some of theseintuitions into a set of illumination cues. More precisely,the variation in sky color, cast shadows on the ground, rel-ative shading of vertical surfaces, and intensity variation onconvex objects (pedestrians) are the four cues used in thiswork. Of course, each one of them are rather weak and un-reliable when taken individually. The sky might be com-pletely saturated, or might not even be present in the image.The ground might not be visible, be barren of any shadow-

    casting structures, or be deprived of any recognizable con-vex objects. Shading information might, likewise, be inac-cessible due to lack of appropriate surfaces or large differ-ences between surface reflectances. Furthermore, comput-ing these cues will inevitably lead to more noise and error(misdetected shadows or pedestrians, poor segmentation, in-correct camera parameters, etc.). Hence, in this work, wecombine the information obtained from these weak cues to-gether, while applying a data-driven prior computed over aset of 6 million Internet photographs.

    The result sections (Sects. 3.2 and 6) will show that thesun visibility can be estimated with 83.5% accuracy, andthe combined estimate is able to successfully locate the sunwithin an octant (quadrant) for 40% (55%) of the real-worldimages in our very challenging test set, and as such outper-forms any of the cues taken independently. While this goesto show how hard the problem of illumination from singleimages truly is, we believe this can still be a useful result fora number of applications. For example, just knowing that thesun is somewhere on your left might be enough for a point-and-shoot camera to automatically adjust its parameters, orfor a car detector to be expecting cars with shadows on theright.

    1.1 Related Work

    The color and geometry of illuminants can be directly ob-served by placing probes, such as mirror spheres (Stumpfelet al. 2004), color charts or integrating spheres, within thescene. But, alas, most of the photographs captured do notcontain such probes and thus, we are forced to look for cueswithin the scene itself. There is a long and rich history incomputer vision about understanding the illumination fromimages. We will briefly summarize relevant works here.

    Color Constancy These approaches strive to extract scenerepresentations that are insensitive to the illuminant color.For this, several works either derive transformations be-tween scene appearances under different source colors (e.g.Finlayson et al. 1993), or transform images into differentcolor spaces that are insensitive to illuminant colors (e.g.Zickler et al. 2006). Our work focuses on a complemen-tary representation of outdoor illumination (sun directionand visibility).

    Model Based Reflectance and Illumination EstimationSeveral works estimate illumination (light direction andlocation), in conjunction with model-based estimation ofobject shape and reflectances (Lambertian, Dichromatic,Torrance-Sparrow), from one or more images of the scene(Sato et al. 2003; Basri et al. 2007). Of particular interest,Sun et al. (2009) estimate illumination conditions in com-plex urban scenes by registering the photographs to 3-D

  • Int J Comput Vis (2012) 98:123145 125

    models of the scene. Our work does not rely on specific re-flectance models of outdoor surfaces or exact estimation of3-D geometry.

    Shadow Extraction and Analysis Many works detect andremove shadows using one or more images (Weiss 2001;Finlayson et al. 2002; Wu and Tang 2005). The extractedshadows have also been used to estimate the sun directionin constrained settings (Kim and Hong 2005) or in webcams(Junejo and Foroosh 2008). But shadows are only weaklyinformative about illumination when their sizes in the imageare too small or their shapes are complex or blurred. Li etal. (2003) propose a technique that combine shadow, shad-ing, and specularity information in a framework for estimat-ing multiple illuminant directions in single images, but theirapproach is restricted to tabletop-like objects. Our work,for the first time, combines shadow cues with other semi-informative cues to better estimate illumination from a sin-gle image of a general outdoor scene.

    Illumination Estimation from Time-Lapse SequencesSunkavalli et al. (2008) develop techniques to estimate sundirection and scene geometry by fitting a photometric modelof scene reflectance to a time-lapse sequence of an out-door scene. Lalonde et al. (2010) exploit a physically-basedmodel of sky appearance (Perez et al. 1993) to estimate thesun position relative to the viewing direction from a time-lapse sequence. We will use the same model of the sky butrecover the most likely representation of the complete skydome (sky appearance, sun position, and sun visibility) froma single image.

    Finally, Lalonde et al. (2007) use cues such as multi-variate histograms of color and intensity together with arough classification of scene geometry (Hoiem et al. 2007a)to match illuminations of different scenes. However, theircues are global in nature and cannot be used to match sundirections. This makes their approach ill-suited for 3-D ob-ject insertion.

    2 Representations for Natural Illumination

    Because of its predominant role in any outdoor setting, it hasbeen of critical importance to understand natural illumina-tion in many fields such as computer vision and graphics, butalso in other research areas such as biology (Hill 1994), ar-chitecture (Reinhart et al. 2006), solar energy (Mills 2004),and remote sensing (Healey and Slater 1999). Consequently,researchers have proposed many different representationsfor natural illumination adapted to their respective applica-tions. We present here an overview of popular representa-tions that we divide into three main categories: physically-based, environment maps, and statistical. We concludethis section by describing the representation that we use inthis work.

    2.1 Physically-Based Representations

    The type of representation that is probably the most popularin the literature is what we name here physically-basedrepresentations: mathematical expressions that model thephysics of natural illumination. They typically stem from thefollowing equation (adapted from Slater and Healey 1998),which models the ground spectral irradiance L() as a func-tion of the sun and sky radiances Esun and Esky respectively:

    L() = KEsun(s, s, ) cos(s)

    + 2=0

    2

    =0Esky(,,) cos() sin()dd, (1)

    where K is a binary constant which accounts for occludingbodies in the solar to surface path, and Esky is integratedover the entire sky hemisphere. While (1) may look simple,its complexity lies in the characterization of the sun and skyradiance components which depend on the form and amountof atmospheric scattering. As a result, a wide variety of waysto model the sun and the sky have been proposed; we sum-marize a few here that are most relevant to our work.

    Being the dominant light source outdoors, the sun hasbeen studied extensively. For example, in architectural de-sign, the relative position of the sun is an important factorin the heat gain of buildings and in radiance computationsthat determine the quantity of natural light received insideeach of the rooms (Ward 1994). Understanding the quan-tity of energy delivered by the sun at ground level is alsocritical to predict the quantity of electricy that can be pro-duced by photovoltaic cells (Mills 2004). As such, highlyprecise physical models for sunlight transport through theatmosphere (Esun in (1)) have been proposed (Bird 1984;Kasten and Young 1989).

    The second main illuminant outdoors, the sky, has alsolong been studied by physicists. One of the most popularphysically-based sky model was introduced by Perez et al.(1993), and was built from measured sky luminances. Thismodel has been used in graphics for relighting architecturalmodels (Yu and Malik 1998), and for developing an effi-cient sky rendering algorithm (Preetham et al. 1999). This isalso the model we will ourselves use to understand the skyappearance (see Sect. 4.1). It has recently been shown thatsome of the parameters of this model can be estimated froman image sequence captured by a static camera, in which thesky is visible (Lalonde et al. 2010).

    In computer vision, Sato and Ikeuchi (1995) use a modelsimilar to (1) along with simple reflectance models to esti-mate the shape of objects lit by natural illumination. This hasled to applications in outdoor color representation and clas-sification (Buluswar and Draper 2002), surveillance (Tsin etal. 2001), and robotics (Manduchi 2006). A similar flavor ofthe same model has also been used in the color constancy

  • 126 Int J Comput Vis (2012) 98:123145

    domain by Maxwell et al. (2008). Because physical mod-els possess many parameters, recovering them from imagesis not an easy task. Therefore, researchers have resorted tosimplifying assumptions, approximations, or to the use ofimage sequences (Sunkavalli et al. 2008) to make the esti-mation problem more tractable.

    2.2 Environment Map Representations

    An alternative representation is based on the accurate mea-surement and direct storage of the quantity of natural lightreceived at a point that comes in from all directions. As op-posed to physically-based representations, here no com-pact formula is sought and all the information is storedexplicitly. Originally introduced in the computer graphicscommunity by Blinn and Newell (1976) to relight shinyobjects, the representation (also dubbed light probe inthe community) was further developed by Debevec (1998,1997) to realistically insert virtual objects in real images.It is typically captured using a high-quality, high-dynamicrange camera equipped with an omni-directional lens (or aspherical mirror), and requires precise calibration. Stumpfelet al. (2004) more recently proposed to employ this repre-sentation to capture the sky over an entire day into an en-vironment map format, and used it for both rendering andrelighting (Debevec et al. 2004).

    In comparison to its physically-based counterpart, theenvironment map representation has no parameters to es-timate: it is simply measured directly. However, its maindrawbackaside from large memory requirementsis thatphysical access to the scene of interest is necessary to cap-ture it. Thus, it cannot be used on images which have alreadybeen captured.

    2.3 Statistical Representations

    The third type of representation for natural illumination hasits roots in data mining and dimensionality reduction tech-niques. The idea is to extract the low-dimensional trendsfrom datasets of measurements of natural illumination. Thisstatistical, or data-driven representation is typically ob-tained by gathering a large number of observationseitherfrom physical measurements of real-world illumination cap-tured around the world (Judd et al. 1964), or by generatingsamples using a physical model (Slater and Healey 1998)and performing a dimensionality reduction technique (e.g.PCA) to discover which dimensions explain most of thevariance in the samples.

    One of the first to propose such an idea was Judd et al.(1964), and is still to this date considered as one of the bestexperimental analysis of daylight (Slater and Healey 1998).Their study considered 622 samples of daylight measuredin the visible spectrum, and observed that most of them

    could be approximated accurately by a linear combinationof three fixed functions. Subsequently, Slater and Healey(1998) reported that a 7-dimensional PCA representationcapture 99% of the variance of the spectral distribution ofnatural illumination in the visible and near-infrared spectra,based on a synthetically-generated dataset of spectral light-ing profiles. Dror et al. (2004) performed a similar study byusing a set of HDR environment maps as input.

    Since then, linear representations for illumination hasbeen successful in many applications, notably in the joint es-timation of lighting and reflectance from images of an objectof known geometry. Famously, Ramamoorthi and Hanrahan(2001) used a spherical harmonics representation (linear inthe frequency domain) of reflectance and illumination andexpresses their interaction as a convolution. More recently,Romeiro and Zickler (2010) also use a linear illuminationbasis to infer material properties by marginalizing over adatabase of real-world illumination conditions.

    2.4 An Intuitive Representation for Natural Illumination

    All the previous representations assume that the camera (orother sensing modalities) used to capture illumination areof very high qualitythere must be a linear relationshipbetween the illumination radiance and sensor reading, theyhave high dynamic range, etc.and that the images con-tain very few, easily recognizable objects. However, mostconsumer photographs that are found on the Internet do notobey these rules, and we have to deal with acquisition pro-cess issues like non-linearity, vignetting, compression andresizing artifacts, limited dynamic range, out-of-focus blur,etc. In addition, scenes contain occlusions, depth disconti-nuities, wide range of scales due to perspective projection.These issues make the previous illumination representationsvery hard to estimate from these images.

    Instead, we propose to use a representation that is moreamenable to understand these types of images. The modelfocuses on the main source of outdoor lighting: the sun. Wemodel the sun using two variables:

    V its visibility, or whether or not it is shining on the scene;S its angular position relative to the camera. In spherical

    coordinates, S = (s,s), where s = s c ands = s c (the s and c indices denote the sun andcamera respectively). S is specified only if V = 1, andundefined if V = 0.

    In contrast to existing approaches, this representation fornatural illumination is more simple and intuitive, thereforemaking it better-suited for single image interpretation. Forexample, if the sun visibility V is 0 (for example, when thesky is overcast), then the sun relative position S is unde-fined. Indeed, if the sun is not shining on the scene, then allthe objects in the scene are lit by the same area light source

  • Int J Comput Vis (2012) 98:123145 127

    Fig. 2 (Color online) An intuitive representation for natural illumi-nation. (a) Illumination I is modeled using two variables: (1) the sunvisibility V , which indicates whether the sun is shining on the sceneor is blocked by an occluder (e.g. cloud); and (2) its angular positionrelative to the camera S = (s,s). Throughout the paper, the sunposition probability P (S) for an input image such as (b) is displayed(c) as if the viewer is looking straight up (center point is zenith), withthe camera field of view drawn at the bottom. Probabilities are repre-sented with a color scale that vary from blue (low probability) to red(high probability). In this example, the maximum likelihood sun posi-tion (yellow circle) is estimated to be at the back-right of the camera.We sometimes illustrate the results by inserting a virtual sun dial (redstick in (b)) and drawing its shadow corresponding to the sun position

    that is the sky. It is not useful to recover the sun direction inthis case. If V is 1 however, then knowing the sun position Sis very important since it is responsible for illumination ef-fects such as cast shadows, shading, specularities, etc. whichmight strongly affect the appearance of the scene. Figure 2illustrates our model schematically.

    3 Is the Sun Visible or Not?

    Using the above representation, we estimate P(I |I), theprobability distribution over the illumination parameters I ,given a single image I . Because our representation definesthe sun position S only if it is visible, we propose to first

    estimate the sun visibility V :

    P(I |I) = P(S,V |I) = P(S|V, I)P (V |I). (2)In this section, we present how we estimate the distributionover the sun visibility variable given the image P(V |I). Theremainder of the paper will then focus on how we estimateP(S|V = 1, I), that is, the probability distribution over thesun position if it was determined to be visible. We propose asupervised learning approach to learn P(V |I), where a clas-sifier is trained on features computed on a manually-labelleddataset of images. We first describe the features that weredeveloped for this task, then provide details on the classifierused.

    3.1 Image Cues for Predicting the Sun Visibility

    When the sun is shining on the scene, it usually creates no-ticeable effects in the image. Consider for a moment the firstand last images of Fig. 3. When the sun is visible (Fig. 3a),it creates hard cast shadow boundaries, bright and dark ar-eas corresponding to sunlit and shadowed regions, highly-saturated colors, and clear skies. On the other hand, whenthe sun is occluded (Fig. 3e), the colors are dull, the sky isgray or saturated, and there are no visible shadows.

    We devise sun visibility features based on these intu-itions. We first split the image into three main geometricregions: the ground G , the sky S , and the vertical surfacesV using the geometric context classifier of Hoiem et al.(2007a). We use these regions when computing the follow-ing set of features:

    Bright and dark regions: We compute the mean intensity ofthe brightest of two clusters on the ground G , where theclustering is performed with k-means with k = 2; The sameis done for the vertical surfaces V ;

    Saturated colors: We compute 4-dimensional marginal nor-malized histograms of the scene (G V ) in the saturationand value color channels; To add robustness to variationsin exposure, white balance, and gamma response functionsacross cameras, we follow Dale et al. (2009) and also com-pute a 4-dimensional histogram of the scene in the normal-ized log-RGB space;

    Sky: If the sky is visible, we compute its average color inRGB;

    Contrast: We use the contrast measure proposed in Ke et al.(2006) (a measure of the histogram spread), computed overthe scene pixels only (G V );

    Shadows: We apply the shadow detection method ofLalonde et al. (2010), and count the fraction of pixels thatbelong to the ground G and scene (G V ) and are labelledas shadows.

    These features are concatenated in a 20-dimensional vectorwhich is used in the learning framework described next.

  • 128 Int J Comput Vis (2012) 98:123145

    Fig. 3 Results of applying the sun visibility classifier on single images to estimate P (V |I), sorted by decreasing order of probability

    Table 1 Confusion matrix for the sun visibility classifier. Overall, theclass-normalized classification accuracy is 83.5%

    Not visible Visible

    Not visible 87.6% 12.4%Visible 20.5% 79.5%

    3.2 Sun Visibility Classifier

    We employ a classical supervised learning formulation, inwhich the image features are first pre-computed on a set ofmanually-labelled training images, and then fed to a classi-fier. We now provide more details on the learning setup usedto predict whether or not the sun is visible in an image.

    We selected a random subset of outdoor images from theLabelMe dataset (Russell et al. 2008), split into 965 imagesused for training, and 425 for testing. Because LabelMe im-ages taken from the same folders are likely to come from thesame location or taken by the same camera, the training andtest sets were carefully split to avoid folder overlap. Treat-ing V as a binary variable, we then manually labelled eachimage with either V = 1 if the sun is shining on the scene, orV = 0 otherwise. Notice that this binary representation of Vis only a coarse approximation of the physical phenomenon:in reality, the sun may have fractional visibility (e.g. due topartially-occluding clouds, for example). In practice, how-ever, we find that it is extremely difficult, even for humans,to reliably estimate a continuous value for the sun visibilityfrom a single image. Additionally, precisely measuring thisvalue requires expensive equipment that is not available indatabases of existing images.

    We then train a binary, linear SVM classifier using thistraining dataset, and evaluate its performance on the testset. We use the libsvm package (Chang and Lin 2001), andconvert the SVM scores to probabilities using the sigmoidfitting method of Platt (1999). Overall, we have found thismethod to work quite well on a variety of images, with aresulting class-normalized test classification accuracy ob-tained is 83.5%, and the full confusion matrix is shown inTable 1. Qualitative results are also shown in Fig. 3.

    Now that we have a classifier that determines whetheror not the sun is visible in the image P(V |I), we consider

    how we can estimate the remaining factor in our representa-tion (2): the distribution over sun positions S given that thesun is visible P(S|V = 1, I).

    4 Image Cues for Predicting the Sun Direction

    When the sun is visible, its position affects different parts ofthe scene in very different ways. In our approach, informa-tion about the sun position is captured from four major partsof the imagethe sky pixels S , the ground pixels G , thevertical surface pixels V and the pixels belonging to pedes-trians P via four cues. To partition the image in this way,we use the approach of Hoiem et al. (2007a), which returnsa pixel-wise labeling of the image, together with the pedes-trian detector of Felzenszwalb et al. (2010) which returnsthe bounding boxes of potential pedestrian locations. Boththese detectors also include a measure of confidence of theirrespective outputs.

    We represent the sun position S = {s,s} using two pa-rameters: s is the sun zenith angle, and s the sun azimuthangle with respect to the camera. This section describes howwe compute distributions over these parameters given thesky, the shadows, the shading on the vertical surfaces, andthe detected pedestrians individually. Afterwards, in Sect. 5,we will see how to combine these cues to estimate the sunposition given the entire image.

    4.1 Sky

    In order to estimate the sun position angles from the sky,we take inspiration from the work of Lalonde et al. (2010),which shows that a physically-based sky model (Perez etal. 1993) can be used to estimate the maximum likelihoodorientation of the camera with respect to the sun from a se-quence of sky images. However, we are now dealing witha single image only. If we, for now, assume that the sky iscompletely clear, our solution is then to discretize the pa-rameter space and try to fit the sky model for each parame-ter setting. For this, we assume that the sky pixel intensitiessi S are conditionally independent given the sun position,and are distributed according to the following generativemodel, function of the Perez sky model (Perez et al. 1993)

  • Int J Comput Vis (2012) 98:123145 129

    Fig. 4 (Color online) Illumination cue from the sky only. Startingfrom the input image (a), we compute the sky mask (b) using Hoiemet al. (2007a). The resulting sky pixels are then used to estimate

    P (s,s |S) (c). The maximum likelihood sun position is shown witha yellow circle. We use this position to artificially synthesize a sun dialin the scene (d)

    g(), the image coordinates (ui, vi) of si , and the camerafocal length fc and zenith angle (with respect to vertical) c:

    si N (k g(s, s, ui, vi, fc, c), 2s ), (3)where N (, 2) is the normal distribution with mean andvariance 2; and k is an unknown scale factor (see Lalondeet al. 2010 for details). We obtain the distribution over sunpositions by computing

    P(s,s |S) exp(

    siS

    (si k g(s,s, . . .))22 2s

    )(4)

    for each bin in the discrete (s, s) space, and normalizingappropriately. Note that since k in (3) and (4) is unknown,we first optimize for k for each sun position bin indepen-dently using a non-linear least-squares optimization scheme(see Lalonde et al. 2010 for more details on the sky model).

    As it is indicated in (3), the sky model g() also requiresknowledge of two important camera parameters: its zenithangle c and its focal length fc. If we assume that fc isavailable via the EXIF tag of the photograph,1 then c canbe computed by finding the horizon line vh in the image (as-suming the camera has no roll angle). We circumvent thehard problem of horizon line estimation by making a simpleapproximation: select the row midway between the lowestsky pixel and the highest ground pixel as horizon. Note thatall the results shown in this paper have been obtained using

    1When unavailable in EXIF, the focal length fc defaults to the equiva-lent of a horizontal field of view of 40 as in Hoiem et al. (2005).

    this approximation, which we have found to work quite wellin practice. Figure 4 demonstrates the distribution over sunpositions obtained using the sky cue.

    So far, we have assumed that the sky is completely clear,which might not always be the case! Even if the sun is shin-ing on the scene, the sky might be covered with clouds, thusrendering the physical model g() useless. To deal with thisproblem, we classify the sky into one of three categories:clear, partially cloudy, or completely overcast. For this, webuild a small database of representative skies for each cat-egory from images downloaded from Flickr, and computethe illumination context feature (Lalonde et al. 2007) oneach. We then find the k nearest neighbors in the database,and assign the most common label (we use k = 5). If thesky is found to be overcast, the sun position distributionP(s,s |S) is left uniform. For partly cloudy scenes, we re-move the clouds with a simple binary color segmentation ofthe sky pixels (keeping the cluster that is closer to blue) andfit the sky model described earlier only to the clear portionof the sky.

    4.2 Ground Shadows Cast by Vertical ObjectsShadows cast on the ground by vertical structures can es-sentially serve as sun dials and are often used by humansto determine the sun direction (Koenderink et al. 2004). Un-fortunately, it is extremely hard to determine if a particu-lar shadow was cast by a vertical object. Luckily, it turnsout that due to the statistics of the world (gravity makes amany things stand-up straight), the majority of long shad-ows should be, in fact, produced by vertical things. There-fore, if we can detect a set of long and strong shadow lines

  • 130 Int J Comput Vis (2012) 98:123145

    (edges), we can use them in a probabilistic sense to deter-mine a likely sun azimuth (up to the directional ambiguity).While shadows have also been used to estimate the sun di-rection with user input (Kim and Hong 2005) or in webcams(Junejo and Foroosh 2008), no technique so far has proposedto do so automatically from a single image.

    Most existing techniques for detecting shadows from asingle image are based on computing illumination invari-ants that are physically-based and are functions of individualpixel values (Finlayson et al. 2002, 2004, 2007; Maxwellet al. 2008; Tian et al. 2009) or the values in a local im-age neighborhood (Narasimhan et al. 2005). Unfortunately,reliable computations of these invariants require high qual-ity images with wide dynamic range, high intensity reso-lution and where the camera radiometry and color trans-formations are accurately measured and compensated for.Even slight perturbations (imperfections) in such images cancause the invariants to fail severely. Thus, they are ill-suitedfor the regular consumer-grade photographs such as thosefrom Flickr and Google, that are noisy and often containcompression, resizing and aliasing artifacts, and effects dueto automatic gain control and color balancing. Since muchof current computer vision research is done on consumerphotographs (and even worse-quality photos from the mo-bile phones), there is an acute need for a shadow detectorthat could work on such images. In this work, we use theshadow detection method we introduced in Lalonde et al.(2010), which we summarize briefly here for completeness.

    Our approach relies on a classifier trained to recognizeground shadow edges by using features computed over a lo-cal neighborhood around the edge. The features used are ra-tios of color intensities computed on both sides of each edge,in three color spaces (RGB, LAB, Chong et al. 2008), andat four different scales. As suggested by Zhu et al. (2010),we also employ a texture description of the regions on bothsides of each edge. The feature distribution is estimated us-ing a logistic regression version of Adaboost (Collins et al.2002), with twenty 16-node decision trees as weak learners.This classification method provides good feature selectionand outputs probabilities, and has been successfully used ina variety of other vision tasks (Hoiem et al. 2007a, 2007b).To train the classifier, we introduced a novel dataset contain-ing more than 130 images in which each shadow boundaryon the ground has been manually labelled (Lalonde et al.2009). Finally, a Conditional Random Field (CRF) is usedto obtain smoother shadow contours which lie on the ground(Hoiem et al. 2007a).

    From the resulting shadow boundaries, we detect the longshadow lines on the ground li G by applying the line de-tection algorithm of Koeck and Zhang (2002). Let us con-sider one shadow line li . If its orientation on the groundplane is denoted i , then the angle between the shadow line

    Fig. 5 (Color online) How do shadow lines predict the sun azimuth?We use our shadow boundary dataset (Lalonde et al. 2009) to computea non-parametric estimate for P (s |(i , s)) (a). A total of 1,700shadow lines taken from 74 images were used to estimate the distri-bution. (b) Example image from the shadow dataset which containsshadows both aligned with and perpendicular to the sun (lamppost androof line respectively). For visualization, the ground truth sun directionis indicated by the red stick and its shadow

    orientation and the sun azimuth angle is

    (i, s) = min{(i, s),(i + 180, s)

    }, (5)

    with the 180 ambiguity due to our assumption that we donot know which object is casting this shadow. Here where(, ) denotes the angular difference. We obtain a non-parametric estimate for P(s |i) by detecting long lines onthe ground truth shadow boundaries in 74 images from ourshadow boundary dataset (Lalonde et al. 2009), in which wehave manually labeled the sun azimuth. The distribution ob-tained from the resulting 1,700 shadow lines found is shownin Fig. 5a. The strongest peak, at (i, s) < 5, confirmsour intuition that long shadow lines align with the sun direc-tion. Another, smaller peak seems to rise at (i, s) > 85.

  • Int J Comput Vis (2012) 98:123145 131

    Fig. 6 (Color online) Illumination cue from the shadows only. Start-ing from the input image (a), we compute the ground mask (b) usingHoiem et al. (2007a), estimate shadow boundaries using Lalonde etal. (2010) (shown in red), and finally extract long lines (Koeck andZhang 2002) (shown in blue). The long shadow lines are used to esti-

    mate P (s,s |G) (c). Note that shadow lines alone can only predict thesun relative azimuth angle up to a 180 ambiguity. For visualization,the two most likely sun positions (shown with yellow circles) are usedto artificially synthesize a sun dial in the scene (d)

    This is explained by the roof lines of buildings, quite com-mon in our database, which cast shadows that are perpendic-ular to the sun direction (Fig. 5b).

    All the shadow lines are combined by making each onevote for its preferred sun direction:

    P(s |G) liG

    P(s |i). (6)

    Of course, computing the shadow line orientation i on theground requires knowledge of the zenith angle c and focallength fc of the camera. For this we use the estimates ob-tained in the previous section. Figure 6 illustrates the resultsobtained using the shadow cue only.

    4.3 Shading on Vertical Surfaces

    If the rough geometric structure of the scene is known, thenanalyzing the shading on the main surfaces can often pro-vide an estimate for the possible sun positions. For example,a brightly lit surface indicates that the sun may be pointingin the direction of its normal, or at least in the vicinity. Ofcourse, this reasoning also assumes that the albedos of thesurfaces are either known or equal, neither of which is true.However, we have experimentally found that, within a givenimage, the albedos of the major vertical surfaces are oftenrelatively similar (e.g. different sides of the same house, orsimilar houses on the same street), while the ground is quitedifferent. Therefore, we use the three coarse vertical surface

    orientations (front, left-facing, and right-facing) computedby Hoiem et al. (2007a) and attempt to estimate the azimuthdirection only.

    Intuitively, we assume that a surface wi V should pre-dict that the sun lies in front of it if it is bright. On thecontrary, the sun should be behind it if the surface is dark.We discover the mapping between the average brightnessof the surface and the relative sun position with respect tothe surface normal orientation i by manually labeling thesun azimuth in the Geometric Context dataset (Hoiem et al.2007a). In particular, we learn the relationship between sur-face brightness bi and whether the sun is in front of or be-hind the surface, i.e. whether (i, s) is less, or greaterthan 90. This is done by computing the average brightnessof all the vertical surfaces in the dataset, and applying lo-gistic regression to learn P((i, s |bi) < 90). This effec-tively models the probability in the following fashion:

    P((i, s) < 90|bi) = 11 + e(x1+x2bi ) , (7)

    where, after learning, x1 = 3.35 and x2 = 5.97. Figure 7shows the computed data points and fitted model obtainedwith this method. As expected, a bright surface (high bi )predicts that the sun is more likely to be in front of it thanbehind it, and vice-versa for dark surfaces (low bi ). In prac-tice, even if the sun is shining on a surface, a large portionof it might be in shadows because of occlusions. Therefore,bi is set to be the average brightness of the brightest cluster,computed using k-means with k = 2.

  • 132 Int J Comput Vis (2012) 98:123145

    To model the distribution of the sun given a vertical sur-face of orientation i , we use:

    P(s |wi) {

    N (i, 2w) if (7) 0.5,N (i + 180, 2w) if (7) < 0.5,

    (8)

    Fig. 7 How do the vertical surfaces predict the sun? From the Geo-metric Context dataset (Hoiem et al. 2007a), the mapping between theprobability of the sun being in front of a surface and its average bright-ness bi is learned using logistic regression. Bright surfaces (high bi )predict that the sun is in front of them (high probability), and vice-versafor dark surfaces (low bi )

    where 2w is such that the fraction of the mass of the gaussianN that lies in front of (behind) the surface corresponds toP((i, s) < 90|bi) if it is greater (less) than 0.5. Notethat i {90,90,180} since we assume only 3 coarsesurface orientations. We combine each surface by makingeach one vote for its preferred sun direction:

    P(s |V) wiV

    P(s |wi). (9)

    Figure 8 shows the sun azimuth prediction results obtainedusing the shading on vertical surfaces only. We find thatthis cue can often help resolve the ambiguity arising withshadow lines.

    4.4 Pedestrians

    When convex objects are present in the image, their appear-ance can also be used to predict where the sun is (Langerand Belthoff 2001). One notable example of this idea isthe work of Bitouk et al. (2008) which uses faces to recoverillumination in a face swapping application. However, front-looking faces are rarely of sufficient resolution in the type ofimages we are considering (they were using mostly high res-olution closeup portraits), but entire people more often are.As shown in Fig. 9, when shone upon by the sun, pedestriansalso exhibit strong appearance characteristics that depend onthe sun position: shadows are cast at the feet, a horizontal in-tensity gradient exists on the persons body, the wrinkles in

    Fig. 8 (Color online) Illumination cue from the vertical surfaces only.Starting from the input image (a), we compute the vertical surfacesmask (b) using Hoiem et al. (2007a) (blue = facing left, green = fac-ing right, red = facing forward). The distribution of pixel intensitieson each of these surfaces are then used to estimate P (s,s |V) (c).

    Note that in our work, vertical surfaces cannot predict the sun zenithangle s . In these examples, the sun zenith is set to 45 for visualiza-tion. We then find the most likely sun position (shown with a yellowcircle), which is used to artificially synthesize a sun dial in the scene (d)

  • Int J Comput Vis (2012) 98:123145 133

    Fig. 9 Looking at these pedestrians, it is easy to say that (a) is lit fromthe left, and (b) from the right. In this work, we train a classifier thatpredicts the sun direction based on the appearance of pedestrians

    the clothing are highlighted, etc. It is easy for us to say thatone person is lit from the left (Fig. 9a), or from the right(Fig. 9b).

    Because several applications require detecting pedestri-ans in an image (surveillance or safety applications in par-ticular), they are a class of objects that has received sig-nificant attention in the literature (Dalal and Triggs 2005;Dollr et al. 2009; Park et al. 2010; Felzenszwalb et al.2010), and as such many efficient detectors exist. In this sec-tion, we present a novel approach to estimate the azimuthangle of the sun with respect to the camera given detectedpedestrian bounding boxes in an image. In our work, thepedestrians are detected using Felzenszwalb et al. (2010)which uses the standard Histogram of Gradients (HOG) fea-tures (Dalal and Triggs 2005) for detection. The detector isoperated at a high-precision, low-recall setting to ensure thatonly very confident detections are used.

    We employ a supervised learning approach to the prob-lem of predicting the sun location given the appearance of apedestrian. In particular, a set of 2,000 random images wereselected from the LabelMe dataset (Russell et al. 2008),which contain the ground truth location of pedestrians, andfor which we also manually labelled the ground truth sun az-imuth. To make the problem more tractable, the space of sunazimuths s [180,180] is discretized into four binsof 90 intervals: [180,90], [90,0], and so forth.We then train a multi-class SVM classifier which uses thesame HOG features used for detection. However, the differ-ence now is that the classifier is conditioned on the pres-ence of a pedestrian in the bounding box, so it effectivelylearns different feature weights that capture effects only dueto illumination. In practice, the multi-class SVM is imple-mented as a set of four one-vs-all binary SVM classifiers.The SVM training is done with the libsvm library (Changand Lin 2001), and the output of each is normalized via anon-linear least-squares sigmoid fitting process (Platt 1999)to obtain a probability for each class.

    Of course, the illumination effects on pedestrians shownin Fig. 9 and captured by the classifier only arise when apedestrian is actually lit by the sun. However, since build-ings or scene structures frequently cast large shadows uponthe ground, pedestrians may very well be in the shade. Toaccount for that, we train another binary SVM classifier topredict whether the pedestrian is in shadows or not. This bi-nary classifier uses simple features computed on the bound-ing box region, such as coarse 5-bin histograms in the HSVand RGB colorspaces, and a histogram of oriented gradientswithin the entire bounding box to capture contrast. This sim-ple sunlit pedestrian classifier results in more than 82%classification accuracy. Figure 10 shows the sun azimuth es-timation results obtained using the automatically-detectedsunlit pedestrians only.

    As with the previous cues, we compute the probabilityP(s |pi) of the sun azimuth given a single pedestrian pi bymaking each one vote for its preferred sun direction:

    P(s |P) piP

    P(s |pi). (10)

    5 Estimating the Sun Position

    Now that we are equipped with several features that can becomputed over an image, we show how we combine them inorder to get a more reliable estimate. Because each cue canbe very weak and might not even be available in any givenimage, we combine them in a probabilistic framework thatcaptures the uncertainty associated with each of them.

    5.1 Cue Combination

    We are interested in estimating the distribution P(S|V =1, I) over the sun position S = {s,s}, given the entire im-age I and assuming the sun is visible (see Sect. 3). We sawin the previous section that the image I is divided into fea-tures computed on the sky S , the shadows on the ground G ,the vertical surfaces V and pedestrians P , so we apply Bayesrule and write

    P(S|S, G, V, P) P(S, G, V, P|S)P (S). (11)

    We make the Naive Bayes assumption that the image pixelsare conditionally independent given the illumination condi-tions, and that the priors on each region of the image (P(S),P(G), P(V), and P(P)) are uniform over their own respec-tive domains. Applying Bayes rule twice, we get

    P(S|S, G, V, P) P(S|S)P (S|G)P (S|V)P (S|P)P (S).(12)

  • 134 Int J Comput Vis (2012) 98:123145

    Fig. 10 (Color online) Illumination cue from the pedestrians only.Starting from the input image (a), we detect pedestrians using the de-tector of Felzenszwalb et al. (2010) (b). Each of these detections isused to estimated P (s |P) (c) by running our pedestrian-based illumi-

    nation predictor, and combining all predictions via a voting framework.The most likely sun azimuth (shown with a yellow circle) is used to ar-tificially synthesize a sun dial in the scene (d). For display purposes,the most likely sun zenith is set to be 45

    The process of combining the cues according to (12) for thesun position is illustrated in Fig. 11. We have presented howwe compute the conditionals P(S|S), P(S|G), P(S|V), andP(S|P) in Sects. 4.1, 4.2, 4.3 and 4.4 respectively. We nowlook at how we can compute the prior P(S) on the sun po-sition themselves.

    5.2 Data-Driven Illumination Prior

    The prior P(S) = P(s,s) captures the typical sun po-sitions in outdoor scenes. We now proceed to show howwe can compute it given a large dataset of consumer pho-tographs.

    The sun position (s, s) depends on the latitude L of thecamera, its azimuth angle c , the date D and the time of dayT expressed in the local timezone:

    P(s,s) = P(f (L,D,T ,c)), (13)where f () is a non-linear function defined in Reda and An-dreas (2005). To estimate (13), we can sample points fromP(L,D,T ,c), and use f () to recover s and s . But esti-mating this distribution is not currently feasible since it re-quires images with known camera orientations c, which arenot yet available in large quantities. On the other hand, geo-and time-tagged images do exist, and are widely available onphoto sharing websites such as Flickr. The database of 6 mil-lion images from Hays and Efros (2008) is used to computethe empirical distribution P(L,D,T ). We compute (13) byrandomly sampling 1 million points from the distribution

    P(L,D,T )P (c), assuming P(c) to be uniform in the[180,180] interval. As a consequence, (13) is flat alongthe s dimension and is marginalized.

    Figure 12c shows 4 estimates for P(s,s), computedwith slightly different variations. First, a uniform samplingof locations on Earth and times of day is used as a baselinecomparison. The three other priors use data-driven informa-tion. Considering non-uniform date and time distributionsdecrease the likelihood of having pictures with the sun takenclose to the horizon (s close to 90). Interestingly, the redand green curve overlap almost perfectly, which indicatesthat the three variables L, D, and T seem to be independent.

    This database captures the distribution of where pho-tographs are most likely to be taken on the planet, whichis indeed very different than considering each location onEarth as equally likely (as shown in Figs. 12a and 12b). Wewill show in the next section that this distinction is criticalto improve our estimation results. Finally, note that the as-sumption of uniform camera azimuth is probably not true inpractice since a basic rule of thumb of good photography isto take a picture with the sun to the back. With the adventof additional sensors such as compasses on digital cameras,this data will surely become available in the near future.

    6 Evaluation and Results

    We evaluate our technique in three different ways. First, wequantitatively evaluate our sun estimation technique for both

  • Int J Comput Vis (2012) 98:123145 135

    Fig. 11 Combining illumination features computed from image (a)yields a more confident final estimate (b). We show how (c) through(f) are estimated in Sect. 4, and how we compute (g) in Sect. 5.2. To

    visualize the relative confidence between the different cues, all proba-bility maps in (c) through (g) are drawn on the same color scale

    zenith and azimuth angles using images taken from cali-brated webcam sequences. Second, we show another quan-titative evaluation this time for the sun azimuth only, butperformed on a dataset of manually labelled single imagesdownloaded from Internet. Finally, we also provide severalqualitative results on single images that demonstrate the per-formance of our approach. These results will show that, al-though far from being perfect, our approach is still able toexploit visible illumination cues to reason about the sun.

    6.1 Quantitative Evaluation Using Webcams

    We use the technique of Lalonde et al. (2010) to estimatethe positions of the sun in 984 images taken from 15 differ-ent time-lapse image sequences, downloaded from the In-ternet. Two example images from our dataset are shown inFig. 13a and 13b. Our algorithm is applied to every imagefrom the sequences independently and the results are com-pared against ground truth.

    Figure 13 reports the cumulative histogram of errors insun position estimation for different scenarios: chance, mak-ing a constant prediction of s = 0 (straight up), using only

    the priors from Sect. 5.2 (we tested both the data-driven andEarth uniform priors), scene cues only, and using our com-bined measure P(s,s |I) with both priors as well. Fig-ure 13 highlights the performance at errors of less than 22.5(50% of images) and 45 (71% of images), which corre-spond to accurately predicting the sun position within anoctant (e.g. North vs. North-West), or a quadrant (e.g. Northvs. West) respectively.

    The cues contribute to the end result differently for dif-ferent scenes. For instance, the sky in Fig. 13a is not infor-mative, since it is small and the sun is always behind thecamera. It is more informative in Fig. 13b, as it occupies alarger area. Pedestrians may appear in Fig. 13a and may beuseful; however they are too small in Fig. 13b to be of anyuse.

    6.2 Quantitative Azimuth Evaluation on Single Images

    When browsing popular publicly available webcam datasets(Lalonde et al. 2009; Jacobs et al. 2007), one quickly real-izes that the types of scenes captured in such sequences areinherently different than those found in single images. We-

  • 136 Int J Comput Vis (2012) 98:123145

    Fig. 12 (Color online) Illumination priors (13) on the sun zenith an-gle. The Earth visualizations illustrate the (a) uniform and (b) datadriven sampling over GPS locations that are compared in this work.The data-driven prior is learned from a database of 6M images down-loaded from Flickr. (c) The priors are obtained by sampling 1 millionpoints (1) uniformly across GPS locations and time of day (cyan); from(2) the marginal latitude distribution P (L) only (blue); (3) the productof independent marginals P (L)P (D)P (T ) obtained from data (green);and (4) the joint P (L,D,T ), also obtained from data (red). The lasttwo curves overlap, indicating that the three variables L, D, and Tindeed seem to be independent

    bcams are typically installed at high vantage points, givingthem a broad overlook on large-scale panoramas such as nat-ural or urban landscapes. On the other hand, single imagesare most often taken at human eye level, where earthbound

    Fig. 13 (Color online) Quantitative evaluation using 984 images takenfrom 15 webcam sequences, calibrated using Lalonde et al. (2010).Two example images, from (a) Madrid and (b) Vatican City, comparesun dials rendered using our estimated sun position (gray), and theground truth (red). (c) Cumulative sun position error (angle betweenestimate and ground truth directions) for different methods. Our re-sult, which combines both the scene cues and the data-driven illumi-nation prior, outperforms the others. The data-driven prior surpassesthe Earth-uniform one (both alone and when combined with the scenecues), showing the importance of being consistent with likely image lo-cations. Picking a constant value of s = 0 has an error of at least 20

    objects like cars, pedestrians, bushes or trees, occupy amuch larger fraction of the image.

    In addition, the scene in a webcam sequence does notchange over time (considering static cameras only), it is theillumination conditions that vary. In single images, however,both the scenes and illumination conditions differ from oneimage to the next. In this section, we evaluate our methodon a dataset of single images. Admittedly, this task is muchharder than the case of webcams, but we still expect to beable to extract meaningful information across a wide varietyof urban and natural scenes.

    The main challenge we face here is the unavailabilityof ground truth sun positions in single images: as of thisday, there exist no such publicly-available dataset. Addition-ally, as discussed in Sect. 5.2, while the GPS coordinatesand time of capture of images are commonly recorded bymodern-day cameras, the camera azimuth angle c is not.

  • Int J Comput Vis (2012) 98:123145 137

    We randomly selected 300 outdoor images from the La-belMe dataset (Russell et al. 2008) where the sun appearsto be shining on the scene (i.e. is not occluded by a cloud orvery large building), and manually labelled the sun positionin each one of them. The labeling was done using an interac-tive graphical interface that resembled the virtual sun dialsused throughout this paper (see Fig. 5b for example), wherethe task of the human labeler was to orient the shadow sothat it aligned with the perceived sun direction. If the labelerjudged that he or she cannot identify the sun direction withsufficient accuracy in a given image, that image was dis-carded from the dataset. After the labeling process, 239 im-ages were retained for evaluation. In addition, we have foundthat reliably labeling the sun zenith angle s is very hard todo in the absence of objects of known heights (Koenderinket al. 2004), so we ask the user to label the sun azimuth sonly.

    Figure 14 reports the cumulative histograms of errors insun azimuth estimation for each of the cues independently(Figs. 14a through 14d), and jointly (Fig. 14e). We also re-port the percentage of images which have less than 22.5and 45 errors, corresponding to correctly estimating the sunazimuth within an octant and quadrant, respectively. Sinceeach cue is not available in all images, we also report thepercentage of images for which each cue is available.

    Let us point out a few noteworthy observations. First,since the shadow lines alone (Fig. 14b) have a 180 ambigu-ity in azimuth estimation, therefore we keep the minimumerror between each predicted (opposite) direction and theground truth sun azimuth. This explains why the minimumerror is 90, and as such should not be compared directlywith the other cues. A second note is that the vertical sur-faces (Fig. 14c) are available in 99% of the images in ourdataset, which indicates that our dataset might be somewhatbiased towards urban environments, much like LabelMe cur-rently is. Also observe how the pedestrians are, indepen-dently, one of the best cues when available (Fig. 14d). Theirperformance is the best of all cues. As a side note, we alsoobserved that there is a direct correlation between the num-ber of pedestrians visible in the image and the quality ofthe azimuth estimate: the more the better. For example, themean error is 41 when there is only one pedestrian, 22with two, and 20 with three or more. Finally, the overallsystem (Fig. 14e) estimates the sun azimuth angle within22.5 for 40.2% of the images, and within 45 for 54.8% ofthe images, which is better than any of the cues taken inde-pendently.

    6.3 Qualitative Evaluation on Single Images

    Figure 15 shows several example results of applying ouralgorithm on typical consumer-grade images taken fromour test set introduced in Sect. 6.2. The rows are arranged

    in order of decreasing order of estimation error. Noticehow our technique recovers the entire distribution overthe sun parameters (s, s) which also captures the de-gree of confidence in the estimates. High confidence casesare usually obtained when one cue is very strong (i.e.large intensity difference between vertical surfaces of dif-ferent orientations in the 5th row, 2nd column), or whenall 4 cues are correlated (as in Fig. 11). On the otherhand, highly cluttered scenes with few vertical surfaces,shadowed pedestrians, or shadows cast by vegetation (4throw, 2nd column of Fig. 15) usually yield lower confi-dences, and the most likely sun positions might be uncer-tain.

    Figure 16 shows typical failure cases. The assumptionthat most shadows are cast by vertical objects from (seeSect. 4.2) is not always satisfied (Fig. 16a). In Fig. 16b,the shadows are correctly detected, but the other cues failto resolve the ambiguity in their orientation. Misdetec-tions, whether for pedestrians (Fig. 16c) or vertical surfaces(Fig. 16e) may also cause problems. In Fig. 16d, the sun-lit pedestrian classifier (Sect. 4.4) incorrectly identifies ashadowed pedestrian as being in the sunlight, thus yieldingerroneous sun direction estimates. Finally, light clouds canmistakenly be interpreted as being intensity variations in thesky (Fig. 16f).

    6.4 Application: 3-D Object Insertion

    We now demonstrate how to use our technique to inserta 3-D object into a single photograph with realistic light-ing. This requires generating a plausible environment map tolight virtual objects (Debevec 1998). An environment map isa sample of the plenoptic function at a single point in spacecapturing the full sphere of light rays incident at that point.It is typically captured by either taking a high dynamic range(HDR) panoramic photograph from the point of view of theobject, or by placing a mirrored ball at the desired locationand photographing it. Such an environment map can thenbe used as an extended light source for rendering syntheticobjects.

    Given only an image, it is generally impossible to re-cover the true environment map of the scene, since the im-age will only contain a small visible portion of the full map(and from the wrong viewpoint besides). We propose to useour model of natural illumination to estimate a high dy-namic range environment map from a single image. Sincewe are dealing with outdoor images, we can divide the envi-ronment map into two parts: the sky probe and the sceneprobe. We now detail the process of building a realisticapproximation to the real environment map for these twoparts.

    The sky probe can be generated by using the physically-based sky model g() from Sect. 4.1. We first estimate

  • 138 Int J Comput Vis (2012) 98:123145

    Fig. 14 Cumulative sun azimuth error (angle between estimate andground truth directions) on a set of 239 single images taken from theLabelMe dataset (Russell et al. 2008), for the individual cues presentedin Sect. 4: (a) the sky, (b) the shadows cast on the ground by vertical

    objects, (c) the vertical surfaces, and (d) the pedestrians. The plot in (e)shows the result obtained by combining all cues together with the sunprior, as discussed in Sect. 5. The percentages indicated represent thefraction of images in the test set for which the cue is available

  • Int J Comput Vis (2012) 98:123145 139

    Fig. 15 (Color online) Sun direction estimation from a single im-age. A virtual sun dial is inserted in each input image (first andthird columns), whose shadow correspond to the MAP sun position

    in the corresponding probability maps P (s,s |I) (second and fourthcolumns). The ground truth sun azimuth is shown in cyan, and since itis not available, a zenith angle of 45 is used for visualization

  • 140 Int J Comput Vis (2012) 98:123145

    Fig. 16 (Color online) Typical failure cases. (a) First, the dominatingshadow lines are not cast by thin vertical objects, and are thereforenot aligned with sun direction. (b) Mislabeling the vertical surfacesorientation causes the wrong shadow direction to be selected. Notethat the pedestrian visible in this example was not a high-confidencedetection, thus unused for the illumination estimate. (c) A false pos-itive pedestrian detection causes the illumination predictor to fail.

    (d) A pedestrian is correctly detected, but incorrectly classified as be-ing in the sun. (e) Vertical surface (red) incorrectly encompassesleft-facing side wall, leading the classifier to believe the sun is behindthe camera. (f) Light clouds mistaken for clear sky variation. In allcases, the other cues were not confident enough to compensate, thusyielding erroneous estimates

    the illumination conditions using our approach. Given themost likely sun position and clear sky pixels, we use thetechnique of Lalonde et al. (2010) to recover the mostlikely sky appearance by fitting its turbidity t (a scalarvalue which captures the degree of scattering in the at-mosphere) in a least-squares fashion. This is done by tak-ing

    t = arg mintsiS

    (si kg(s,s, t, . . .))2 , (14)

    where S are clear sky pixels. The estimated turbidity t andcamera parameters fc and c allow us to use g() and ex-trapolate the sky appearance on the entire hemisphere, even

    though only a small portion of it was originally visible to thecamera. The sun is simulated by a bright circular patch (104times brighter than the maximum scene brightness). An ex-ample sky probe obtained using this technique is shown inFig. 17b.

    For the bottom part of the environment map (not shown),we use the spherical projection technique of Khan et al.(2006) on the pixels below the horizon line as in Lalondeet al. (2009). A realistic 3-D model is relit using an off-the-shelf rendering software (see Fig. 17c). Notice how theshadows on the ground, and shading and reflections on thecar are consistent with the image. Another example is shownin Fig. 1.

  • Int J Comput Vis (2012) 98:123145 141

    Fig. 17 3-D object relighting.From a single image (a), werender the most likely skyappearance (b) using the sunposition computed with ourmethod, and then fitting the skyparameters using Lalonde et al.(2010). We can realisticallyinsert a 3-D object into theimage (c)

    7 Discussion

    We now discuss three important aspects of our approach: thedistribution over the sun positions, how we can exploit ad-ditional sources of information when available, and higher-order interactions between cues.

    7.1 Distribution over Sun Positions

    The quantitative evaluation presented in Sect. 6.2 revealedthat our method successfully predicts the sun azimuth within22.5 for 40% of the images in our test set, within 45 for55% of them, and within 90 for 80% (see Fig. 14e). Ad-mittedly, this is far from perfect. However, we believe thisis still a very useful result for applications which might notrequire a very precise estimate. For instance, all that mightbe required in some cases is to correctly identify the sunquadrant (

  • 142 Int J Comput Vis (2012) 98:123145

    Fig. 18 Different scenarios result in different confidences in the illu-mination estimate. When the cues are strong and extracted properly,the resulting estimate is highly confident (a). A scene with more clut-ter typically results in lower confidence (b). The ambiguity created byshadows is sometimes visible when the other cues are weaker (c)(d).

    When the cues are so weak (e.g., no bright vertical surfaces, no strongcast shadows, pedestrians in shadows, etc.), the resulting estimate isnot confident, and the maximum likelihood sun position is meaning-less (e)(f). All probability maps are drawn on the same color scale, soconfidences can be compared directly

    7.3 Higher-Order Interactions Between Cues

    In Sect. 5, we saw how we can combine together the pre-dictions from multiple cues to obtain a final, more confi-dent estimate. To combine the cues, we rely on the NaiveBayes assumption, which states that all cues conditionallyindependent given the sun direction. While this conditionalindependence assumption makes intuitive sense for manycuesfor example, the appearance of the sky is indepen-dent from the shadow directions on the ground if we knowthe sun directionit does not apply for all cues. Here wediscuss a few dependencies that arise in our framework, andhow we could leverage them to better capture interactionsacross cues.

    Even if the sun direction is known, there is still a strongdependency between objects and their shadows. Knowing

    the position of vertical objects (e.g., pedestrians) tell us thatcast shadows should be near their point of contact with theground. Similarly, knowing the location of shadow bound-aries on the ground constrain the possible locations of pedes-trians, since they must cast a shadow (if they are in sunlight).Capturing this interaction between pedestrians and shadowswould be very beneficial: since we know pedestrians are ver-tical objects, simply finding their shadows would be enoughto get a good estimate of the sun position, and the ambiguityin direction from Sect. 4.2 could even be resolved.

    There is also a dependency, albeit a potentially weakerone, between vertical surfaces and cast shadows on theground. The top, horizontal edge of vertical surfaces (e.g.,roof of buildings) also cast shadows on the ground. Rea-soning about the interaction between buildings and shadows

  • Int J Comput Vis (2012) 98:123145 143

    Fig. 19 Capturing higher-level interactions across cues. The cur-rent approach uses a Naive Bayes model (a), which assumes that allcues are conditionally independent given the sun position. Capturinghigher-order interactions across cues would require a more complexmodel (b), with less restrictive independence assumptions

    would allow us to discard their shadows, which typicallypoint away from the sun (Sect. 4.2).

    Another interesting cross-cue dependency arises betweenpedestrians and vertical surfaces. Since large surfaces maycreate large shadow regions, if the sun comes from behind alarge wall, and a pedestrian is close to that wall, it is likelythat this pedestrian is in shadows, therefore unpredictive ofthe sun position.

    Capturing these higher-level interactions across cues,while beneficial, would also increase the complexity in theprobabilistic model used to solve the problem. Figure 19shows a comparison between the graphical model that corre-sponds to our current approach (Fig. 19a) and a new one thatwould capture these dependencies (Fig. 19b). The caveathere is that the complexity of the model is exponential in theclique size, which is determined by the number of cues in theimage (e.g., number of shadow lines, number of pedestrians,etc.). Learning and inference in such a model will certainlybe more challenging.

    8 Conclusion

    Outdoor illumination affects the appearances of scenes incomplex ways. Untangling illumination from surface andmaterial properties is a hard problem in general. Surpris-ingly, however, numerous consumer-grade photographs cap-tured outdoors contain rich and informative cues about illu-mination, such as the sky, the shadows on the ground andthe shading on vertical surfaces. Our approach extracts thecollective wisdom from these cues to estimate the sun vis-ibility and, if deemed visible, its position relative to the cam-era. Even when the lighting information within an image isminimal, and the resulting estimates are weak, we believe itcan still be a useful result for a number of applications. Forexample, just knowing that the sun is somewhere on yourleft might be enough for a point-and-shoot camera to auto-matically adjust its parameters, or for a car detector to beexpecting cars with shadows on the right. Several additionalpieces of information can also be exploited to help in illu-mination estimation. For instance, GPS coordinates, time of

    day and camera orientation are increasingly being tagged inimages. Knowing these quantities can further constrain theposition of the sun and increase confidences in the probabil-ity maps that we estimate. We will explore these avenues inthe future.

    Acknowledgements This work has been partially supported by aMicrosoft Fellowship to J.-F. Lalonde, and by NSF grants CCF-0541230, IIS-0546547, IIS-0643628 and ONR grant N00014-08-1-0330. A. Efros is grateful to the WILLOW team at ENS Paris for theirhospitality. Parts of the results presented in this paper have previouslyappeared in Lalonde et al. (2009).

    References

    Basri, R., Jacobs, D., & Kemelmacher, I. (2007). Photometric stereowith general, unknown lighting. International Journal of Com-puter Vision, 72(3), 239257.

    Bird, R. E. (1984). A simple spectral model for direct normal and dif-fuse horizontal irradiance. Solar Energy, 32, 461471.

    Bitouk, D., Kumar, N., Dhillon, S., Belhumeur, P. N., & Nayar, S. K.(2008). Face swapping: automatically replacing faces in pho-tographs. ACM Transactions on Graphics (SIGGRAPH 2008),27(3), 39:139:8.

    Blinn, J. F., & Newell, M. E. (1976). Texture and reflection in computergenerated images. In Proceedings of ACM SIGGRAPH 1976.

    Buluswar, S. D., & Draper, B. A. (2002). Color models for outdoor ma-chine vision. Computer Vision and Image Understanding, 85(2),7199.

    Cavanagh, P. (2005). The artist as neuroscientist. Nature, 434, 301307.

    Chang, C.-C., & Lin, C.-J. (2001). Libsvm: a library for support vec-tor machines. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm.

    Chen, H., Belhumeur, P., & Jacobs, D. (2000). In search of illuminationinvariants. In IEEE conference on computer vision and patternrecognition.

    Chong, H. Y., Gortler, S. J., & Zickler, T. (2008). A perception-basedcolor space for illumination-invariant image processing. ACMTransactions on Graphics (SIGGRAPH 2008) (pp. 61:161:7).

    Collins, M., Shapire, R., & Singer, Y. (2002). Logistic regression, ad-aboost and Bregman distances. Machine Learning, 48(1), 158169.

    Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients forhuman detection. In IEEE conference on computer vision and pat-tern recognition.

    Dale, K., Johnson, M. K., Sunkavalli, K., Matusik, W., & Pfister, H.(2009). Image restoration using online photo collections. In Inter-national conference on computer vision.

    Debevec, P. (1998). Rendering synthetic objects into real scenes: bridg-ing traditional and image-based graphics with global illuminationand high dynamic range photography. In Proceedings of ACMSIGGRAPH 1998.

    Debevec, P., & Malik, J. (1997). Recovering high dynamic range radi-ance maps from photographs. In Proceedings of ACM SIGGRAPH1997, August.

    Debevec, P., Tchou, C., Gardner, A., Hawkins, T., Poullis, C.,Stumpfel, J., Jones, A., Yun, N., Einarsson, P., Lundgren, T., Fa-jardo, M., & Martinez, P. (2004). Estimating surface reflectanceproperties of a complex scene under captured natural illumination(Technical Report ICT-TR-06.2004, USC ICT).

    Dollr, P., Wojek, C., Schiele, B., & Perona, P. (2009). Pedestrian de-tection: a benchmark. In IEEE conference on computer vision andpattern recognition.

  • 144 Int J Comput Vis (2012) 98:123145

    Dror, R. O., Willsky, A. S., & Adelson, E. H. (2004). Statistical char-acterization of real-world illumination. Journal of Vision, 4, 821837.

    Felzenszwalb, P., Girshick, R. B., McAllester, D., & Ramanan, D.(2010). Object detection with discriminatively trained part basedmodels. IEEE Transactions on Pattern Analysis and Machine In-telligence (pp. 16271645).

    Finlayson, G., Drew, M., & Funt, B. (1993). Diagonal transforms suf-fice for color constancy. In International conference on computervision.

    Finlayson, G. D., Drew, M. S., & Lu, C. (2004). Intrinsic images by en-tropy minimization. In European conference on computer vision.

    Finlayson, G. D., Fredembach, C., & Drew, M. S. (2007). Detectingillumination in images. In IEEE international conference on com-puter vision.

    Finlayson, G. D., Hordley, S. D., & Drew, M. S. (2002). Removingshadows from images. In European conference on computer vi-sion.

    Hays, J., & Efros, A. A. (2008). im2gps: estimating geographic in-formation from a single image. In IEEE conference on computervision and pattern recognition.

    Healey, G., & Slater, D. (1999). Models and methods for automatedmaterial identification in hyperspectral imagery acquired underunknown illumination and atmospheric conditions. IEEE Trans-actions on Geoscience and Remote Sensing, 37(6), 27062717.

    Hill, R. (1994). Theory of geolocation by light levels. In B. J. LeBouef& R. M. Laws (Eds.), Elephant seals: population ecology, behav-ior, and physiology (pp. 227236). Berkeley: University of Cali-fornia Press. Chapter 12.

    Hoiem, D., Efros, A. A., & Hebert, M. (2005). Automatic photo pop-up. ACM Transactions on Graphics (SIGGRAPH 2005), 24(3),577584.

    Hoiem, D., Efros, A. A., & Hebert, M. (2007a). Recovering surfacelayout from an image. International Journal of Computer Vision,75(1), 151172.

    Hoiem, D., Stein, A., Efros, A. A., & Hebert, M. (2007b). Recoveringocclusion boundaries from a single image. In IEEE internationalconference on computer vision.

    Jacobs, N., Roman, N., & Pless, R. (2007). Consistent temporal vari-ations in many outdoor scenes. In IEEE conference on computervision and pattern recognition.

    Judd, D. B., Macadam, D. L., Wyszecki, G., Budde, H. W., Condit,H. R., Henderson, S. T., & Simonds, J. L. (1964). Spectral distri-bution of typical daylight as a function of correlated color temper-ature. Journal of the Optical Society of America A, 54(8), 10311036.

    Junejo, I. N., & Foroosh, H. (2008). Estimating geo-temporal locationof stationary cameras using shadow trajectories. In European con-ference on computer vision.

    Kasten, F., & Young, A. T. (1989). Revised optical air mass tables andapproximation formula. Applied Optics, 28, 47354738.

    Ke, Y., Tang, X., & Jing, F. (2006). The design of high-level featuresfor photo quality assessment. In IEEE conference on computervision and pattern recognition.

    Khan, E. A., Reinhard, E., Fleming, R., & Belthoff, H. (2006). Image-based material editing. ACM Transactions on Graphics (ACMSIGGRAPH 2006), August (pp. 654663).

    Kim, T., & Hong, K.-S. (2005). A practical single image based ap-proach for estimating illumination distribution from shadows. InIEEE international conference on computer vision.

    Koenderink, J. J., van Doorn, A. J., & Pont, S. C. (2004). Light di-rection from shad(ow)ed random gaussian surfaces. Perception,33(12), 14051420.

    Koeck, J., & Zhang, W. (2002). Video compass. In European confer-ence on computer vision.

    Lalonde, J.-F., Efros, A. A., & Narasimhan, S. G. (2009). Estimatingnatural illumination from a single outdoor image. In IEEE inter-national conference on computer vision.

    Lalonde, J.-F., Efros, A. A., & Narasimhan, S. G. (2009). Web-cam clip art: Appearance and illuminant transfer from time-lapsesequences. ACM Transactions on Graphics (SIGGRAPH Asia2009), 28(5), December (pp. 131:1131:10).

    Lalonde, J.-F., Efros, A. A., & Narasimhan, S. G. (2010). Detectingground shadows in outdoor consumer photographs. In Europeanconference on computer vision.

    Lalonde, J.-F., Efros, A. A., & Narasimhan, S. G. (2010). Groundshadow boundary dataset. http://graphics.cs.cmu.edu/projects/shadows, September.

    Lalonde, J.-F., Hoiem, D., Efros, A. A., Rother, C., Winn, J., & Cri-minisi, A. (2007). Photo clip art. ACM Transactions on Graphics(SIGGRAPH 2007).

    Lalonde, J.-F., Narasimhan, S. G., & Efros, A. A. (2010). What do thesun and the sky tell us about the camera? International Journal ofComputer Vision, 88(1), 2451.

    Langer, M. S., & Belthoff, H. H. (2001). A prior for global convexityin local shape-from-shading. Perception, 30(4), 403410.

    Li, Y., Lin, S., Lu, H., & Shum, H.-Y. (2003). Multiple-cue illuminationestimation in textured scenes. In IEEE international conferenceon computer vision.

    Manduchi, R. (2006). Learning outdoor color classification. IEEETransactions on Pattern Analysis and Machine Intelligence,28(11), 17131723.

    Maxwell, B. A., Friedhoff, R. M., & Smith, C. A. (2008). A bi-illuminant dichromatic reflection model for understanding im-ages. In IEEE conference on computer vision and pattern recog-nition.

    Mills, D. (2004). Advances in solar thermal electricity and technology.Solar Energy, 76, 1931.

    Narasimhan, S. G., Ramesh, V., & Nayar, S. K. (2005). A class of pho-tometric invariants: Separating material from shape and illumina-tion. In IEEE international conference on computer vision.

    Park, D., Ramanan, D., & Fowlkes, C. C. (2010). Multiresolution mod-els for object detection. In European conference on computer vi-sion.

    Perez, R., Seals, R., & Michalsky, J. (1993). All-weather model forsky luminance distributionpreliminary configuration and vali-dation. Solar Energy, 50(3), 235245.

    Platt, J. C. (1999). Probabilistic outputs for support vector machinesand comparisons to regularized likelihood methods. Advances inLarge Margin Classifiers.

    Preetham, A. J., Shirley, P., & Smits, B. (1999). A practical analyticmodel for daylight. In Proceedings of ACM SIGGRAPH 1999,August.

    Ramamoorthi, R., & Hanrahan, P. (2001). A signal-processing frame-work for inverse rendering. In Proceedings of ACM SIGGRAPH2001.

    Reda, I., & Andreas, A. (2005). Solar position algorithm for solar radi-ation applications (Technical Report NREL/TP-560-34302). Na-tional Renewable Energy Laboratory, November.

    Reinhart, C. F., Mardaljevic, J., & Rogers, Z. (2006). Dynamic daylightperformance metrics for sustainable building design. Leukos, 3(1),125.

    Romeiro, F., & Zickler, T. (2010). Blind reflectometry. In Europeanconference on computer vision.

    Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008).LabelMe: a database and web-based tool for image annotation.International Journal of Computer Vision, 77(13), 157173.

    Sato, I., Sato, Y., & Ikeuchi, K. (2003). Illumination from shadows.IEEE Transactions on Pattern Matching and Machine Intelli-gence, 25(3), 290300.

  • Int J Comput Vis (2012) 98:123145 145

    Sato, Y., & Ikeuchi, K. (1995). Reflectance analysis under solar illu-mination. In Proceedings of the IEEE workshop on physics-basedmodeling and computer vision (pp. 180187).

    Slater, D., & Healey, G. (1998). Analyzing the spectral dimensionalityof outdoor visible and near-infrared illumination functions. Jour-nal of the Optical Society of America, 15(11), 29132920.

    Stauffer, C. (1999). Adaptive background mixture models for real-time tracking. In IEEE conference on computer vision and patternrecognition.

    Stumpfel, J., Jones, A., Wenger, A., Tchou, C., Hawkins, T., & De-bevec, P. (2004). Direct HDR capture of the sun and sky. In Pro-ceedings of AFRIGRAPH.

    Sun, M., Schindler, G., Turk, G., & Dellaert, F. (2009). Color match-ing and illumination estimation for urban scenes. In IEEE inter-national workshop on 3-D digital imaging and modeling.

    Sunkavalli, K., Romeiro, F., Matusik, W., Zickler, T., & Pfister, H.(2008). What do color changes reveal about an outdoor scene. InIEEE conference on computer vision and pattern recognition.

    Tian, J., Sun, J., & Tang, Y. (2009). Tricolor attenuation modelfor shadow detection. IEEE Transactions on Image Processing,18(10), 23552363.

    Tsin, Y., Collins, R. T., Ramesh, V., & Kanade, T. (2001). Bayesiancolor constancy for outdoor object recognition. In IEEE confer-ence on computer vision and pattern recognition.

    Ward, G. (1994). The RADIANCE lighting simulation and renderingsystem. In Proceedings of ACM SIGGRAPH 1994.

    Weiss, Y. (2001). Deriving intrinsic images from image sequences. InIEEE international conference on computer vision.

    Wu, T.-P., & Tang, C.-K. (2005). A Bayesian approach for shadow ex-traction from a single image. In IEEE international conference oncomputer vision.

    Yu, Y., & Malik, J. (1998). Recovering photometric properties of archi-tectural scenes from photographs. In Proceedings of ACM SIG-GRAPH 1998, July.

    Zhu, J., Samuel, K. G. G., Masood, S. Z., & Tappen, M. F. (2010).Learning to recognize shadows in monochromatic natural images.In IEEE conference on computer vision and pattern recognition.

    Zickler, T., Mallick, S. P., Kriegman, D. J., & Belhumeur, P. N. (2006).Color subspaces as photometric invariants. In IEEE conference oncomputer vision and pattern recognition.

    Estimating the Natural Illumination Conditions from a Single Outdoor ImageAbstractIntroductionRelated WorkColor ConstancyModel Based Reflectance and Illumination EstimationShadow Extraction and AnalysisIllumination Estimation from Time-Lapse Sequences

    Representations for Natural IlluminationPhysically-Based RepresentationsEnvironment Map RepresentationsStatistical RepresentationsAn Intuitive Representation for Natural Illumination

    Is the Sun Visible or Not?Image Cues for Predicting the Sun VisibilitySun Visibility Classifier

    Image Cues for Predicting the Sun DirectionSkyGround Shadows Cast by Vertical ObjectsShading on Vertical SurfacesPedestrians

    Estimating the Sun PositionCue CombinationData-Driven Illumination Prior

    Evaluation and ResultsQuantitative Evaluation Using WebcamsQuantitative Azimuth Evaluation on Single ImagesQualitative Evaluation on Single ImagesApplication: 3-D Object Insertion

    DiscussionDistribution over Sun PositionsComplementary Sources of InformationHigher-Order Interactions Between Cues

    ConclusionAcknowledgementsReferences