the 'what could have been' photo - stanford university€¦ · the 'what could have been' photo...

The 'What Could Have Been' Photo

Tian Kai Woon, Huimin HuangEE 368, Stanford University, Spring 2010

Abstract

There are many traditional algorithms that fixpatches or holes in pictures based on texture andor other features found in the same picture. Inour project, we replace these patches withinformation from other pictures, therebyextending our sources and reducing the limitationin information that the original photo holds.Ideally, these features should blend in well andbe contextually valid. We adapt the Scenecompletion algorithm by Hays and Efros [2007],which uses millions of photographs, for thisproject. However, we limit the number ofphotographs to a more manageable quantity thatcan be processed on a single computer by pre-labeling our stock images, so that the averageperson will be able to do perform this form ofimage editing on one machine. This would belike an additional 'What Could Have Been'function on the home version of AdobePhotoshop or Gimp.

1 Introduction

How many times have we been on that dreamholiday, only to find out during photodevelopment that the otherwise perfect picture ismarred by a finger imprint of an inexperiencedphotographer. There are also instances wheneven a good photographer cannot manipulate thesurroundings to capture exactly what he wants.

Figure 1: Man obstructing part of picture

If only the scaffolding weren't covering half thebuilding, if only the crane weren't in the middleof that wonderful sunset, if only that strangerweren't in my group photo. Occasionally, wemight also just want a change of landscape, or tocombine the 'best of' elements in differentphotographs. If only I could have that backdropfrom that photo in this photo instead. If only, ifonly... These creative alternatives form themotivation behind our project.

Finding an alternative for the repairing of theimage not only gives us the freedom to injectnew elements into the picture, it is also a betterway of hole patching compared to interpolatingexisting pixels in the original image, or toduplicate elements in the original image.

Interpolating and then blending can result in ablurry patch that is far from natural. Meanwhile,duplicating existing features in the image, whilecontextually valid, is easy to spot since humansight is sensitive to recurring patterns.

Figure 2: Poisson blending of hole in originalimage [Efros, Hays. 2007]

Figure 3: Duplicating existing features in originalimage to fill hole [Efros, Hays. 2007]

Having said that, isolating only the contextuallyvalid elements to insert into an image hole is oneof the challenges when selecting from a widerange of sources. We address this by pre-labelingour database images when we download and savethem. This reduces our selection process tolocally matching the selected area of interestamongst a pool of contextually valid images.

2 Prior and Related Work

In Efros and Hays' work [2007], they utilized 2.3million unique images on a cluster of 15machines and used the gist scene descriptordeveloped by Oliva and Torralba [2006] to groupand recognize their pictures. As withWilczkowiak et al. [2005], their seam findingoperation relaxed the use of the remaining validpixels in the original picture as hard constraints,allowing said pixels to be removed and replacedby new content.

Other algorithms to fill an unknown area withnew information are based on example-basedtexture synthesis. These make use of texture andcontour information from adjacent pixels, whichare then extrapolated to fill up the missing area.Authors of such algorithms include Wilczkowiaket al. [2005] and Wexler et al. [2004].

There are also methods that use multiplephotographs of the same area in order toaccurately reconstruct the image, such as thosedeveloped by Agarwala et al. [2004].

3 Preliminaries

All images used were scaled down to a size of512 by 512 pixels using the bicubic interpolationmethod. In addition, images that were too smalland images that were duplicates were discarded.

We obtained our database of images usingGoogle Image Swirl, which “organizes imagesearch results based on their visual and semanticsimilarities”. When we saved those images, welabeled and categorized them according to theirunderlying themes, e.g. 'Road', 'Sky', 'Lake' etc.Doing this offers us two advantages:

One, the element chosen to fill the hole will havethe relevant context. After all, local imagematching is limited to local gradient informationand pixel-by-pixel information, with no regardsfor the big picture. Pre-labeling ensures that westill retain information of the context of thepicture as a whole, reducing instances whereimages have extremely out-of-place elements.

Next, pre-labeling sifts out the irrelevant images,thus reducing the number of images to beconsidered for local image matching. Since localimage matching is the most computationallydemanding step in our program, this confers ussignificant time savings.

For each image category, we have about 20pictures. Themespaces, which capture the maincharacteristics of the scene, of each of thesecategories were constructed. Given a new image,its projection onto these themespaces can becompared with the actual image to determinehow well it fits into a particular category.

Each database image was first reshaped to acolumn vector. 20 such vectors from the 20pictures in a category were then used to form amatrix, A. The mean of A was subtracted, and thefirst 5 singular vectors from this resultant matrixwere found. The themespace, corresponding to acertain theme category e.g. sky, was then formedfrom the linear combination of these 5 singularvectors.

4 Description of Algorithm

A. Selection Of Area

The user first selects the region where he wantsreplaced using a program like Paint, where alarge degree of freedom is given for selection ofan irregularly shaped area. A binary mask of theselected area is then created.

Figure 4: Binary mask of selected area

B. Classification

The input image is classified via matching of itsprojection onto each of the themespace asmentioned in the Preliminaries section.

This projection is given by:

x proj=meanimage∑p∈S

r

iopti

where

opt=argmin

∥ x−meanimage−∑i=1

r

ii∥2

The input image is projected to each of the

themespace and the resulting error calculated.Images from the 3 categories which yielded theleast error are then used for local contextmatching with the input image.

C. Local Context Matching

The local context used for matching is defined tobe the 10 pixel band along the boundary of theseam formed by the binary mask. This localcontext is used for comparisons among therelevant images (60-80 images) as sifted out bythe classification process.

The cost function associated with eachreplacement image is formulated based on threeconsiderations. First, the distance of each pixelalong the seam to the original binary maskdefined by the user. Second, the magnitude of thegradient of the sum of squared differences (SSD)between the original image and the replacementimage along the seam. The final consideration isthe mean squared error (MSE) in intensitybetween the original image and the replacementimage over a band of width 10 pixels from theseam.

The seam is found by obtaining the localminimum of the following cost function aroundthe original binary mask defined by the user.

C S =∑p∈S

[C d p , M ∣∇ SSD p∣]

Where S are the pixels which form the seam, Mis the original binary mask defined by the userand B is the 10 pixel band from the seam S.

C d p , M =k∗Dist p , M 3

Cd(p,M) assigns a cost to each pixel from theseam that increases with distance from theoriginal binary mask. The value of k wasempirically set to 0.02.

At regions that are missing for the originalimage, Cd(p,M) is very large to prevent the seamfrom crossing regions removed by the user.

∇| SSD(p)| is the magnitude of the gradient ofthe sum of squared differences between theoriginal and replacement image at pixel p. Usingthe gradient in the cost function causes the seamto avoid paths of high frequency changes incolor. The high frequency edges are difficult toblend and may appear unnatural in the blendedimage.

The seam is originally assumed to be theboundary formed by the original binary mask.The local minimum of C(S) is then found byshifting S towards pixels of lower cost.

The final cost function is evaluated by adding theMSE along the 10 pixel band along the seam toC(S). The individual components of the costfunction are then weighted to give roughly equalcontributions.

For each replacement image, the cost function iscalculated and the four matches with the lowestscores are selected for blending.

D. Laplacian Pyramid Blending

After selecting the four lowest cost replacementimages, we perform Laplacian pyramid blendingto obtain the composite output image.

An N by N image can be represented as apyramid of 1 by 1, 2 by 2, 4 by 4, … , 2k by 2kimages, where N = 2k. The image is sub-sampledas it moves from level 0 (512 by 512) to level k(1 by 1). Gaussian pre-filtering is applied to theimage before sub-sampling.

Figure 5: Formation of the Gaussian pyramid[Burt. 1983]

In pyramid blending, first, two Laplacianpyramids LA and LB are built from thereplacement and original images respectively.During implementation, the Laplacian pyramidsare approximated using Difference of Gaussianpyramids.

Figure 6: Left and right Laplacian pyramids[Burt. 1983]

Next, a Gaussian pyramid GR is built from theselected region R represented by the localminimum seam, S.

A combined pyramid LS is formed from LA andLB using nodes of GR as weights:

LS(i,j) = GR(i,j)*LA(i,j)+(1-GR(i,j))*LB(i,j)

The LS pyramid is then collapsed to get the finalblended image.

E. Mode And Results Selection

The program will have available, two modesfrom which the user can select from. Normalmode, which would search amongst images thatare contextually similar. Random mode, if theuser would like to see more possiblecombinations that may, or may not besemantically valid. This mode will search forgood local context matches among a larger groupof images, generating more possibilities, but alsotaking up more computation time.

Four blended composite images with the lowestscores are then presented for the user to choosefrom.

5 Experimental Results

We have attached some results from our test runson page 6. The user is able to choose among fourimages that look different from one another.

For our best results, the composite image haselements that fit within the general context of thepicture, is well blended and looks natural.

Meanwhile, the less-than-desirable compositeimages may contain elements that are obviouslyout of scale with the rest of the picture, havevisible blending artifacts, have elements withunnatural cut-offs and/or strange illuminationpatterns.

The Matlab implementation of this algorithmtakes approximately three to five minutes to runon a single computer for one input image with adatabase of about 100 images.

6 Conclusion

This project, in summary, covers the localcontext matching section of the paper by Haysand Efros [2007], since the semantic scenematching section is done via pre-labeling ofdownloaded images. This makes it a moremanageable task that can be done with limitedcomputational resource. However, the databasecan be expanded to include a much largernumber of images, which could give uscombinations with much lower scores comparedto what we currently have with a smaller set ofimages.

Another area that can be further developed wouldbe to take the scales of both original andreplacement images into account. When thereplacement element is interleaved into theoriginal, it may so happen that the new elementis of a much larger or smaller scale compared toother elements in the background, making thecomposite look unnaturally disproportionate.

Finally, a less local method of selecting areasfrom the replacement image could be lookedinto. This would prevent instances when part of

the replacement object is cropped off. An objectpreserving component could be included thatdefines the edges that make a particular feature'whole'.

Bibliography

[1] AGRAWAL, A., RASKAR, R., ANDCHELLAPPA, R. 2006. What is the range ofsurface reconstructions from a gradient field? InECCV.

[2] BURT, P. J. 1983. The Laplacian Pyramid asa Compact Image Code. IEEE Transactions onCommunication 31, no. 4. 532-540.

[3] HAYS, J., EFROS, A. A. 2007. Scenecompletion using millions of photographs. ACMSIGGRAPH 2007 conference proceedings.

[4] OLIVA, A., AND TORRALBA, A. 2006.Building the gist of a scene: The role of globalimage features in recognition. In VisualPerception, Progress in Brain Research, vol. 155.

[5] PEREZ, P., GANGNET, M., AND BLAKE,A. 2003. Poisson image editing. ACM Trans.Graph. 22, 3, 313–318.

[6] TURK, M. A., PENTLAND, A. P. 1991. Facerecognition using eigenfaces. IEEE, June 1991b.586-591.

[7] WEXLER, Y., SHECHTMAN, E., ANDIRANI, M. 2004. Space-time video completion.CVPR 01, 120-127.

[8] WILCZKOWIAK, M., BROSTOW, G. J.,TORDOFF, B., ANDCIPOLLA, R. 2005. Holefilling through photomontage. BMVC, 492–501.

Appendix

Tian Kai: Section A, Section B, Section C, Codetesting and editing

Huimin: Preliminaries, Section D, Section E,Report drafting and editing

Figure 7: Example images. Input images and replacement images are blended to form composite outputimages. Top four matches associated with each input image are displayed.

the 'what could have been' photo - stanford university€¦ · the 'what could have been' photo...

Documents