supplementary material: deep image harmonization

16
Supplementary Material: Deep Image Harmonization Yi-Hsuan Tsai 1 Xiaohui Shen 2 Zhe Lin 2 Kalyan Sunkavalli 2 Xin Lu 2 Ming-Hsuan Yang 1 1 University of California, Merced 2 Adobe Research 1 {ytsai2, mhyang}@ucmerced.edu 2 {xshen, zlin, sunkaval, xinl}@adobe.com 1. Comparison between Joint Training and Separate Model To validate the effectiveness of our joint training scheme, we also try an alternative of incorporating an off-the-shelf state-of-the-art scene parsing model [3] into our single encoder-decoder harmonization framework to provide se- mantic information. This network architecture is shown in Figure 1. We show quantitative comparisons on our syn- thesized dataset in Table 1 and 2. The MSE and PSNR of the results generated from the framework with the sep- arate scene parsing model is worse than our joint model, where the scene parsing decoder is learned from scratch as described in the main manuscript. It shows that our joint training scheme can achieve better harmonization results than the one based on separate training. Table 1. Comparisons of different methods on three synthesized datasets using mean-squared errors (MSE) . MSCOCO MIT-Adobe Flickr copy-and-paste 400.5 552.5 701.6 Lalonde [1] 667.0 1207.8 2371.0 Xue [2] 351.6 568.3 785.1 Zhu [4] 322.2 360.3 475.9 Ours (w/o semantics) 80.5 168.8 491.7 Ours (separate model) 86.0 227.5 511.2 Ours (joint training) 76.1 142.8 406.8 Table 2. Comparisons of different methods on three synthesized datasets using PSNR. MSCOCO MIT-Adobe Flickr cut-and-paste 26.3 23.9 25.9 Lalonde [1] 22.7 21.1 18.9 Xue [2] 26.9 24.6 25.0 Zhu [4] 26.9 25.8 25.4 Ours (w/o semantics) 32.2 27.5 27.2 Ours (separate model) 32.3 27.2 26.7 Ours (joint training) 32.9 28.7 27.4 Table 3. PSNR improvement by adding semantics on MSCOCO. bear giraffe horse zebra boat train bird PSNR diff +1.26 +1.19 +0.94 +0.85 +0.40 +0.56 +0.24 Table 4. PSNR improvement by adding semantics with scene pars- ing IoUs on ADE20K. sky tree person sofa table PSNR diff +2.21 +2.81 +2.38 +1.16 +1.26 IoU 86.1 57.9 52.5 17.7 18.2 2. Results on Different Categories Table 3 shows the results on different object categories of the MSCOCO dataset. For objects that have specific pat- terns (bear, giraffe, zebra), the PSNR values are improved significantly, while for categories that have diverse appear- ances, the improvement is marginal. Table 4 shows PSNR improvements using semantics and the corresponding scene parsing IoUs on the ADE20K test set. Similarly, more im- provements are achieved for categories with stronger se- mantic patterns and better scene parsing results. 3. Results on Real Composite Images We present all the results of real composite images used in our user study, including examples created by our- selves (Figure 2 to 8) and examples from [2] (Figure 9 to 16). We compare harmonization results generated by our joint model to other state-of-the-art algorithms, including Lalonde [1], Xue [2] and Zhu [4]. References [1] J.-F. Lalonde and A. A. Efros. Using color compatibility for assessing image realism. In ICCV, 2007. 1, 2 [2] S. Xue, A. Agarwala, J. Dorsey, and H. Rushmeier. Un- derstanding and improving the realism of image composites. ACM Trans. Graph. (proc. SIGGRAPH), 31(4), 2012. 1, 2 [3] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Tor- ralba. Semantic understanding of scenes through the ADE20K dataset. CoRR, abs/1608.05442, 2016. 1, 2 [4] J.-Y. Zhu, P. Kr¨ ahenb¨ uhl, E. Shechtman, and A. A. Efros. Learning a discriminative model for the perception of realism in composite images. In ICCV, 2015. 1, 2 1

Upload: others

Post on 03-Jun-2022

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supplementary Material: Deep Image Harmonization

Supplementary Material: Deep Image Harmonization

Yi-Hsuan Tsai1 Xiaohui Shen2 Zhe Lin2 Kalyan Sunkavalli2 Xin Lu2 Ming-Hsuan Yang1

1University of California, Merced 2Adobe Research1{ytsai2, mhyang}@ucmerced.edu 2{xshen, zlin, sunkaval, xinl}@adobe.com

1. Comparison between Joint Training andSeparate Model

To validate the effectiveness of our joint training scheme,we also try an alternative of incorporating an off-the-shelfstate-of-the-art scene parsing model [3] into our singleencoder-decoder harmonization framework to provide se-mantic information. This network architecture is shown inFigure 1. We show quantitative comparisons on our syn-thesized dataset in Table 1 and 2. The MSE and PSNRof the results generated from the framework with the sep-arate scene parsing model is worse than our joint model,where the scene parsing decoder is learned from scratch asdescribed in the main manuscript. It shows that our jointtraining scheme can achieve better harmonization resultsthan the one based on separate training.

Table 1. Comparisons of different methods on three synthesizeddatasets using mean-squared errors (MSE) .

MSCOCO MIT-Adobe Flickr

copy-and-paste 400.5 552.5 701.6Lalonde [1] 667.0 1207.8 2371.0

Xue [2] 351.6 568.3 785.1Zhu [4] 322.2 360.3 475.9

Ours (w/o semantics) 80.5 168.8 491.7

Ours (separate model) 86.0 227.5 511.2Ours (joint training) 76.1 142.8 406.8

Table 2. Comparisons of different methods on three synthesizeddatasets using PSNR.

MSCOCO MIT-Adobe Flickr

cut-and-paste 26.3 23.9 25.9Lalonde [1] 22.7 21.1 18.9

Xue [2] 26.9 24.6 25.0Zhu [4] 26.9 25.8 25.4

Ours (w/o semantics) 32.2 27.5 27.2

Ours (separate model) 32.3 27.2 26.7Ours (joint training) 32.9 28.7 27.4

Table 3. PSNR improvement by adding semantics on MSCOCO.bear giraffe horse zebra boat train bird

PSNR diff +1.26 +1.19 +0.94 +0.85 +0.40 +0.56 +0.24

Table 4. PSNR improvement by adding semantics with scene pars-ing IoUs on ADE20K.

sky tree person sofa table

PSNR diff +2.21 +2.81 +2.38 +1.16 +1.26IoU 86.1 57.9 52.5 17.7 18.2

2. Results on Different CategoriesTable 3 shows the results on different object categories

of the MSCOCO dataset. For objects that have specific pat-terns (bear, giraffe, zebra), the PSNR values are improvedsignificantly, while for categories that have diverse appear-ances, the improvement is marginal. Table 4 shows PSNRimprovements using semantics and the corresponding sceneparsing IoUs on the ADE20K test set. Similarly, more im-provements are achieved for categories with stronger se-mantic patterns and better scene parsing results.

3. Results on Real Composite ImagesWe present all the results of real composite images

used in our user study, including examples created by our-selves (Figure 2 to 8) and examples from [2] (Figure 9 to16). We compare harmonization results generated by ourjoint model to other state-of-the-art algorithms, includingLalonde [1], Xue [2] and Zhu [4].

References[1] J.-F. Lalonde and A. A. Efros. Using color compatibility for

assessing image realism. In ICCV, 2007. 1, 2[2] S. Xue, A. Agarwala, J. Dorsey, and H. Rushmeier. Un-

derstanding and improving the realism of image composites.ACM Trans. Graph. (proc. SIGGRAPH), 31(4), 2012. 1, 2

[3] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Tor-ralba. Semantic understanding of scenes through the ADE20Kdataset. CoRR, abs/1608.05442, 2016. 1, 2

[4] J.-Y. Zhu, P. Krahenbuhl, E. Shechtman, and A. A. Efros.Learning a discriminative model for the perception of realismin composite images. In ICCV, 2015. 1, 2

1

Page 2: Supplementary Material: Deep Image Harmonization

Figure 1. Overview of the network architecture that incorporates a pre-trained scene parsing model. Given a composite image and aprovided foreground mask, we use the same encoder-decoder architecture for harmonization as described in the main manuscript. Topropagate semantic information, we use a pre-trained scene parsing model (dilatedNet) [3] with the top 25 frequent labels in the ADE20Kdataset. We first extract the response map before the output layer as semantic information. During the training process for harmonization,we then resize and concatenate this response to each deconvolutional layer in the harmonization decoder (denoted as red-dot lines). Notethat, different from the proposed scene parsing decoder jointly trained with harmonization described in the main manuscript, this separatescene parsing model only provides semantic information without updating parameters through back propagation.

Input Lalonde [1] Xue [2] Zhu [4] Ours

Figure 2. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

Page 3: Supplementary Material: Deep Image Harmonization

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

CVPR#1434

CVPR#1434

CVPR 2017 Submission #1434. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Input Lalonde [1] Xue [2] Zhu [4] Ours

Figure 3. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

3

Figure 3. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

Page 4: Supplementary Material: Deep Image Harmonization

324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

CVPR#1434

CVPR#1434

CVPR 2017 Submission #1434. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Input Lalonde [1] Xue [2] Zhu [4] Ours

Figure 4. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

4

Figure 4. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

Page 5: Supplementary Material: Deep Image Harmonization

432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

CVPR#1434

CVPR#1434

CVPR 2017 Submission #1434. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Input Lalonde [1] Xue [2] Zhu [4] Ours

Figure 5. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

5

Figure 5. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

Page 6: Supplementary Material: Deep Image Harmonization

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

CVPR#1434

CVPR#1434

CVPR 2017 Submission #1434. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Input Lalonde [1] Xue [2] Zhu [4] Ours

Figure 6. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

6

Figure 6. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

Page 7: Supplementary Material: Deep Image Harmonization

648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755

CVPR#1434

CVPR#1434

CVPR 2017 Submission #1434. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Input Lalonde [1] Xue [2] Zhu [4] Ours

Figure 7. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

7

Figure 7. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

Page 8: Supplementary Material: Deep Image Harmonization

756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809

810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863

CVPR#1434

CVPR#1434

CVPR 2017 Submission #1434. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Input Lalonde [1] Xue [2] Zhu [4] Ours

Figure 8. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

8

Figure 8. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

Page 9: Supplementary Material: Deep Image Harmonization

864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917

918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971

CVPR#1434

CVPR#1434

CVPR 2017 Submission #1434. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Input Lalonde [1] Xue [2] Zhu [4] Ours

Figure 9. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

9

Figure 9. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

Page 10: Supplementary Material: Deep Image Harmonization

97297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025

102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079

CVPR#1434

CVPR#1434

CVPR 2017 Submission #1434. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Input Lalonde [1] Xue [2] Zhu [4] Ours

Figure 10. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

10

Figure 10. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

Page 11: Supplementary Material: Deep Image Harmonization

108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133

113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187

CVPR#1434

CVPR#1434

CVPR 2017 Submission #1434. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Input Lalonde [1] Xue [2] Zhu [4] Ours

Figure 11. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

11

Figure 11. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

Page 12: Supplementary Material: Deep Image Harmonization

118811891190119111921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241

124212431244124512461247124812491250125112521253125412551256125712581259126012611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295

CVPR#1434

CVPR#1434

CVPR 2017 Submission #1434. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Input Lalonde [1] Xue [2] Zhu [4] Ours

Figure 12. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

12

Figure 12. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

Page 13: Supplementary Material: Deep Image Harmonization

129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319132013211322132313241325132613271328132913301331133213331334133513361337133813391340134113421343134413451346134713481349

135013511352135313541355135613571358135913601361136213631364136513661367136813691370137113721373137413751376137713781379138013811382138313841385138613871388138913901391139213931394139513961397139813991400140114021403

CVPR#1434

CVPR#1434

CVPR 2017 Submission #1434. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Input Lalonde [1] Xue [2] Zhu [4] Ours

Figure 13. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

13

Figure 13. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

Page 14: Supplementary Material: Deep Image Harmonization

140414051406140714081409141014111412141314141415141614171418141914201421142214231424142514261427142814291430143114321433143414351436143714381439144014411442144314441445144614471448144914501451145214531454145514561457

145814591460146114621463146414651466146714681469147014711472147314741475147614771478147914801481148214831484148514861487148814891490149114921493149414951496149714981499150015011502150315041505150615071508150915101511

CVPR#1434

CVPR#1434

CVPR 2017 Submission #1434. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Input Lalonde [1] Xue [2] Zhu [4] Ours

Figure 14. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

14

Figure 14. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

Page 15: Supplementary Material: Deep Image Harmonization

151215131514151515161517151815191520152115221523152415251526152715281529153015311532153315341535153615371538153915401541154215431544154515461547154815491550155115521553155415551556155715581559156015611562156315641565

156615671568156915701571157215731574157515761577157815791580158115821583158415851586158715881589159015911592159315941595159615971598159916001601160216031604160516061607160816091610161116121613161416151616161716181619

CVPR#1434

CVPR#1434

CVPR 2017 Submission #1434. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Input Lalonde [1] Xue [2] Zhu [4] Ours

Figure 15. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

15

Figure 15. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

Page 16: Supplementary Material: Deep Image Harmonization

162016211622162316241625162616271628162916301631163216331634163516361637163816391640164116421643164416451646164716481649165016511652165316541655165616571658165916601661166216631664166516661667166816691670167116721673

167416751676167716781679168016811682168316841685168616871688168916901691169216931694169516961697169816991700170117021703170417051706170717081709171017111712171317141715171617171718171917201721172217231724172517261727

CVPR#1434

CVPR#1434

CVPR 2017 Submission #1434. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Input Lalonde [1] Xue [2] Zhu [4] Ours

Figure 16. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.

16

Figure 16. Sample results on real composite images for the input, three state-of-the-art methods and our proposed network. We show thatour method produces realistic harmonized images by adjusting composite foreground regions containing various scenes or objects.