learning to detect specular highlights from real-world...

Learning to Detect Specular Highlights from Real-world Images

Gang FuWuhan [email protected]

Qing ZhangSun Yat-sen University

[email protected]

Qifeng LinWuhan University

[email protected]

Lei ZhuTianjin University

[email protected]

Chunxia Xiao∗

Wuhan [email protected]

ABSTRACT

Specular highlight detection is a challenging problem, and hasmany

applications such as shiny object detection and light source esti-

mation. Although various highlight detection methods have been

proposed, they fail to disambiguate bright material surfaces from

highlights, and cannot handle non-white-balanced images. More-

over, at present, there is still no benchmark dataset for highlight

detection. In this paper, we present a large-scale real-world high-

light dataset containing a rich variety of material categories, with

diverse highlight shapes and appearances, in which each image is

with an annotated ground-truth mask. Based on the dataset, we

develop a deep learning-based specular highlight detection network

(SHDNet) leveraging multi-scale context contrasted features to ac-

curately detect specular highlights of varying scales. In addition, we

design a binary cross-entropy (BCE) loss and an intersection-over-

union edge (IoUE) loss for our network. Compared with existing

highlight detection methods, our method can accurately detect

highlights of different sizes, while effectively excluding the non-

highlight regions, such as bright materials, non-specular as well as

colored lighting, and even light sources.

CCS CONCEPTS

• Computing methodologies→ Interest point and salient re-

gion detections.

KEYWORDS

Highlight detection; context contrasted feature; highlight dataset

ACM Reference Format:

Gang Fu, Qing Zhang, Qifeng Lin, Lei Zhu, and Chunxia Xiao. 2020. Learning

to Detect Specular Highlights from Real-world Images. In Proceedings of

the 28th ACM International Conference on Multimedia (MM ’20), October

12–16, 2020, Seattle, WA, USA. ACM, New York, NY, USA, 9 pages. https:

//doi.org/10.1145/3394171.3413586

∗Corresponding author.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

MM ’20, October 12–16, 2020, Seattle, WA, USA

© 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-7988-5/20/10. . . $15.00https://doi.org/10.1145/3394171.3413586

(a) Input (c) Zhang et al. [50]

(b) Li et al. [26] (d) Ours

Figure 1: Comparison with state-of-the-art highlight detec-

tion methods on an example image with very bright texts

and materials.

1 INTRODUCTION

Specular highlight 1 as a common phenomenon in the real world

usually appears as bright spot on shiny objects when illuminated.

Thus, it has a great effect on the appearance of objects, and con-

stitutes an unignorable double-edged sword. On the one hand, the

presence of specular highlights degrades the performance of some

tasks, such as intrinsic image decomposition [5, 7], general object

detection, clustering, and recoloring [4]. But on the other hand,

it also enhances the performance of other tasks, such as specu-

lar highlight removal [17, 39], glossy object recognition [33, 35],

depth estimation [40], face recognition [50], highlight manipulation

[37], and light source estimation [47]. Therefore, specular highlight

detection has long been a fundamental problem. A high-quality

highlight detection method would significantly benefit the above

tasks and even more applications.

1The three terms specular highlight, highlight, and specularity are equivalent in ourpaper, except stated definitely.

https://doi.org/10.1145/3394171.3413586

https://doi.org/10.1145/3394171.3413586

https://doi.org/10.1145/3394171.3413586

Nearly all existing detection methods are based on the vari-

ous forms of thresholding operations [26, 32, 35, 36, 41], which

are built upon a strict premise that the illumination is white, and

treat the brightest pixels as highlights. However, it is probably

invalid for real-world images with very bright material surfaces,

and non-specular lighting (e.g., interreflection and overexposure)

whose intensities could be greater than highlights. Essentially, ex-

isting traditional methods do not semantically disambiguate all-

white and/or nearly-white materials from highlights in complex

real-world scenes, and not accurately locate desired highlights. As

shown in Figure 1, two recent state-of-the-art methods [26, 50]

fail to produce satisfactory results, particularly where many white

texts are undesirably copied to the highlight maps. In addition,

existing methods are unable to handle non-white-balanced image,

since they often depend on an even strong premise that the illu-

mination is white. What’s more, at present, there is still no public

large real-world benchmark dataset to evaluate the performance

of highlight detection algorithms. This limits the development of

highlight detection techniques and even related techniques such as

highlight removal.

In this work, we present a large-scale dataset for the first time

to promote specular highlight detection for real-world images. Our

dataset includes 4,310 images featuring a wide variety of scenes and

materials, each with a labeled ground truth highlight mask. It also

contains almost all everyday shiny materials, such as metal, plastic,

glass, and wood, on which specular highlights often appear. To the

best of our knowledge, it is the first large-scale real-world dataset

for specular highlight detection task. Thus, it would promote a

fair comparison of performance for existing highlight detection

methods, and even benefit to evaluate other tasks such as highlight

removal and light source estimation. Figure 2 shows several example

image pairs in our dataset.

Based on our dataset, we have also designed a deep learning-

based specular highlight detection method. We observe that specu-

lar highlights usually drastically change the appearance of objects in

the scene, which makes it stand out from its surroundings. However,

this change includes both low-level color variations and high-level

semantics. So we leverage the context contrasted information for

specular highlight detection. We note that this kind of informa-

tion has been successfully exploited in semantic segmentation [13]

and saliency detection [52]. Also, we use a binary cross-entropy

(BCE) loss and an intersection-over-union edge (IoUE) loss for our

network to help more accurately locate highlights. In addition, to

increase the generalization ability and robustness of our method to

non-white-balanced images, we further perform the data augmen-

tation to produce more training images with colored illumination.

To sum up, our main contributions are:

• We present a first large-scale specularity dataset with specu-

lar highlight annotations.

• We develop a deep learning-based method leveraging multi-

scale context contrasted features, equipped with a BCE loss

and an IoUE loss, to accurately detect specularities.

• We evaluate our method and state-of-the-art detection meth-

ods on our dataset and many challenging images in the wild

from the Internet, and show the superior performance of our

method both quantitatively and qualitatively.

2 RELATED WORK

In this section, we briefly review state-of-the-art specular highlight

detection methods, and salient object detection, shadow detection

as well as semantic segmentation from other relevant fields.

Specular highlight detection. Early researchers proposed var-

ious algorithms [9, 31] for specular highlight detection and com-

pensation, since camera saturation caused by specular highlight

interferes with existing image processing algorithms that use deci-

sion threshold [36]. Subsequently, researchers have proposed more

specular highlight detection algorithms [2, 24, 25], based on the con-

cept of łwell-definednessž of mappings between the chromaticity

spaces corresponding to the images taken under different illumi-

nations. Later, Park et al. [36] proposed a truncated least squares

method to the detection of specular highlights in color images.

Recently, several methods [26, 32, 34, 41] designed the algorithms

leveraging a thresholding scheme in RGB or HSV channels. Al-

though these methods are simple and fast, they are often sensitive

to the thresholding value. Tian et al.[42] proposed a real-time high-

light detection method based on an unnormalized version of the

Wiener Entropy. Zhang et al. [50] formulated specular highlight de-

tection as Non-negative Matrix Factorization (NMF) [20] based on

the fact that only a small portion of an image contains highlights.

Although existing highlight detection methods have achieved

remarkable progress, they fail to produce satisfactory results for

real-world images with very bright materials and non-specular

lighting. In comparison, we utilize modern deep learning tools to

achieve high-quality results by building a large-scale real-world

dataset.

Salient object detection. Its goal is to find the most visually

distinctive objects in an image. Recently, Li et al. [27] combined

a well-trained contour detection model with a saliency detection

model, with a contour-to-saliency transferring scheme. Qin et al.

[38] proposed a densely Encoder-Decoder network and a residual

refinement module for saliency prediction and saliency map refine-

ment respectively. Hou et al. [19] introduced a deeply supervised

method with short connections, making full use of multi-level and

multi-scale features. Zhu et al. [54] designed the guided non-local

block to utilize guidance information for saliency detection.

Shadow detection. It aims to detect shadows in the scene. Re-

cently, Hu et al. [21, 22] proposed a deep learning-based detection

method, making full use of direction-aware spatial context features.

Zheng et al. [51] proposed a distraction-aware detection method by

learning and integrating the semantics of visual distraction regions.

Zhu et al. [53] explored the global context in deep layer and local

context in shallow layers of a deep convolutional neural network to

build a bidirectional feature pyramid network for shadow detection.

Semantic segmentation. It aims at predicting pixel-level labels

of object categories for images. A pioneer work in semantic seg-

mentation is Fully Convolutional Network (FCN) [30]. To produce

high-quality semantic segmentation results, some researchers have

utilized various heavy backbones [12] or powerful modules to lever-

age multi-scale context information [11]. Recently, Lin et al. [28]

integrated the graph convolution network into neural architecture

search as a communication mechanism between independent cells.

Yang et al. [46] used a Fourier Transform for semantic segmentation.

Figure 2: Example highlight images and corresponding highlight masks in our dataset. Please zoom in to view more details.

Objects and shadows to be detected in a scene in the above three

tasks often appear to be a few connected large areas. In contrast,

specular highlights in a scene have various shapes, which can be

from tiny dots to very long areas across entire large objects. There-

fore, based on the special characteristics of highlights, we design a

high-performance deep learning-based network for highlight detec-

tion, for the first time in this field. Since the network needs many

labeled images for training, we also provide a large-scale real-world

specularity dataset.

Specularity datasets with annotated highlight masks. Al-

though there are many studies of specular highlight detection, in

general, most methods [15, 50] conduct just visual evaluation on

a few images selected by authors without annotated ground truth

highlight masks. It is far from the comprehensive quantitative eval-

uation of highlight detection algorithms on a large-scale benchmark

dataset. To the best of our knowledge, there is no public dataset

with specularity images and corresponding ground truth highlight

masks. We thus present such a dataset to enable systematic repro-

ducible research in this field.

3 OUR DATASET

In this section, we first describe the annotation pipeline in detail,

and then provide a comprehensive analysis of the dataset.

3.1 Building dataset

To build our dataset, we first downloaded various photos from the

Internet. Then, we filtered out undesired photos. Next, the recruited

workers annotated highlights for each photo. Finally, we checked

the quality of their annotated masks. Note for our annotation task,

we recruited 15 students from a college campus who are skilled at

Photoshop. In the following, we will describe these steps in detail.

Step 1: Collecting images. First, we downloaded a set of every-

day photos from the following sources: (1) photo-sharing websites,

such as Flickr; (2) shopping websites, such as Amazon; (3) image

search engines, such as Google Images and Bing Images. The pro-

portion of images in our initial set of photos collected from the

above three sources is about 5 : 4 : 1. As we all known, specular

highlights are often produced on the surfaces of shiny materials.

With this in mind, we totally downloaded 44,800 images from the

above websites by searching objects whose materials could be iron,

silver, snack, nail polish, book, paint, cobblestone, wine glass, and old

Figure 3: Three example situations in labeling. (a)(c)(e) Input

local patchs. (b)(d)(f) Labeled highlight masks of (a)(c)(e) us-

ing threshold,matting, and free select tools respectively.

wooden table etc. To facilitate later annotations and ensure the qual-

ity of the dataset, we initially pruned the image list disregarding

those images that are: (i) ≤ 0.5 megapixels, and (ii) ≥ 20MB in size

(to control our disk footprint). After this process, we have an image

set with about 38,163 images.

Step 2: Filtering images. Further, we excluded those images

that are: (i) unnatural-looking images, such as synthetic, distorted,

blurred, and noisy photos, and (ii) with very few even no obvious

highlights (i.e. the number of highlight regions is typically less than

three). For this task, each image was shown to five workers. We

found that the workers are very fast and reliable at this task. Then

the image list is finally reduced to 4,310 images.

Step 3: Data annotation. This task is to annotate specular high-

light regions to produce binary highlight masks, which is similar to

the salient object annotation task, the instance segmentation anno-

tation task. However, compared with salient objects and instances

in a scene, specular highlights have particularly different charac-

teristics, namely various sizes, possibly numerous quantity, fuzzy

contours, and even messy distribution. Therefore, the annotation

pipelines of their tasks could not be suitable for our task. Instead of

the LabelMe tool [43], we adopted the more powerful Photoshop

software with the versatile tools (free select, pencil, threshold, mat-

ting, smooth zoom, and undo/redo etc.) we need. These tools were

especially important to enable workers to quickly produce more

accurate masks even for small highlight regions. For this task, each

photo is annotated by a worker. The workers are shown several

examples of good and bad annotation example masks. Several exam-

ple images of annotated local patches (not the whole input photos)

in our dataset are presented in Figure 3.

Step 4: Voting for annotation quality.Wemade an additional

task where human subjects vote on the quality of each annotate

mask. Then we used these results to determine the quality of each

mask. We consider an annotated mask as high quality only if there

exists a certain amount of consensus. Five voters vote on the quality

of each highlight mask. To check the quality of annotation masks,

similar to [6], we use the CUBAM learning-based algorithm [45]

which uses a model of noisy user behavior for binary tasks (in our

case, voting for the quality of annotated highlight mask) to extract

better results. This method partly models the bias of each worker

based on how often they are in agreement with other ones. Then

we ran CUBAM on the resulting votes, and discard highlight mask

with a CUBAM score below a given threshold. Afterward, we se-

lected some bad highlight mask examples in those discarded masks

to show the workers so that they can avoid similar problematic

annotations in the subsequent adjustable annotation task. We kept

repeating this process until all annotated highlight masks meet our

requirements.

3.2 Dataset analysis

To provide a thorough understanding of our dataset, we show a

series of statistical analysis on our proposed dataset.

Highlights number distribution. First, we group connected

highlight pixels and count the number of separated highlights per

image in our dataset. To avoid the effect of noisy labels, we discard

highlight regions whose number of pixels is less than 25, while

undesirably excluding those łtruež separate highlights with ex-

tremely small size. Figure 4(a) shows the resulting statistics. On the

whole, the number of separate connected regions in our dataset is

considerably large.

Highlight area proportion. Next, we identify the proportion

of pixels occupied by highlights in each image. Figure 4(b) presents

the histogram plots of highlight area proportion for our dataset.

We can see that most images in our dataset have relatively small

highlight areas.

Color contrast distribution. Real-world highlights are often

high-intensity and texture-less, which means that the color contrast

in highlight and non-highlight regions is often high. We adopt χ2

distance to measure the contrasts between the color histograms of

the highlight and non-highlight regions in each image. Figure 4(c)

plots the color contrast distribution in our dataset, where a high

value in the horizontal axis means high color contrast. Nevertheless,

this does not mean that it is not challenging to locate highlights in

our dataset, since it is quite possible that there exist the same or

even more amount of bright material surfaces or light sources in

the non-highlight region.

Highlight location distribution. To analyze the spatial distri-

bution of highlights in our dataset, we resize all highlight masks to

512 × 512, and sum them up to obtain a probability map to show

how likely each pixel belongs to specular highlight. Figure 4 plots

the highlight location distribution of our dataset. As can be seen,

the highlights tend to cluster around the lower center of the image,

since objects are usually placed approximately around the human

eyesight.

Scene categories. In our dataset, indoor scenes have a large

proportion, since specular and shiny materials are highly common

in indoor scenes. Specifically, our dataset mostly includes scenes

from the following categories: living room, office, supermarket,

kitchen, dining room, and museum.

Figure 4: Statistics of the proposed specular highlight detec-

tion dataset. We show its distributions including highlight

numbers, area proportion, color contrast, and location.

Material categories. Our dataset contains various specular and

shiny materials on which specular highlights easily appear, such

as metal, plastic (clear), plastic (opaque), rubber/latex, glass, wood,

painted, porcelain, stone, leather, wax, marble, and tile.

4 PROPOSED METHOD

We observe that specular highlights in the real world often have

high intensity and are rather sparse in distribution, which leads

to a violent change of the appearance of objects in the scene. This

observation inspires us to utilize the contrast between the high-

light and non-highlight regions. To accomplish this, we introduce

a context contrasted feature module (CCFM) to extract contrast

features for highlight detection. Based on it, we further design a

multi-scale context contrasted feature module (MCCFM) to har-

vest contrasted features to help locate highlights of various scales.

In addition, we exploit a binary cross-entropy (BCE) loss and an

intersection-over-union edge (IoUE) loss for our network.

4.1 Overview

Figure 5 illustrates the proposed specular highlight detection net-

work (SHDNet). It takes a single image as input and outputs the

highlight map in an end-to-end way. We take FPN [29] as the back-

bone. SHDNet uses a convolution neural network to generate the

five feature maps with varying spatial resolutions. The features

at shallow CNN layers tend to discover the fine details at whole

highlight regions and many non-highlight regions due to lack of

high-level semantic information, while features at deep CNN layers

suppress most of non-highlight regions but somehow lack of high-

light region details due to the relatively larger receptive fields than

shallow layers. We upsample the feature maps with low resolution

at deep layers and sum them with the adjacent feature maps at

shallow layers to construct the feature pyramid. It can make full

use of the complementary information among features with differ-

ent spatial resolutions. Then, we embed an MCCFM (see Figure 6)

to generate context contrasted features for each layer. Next, con-

catenating these contrast features with convolutional features, we

build the context contrasted feature pyramid. Finally, giving the

supervision signals at each layer, we obtain the highlight maps at

each layer, and treat the highlight map with the highest resolution

as the final highlight mask.

4.2 Multi-scale Context Contrasted Feature

Figure 6 demonstrates the architecture of the proposed MCCFM.

Given the input features extracted by the feature extraction net-

work, the purpose of MCCFM is to provide multi-scale context

contrasted features for locating specular highlights.

Figure 5: The schematic illustration of the proposed specular highlight detection network (SHDNet).

AvgPoolUpsample OutputAvgPool

UpsampleAvgPool

UpsampleAvgPool

UpsampleConvInput_ _ _ _

C

Conv

Conv ConvConv

ConvConv ConvCCFM

C :Concatenation_

:Subtraction

Figure 6: The schematic illustration of the proposed multi-scale context contrasted feature module (MCCFM). The MCCFM

consists of the four chained CCFMs. In each CCFM (gray dashed box), we compute context contrasted features between a local

information and its surrounding contexts. Then we concatenate the resulting context contrasted features as the output.

CCFM. The high intensity and sparseness are the distinctive

characteristics of specular highlight, which make it significantly

stand out from its surrounding non-highlight regions. To capture

this kind of high contrast information, the desired context con-

trasted features should be uniform inside the highlight region and

non-highlight region but at the same time be different between

the both. To this end, we design the CCFM to learn context con-

trasted features between a local information (generated by standard

convolutions) and its surrounding contexts (generated by average

poolings with different kernel sizes), expressed as

C = F −Up(T (AvgPool(F ; s))) , (1)

where F is the input features; C is the resulting context contrasted

features;Up denotes bilinear interpolation to up-sample the array

of contextual features to be of the same size of F ; T denotes a

convolutional network with kernel size 1 to combine the context

features across channels without changing their dimensions; and s

is the kernel size of the average pooling (AvgPool).

MCCFM. To make highlight detection problem tractable, previ-

ous methods [26, 50] typically assume that the specular highlight is

small in size. However, we observe the specular highlight in the real

world actually could take up a large portion of an object (e.g., the

specularity across the entire large objects in the scene, as shown in

Figure 1 and 9), for example due to view direction, and large light

source. Based on CCFM, we further propose a multi-scale context

contrasted feature module (MCCFM) to help capture specularities

of various scales, by chaining four CCFMs. Figure 6 illustrates the

scheme of the proposed MCCFM. Here, we empirically set the ker-

nel size of the average pooling to 2×2, 4×4, 6×6, and 8×8, such that

long-range spatial contrast could be obtained. Then, the resulting

contrasted features by CCFMs are concatenated to produce feature

maps that identify the dividing edges.

4.3 Loss function

Our total loss is defined as L =∑KkwkLk , where Lk is the loss of

k-th side output, K is the total number of outputs, and wk is the

weighting parameter of each loss. As illustrated in Section 4, our

SHDNet is supervised with five side paths, i.e. K = 5. To obtain

high-quality highlight detection, we propose to define Lk as Lk =

LBCEk+ LIoUE

k, where LBCE

kand LIoUE

kdenote binary cross-entropy

(BCE) loss and intersection-over-union edge (IoUE) loss.

BCE loss. The cross-entropy loss is commonly adopted in salient

object detection and semantic segmentation problems. Similarly,

we adopt this loss for our specularity detection task, written as

LBCEk= −

∑

p

(Gp log(Hp ) + (1 −Gp )log(1 − Hp )) , (2)

where p denots a pixel location; Hp is the predict probability of

being specular highlight; Gp ∈ {0, 1} is the ground truth label of

the pixel p.

IoUE loss.We also introduce an IoUE loss, written as

LIoUEk=

∑

j

(

1 −2|Cj ∩ Cj |

|Cj | + |Cj |

)

, (3)

where j denotes a highlight region; Cj is the gradient magnitude

of predicted highlight map of region j, while Cj is the gradient

magnitude of ground truth highlight map of region j, estimated

by a Sobel operator. Here the intersection is estimated using a

pixel-wise multiplication operator.

The IoUE loss is able to help locate highlights, resulting in high-

light masks with clearer contours/edges. The reason is two-fold.

First, the IoUE loss leverages the edge information, which is rela-

tively not sensitive to scale difference of highlight region. Second,

since the training set includes highlight regions with diverse region

sizes, the edges of highlight regions with large shapes and very thin

(or tiny) regions are both incorporated into the network training. By

doing so, our network with the IoUE loss can help locate highlights

at different region sizes, which has been proved by the detection

results of our method. See the ablation study in Section 5.4 for more

details.

(a) Input (b) [41] (c) [32] (d) [50] (e) [26]

(f) [10] (g) [21] (h) [51] (i) Ours (j) Ground truth

(a) Input (b) [41] (c) [32] (d) [50] (e) [26]

(f) [10] (g) [21] (h) [51] (i) Ours (j) Ground truth

Figure 7: Visual comparison of our method against recent state-of-the-art detection methods on the testing set of our dataset.

4.4 Implementation detail

To guarantee that the training set and testing set in our dataset

basically match, we randomly select 3,017 images as the training set,

and other 1293 images as the testing set. We train our model on the

training set of our dataset. We have implemented the SHDNet in

PyTorch on a PC equipped with Intel(R) Xeon(R) CPU E5-2630 v4 @

2.20GHz, and NVIDIA GeForce GTX 1080Ti. The Adam optimizer

[23] is used to train our network. The hyper-parameters are set to:

learning rate = 5e − 5, weight decay = 0.0005, momentum = 0.9,

batch size = 8, and lose weight for each side output w = 1. In

addition, we train the network 100 epochs and divide the learning

rate by 10 after 20 epochs. We do not validate the network using

the dataset during training. It takes about 80 hours for training

the network to converge on our card. During inference, we can

obtain a predicted specular highlight map and its corresponding

highlight edge map. In our method, we use the highlight map with

the highest resolution as the final highlight mask.

5 EXPERIMENTS

5.1 Experimental settings

Dataset and evaluation metrics.We have evaluated the perfor-

mance of all detection methods on our dataset. Note that most

existing datasets contain a very small number of images and only

allow subjective evaluation due to lack of ground truth annotations.

In contrast, the test set of our dataset covers a broad range of im-

ages with annotations, and allows both qualitative and quantitative

evaluations.

Following the commonmetrics in semantic segmentation, saliency

detection, and shadow detection communities, we in this paper

adopt 4 commonly used metrics including MAE, mean F-measure,

max F-measure and S-measure [1, 8, 14] to evaluate the performance

of our method and compared detection methods.

Data augmentation. In order to improve the generalization

ability of our method to handle colored highlights, we perform the

data augmentation on the training set of our dataset. First, we use

the fast Fourier color constancy method proposed by Barron et al.

[3] to recover non-white-balanced images to be white-balanced for

subsequent data augmentation. Then, we generate more training

images by mixing each white-balanced image with varying colored

illumination. We express this process as Icp = Ip ∗ C, where I is an

input image; Ic is an augmented image; p denotes a pixel location; ∗

denotes element-wise multiplication; and C = [R,G,B] is the mixed

illumination color vector (three random values between 0 and 1).

For simplification, we assume that there is only a dominant light

source with a basically constant color in a scene. For each image,

10 augmented images are produced.

Comparedmethods. For comparison, we choose the four state-

of-the-art highlight detection methods including [26, 32, 41, 50].

In addition, we also choose several state-of-the-art learning-based

segmentation and detection methods from the akin fields, including

DSS [19] in the saliency detection community, Deeplab v3+ [10] in

the semantic segmentation community, DSC [21, 22] and DSD [51]

in the shadow detection community.

Fairness setting. Since none of the compared methods were

given the opportunity to train on our dataset, we tried our best to

make the evaluation fair by testing various parameters for each

method. We varied the window size n for [41], the inner dimension

of factorization for [50] and the two weighting parameters α and

β for [26] on our augmented testing set. Note that there is no

any adjustable parameter in [32]. In addition, we re-trained DSS,

Deeplab v3+, DSC, and DSD on our training set, and set the size of

input image as 512 × 512. For these four learning-based methods,

we fine-tuned their large number of parameters to produce better

results as much as possible. To measure a final quantitative result,

we use leave-one-out cross-validation (namely, test each image

using the best parameters for all other images).

5.2 Comparison on our testing set

Quantitative comparison. Table 1 reports the quantitative com-

parison results on our testing set. As can be seen, our method

achieves the best quantitative results on all four metrics including

S-measure, mean F-measure (meanF), max F-measure (maxF), and

MAE. Notice the methods of Meslouhi et al. [32] and Li et al. [26]

achieve unsatisfactory results on our testing set, since they have a

strict premise that the illumination is white-balanced.

Visual comparison. In Figure 7, we provide some visual com-

parison results. The results show that our method can produce

more accurate results than other state-of-the-art highlight detection

methods. We found that, four traditional methods [26, 32, 41, 50]

often mistakenly detect the very bright material areas and non-

specular lighting as the specular highlights, e.g., the white product

labels on the four bottles (1st group of results), and the white text

textures as well as the white dots of the carpet (2nd group of re-

sults). The reason is that these methods are based on the analysis or

statistics of the pixel intensities and fail to semantically distinguish

between highlight pixels and achromatic pixels (i.e., very bright

material surfaces and text textures). In addition, the compared deep

learning-based semantic segmentation method [10], salient object

detection method [19], and shadow detection methods [21, 51] are

likely to have the ability to exclude very bright material colors.

Table 1: Quantitative comparison results on our testing set.

↑ and ↓means larger and smaller are better respectively.

Metrics [41] [32] [50] [26] [19] [10] [21] [51] Ours

S-m. ↑ 0.132 0.125 0.521 0.201 0.491 0.619 0.412 0.480 0.793

meanF ↑ 0.027 0.030 0.410 0.223 0.218 0.451 0.108 0.202 0.676

maxF ↑ 0.031 0.033 0.383 0.301 0.382 0.501 0.114 0.490 0.736

MAE ↓ 0.423 0.478 0.021 0.383 0.053 0.019 0.091 0.049 0.006

Table 2: User study result.

Methods [41] [32] [50] [26] [19] [10] [21] [51] Ours

PoV 1.13% 1.81% 2.41% 2.82% 4.19% 18.2% 1.18% 8.01% 60.25%

Table 3: Ablation study. “basic” denotes our networkwithout

MCCFM module and used loss.

Networks S-m.↑ meanF↑ maxF↑ MAE↓

basic + BCE loss 0.638 0.508 0.612 0.013

basic + hybrid loss 0.680 0.609 0.645 0.012

basic + CCFM + BCE loss 0.691 0.623 0.649 0.012

basic + CCFM + hybrid loss 0.762 0.652 0.703 0.009

basic + MCCFM + BCE loss 0.789 0.661 0.721 0.008

basic + MCCFM + hybrid loss 0.793 0.676 0.736 0.006

However, these methods fail to locate tiny highlight details, and

undesirably treat the diffuse pixels around highlights as highlights

to be detected. This is because the shape and appearance of specular

highlight in our task are highly different from that of object and

shadow in their tasks. In comparison, our method can effectively

locate highlights of different sizes, while effectively excluding the

bright material surfaces.

5.3 User study

We further conduct a user study to compare results. To this end, we

first download extra 500 test images from Flickr by searching with

keywords like study room, restaurant, grocery store, and reception

(Figure 8 presents an example). These real-world images of whole

scenes in the wild are quite challenging with heavy texture, noise,

colored lighting, and achromatic surfaces etc. Then, we use our

method and other compared methods to produce highlight masks

for each test image, and recruit 100 participants from a school

campus to rate each group results, which are presented in a random

order to avoid subjective bias.

For each result, the participants are asked to give a rating using

a Likert scale k from 1 (worst) to 5 (best). After all participants

have finished the investigation, for each method, we compute the

average percentage of votes (PoV) as the final result by PoV =1N

∑

i∑

j Vi , j ×100%, where i denotes ith group of results; j denotes

jth participants; Vi , j denotes the number of votes on ith group of

results by j participant; N is the total voting score. Table 2 reports

the results of the user study. In the user study, our method highly

outperformed than other compared detection methods. The visual

comparison results on a randomly selected image are shown in

Figure 8.

(a) Input (b) [41] (c) [32] (d) [50] (e) [26]

(f) [19] (g) [10] (h) [21] (i) [51] (j) Ours

Figure 8: Visual comparison with state-of-the-art detection methods on a real-world photo in the wild.

Figure 9: Ablation study for our MCCFM. (a) Input. (b) Basic.

(c) Basic + CCFM + hybrid loss. (d) Basic + MCCFM + BCE

loss. (e) Basic + MCCFM + hybrid loss. (f) Ground truth.

Input

[50]

Ours

Figure 10: Sensitivity to colored lighting. Zoom in for detail.

5.4 Discussions

Ablation study.We perform experiments to evaluate the perfor-

mance of the proposed MCCFM, and the used BCE loss and IoUE

loss. The quantitative comparison results are shown in Table 3. As

can be seen, our SHDNet with the hybrid loss (i.e., the BCE loss

and IoUE loss) is able to obtain better results than only using BCE

loss. As shown in Figure 9(d)(e), we can see that our SHDNet with

the hybrid loss can produce the highlight masks with clearer con-

tours/edges, compared to without the IoUE loss. Moreover, from the

Table 3, we can also see that our network with MCCFM (4 CCFMs)

can improve the more performance than only with one CCFM. As

shown in Figure 9, our SHDNet with MCCFM can more effectively

learn the contrasted features to accurately locate highlights and

avoid non-highlight details residual in highlight maps than only

with CCFM.

Robustness to colored lighting. Our method is robust to col-

ored lighting mainly due to our data augmentation scheme. The

visual comparison results are shown in Figure 10. It shows that the

traditional methods fail to detect the colored highlights well, since

they highly depend on the premise of white-balanced illumination.

In contrast, our method can effectively overcome the issue.

6 CONCLUSION

This paper presented a large-scale specularity dataset consisting of

annotated 4,310 real-world images for specular highlight detection.

In addition, by analyzing the shape and appearance of specular

highlight, we designed a novel deep convolutional neural network,

called SHDNet, to leverage multi-scale context contrasted features

for specular highlight detection. Our method has achieved superior

performance over existing methods on our collected dataset and

many challenging images in the wild from the Internet. We hope

that our dataset will benefit many multimedia applications, and

enables more researchers to tackle highlight detection task. In the

future, we would like to extend our method for specular highlight

detection in videos, and take ourmethod as a pre-processing step for

specularity removal [17], intrinsic decomposition [16, 18], exposure

correction [44, 48, 49], and so on.

ACKNOWLEDGMENTS

We would like to thank Qimeng Zhang, Jianlin Zhu, Yiwen Liu,

Zihao Wang, Ting Zhang, Xinyue Zhang, Zhuoxu Huang, Yizhe Lv,

and Jiapan Wang for their invaluable help in building the dataset.

We also thank the ACM MM reviewers for their feedback. This

work was partly supported by the Key Technological Innovation

Projects of Hubei Province (2018AAA062), NSFC (No. 61672390,

61972298, 61902275), the National Key Research and Development

Program of China (2017YFB1002600).

REFERENCES[1] Radhakrishna Achanta, Sheila Hemami, Francisco Estrada, and Sabine Susstrunk.

2009. Frequency-tuned salient region detection. In CVPR. 1597ś1604.[2] Ruzena Bajcsy, Sang Wook Lee, and Aleš Leonardis. 1990. Color image seg-

mentation with detection of highlights and local illumination induced by inter-reflections. In CVPR. 785ś790.

[3] Jonathan T. Barron and Yun-Ta Tsai. 2017. Fast Fourier Color Constancy. In CVPR.6950ś6958.

[4] Shida Beigpour and Joost van deWeijer. 2011. Object recoloring based on intrinsicimage estimation. In ICCV. 327ś334.

[5] Sean Bell, Kavita Bala, and Noah Snavely. 2014. Intrinsic Images in the Wild.ACM Transactions on Graphics 33, 4 (2014), 159.

[6] Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. 2013. OpenSurfaces: Arichly annotated catalog of surface appearance. ACM Transactions on Graphics32, 4 (2013), 1ś17.

[7] Sai Bi, Xiaoguang Han, and Yizhou Yu. 2015. An L1 Image Transform for Edge-Preserving Smoothing and Scene-Level Intrinsic Decomposition. ACM Transac-tions on Graphics 34, 4 (2015), 78.

[8] Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li. 2015. Salient ObjectDetection: a Benchmark. IEEE Transactions on Image Processing 24, 12 (2015),5706ś5722.

[9] Gavin Brelstaff and Andrew Blake. 1988. Detecting Specular Reflections UsingLambertian Constraints. In ICCV. 297ś302.

[10] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and HartwigAdam. 2018. Encoder-Decoder with Atrous Separable Convolution for SemanticImage Segmentation. In ECCV. 833ś851.

[11] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and HartwigAdam. 2018. Encoder-decoder with atrous separable convolution for semanticimage segmentation. In ECCV. 801ś818.

[12] François Chollet. 2017. Xception: Deep learning with depthwise separable con-volutions. In CVPR. 1251ś1258.

[13] Henghui Ding, Xudong Jiang, Bing Shuai, Ai Qun Liu, and Gang Wang. 2018.Context Contrasted Feature and Gated Multi-Scale Aggregation for Scene Seg-mentation. In CVPR. 2393ś2402.

[14] Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali Borji. 2017. Structure-measure: A new way to evaluate foreground maps. In ICCV. 4548ś4557.

[15] Rogerio Feris, Ramesh Raskar, Kar-Han Tan, and Matthew Turk. 2006. SpecularHighlights Detection and Reduction With Multi-Flash Photography. Journal ofthe Brazilian Computer Society 12, 1 (2006), 35ś42.

[16] Gang Fu, Lian Duan, and Chunxia Xiao. 2019. A Hybrid L2-Lp Variational Modelfor Single Low-light Image Enhancement with Bright Channel Prior. In ICIP.1925ś1929.

[17] Gang Fu, Qing Zhang, Chengfang Song, Qifeng Lin, and Chunxia Xiao. 2019.Specular Highlight Removal for Real-world Images. Computer Graphics Forum 7(2019), 253ś263.

[18] Gang Fu, Qing Zhang, and Chunxia Xiao. 2019. Towards high-quality intrinsicimages in the wild. In ICME. 175ś180.

[19] Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, Zhuowen Tu, and PhilipH. S. Torr. 2019. Deeply Supervised Salient Object Detection With Short Connec-tions. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 4 (2019),815ś828.

[20] Patrik O Hoyer. 2004. Non-Negative Matrix Factorization With SparsenessConstraints. Journal of Machine Learning Research 5 (2004), 1457ś1469.

[21] Xiaowei Hu, Chi-Wing Fu, Lei Zhu, Jing Qin, and Pheng-Ann Heng. 2019.Direction-Aware Spatial Context Features for Shadow Detection and Removal.IEEE Transactions on Pattern Analysis and Machine Intelligence (2019). to appear.

[22] Xiaowei Hu, Lei Zhu, Chi-Wing Fu, Jing Qin, and Pheng-Ann Heng. 2018.Direction-aware spatial context features for shadow detection. In CVPR. 7454ś7462.

[23] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti-mization. In ICLR.

[24] Gudrun J Klinker, Steven A Shafer, and Takeo Kanade. 1988. The Measurement ofHighlights in Color Images. International Journal of Computer Vision 2, 1 (1988),7ś32.

[25] Hsien-Che Lee. 1986. Method for Computing the Scene-Illuminant ChromaticityFrom Specular Highlights. JOSA A 3, 10 (1986), 1694ś1699.

[26] Ranyang Li, Junjun Pan, Yaqing Si, Bin Yan, Yong Hu, and Hong Qin. 2019.Specular Reflections Removal for Endoscopic Image Sequences With Adaptive-Rpca Decomposition. IEEE Transactions on Medical Imaging 39, 2 (2019), 328ś340.

[27] Xin Li, Fan Yang, Hong Cheng, Wei Liu, and Dinggang Shen. 2018. Contourknowledge transfer for salient object detection. In ECCV. 355ś370.

[28] Peiwen Lin, Peng Sun, Guangliang Cheng, Sirui Xie, Xi Li, and Jianping Shi. 2020.Graph-Guided Architecture Search for Real-Time Semantic Segmentation. InCVPR. 4203ś4212.

[29] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, andSerge Belongie. 2017. Feature pyramid networks for object detection. In CVPR.2117ś2125.

[30] Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutionalnetworks for semantic segmentation. In CVPR. 3431ś3440.

[31] Laurence T Maloney and Brian A Wandell. 1986. Color Constancy: a Method forRecovering Surface Spectral Reflectance. JOSA A 3, 1 (1986), 29ś33.

[32] Othmane Meslouhi, Mustapha Kardouchi, Hakim Allali, Taoufiq Gadi, and YassirBenkaddour. 2011. Automatic Detection and Inpainting of Specular Reflectionsfor Colposcopic Images. Open Computer Science 1, 3 (2011), 341ś354.

[33] Aaron Netz and Margarita Osadchy. 2012. Recognition Using Specular Highlights.IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 3 (2012), 639ś652.

[34] Francisco Ortiz, Fernando Torres Medina, and Pablo Gil. 2005. A ComparativeStudy of Highlights Detection and Elimination by Color Morphology and PolarColor Models. In Pattern Recognition and Image Analysis. 295ś302.

[35] Margarita Osadchy, David W. Jacobs, and Ravi Ramamoorthi. 2003. Using Specu-larities for Recognition. In ICCV. 1512ś1519.

[36] Jae Byung Park and Avinash C Kak. 2003. A truncated least squares approach tothe detection of specular highlights in color images. In The IEEE InternationalConference on Robotics and Automation. 1397ś1403.

[37] Véronique Prinet, MichaelWerman, and Dani Lischinski. 2013. Specular highlightenhancement from video sequences. In ICIP. 558ś562.

[38] Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao, Masood Dehghan, andMartin Jagersand. 2019. BASNet: Boundary-Aware Salient Object Detection. InCVPR. 7479ś7489.

[39] Minjung Son, Yunjin Lee, and Hyun Sung Chang. 2020. Toward Specular RemovalFrom Natural Images Based on Statistical Reflection Models. IEEE Transactionson Image Processing 29 (2020), 4204ś4218.

[40] Michael W Tao, Jong-Chyi Su, Ting-Chun Wang, Jitendra Malik, and Ravi Ra-mamoorthi. 2016. Depth Estimation and Specular Removal for Glossy SurfacesUsing Point and Line Consistency With Light-Field Cameras. IEEE Transactionson Pattern Analysis and Machine Intelligence 38, 6 (2016), 1155ś1169.

[41] Stephane Tchoulack, J. M. Pierre Langlois, and Farida Cheriet. 2008. A videostream processor for real-time detection and correction of specular reflections inendoscopic images. In Joint International IEEE Northeast Workshop on Circuitsand Systems and Taisa Conference. 49ś52.

[42] Qing Tian and James J. Clark. 2013. Real-Time Specularity Detection UsingUnnormalized Wiener Entropy. In The International Conference on Computer andRobot Vision. 356ś363.

[43] Kentaro Wada. 2016. labelme: Image Polygonal Annotation with Python. https://github.com/wkentaro/labelme.

[44] Ruixing Wang, Qing Zhang, Chi-Wing Fu, Xiaoyong Shen, Wei-Shi Zheng, andJiaya Jia. 2019. Underexposed photo enhancement using deep illuminationestimation. In CVPR. 6849ś6857.

[45] Peter Welinder, Steve Branson, Pietro Perona, and Serge J Belongie. 2010. Themultidimensional wisdom of crowds. In NIPS. 2424ś2432.

[46] Yanchao Yang and Stefano Soatto. 2020. FDA: Fourier Domain Adaptation forSemantic Segmentation. In CVPR. 4085ś4095.

[47] Ling Zhang, Qingan Yan, Zheng Liu, Hua Zou, and Chunxia Xiao. 2017. Illu-mination Decomposition for Photograph With Multiple Light Sources. IEEETransactions on Image Processing 26, 9 (2017), 4114ś4127.

[48] Qing Zhang, Yongwei Nie, andWei-Shi Zheng. 2019. Dual Illumination Estimationfor Robust Exposure Correction. Computer Graphics Forum 38, 7 (2019), 243ś252.

[49] Qing Zhang, Ganzhao Yuan, Chunxia Xiao, Lei Zhu, and Wei-Shi Zheng. 2018.High-quality exposure correction of underexposed photos. In ACMMM. 582ś590.

[50] Wuming Zhang, Xi Zhao, Jean-Marie Morvan, and Liming Chen. 2018. ImprovingShadow Suppression for Illumination Robust Face Recognition. IEEE Transactionson Pattern Analysis and Machine Intelligence 41, 3 (2018), 611ś624.

[51] Quanlong Zheng, Xiaotian Qiao, Ying Cao, and Rynson WH Lau. 2019.Distraction-aware shadow detection. In CVPR. 5167ś5176.

[52] Luo Zhiming, Kumar Mishra Akshaya, Andrew Achkar, Justin A. Eichel, ShaoziLi, and Pierre-Marc Jodoin. 2017. Non-local Deep Features for Salient ObjectDetection. In CVPR. 6593ś6601.

[53] Lei Zhu, Zijun Deng, Xiaowei Hu, Chi-Wing Fu, Xuemiao Xu, Jing Qin, andPheng-Ann Heng. 2018. Bidirectional Feature Pyramid Network with RecurrentAttention Residual Modules for Shadow Detection. In ECCV. 122ś137.

[54] Lei Zhu, Xiaowei Hu, Chi-Wing Fu, Jing Qin, and Pheng-Ann Heng. 2018.Saliency-Aware Texture Smoothing. IEEE Transactions on Visualization andComputer Graphics 26, 7 (2018), 2471ś2484.

https://github.com/wkentaro/labelme

https://github.com/wkentaro/labelme

learning to detect specular highlights from real-world...

Documents