learning to detect specular highlights from real-world...
TRANSCRIPT
Learning to Detect Specular Highlights from Real-world Images
Gang FuWuhan [email protected]
Qing ZhangSun Yat-sen University
Qifeng LinWuhan University
Lei ZhuTianjin University
Chunxia Xiao∗
Wuhan [email protected]
ABSTRACT
Specular highlight detection is a challenging problem, and hasmany
applications such as shiny object detection and light source esti-
mation. Although various highlight detection methods have been
proposed, they fail to disambiguate bright material surfaces from
highlights, and cannot handle non-white-balanced images. More-
over, at present, there is still no benchmark dataset for highlight
detection. In this paper, we present a large-scale real-world high-
light dataset containing a rich variety of material categories, with
diverse highlight shapes and appearances, in which each image is
with an annotated ground-truth mask. Based on the dataset, we
develop a deep learning-based specular highlight detection network
(SHDNet) leveraging multi-scale context contrasted features to ac-
curately detect specular highlights of varying scales. In addition, we
design a binary cross-entropy (BCE) loss and an intersection-over-
union edge (IoUE) loss for our network. Compared with existing
highlight detection methods, our method can accurately detect
highlights of different sizes, while effectively excluding the non-
highlight regions, such as bright materials, non-specular as well as
colored lighting, and even light sources.
CCS CONCEPTS
• Computing methodologies→ Interest point and salient re-
gion detections.
KEYWORDS
Highlight detection; context contrasted feature; highlight dataset
ACM Reference Format:
Gang Fu, Qing Zhang, Qifeng Lin, Lei Zhu, and Chunxia Xiao. 2020. Learning
to Detect Specular Highlights from Real-world Images. In Proceedings of
the 28th ACM International Conference on Multimedia (MM ’20), October
12–16, 2020, Seattle, WA, USA. ACM, New York, NY, USA, 9 pages. https:
//doi.org/10.1145/3394171.3413586
∗Corresponding author.
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
MM ’20, October 12–16, 2020, Seattle, WA, USA
© 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-7988-5/20/10. . . $15.00https://doi.org/10.1145/3394171.3413586
(a) Input (c) Zhang et al. [50]
(b) Li et al. [26] (d) Ours
Figure 1: Comparison with state-of-the-art highlight detec-
tion methods on an example image with very bright texts
and materials.
1 INTRODUCTION
Specular highlight 1 as a common phenomenon in the real world
usually appears as bright spot on shiny objects when illuminated.
Thus, it has a great effect on the appearance of objects, and con-
stitutes an unignorable double-edged sword. On the one hand, the
presence of specular highlights degrades the performance of some
tasks, such as intrinsic image decomposition [5, 7], general object
detection, clustering, and recoloring [4]. But on the other hand,
it also enhances the performance of other tasks, such as specu-
lar highlight removal [17, 39], glossy object recognition [33, 35],
depth estimation [40], face recognition [50], highlight manipulation
[37], and light source estimation [47]. Therefore, specular highlight
detection has long been a fundamental problem. A high-quality
highlight detection method would significantly benefit the above
tasks and even more applications.
1The three terms specular highlight, highlight, and specularity are equivalent in ourpaper, except stated definitely.
Nearly all existing detection methods are based on the vari-
ous forms of thresholding operations [26, 32, 35, 36, 41], which
are built upon a strict premise that the illumination is white, and
treat the brightest pixels as highlights. However, it is probably
invalid for real-world images with very bright material surfaces,
and non-specular lighting (e.g., interreflection and overexposure)
whose intensities could be greater than highlights. Essentially, ex-
isting traditional methods do not semantically disambiguate all-
white and/or nearly-white materials from highlights in complex
real-world scenes, and not accurately locate desired highlights. As
shown in Figure 1, two recent state-of-the-art methods [26, 50]
fail to produce satisfactory results, particularly where many white
texts are undesirably copied to the highlight maps. In addition,
existing methods are unable to handle non-white-balanced image,
since they often depend on an even strong premise that the illu-
mination is white. What’s more, at present, there is still no public
large real-world benchmark dataset to evaluate the performance
of highlight detection algorithms. This limits the development of
highlight detection techniques and even related techniques such as
highlight removal.
In this work, we present a large-scale dataset for the first time
to promote specular highlight detection for real-world images. Our
dataset includes 4,310 images featuring a wide variety of scenes and
materials, each with a labeled ground truth highlight mask. It also
contains almost all everyday shiny materials, such as metal, plastic,
glass, and wood, on which specular highlights often appear. To the
best of our knowledge, it is the first large-scale real-world dataset
for specular highlight detection task. Thus, it would promote a
fair comparison of performance for existing highlight detection
methods, and even benefit to evaluate other tasks such as highlight
removal and light source estimation. Figure 2 shows several example
image pairs in our dataset.
Based on our dataset, we have also designed a deep learning-
based specular highlight detection method. We observe that specu-
lar highlights usually drastically change the appearance of objects in
the scene, which makes it stand out from its surroundings. However,
this change includes both low-level color variations and high-level
semantics. So we leverage the context contrasted information for
specular highlight detection. We note that this kind of informa-
tion has been successfully exploited in semantic segmentation [13]
and saliency detection [52]. Also, we use a binary cross-entropy
(BCE) loss and an intersection-over-union edge (IoUE) loss for our
network to help more accurately locate highlights. In addition, to
increase the generalization ability and robustness of our method to
non-white-balanced images, we further perform the data augmen-
tation to produce more training images with colored illumination.
To sum up, our main contributions are:
• We present a first large-scale specularity dataset with specu-
lar highlight annotations.
• We develop a deep learning-based method leveraging multi-
scale context contrasted features, equipped with a BCE loss
and an IoUE loss, to accurately detect specularities.
• We evaluate our method and state-of-the-art detection meth-
ods on our dataset and many challenging images in the wild
from the Internet, and show the superior performance of our
method both quantitatively and qualitatively.
2 RELATED WORK
In this section, we briefly review state-of-the-art specular highlight
detection methods, and salient object detection, shadow detection
as well as semantic segmentation from other relevant fields.
Specular highlight detection. Early researchers proposed var-
ious algorithms [9, 31] for specular highlight detection and com-
pensation, since camera saturation caused by specular highlight
interferes with existing image processing algorithms that use deci-
sion threshold [36]. Subsequently, researchers have proposed more
specular highlight detection algorithms [2, 24, 25], based on the con-
cept of łwell-definednessž of mappings between the chromaticity
spaces corresponding to the images taken under different illumi-
nations. Later, Park et al. [36] proposed a truncated least squares
method to the detection of specular highlights in color images.
Recently, several methods [26, 32, 34, 41] designed the algorithms
leveraging a thresholding scheme in RGB or HSV channels. Al-
though these methods are simple and fast, they are often sensitive
to the thresholding value. Tian et al.[42] proposed a real-time high-
light detection method based on an unnormalized version of the
Wiener Entropy. Zhang et al. [50] formulated specular highlight de-
tection as Non-negative Matrix Factorization (NMF) [20] based on
the fact that only a small portion of an image contains highlights.
Although existing highlight detection methods have achieved
remarkable progress, they fail to produce satisfactory results for
real-world images with very bright materials and non-specular
lighting. In comparison, we utilize modern deep learning tools to
achieve high-quality results by building a large-scale real-world
dataset.
Salient object detection. Its goal is to find the most visually
distinctive objects in an image. Recently, Li et al. [27] combined
a well-trained contour detection model with a saliency detection
model, with a contour-to-saliency transferring scheme. Qin et al.
[38] proposed a densely Encoder-Decoder network and a residual
refinement module for saliency prediction and saliency map refine-
ment respectively. Hou et al. [19] introduced a deeply supervised
method with short connections, making full use of multi-level and
multi-scale features. Zhu et al. [54] designed the guided non-local
block to utilize guidance information for saliency detection.
Shadow detection. It aims to detect shadows in the scene. Re-
cently, Hu et al. [21, 22] proposed a deep learning-based detection
method, making full use of direction-aware spatial context features.
Zheng et al. [51] proposed a distraction-aware detection method by
learning and integrating the semantics of visual distraction regions.
Zhu et al. [53] explored the global context in deep layer and local
context in shallow layers of a deep convolutional neural network to
build a bidirectional feature pyramid network for shadow detection.
Semantic segmentation. It aims at predicting pixel-level labels
of object categories for images. A pioneer work in semantic seg-
mentation is Fully Convolutional Network (FCN) [30]. To produce
high-quality semantic segmentation results, some researchers have
utilized various heavy backbones [12] or powerful modules to lever-
age multi-scale context information [11]. Recently, Lin et al. [28]
integrated the graph convolution network into neural architecture
search as a communication mechanism between independent cells.
Yang et al. [46] used a Fourier Transform for semantic segmentation.
Figure 2: Example highlight images and corresponding highlight masks in our dataset. Please zoom in to view more details.
Objects and shadows to be detected in a scene in the above three
tasks often appear to be a few connected large areas. In contrast,
specular highlights in a scene have various shapes, which can be
from tiny dots to very long areas across entire large objects. There-
fore, based on the special characteristics of highlights, we design a
high-performance deep learning-based network for highlight detec-
tion, for the first time in this field. Since the network needs many
labeled images for training, we also provide a large-scale real-world
specularity dataset.
Specularity datasets with annotated highlight masks. Al-
though there are many studies of specular highlight detection, in
general, most methods [15, 50] conduct just visual evaluation on
a few images selected by authors without annotated ground truth
highlight masks. It is far from the comprehensive quantitative eval-
uation of highlight detection algorithms on a large-scale benchmark
dataset. To the best of our knowledge, there is no public dataset
with specularity images and corresponding ground truth highlight
masks. We thus present such a dataset to enable systematic repro-
ducible research in this field.
3 OUR DATASET
In this section, we first describe the annotation pipeline in detail,
and then provide a comprehensive analysis of the dataset.
3.1 Building dataset
To build our dataset, we first downloaded various photos from the
Internet. Then, we filtered out undesired photos. Next, the recruited
workers annotated highlights for each photo. Finally, we checked
the quality of their annotated masks. Note for our annotation task,
we recruited 15 students from a college campus who are skilled at
Photoshop. In the following, we will describe these steps in detail.
Step 1: Collecting images. First, we downloaded a set of every-
day photos from the following sources: (1) photo-sharing websites,
such as Flickr; (2) shopping websites, such as Amazon; (3) image
search engines, such as Google Images and Bing Images. The pro-
portion of images in our initial set of photos collected from the
above three sources is about 5 : 4 : 1. As we all known, specular
highlights are often produced on the surfaces of shiny materials.
With this in mind, we totally downloaded 44,800 images from the
above websites by searching objects whose materials could be iron,
silver, snack, nail polish, book, paint, cobblestone, wine glass, and old
Figure 3: Three example situations in labeling. (a)(c)(e) Input
local patchs. (b)(d)(f) Labeled highlight masks of (a)(c)(e) us-
ing threshold,matting, and free select tools respectively.
wooden table etc. To facilitate later annotations and ensure the qual-
ity of the dataset, we initially pruned the image list disregarding
those images that are: (i) ≤ 0.5 megapixels, and (ii) ≥ 20MB in size
(to control our disk footprint). After this process, we have an image
set with about 38,163 images.
Step 2: Filtering images. Further, we excluded those images
that are: (i) unnatural-looking images, such as synthetic, distorted,
blurred, and noisy photos, and (ii) with very few even no obvious
highlights (i.e. the number of highlight regions is typically less than
three). For this task, each image was shown to five workers. We
found that the workers are very fast and reliable at this task. Then
the image list is finally reduced to 4,310 images.
Step 3: Data annotation. This task is to annotate specular high-
light regions to produce binary highlight masks, which is similar to
the salient object annotation task, the instance segmentation anno-
tation task. However, compared with salient objects and instances
in a scene, specular highlights have particularly different charac-
teristics, namely various sizes, possibly numerous quantity, fuzzy
contours, and even messy distribution. Therefore, the annotation
pipelines of their tasks could not be suitable for our task. Instead of
the LabelMe tool [43], we adopted the more powerful Photoshop
software with the versatile tools (free select, pencil, threshold, mat-
ting, smooth zoom, and undo/redo etc.) we need. These tools were
especially important to enable workers to quickly produce more
accurate masks even for small highlight regions. For this task, each
photo is annotated by a worker. The workers are shown several
examples of good and bad annotation example masks. Several exam-
ple images of annotated local patches (not the whole input photos)
in our dataset are presented in Figure 3.
Step 4: Voting for annotation quality.Wemade an additional
task where human subjects vote on the quality of each annotate
mask. Then we used these results to determine the quality of each
mask. We consider an annotated mask as high quality only if there
exists a certain amount of consensus. Five voters vote on the quality
of each highlight mask. To check the quality of annotation masks,
similar to [6], we use the CUBAM learning-based algorithm [45]
which uses a model of noisy user behavior for binary tasks (in our
case, voting for the quality of annotated highlight mask) to extract
better results. This method partly models the bias of each worker
based on how often they are in agreement with other ones. Then
we ran CUBAM on the resulting votes, and discard highlight mask
with a CUBAM score below a given threshold. Afterward, we se-
lected some bad highlight mask examples in those discarded masks
to show the workers so that they can avoid similar problematic
annotations in the subsequent adjustable annotation task. We kept
repeating this process until all annotated highlight masks meet our
requirements.
3.2 Dataset analysis
To provide a thorough understanding of our dataset, we show a
series of statistical analysis on our proposed dataset.
Highlights number distribution. First, we group connected
highlight pixels and count the number of separated highlights per
image in our dataset. To avoid the effect of noisy labels, we discard
highlight regions whose number of pixels is less than 25, while
undesirably excluding those łtruež separate highlights with ex-
tremely small size. Figure 4(a) shows the resulting statistics. On the
whole, the number of separate connected regions in our dataset is
considerably large.
Highlight area proportion. Next, we identify the proportion
of pixels occupied by highlights in each image. Figure 4(b) presents
the histogram plots of highlight area proportion for our dataset.
We can see that most images in our dataset have relatively small
highlight areas.
Color contrast distribution. Real-world highlights are often
high-intensity and texture-less, which means that the color contrast
in highlight and non-highlight regions is often high. We adopt χ2
distance to measure the contrasts between the color histograms of
the highlight and non-highlight regions in each image. Figure 4(c)
plots the color contrast distribution in our dataset, where a high
value in the horizontal axis means high color contrast. Nevertheless,
this does not mean that it is not challenging to locate highlights in
our dataset, since it is quite possible that there exist the same or
even more amount of bright material surfaces or light sources in
the non-highlight region.
Highlight location distribution. To analyze the spatial distri-
bution of highlights in our dataset, we resize all highlight masks to
512 × 512, and sum them up to obtain a probability map to show
how likely each pixel belongs to specular highlight. Figure 4 plots
the highlight location distribution of our dataset. As can be seen,
the highlights tend to cluster around the lower center of the image,
since objects are usually placed approximately around the human
eyesight.
Scene categories. In our dataset, indoor scenes have a large
proportion, since specular and shiny materials are highly common
in indoor scenes. Specifically, our dataset mostly includes scenes
from the following categories: living room, office, supermarket,
kitchen, dining room, and museum.
Figure 4: Statistics of the proposed specular highlight detec-
tion dataset. We show its distributions including highlight
numbers, area proportion, color contrast, and location.
Material categories. Our dataset contains various specular and
shiny materials on which specular highlights easily appear, such
as metal, plastic (clear), plastic (opaque), rubber/latex, glass, wood,
painted, porcelain, stone, leather, wax, marble, and tile.
4 PROPOSED METHOD
We observe that specular highlights in the real world often have
high intensity and are rather sparse in distribution, which leads
to a violent change of the appearance of objects in the scene. This
observation inspires us to utilize the contrast between the high-
light and non-highlight regions. To accomplish this, we introduce
a context contrasted feature module (CCFM) to extract contrast
features for highlight detection. Based on it, we further design a
multi-scale context contrasted feature module (MCCFM) to har-
vest contrasted features to help locate highlights of various scales.
In addition, we exploit a binary cross-entropy (BCE) loss and an
intersection-over-union edge (IoUE) loss for our network.
4.1 Overview
Figure 5 illustrates the proposed specular highlight detection net-
work (SHDNet). It takes a single image as input and outputs the
highlight map in an end-to-end way. We take FPN [29] as the back-
bone. SHDNet uses a convolution neural network to generate the
five feature maps with varying spatial resolutions. The features
at shallow CNN layers tend to discover the fine details at whole
highlight regions and many non-highlight regions due to lack of
high-level semantic information, while features at deep CNN layers
suppress most of non-highlight regions but somehow lack of high-
light region details due to the relatively larger receptive fields than
shallow layers. We upsample the feature maps with low resolution
at deep layers and sum them with the adjacent feature maps at
shallow layers to construct the feature pyramid. It can make full
use of the complementary information among features with differ-
ent spatial resolutions. Then, we embed an MCCFM (see Figure 6)
to generate context contrasted features for each layer. Next, con-
catenating these contrast features with convolutional features, we
build the context contrasted feature pyramid. Finally, giving the
supervision signals at each layer, we obtain the highlight maps at
each layer, and treat the highlight map with the highest resolution
as the final highlight mask.
4.2 Multi-scale Context Contrasted Feature
Figure 6 demonstrates the architecture of the proposed MCCFM.
Given the input features extracted by the feature extraction net-
work, the purpose of MCCFM is to provide multi-scale context
contrasted features for locating specular highlights.
Figure 5: The schematic illustration of the proposed specular highlight detection network (SHDNet).
AvgPoolUpsample OutputAvgPool
UpsampleAvgPool
UpsampleAvgPool
UpsampleConvInput_ _ _ _
C
Conv
Conv ConvConv
ConvConv ConvCCFM
C :Concatenation_
:Subtraction
Figure 6: The schematic illustration of the proposed multi-scale context contrasted feature module (MCCFM). The MCCFM
consists of the four chained CCFMs. In each CCFM (gray dashed box), we compute context contrasted features between a local
information and its surrounding contexts. Then we concatenate the resulting context contrasted features as the output.
CCFM. The high intensity and sparseness are the distinctive
characteristics of specular highlight, which make it significantly
stand out from its surrounding non-highlight regions. To capture
this kind of high contrast information, the desired context con-
trasted features should be uniform inside the highlight region and
non-highlight region but at the same time be different between
the both. To this end, we design the CCFM to learn context con-
trasted features between a local information (generated by standard
convolutions) and its surrounding contexts (generated by average
poolings with different kernel sizes), expressed as
C = F −Up(T (AvgPool(F ; s))) , (1)
where F is the input features; C is the resulting context contrasted
features;Up denotes bilinear interpolation to up-sample the array
of contextual features to be of the same size of F ; T denotes a
convolutional network with kernel size 1 to combine the context
features across channels without changing their dimensions; and s
is the kernel size of the average pooling (AvgPool).
MCCFM. To make highlight detection problem tractable, previ-
ous methods [26, 50] typically assume that the specular highlight is
small in size. However, we observe the specular highlight in the real
world actually could take up a large portion of an object (e.g., the
specularity across the entire large objects in the scene, as shown in
Figure 1 and 9), for example due to view direction, and large light
source. Based on CCFM, we further propose a multi-scale context
contrasted feature module (MCCFM) to help capture specularities
of various scales, by chaining four CCFMs. Figure 6 illustrates the
scheme of the proposed MCCFM. Here, we empirically set the ker-
nel size of the average pooling to 2×2, 4×4, 6×6, and 8×8, such that
long-range spatial contrast could be obtained. Then, the resulting
contrasted features by CCFMs are concatenated to produce feature
maps that identify the dividing edges.
4.3 Loss function
Our total loss is defined as L =∑KkwkLk , where Lk is the loss of
k-th side output, K is the total number of outputs, and wk is the
weighting parameter of each loss. As illustrated in Section 4, our
SHDNet is supervised with five side paths, i.e. K = 5. To obtain
high-quality highlight detection, we propose to define Lk as Lk =
LBCEk+ LIoUE
k, where LBCE
kand LIoUE
kdenote binary cross-entropy
(BCE) loss and intersection-over-union edge (IoUE) loss.
BCE loss. The cross-entropy loss is commonly adopted in salient
object detection and semantic segmentation problems. Similarly,
we adopt this loss for our specularity detection task, written as
LBCEk= −
∑
p
(Gp log(Hp ) + (1 −Gp )log(1 − Hp )) , (2)
where p denots a pixel location; Hp is the predict probability of
being specular highlight; Gp ∈ {0, 1} is the ground truth label of
the pixel p.
IoUE loss.We also introduce an IoUE loss, written as
LIoUEk=
∑
j
(
1 −2|Cj ∩ Cj |
|Cj | + |Cj |
)
, (3)
where j denotes a highlight region; Cj is the gradient magnitude
of predicted highlight map of region j, while Cj is the gradient
magnitude of ground truth highlight map of region j, estimated
by a Sobel operator. Here the intersection is estimated using a
pixel-wise multiplication operator.
The IoUE loss is able to help locate highlights, resulting in high-
light masks with clearer contours/edges. The reason is two-fold.
First, the IoUE loss leverages the edge information, which is rela-
tively not sensitive to scale difference of highlight region. Second,
since the training set includes highlight regions with diverse region
sizes, the edges of highlight regions with large shapes and very thin
(or tiny) regions are both incorporated into the network training. By
doing so, our network with the IoUE loss can help locate highlights
at different region sizes, which has been proved by the detection
results of our method. See the ablation study in Section 5.4 for more
details.
(a) Input (b) [41] (c) [32] (d) [50] (e) [26]
(f) [10] (g) [21] (h) [51] (i) Ours (j) Ground truth
(a) Input (b) [41] (c) [32] (d) [50] (e) [26]
(f) [10] (g) [21] (h) [51] (i) Ours (j) Ground truth
Figure 7: Visual comparison of our method against recent state-of-the-art detection methods on the testing set of our dataset.
4.4 Implementation detail
To guarantee that the training set and testing set in our dataset
basically match, we randomly select 3,017 images as the training set,
and other 1293 images as the testing set. We train our model on the
training set of our dataset. We have implemented the SHDNet in
PyTorch on a PC equipped with Intel(R) Xeon(R) CPU E5-2630 v4 @
2.20GHz, and NVIDIA GeForce GTX 1080Ti. The Adam optimizer
[23] is used to train our network. The hyper-parameters are set to:
learning rate = 5e − 5, weight decay = 0.0005, momentum = 0.9,
batch size = 8, and lose weight for each side output w = 1. In
addition, we train the network 100 epochs and divide the learning
rate by 10 after 20 epochs. We do not validate the network using
the dataset during training. It takes about 80 hours for training
the network to converge on our card. During inference, we can
obtain a predicted specular highlight map and its corresponding
highlight edge map. In our method, we use the highlight map with
the highest resolution as the final highlight mask.
5 EXPERIMENTS
5.1 Experimental settings
Dataset and evaluation metrics.We have evaluated the perfor-
mance of all detection methods on our dataset. Note that most
existing datasets contain a very small number of images and only
allow subjective evaluation due to lack of ground truth annotations.
In contrast, the test set of our dataset covers a broad range of im-
ages with annotations, and allows both qualitative and quantitative
evaluations.
Following the commonmetrics in semantic segmentation, saliency
detection, and shadow detection communities, we in this paper
adopt 4 commonly used metrics including MAE, mean F-measure,
max F-measure and S-measure [1, 8, 14] to evaluate the performance
of our method and compared detection methods.
Data augmentation. In order to improve the generalization
ability of our method to handle colored highlights, we perform the
data augmentation on the training set of our dataset. First, we use
the fast Fourier color constancy method proposed by Barron et al.
[3] to recover non-white-balanced images to be white-balanced for
subsequent data augmentation. Then, we generate more training
images by mixing each white-balanced image with varying colored
illumination. We express this process as Icp = Ip ∗ C, where I is an
input image; Ic is an augmented image; p denotes a pixel location; ∗
denotes element-wise multiplication; and C = [R,G,B] is the mixed
illumination color vector (three random values between 0 and 1).
For simplification, we assume that there is only a dominant light
source with a basically constant color in a scene. For each image,
10 augmented images are produced.
Comparedmethods. For comparison, we choose the four state-
of-the-art highlight detection methods including [26, 32, 41, 50].
In addition, we also choose several state-of-the-art learning-based
segmentation and detection methods from the akin fields, including
DSS [19] in the saliency detection community, Deeplab v3+ [10] in
the semantic segmentation community, DSC [21, 22] and DSD [51]
in the shadow detection community.
Fairness setting. Since none of the compared methods were
given the opportunity to train on our dataset, we tried our best to
make the evaluation fair by testing various parameters for each
method. We varied the window size n for [41], the inner dimension
of factorization for [50] and the two weighting parameters α and
β for [26] on our augmented testing set. Note that there is no
any adjustable parameter in [32]. In addition, we re-trained DSS,
Deeplab v3+, DSC, and DSD on our training set, and set the size of
input image as 512 × 512. For these four learning-based methods,
we fine-tuned their large number of parameters to produce better
results as much as possible. To measure a final quantitative result,
we use leave-one-out cross-validation (namely, test each image
using the best parameters for all other images).
5.2 Comparison on our testing set
Quantitative comparison. Table 1 reports the quantitative com-
parison results on our testing set. As can be seen, our method
achieves the best quantitative results on all four metrics including
S-measure, mean F-measure (meanF), max F-measure (maxF), and
MAE. Notice the methods of Meslouhi et al. [32] and Li et al. [26]
achieve unsatisfactory results on our testing set, since they have a
strict premise that the illumination is white-balanced.
Visual comparison. In Figure 7, we provide some visual com-
parison results. The results show that our method can produce
more accurate results than other state-of-the-art highlight detection
methods. We found that, four traditional methods [26, 32, 41, 50]
often mistakenly detect the very bright material areas and non-
specular lighting as the specular highlights, e.g., the white product
labels on the four bottles (1st group of results), and the white text
textures as well as the white dots of the carpet (2nd group of re-
sults). The reason is that these methods are based on the analysis or
statistics of the pixel intensities and fail to semantically distinguish
between highlight pixels and achromatic pixels (i.e., very bright
material surfaces and text textures). In addition, the compared deep
learning-based semantic segmentation method [10], salient object
detection method [19], and shadow detection methods [21, 51] are
likely to have the ability to exclude very bright material colors.
Table 1: Quantitative comparison results on our testing set.
↑ and ↓means larger and smaller are better respectively.
Metrics [41] [32] [50] [26] [19] [10] [21] [51] Ours
S-m. ↑ 0.132 0.125 0.521 0.201 0.491 0.619 0.412 0.480 0.793
meanF ↑ 0.027 0.030 0.410 0.223 0.218 0.451 0.108 0.202 0.676
maxF ↑ 0.031 0.033 0.383 0.301 0.382 0.501 0.114 0.490 0.736
MAE ↓ 0.423 0.478 0.021 0.383 0.053 0.019 0.091 0.049 0.006
Table 2: User study result.
Methods [41] [32] [50] [26] [19] [10] [21] [51] Ours
PoV 1.13% 1.81% 2.41% 2.82% 4.19% 18.2% 1.18% 8.01% 60.25%
Table 3: Ablation study. “basic” denotes our networkwithout
MCCFM module and used loss.
Networks S-m.↑ meanF↑ maxF↑ MAE↓
basic + BCE loss 0.638 0.508 0.612 0.013
basic + hybrid loss 0.680 0.609 0.645 0.012
basic + CCFM + BCE loss 0.691 0.623 0.649 0.012
basic + CCFM + hybrid loss 0.762 0.652 0.703 0.009
basic + MCCFM + BCE loss 0.789 0.661 0.721 0.008
basic + MCCFM + hybrid loss 0.793 0.676 0.736 0.006
However, these methods fail to locate tiny highlight details, and
undesirably treat the diffuse pixels around highlights as highlights
to be detected. This is because the shape and appearance of specular
highlight in our task are highly different from that of object and
shadow in their tasks. In comparison, our method can effectively
locate highlights of different sizes, while effectively excluding the
bright material surfaces.
5.3 User study
We further conduct a user study to compare results. To this end, we
first download extra 500 test images from Flickr by searching with
keywords like study room, restaurant, grocery store, and reception
(Figure 8 presents an example). These real-world images of whole
scenes in the wild are quite challenging with heavy texture, noise,
colored lighting, and achromatic surfaces etc. Then, we use our
method and other compared methods to produce highlight masks
for each test image, and recruit 100 participants from a school
campus to rate each group results, which are presented in a random
order to avoid subjective bias.
For each result, the participants are asked to give a rating using
a Likert scale k from 1 (worst) to 5 (best). After all participants
have finished the investigation, for each method, we compute the
average percentage of votes (PoV) as the final result by PoV =1N
∑
i∑
j Vi , j ×100%, where i denotes ith group of results; j denotes
jth participants; Vi , j denotes the number of votes on ith group of
results by j participant; N is the total voting score. Table 2 reports
the results of the user study. In the user study, our method highly
outperformed than other compared detection methods. The visual
comparison results on a randomly selected image are shown in
Figure 8.
(a) Input (b) [41] (c) [32] (d) [50] (e) [26]
(f) [19] (g) [10] (h) [21] (i) [51] (j) Ours
Figure 8: Visual comparison with state-of-the-art detection methods on a real-world photo in the wild.
Figure 9: Ablation study for our MCCFM. (a) Input. (b) Basic.
(c) Basic + CCFM + hybrid loss. (d) Basic + MCCFM + BCE
loss. (e) Basic + MCCFM + hybrid loss. (f) Ground truth.
Input
[50]
Ours
Figure 10: Sensitivity to colored lighting. Zoom in for detail.
5.4 Discussions
Ablation study.We perform experiments to evaluate the perfor-
mance of the proposed MCCFM, and the used BCE loss and IoUE
loss. The quantitative comparison results are shown in Table 3. As
can be seen, our SHDNet with the hybrid loss (i.e., the BCE loss
and IoUE loss) is able to obtain better results than only using BCE
loss. As shown in Figure 9(d)(e), we can see that our SHDNet with
the hybrid loss can produce the highlight masks with clearer con-
tours/edges, compared to without the IoUE loss. Moreover, from the
Table 3, we can also see that our network with MCCFM (4 CCFMs)
can improve the more performance than only with one CCFM. As
shown in Figure 9, our SHDNet with MCCFM can more effectively
learn the contrasted features to accurately locate highlights and
avoid non-highlight details residual in highlight maps than only
with CCFM.
Robustness to colored lighting. Our method is robust to col-
ored lighting mainly due to our data augmentation scheme. The
visual comparison results are shown in Figure 10. It shows that the
traditional methods fail to detect the colored highlights well, since
they highly depend on the premise of white-balanced illumination.
In contrast, our method can effectively overcome the issue.
6 CONCLUSION
This paper presented a large-scale specularity dataset consisting of
annotated 4,310 real-world images for specular highlight detection.
In addition, by analyzing the shape and appearance of specular
highlight, we designed a novel deep convolutional neural network,
called SHDNet, to leverage multi-scale context contrasted features
for specular highlight detection. Our method has achieved superior
performance over existing methods on our collected dataset and
many challenging images in the wild from the Internet. We hope
that our dataset will benefit many multimedia applications, and
enables more researchers to tackle highlight detection task. In the
future, we would like to extend our method for specular highlight
detection in videos, and take ourmethod as a pre-processing step for
specularity removal [17], intrinsic decomposition [16, 18], exposure
correction [44, 48, 49], and so on.
ACKNOWLEDGMENTS
We would like to thank Qimeng Zhang, Jianlin Zhu, Yiwen Liu,
Zihao Wang, Ting Zhang, Xinyue Zhang, Zhuoxu Huang, Yizhe Lv,
and Jiapan Wang for their invaluable help in building the dataset.
We also thank the ACM MM reviewers for their feedback. This
work was partly supported by the Key Technological Innovation
Projects of Hubei Province (2018AAA062), NSFC (No. 61672390,
61972298, 61902275), the National Key Research and Development
Program of China (2017YFB1002600).
REFERENCES[1] Radhakrishna Achanta, Sheila Hemami, Francisco Estrada, and Sabine Susstrunk.
2009. Frequency-tuned salient region detection. In CVPR. 1597ś1604.[2] Ruzena Bajcsy, Sang Wook Lee, and Aleš Leonardis. 1990. Color image seg-
mentation with detection of highlights and local illumination induced by inter-reflections. In CVPR. 785ś790.
[3] Jonathan T. Barron and Yun-Ta Tsai. 2017. Fast Fourier Color Constancy. In CVPR.6950ś6958.
[4] Shida Beigpour and Joost van deWeijer. 2011. Object recoloring based on intrinsicimage estimation. In ICCV. 327ś334.
[5] Sean Bell, Kavita Bala, and Noah Snavely. 2014. Intrinsic Images in the Wild.ACM Transactions on Graphics 33, 4 (2014), 159.
[6] Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. 2013. OpenSurfaces: Arichly annotated catalog of surface appearance. ACM Transactions on Graphics32, 4 (2013), 1ś17.
[7] Sai Bi, Xiaoguang Han, and Yizhou Yu. 2015. An L1 Image Transform for Edge-Preserving Smoothing and Scene-Level Intrinsic Decomposition. ACM Transac-tions on Graphics 34, 4 (2015), 78.
[8] Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li. 2015. Salient ObjectDetection: a Benchmark. IEEE Transactions on Image Processing 24, 12 (2015),5706ś5722.
[9] Gavin Brelstaff and Andrew Blake. 1988. Detecting Specular Reflections UsingLambertian Constraints. In ICCV. 297ś302.
[10] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and HartwigAdam. 2018. Encoder-Decoder with Atrous Separable Convolution for SemanticImage Segmentation. In ECCV. 833ś851.
[11] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and HartwigAdam. 2018. Encoder-decoder with atrous separable convolution for semanticimage segmentation. In ECCV. 801ś818.
[12] François Chollet. 2017. Xception: Deep learning with depthwise separable con-volutions. In CVPR. 1251ś1258.
[13] Henghui Ding, Xudong Jiang, Bing Shuai, Ai Qun Liu, and Gang Wang. 2018.Context Contrasted Feature and Gated Multi-Scale Aggregation for Scene Seg-mentation. In CVPR. 2393ś2402.
[14] Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali Borji. 2017. Structure-measure: A new way to evaluate foreground maps. In ICCV. 4548ś4557.
[15] Rogerio Feris, Ramesh Raskar, Kar-Han Tan, and Matthew Turk. 2006. SpecularHighlights Detection and Reduction With Multi-Flash Photography. Journal ofthe Brazilian Computer Society 12, 1 (2006), 35ś42.
[16] Gang Fu, Lian Duan, and Chunxia Xiao. 2019. A Hybrid L2-Lp Variational Modelfor Single Low-light Image Enhancement with Bright Channel Prior. In ICIP.1925ś1929.
[17] Gang Fu, Qing Zhang, Chengfang Song, Qifeng Lin, and Chunxia Xiao. 2019.Specular Highlight Removal for Real-world Images. Computer Graphics Forum 7(2019), 253ś263.
[18] Gang Fu, Qing Zhang, and Chunxia Xiao. 2019. Towards high-quality intrinsicimages in the wild. In ICME. 175ś180.
[19] Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, Zhuowen Tu, and PhilipH. S. Torr. 2019. Deeply Supervised Salient Object Detection With Short Connec-tions. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 4 (2019),815ś828.
[20] Patrik O Hoyer. 2004. Non-Negative Matrix Factorization With SparsenessConstraints. Journal of Machine Learning Research 5 (2004), 1457ś1469.
[21] Xiaowei Hu, Chi-Wing Fu, Lei Zhu, Jing Qin, and Pheng-Ann Heng. 2019.Direction-Aware Spatial Context Features for Shadow Detection and Removal.IEEE Transactions on Pattern Analysis and Machine Intelligence (2019). to appear.
[22] Xiaowei Hu, Lei Zhu, Chi-Wing Fu, Jing Qin, and Pheng-Ann Heng. 2018.Direction-aware spatial context features for shadow detection. In CVPR. 7454ś7462.
[23] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti-mization. In ICLR.
[24] Gudrun J Klinker, Steven A Shafer, and Takeo Kanade. 1988. The Measurement ofHighlights in Color Images. International Journal of Computer Vision 2, 1 (1988),7ś32.
[25] Hsien-Che Lee. 1986. Method for Computing the Scene-Illuminant ChromaticityFrom Specular Highlights. JOSA A 3, 10 (1986), 1694ś1699.
[26] Ranyang Li, Junjun Pan, Yaqing Si, Bin Yan, Yong Hu, and Hong Qin. 2019.Specular Reflections Removal for Endoscopic Image Sequences With Adaptive-Rpca Decomposition. IEEE Transactions on Medical Imaging 39, 2 (2019), 328ś340.
[27] Xin Li, Fan Yang, Hong Cheng, Wei Liu, and Dinggang Shen. 2018. Contourknowledge transfer for salient object detection. In ECCV. 355ś370.
[28] Peiwen Lin, Peng Sun, Guangliang Cheng, Sirui Xie, Xi Li, and Jianping Shi. 2020.Graph-Guided Architecture Search for Real-Time Semantic Segmentation. InCVPR. 4203ś4212.
[29] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, andSerge Belongie. 2017. Feature pyramid networks for object detection. In CVPR.2117ś2125.
[30] Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutionalnetworks for semantic segmentation. In CVPR. 3431ś3440.
[31] Laurence T Maloney and Brian A Wandell. 1986. Color Constancy: a Method forRecovering Surface Spectral Reflectance. JOSA A 3, 1 (1986), 29ś33.
[32] Othmane Meslouhi, Mustapha Kardouchi, Hakim Allali, Taoufiq Gadi, and YassirBenkaddour. 2011. Automatic Detection and Inpainting of Specular Reflectionsfor Colposcopic Images. Open Computer Science 1, 3 (2011), 341ś354.
[33] Aaron Netz and Margarita Osadchy. 2012. Recognition Using Specular Highlights.IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 3 (2012), 639ś652.
[34] Francisco Ortiz, Fernando Torres Medina, and Pablo Gil. 2005. A ComparativeStudy of Highlights Detection and Elimination by Color Morphology and PolarColor Models. In Pattern Recognition and Image Analysis. 295ś302.
[35] Margarita Osadchy, David W. Jacobs, and Ravi Ramamoorthi. 2003. Using Specu-larities for Recognition. In ICCV. 1512ś1519.
[36] Jae Byung Park and Avinash C Kak. 2003. A truncated least squares approach tothe detection of specular highlights in color images. In The IEEE InternationalConference on Robotics and Automation. 1397ś1403.
[37] Véronique Prinet, MichaelWerman, and Dani Lischinski. 2013. Specular highlightenhancement from video sequences. In ICIP. 558ś562.
[38] Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao, Masood Dehghan, andMartin Jagersand. 2019. BASNet: Boundary-Aware Salient Object Detection. InCVPR. 7479ś7489.
[39] Minjung Son, Yunjin Lee, and Hyun Sung Chang. 2020. Toward Specular RemovalFrom Natural Images Based on Statistical Reflection Models. IEEE Transactionson Image Processing 29 (2020), 4204ś4218.
[40] Michael W Tao, Jong-Chyi Su, Ting-Chun Wang, Jitendra Malik, and Ravi Ra-mamoorthi. 2016. Depth Estimation and Specular Removal for Glossy SurfacesUsing Point and Line Consistency With Light-Field Cameras. IEEE Transactionson Pattern Analysis and Machine Intelligence 38, 6 (2016), 1155ś1169.
[41] Stephane Tchoulack, J. M. Pierre Langlois, and Farida Cheriet. 2008. A videostream processor for real-time detection and correction of specular reflections inendoscopic images. In Joint International IEEE Northeast Workshop on Circuitsand Systems and Taisa Conference. 49ś52.
[42] Qing Tian and James J. Clark. 2013. Real-Time Specularity Detection UsingUnnormalized Wiener Entropy. In The International Conference on Computer andRobot Vision. 356ś363.
[43] Kentaro Wada. 2016. labelme: Image Polygonal Annotation with Python. https://github.com/wkentaro/labelme.
[44] Ruixing Wang, Qing Zhang, Chi-Wing Fu, Xiaoyong Shen, Wei-Shi Zheng, andJiaya Jia. 2019. Underexposed photo enhancement using deep illuminationestimation. In CVPR. 6849ś6857.
[45] Peter Welinder, Steve Branson, Pietro Perona, and Serge J Belongie. 2010. Themultidimensional wisdom of crowds. In NIPS. 2424ś2432.
[46] Yanchao Yang and Stefano Soatto. 2020. FDA: Fourier Domain Adaptation forSemantic Segmentation. In CVPR. 4085ś4095.
[47] Ling Zhang, Qingan Yan, Zheng Liu, Hua Zou, and Chunxia Xiao. 2017. Illu-mination Decomposition for Photograph With Multiple Light Sources. IEEETransactions on Image Processing 26, 9 (2017), 4114ś4127.
[48] Qing Zhang, Yongwei Nie, andWei-Shi Zheng. 2019. Dual Illumination Estimationfor Robust Exposure Correction. Computer Graphics Forum 38, 7 (2019), 243ś252.
[49] Qing Zhang, Ganzhao Yuan, Chunxia Xiao, Lei Zhu, and Wei-Shi Zheng. 2018.High-quality exposure correction of underexposed photos. In ACMMM. 582ś590.
[50] Wuming Zhang, Xi Zhao, Jean-Marie Morvan, and Liming Chen. 2018. ImprovingShadow Suppression for Illumination Robust Face Recognition. IEEE Transactionson Pattern Analysis and Machine Intelligence 41, 3 (2018), 611ś624.
[51] Quanlong Zheng, Xiaotian Qiao, Ying Cao, and Rynson WH Lau. 2019.Distraction-aware shadow detection. In CVPR. 5167ś5176.
[52] Luo Zhiming, Kumar Mishra Akshaya, Andrew Achkar, Justin A. Eichel, ShaoziLi, and Pierre-Marc Jodoin. 2017. Non-local Deep Features for Salient ObjectDetection. In CVPR. 6593ś6601.
[53] Lei Zhu, Zijun Deng, Xiaowei Hu, Chi-Wing Fu, Xuemiao Xu, Jing Qin, andPheng-Ann Heng. 2018. Bidirectional Feature Pyramid Network with RecurrentAttention Residual Modules for Shadow Detection. In ECCV. 122ś137.
[54] Lei Zhu, Xiaowei Hu, Chi-Wing Fu, Jing Qin, and Pheng-Ann Heng. 2018.Saliency-Aware Texture Smoothing. IEEE Transactions on Visualization andComputer Graphics 26, 7 (2018), 2471ś2484.