computer vision and normalizing flow based defect detectioncomputer vision, visual inspection, deep...

Computer Vision and Normalizing Flow BasedDefect Detection

Zijian KuangDepartment of Computing Science

University of AlbertaEdmonton, Canada

Email: [email protected]

Xinran TieDepartment of Computing Science

University of AlbertaEdmonton, Canada

Email: [email protected]

Abstract—Surface defect detection is essential and necessaryfor controlling the qualities of the products during manufactur-ing. The challenges in this complex task include: 1) collectingdefective samples and manually labeling for training is time-consuming; 2) the defects’ characteristics are difficult to defineas new types of defect can happen all the time; 3) and the real-world product images contain lots of background noise. In thispaper, we present a two-stage defect detection network basedon the object detection model YOLO [1], and the normalizingflow-based defect detection model DifferNet [2]. Our modelhas high robustness and performance on defect detection usingreal-world video clips taken from a production line monitoringsystem. The normalizing flow-based anomaly detection modelonly requires a small number of good samples for training andthen perform defect detection on the product images detected byYOLO. The model we invent employs two novel strategies: 1)a two-stage network using YOLO and a normalizing flow-basedmodel to perform product defect detection, 2) multi-scale imagetransformations are implemented to solve the issue product imagecropped by YOLO includes many background noise. Besides,extensive experiments are conducted on a new dataset collectedfrom the real-world factory production line. We demonstrate thatour proposed model can learn on a small number of defect-freesamples of single or multiple product types. The dataset will alsobe made public to encourage further studies and research insurface defect detection.

Index Terms—Surface defect detection, Normalizing flow,Computer vision, Visual inspection, Deep Neural Network, YOLO

I. INTRODUCTION

Surface defects have a significant impact on the resultof the quality of industrial products. Small defects needto be carefully and reliably detected during the process ofmonitoring. It is crucial to ensure the defective products arenoticed at earlier stages, which prevents a negative impact ona company’s reputation and additional financial loss. In recentresearch, surface detection has been increasingly studied andhas improved quality control in the industrial field. However,surface defect detection is challenging due to 1) collectingdefective samples and manually labeling for training is time-consuming; 2) the defects’ characteristics are difficult to defineas new types of defect can happen all the time; 3) and the real-world product images contain lots of background noise. Theresults of defect detection become less reliable because of theinfluence of these factors.

In the current industry, the defect types are varied, andthe defects’ characteristics are difficult to define. Most exist-ing defect datasets are lack defect richness and data scale.Specifically, the dataset is limited to a few categories ofproducts and a smaller number of samples. To ensure ourexperiment’s realism and applicability, we introduced a newdataset collected from a real-world production line monitoringsystem. This dataset includes 21 video clips and 1634 imagesconsisting of multiple types of bottles with both good anddefective samples. The videos of bottles are gathered fromvideos of the assembly line provided by ZeroBox Inc.

In this paper, we propose a two-stage defect detection modelbased on object detection and normalizing flow-based defectdetection. For a given product video as input, the objectdetection is performed by YOLO [1] to draw bounding boxesof products in each frame. Each product image is furthercropped by our model and fed into the normalizing flow-based model for training and predicting. Since the croppedproduct images contain lots of background noise, multi-scaleimage transformations such as image cropping and rotation arealso implemented in our model to ensure high robustness andperformance. We also introduced a visualization model to plotpredicted bounding boxes for each bottle, and the predictedanomaly result on every video frame. In summary, the maincontributions of this paper are:

• We create a new dataset that includes various typesof bottles collected from a real-world production linemonitoring system.

• We propose a two-stage defect detection model based onobject detection and normalizing flow-based defect de-tection and a visualization model for predicted boundingboxes for each bottle and the predicted anomaly resultusing a quality control inspection video as input.

• We propose an image transformation model to rotate andcrop each bottle image’s edges to remove the backgroundnoises surrounding the bottle.

• Extensive experiments on the new dataset demonstratethe proposed model’s effectiveness and our dataset’spracticability.

arX

iv:2

012.

0673

7v1

[cs

.CV

] 1

2 D

ec 2

020

II. RELATED WORK

A. Normalizing Flows

Normalizing Flows (NF) are a network that can generatecomplex distributions by transforming a probability densityfrom a latent space through a series of invertible affinetransformations ‘flows’ [3]. Based on the change of variablerule, the bijective mappings between feature space and latentspace can be evaluated in both directions. The formula isderived as the way below:

PG(xi)= π

(zi)|det (JG−1)| (1)

PG(xi)

refers to the complex distribution generated fromthe learned normal distribution π

(zi), and the magnitude of

Jacobian determinant of function G, it indicates how much atransformation locally stretch or squish the area is necessaryto ensure that the density function pie of z satisfies thisrequirement.

To ensure the bijective is invertible, tractable and the Jaco-bian is easy to compute, the coupling layer was introduced byL. Dinh et al. in 2017 [4]. With a given D dimensional inputx and d ¡ D, the output y of an affine coupling layer followsthe equations:

y1:d = x1:d (2)

yd+1:D = xd+1:D � exp (s (x1:d)) + t (x1:d) (3)

In the coupling layer, the input data are split into two partshalf by half, and the first parts x1:d will directly copy tothe output y1:d. The second half of the input xd+1:D will gothrough an affine coupling layer to generate output yd+1:D,where s and t stand for scale and translation.

After a certain number of affine coupling transformation,the complex distribution can be transformed into a simplenormal distribution. Also, the inverse of this flow can generatecomplex distribution from the learned normal distribution.

B. Semi-Supervised Defect Detection with Normalizing Flows

Our work is based on a normalizing flow model calledDifferNet, proposed in 2020 by M. Rudolph et al. [2] DifferNetutilizes a latent space of normalizing flow represent normalsamples’ feature distribution. Unlike other generative modelssuch as variational autoencoder (VAE) and GANs, the flow-based generator makes the bijective mapping between featurespace and latent space assigned to a likelihood. Thus a scoringfunction can be derived to decide if an image contains ananomaly or not. As a result, most common samples will havea high likelihood, while uncommon images will have a lowerlikelihood. Since DifferNet only requires good product imagesas the training dataset, so that defects are not present duringtraining. Therefore, the defective products will be assigned toa lower likelihood, which can be easily detected by the scoringfunction [2].

The DifferNet implemented the coupling layers as proposedin Real-NVP [4]. The structure of each coupling block isshown in Fig.1.

Fig. 1: The coupling layers proposed in Real-NVP [4]

The design of coupling layer fNF splits the input data intoyin1 and yin2 and then apply series of affine transformationthat includes regressing multiplicative (scaling function s) andadditive manipulations (translation function t). The scale andtranslation operations are written as below [2]:

yout ,2 = yin ,2 � es1(yin ,1) + t1 (yin ,1) (4)yout ,1 = yin ,1 � es2(yout ,2) + t2 (yout ,2) (5)

The exponential function is applied to the output of functions to make sure non-zero coefficients. The � refers to theelement-wise multiplication. The transformation function s andt can be any differentiable function. In the DifferNet, a fullyconnected network is implemented to generate the results fromthe input value [2].

This model aims to find the best probability distributionPz(z) in the latent space Z to maximize likelihoods forextracted features y [2]. According to the change-of-variablesformula, after adding log function on both sides, the lossfunction can be defined as:

log pY (y) = log pZ(z) + log

∣∣∣∣det ∂z∂y∣∣∣∣ (6)

A scoring function τ(x) is used to calculate likelihoodsto classify a sample as defective or normal. Rotations andmanipulations of brightness and contrast have been performedon the input images, and the average value of the negativelikelihoods is calculated to get an anomaly score. The formulais defined as below [2]:

τ(x) = ETi∈T [− log pZ (fNF (fex (Ti(x))))] (7)

The anomaly score is further used to compare with athreshold value θ, which is learned from the training process.The anomaly is classified where A(x) equal to 1, and goodproduct is classified where A(x) equal to 0 [2].

A(x) ={

1 for τ(x) ≥ θ0 for τ(x) < θ (8)

C. You Only Look Once: Unified, Real-Time Object Detection

In 2016, J. Redmon et al. introduced a unified modelYOLO for object detection. It reframes object detection asa regression problem that separates bounding boxes spatiallyand associates their class probabilities [1]. Only a singleconvolutional neural network is used to predict boundingboxes and class probabilities in the YOLO’s system. Witha given image as input, the system first divides the imageinto a S x S grid. Each cell predicts B bounding boxes andtheir corresponding confidence scores. The confidence score is

defined as Pr( Object )∗ IOUtruthpred , where the intersection overunion (IOU) between the ground truth and predicted boundingbox is calculated. Later, the conditional class probabilities aremultiplied with confidence scores of each bounding box toobtain confidence scores for a specific class as:Pr ( Class i | Object ) ∗ Pr( Object ) ∗ IOUtruthpred =

Pr ( Class i) ∗ IOUtruthpred .YOLO is extremely fast, reasons globally, and learns a

more generalized representation of the objects, making itoutperformed other detection methods. It achieves efficientperformance in both fetching images from the camera anddisplaying the detections. However, YOLO struggles withsmall items that appeared in the group under the strong spatialconstraints. It also struggles to identify objects in new orunusual configurations from the data it has not seen duringthe training [1].

D. Improving Unsupervised Defect Segmentation by ApplyingStructural Similarity To Autoencoders

Convolutional autoencoder has become a popular approachfor unsupervised defect segmentation of Images. In this paper,a model is proposed to use the structural similarity (SSIM)metric with an autoencoder to capture the inter-dependenciesbetween local regions of an image. This model is trained ex-clusively with defect-free images and able to segment defectiveregions in an image after the process of training [5].

The autoencoder in the proposed model attempts to re-construct an input image precisely after passing through abottleneck and effectively project the input image into a latentspace. To prevent the model from simply copying the inputimage, the latent space dimension is much less than the inputimage’s dimension. For a given input image x, the overallprocess is summarized as [5]:

x̂ = D(E(x)) = D(z), (9)

Function D stands for a decoder function, Function Erepresents an encoder function, and z denotes the latent space.If the autoencoder encounters images that have not been seenin training, i.e., samples with defects, it will fail to reconstructsuch images [5].

SSIM is a distance measure that is designed to capture thesimilarity between two images. It is less sensitive to edgealignment and considers luminance, contrast, and structuralinformation at the same time. With given patches p and qfrom two images, the SSIM index compares the patches fromthree statistical feature and is summarized as:

SSIM(p,q) =(2µpµq + c1) (2σpq + c2)(

µ2p + µ2q + c1

) (σ2p + σ

2q + c2

) , (10)where µp, µq are patches’ mean intensities, and σp and

σq denote the patches’ variances. The advantage of usingthe autoencoder with SSIM index is to make the model lesssensitive to localization accuracies in the reconstruction andboost the performance for the real-world dataset [5].

E. Segmentation-Based Deep-Learning Approach for Surface-Defect Detection

This paper proposes a segmentation-based deep-learningarchitecture on a specific surface crack detection domain withthe great success that deep-learning methods have achieved inquality control. This model is trained with a small number ofsamples with approximately 25-30 samples that are defective[6].

In the paper, a deep convolution network is constructedbased on a two-stage architecture. The first stage containsa segmentation network to perform the pixel-wise locationof defects. It focuses on detecting small defects in a large-scale image with the requirement of a large receptive field.The second stage, an extra network built on top of thesegmentation network, implements a decision network. It iswhere the binary-image classification performs. It ensures themodel capture not only local shapes but also global ones. Theperformance of the model has been proved on the specific taskof crack detection. Moreover, the network architecture can beapplied for new domains with multiple complex surfaces, andother different defect types [6].

III. PROPOSED METHOD

In this paper, object detection is performed by YOLO todetect and draw bounding boxes of products in each frame. Af-ter comparing with the most recent defect detection methods,we will focus on one of the state-of-the-art normalizing flow-based models called DifferNet to perform defect detection [2].However, due to the product images detected and croppedby YOLO are usually contains lots of background noises, wepropose to devise an improved normalizing flow-based modelwith an additional image transformation layer to remove thebackground noises. A Visualization model is also introduced toplot predicted bounding boxes for each bottle and the predictedanomaly result on every video frame. Fig.2 shows an overviewof our proposed model:

A. Proposed model

1) Our model takes video clips of bottle products as inputand utilizes YOLO to detect and draw bounding boxes on eachbottle in each frame.

2) A data extraction model is created to crop the bottleimages based on the bounding boxes drawn by YOLO. Bothof the cropped bottle images and the original video frameswill be saved into separate folders.

3) An image transformation model is further introducedto rotate and crop each bottle image’s edges to remove thebackground noises surrounding the bottle.

4) The processed bottle images are then passed into thenormalizing flow-based defect detection model to generate anormal distribution by maximum likelihood training.

5) After training the model, a scoring function is used tocalculate likelihoods to classify the input sample as defectiveor normal. We also created a visualization model to plot boththe bounding box and anomaly prediction onto the originalinput video frames.

Fig. 2: Overview of our proposed model

B. Novel Combination of YOLO and improved DifferNet

YOLO is a state-of-art model that extremely fast, reasonsglobally, and learns a more generalized representation of theobjects, making it outperformed other detection methods. Themodel is constructed with twenty-four convolutional layersand two fully connected layers. With a given image as input,the system first divides the image into a S x S grid. Eachcell predicts B bounding boxes and their corresponding con-fidence scores. The bounding box drawn by YOLO containsobject class, center coordinates, the height and width of eachbounding box [1]. We decided to use YOLO to perform objectdetection on video clips collected from a real-world productionline monitoring system. The bottle images cropped based onthe bounding boxes are further passed into our improvedDifferNet to perform training and predicting.

DifferNet is a state-of-art model that utilizes a latent spaceof normalizing flow to represent normal samples’ featuredistribution. Unlike other generative models such as varia-tional autoencoder (VAE) and GANs, the flow-based generatormakes the bijective mapping between feature space and latentspace assigned to a likelihood [2].

To improve the performance of DifferNet on the outputimages from YOLO, we propose an image transformationmodel to rotate and crop each bottle image. In training, variousscales of cropping on bottle images are performed to easebackground noise interference. Moreover, the range of rotationfor input images is reduced from 360 degrees to 10 or 20degrees for better computing performance.

Then the transformed images are fed into a pre-trainedAlexNet to extract the feature. The extracted feature map isfurther passed into a normalizing flow-based coupling layer tooutput a normal distribution by maximum likelihood training.The DifferNet uses the negative log-likelihood loss L(y) toobtain a minimization problem [2]:

log pY (y) = log pZ(z) + log

∣∣∣∣det ∂z∂y∣∣∣∣ (11)

L(y) = ‖z‖22

2− log

∣∣∣∣det ∂z∂y∣∣∣∣ (12)

To classify if an input image is anomalous or not, DifferNetuses a scoring function that calculates the average of thenegative log-likelihoods using multiple transformations Ti(x)of an image x:

τ(x) = ETi∈T [− log pZ (fNF (fex (Ti(x))))] (13)

The result will compare with the threshold value θ todetermine if the image contains an anomaly or not. [2]

IV. EXPERIMENTS AND RESULTS

In this section, we evaluate the proposed model based onreal-world videos obtained from the factory. First, we brieflyintroduce the dataset used in experiments. Then, the resultsof the experiments are analyzed with visual statistics. Sincethe complexity of experiments primarily stems from the noisybackground in the video clips, our experiments concentrate onlogo-free products and group into single and multiple productcategories.

A. Dataset

In this paper, we evaluate our model to real-world defectdetection problems. We created a new dataset collected froma real-world production line monitoring system. This datasetincludes 21 video clips consisting of 20 types of bottles withboth good and defective samples. The videos of bottles aregathered from videos of the assembly line provided by Zer-oBox Inc. 1381 good bottle’s images, and 253 defective bottleimages are generated from YOLO detection and cropping.Examples of defective and defective-free samples can be seenin Fig. 3.

Since our normalizing flow-based model is semi-supervisedlearning, it only requires about 200 good sample images tolearn how to use a simple normal distribution to represent agood sample’s complex distribution. In our experiments, weonly use 200 good sample images for training, and all the restsample images are used as test datasets.

B. Implementation Details

The Area Under Receiver Operator Characteristics(AUROC) is computed for performance evaluation. We adoptthis performance metric since it reveals the model’s ability todiscriminate between positive samples and negative samples.It calculates the area under a ROC curve which is a graphthat plots the true positive rate and false positive rate atdifferent classification thresholds. AUROC is not sensitiveto the percentage of defective samples and therefore chosenas the metric for performance evaluation. The experimentalresults are presented and analyzed both qualitatively and

(a) (b) (c) (d)

Fig. 3: Example images from the contributed dataset of bottles. Fig. 3a and Fig. 3b show examples of the original image and the croppedimage of a good bottle. Fig. 3c and Fig. 3d show examples of the original image and the cropped image of a defective bottle.

quantitatively.

A. Detection on One Product Type with Image ProcessingTechniques

1) Experiment Result of Image Cropping: Table I andFig.4 present the detailed comparison AUROC resultof the detection on one product with different scales ofcropping. The proposed model can obtain better resulton defect detection after cropping the background ofbottles’ images. Based on the background of images,an adequate adjustment of the cropping scales whileretaining the defective regions can obtain higher AU-ROC scores.

Scale of Image Cropping AUROC [%]

Original Image 76.9%Top 10% Cropping, Bottom, Left and Right 5% Cropping 92.2%Top and Bottom 10% Cropping, Left and Right 5% Cropping 99.2%

TABLE I: AUROC (%) in detection on one product with imagecropping

2) Experiment Result of Image Rotation: Table II andFig.7 display the AUROC comparison of the detectionresult for one product with random image rotationwithin a specific range. Since the bottles are motionlesson the production line, we reduce the range of randomrotation from the initial 360 degrees in DifferNet tosmaller ranges. As a result, the proposed model canobtain better result on defect detection within a rotationangle between -5 and 5 degrees or -10 and 10 degrees.

Range of Image Rotation AUROC [%]

Original Image 76.9%10 Degrees of Image Rotation 97.8%20 Degrees of Image Rotation 99.6%360 Degrees of Image Rotation 93.3%

TABLE II: AUROC (%) in detection on one product with imagerotation

B. Detection on Multiple Product Types with Image Process-ing Techniques

Table III and Fig.5 present the AUROC score of detectionon three different product types. With extra cropping onthe top and bottom edges, the proposed model achievesan overall better performance.

Scales of Image Cropping AUROC [%]

Original Image 73.2%Top and Bottom 10% Cropping 86.1%Top and Bottom 15% Cropping 93.4%

TABLE III: AUROC (%) in detection on multiple product types withimage cropping

C. Detection on All Product Types with Image ProcessingTechniquesTable IV and Fig.6 show the AUROC result of the modelon the detection of all products. Similar to the resultsof detection on single and multiple product types, imagecropping to eliminate background noise near edges ofimages enables the model to achieve better performance.

Scales of Image Cropping AUROC [%]

Original Image 88.1%Top and Bottom 10% Cropping 90.1%Top and Bottom 15% Cropping 93.5%

TABLE IV: AUROC (%) in detection on all product types with imagecropping

V. CONCLUSION

In this paper, we introduce a new dataset for bottle surfacedefect detection. This dataset has several challenges regardingdefect types, background noise, and dataset sizes. Also, wepropose an two-stage defect detection network based on objectdetection and normalizing flow-based defect detection. Inorder to overcome the significant effect of background noise onboth positive and negative samples, we present the multi-scaleimage transformations for solving this issue. Finally, extensiveexperiments show that the proposed approach is robust forthe detection of surface defects on bottle products. In thefuture, we will work on using background and foregroundsegmentation with an end-to-end trained mask to eliminate the

background noise in images cropped by YOLO. Also, moredata samples will be collected for training and improving ourproposed method.

REFERENCES[1] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once:

Unified, real-time object detection,” 2016.[2] M. Rudolph, B. Wandt, and B. Rosenhahn, “Same same but differnet:

Semi-supervised defect detection with normalizing flows,” 2020.[3] D. J. Rezende and S. Mohamed, “Variational inference with normalizing

flows,” 2015.[4] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real

nvp,” 2017.[5] P. Bergmann, S. Löwe, M. Fauser, D. Sattlegger, and C. Steger,

“Improving unsupervised defect segmentation by applying structuralsimilarity to autoencoders,” 2019. [Online]. Available: http://dx.doi.org/10.5220/0007364503720380

[6] D. Tabernik, S. Šela, J. Skvarč, and D. Skočaj, “Segmentation-based deep-learning approach for surface-defect detection,” Journal ofIntelligent Manufacturing, vol. 31, no. 3, p. 759–776, May 2019.[Online]. Available: http://dx.doi.org/10.1007/s10845-019-01476-x

http://dx.doi.org/10.5220/0007364503720380http://dx.doi.org/10.5220/0007364503720380http://dx.doi.org/10.1007/s10845-019-01476-x

(a) Original Image(b) Top 10% Cropping + Bottom, Left andRight 5% Cropping

(c) Top and Bottom 10% Cropping + Leftand Right 5% Cropping

Fig. 4: AUROC of Multi-scale Image Cropping on One Product Type

(a) Original Image (b) Top and Bottom 10% Cropping (c) Top and Bottom 15% Cropping

Fig. 5: AUROC of Multi-scale Image Cropping on Multiple Product Types

(a) Original Image (b) Top and Bottom 10% Cropping (c) Top and Bottom 15% Cropping

Fig. 6: AUROC of Multi-scale Image Cropping on All Product Types

(a) Original Image (b) Image Rotation in angle range (-5, 5)

(c) Image Rotation in angle range (-10, 10) (d) Image Rotation in angle range (-180, 180)

Fig. 7: AUROC of Multi-scale Image Rotation on One Product Type

I IntroductionII Related WorkII.a Normalizing FlowsII.b Semi-Supervised Defect Detection with Normalizing FlowsII.c You Only Look Once: Unified, Real-Time Object DetectionII.d Improving Unsupervised Defect Segmentation by Applying Structural Similarity To AutoencodersII.e Segmentation-Based Deep-Learning Approach for Surface-Defect Detection

III Proposed MethodIII.a Proposed modelIII.b Novel Combination of YOLO and improved DifferNet

IV Experiments and ResultsIV.a DatasetIV.b Implementation Details

V ConclusionReferences

computer vision and normalizing flow based defect detectioncomputer vision, visual inspection, deep...

Documents