imagerecognitionbasedonmultiscalepoolingdeep...

Research ArticleImage Recognition Based on Multiscale Pooling DeepConvolution Neural Networks

Haitao Sang 1 Li Xiang 1 Shifeng Chen 1 Bo Chen1 and Li Yan2

1College of Information Engineering Lingnan Normal University Zhanjiang 524048 China2College of Science Guangdong University of Petrochemical Technology Maoming 525000 China

Correspondence should be addressed to Li Xiang xiangllingnaneducn

Received 5 June 2020 Revised 16 July 2020 Accepted 20 July 2020 Published 7 September 2020

Guest Editor Kailong Liu

Copyright copy 2020 Haitao Sang et al )is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Depth neural network (DNN) has become a research hotspot in the field of image recognition Developing a suitable solution tointroduce effective operations and layers into DNNmodel is of great significance to improve the performance of image and videorecognition To achieve this through making full use of block information of different sizes and scales in the image a multiscalepooling deep convolution neural network model is designed in this paper No matter how large the feature map is multiscalesampling layer will output three fixed-size character matrices Experimental results demonstrate that this method greatly improvesthe performance of the current single training image which is suitable for solving the image generation style migration imageediting and other issues It provides an effective solution for further industrial practice in the fields of medical image remotesensing and satellite imaging

1 Introduction

Edge computing extends computing network storage andother capabilities to the edge side of the network near theInternet of )ings devices while the artificial intelligencetechnology represented by deep learning enables each edgecomputing node to have the ability of computing and de-cision making allowing some complex intelligent applica-tions to be processed at the local edge meeting the needs ofagile connection real-time data optimization business ap-plication intelligence security and privacy protection In-telligent edge computing collects data and analysescomputing by means of the edge device of the Internet of)ings so as to realize the flow of intelligence between thecloud and the edge Now that new demands on artificialintelligence algorithms terminals and chips have to be metit appeals more and more artificial intelligence (AI) enter-prises [1ndash3] However most current edge devices todaycannot support deep learning applications in a low-latencylow-power high-precision manner due to resource con-straints for example deep learning model inference requiresa large amount of computational resources

Deep neural networks like many other machine learningmodels can be divided into two stages of training andreasoning )e training phase learns the parameters in themodel according to data (for the neural network it is mainlythe weight in the network) )e inference phase enters thenew data into the model and calculates the results Over-parameterization refers to the training phase )e networkneeds a large number of parameters to capture the smallinformation in the data And once the training has reachedthe reasoning stage it does not need so many parametersBased on this assumption themodel can be simplified beforedeployment However deep neural network model thatexists in the process of removing specific fuzzy kernel andnoise usually needs to set up different training modelsaccording to different noise levels which does not only lackflexibility but also cannot cope with more general imagerecognition tasks)erefore how to use DNNmodel flexiblyin image recognition task has more important research value[4]

Although the purpose of neural network is to solve thegeneral machine learning problem domain knowledge alsoplays an important role in the design of depth model

HindawiComplexityVolume 2020 Article ID 6180317 13 pageshttpsdoiorg10115520206180317

Among the applications related to image and video the mostsuccessful one is the deep convolution network whosedesign is only used in the special structure of image Two ofthe most important operations convolution and poolingcome from domain knowledge related to images How tointroduce new effective operation and layer in the depthmodel through the knowledge of research field is of greatsignificance to improve the performance of image and videorecognition For example the pool layer brings about localtranslation invariance and the deformation pool layerproposed in [5ndash7] can better describe the geometric de-formation of various parts of the object At present it can befurther extended to achieve scale invariance rotation in-variance and robustness to occlusion

In this paper we put forward Multiscale Pooling DeepConvolution Neural Networks model that aims to train a set offast and effective schemes Rather than learning MAP infer-ence-guided discriminative models we instead adopt plainDNN to learn the denoisers taking advantage of recentprogress in DNN as well as the merit of GPU computationMeanwhile providing good performance for image recogni-tion the learned set ofmultiscale pooling is plugged in amodel-based optimization method to tackle various problems

)e contribution of this work is summarized as follows

(i) We trained a set of fast and effectiveMultiscale PoolingDeep Convolution Neural Networks (MSDNN)model Using the core idea of pyramid method forreference the powerful multiscale expression ability isapplied to the deep level of network

(ii) Extensive experiments on classical feature extractionproblems including superresolution and deblurringhave demonstrated great improvement in the per-formance of the current single training image andcan be applied to image generation style migrationimage editing and other issues

2 Related Work

21 Feature Learning )e biggest difference between tra-ditional pattern recognition method and deep learning liesin its characteristics the feature is automatically learnedfrom big data rather than using manual design Goodfeatures can improve the performance of pattern recognitionsystem [8ndash10] In the past few decades in various appli-cations of pattern recognition the characteristics of manualdesign have been in domination level Manual design mainlydepends on the prior knowledge of designers so it is difficultto take advantages of the advantages of big data Due to thedependence on manual parameter adjustment the numberof parameters allowed in the design of the feature is verylimited Deep learning can automatically learn representa-tions of features from big data and can contain thousands ofparameters [11ndash16] It takes five to ten years to design ef-fective features by hand and deep learning can quickly learnnew and effective feature representations from training datafor new applications

A pattern recognition system includes two parts featureand classifier In traditional methods the optimization of

features and classifiers is separated Under the framework ofneural network feature representation and classifier arejointly optimized which can give full play to the perfor-mance of joint cooperation [17ndash21]

In 2012 Hinton participated in the ImageNet compe-tition and adopted the convolution network model whichcontains 60 million parameters learned from millions ofsamples [22] )e feature representation learned fromImageNet has very strong generalization ability and can besuccessfully applied to other datasets and tasks such asobject detection tracking and inspection Another famouscompetition in the field of computer vision is PSACALVOC However its training set is small and is not suitable fortraining deep learning models Some scholars use the featurerepresentation learned from ImageNet for physical exami-nation on PSACAL VOC the detection rate increased by20 [23]

Since feature learning is so important what are goodfeatures In an image various complex factors are oftencombined in a nonlinear way For example the face imagecontains a variety of information such as identity postureage expression light and so on )e key to deep learning isto successfully separate these factors through multilevelnonlinear mapping for example in the last hidden layer ofthe depth model different neurons represent differentfactors If you take this hidden layer as a feature repre-sentation face recognition pose estimation expressionrecognition and age estimation will become very simplebecause the various factors become a single linear rela-tionship no longer interfere with each other

22 ImageNet Image Classification )e most importantdevelopment of deep learning in object recognition is imageclassification task in ImageNet ILSVRC 3 challenge )elowest error rate of traditional computer vision method inthis test set is 26172 In 2012 Hintonrsquos team used con-volutional networks to reduce the error rate to 15315)isnetwork structure is called AlexNet [24] Compared with thetraditional convolution network it has three differencesFirst of all AlexNet adopts dropout training strategy Duringthe training process some neurons in the input layer and themiddle layer are randomly set to zero )is simulates thesituation that noise interferes with input data so that someneurons fail to detect some visual patterns Dropout makesthe training process converge more slowly but the resultingnetwork model is more robust Secondly AlexNet uses therectifier linear element as the nonlinear excitation function)is not only greatly reduces the computational complexitybut alsomakes the output of neurons sparse andmore robustto various interferences )irdly AlexNet generates moretraining samples and reduces overfitting by mapping theimage of training samples and adding random translationdisturbance

In the ImageNet ILSVRC 2013 competition the top 20groups use deep learning techniques )e winner is RobFergusrsquos research group of New York University )e depthmodel adopted is convolutional network and the networkstructure has been further optimized )e error rate is

2 Complexity

11197 Its model is called clarifai [25] In the ILSVRC 2014competition the winner GoogLeNet [25] reduced the errorrate to 6656)e outstanding feature of GoogLeNet is thatit greatly increases the depth of convolution network whichis more than 20 layers which was unimaginable before )edeep network structure makes the backpropagation ofprediction error difficult because the prediction error istransmitted from the top layer to the bottom layer and theerror transmitted to the bottom layer is very small so it isdifficult to drive the updating of the bottom-layer param-eters GoogLeNetrsquos strategy is to add the monitoring signaldirectly to multiple middle layers which means that thefeature representation of middle layer and bottom layershould also be able to accurately classify the training data Inthe last competition of ILSVRC 2017 the ldquoDPN dual-channel network + basic aggregationrdquo deep learning modelproposed by 360 Artificial Intelligence Research Instituteand the National University of Singapore (NUS) teamachieved the lowest positioning error rate of 0062263 and0061941 respectively setting a world record Among themthe DPN-92 costs about 15 fewer parameters thanResNeXt-101 while the DPN-98 costs about 26 fewerparameters than ResNeXt-101 In addition the DPN-92consumes about 19 less FLOPs than ResNeXt-101 and theDPN-98 consumes about 25 less FLOPs than ResNeXt-101 How to train the deep network model effectively is stillan important research topic in the future

23 Face Recognition Another important breakthrough ofdeep learning in object recognition is face recognition )ebiggest challenge of face recognition is how to distinguishthe intraclass changes caused by light posture expressionand other factors from the interclass changes caused bydifferent identities )e distribution of these two kinds ofchanges is nonlinear and extremely complex which cannotbe effectively distinguished by the traditional linear model)e purpose of deep learning is to get new feature repre-sentation through multilayer nonlinear transformation)ese new features need to remove as many intraclasschanges as possible while retaining interclass changes

Face recognition includes two tasks face confirmationand face recognition Face confirmation is to judge whethertwo face photos belong to the same person which is a binaryclassification problem )e accuracy of random guessing is50 Face recognition is to divide a face image into one of Ncategories which are defined by the identity of the face )isis a multiclassification problem which is more challenging)e difficulty increases with the number of categories )eaccuracy of random guess is 1n Both tasks can learn thefacial expressions through deep models

Using convolution network to learn face features lit-erature [26ndash28] used face recognition task as a supervisorysignal the recognition rate is 9252 on LFW Although thisresult is lower than the follow-up deep learning method italso exceeds most of the nondeep learning algorithmsBecause face recognition has two classified problems theefficiency of using it to learn face features is relatively lowand it is easy to have fitting on the training set Face

recognition is a more challenging multiclassification prob-lem which is not easy to be fitted and it is more suitable tolearn face features through depth model [29ndash31] On theother hand in face recognition each pair of training samplesis manually labelled into one of two categories whichcontains less information In face recognition each trainingsample is labelled as one of N classes which has a largeamount of information

Some people think that the success of deep learning isdue to the use of complex modules with a large number ofparameters [32] In fact it is far from so simple to fit thedataset For example the success of DeepID2+ lies in itsmany important and interesting characteristic [33] its topneurons respond moderately sparsely it has strong selec-tivity to face identity and various face attributes and strongrobustness to local occlusion In previous studies in order toget these attributes we often need to add various displayconstraints to the model DeepID2+ has these attributesautomatically through large-scale learning and the theo-retical analysis behind it is worth further study in the future

24 Deep Learning for Video Analysis )e application ofdeep learning in video classification is still in its infancythere is still a lot of work to be done in the future )e depthmodel that can be learned from ImageNet can be used todescribe the static image features of video )e difficulty ishow to describe the dynamic features In the past the de-scription of dynamic features in visual research methodsoften depends on optical flow estimation key point trackingand dynamic texture How to embody this information indepth model is a difficulty )e most direct way is to treatvideo as a three-dimensional image and apply convolutionnetwork directly [34ndash36] three-dimensional filters arelearned at each level But this idea obviously does not takeinto account the differences of time and space dimensionsAnother simple but more effective way is to calculate thespatial field distribution of optical flow field or other dy-namic characteristics by preprocessing as an input channelof convolution network [37ndash39] )ere are also researchworks using deep autoencoder to extract dynamic texture ina nonlinear way [38] In the latest research work [40] longshort-term memory (LSTM) has been widely concernedwhich can capture long-term dependence and complexdynamic modelling in video

In 2018 Ulyanov et al proposed a depth image prior(Deep Image Prior (DIP)) model [1] which considers thatthe neural network structure is a priori that does not requirelearning and pretraining to a large number of datasets onlylearning network superparameters adaptively can realizeimage conversion and it is verified in the tasks of denoisingsuperresolution and filling After dip was proposed it wassuccessively used in image decomposition-related tasks [41]and blind deconvolution [42] Double-Dip [43] is an un-supervised layer decomposition of a single image based on acoupled dip network which is used for image segmentation(foreground layer and background layer) image defoggingwatermarking and other issues Ren et al [42] proposed ablind deconvolution framework based on depth

Complexity 3

priormdashSelfDeblur using dip and full convolution networkrespectively and fuzzy kernel

Compared with the image processing the sound signalprocessing is another direction of successful application indeep learning especially speech recognition which manylarge companies are going for Unlike images sound signalsare one-dimensional sequence data and although they canbe converted to two-dimensional spectrum by frequencydomain conversion algorithms such as FFT their two di-mensions also have specific meanings (vertical axis repre-sents frequency horizontal axis represents timeframe) andcannot be processed directly in the form of an image re-quiring a specific processing method in a specific area Deepneural network has the ability to extract features automat-ically which can jointly optimize the separation of featurerepresentation and classification model especially after2016 Deep learning has made great progress in the field ofspeech recognition for example RNN CRN and otherstructures have made a breakthrough in speech recognitionperformance [44ndash46] With the research developing and thehardware updating some end-to-end models have achievedthe performance in the state of the art for example as thepreviously mentioned it is sequence-to-sequence model(CTC) As speech recognition performance continues toimprove industry has produced many applications such asvirtual assistants Google Home Amazon Alexa andMicrosoft Cortana

Although deep learning has achieved great success inpractice moreover the characteristics (such as sparsityselectivity and robustness to occlusion [29]) of the depthmodel obtained by big data training are striking but thetheoretical analysis behind it still needs to be completed

3 Method

In the field of target detection the manual extraction ofcharacteristics has been replaced by the conventionalmultiple layer NN which has saved a great deal of costs ofbackground research But as the number of network layersincreases the corresponding training methods becomelacking Moreover the conventional NNs are sensitive to thetranslation and scale change of the target which leads to therelatively low detection rate of the target At present CNNcan realize the simulation of the large-scale neural networkby the following three techniques local receptive fieldweight sharing and pooled sampling )erefore we putforward a Multiscale Pooling Deep Convolution NeuralNetworks method to overcome the restriction during theimage detection that the input image must be of a fixed size

31 0e Basic Structure of Convolutional Neural NetworkA multilayer NN has too many parameters and is difficult totrain To overcome this limitation the CNN emerges whichis based on ANN and includes convolution and samplingoperations )e characteristics extracted by CNN are spatialinvariant to certain degree which makes CNNmore suitablefor the target detection of an image Although there aremany kinds of CNN the basic structure is largely the same

(Figure 1) the input layer I the convolution layer C thesampling layer S the output layer O and sometimes the full-connection layer F

)e main purpose of convolution layer is to realize localfeature perception through convolution operation and thenrealize the same kind of feature extraction through weightsharing and realize different types of feature extraction byusing different convolution kernels )ey should be referredto as follows

Xlj f 1113944

iisinMj

Xlminus1i lowastK

lij + b

lj

⎛⎜⎝ ⎞⎟⎠ (1)

where Xlj represents the convolutional output of the jth

characteristic map of the lth layer f( ) represents the acti-vation function l represents the number of the layer K is thekernel Mi represents one choice of the input characteristicmap and b is an offset

)e sampling layer is an important layer of convolutionneural network which performs convolution to get theaverage or maximum value of the local space of the featuremap )ey should be referred to as follows

Xlj f βl

jdown Xlminus1j1113872 1113873 + b

lj1113872 1113873 (2)

where down(middot) is a sampling function obtaining the averageor maximum of a rectangle area

)e full-connection layer is the penultimate layer ofconvolution neural network which means that the inputfeature map and the output feature map adopt the full-connection method Its function is to reduce the dimensioneffectively and to facilitate the classification processing of theoutput layer )ey should be referred to as follows

Xlj f 1113944

Nin

i1aij X

lminus1j lowastK

li1113872 1113873 + b

lj

⎛⎝ ⎞⎠ (3)

where the constraint 1113936iαij 1 and 0le αij le 1 must besatisfied

)e constraint to the variable aij can be strengthened byturning the variable aij into an unconstrained Softmaxfunction of the implicit weight cij )ey should be referred toas follows

aij exp cij1113872 1113873

1113936Kexp cij1113872 1113873 (4)

For a given j every group of weight values cij is inde-pendent of those in the other groups We therefore can omitthe subscript j for the sake of convenience In other wordsone needs only to consider updating a map because theupdating of the other maps is the same process )e onlydifference is the index j of the map )e details of the re-gression process about Softmax are not explained here Itshould be noted that the input to the FCL needs to have afixed dimensionality for an easier training of the BP algo-rithm which necessitates the consistency between the size ofthe input image and the size of the training network )isrequirement represents one shortcoming of the network

4 Complexity

structure )erefore to solve the problem that the size of theinput image must be fixed the multiscale sampling methodis to be used

32 0e Multiscale Sampling

321 Shortcomings of the Conventional CNN )e con-ventional CNN has a fixed window size of the input imagewhich restricts the size of the input image and length-to-width ratio When the input image is of arbitrary size oneoften has to first fix the image to a size suitable for thetraining network by clipping the image or deformation alongthe horizontal and vertical directions )is manual aligningmethod cannot preserve the size and length-to-width ratio ofthe original image and cannot guarantee that the clippedimage subblocks encompass the whole image which leads tothe loss of some important information and geometricaldistortion due to rotation and stretching of the image Whenthe image size changes to a large extent the whole networkhas to be trained again to adapt to the new image even if thenetwork was trained by images of a given input size Oth-erwise the accuracy would reduce of detecting the wholeimage or the subblocks of the other arbitrary sizes )e fixedinput dimensionality greatly restricts the development ofCNN in multiscale image detection

)e convolutional and sampling layers extract imagecharacteristics by using sliding windows )e FCL as thehidden intermediate between the sampling and outputlayers plays a pivotal role in the whole network structure Infact the convolutional layer can generate characteristic mapsof different sizes and ratios without inputting images of fixedsize On the contrary according to the definition of CNNthe input to FCL needs to be images of fixed size )e reasonis that only the FCL has a fixed input dimensionality whichfurther fixes the dimensionality of the parameters used in thetraining and thus allows for the use of the BP algorithm forthe training

To solve the problem one can use multiscale sampling tointegrate the deep characteristics generated by the previousconvolutionalndashsampling layers to a characteristic expressionof fixed length and then send it to the FCL In this way oneneeds not to clip images of different sizes and length-to-width ratio before sending them to the network At theprevious layer of the FCL namely the last sampling layerthe multiscale sampling is used to fix the imagesrsquo sizes so thatthey can be input to the FCL Figure 2 shows how the CNNstructure is changed by introducing the MSL

Now one can see that the MSL integrates the imagecharacteristics between the convolutionalndashpooled layers and

the FCL)e integration at deep layers is reasonable becauseit conforms well with bionic topology human learning startsfrom the overall classification but not from the postclippingparts thus to detect a target the brain needs not to firstprocess the target and adapt to the fixed size Some authorsproposed the solution of inputting images of different sizesto CNN but used images of different sizes for both trainingand testing )ey found no significant improvement of thedetection rate For MSDNN the training and testing areperformed on the same network

)e idea of multiscale sampling was primarily inspiredby the method of pyramid which preserves the spatial in-formation of the local parts during the integration )epioneers applied the pyramid method to all the levels ofCNN and extracted multiscale image characteristics bysliding windows )ese image characteristics were then usedto train or test the networks of the respective sizes )isactually enriched the scale of the input image and extendedthe training set but the training became more difficult withmore network parameters that can easily lead to the over-fitting problem)eMSL design makes it unnecessary to usedifferent networks to train and test images of different size)eMSL does not simply use the pyramid method to extractthe imagersquos multiscale characteristics but applies the coreidea of the pyramid method namely multiscale expressionto the deep layers of the network )ese enable not only the

(a) (b) (c) (d) (e) (f)

Figure 1 )e basic structure of CNN (a) Input layer I (b) Layer C1 (c) Layer S1 (d) Layer C2 (e) Layer S2 (f ) Output

The convolutional sampling layers

The multiscale sampling layer

The fully connected layer

Start

Input the image

The output layer

End

Figure 2 )e MSDNN structure

Complexity 5

training of NN with images of different sizes which enrichesthe characteristic expressions but also the output of char-acteristic vectors of the fixed size

322 Methods of Multiscale Sampling Multiscale expres-sion is applied to the deep layers of NN We propose toreplace the normal sampling with the multiscale samplingat the final sampling layer where ldquomultiscalerdquo lies in theuse of multiple sampling sizes and strides Figure 3 showsno matter how large the previous characteristic map theMSL outputs three characteristic matrices with sizes 1 timesm4 timesm and 9 timesm where m is the number of the previousconvolutional layerrsquos characteristic maps )e threecharacteristic matrices are concatenated in the column-major order to form a column vector with fixed dimen-sionality 14 timesm times 1 )e column vector is the input to theFCL

)e sizes of these sampling windows are matchedwith those of the input images )at is the sampling sizeand sampling stride change as the size of the inputchanges In this way the number of sampling windowsnamely the size of the output characteristic columnvector becomes fixed no matter how large the inputimage is )is makes it easy to send the vector to FCL Forexample if the size of the input characteristic map is r times sand we use the maximum sampling method with threesampling sizes and strides then the computation for-mulas are as follows

size r times s lceil(r2) times(s2)rceil lceil(r3) times(s3)rceil (5)

stride r times s lceil(r2) times(s2)rceil lceil(r3) times(s3)rceil

y1j max xi( 1113857 i isin Rrtimess j isin R1times1

y2j max xi( 1113857 i isin R[(r2)times(s2)] j isin R2times2

(6)

y3j max xi( 1113857 i isin R[(r3)times(s3)] j isin R3times3 (7)

where y1j y

2j and y3

j represent the output of every charac-teristic map after the multiscale maximum sampling )eseoperations finally obtain three output characteristic matriceswith fixed sizes1times 1timesm 2times 2timesm and 3times 3timesmwhich canbe converted into characteristic matrices 1timesm 4timesm and9timesm by unfolding in the column-major order Finally thesecharacteristic matrices are concatenated in sequence be-coming a characteristic column vector of a fixed size14timesmtimes 1 Take two different size input images as an ex-ample (Figure 4) )e sizes of the characteristic maps are16times16 and 13times 9 respectively )e target is a 3times 3 outputmatrix In Figure 4(a) the sampling size is 6times 6 and sam-pling stride is 5times 5 By (3) and (4) one obtains themaximumvalue of the 6times 6 area and uses it as the output which finallyobtains a 3times 3 output characteristic matrix In Figure 4(b)the sampling size is 5times 3 and sampling stride is 4times 3 By (3)and (4) one obtains the maximum value of the 5times 3 area anduses it as the output which finally obtains a 3times 3 outputcharacteristic matrix

323 0e Fully Connected Layer with Multilayer Charac-teristic Expression )e FCL as the hidden intermediate be-tween the sampling and output layers plays a pivotal role in thewhole network structure Here the neurons of the FCL si-multaneously indirectly and fully connect to the characteristicmaps next to the convolutional layers so that the FCL canreceive multiscale and multilevel feature representation input

33 0e Topological Structure of the Overall NetworkFigure 5 shows the topological structure of the MSDNN putforward in this paper Due to the page limit taking a 64times 64input image as an example we briefly introduce the mainparameters and implementation methods of every networklayer

)e input layer the preprocessed grayscale image Xin)e convolutional layer Clthe layer consists of 20

different characteristic maps XC1j Every characteristic map

is convoluted with twenty 5times 5 convolution kernels re-spectively )e result is offset by bC1

j and then used as theindependent variable of the activation function ReLU toobtain the characteristic map XC1

j )ey should be referredto as follows

XC1j ReLU Xin otimesK

C1ij + b

C1j1113872 1113873

max 0 Xin otimesKC1ij + b

C1j1113872 1113873 i 1 j 1 2 20

(8)

Unlimited image size

The convolutional layer

Multiscale sampling

The characteristic column vector of fixed size


Figure 3 Illustration of the multiscale sampling

6 Complexity

where otimes represents the convolution operation with stride 1the activation function is defined to be ReLU(x)

max(0 x) and the size of the obtained XCj is 60times 60

)e sampling layer S1perform statistical computationon the results obtained by the convolutional later C1 byusing the maximum value sampling After the sampling thehorizontal and vertical spatial resolutions become half of theoriginal resolutions )e size is 30times 30

)e convolutional layer C2by convoluting with con-volutional kernels of size 3times 3 the 20 characteristic mapsXS1

j are extended to 40 characteristic mapXC2j of size 28times 28

XC2j max 0 1113944 X

S1i otimesK

C2ij + b

C2j1113872 1113873 i 1 2 20 j 1 2 40

(9)

)e sampling layer S2perform the maximum valuesampling with a sampling size 2 and without overlapping ofthe sampling area After the sampling the horizontal andvertical spatial resolutions become half of the original res-olutions )e size is 14times14

)e convolutional layer C3by convoluting withconvolutional kernels of size 3 times 3 the 40 characteristicmaps XS2

j are extended to 60 characteristic map XC3j of size

12times12

XC3j max 0 1113944 X

S2i otimesK

C3ij + b

C3j1113872 1113873 i 1 2 40 j 1 2 60

(10)

)e sampling layer S3perform the maximum valuesampling with a sampling size 2 and without overlapping ofthe sampling area After the sampling the horizontal andvertical spatial resolutions become half of the original res-olutions )e size is 6times 6


j are extended to 80 characteristic map XC4j of size 4times 4

XC4j max 0 1113944 X

S3i otimesK

C4ij + b

C4j1113872 1113873 i 1 2 60 j 1 2 80

(11)

)e multiscale sampling layer MS4perform themaximum value sampling with three different scales on the80 characteristic maps XC4

j )e three sampling sizes andstrides are as follows

size 4 times 4 2 times 2 2 times 2

stride 4 times 4 2 times 2 1 times 1(12)

)is obtains three output characteristic matrices of thefixed sizes 1times 1times 80 2times 2times 80 and 3times 3times 80 By extendingthe matrices in the column-major order one obtainscharacteristic column vectors of sizes 1times 80 4times 80 and9times 80 Finally the vectors are concatenated in sequence toform a characteristic column vector xMP4 with a fixed size14times 801120times1 For the input images of the other sizes thesampling size and stride change adaptively but the di-mensionality of the final characteristic column vector doesnot change

)e multiscale sampling layer MS3perform themaximum value sampling with three different scales on the60 characteristic mapXC3

j A characteristic column vectorxMS3 with a fixed size 14times 60 840times1 is formed )e threesampling sizes and strides are as follows



)e multiscale sampling layer MS2perform the maximum value sampling with three dif-

ferent scales on the 40 characteristic maps XC2j A charac-

teristic column vector xMS2 with a fixed size 14times 40 560times1is formed)e three sampling sizes and strides are as follows



)e fully connected layer FC5 the output expressioncolumn vector xFC5 with the size 400times1 is obtained in theFCL from the three characteristic column vector obtained atthe respective MSLs

(a) (b)

Figure 4 Multiscale sampling with variable sampling size and stride (a) 16times16 multiscale sampling (b) 13times 9 multiscale sampling

Complexity 7

XFC5

max 0 W2 middot xMS2

+ W3 middot xMS3

+ W4 middot xMS4

+ bFC5

1113872 1113873

(15)

where W2 W3 and W4 are the respective weight matricesthat connect the three column vectors to the FCL bFC5 is theoffset matrix used in the FCL

)e output layera labor vector youtputis obtained from theoutput expression column vector xFC5 obtained in FC5

34 Methods of Training and Learning Before the trainingand testing one should first build structures according to thedegree of difficulty of the detection the actual conditions of

C2 fmap28 lowast 28

S2 fmap14 lowast 14

Multiscale sampling

Multiscale sampling

C3 fmap12 lowast 12

C1 fmap60 lowast 60

S1 fmap30 lowast 30

C4 fmap4 lowast 4

Output

Subsampling3 lowast 3

Convolution2 lowast 2



Input64 lowast 64

Figure 5 )e MSDNN topology graph

8 Complexity

the database etc In general the larger the database and themore complex the classification are the deeper the structurerequired One then needs to do some preprocession on thedatabase so that the training would be smoother For smalldatabases one can expand the data capacity to reduceoverfitting)e number of training samples can be increasedby operations such as translation and random clipping

Here the method of capacity expansion is used )edataset is from the Internet containing 1980 single hu-man face images of different sizes skin colors directionsand positions )e same person may have different ex-pressions some are shielded some with mustache somewith glasses etc To improve robustness of the networkand increase multiplicity of the samples we performsome transformation to a part of images in the initialhuman face image sample set such as horizontal flippingplusmn20deg displacement along the horizontal and vertical di-rections plusmn20deg rotation and the reduction of contrast Ofcourse one can combine different transformations andthen clip the corresponding image subblocks In this waytens of sample images can be obtained from a singleimage A part of the human face image samples is il-lustrated in Figure 6

)e nonndashface image dataset can be a set of nonndashfaceimages In fact any randomly clipped image can be used as anonndashface sample But it is impossible to train a network thatcan recognize all kinds of nonndashface samples One can use theiterative bootstrapping program to collect nonndashface imagesand use methods including the iterative retraining system toupdate the training set

)e methods for training MSDNNs can be divided intothe forward propagation stage and the backward propaga-tion stage )ey are not introduced here because they aresimilar to the BP algorithm of the conventional NN

4 Experiments

41 Validity Analysis of the Multiscale Sampling )e inno-vation point of this paper is multiscale sampling whichimproves the structure of the conventional CNN for thedetection of human face In the following we perform ex-periments to compare their performance To accelerate thespeed of training and testing we use the popular Caffe deeplearning framework to realize GPU computation )e ad-vantage of Caffe is that programming is not needed Oncethe network structure is designed Caffe can automaticallyhelp the user to train the NN However currently there areno frameworks that accept input images of variable sizes)erefore the experiments focus on training the NN withinput images of three fixed sizes

To verify effectiveness of multiscale sampling we use asingle-layer cascade network structure and connect it to thefinal multiscale layer MS4 (Figure 5) Two conditions areanalyzed in the following

411 0e Training of MSDNN with a Single Input SizeTake a 64times 64 input image as example )e three structuresare as follows

(1) Without multiscale sampling ie the ordinary CNNstructure

(2) One sampling size (1times 80 characteristic matrix)(3) Two sampling sizes (4times 80 9times 80 characteristic

matrices)

According to the final testing results obtained by av-eraging 100 times of experiments (Table 1) MSDNN has ahigher detection rate than the ordinary CNN As far as thenumber of characteristics is concerned the dimensionalitiesof the ordinary CNN and the MSDNN are 1280 and 1120respectively which demonstrates that the improvement ofthe detection rate by MSDNN does not require the increaseof the dimensionality It is the change of the spatial de-ployment that brings about the improvement of targetdetection

412 0e Training of MSDNN with Multiple Input SizesBecause the Caffe framework can only accept input of fixedsize during one period of training the trained network isapplied to process images of arbitrary sizes in the testingstage Taking into account the quality of the human faceimages in the library and the quality of hardware we trainthe network with three input image sizes 64times 64 80times 80and 100times100 )e images only differ in the resolutionsampling size and sampling stride but the other properties(the content layout and output characteristic columnvector) are the same In this way one can first randomlychoose any one of the three network sizes (64times 64) to

Figure 6 A part of face images

Table 1 Temperature and wildlife count in the three areas coveredby the study

Methods Detection rates ())e ordinary CNN 977One sampling size 954Two sampling sizes 975)ree sampling sizes 983

Complexity 9

complete one training period at the end of which all theparameters are saved )en one switches to the other twonetwork sizes to continue the training until the terminationcondition is satisfied

In the experiments the detection rate of the multiscalesampling training with multiple input sizes is higher thanthat with a single input size )is is because the network istrained by the characteristics of images of different sizes andis thus more suitable for the real detection )e final de-tection rates of the networks trained by one two and threeinput sizes are 9825 9846 and 9859 respectively

42 Comparative Experiments )e experiments follow thescheme of first performing segmentation of skin color fromthe background color in the YCbCr space to remove the widebackground area so that the computation speed can beimproved and then using MSDNN classifier to screen the

extracted skin-color area in order to determine whether ornot a human face exists in the image and if yes labelling thehuman face

We used MATLAB 2014 as the platform to validate thedetection algorithm )e data were from two kinds ofsources )e first kind of data was from Internet downloador life photos including human faces of different illumi-nation different complex background and different sizes)e other data were some human face images from the CMUdatabase We performed the detection and analyzed theresults both qualitatively and quantitatively

We performed the simulations by using MATLAB 2014installed in an ordinary PC with Intel Core i5 CPU320GHZ 8G RAM Figure 7 is an image of multiple front-view human faces obtained from the Internet )e humanfaces are different in the size and direction Figures 8 and 9present the result of skin-color segmentation by theGaussian mixture model )erefore it was necessary to

Figure 7 )e original image

Figure 8 )e binary image after skin-color segmentation

Figure 9 )e binary image after morphological operation

10 Complexity

locate human faces by using MSDNN )e characteristicmatrices after the three scales sampling were 1times 80 4times 80and 9times 80 )e concatenated characteristic column vectorwhich was sent to the FCL was 1120times1 )e results arepresented in Figure 10

To test the validity of the new method in detectinghuman face images with certain degree of rotation and withdifferent resolutions we performed human face detection onthe segmented original image and the segmented fussyfixated image respectively by the conventional CNN andMSDNN methods We used two sampling scales and

obtained 1times 80 and 4times 80 characteristic matrices )e cir-cled areas are the detected human faces (Figures 11 and 12)

From Figures 11 and 12 one sees that the conventionalCNN can only detect two human faces while the MSDNNcan detect three faces not only the two front-view faces butalso the accurate positioning of the side-view face Even ifthe image is fuzzy and of low resolution the MSDNNmethod can still locate the human face )ese results furtherdemonstrate the correctness of the methods put forward inthis paper and that multiscale information plays an im-portant role in image detection

Figure 10 )e detection results

(a) (b)

Figure 11 Detection on the original image by the CNN andMSDNNmethods (a) Results by the CNNmethod (b) Results by the MSDNNmethod

(a) (b)

Figure 12 Detection on the blurred image by the CNN andMSDNNmethods (a) Results by the CNNmethod (b) Results by the MSDNNmethod

Complexity 11

To demonstrate the merits of the new method wecompared the new method with the conventional CNNPCA and ANN methods by performing experiments oneight face images (50 human face windows) taken from theCMU database )e detection rate and time by thesemethods are presented in Table 2

From Table 2 one sees that the detection rate of the newmethod was higher than that of the CNN PCA and ANNmethods Moreover the computation time was relativelysmaller (hardware configurationIntel Core i5-6500 CPU320GHz times 4 16G RAM with GUP acceleration))ereforethe performance of the new method is better than theprevious methods

5 Conclusions

In this paper a set of fast and effective MSDNN image rec-ognition methods is to be designed and trained In particularinstead of establishing amultiscale sampling layer to replace theordinary sampling layer in the network structure the networkstructure is improved and the generalization ability of samplecollection is improved to a certain extent In the future thedepth model is not a black box it is closely related to thetraditional computer vision system the various layers of neuralnetwork through joint learning and overall optimization sothat the performance has been greatly improved Applicationsrelated to image recognition are also driving the rapid de-velopment of deep learning in all aspects of network structurelayer design and training methods It can be expected that inthe next few years deep learning will enter a period of rapiddevelopment in theory algorithms and applications

Data Availability

All the data have been included in the paper therefore thereare no other data available

Conflicts of Interest

)e authors declare that they have no conflicts of interest

Acknowledgments

)is work was supported by the Competitive Allocation ofSpecial Funds for Science and Technology InnovationStrategy in Guangdong Province of China (no 2018A06001)Zhanjiang Science and Technology Project (Grant no2019B01076) Lingnan Normal University Natural ScienceResearch Project (no ZL2004) Science and TechnologyProgram of Maoming (no2019397) Scientific ResearchFoundation for Talents Introduction and GuangdongUniversity of Petrochemical Technology (no 519025)

References

[1] D Ulyanov A Vedaldi and V Lempitsky ldquoDeep imagepriorrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 9446ndash9454 Salt lake cityUT USA June 2018

[2] G Yossi A Shocher and M Irani ldquoUnsupervised imagedecomposition via coupled deep-image-priorsrdquo in Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition Long Beach CA USA June 2019

[3] A Krull T Buchholz and F Jug ldquoNoise2Void-learningdenoising from single noisy imagesrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognitionpp 2129ndash2137 Long Beach CA USA June 2019

[4] S Ji W Xu M Yang and K Yu ldquo3D convolutional neuralnetworks for human action recognitionrdquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 35 no 1pp 221ndash231 2013

[5] K Liu K Li Q Peng Y Guo and L Zhang ldquoData-drivenhybrid internal temperature estimation approach for batterythermal managementrdquo Complexity vol 2018 Article ID9642892 15 pages 2018

[6] J Lehtinen and M Jacob ldquoNoise2Noise learning imagerestoration without clean datardquo in Proceedings of the 35thInternational Conference on Machine Learning StockholmSweden July 2018

[7] Z Yang K Liu J Fan Y Guo Q Niu and J Zhang ldquoA novelbinaryreal-valued pigeon-inspired optimization for eco-nomicenvironment unit commitment with renewables andplug-in vehiclesrdquo Science China Information Sciences vol 62no 7 pp 110ndash112 2019

[8] R Jaroensri C Biscarrat M Aittala and F Durand ldquoGen-erating training data for denoising real rgb images via camerapipeline simulationrdquo 2019 httparxivorgabs190408825

[9] Y Chen and T Pock ldquoTrainable nonlinear reaction diffusiona flexible framework for fast and effective image restorationrdquoIEEE Transactions on Pattern Analysis and Machine Intelli-gence vol 39 no 6 pp 1256ndash1272 2017

[10] K Simonyan and A Zisserman ldquoTwo-stream convolutionalnetworks for action recognition in videosrdquo 2014 httparxivorgabs14062199

[11] K Liu X Hu Z Wei Y Li and Y Jiang ldquoModified Gaussianprocess regression models for cyclic capacity prediction oflithium-ion batteriesrdquo IEEE Transactions on TransportationElectrification vol 5 no 4 pp 1225ndash1236 2019

[12] X Tang K Liu X Wang et al ldquoReal-time aging trajectoryprediction using a base model-oriented gradient-correctionparticle filter for Lithium-ion batteriesrdquo Journal of PowerSource vol 440 2019

[13] Y Li K Liu AM Foley et al ldquoData-driven based lithium-ionbattery health diagnostic and prognostic technologies a re-viewrdquo Renewable and Sustainable Energy Reviews vol 1132019

[14] K Liu C Zou K Li and T Wik ldquoCharging pattern opti-mization for lithium-ion batteries with an electrothermal-

Table 2 Comparison of the three methods

Methods Nondetected Detected Detection rate () Time (s)CNN 3 47 94 323PCA 8 42 84 301ANN 5 45 90 575)e present method 2 48 96 366

12 Complexity

aging modelrdquo IEEE Transactions on Industrial Informaticsvol 14 no 12 pp 5463ndash5474 2018

[15] W Ouyang P Luo X Zeng et al ldquoDeepidnet Multi-Stageand Deformable Deep Convolutional Neural Networks forObject Detectionrdquo 2014 httpsarxivorgabs14093505

[16] K Liu Y Li X Hu M Lucu andW DWidanage ldquoGaussianprocess regression with automatic relevance determinationkernel for calendar aging prediction of lithium-ion batteriesrdquoIEEE Transactions on Industrial Informatics vol 16 no 6pp 3767ndash3777 2020

[17] X Mao C Shen and Y Yang ldquoImage restoration using verydeep convolutional encoder-decoder networks with sym-metric skip connectionsrdquo Advances in Neural InformationProcessing Systems pp 2802ndash2810 2016

[18] K Liu X Hu Z Yang Y Xie and S Feng ldquoLithium-ionbattery charging management considering economic costs ofelectrical energy loss and battery degradationrdquo Energy Con-version and Management vol 195 pp 167ndash179 2019

[19] K Xie W Zuo Y Chen D Meng and L Zhang ldquoBeyond aGaussian denoiser residual learning of deep CNN for imagedenoisingrdquo IEEE Transactions on Image Processing vol 26no 7 pp 3142ndash3155 2017

[20] X Meng K Liu X Wang et al ldquoModel migration neuralnetwork for predicting battery aging trajectoriesrdquo IEEETransactions on Transportation Electrification vol 6 no 2pp 363ndash374 2020

[21] K Liu Y ShangQOuyang andWDWidanage ldquoAdata-drivenapproach with uncertainty quantification for predicting futurecapacities and remaining useful life of lithium-ion batteryrdquo IEEETransactions on Industrial Electronics 2020

[22] K Zhang W Zuo and L Zhang ldquoFFDNet toward a fast andflexible solution for CNN-based image denoisingrdquo IEEETransactions on Image Processing vol 27 no 9 pp 4608ndash4622 2018

[23] B Mildenhall J T Barron J Chen D Sharlet R Ng andR Carroll ldquoBurst denoising with kernel prediction networksrdquoin Proceedings of the IEEE Conference on Computer Vision andPattern Recognition pp 2502ndash2510 Salt lake city UT USAJune 2018

[24] S Guo Z Yan K Zhang W Zuo and L Zhang ldquoTowardconvolutional blind denoising of real photographsrdquo in Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition pp 1712ndash1722 2019

[25] C Szegedy W Liu Y Jia et al ldquoGoing deeper withconvolutionsrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition pp 1ndash9 BostonMA USA June 2015

[26] T Brooks K Mildenhall T Xue J Chen D Sharlet andJ T Barron ldquoUnprocessing images for learned raw denois-ingrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pp 11036ndash11045 2019

[27] U Schmidt J Jancsary and S Nowozin ldquoCascades of re-gression tree fields for image restorationrdquo IEEE Transactionson Pattern Analysis and Machine Intelligence vol 38 no 4pp 677ndash689 2016

[28] J Rother L Cai G Luo et al ldquoLithium-ion battery state ofhealth estimation with short-term current pulse test andsupport vector machinerdquo Microelectronics Reliabilitypp 1216ndash1220 2018

[29] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-based learning applied to document recognitionrdquo Proceedingsof the IEEE vol 86 no 11 pp 2278ndash2324 1998

[30] Y Sun X Wang and X Tang ldquoDeeply learned face repre-sentations are sparse selective and robustrdquo in Proceedings of

the 2015 IEEE Conference on Computer Vision and PatternRecognition (CVPR) Boston MA USA June 2015

[31] L Cai J Meng D-I Stroe G Luo and R Teodorescu ldquoAnevolutionary framework for lithium-ion battery state of healthestimationrdquo Journal of Power Sources vol 412 pp 615ndash6222019

[32] L Luo W Ouyang X Wang et al ldquoDeep learning for genericobject detection a surveyrdquo International Journal of ComputerVision vol 128 no 2 pp 261ndash318 2020

[33] R Fieguth J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic seg-mentationrdquo in Proceedings of the 2014 IEEE Conference onComputer Vision and Pattern Recognition Columbus OHUSA June 2014 httpsarxivorgabs13112524

[34] A Krizhevsky L Sutskever and G E Hinton ldquoImagenetclassification with deep convolutional neural networksrdquoNeural Information Processing Systems vol 25 no 2 2012

[35] Z Yang M Mourshed K Liu X Xu and S Feng ldquoA novelcompetitive swarm optimized rbf neural networkmodel for short-term solar power generation forecastingrdquo Neurocomputingvol 397 pp 415ndash421 2020

[36] L Cai JMeng S Daniel-Ioan J Peng G Luo andR TeodoresculdquoMulti-objective optimization of data-driven model for lithium-ion battery SOH estimation with short-term featurerdquo IEEETransactions onPower Electronics vol 35 no11 pp11855ndash118642020

[37] J Shao C C Loy and XWang ldquoDeeply learned attributes forcrowded scene understandingrdquo in Proceedings of the IEEEInternational Conference on Computer Vision and PatternRecognition Boston MA USA June 2015

[38] Y Tai J Yang X Liu et al ldquoA persistent memory network forimage restorationrdquo in Proceedings of the IEEE InternationalConference on Computer Vision pp 4539ndash4547 Venice ItalyOctober 2017

[39] J Donahue L A Hendricks S Guadarrama et al ldquoLong-termrecurrent convolutional networks for visual recognition anddescriptionrdquo 2014 httpsarxivorgabs14114389

[40] Y Sun X Wang and X Tang ldquoHybrid deep learning forcomputing face similaritiesrdquo in Proceedings of the IEEE In-ternational Conference on Computer Vision Madurai IndiaMay 2013

[41] M Liu and Y Ding ldquoA unified selective transfer network forarbitrary image attribute editingrdquo in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition LongBeach CA USA April 2019

[42] D Ren K Zhang Q Wang Q Hu and W Zuo ldquoNeuralblind deconvolution using deep priorsrdquo in Proceedings of theIEEECVF Conference on Computer Vision and Pattern Rec-ognition pp 3341ndash3350 Seattle WA USA November 2020

[43] K Kang and X Wang ldquoFully convolutional neural networksfor crowd segmentationrdquo 2014 httpsarxivorgabs14114464

[44] F Feng S Teng K Liu et al ldquoCo-estimation of lithium-ionbattery state of charge and state of temperature based on ahybrid electrochemical-thermal-neural-network modelrdquoJournal of Power Sources vol 455 p 227935 2020

[45] Q Ouyang Z Wang K Liu G Xu and Y Li ldquoOptimalcharging control for lithium-ion battery packs a distributedaverage tracking approachrdquo IEEE Transactions on IndustrialInformatics vol 16 no 5 pp 3430ndash3438 2019

[46] F Feng X Hu K Liu et al ldquoA practical and comprehensiveevaluation method for series-connected battery pack modelsrdquoIEEE Transactions on Transportation Electrification vol 6no 2 pp 391ndash416 2020

Complexity 13

Among the applications related to image and video the mostsuccessful one is the deep convolution network whosedesign is only used in the special structure of image Two ofthe most important operations convolution and poolingcome from domain knowledge related to images How tointroduce new effective operation and layer in the depthmodel through the knowledge of research field is of greatsignificance to improve the performance of image and videorecognition For example the pool layer brings about localtranslation invariance and the deformation pool layerproposed in [5ndash7] can better describe the geometric de-formation of various parts of the object At present it can befurther extended to achieve scale invariance rotation in-variance and robustness to occlusion

In this paper we put forward Multiscale Pooling DeepConvolution Neural Networks model that aims to train a set offast and effective schemes Rather than learning MAP infer-ence-guided discriminative models we instead adopt plainDNN to learn the denoisers taking advantage of recentprogress in DNN as well as the merit of GPU computationMeanwhile providing good performance for image recogni-tion the learned set ofmultiscale pooling is plugged in amodel-based optimization method to tackle various problems

)e contribution of this work is summarized as follows

(i) We trained a set of fast and effectiveMultiscale PoolingDeep Convolution Neural Networks (MSDNN)model Using the core idea of pyramid method forreference the powerful multiscale expression ability isapplied to the deep level of network

(ii) Extensive experiments on classical feature extractionproblems including superresolution and deblurringhave demonstrated great improvement in the per-formance of the current single training image andcan be applied to image generation style migrationimage editing and other issues

2 Related Work

21 Feature Learning )e biggest difference between tra-ditional pattern recognition method and deep learning liesin its characteristics the feature is automatically learnedfrom big data rather than using manual design Goodfeatures can improve the performance of pattern recognitionsystem [8ndash10] In the past few decades in various appli-cations of pattern recognition the characteristics of manualdesign have been in domination level Manual design mainlydepends on the prior knowledge of designers so it is difficultto take advantages of the advantages of big data Due to thedependence on manual parameter adjustment the numberof parameters allowed in the design of the feature is verylimited Deep learning can automatically learn representa-tions of features from big data and can contain thousands ofparameters [11ndash16] It takes five to ten years to design ef-fective features by hand and deep learning can quickly learnnew and effective feature representations from training datafor new applications

A pattern recognition system includes two parts featureand classifier In traditional methods the optimization of

features and classifiers is separated Under the framework ofneural network feature representation and classifier arejointly optimized which can give full play to the perfor-mance of joint cooperation [17ndash21]

In 2012 Hinton participated in the ImageNet compe-tition and adopted the convolution network model whichcontains 60 million parameters learned from millions ofsamples [22] )e feature representation learned fromImageNet has very strong generalization ability and can besuccessfully applied to other datasets and tasks such asobject detection tracking and inspection Another famouscompetition in the field of computer vision is PSACALVOC However its training set is small and is not suitable fortraining deep learning models Some scholars use the featurerepresentation learned from ImageNet for physical exami-nation on PSACAL VOC the detection rate increased by20 [23]

Since feature learning is so important what are goodfeatures In an image various complex factors are oftencombined in a nonlinear way For example the face imagecontains a variety of information such as identity postureage expression light and so on )e key to deep learning isto successfully separate these factors through multilevelnonlinear mapping for example in the last hidden layer ofthe depth model different neurons represent differentfactors If you take this hidden layer as a feature repre-sentation face recognition pose estimation expressionrecognition and age estimation will become very simplebecause the various factors become a single linear rela-tionship no longer interfere with each other

22 ImageNet Image Classification )e most importantdevelopment of deep learning in object recognition is imageclassification task in ImageNet ILSVRC 3 challenge )elowest error rate of traditional computer vision method inthis test set is 26172 In 2012 Hintonrsquos team used con-volutional networks to reduce the error rate to 15315)isnetwork structure is called AlexNet [24] Compared with thetraditional convolution network it has three differencesFirst of all AlexNet adopts dropout training strategy Duringthe training process some neurons in the input layer and themiddle layer are randomly set to zero )is simulates thesituation that noise interferes with input data so that someneurons fail to detect some visual patterns Dropout makesthe training process converge more slowly but the resultingnetwork model is more robust Secondly AlexNet uses therectifier linear element as the nonlinear excitation function)is not only greatly reduces the computational complexitybut alsomakes the output of neurons sparse andmore robustto various interferences )irdly AlexNet generates moretraining samples and reduces overfitting by mapping theimage of training samples and adding random translationdisturbance

In the ImageNet ILSVRC 2013 competition the top 20groups use deep learning techniques )e winner is RobFergusrsquos research group of New York University )e depthmodel adopted is convolutional network and the networkstructure has been further optimized )e error rate is

2 Complexity









Complexity 3




3 Method





Xlj f 1113944

iisinMj

Xlminus1i lowastK

lij + b

lj

⎛⎜⎝ ⎞⎟⎠ (1)




Xlj f βl


lj1113872 1113873 (2)



Xlj f 1113944

Nin

i1aij X

lminus1j lowastK

li1113872 1113873 + b

lj

⎛⎝ ⎞⎠ (3)



aij exp cij1113872 1113873

1113936Kexp cij1113872 1113873 (4)


4 Complexity









(a) (b) (c) (d) (e) (f)





Start

Input the image

The output layer

End


Complexity 5








(6)


where y1j y

2j and y3










C1ij + b

C1j1113872 1113873


C1j1113872 1113873 i 1 j 1 2 20

(8)



Multiscale sampling




6 Complexity






XC2j max 0 1113944 X

S1i otimesK

C2ij + b

C2j1113872 1113873 i 1 2 20 j 1 2 40

(9)




12times12

XC3j max 0 1113944 X

S2i otimesK

C3ij + b

C3j1113872 1113873 i 1 2 40 j 1 2 60

(10)




XC4j max 0 1113944 X

S3i otimesK

C4ij + b

C4j1113872 1113873 i 1 2 60 j 1 2 80

(11)
















(a) (b)


Complexity 7

XFC5


+ W3 middot xMS3

+ W4 middot xMS4

+ bFC5

1113872 1113873

(15)




C2 fmap28 lowast 28

S2 fmap14 lowast 14

Multiscale sampling

Multiscale sampling

C3 fmap12 lowast 12

C1 fmap60 lowast 60

S1 fmap30 lowast 30

C4 fmap4 lowast 4

Output





Input64 lowast 64


8 Complexity





4 Experiments






matrices)






Complexity 9










10 Complexity






(a) (b)


(a) (b)


Complexity 11



5 Conclusions


Data Availability




Acknowledgments


References

















12 Complexity



































Complexity 13









Complexity 3




3 Method





Xlj f 1113944

iisinMj

Xlminus1i lowastK

lij + b

lj

⎛⎜⎝ ⎞⎟⎠ (1)




Xlj f βl


lj1113872 1113873 (2)



Xlj f 1113944

Nin

i1aij X

lminus1j lowastK

li1113872 1113873 + b

lj

⎛⎝ ⎞⎠ (3)



aij exp cij1113872 1113873

1113936Kexp cij1113872 1113873 (4)


4 Complexity









(a) (b) (c) (d) (e) (f)





Start

Input the image

The output layer

End


Complexity 5








(6)


where y1j y

2j and y3










C1ij + b

C1j1113872 1113873


C1j1113872 1113873 i 1 j 1 2 20

(8)



Multiscale sampling




6 Complexity






XC2j max 0 1113944 X

S1i otimesK

C2ij + b

C2j1113872 1113873 i 1 2 20 j 1 2 40

(9)




12times12

XC3j max 0 1113944 X

S2i otimesK

C3ij + b

C3j1113872 1113873 i 1 2 40 j 1 2 60

(10)




XC4j max 0 1113944 X

S3i otimesK

C4ij + b

C4j1113872 1113873 i 1 2 60 j 1 2 80

(11)
















(a) (b)


Complexity 7

XFC5


+ W3 middot xMS3

+ W4 middot xMS4

+ bFC5

1113872 1113873

(15)




C2 fmap28 lowast 28

S2 fmap14 lowast 14

Multiscale sampling

Multiscale sampling

C3 fmap12 lowast 12

C1 fmap60 lowast 60

S1 fmap30 lowast 30

C4 fmap4 lowast 4

Output





Input64 lowast 64


8 Complexity





4 Experiments






matrices)






Complexity 9










10 Complexity






(a) (b)


(a) (b)


Complexity 11



5 Conclusions


Data Availability




Acknowledgments


References

















12 Complexity



































Complexity 13




3 Method





Xlj f 1113944

iisinMj

Xlminus1i lowastK

lij + b

lj

⎛⎜⎝ ⎞⎟⎠ (1)




Xlj f βl


lj1113872 1113873 (2)



Xlj f 1113944

Nin

i1aij X

lminus1j lowastK

li1113872 1113873 + b

lj

⎛⎝ ⎞⎠ (3)



aij exp cij1113872 1113873

1113936Kexp cij1113872 1113873 (4)


4 Complexity









(a) (b) (c) (d) (e) (f)





Start

Input the image

The output layer

End


Complexity 5








(6)


where y1j y

2j and y3










C1ij + b

C1j1113872 1113873


C1j1113872 1113873 i 1 j 1 2 20

(8)



Multiscale sampling




6 Complexity






XC2j max 0 1113944 X

S1i otimesK

C2ij + b

C2j1113872 1113873 i 1 2 20 j 1 2 40

(9)




12times12

XC3j max 0 1113944 X

S2i otimesK

C3ij + b

C3j1113872 1113873 i 1 2 40 j 1 2 60

(10)




XC4j max 0 1113944 X

S3i otimesK

C4ij + b

C4j1113872 1113873 i 1 2 60 j 1 2 80

(11)
















(a) (b)


Complexity 7

XFC5


+ W3 middot xMS3

+ W4 middot xMS4

+ bFC5

1113872 1113873

(15)




C2 fmap28 lowast 28

S2 fmap14 lowast 14

Multiscale sampling

Multiscale sampling

C3 fmap12 lowast 12

C1 fmap60 lowast 60

S1 fmap30 lowast 30

C4 fmap4 lowast 4

Output





Input64 lowast 64


8 Complexity





4 Experiments






matrices)






Complexity 9










10 Complexity






(a) (b)


(a) (b)


Complexity 11



5 Conclusions


Data Availability




Acknowledgments


References

















12 Complexity



































Complexity 13









(a) (b) (c) (d) (e) (f)





Start

Input the image

The output layer

End


Complexity 5








(6)


where y1j y

2j and y3










C1ij + b

C1j1113872 1113873


C1j1113872 1113873 i 1 j 1 2 20

(8)



Multiscale sampling




6 Complexity






XC2j max 0 1113944 X

S1i otimesK

C2ij + b

C2j1113872 1113873 i 1 2 20 j 1 2 40

(9)




12times12

XC3j max 0 1113944 X

S2i otimesK

C3ij + b

C3j1113872 1113873 i 1 2 40 j 1 2 60

(10)




XC4j max 0 1113944 X

S3i otimesK

C4ij + b

C4j1113872 1113873 i 1 2 60 j 1 2 80

(11)
















(a) (b)


Complexity 7

XFC5


+ W3 middot xMS3

+ W4 middot xMS4

+ bFC5

1113872 1113873

(15)




C2 fmap28 lowast 28

S2 fmap14 lowast 14

Multiscale sampling

Multiscale sampling

C3 fmap12 lowast 12

C1 fmap60 lowast 60

S1 fmap30 lowast 30

C4 fmap4 lowast 4

Output





Input64 lowast 64


8 Complexity





4 Experiments






matrices)






Complexity 9










10 Complexity






(a) (b)


(a) (b)


Complexity 11



5 Conclusions


Data Availability




Acknowledgments


References

















12 Complexity



































Complexity 13








(6)


where y1j y

2j and y3










C1ij + b

C1j1113872 1113873


C1j1113872 1113873 i 1 j 1 2 20

(8)



Multiscale sampling




6 Complexity






XC2j max 0 1113944 X

S1i otimesK

C2ij + b

C2j1113872 1113873 i 1 2 20 j 1 2 40

(9)




12times12

XC3j max 0 1113944 X

S2i otimesK

C3ij + b

C3j1113872 1113873 i 1 2 40 j 1 2 60

(10)




XC4j max 0 1113944 X

S3i otimesK

C4ij + b

C4j1113872 1113873 i 1 2 60 j 1 2 80

(11)
















(a) (b)


Complexity 7

XFC5


+ W3 middot xMS3

+ W4 middot xMS4

+ bFC5

1113872 1113873

(15)




C2 fmap28 lowast 28

S2 fmap14 lowast 14

Multiscale sampling

Multiscale sampling

C3 fmap12 lowast 12

C1 fmap60 lowast 60

S1 fmap30 lowast 30

C4 fmap4 lowast 4

Output





Input64 lowast 64


8 Complexity





4 Experiments






matrices)






Complexity 9










10 Complexity






(a) (b)


(a) (b)


Complexity 11



5 Conclusions


Data Availability




Acknowledgments


References

















12 Complexity



































Complexity 13






XC2j max 0 1113944 X

S1i otimesK

C2ij + b

C2j1113872 1113873 i 1 2 20 j 1 2 40

(9)




12times12

XC3j max 0 1113944 X

S2i otimesK

C3ij + b

C3j1113872 1113873 i 1 2 40 j 1 2 60

(10)




XC4j max 0 1113944 X

S3i otimesK

C4ij + b

C4j1113872 1113873 i 1 2 60 j 1 2 80

(11)
















(a) (b)


Complexity 7

XFC5


+ W3 middot xMS3

+ W4 middot xMS4

+ bFC5

1113872 1113873

(15)




C2 fmap28 lowast 28

S2 fmap14 lowast 14

Multiscale sampling

Multiscale sampling

C3 fmap12 lowast 12

C1 fmap60 lowast 60

S1 fmap30 lowast 30

C4 fmap4 lowast 4

Output





Input64 lowast 64


8 Complexity





4 Experiments






matrices)






Complexity 9










10 Complexity






(a) (b)


(a) (b)


Complexity 11



5 Conclusions


Data Availability




Acknowledgments


References

















12 Complexity



































Complexity 13

XFC5


+ W3 middot xMS3

+ W4 middot xMS4

+ bFC5

1113872 1113873

(15)




C2 fmap28 lowast 28

S2 fmap14 lowast 14

Multiscale sampling

Multiscale sampling

C3 fmap12 lowast 12

C1 fmap60 lowast 60

S1 fmap30 lowast 30

C4 fmap4 lowast 4

Output





Input64 lowast 64


8 Complexity





4 Experiments






matrices)






Complexity 9










10 Complexity






(a) (b)


(a) (b)


Complexity 11



5 Conclusions


Data Availability




Acknowledgments


References

















12 Complexity



































Complexity 13





4 Experiments






matrices)






Complexity 9










10 Complexity






(a) (b)


(a) (b)


Complexity 11



5 Conclusions


Data Availability




Acknowledgments


References

















12 Complexity



































Complexity 13










10 Complexity






(a) (b)


(a) (b)


Complexity 11



5 Conclusions


Data Availability




Acknowledgments


References

















12 Complexity



































Complexity 13






(a) (b)


(a) (b)


Complexity 11



5 Conclusions


Data Availability




Acknowledgments


References

















12 Complexity



































Complexity 13



5 Conclusions


Data Availability




Acknowledgments


References

















12 Complexity



































Complexity 13



































Complexity 13

imagerecognitionbasedonmultiscalepoolingdeep...

Documents