application of deep learning
TRANSCRIPT
APPLICATIONOFDEEPLEARNINGTO
FISHRECOGNITION
AProject
Presentedto
TheFacultyoftheGraduateSchool
AttheUniversityofMissouri
InPartialFulfillment
OftheRequirementsfortheDegree
MasterofScience
ImplementedandDefendedby:
FumingXiangProf.YiShang,Advisor
May2018
2
TABLEOFCONTENTS
TABLEOFCONTENTS-----------------------------------------------------------------------------------------------------2
ListofFigures---------------------------------------------------------------------------------------------------------------4
ListofTables----------------------------------------------------------------------------------------------------------------7
ACKNOWLEDGEMENTS------------------------------------------------------------------------------------------8
Abstract--------------------------------------------------------------------------------------------------------------9
Introduction-------------------------------------------------------------------------------------------------------10
1. RelatedWorks----------------------------------------------------------------------------------------------12
1.1 ConvolutionalNeuralNetwork(CNN)--------------------------------------------------------------------12
1.2 VGG16-------------------------------------------------------------------------------------------------------------13
1.3 ResNet50---------------------------------------------------------------------------------------------------------13
1.4 SSDCaffe----------------------------------------------------------------------------------------------------------14
2. DesignandImplementation-----------------------------------------------------------------------------16
2.1 Design-------------------------------------------------------------------------------------------------------------16
2.1.1 DataPreprocessing--------------------------------------------------------------------------------------------16
2.1.2 ImagePipeline--------------------------------------------------------------------------------------------------17
2.1.3 InstancePipeline-----------------------------------------------------------------------------------------------18
2.1.4 InstanceRotationPipeline-----------------------------------------------------------------------------------19
2.1.5 EnsemblePipeline---------------------------------------------------------------------------------------------21
2.2 Implementation------------------------------------------------------------------------------------------------23
2.2.1 Configuration---------------------------------------------------------------------------------------------------23
3
2.2.2 EvaluationMetric----------------------------------------------------------------------------------------------24
2.2.3 ImagePipeline--------------------------------------------------------------------------------------------------29
2.2.4 InstancePipeline-----------------------------------------------------------------------------------------------34
2.2.5 InstanceRotationPipeline-----------------------------------------------------------------------------------38
2.2.6 EnsemblePipeline---------------------------------------------------------------------------------------------44
3. Results---------------------------------------------------------------------------------------------------------49
4. ConclusionandFutureWork----------------------------------------------------------------------------53
4.1 ConclusionandContribution -------------------------------------------------------------------------------53
4.2 FutureWork-----------------------------------------------------------------------------------------------------53
5. References---------------------------------------------------------------------------------------------------54
4
ListofFigures
Figure1FishSpeciesandDistributioninMissouri--------------------------------------------------------------------10
Figure2Fishsizeratiodistribution---------------------------------------------------------------------------------------11
Figure3VGG16Architecture----------------------------------------------------------------------------------------------13
Figure4ResNet50Architecture------------------------------------------------------------------------------------------14
Figure5SSDCaffeArchitecture-------------------------------------------------------------------------------------------15
Figure6Examplesofgrayscale,non-fishormissing-fish,andambiguityimages-----------------------------16
Figure7Exampleoflabeledimage---------------------------------------------------------------------------------------17
Figure8ImagePipeline------------------------------------------------------------------------------------------------------17
Figure9InstancePipeline---------------------------------------------------------------------------------------------------18
Figure10Instanceimagesgeneration-----------------------------------------------------------------------------------19
Figure11InstanceRotationPipeline-------------------------------------------------------------------------------------19
Figure12Orientationsplit--------------------------------------------------------------------------------------------------20
Figure13Orientationvector-----------------------------------------------------------------------------------------------20
Figure14InstanceRotationimagesgeneration-----------------------------------------------------------------------21
Figure15EnsemblePipeline-----------------------------------------------------------------------------------------------22
Figure16ConfusionMatrixExample-------------------------------------------------------------------------------------25
Figure17IntersectionoverUnion----------------------------------------------------------------------------------------26
5
Figure18ThresholdexampleofIntersectionoverUnion-----------------------------------------------------------27
Figure19Pythoncodeforaccuracycalculation-----------------------------------------------------------------------29
Figure20Accuracy_0andAccuracy_1orientationdistribution---------------------------------------------------29
Figure21ModifiedVGG16Architecture-------------------------------------------------------------------------------30
Figure22CompilemodelinKeraswithcross-entropylossfunction---------------------------------------------30
Figure23ImagePipelinelossandaccuracycurve–VGG16Classifier-------------------------------------------32
Figure24ImagePipelinetrainingconfusionmatric–VGG16Classifier-----------------------------------------32
Figure25ImagePipelinevalidationconfusionmatric–VGG16Classifier--------------------------------------33
Figure26ImagePipelinetestingconfusionmatric–VGG16Classifier------------------------------------------33
Figure27SSDCaffetrainingloss------------------------------------------------------------------------------------------34
Figure28SSDCaffeprecision-recallcurveonthetestdataset----------------------------------------------------35
Figure29InstancePipelinelossandaccuracycurve–VGG16Classifier----------------------------------------36
Figure30InstancePipelinetrainingconfusionmatric–VGG16Classifier--------------------------------------37
Figure31InstancePipelinevalidationconfusionmatric–VGG16Classifier-----------------------------------37
Figure32InstancePipelinetestingconfusionmatric–VGG16Classifier---------------------------------------38
Figure33PoseEstimatorlossandaccuracycurve--------------------------------------------------------------------39
Figure34PoseEstimatortrainingconfusionmatrix-----------------------------------------------------------------40
Figure35PoseEstimatorvalidationconfusionmatrix---------------------------------------------------------------40
6
Figure36PoseEstimatortestingconfusionmatrix-------------------------------------------------------------------41
Figure37InstanceRotationPipelinelossandaccuracycurve–VGG16Classifier----------------------------42
Figure38InstanceRotationPipelinetrainingconfusionmatrix–VGG16Classifier--------------------------42
Figure39InstanceRotationPipelinevalidationconfusionmatrix–VGG16Classifier-----------------------43
Figure40InstanceRotationPipelinetestingconfusionmatrix–VGG16Classifier---------------------------43
Figure41EnsemblePipeline-ResNet50lossandaccuracycurve-----------------------------------------------44
Figure42EnsemblePipelinetrainingconfusionmatrix–ResNet50Classifier---------------------------------45
Figure43EnsemblePipelinevalidationconfusionmatrix–ResNet50Classifier------------------------------45
Figure44EnsemblePipelinetestingconfusionmatrix–ResNet50Classifier---------------------------------46
Figure45VGGratio-validationaccuracycurve----------------------------------------------------------------------46
Figure46EnsemblePipelinetrainingconfusionmatrix–WeightedClassifier--------------------------------47
Figure47EnsemblePipelinevalidationconfusionmatrix–WeightedClassifier------------------------------48
Figure48EnsemblePipelinetestingconfusionmatrix–WeightedClassifier----------------------------------48
Figure49Image-ClassifierAccuracy--------------------------------------------------------------------------------------49
Figure50ImagePipelinetestaccuracy----------------------------------------------------------------------------------50
Figure51InstancePipelinetestaccuracy-------------------------------------------------------------------------------50
Figure52InstanceRotationPipelinetestaccuracy-------------------------------------------------------------------51
Figure53EnsemblePipelinetestaccuracy-----------------------------------------------------------------------------51
7
ListofTables
Table1DatasetdownloadedfromImageNet--------------------------------------------------------------------------11
Table2ImagePipelinedataset--------------------------------------------------------------------------------------------18
Table3InstancePipelinedataset-----------------------------------------------------------------------------------------19
Table4Poseestimatordataset--------------------------------------------------------------------------------------------21
Table5InstanceRotationPipelinedataset-----------------------------------------------------------------------------21
Table6EnsemblePipelinedataset---------------------------------------------------------------------------------------23
Table7Developmentenvironmentconfiguration--------------------------------------------------------------------24
Table8ImagePipelinetrainingphase–VGG16Classifier----------------------------------------------------------31
Table9InstancePipelinetrainingphase–VGG16Classifier-------------------------------------------------------36
Table10PoseEstimatortrainingphase---------------------------------------------------------------------------------39
Table11InstanceRotationPipelinetrainingphase–VGG16Classifier-----------------------------------------41
Table12ResNet50trainingphase---------------------------------------------------------------------------------------44
Table13Finaltestresults---------------------------------------------------------------------------------------------------52
8
ACKNOWLEDGEMENTS
Firstly, I would like to thank Prof. Yi Shang for his mentorship, guidance, and support
throughoutmygraduateschoolandthroughthisresearchproject. Iwouldalso liketothankJoel for
theopportunity tocollaboratewith theMissouriDepartmentofConservationandtheir feedbackon
thisproject.
I would also like to thank everyone from the Computer Science lab, especially Guang Chen,
PengSun,YangLiu,andNickolasWergeles,forsharingtheirexperiencesandbeinganintegralpartof
thisproject. Iwouldalso liketoshowappreciationtoJodieLenser,ShirleyHoldmeierforguidingme
throughmyapplicationprocesstoMizzouaswellasadvisingmeonmycourses.
Finally, Iwouldliketothankmyfamily.Withouttheirsupport,trustandbelief inme,Iwould
notbewhereIamtoday.IhopeIhavedoneyouproud.
9
ABSTRACT
Overthepastfewyears,deeplearninghasbeenwidelyusedandobtainedverygoodresultsin
imagerecognition.Inthisproject,severalstate-of-the-artdeeplearningmodelsandtheircombinations
have been applied to fish recognition in images, in particular 9 common species of fish inMissouri
rivers. Four different data processing and machine learnings pipelines have been developed and
extensiveexperimentshavebeenconductedtoevaluatetheirperformances.Thedeepconvolutional
neural network (CNN)models used in these pipelines include SSD, VGG16, ResNet50, etc. The four
pipelines are image-based, instance-based, instance rotation based, and ensemble, with increasing
complexity.Withoutdoinganypreprocessing,theimage-basedpipelinetakesanentireimageasinput
to classify the image into one of the target classes using deep CNNs. This pipeline achieved up to
75.57% classification accuracy on our test dataset. The instance-based pipeline consists of object
detectionbyonedeepCNNfollowedbyclassificationbyanotherdeepCNN.Thismethodachievedup
to80.03%accuracyonourtestdataset.TheinstancerotationbasedpipelineaddsadeepCNNtodo
poseestimationbetweenobjectdetectionandclassification.Theposture-adjustedfish imageisused
astheinputtotheclassificationmodel,whichhelpthepipelinetoachieveupto82.83%accuracyon
the same dataset. Finally, the ensemble pipeline is a combination of two instance rotation based
pipelines.Thedifferenceofthesetwoinstancerotationbasedpipelinesisintheclassificationmodel:
one is VGG16 and the other ResNet50. The ensemble pipeline achieved up to 87.22% accuracy,
outperformingallotherpipelinessignificantly.
10
INTRODUCTION
TheprojectisproposedbyMissouriDepartmentofConservation(MDC),tosolvetheissuethat
countingandclassifyingfish in imagemanually isahugeamountofworkandtime-consuming.Since
theConvolutionalneuralnetworkperformancebetterandbetter,andfornow,theCNNinsomecase
couldgethigheraccuracythanhumanrecognition,wearegoingtouseexistingCNNmodelstosolve
thefishclassificationproblem.Theprimarytaskofthisprojectisaimingtoclassify9differentspecies
offishinMissouririver.
Figure1FishSpeciesandDistributioninMissouri
The classification task is taking a or a batch of RGB images, and return the species or the
probabilityoffishspecies.
AllthedatausedinthisprojectisdownloadfromImageNet,whichisalargeimagedatasetwith
categorical labels, In thedataset, thereare4842 imagesdistribute in9classes,andeach imagemay
containseveralfishinstances,butonlyonespecies.
11
Table1DatasetdownloadedfromImageNet
Amongthewholedataset,mostof the imagecontains1 fish.Tovisualize thesizeof fish,we
definetheratiobyusingthis formulatheratioequalstothe labeledboxareaof fish,dividedbythe
imageareainpixelmeasurement.
𝑅𝑎𝑡𝑖𝑜&'() = 𝐴𝑟𝑒𝑎&'()/01/𝐴𝑟𝑒𝑎'3456
Figure2Fishsizeratiodistribution
12
1. RELATEDWORKS
1.1 ConvolutionalNeuralNetwork(CNN)
TheConvolutionalNeuralNetwork is a kindof artificialneuralnetwork,whichhasbecomea
research hotspot in the field of speech analysis and image recognition. Its weight-sharing network
structuremakesitmoresimilartobiologicalneuralnetworks,reducingthecomplexityofthenetwork
model and reducing the number ofweights. This advantage ismore obviouswhen the input of the
network is a multi-dimensional image so that the image can be directly used as the input of the
network, which avoids the complicated feature extraction and data reconstruction process in the
traditional recognition algorithm. A convolutional network is a multi-layer perceptron specially
designedtoidentifytwo-dimensionalshapes.Thisnetworkstructureishighlyinvarianttotranslation,
scaling,tilting,orotherformsofdeformation.
CNNis thefirst learningalgorithmtotrainmulti-layernetworkstructuressuccessfully. Ituses
spatial relationships to reduce the number of parameters that need to be learned to improve the
trainingperformanceofgeneralbackpropagationalgorithms.CNNsareproposedasadeep learning
framework to minimize data preprocessing requirements. In CNN, a small part of the image (local
receptivearea)isusedasthelowestlevelinputtothehierarchicalstructure.Theinformationisthen
transmittedtodifferentlayers.Eachlayerpassesadigitalfiltertoobtainthemostsignificantfeatures
of the observed data. Thismethod can obtain significant features of the observation data that are
invarianttotranslation,scaling,androtation,becausethelocalperceptionregionoftheimageallows
theneuronortheprocessingunittoaccessthemostbasicfeatures,suchasorientededgesorcorner
points.
13
1.2 VGG16
Inthisproject,differentCNNswareused,suchasVGG16,ResNet50,andSSDCaffe.Theoriginal
title of the VGG paper was “VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE
RECOGNITION”.ThepaperwaspublishedonICLR2015.VGG16andVGG19aretwoverygoodDCNN
(Deep Convolutional Neural Network) proposed by this article. This series ofmodels achieved best-
performanceinILSVRC-2013andachievedgoodresultsinotherDCNN-basedwork(egFCN).Compared
to ALEXNET, VGG has a more accurate estimate of the picture and more space. VGG16 has 3
convolutional blocks with max pooling, and 3 fully connected layers with ReLU functions, finally
followingasoftmaxfunction.Eachconvolutionalblockcontains2or3convolutionallayers.
Figure3VGG16Architecture
1.3 ResNet50
An original paper of the ResNet model: Deep Residual Learning for Image Recognition.
Itcanbeclearlyseenthattheexpressivenessofthenetworkisenhancedasthedepthofthenetwork
increases.TheexperimentsofHeKaimingandothersalsoprovedthatthenetworkstructurewiththe
14
samecomplexity, therelativedeepernetworkperformsbetter.However, it isnot.Regardlessof the
issueofcomputationalcost,whenthedepthofthenetworkisdeeper,increasingthenumberoflayers
willnot improveperformance,butwillcausethegradientdiffusion.Undertheguidanceofthis idea,
DeepResidualLearning(ResNet)wasproposed.ResNethastwobasicblocks,oneisIdentityBlock,the
inputandoutputdimensionsare the same,anotherbasicblock isConvBlock, the inputandoutput
dimensions are not the same, so cannot be connected in series, its rolewas initially to change the
dimensionofthefeaturevector.
Figure4ResNet50Architecture
1.4 SSDCaffe
The SSD algorithm proposed in the paper “SSD: single shotmultibox detector”, is an object
detection algorithm that directly predicts the coordinates and categories of the bounding box, and
doesnotgenerateaproposalprocess. For thedetectionofobjectsofdifferent sizes, the traditional
approach is to convert the images into different sizes, then process them separately, and finally
synthesizetheresults.Inthispaper,thessdcansynthesizedifferentconvolutionallayerfeaturemaps
toachievethesameeffect.ThemainnetworkstructureofthealgorithmisVGG16,whichtransforms
two fully connected layers into a convolutional layer and then adds four convolutional layers to
15
constructanetworkstructure.Theoutputoffivedifferentconvolutionallayersisconvolvedwithtwo
3*3convolutionkernels,onefortheoutputclassification,eachdefaultboxgenerates21confidences,
andtheotherforlocalization,eachdefaultboxgenerates4coordinatevalues(x,y,w,h).Inaddition,
these5 convolutional layers alsogenerate thedefaultbox (thegenerated coordinates) via theprior
boxlayer.Finally,thefirstthreecalculationresultsarerespectivelymergedandthenpassedtotheloss
layer.
Figure5SSDCaffeArchitecture
16
2. DESIGNANDIMPLEMENTATION
2.1 Design
2.1.1 DataPreprocessing
ImagesaredownloadedfromImageNetbythelinkswithkeywords.Thus,therewouldbesome
datathatisnotsuitableforourproject,suchasgrayscaleimages,non-fishormissing-fishimage,and
ambiguityimage.Whatweneedtodoistoremovetheseimagestocleanthedatasetforthisproject.
Figure6Examplesofgrayscale,non-fishormissing-fish,andambiguityimages
SincetheimagesdownloadedfromImageNetonlyprovidecategorical labels, inordertotrain
thenetwork,especiallyfortheobjectdetectionandposeestimation,weneedtolabelthefishwithan
annotationtool:Sloth.
Sloth’s purpose is to provide a versatile tool for various labeling tasks in the context of
computervisionresearch.Inthisproject,foreveryimage,thelabelweneedisapositionboundingbox
(red in the figure 7) containing a particular species of fish, and its head and tail coordinates (green
pointsinfigure7).
17
Figure7Exampleoflabeledimage
2.1.2 ImagePipeline
Image Pipeline is the basic pipeline in this project. It takes the non-processed images as the
input. For this pipeline, we used VGG16 as the prediction model. Since the VGG16 was trained in
ImageNetandtheoutputofthepre-trainedmodelis1000,inordertousethepre-trainedweightsto
getahigheraccuracy,weusedpre-trainedVGG16asthebasicmodelandaddedanotherdenselayer
(fullyconnectedlayer)withtheoutput9.
Figure8ImagePipeline
Werandomlysplit4842imagesintotraining(80%),validation(10%),andtesting(10%)dataset
byspecies.
18
Table2ImagePipelinedataset
2.1.3 InstancePipeline
However,inourproject,likewesawindataoverviewpart,thefishintheimageissmall,mostly
smallerthanhalfoftheimagesize,whichmeans,intheimage,thereareplentyofuselessinformation,
whichwould influence theperformanceof classification. Thus, in order to avoid it, fish detection is
necessary. In Instance Pipeline, detection is added before the VGG16 classification, which would
outputtheboxpositionoffish,andwedothepaddingtoavoiddistortionandcropit.
Figure9InstancePipeline
Basedon thepipelinewedesigned, cropped imagesareneeded for the training stage to the
classifier.ThegeneratedimagesarecomingfromtheImagePipeline.Wepaddedandcroppedthefish
intosquareinstanceimages.Iftheborderofthecroppedimageisoutoftheimageborder,weshould
dopaddingaftercropping,andfinally,wegot5136instanceimages.
19
Figure10Instanceimagesgeneration
Table3InstancePipelinedataset
2.1.4 InstanceRotationPipeline
In the Instance Pipeline, we did object detection, and image classification to filter out the
useless information by cropping the fish out. However, the cropped images still have some useless
informationwhichwecouldfilterout:theorientationoffish.Thus,weaddedaposeestimatorafter
theobjectdetectiontoalignthefishintoaparticularorientation.
Figure11InstanceRotationPipeline
20
Amount5136 instance images,weassigntheposeof fish into16orientations inacircle,and
everyorientationtakes22.5degrees.
Figure12Orientationsplit
Togetwhichorientationthatthefishshouldbeassignedto,weusedtheheadandtail
coordinatestocalculatethevectorofeachfish.
Figure13Orientationvector
𝑣𝑒𝑐𝑡𝑜𝑟 = 𝑐𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒)64; − 𝑐𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒=4'>
Forthetraining,validation,andtestingdatainposeestimator,wediddataaugmentationto
enlargethedataset,suchasflippingandrotating.Finally,wehave7687fortheestimatordataset.
21
Table4Poseestimatordataset
Forthetraining,validation,andtestingdatainVGG16classifier,wedothealignmentfromthe
InstancePipelineVGG16classifierdataset.
Figure14InstanceRotationimagesgeneration
Afteralignment,wegotthetraining,validation,andtestingdatafortheVGG16classifierin
InstanceRotationPipeline.
Table5InstanceRotationPipelinedataset
2.1.5 EnsemblePipeline
22
Intheresultsofamassivemachinelearningcompetition,thebestresultisusuallytheensemble
modelratherthanthesinglemodel.Forexample,ILSVRC2015'shighest-scoringsinglemodelstructure
achievedthe13thplace.The1stto12thhaveuseddifferenttypesofmodelensemble.
Therearemanydifferenttypesofensemblemodel:oneofthemisstacking.Thistypeismore
generalandcantheoreticallycharacterizeanyotherensembletechnique.Forthispipeline, Iwilluse
thepurestformofstacking,whichinvolvesaveragingtheensemblemodeloutputs.Sincetheaveraging
process does not include any parameters, this process does not require training (only the trained
modelisneeded).
In the Ensemble Pipeline, we train another classifier by using ImageNet pre-trained model:
ResNet50,andtaketheweightaveragetogettheensembleresult.
Figure15EnsemblePipeline
WhenwegotthepredictionresultsforVGG16andResNet50,weappliedweightaveragetoget
thefinalresultforthispipeline.
𝑃𝑟𝑒𝑑6@(63/>6 = 𝑟𝑎𝑡𝑖𝑜ABB ∗ 𝑝𝑟𝑒𝑑ABB + (1 − 𝑟𝑎𝑡𝑖𝑜ABB) ∗ 𝑝𝑟𝑒𝑑I6(J6=
The only difference for the Ensemble Pipeline is thatwe need to train the ResNet50model
similarlyasthetrainingstagefortheVGG16modelinInstanceRotationPipeline.Thus,thedatasetis
same.
23
Table6EnsemblePipelinedataset
2.2 Implementation
2.2.1 Configuration
The project's development environment is Linux, Ubuntu 16.04, using python as the only
developmentlanguage.Pythondependencypackageversioninformationisasfollows.
PackageName Version Description
h5py 2.7.1 toolsfor*.h5file(kerasweights)
imutils 0.4.3 imageprocessingtools
jupyter 1.0.0 pythonGUI(inbrowser)
Keras 1.2.0 highlevelmachinelearningAPI
matplotlib 2.1.0 datavisualization
numpy 1.14.0 pythonscientificcomputinglibrary
pandas 0.20.3 pythondataanalysislibrary
pillow 3.1.2 imageprocessingtools
pip 9.0.1 pythonpackagemanagementtool
24
progressbar 2.3.0 progressbarviewcomponents
python 2.7 language
scikit-image 0.13.1 imageprocessingtools
scikit-learn 0.19.1 machinelearningtoolkit
scipy 1.0.0 pythonalgorithmlibraryandmathtoolkit
six 1.11.0 libraryforcompatibilitywithpython2andpython3
sloth 1.0 annotationtools
Theano 0.9.0 Usedtodefine,optimize,andefficientlysolveanalogestimationproblemsofmulti-dimensionalarraydata
correspondingmathematicalexpressionsTable7Developmentenvironmentconfiguration
2.2.2 EvaluationMetric
In the whole project, we use object detection, and classification (pose estimator and fish
classifier). In object detection, precision and recall are the primary evaluation metrics, and for
classification,calculatingtheaccuracyisasimplemethodtoevaluatethemodel.
2.2.2.1 ObjectDetectionAccuracy
Traditionallyintheevaluationofobjectdetectionalgorithms,forasingledetectionfileandits
corresponding ground truth file, two values, recall, and precision, can be calculated. Usually, in the
objectdetection task,weuse confusionmatrix toget theprediction results.A confusionmatrix is a
summary of prediction results on a detection or classification problem. The number of correct and
incorrectpredictionsaresummarizedwithcountvaluesandbrokendownbyeachclass.Tocalculate
theconfusionmatrix:
1. Youneedatestdatasetoravalidationdatasetwithexpectedoutcomevalues.
25
2. Makeapredictionforeachrowinyourtestdataset.
3. Fromtheexpectedoutcomesandpredictionscount:
a. Thenumberofcorrectpredictionsforeachclass.
b. Thenumberofincorrectpredictionsforeachclass,organizedbytheclassthat
waspredicted.
Efficiently,scikit-learnprovidesafunctiontogetconfusionmatrixbyinputtingthelabeland
predictioninorder.Forobjectdetectionorbinaryclassification,forexample,wehavetheconfusion
matrixasfollows:
Figure16ConfusionMatrixExample
● TruePositives(TP):Thesearecasesinwhichwepredictedyes,andtheirlabelisyes.
● TrueNegatives(TN):Wepredictedno,andtheirlabelisno.
● FalsePositives(FP):Wepredictedyes,buttheirlabelisno.
● FalseNegatives(FN):Wepredictedno,buttheirlabelisyes.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃)
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁)
26
Intuitively,recalltellsushowmanyofourobjectshavebeendetected,andprecisiongivesus
information on the number of false alarms. Both are higher if better and should be close to 1 for
perfectsystems.
Inorder tobeable tocalculate these twomeasures (recallandprecision),anevaluation tool
needstofindouttwothings:
1. for each ground truth rectangle, we need to know whether it has been correctly
detectedornot
2. foreachdetectedrectangle,weneedtoknowwhetheritcorrespondstoavalidground
truthrectangleornot
Inordertomeasurewhethertheobjectiscorrectlydetected,weuseIntersectionoverUnion.
Figure17IntersectionoverUnion
Examining this equation, you can see that Intersection over Union is simply a ratio. In the
numerator,we compute the area of overlap between the predicted bounding box and the ground-
truthboundingbox.Thedenominatoristheareaofunion,ormoresimply,theareaencompassedby
27
boththepredictedboundingboxandtheground-truthboundingbox.Dividingtheareaofoverlapby
theareaofunionyieldsourfinalscore—theIntersectionoverUnion.
Inall reality, it’sextremelyunlikely that the (x,y)-coordinatesofourpredictedboundingbox
are going to exactlymatch the (x, y)-coordinates of the ground-truth bounding box.Due to varying
parametersofourmodel(imagepyramidscale,slidingwindowsize,featureextractionmethod,etc.),a
completeandtotalmatchbetweenpredictedandground-truthboundingboxes issimplyunrealistic.
Becauseof this,weneed todefineanevaluationmetric that rewardspredictedboundingboxes for
heavilyoverlappingwiththeground-truth:
Figure18ThresholdexampleofIntersectionoverUnion
Asyoucansee,predictedboundingboxesthatheavilyoverlapwiththeground-truthbounding
boxeshavehigherscoresthanthosewithlessoverlap.ThismakesIntersectionoverUnionanexcellent
metricforevaluatingcustomobjectdetectors.
28
2.2.2.2 ClassificationAccuracy
Accuracyisonemetricforevaluatingclassificationmodels.Informally,accuracyisthefraction
ofpredictionsourmodelgotright.Formally,accuracyhasthefollowingdefinition:
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠𝑇𝑜𝑡𝑎𝑙𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
Usually, in classification task, we use confusion matrix similarly to the object detection but
usingmultiple-class.
Bylookingattheconfusionmatrix,wecouldcalculatetheaccuracyeasily.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑟𝑎𝑐𝑒34=U'1𝑆𝑢𝑚34=U'1
2.2.2.3 PoseEstimationAccuracy
PoseEstimationisaclassificationtask,butweapplytwodifferentmeasurementstocalculate
theaccuracyofthemodel:
1. Traditionalclassificationaccuracymeasurement.
2. Offsettingclassificationaccuracymeasurement.
In traditional classification accuracy measurement, we could calculate the confusion matrix,
andgettheaccuracybyusingtheaboveformula.However,insomecases,weallowsomeoffsettothe
rotatedfish,whichmeansthenearbyorientationisacceptable.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦0 =𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
29
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦1 = 𝑂𝑓𝑓𝑠𝑒𝑡𝑡𝑖𝑛𝑔𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
Where:
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠: (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 == 𝑔𝑟𝑜𝑢𝑛𝑑𝑡𝑟𝑢𝑡ℎ)and
𝑂𝑓𝑓𝑠𝑒𝑡𝑡𝑖𝑛𝑔𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠: (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 == 𝑔𝑟𝑜𝑢𝑛𝑑𝑡𝑟𝑢𝑡ℎ||𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ==
𝑔𝑟𝑜𝑢𝑛𝑑𝑡𝑟𝑢𝑡ℎ4;]0'@'@5)
Usually,inNumpy,wecoulddoiteasilybyusingthefollowingcode(cmistheconfusion
matrix):
Figure19Pythoncodeforaccuracycalculation
Forexample, in the following figure, if theground truthorientation is 14,bothof those two
measurementsareacceptable.
Figure20Accuracy_0andAccuracy_1orientationdistribution
2.2.3 ImagePipeline
30
IntheImagePipeline,aswementioned,weusedVGG16pre-trainedmodelandaddedanother
fullyconnectedlayerwithoutput9.Sincethepre-trainedmodelistrainedonImageNetdataset,and
ourdataisalsocomingfromImageNet,weonlytrainlasttwolayers(transferlearning).
Figure21ModifiedVGG16Architecture
Forthelossfunction,wechosecrossentropy,whichisusuallyusedinmulti-classclassification.
𝑙(𝑦, 𝑦) = −𝑦_𝑙𝑜𝑔𝑦
InKeras,itiseasytosetupthelossfunctionbyusingthefollowingcode
Figure22CompilemodelinKeraswithcross-entropylossfunction
TheinputofthenetworkisabatchofRGBimageswiththesizeof(3,224,224).3isforR,G,
andBchannel,and224istheinputoftraditionalVGGstructure.Anyimagesshouldberesizedto224
by224beforeit isgiventothenetwork.SinceweusedTheanoasthebackendinKeras,thechannel
31
shouldbe the firstparameter.Anyonewhowants touseTensorflowas theKerasbackend, theonly
modification is using the (224, 224, 3) as the input size.Usually, in training stepof traditional deep
learningproblem,choosing1e-4astheinitiallearningrateisbetter,andwhenthelossvaluenolonger
declinesinacertainepoch,ortheaccuracyoftheverificationdatasetnolongerrises,wecanreduce
thelearningrateby0.1toachievebetterresults.
In this pipeline, we used (3, 224, 224) as the input, initial learning rate 1e-4, batch size 64,
optimizerSGD,andtrainedthenetworkbyreducingthelearningrateby0.1ifthevalidationaccuracy
doesnotimprovein30epochs.
input lr # epoch batch size time / epoch total time last update
epoch loss val_loss val_Acc
3,224,224 1e-4 56 64 39 s 36 min 24 s 25 0.1155 0.9666 0.77
Table8ImagePipelinetrainingphase–VGG16Classifier
After25epochs,weachieve77%accuracyonvalidationdatasetwiththetrainingloss0.1155
andvalidationloss0.9666.Thefiguresshowthecurveoflossandaccuracyontrainingandvalidation
dataset.
32
Figure23ImagePipelinelossandaccuracycurve–VGG16Classifier
Forthetrainingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.9866
Figure24ImagePipelinetrainingconfusionmatric–VGG16Classifier
Forthevalidationdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.7711
33
Figure25ImagePipelinevalidationconfusionmatric–VGG16Classifier
Forthetestdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.7557
Figure26ImagePipelinetestingconfusionmatric–VGG16Classifier
34
2.2.4 InstancePipeline
IntheInstancePipeline,therearetwomodelswhichshouldbetrainedontheinstancedataset:
SSDCaffedetectorandVGG16classifier.
FortheSSDCaffedetector,weresizetheimagesto512by512andusethesize(512,512,3)as
theinputwithboundingboxlabels.Trainingphasetook20,000iterationsin12hours.InSSDCaffe,the
lossfunctionisthecombinationofconfidencelossandboxlocationloss.
𝐿(𝑥, 𝑐, 𝑙, 𝑔) = 1𝑁 (𝐿b0@&(𝑥, 𝑐) + 𝛼𝐿>0b(𝑥, 𝑙, 𝑔))
WhereNisthenumberofmatchedboxwithgroundtruthbox.
After20,000iterations,wegotthefinalloss1.28
Figure27SSDCaffetrainingloss
35
Ontestdataset,wegotthepredictionresultswithoutconfidencethreshold.Inordertochoose
thethresholdthatisbestforthemodeltogetrelativelyhighprecisionandrecall,weusetheprecision-
recall curve tomeasure the performance. For each confidence threshold from 0.0 to 1.0,we could
haveapointwiththexofrecallandyofprecision.
Figure28SSDCaffeprecision-recallcurveonthetestdataset
The AUC (Area Under Curve) of this figure is the average precision. When the confidence
thresholdis0.5,wecouldhavearelativelyhighprecisionandrecallwiththeaverageprecision0.9275
FortheVGG16classifierininstancedataset,weusethesimilarmethodtomodifythenetwork
structure:usingImageNetpre-trainedmodel,andaddingafullyconnectedlayerwithoutput9,freeze
thelayersexceptforthelasttwolayers.Weused(3,224,224)astheinput,initiallearningrate1e-4,
36
batchsize64andtrainedthenetworkbyreducingthe learningrateby0.1 if thevalidationaccuracy
doesnotimprovein30epochs.
input lr # epoch batch size time / epoch total time last update
epoch loss val_loss val_Acc
3,224,224 1e-4 114 64 37 s 1 h 10 m 18 s 83 0.3386 0.4751 0.8236
Table9InstancePipelinetrainingphase–VGG16Classifier
After83epochs,weachieved82.36%accuracyonvalidationdatasetwiththetrainingloss
0.3386andvalidationloss0.4751.Thefiguresshowthecurveoflossandaccuracyontrainingand
validationdataset.
Figure29InstancePipelinelossandaccuracycurve–VGG16Classifier
Forthetrainingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.9589
37
Figure30InstancePipelinetrainingconfusionmatric–VGG16Classifier
Forthevalidationdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.8236
Figure31InstancePipelinevalidationconfusionmatric–VGG16Classifier
38
Forthetestingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.8064
Figure32InstancePipelinetestingconfusionmatric–VGG16Classifier
2.2.5 InstanceRotationPipeline
In Instance Rotation Pipeline, the detector is same as Instance Pipeline. Two models are
needed: Pose Estimator and VGG16 Classifier in Instance rotated dataset. The pose estimator is a
traditionalclassificationproblem,sowestillusethesamemethodtomodifytheneuralnetworkbased
on VGG16: using ImageNet pre-trained model, and adding a fully connected layer with output 16,
freezethelayersexceptforthelasttwolayers.Weused(3,224,224)astheinput,initiallearningrate
1e-4, batch size 64, and trained the network by reducing the learning rate by 0.1 if the validation
accuracydoesnotimprovein30epochs.
39
input lr # epoch batch size time / epoch total time last update
epoch loss val_loss val_Acc
3,224,224 1e-4 53 64 54 s 47 min 42 s 23 0.2951 0.2944 0.9138
Table10PoseEstimatortrainingphase
After23epochs,weachieved91.38%accuracyonvalidationdatasetwiththetrainingloss
0.2951andvalidationloss0.2944.Thefiguresshowthecurveoflossandaccuracyontrainingand
validationdataset.
Figure33PoseEstimatorlossandaccuracycurve
For the accuracy, we use two different measurement methods as mentioned in Evaluation
Metric part. accuracy_0 for traditional classification accuracy, accuracy_1 for offsetting classification
accuracy.Forthetrainingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy_0:0.9768,
andaccuracy_1:0.9985
40
Figure34PoseEstimatortrainingconfusionmatrix
Forthevalidationdataset,wegetthefollowingconfusionmatrixwiththeaccuracy_0:0.9138,
andaccuracy_1:0.9948
Figure35PoseEstimatorvalidationconfusionmatrix
41
Forthetestingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy_0:0.9239,and
accuracy_1:0.9961
Figure36PoseEstimatortestingconfusionmatrix
In theVGG16 classifier for instance rotation classification,wedid the same step,modify the
VGG16structurewithabasicVGGmodelandaddedafullyconnectedlayer(outputshape9).Weused
(3,224,224)astheinput,initiallearningrate1e-4,batchsize64,andtrainedthenetworkbyreducing
thelearningrateby0.1ifthevalidationaccuracydoesnotimprovein30epochs.
input lr # epoch batch size time / epoch total time last update
epoch loss val_loss val_Acc
3,224,224 1e-4 126 64 40 s 1h 24 min 96 0.1039 0.3305 0.8938
Table11InstanceRotationPipelinetrainingphase–VGG16Classifier
42
After96epochs,weachieved89.38%accuracyonvalidationdatasetwiththetrainingloss
0.1039andvalidationloss0.3305.Thefiguresshowthecurveoflossandaccuracyontrainingand
validationdataset.
Figure37InstanceRotationPipelinelossandaccuracycurve–VGG16Classifier
Forthetrainingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.9915
Figure38InstanceRotationPipelinetrainingconfusionmatrix–VGG16Classifier
43
Forthevalidationdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.8938
Figure39InstanceRotationPipelinevalidationconfusionmatrix–VGG16Classifier
Forthetestingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.8383
Figure40InstanceRotationPipelinetestingconfusionmatrix–VGG16Classifier
44
2.2.6 EnsemblePipeline
FortheEnsemblePipeline,obviously,weneedanotherCNNclassifier,ResNet50,andtrainiton
theinstancerotationdataset.SimilartothemodificationinVGG16,weloadtheImageNetpre-trained
ResNet50modelandaddedanotherfullyconnectedlayerwithoutput9.Weused(3,224,224)asthe
input,initiallearningrate1e-4,batchsize64andtrainedthenetworkbyreducingthelearningrateby
0.1ifthevalidationaccuracydoesnotimprovein30epochs.
input lr # epoch batch size time / epoch total time last update
epoch loss val_loss val_Acc
3,224,224 1e-4 98 64 46 s 1h 15m 8s 68 0.1912 0.27 0.8717
Table12ResNet50trainingphase
After98epochs,weachieved87.17%accuracyonvalidationdatasetwiththetrainingloss
0.1912andvalidationloss0.27.Thefiguresshowthecurveoflossandaccuracyontrainingand
validationdataset.
Figure41EnsemblePipeline-ResNet50lossandaccuracycurve
45
Forthetrainingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy
Figure42EnsemblePipelinetrainingconfusionmatrix–ResNet50Classifier
Forthevalidationdataset,wegetthefollowingconfusionmatrixwiththeaccuracy
Figure43EnsemblePipelinevalidationconfusionmatrix–ResNet50Classifier
Forthetestingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.8683
46
Figure44EnsemblePipelinetestingconfusionmatrix–ResNet50Classifier
Fortheensemblemodel,weuseaweightedaverageinsteadoftheaveragealgorithm.Inorder
tofindthebestratio,weuseVGGasastandard,andtheratioisincreasedby0.001from0to1aswe
mentionedindesignpart.Onthevalidationdataset,wehavethefollowingcurve.
Figure45VGGratio-validationaccuracycurve
47
When𝑟𝑎𝑡𝑖𝑜ABB =0.501,themaximumaccuracyoftheensemblemodelis0.9098.
Forthetrainingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.9915
Figure46EnsemblePipelinetrainingconfusionmatrix–WeightedClassifier
Forthevalidationdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.9098
48
Figure47EnsemblePipelinevalidationconfusionmatrix–WeightedClassifier
Forthetestingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.8802
Figure48EnsemblePipelinetestingconfusionmatrix–WeightedClassifier
49
3. RESULTS
In the implementation part, we use different image dataset for training on VGG16 and
ResNet50model.Foreachoftheimage-classifiercombinationsweuse,wehavethefollowingaccuracy
chart.
Figure49Image-ClassifierAccuracy
Bypreprocessingthe imagesand lettingtheclassifierfocusonthefish instance itself,wecan
improvetheaccuracymoreeffectively.Forallpairsofimage-classifier,thevalidationaccuracyishigher
than test accuracy. Only considering the classifier, the Ensemble Pipeline performs best both on
validationandtestdataset.
Hereweregardeachpipelineasasystemandexecutethemonthetestset.Foreachpipeline,
wegetthefollowingconfusionmatrix.
52
For each system, execution time is another standard formeasuring systemefficiency on the
samedata.Finally,combinedwithalltheinformationobtained,wehavethefollowingresults.
# of Test Image # of fish
# of detected
‘fish’
tp in detection
correct classificatio
n
Accuracy correct
classification / # fish
Time (s)
Image Pipeline 479 - - - 362 75.57 % 13.16
Instance Pipeline 479 501 500 494 401 80.03 % 53.16
Instance Rotation Pipeline
VGG16 479 501 500 494 415 82.83 % 132.15
ResNet50 479 501 500 494 428 85.43 % 138.02
Ensemble Pipeline 479 501 500 494 437 87.22 % 140.58
Table13Finaltestresults
EnsemblePipelinegetshighest classificationaccuracy,87.22%,on the same testdataset,but
takesthe longest time,140.58s.Even ifEnsemblePipelinespendsthe longest time,westillhaveto
chooseensembleasthefinalmodel,becauseinourproject,accuracyisthemostimportant.
53
4. CONCLUSIONANDFUTUREWORK
4.1 ConclusionandContribution
Throughthestep-by-steppurificationoftheimages,theclassifiercanfocusonthefishinstance
itself, filter out unwanted information such as background and orientation, and our classifier can
obtainhigheraccuracy.
In thisproject,wecameupwith fishclassificationpipelineonmess images.Forour selected
pipeline, Ensemble Pipeline, given the input RGB image, we can return the position, category, and
confidenceoffishtotheuserasasystem.Finally,weachieved87.22%accuracyfortheclassification
task.
4.2 FutureWork
In the future, we will improve classification accuracy by trying Automatic Hyper-Parameter
Adjustment.Tomorefocusontheinformationweneedandclassifythefishbyitsprominentfeature,
wearegoingtoextractthetailorfinsoffish,anduseanothermodeltoclassifythemforsupporting
thepredictionresult.For theensemblemethod,weightedaverage isnot thebest,wecould try tiny
networkinsteadofit.
Moreover,wewouldliketouseadatavisualizationmethodtoreturntheuseranoutputimage
withtheinformationonittoeasilygrabtheposition,species,andconfidenceoffish.
54
5. REFERENCES
[1] Liu, Wei, et al. "Ssd: Single shot multibox detector." European conference on computer vision.
Springer,Cham,2016.
[2]K.SimonyanandA.Zisserman.Verydeepconvolutionalnetworksforlarge-scaleimagerecognition.
InICLR,2015.
[3] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE
conferenceoncomputervisionandpatternrecognition.2016.
[4]Chen,Guang,etal."AutomaticFishClassificationSystemUsingDeepLearning."InICTAI,2017
[5]He,Kaiming,et al. "Mask r-cnn."ComputerVision (ICCV),2017 IEEE InternationalConferenceon.
IEEE,2017.
[6] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep
convolutionalneuralnetworks."Advancesinneuralinformationprocessingsystems.2012.
[7]Redmon,Joseph,etal."Youonlylookonce:Unified,real-timeobjectdetection."Proceedingsofthe
IEEEconferenceoncomputervisionandpatternrecognition.2016.
[8]M.Ravanbakhsh,M.R.Shortis,F.Shafait,A.Mian,E.S.Harvey,andJ.W.Seager,“Automatedfish
detection in underwater imagesusing shape-based level sets,”ThePhotogrammetric Record,
vol.30,no.149,pp.46–62,2015.