application of deep learning

54
APPLICATION OF DEEP LEARNING TO FISH RECOGNITION A Project Presented to The Faculty of the Graduate School At the University of Missouri In Partial Fulfillment Of the Requirements for the Degree Master of Science Implemented and Defended by: Fuming Xiang Prof. Yi Shang, Advisor May 2018

Upload: others

Post on 21-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

APPLICATIONOFDEEPLEARNINGTO

FISHRECOGNITION

AProject

Presentedto

TheFacultyoftheGraduateSchool

AttheUniversityofMissouri

InPartialFulfillment

OftheRequirementsfortheDegree

MasterofScience

ImplementedandDefendedby:

FumingXiangProf.YiShang,Advisor

May2018

2

TABLEOFCONTENTS

TABLEOFCONTENTS-----------------------------------------------------------------------------------------------------2

ListofFigures---------------------------------------------------------------------------------------------------------------4

ListofTables----------------------------------------------------------------------------------------------------------------7

ACKNOWLEDGEMENTS------------------------------------------------------------------------------------------8

Abstract--------------------------------------------------------------------------------------------------------------9

Introduction-------------------------------------------------------------------------------------------------------10

1. RelatedWorks----------------------------------------------------------------------------------------------12

1.1 ConvolutionalNeuralNetwork(CNN)--------------------------------------------------------------------12

1.2 VGG16-------------------------------------------------------------------------------------------------------------13

1.3 ResNet50---------------------------------------------------------------------------------------------------------13

1.4 SSDCaffe----------------------------------------------------------------------------------------------------------14

2. DesignandImplementation-----------------------------------------------------------------------------16

2.1 Design-------------------------------------------------------------------------------------------------------------16

2.1.1 DataPreprocessing--------------------------------------------------------------------------------------------16

2.1.2 ImagePipeline--------------------------------------------------------------------------------------------------17

2.1.3 InstancePipeline-----------------------------------------------------------------------------------------------18

2.1.4 InstanceRotationPipeline-----------------------------------------------------------------------------------19

2.1.5 EnsemblePipeline---------------------------------------------------------------------------------------------21

2.2 Implementation------------------------------------------------------------------------------------------------23

2.2.1 Configuration---------------------------------------------------------------------------------------------------23

3

2.2.2 EvaluationMetric----------------------------------------------------------------------------------------------24

2.2.3 ImagePipeline--------------------------------------------------------------------------------------------------29

2.2.4 InstancePipeline-----------------------------------------------------------------------------------------------34

2.2.5 InstanceRotationPipeline-----------------------------------------------------------------------------------38

2.2.6 EnsemblePipeline---------------------------------------------------------------------------------------------44

3. Results---------------------------------------------------------------------------------------------------------49

4. ConclusionandFutureWork----------------------------------------------------------------------------53

4.1 ConclusionandContribution -------------------------------------------------------------------------------53

4.2 FutureWork-----------------------------------------------------------------------------------------------------53

5. References---------------------------------------------------------------------------------------------------54

4

ListofFigures

Figure1FishSpeciesandDistributioninMissouri--------------------------------------------------------------------10

Figure2Fishsizeratiodistribution---------------------------------------------------------------------------------------11

Figure3VGG16Architecture----------------------------------------------------------------------------------------------13

Figure4ResNet50Architecture------------------------------------------------------------------------------------------14

Figure5SSDCaffeArchitecture-------------------------------------------------------------------------------------------15

Figure6Examplesofgrayscale,non-fishormissing-fish,andambiguityimages-----------------------------16

Figure7Exampleoflabeledimage---------------------------------------------------------------------------------------17

Figure8ImagePipeline------------------------------------------------------------------------------------------------------17

Figure9InstancePipeline---------------------------------------------------------------------------------------------------18

Figure10Instanceimagesgeneration-----------------------------------------------------------------------------------19

Figure11InstanceRotationPipeline-------------------------------------------------------------------------------------19

Figure12Orientationsplit--------------------------------------------------------------------------------------------------20

Figure13Orientationvector-----------------------------------------------------------------------------------------------20

Figure14InstanceRotationimagesgeneration-----------------------------------------------------------------------21

Figure15EnsemblePipeline-----------------------------------------------------------------------------------------------22

Figure16ConfusionMatrixExample-------------------------------------------------------------------------------------25

Figure17IntersectionoverUnion----------------------------------------------------------------------------------------26

5

Figure18ThresholdexampleofIntersectionoverUnion-----------------------------------------------------------27

Figure19Pythoncodeforaccuracycalculation-----------------------------------------------------------------------29

Figure20Accuracy_0andAccuracy_1orientationdistribution---------------------------------------------------29

Figure21ModifiedVGG16Architecture-------------------------------------------------------------------------------30

Figure22CompilemodelinKeraswithcross-entropylossfunction---------------------------------------------30

Figure23ImagePipelinelossandaccuracycurve–VGG16Classifier-------------------------------------------32

Figure24ImagePipelinetrainingconfusionmatric–VGG16Classifier-----------------------------------------32

Figure25ImagePipelinevalidationconfusionmatric–VGG16Classifier--------------------------------------33

Figure26ImagePipelinetestingconfusionmatric–VGG16Classifier------------------------------------------33

Figure27SSDCaffetrainingloss------------------------------------------------------------------------------------------34

Figure28SSDCaffeprecision-recallcurveonthetestdataset----------------------------------------------------35

Figure29InstancePipelinelossandaccuracycurve–VGG16Classifier----------------------------------------36

Figure30InstancePipelinetrainingconfusionmatric–VGG16Classifier--------------------------------------37

Figure31InstancePipelinevalidationconfusionmatric–VGG16Classifier-----------------------------------37

Figure32InstancePipelinetestingconfusionmatric–VGG16Classifier---------------------------------------38

Figure33PoseEstimatorlossandaccuracycurve--------------------------------------------------------------------39

Figure34PoseEstimatortrainingconfusionmatrix-----------------------------------------------------------------40

Figure35PoseEstimatorvalidationconfusionmatrix---------------------------------------------------------------40

6

Figure36PoseEstimatortestingconfusionmatrix-------------------------------------------------------------------41

Figure37InstanceRotationPipelinelossandaccuracycurve–VGG16Classifier----------------------------42

Figure38InstanceRotationPipelinetrainingconfusionmatrix–VGG16Classifier--------------------------42

Figure39InstanceRotationPipelinevalidationconfusionmatrix–VGG16Classifier-----------------------43

Figure40InstanceRotationPipelinetestingconfusionmatrix–VGG16Classifier---------------------------43

Figure41EnsemblePipeline-ResNet50lossandaccuracycurve-----------------------------------------------44

Figure42EnsemblePipelinetrainingconfusionmatrix–ResNet50Classifier---------------------------------45

Figure43EnsemblePipelinevalidationconfusionmatrix–ResNet50Classifier------------------------------45

Figure44EnsemblePipelinetestingconfusionmatrix–ResNet50Classifier---------------------------------46

Figure45VGGratio-validationaccuracycurve----------------------------------------------------------------------46

Figure46EnsemblePipelinetrainingconfusionmatrix–WeightedClassifier--------------------------------47

Figure47EnsemblePipelinevalidationconfusionmatrix–WeightedClassifier------------------------------48

Figure48EnsemblePipelinetestingconfusionmatrix–WeightedClassifier----------------------------------48

Figure49Image-ClassifierAccuracy--------------------------------------------------------------------------------------49

Figure50ImagePipelinetestaccuracy----------------------------------------------------------------------------------50

Figure51InstancePipelinetestaccuracy-------------------------------------------------------------------------------50

Figure52InstanceRotationPipelinetestaccuracy-------------------------------------------------------------------51

Figure53EnsemblePipelinetestaccuracy-----------------------------------------------------------------------------51

7

ListofTables

Table1DatasetdownloadedfromImageNet--------------------------------------------------------------------------11

Table2ImagePipelinedataset--------------------------------------------------------------------------------------------18

Table3InstancePipelinedataset-----------------------------------------------------------------------------------------19

Table4Poseestimatordataset--------------------------------------------------------------------------------------------21

Table5InstanceRotationPipelinedataset-----------------------------------------------------------------------------21

Table6EnsemblePipelinedataset---------------------------------------------------------------------------------------23

Table7Developmentenvironmentconfiguration--------------------------------------------------------------------24

Table8ImagePipelinetrainingphase–VGG16Classifier----------------------------------------------------------31

Table9InstancePipelinetrainingphase–VGG16Classifier-------------------------------------------------------36

Table10PoseEstimatortrainingphase---------------------------------------------------------------------------------39

Table11InstanceRotationPipelinetrainingphase–VGG16Classifier-----------------------------------------41

Table12ResNet50trainingphase---------------------------------------------------------------------------------------44

Table13Finaltestresults---------------------------------------------------------------------------------------------------52

8

ACKNOWLEDGEMENTS

Firstly, I would like to thank Prof. Yi Shang for his mentorship, guidance, and support

throughoutmygraduateschoolandthroughthisresearchproject. Iwouldalso liketothankJoel for

theopportunity tocollaboratewith theMissouriDepartmentofConservationandtheir feedbackon

thisproject.

I would also like to thank everyone from the Computer Science lab, especially Guang Chen,

PengSun,YangLiu,andNickolasWergeles,forsharingtheirexperiencesandbeinganintegralpartof

thisproject. Iwouldalso liketoshowappreciationtoJodieLenser,ShirleyHoldmeierforguidingme

throughmyapplicationprocesstoMizzouaswellasadvisingmeonmycourses.

Finally, Iwouldliketothankmyfamily.Withouttheirsupport,trustandbelief inme,Iwould

notbewhereIamtoday.IhopeIhavedoneyouproud.

9

ABSTRACT

Overthepastfewyears,deeplearninghasbeenwidelyusedandobtainedverygoodresultsin

imagerecognition.Inthisproject,severalstate-of-the-artdeeplearningmodelsandtheircombinations

have been applied to fish recognition in images, in particular 9 common species of fish inMissouri

rivers. Four different data processing and machine learnings pipelines have been developed and

extensiveexperimentshavebeenconductedtoevaluatetheirperformances.Thedeepconvolutional

neural network (CNN)models used in these pipelines include SSD, VGG16, ResNet50, etc. The four

pipelines are image-based, instance-based, instance rotation based, and ensemble, with increasing

complexity.Withoutdoinganypreprocessing,theimage-basedpipelinetakesanentireimageasinput

to classify the image into one of the target classes using deep CNNs. This pipeline achieved up to

75.57% classification accuracy on our test dataset. The instance-based pipeline consists of object

detectionbyonedeepCNNfollowedbyclassificationbyanotherdeepCNN.Thismethodachievedup

to80.03%accuracyonourtestdataset.TheinstancerotationbasedpipelineaddsadeepCNNtodo

poseestimationbetweenobjectdetectionandclassification.Theposture-adjustedfish imageisused

astheinputtotheclassificationmodel,whichhelpthepipelinetoachieveupto82.83%accuracyon

the same dataset. Finally, the ensemble pipeline is a combination of two instance rotation based

pipelines.Thedifferenceofthesetwoinstancerotationbasedpipelinesisintheclassificationmodel:

one is VGG16 and the other ResNet50. The ensemble pipeline achieved up to 87.22% accuracy,

outperformingallotherpipelinessignificantly.

10

INTRODUCTION

TheprojectisproposedbyMissouriDepartmentofConservation(MDC),tosolvetheissuethat

countingandclassifyingfish in imagemanually isahugeamountofworkandtime-consuming.Since

theConvolutionalneuralnetworkperformancebetterandbetter,andfornow,theCNNinsomecase

couldgethigheraccuracythanhumanrecognition,wearegoingtouseexistingCNNmodelstosolve

thefishclassificationproblem.Theprimarytaskofthisprojectisaimingtoclassify9differentspecies

offishinMissouririver.

Figure1FishSpeciesandDistributioninMissouri

The classification task is taking a or a batch of RGB images, and return the species or the

probabilityoffishspecies.

AllthedatausedinthisprojectisdownloadfromImageNet,whichisalargeimagedatasetwith

categorical labels, In thedataset, thereare4842 imagesdistribute in9classes,andeach imagemay

containseveralfishinstances,butonlyonespecies.

11

Table1DatasetdownloadedfromImageNet

Amongthewholedataset,mostof the imagecontains1 fish.Tovisualize thesizeof fish,we

definetheratiobyusingthis formulatheratioequalstothe labeledboxareaof fish,dividedbythe

imageareainpixelmeasurement.

𝑅𝑎𝑡𝑖𝑜&'() = 𝐴𝑟𝑒𝑎&'()/01/𝐴𝑟𝑒𝑎'3456

Figure2Fishsizeratiodistribution

12

1. RELATEDWORKS

1.1 ConvolutionalNeuralNetwork(CNN)

TheConvolutionalNeuralNetwork is a kindof artificialneuralnetwork,whichhasbecomea

research hotspot in the field of speech analysis and image recognition. Its weight-sharing network

structuremakesitmoresimilartobiologicalneuralnetworks,reducingthecomplexityofthenetwork

model and reducing the number ofweights. This advantage ismore obviouswhen the input of the

network is a multi-dimensional image so that the image can be directly used as the input of the

network, which avoids the complicated feature extraction and data reconstruction process in the

traditional recognition algorithm. A convolutional network is a multi-layer perceptron specially

designedtoidentifytwo-dimensionalshapes.Thisnetworkstructureishighlyinvarianttotranslation,

scaling,tilting,orotherformsofdeformation.

CNNis thefirst learningalgorithmtotrainmulti-layernetworkstructuressuccessfully. Ituses

spatial relationships to reduce the number of parameters that need to be learned to improve the

trainingperformanceofgeneralbackpropagationalgorithms.CNNsareproposedasadeep learning

framework to minimize data preprocessing requirements. In CNN, a small part of the image (local

receptivearea)isusedasthelowestlevelinputtothehierarchicalstructure.Theinformationisthen

transmittedtodifferentlayers.Eachlayerpassesadigitalfiltertoobtainthemostsignificantfeatures

of the observed data. Thismethod can obtain significant features of the observation data that are

invarianttotranslation,scaling,androtation,becausethelocalperceptionregionoftheimageallows

theneuronortheprocessingunittoaccessthemostbasicfeatures,suchasorientededgesorcorner

points.

13

1.2 VGG16

Inthisproject,differentCNNswareused,suchasVGG16,ResNet50,andSSDCaffe.Theoriginal

title of the VGG paper was “VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE

RECOGNITION”.ThepaperwaspublishedonICLR2015.VGG16andVGG19aretwoverygoodDCNN

(Deep Convolutional Neural Network) proposed by this article. This series ofmodels achieved best-

performanceinILSVRC-2013andachievedgoodresultsinotherDCNN-basedwork(egFCN).Compared

to ALEXNET, VGG has a more accurate estimate of the picture and more space. VGG16 has 3

convolutional blocks with max pooling, and 3 fully connected layers with ReLU functions, finally

followingasoftmaxfunction.Eachconvolutionalblockcontains2or3convolutionallayers.

Figure3VGG16Architecture

1.3 ResNet50

An original paper of the ResNet model: Deep Residual Learning for Image Recognition.

Itcanbeclearlyseenthattheexpressivenessofthenetworkisenhancedasthedepthofthenetwork

increases.TheexperimentsofHeKaimingandothersalsoprovedthatthenetworkstructurewiththe

14

samecomplexity, therelativedeepernetworkperformsbetter.However, it isnot.Regardlessof the

issueofcomputationalcost,whenthedepthofthenetworkisdeeper,increasingthenumberoflayers

willnot improveperformance,butwillcausethegradientdiffusion.Undertheguidanceofthis idea,

DeepResidualLearning(ResNet)wasproposed.ResNethastwobasicblocks,oneisIdentityBlock,the

inputandoutputdimensionsare the same,anotherbasicblock isConvBlock, the inputandoutput

dimensions are not the same, so cannot be connected in series, its rolewas initially to change the

dimensionofthefeaturevector.

Figure4ResNet50Architecture

1.4 SSDCaffe

The SSD algorithm proposed in the paper “SSD: single shotmultibox detector”, is an object

detection algorithm that directly predicts the coordinates and categories of the bounding box, and

doesnotgenerateaproposalprocess. For thedetectionofobjectsofdifferent sizes, the traditional

approach is to convert the images into different sizes, then process them separately, and finally

synthesizetheresults.Inthispaper,thessdcansynthesizedifferentconvolutionallayerfeaturemaps

toachievethesameeffect.ThemainnetworkstructureofthealgorithmisVGG16,whichtransforms

two fully connected layers into a convolutional layer and then adds four convolutional layers to

15

constructanetworkstructure.Theoutputoffivedifferentconvolutionallayersisconvolvedwithtwo

3*3convolutionkernels,onefortheoutputclassification,eachdefaultboxgenerates21confidences,

andtheotherforlocalization,eachdefaultboxgenerates4coordinatevalues(x,y,w,h).Inaddition,

these5 convolutional layers alsogenerate thedefaultbox (thegenerated coordinates) via theprior

boxlayer.Finally,thefirstthreecalculationresultsarerespectivelymergedandthenpassedtotheloss

layer.

Figure5SSDCaffeArchitecture

16

2. DESIGNANDIMPLEMENTATION

2.1 Design

2.1.1 DataPreprocessing

ImagesaredownloadedfromImageNetbythelinkswithkeywords.Thus,therewouldbesome

datathatisnotsuitableforourproject,suchasgrayscaleimages,non-fishormissing-fishimage,and

ambiguityimage.Whatweneedtodoistoremovetheseimagestocleanthedatasetforthisproject.

Figure6Examplesofgrayscale,non-fishormissing-fish,andambiguityimages

SincetheimagesdownloadedfromImageNetonlyprovidecategorical labels, inordertotrain

thenetwork,especiallyfortheobjectdetectionandposeestimation,weneedtolabelthefishwithan

annotationtool:Sloth.

Sloth’s purpose is to provide a versatile tool for various labeling tasks in the context of

computervisionresearch.Inthisproject,foreveryimage,thelabelweneedisapositionboundingbox

(red in the figure 7) containing a particular species of fish, and its head and tail coordinates (green

pointsinfigure7).

17

Figure7Exampleoflabeledimage

2.1.2 ImagePipeline

Image Pipeline is the basic pipeline in this project. It takes the non-processed images as the

input. For this pipeline, we used VGG16 as the prediction model. Since the VGG16 was trained in

ImageNetandtheoutputofthepre-trainedmodelis1000,inordertousethepre-trainedweightsto

getahigheraccuracy,weusedpre-trainedVGG16asthebasicmodelandaddedanotherdenselayer

(fullyconnectedlayer)withtheoutput9.

Figure8ImagePipeline

Werandomlysplit4842imagesintotraining(80%),validation(10%),andtesting(10%)dataset

byspecies.

18

Table2ImagePipelinedataset

2.1.3 InstancePipeline

However,inourproject,likewesawindataoverviewpart,thefishintheimageissmall,mostly

smallerthanhalfoftheimagesize,whichmeans,intheimage,thereareplentyofuselessinformation,

whichwould influence theperformanceof classification. Thus, in order to avoid it, fish detection is

necessary. In Instance Pipeline, detection is added before the VGG16 classification, which would

outputtheboxpositionoffish,andwedothepaddingtoavoiddistortionandcropit.

Figure9InstancePipeline

Basedon thepipelinewedesigned, cropped imagesareneeded for the training stage to the

classifier.ThegeneratedimagesarecomingfromtheImagePipeline.Wepaddedandcroppedthefish

intosquareinstanceimages.Iftheborderofthecroppedimageisoutoftheimageborder,weshould

dopaddingaftercropping,andfinally,wegot5136instanceimages.

19

Figure10Instanceimagesgeneration

Table3InstancePipelinedataset

2.1.4 InstanceRotationPipeline

In the Instance Pipeline, we did object detection, and image classification to filter out the

useless information by cropping the fish out. However, the cropped images still have some useless

informationwhichwecouldfilterout:theorientationoffish.Thus,weaddedaposeestimatorafter

theobjectdetectiontoalignthefishintoaparticularorientation.

Figure11InstanceRotationPipeline

20

Amount5136 instance images,weassigntheposeof fish into16orientations inacircle,and

everyorientationtakes22.5degrees.

Figure12Orientationsplit

Togetwhichorientationthatthefishshouldbeassignedto,weusedtheheadandtail

coordinatestocalculatethevectorofeachfish.

Figure13Orientationvector

𝑣𝑒𝑐𝑡𝑜𝑟 = 𝑐𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒)64; − 𝑐𝑜𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒=4'>

Forthetraining,validation,andtestingdatainposeestimator,wediddataaugmentationto

enlargethedataset,suchasflippingandrotating.Finally,wehave7687fortheestimatordataset.

21

Table4Poseestimatordataset

Forthetraining,validation,andtestingdatainVGG16classifier,wedothealignmentfromthe

InstancePipelineVGG16classifierdataset.

Figure14InstanceRotationimagesgeneration

Afteralignment,wegotthetraining,validation,andtestingdatafortheVGG16classifierin

InstanceRotationPipeline.

Table5InstanceRotationPipelinedataset

2.1.5 EnsemblePipeline

22

Intheresultsofamassivemachinelearningcompetition,thebestresultisusuallytheensemble

modelratherthanthesinglemodel.Forexample,ILSVRC2015'shighest-scoringsinglemodelstructure

achievedthe13thplace.The1stto12thhaveuseddifferenttypesofmodelensemble.

Therearemanydifferenttypesofensemblemodel:oneofthemisstacking.Thistypeismore

generalandcantheoreticallycharacterizeanyotherensembletechnique.Forthispipeline, Iwilluse

thepurestformofstacking,whichinvolvesaveragingtheensemblemodeloutputs.Sincetheaveraging

process does not include any parameters, this process does not require training (only the trained

modelisneeded).

In the Ensemble Pipeline, we train another classifier by using ImageNet pre-trained model:

ResNet50,andtaketheweightaveragetogettheensembleresult.

Figure15EnsemblePipeline

WhenwegotthepredictionresultsforVGG16andResNet50,weappliedweightaveragetoget

thefinalresultforthispipeline.

𝑃𝑟𝑒𝑑6@(63/>6 = 𝑟𝑎𝑡𝑖𝑜ABB ∗ 𝑝𝑟𝑒𝑑ABB + (1 − 𝑟𝑎𝑡𝑖𝑜ABB) ∗ 𝑝𝑟𝑒𝑑I6(J6=

The only difference for the Ensemble Pipeline is thatwe need to train the ResNet50model

similarlyasthetrainingstagefortheVGG16modelinInstanceRotationPipeline.Thus,thedatasetis

same.

23

Table6EnsemblePipelinedataset

2.2 Implementation

2.2.1 Configuration

The project's development environment is Linux, Ubuntu 16.04, using python as the only

developmentlanguage.Pythondependencypackageversioninformationisasfollows.

PackageName Version Description

h5py 2.7.1 toolsfor*.h5file(kerasweights)

imutils 0.4.3 imageprocessingtools

jupyter 1.0.0 pythonGUI(inbrowser)

Keras 1.2.0 highlevelmachinelearningAPI

matplotlib 2.1.0 datavisualization

numpy 1.14.0 pythonscientificcomputinglibrary

pandas 0.20.3 pythondataanalysislibrary

pillow 3.1.2 imageprocessingtools

pip 9.0.1 pythonpackagemanagementtool

24

progressbar 2.3.0 progressbarviewcomponents

python 2.7 language

scikit-image 0.13.1 imageprocessingtools

scikit-learn 0.19.1 machinelearningtoolkit

scipy 1.0.0 pythonalgorithmlibraryandmathtoolkit

six 1.11.0 libraryforcompatibilitywithpython2andpython3

sloth 1.0 annotationtools

Theano 0.9.0 Usedtodefine,optimize,andefficientlysolveanalogestimationproblemsofmulti-dimensionalarraydata

correspondingmathematicalexpressionsTable7Developmentenvironmentconfiguration

2.2.2 EvaluationMetric

In the whole project, we use object detection, and classification (pose estimator and fish

classifier). In object detection, precision and recall are the primary evaluation metrics, and for

classification,calculatingtheaccuracyisasimplemethodtoevaluatethemodel.

2.2.2.1 ObjectDetectionAccuracy

Traditionallyintheevaluationofobjectdetectionalgorithms,forasingledetectionfileandits

corresponding ground truth file, two values, recall, and precision, can be calculated. Usually, in the

objectdetection task,weuse confusionmatrix toget theprediction results.A confusionmatrix is a

summary of prediction results on a detection or classification problem. The number of correct and

incorrectpredictionsaresummarizedwithcountvaluesandbrokendownbyeachclass.Tocalculate

theconfusionmatrix:

1. Youneedatestdatasetoravalidationdatasetwithexpectedoutcomevalues.

25

2. Makeapredictionforeachrowinyourtestdataset.

3. Fromtheexpectedoutcomesandpredictionscount:

a. Thenumberofcorrectpredictionsforeachclass.

b. Thenumberofincorrectpredictionsforeachclass,organizedbytheclassthat

waspredicted.

Efficiently,scikit-learnprovidesafunctiontogetconfusionmatrixbyinputtingthelabeland

predictioninorder.Forobjectdetectionorbinaryclassification,forexample,wehavetheconfusion

matrixasfollows:

Figure16ConfusionMatrixExample

● TruePositives(TP):Thesearecasesinwhichwepredictedyes,andtheirlabelisyes.

● TrueNegatives(TN):Wepredictedno,andtheirlabelisno.

● FalsePositives(FP):Wepredictedyes,buttheirlabelisno.

● FalseNegatives(FN):Wepredictedno,buttheirlabelisyes.

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃)

𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁)

26

Intuitively,recalltellsushowmanyofourobjectshavebeendetected,andprecisiongivesus

information on the number of false alarms. Both are higher if better and should be close to 1 for

perfectsystems.

Inorder tobeable tocalculate these twomeasures (recallandprecision),anevaluation tool

needstofindouttwothings:

1. for each ground truth rectangle, we need to know whether it has been correctly

detectedornot

2. foreachdetectedrectangle,weneedtoknowwhetheritcorrespondstoavalidground

truthrectangleornot

Inordertomeasurewhethertheobjectiscorrectlydetected,weuseIntersectionoverUnion.

Figure17IntersectionoverUnion

Examining this equation, you can see that Intersection over Union is simply a ratio. In the

numerator,we compute the area of overlap between the predicted bounding box and the ground-

truthboundingbox.Thedenominatoristheareaofunion,ormoresimply,theareaencompassedby

27

boththepredictedboundingboxandtheground-truthboundingbox.Dividingtheareaofoverlapby

theareaofunionyieldsourfinalscore—theIntersectionoverUnion.

Inall reality, it’sextremelyunlikely that the (x,y)-coordinatesofourpredictedboundingbox

are going to exactlymatch the (x, y)-coordinates of the ground-truth bounding box.Due to varying

parametersofourmodel(imagepyramidscale,slidingwindowsize,featureextractionmethod,etc.),a

completeandtotalmatchbetweenpredictedandground-truthboundingboxes issimplyunrealistic.

Becauseof this,weneed todefineanevaluationmetric that rewardspredictedboundingboxes for

heavilyoverlappingwiththeground-truth:

Figure18ThresholdexampleofIntersectionoverUnion

Asyoucansee,predictedboundingboxesthatheavilyoverlapwiththeground-truthbounding

boxeshavehigherscoresthanthosewithlessoverlap.ThismakesIntersectionoverUnionanexcellent

metricforevaluatingcustomobjectdetectors.

28

2.2.2.2 ClassificationAccuracy

Accuracyisonemetricforevaluatingclassificationmodels.Informally,accuracyisthefraction

ofpredictionsourmodelgotright.Formally,accuracyhasthefollowingdefinition:

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠𝑇𝑜𝑡𝑎𝑙𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠

Usually, in classification task, we use confusion matrix similarly to the object detection but

usingmultiple-class.

Bylookingattheconfusionmatrix,wecouldcalculatetheaccuracyeasily.

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑟𝑎𝑐𝑒34=U'1𝑆𝑢𝑚34=U'1

2.2.2.3 PoseEstimationAccuracy

PoseEstimationisaclassificationtask,butweapplytwodifferentmeasurementstocalculate

theaccuracyofthemodel:

1. Traditionalclassificationaccuracymeasurement.

2. Offsettingclassificationaccuracymeasurement.

In traditional classification accuracy measurement, we could calculate the confusion matrix,

andgettheaccuracybyusingtheaboveformula.However,insomecases,weallowsomeoffsettothe

rotatedfish,whichmeansthenearbyorientationisacceptable.

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦0 =𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠

29

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦1 = 𝑂𝑓𝑓𝑠𝑒𝑡𝑡𝑖𝑛𝑔𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠

Where:

𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠: (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 == 𝑔𝑟𝑜𝑢𝑛𝑑𝑡𝑟𝑢𝑡ℎ)and

𝑂𝑓𝑓𝑠𝑒𝑡𝑡𝑖𝑛𝑔𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠: (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 == 𝑔𝑟𝑜𝑢𝑛𝑑𝑡𝑟𝑢𝑡ℎ||𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 ==

𝑔𝑟𝑜𝑢𝑛𝑑𝑡𝑟𝑢𝑡ℎ4;]0'@'@5)

Usually,inNumpy,wecoulddoiteasilybyusingthefollowingcode(cmistheconfusion

matrix):

Figure19Pythoncodeforaccuracycalculation

Forexample, in the following figure, if theground truthorientation is 14,bothof those two

measurementsareacceptable.

Figure20Accuracy_0andAccuracy_1orientationdistribution

2.2.3 ImagePipeline

30

IntheImagePipeline,aswementioned,weusedVGG16pre-trainedmodelandaddedanother

fullyconnectedlayerwithoutput9.Sincethepre-trainedmodelistrainedonImageNetdataset,and

ourdataisalsocomingfromImageNet,weonlytrainlasttwolayers(transferlearning).

Figure21ModifiedVGG16Architecture

Forthelossfunction,wechosecrossentropy,whichisusuallyusedinmulti-classclassification.

𝑙(𝑦, 𝑦) = −𝑦_𝑙𝑜𝑔𝑦

InKeras,itiseasytosetupthelossfunctionbyusingthefollowingcode

Figure22CompilemodelinKeraswithcross-entropylossfunction

TheinputofthenetworkisabatchofRGBimageswiththesizeof(3,224,224).3isforR,G,

andBchannel,and224istheinputoftraditionalVGGstructure.Anyimagesshouldberesizedto224

by224beforeit isgiventothenetwork.SinceweusedTheanoasthebackendinKeras,thechannel

31

shouldbe the firstparameter.Anyonewhowants touseTensorflowas theKerasbackend, theonly

modification is using the (224, 224, 3) as the input size.Usually, in training stepof traditional deep

learningproblem,choosing1e-4astheinitiallearningrateisbetter,andwhenthelossvaluenolonger

declinesinacertainepoch,ortheaccuracyoftheverificationdatasetnolongerrises,wecanreduce

thelearningrateby0.1toachievebetterresults.

In this pipeline, we used (3, 224, 224) as the input, initial learning rate 1e-4, batch size 64,

optimizerSGD,andtrainedthenetworkbyreducingthelearningrateby0.1ifthevalidationaccuracy

doesnotimprovein30epochs.

input lr # epoch batch size time / epoch total time last update

epoch loss val_loss val_Acc

3,224,224 1e-4 56 64 39 s 36 min 24 s 25 0.1155 0.9666 0.77

Table8ImagePipelinetrainingphase–VGG16Classifier

After25epochs,weachieve77%accuracyonvalidationdatasetwiththetrainingloss0.1155

andvalidationloss0.9666.Thefiguresshowthecurveoflossandaccuracyontrainingandvalidation

dataset.

32

Figure23ImagePipelinelossandaccuracycurve–VGG16Classifier

Forthetrainingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.9866

Figure24ImagePipelinetrainingconfusionmatric–VGG16Classifier

Forthevalidationdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.7711

33

Figure25ImagePipelinevalidationconfusionmatric–VGG16Classifier

Forthetestdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.7557

Figure26ImagePipelinetestingconfusionmatric–VGG16Classifier

34

2.2.4 InstancePipeline

IntheInstancePipeline,therearetwomodelswhichshouldbetrainedontheinstancedataset:

SSDCaffedetectorandVGG16classifier.

FortheSSDCaffedetector,weresizetheimagesto512by512andusethesize(512,512,3)as

theinputwithboundingboxlabels.Trainingphasetook20,000iterationsin12hours.InSSDCaffe,the

lossfunctionisthecombinationofconfidencelossandboxlocationloss.

𝐿(𝑥, 𝑐, 𝑙, 𝑔) = 1𝑁 (𝐿b0@&(𝑥, 𝑐) + 𝛼𝐿>0b(𝑥, 𝑙, 𝑔))

WhereNisthenumberofmatchedboxwithgroundtruthbox.

After20,000iterations,wegotthefinalloss1.28

Figure27SSDCaffetrainingloss

35

Ontestdataset,wegotthepredictionresultswithoutconfidencethreshold.Inordertochoose

thethresholdthatisbestforthemodeltogetrelativelyhighprecisionandrecall,weusetheprecision-

recall curve tomeasure the performance. For each confidence threshold from 0.0 to 1.0,we could

haveapointwiththexofrecallandyofprecision.

Figure28SSDCaffeprecision-recallcurveonthetestdataset

The AUC (Area Under Curve) of this figure is the average precision. When the confidence

thresholdis0.5,wecouldhavearelativelyhighprecisionandrecallwiththeaverageprecision0.9275

FortheVGG16classifierininstancedataset,weusethesimilarmethodtomodifythenetwork

structure:usingImageNetpre-trainedmodel,andaddingafullyconnectedlayerwithoutput9,freeze

thelayersexceptforthelasttwolayers.Weused(3,224,224)astheinput,initiallearningrate1e-4,

36

batchsize64andtrainedthenetworkbyreducingthe learningrateby0.1 if thevalidationaccuracy

doesnotimprovein30epochs.

input lr # epoch batch size time / epoch total time last update

epoch loss val_loss val_Acc

3,224,224 1e-4 114 64 37 s 1 h 10 m 18 s 83 0.3386 0.4751 0.8236

Table9InstancePipelinetrainingphase–VGG16Classifier

After83epochs,weachieved82.36%accuracyonvalidationdatasetwiththetrainingloss

0.3386andvalidationloss0.4751.Thefiguresshowthecurveoflossandaccuracyontrainingand

validationdataset.

Figure29InstancePipelinelossandaccuracycurve–VGG16Classifier

Forthetrainingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.9589

37

Figure30InstancePipelinetrainingconfusionmatric–VGG16Classifier

Forthevalidationdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.8236

Figure31InstancePipelinevalidationconfusionmatric–VGG16Classifier

38

Forthetestingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.8064

Figure32InstancePipelinetestingconfusionmatric–VGG16Classifier

2.2.5 InstanceRotationPipeline

In Instance Rotation Pipeline, the detector is same as Instance Pipeline. Two models are

needed: Pose Estimator and VGG16 Classifier in Instance rotated dataset. The pose estimator is a

traditionalclassificationproblem,sowestillusethesamemethodtomodifytheneuralnetworkbased

on VGG16: using ImageNet pre-trained model, and adding a fully connected layer with output 16,

freezethelayersexceptforthelasttwolayers.Weused(3,224,224)astheinput,initiallearningrate

1e-4, batch size 64, and trained the network by reducing the learning rate by 0.1 if the validation

accuracydoesnotimprovein30epochs.

39

input lr # epoch batch size time / epoch total time last update

epoch loss val_loss val_Acc

3,224,224 1e-4 53 64 54 s 47 min 42 s 23 0.2951 0.2944 0.9138

Table10PoseEstimatortrainingphase

After23epochs,weachieved91.38%accuracyonvalidationdatasetwiththetrainingloss

0.2951andvalidationloss0.2944.Thefiguresshowthecurveoflossandaccuracyontrainingand

validationdataset.

Figure33PoseEstimatorlossandaccuracycurve

For the accuracy, we use two different measurement methods as mentioned in Evaluation

Metric part. accuracy_0 for traditional classification accuracy, accuracy_1 for offsetting classification

accuracy.Forthetrainingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy_0:0.9768,

andaccuracy_1:0.9985

40

Figure34PoseEstimatortrainingconfusionmatrix

Forthevalidationdataset,wegetthefollowingconfusionmatrixwiththeaccuracy_0:0.9138,

andaccuracy_1:0.9948

Figure35PoseEstimatorvalidationconfusionmatrix

41

Forthetestingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy_0:0.9239,and

accuracy_1:0.9961

Figure36PoseEstimatortestingconfusionmatrix

In theVGG16 classifier for instance rotation classification,wedid the same step,modify the

VGG16structurewithabasicVGGmodelandaddedafullyconnectedlayer(outputshape9).Weused

(3,224,224)astheinput,initiallearningrate1e-4,batchsize64,andtrainedthenetworkbyreducing

thelearningrateby0.1ifthevalidationaccuracydoesnotimprovein30epochs.

input lr # epoch batch size time / epoch total time last update

epoch loss val_loss val_Acc

3,224,224 1e-4 126 64 40 s 1h 24 min 96 0.1039 0.3305 0.8938

Table11InstanceRotationPipelinetrainingphase–VGG16Classifier

42

After96epochs,weachieved89.38%accuracyonvalidationdatasetwiththetrainingloss

0.1039andvalidationloss0.3305.Thefiguresshowthecurveoflossandaccuracyontrainingand

validationdataset.

Figure37InstanceRotationPipelinelossandaccuracycurve–VGG16Classifier

Forthetrainingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.9915

Figure38InstanceRotationPipelinetrainingconfusionmatrix–VGG16Classifier

43

Forthevalidationdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.8938

Figure39InstanceRotationPipelinevalidationconfusionmatrix–VGG16Classifier

Forthetestingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.8383

Figure40InstanceRotationPipelinetestingconfusionmatrix–VGG16Classifier

44

2.2.6 EnsemblePipeline

FortheEnsemblePipeline,obviously,weneedanotherCNNclassifier,ResNet50,andtrainiton

theinstancerotationdataset.SimilartothemodificationinVGG16,weloadtheImageNetpre-trained

ResNet50modelandaddedanotherfullyconnectedlayerwithoutput9.Weused(3,224,224)asthe

input,initiallearningrate1e-4,batchsize64andtrainedthenetworkbyreducingthelearningrateby

0.1ifthevalidationaccuracydoesnotimprovein30epochs.

input lr # epoch batch size time / epoch total time last update

epoch loss val_loss val_Acc

3,224,224 1e-4 98 64 46 s 1h 15m 8s 68 0.1912 0.27 0.8717

Table12ResNet50trainingphase

After98epochs,weachieved87.17%accuracyonvalidationdatasetwiththetrainingloss

0.1912andvalidationloss0.27.Thefiguresshowthecurveoflossandaccuracyontrainingand

validationdataset.

Figure41EnsemblePipeline-ResNet50lossandaccuracycurve

45

Forthetrainingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy

Figure42EnsemblePipelinetrainingconfusionmatrix–ResNet50Classifier

Forthevalidationdataset,wegetthefollowingconfusionmatrixwiththeaccuracy

Figure43EnsemblePipelinevalidationconfusionmatrix–ResNet50Classifier

Forthetestingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.8683

46

Figure44EnsemblePipelinetestingconfusionmatrix–ResNet50Classifier

Fortheensemblemodel,weuseaweightedaverageinsteadoftheaveragealgorithm.Inorder

tofindthebestratio,weuseVGGasastandard,andtheratioisincreasedby0.001from0to1aswe

mentionedindesignpart.Onthevalidationdataset,wehavethefollowingcurve.

Figure45VGGratio-validationaccuracycurve

47

When𝑟𝑎𝑡𝑖𝑜ABB =0.501,themaximumaccuracyoftheensemblemodelis0.9098.

Forthetrainingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.9915

Figure46EnsemblePipelinetrainingconfusionmatrix–WeightedClassifier

Forthevalidationdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.9098

48

Figure47EnsemblePipelinevalidationconfusionmatrix–WeightedClassifier

Forthetestingdataset,wegetthefollowingconfusionmatrixwiththeaccuracy0.8802

Figure48EnsemblePipelinetestingconfusionmatrix–WeightedClassifier

49

3. RESULTS

In the implementation part, we use different image dataset for training on VGG16 and

ResNet50model.Foreachoftheimage-classifiercombinationsweuse,wehavethefollowingaccuracy

chart.

Figure49Image-ClassifierAccuracy

Bypreprocessingthe imagesand lettingtheclassifierfocusonthefish instance itself,wecan

improvetheaccuracymoreeffectively.Forallpairsofimage-classifier,thevalidationaccuracyishigher

than test accuracy. Only considering the classifier, the Ensemble Pipeline performs best both on

validationandtestdataset.

Hereweregardeachpipelineasasystemandexecutethemonthetestset.Foreachpipeline,

wegetthefollowingconfusionmatrix.

50

Figure50ImagePipelinetestaccuracy

Figure51InstancePipelinetestaccuracy

51

Figure52InstanceRotationPipelinetestaccuracy

Figure53EnsemblePipelinetestaccuracy

52

For each system, execution time is another standard formeasuring systemefficiency on the

samedata.Finally,combinedwithalltheinformationobtained,wehavethefollowingresults.

# of Test Image # of fish

# of detected

‘fish’

tp in detection

correct classificatio

n

Accuracy correct

classification / # fish

Time (s)

Image Pipeline 479 - - - 362 75.57 % 13.16

Instance Pipeline 479 501 500 494 401 80.03 % 53.16

Instance Rotation Pipeline

VGG16 479 501 500 494 415 82.83 % 132.15

ResNet50 479 501 500 494 428 85.43 % 138.02

Ensemble Pipeline 479 501 500 494 437 87.22 % 140.58

Table13Finaltestresults

EnsemblePipelinegetshighest classificationaccuracy,87.22%,on the same testdataset,but

takesthe longest time,140.58s.Even ifEnsemblePipelinespendsthe longest time,westillhaveto

chooseensembleasthefinalmodel,becauseinourproject,accuracyisthemostimportant.

53

4. CONCLUSIONANDFUTUREWORK

4.1 ConclusionandContribution

Throughthestep-by-steppurificationoftheimages,theclassifiercanfocusonthefishinstance

itself, filter out unwanted information such as background and orientation, and our classifier can

obtainhigheraccuracy.

In thisproject,wecameupwith fishclassificationpipelineonmess images.Forour selected

pipeline, Ensemble Pipeline, given the input RGB image, we can return the position, category, and

confidenceoffishtotheuserasasystem.Finally,weachieved87.22%accuracyfortheclassification

task.

4.2 FutureWork

In the future, we will improve classification accuracy by trying Automatic Hyper-Parameter

Adjustment.Tomorefocusontheinformationweneedandclassifythefishbyitsprominentfeature,

wearegoingtoextractthetailorfinsoffish,anduseanothermodeltoclassifythemforsupporting

thepredictionresult.For theensemblemethod,weightedaverage isnot thebest,wecould try tiny

networkinsteadofit.

Moreover,wewouldliketouseadatavisualizationmethodtoreturntheuseranoutputimage

withtheinformationonittoeasilygrabtheposition,species,andconfidenceoffish.

54

5. REFERENCES

[1] Liu, Wei, et al. "Ssd: Single shot multibox detector." European conference on computer vision.

Springer,Cham,2016.

[2]K.SimonyanandA.Zisserman.Verydeepconvolutionalnetworksforlarge-scaleimagerecognition.

InICLR,2015.

[3] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE

conferenceoncomputervisionandpatternrecognition.2016.

[4]Chen,Guang,etal."AutomaticFishClassificationSystemUsingDeepLearning."InICTAI,2017

[5]He,Kaiming,et al. "Mask r-cnn."ComputerVision (ICCV),2017 IEEE InternationalConferenceon.

IEEE,2017.

[6] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep

convolutionalneuralnetworks."Advancesinneuralinformationprocessingsystems.2012.

[7]Redmon,Joseph,etal."Youonlylookonce:Unified,real-timeobjectdetection."Proceedingsofthe

IEEEconferenceoncomputervisionandpatternrecognition.2016.

[8]M.Ravanbakhsh,M.R.Shortis,F.Shafait,A.Mian,E.S.Harvey,andJ.W.Seager,“Automatedfish

detection in underwater imagesusing shape-based level sets,”ThePhotogrammetric Record,

vol.30,no.149,pp.46–62,2015.