yolo net on ios - stanford universitycs231n.stanford.edu/reports/2017/pdfs/135.pdf · yolo net on...

8
YOLO Net on iOS Maneesh Apte Stanford University [email protected] Simar Mangat Stanford University [email protected] Priyanka Sekhar Stanford University [email protected] Abstract Our project aims to investigate the trade-offs between speed and accuracy of implementing CNNs for real time object detection on mobile devices. We focused on modi- fying the YOLO architecture to achieve optimal speed and accuracy for mobile use cases. Specifically, we investigated the effect of a fire layer, as used in SqueezeNet, on clas- sification performance. While this layer did increase the speed of classification, the decrease in accuracy was too great for use on mobile. We then deployed the highest per- forming tiny-YOLO net, which operates at 8-11 FPS with a mAP of 51%, to an iOS app using the Metal framework. The app can detect the 20 classes from the VOC dataset and translate the categories into 5 different languages in real time. This real-time detection is essentially a mobile Rosetta Stone that runs natively and is thus wifi indepen- dent. 1. Introduction With the growing popularity of neural networks, object detection on images and subsequently videos has increas- ingly become a reality. However, fast object detection in real time with high accuracy still has large room for im- provement. Our project is a combination of application and model modification. It aims to leverage existing work in the field of real time object detection and extend this work to 1) explore the trade offs between speed and accuracy for dif- ferent models and 2) be deployed to iOS via an iPhone app that uses the phones video camera. The Darkflow[25] implementation of the Darknet YOLO architecture [21] served as the base for our modifications. The input to our net is images from the VOC 2012 dataset, and the output will be an image with a bounding box, classi- fication, and confidence along with a json object containing all the prediction information. 1.1. Dataset Our primary dataset is from The PASCAL Visual Ob- ject Classes Project, abbreviated as VOC. The VOC chal- lenge was a challenge that ran annually from 2007- 2012, giving participants standardized image datasets for object class recognition, evaluating performance of models via mean average precision (mAP) on their validation and test set. Training/validation data is available for download for each year on the competition website http://host. robots.ox.ac.uk/pascal/VOC/index.html. The VOC 2012 data consists of 20 classes and the train/validation data has 11,530 images containing 27,450 ROI (region of interest) annotated objects and 6,929 seg- mentations. We chose this dataset because of the simple, broad cat- egories which prove useful in introductory language learn- ing. In addition, the relatively small number of classes(20) enabled us to train a model that was feasible to run on mo- bile. For use in the current version of YOLO, images were resized to 416x416 resolution during training and test. 1.2. Methodology Overview After surveying a number of other real-time object de- tection algorithms such as DPM, R-CNN, Fast R-CNN, and Faster R-CNN, we have decided to proceed with the YOLO (You Only Look Once) model [21]. YOLO provides fairly accurate object detection at extremely fast speeds. We focused specifically on modifying and deploying tiny-YOLO [21], a version of YOLO that uses 9 convolu- 1

Upload: others

Post on 01-Sep-2019

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: YOLO Net on iOS - Stanford Universitycs231n.stanford.edu/reports/2017/pdfs/135.pdf · YOLO Net on iOS Maneesh Apte Stanford University mapte05@stanford.edu Simar Mangat Stanford University

YOLO Net on iOS

Maneesh ApteStanford University

mapte05stanfordedu

Simar MangatStanford University

smangatstanfordedu

Priyanka SekharStanford University

psekharstanfordedu

Abstract

Our project aims to investigate the trade-offs betweenspeed and accuracy of implementing CNNs for real timeobject detection on mobile devices We focused on modi-fying the YOLO architecture to achieve optimal speed andaccuracy for mobile use cases Specifically we investigatedthe effect of a fire layer as used in SqueezeNet on clas-sification performance While this layer did increase thespeed of classification the decrease in accuracy was toogreat for use on mobile We then deployed the highest per-forming tiny-YOLO net which operates at 8-11 FPS witha mAP of 51 to an iOS app using the Metal frameworkThe app can detect the 20 classes from the VOC datasetand translate the categories into 5 different languages inreal time This real-time detection is essentially a mobileRosetta Stone that runs natively and is thus wifi indepen-dent

1 IntroductionWith the growing popularity of neural networks object

detection on images and subsequently videos has increas-ingly become a reality However fast object detection inreal time with high accuracy still has large room for im-provement Our project is a combination of application andmodel modification It aims to leverage existing work in thefield of real time object detection and extend this work to 1)explore the trade offs between speed and accuracy for dif-ferent models and 2) be deployed to iOS via an iPhone appthat uses the phones video camera

The Darkflow[25] implementation of the Darknet YOLOarchitecture [21] served as the base for our modificationsThe input to our net is images from the VOC 2012 datasetand the output will be an image with a bounding box classi-fication and confidence along with a json object containingall the prediction information

11 Dataset

Our primary dataset is from The PASCAL Visual Ob-ject Classes Project abbreviated as VOC The VOC chal-

lenge was a challenge that ran annually from 2007- 2012giving participants standardized image datasets for objectclass recognition evaluating performance of models viamean average precision (mAP) on their validation and testset Trainingvalidation data is available for download foreach year on the competition website httphostrobotsoxacukpascalVOCindexhtml

The VOC 2012 data consists of 20 classes and thetrainvalidation data has 11530 images containing 27450ROI (region of interest) annotated objects and 6929 seg-mentations

We chose this dataset because of the simple broad cat-egories which prove useful in introductory language learn-ing In addition the relatively small number of classes(20)enabled us to train a model that was feasible to run on mo-bile For use in the current version of YOLO images wereresized to 416x416 resolution during training and test

12 Methodology Overview

After surveying a number of other real-time object de-tection algorithms such as DPM R-CNN Fast R-CNN andFaster R-CNN we have decided to proceed with the YOLO(You Only Look Once) model [21] YOLO provides fairlyaccurate object detection at extremely fast speeds

We focused specifically on modifying and deployingtiny-YOLO [21] a version of YOLO that uses 9 convolu-

1

tion layers and 6 pooling layers instead of the 24 convolu-tion layers in the traditional YOLO net

We then used the Metal framework and the Forge library[10] to translate our architecture to Swift for deployment onour iPhones [12]

2 Background and Related WorkWe surveyed several object detection architectures and

mobile frameworksWe first investigated Deformable Part Models which are

graphical models that can be used for predictions Accord-ing to Girshik et al these models can be formulated asCNNs [7] Current literature builds on this work to inves-tigate how deformable convolution and deformable poolingcan enhance the transformation modeling capability of theCNN [3] However because of our focus on mobile weshifted our focus to other models with a greater emphasison parameter reduction

The first of these models is the RCNN which proposesregions of interest and classifies regions with an SVM afterconvolution on each region [6] The speed of this model isimproved upon in Fast-RCNN [5] and Faster-RCNN [23]which speeds up passes by introducing Region ProposalNetworks (RPN) to predict proposals from features

Figure 1 Comparison of RCNN speeds [1]

The RCNN is often used as a benchmark for the speed-accuracy trade off For work focused exclusively on hu-mans and human poses Mask RCNN is a promising pathThis architecture includes joint coordinates in addition tobox coordinate [8]

We then looked into the rdquoYou Only Look Oncerdquo (YOLO)net [20] and decided this framework would be best suitedfor the limited resources available on mobile The originalnet was improved upon in 2015 and this YOLO v2 servesas the primary focus of our experiments [21]

Our next focus was finding methods of parameter re-duction to increase speed without sacrificing accuracySqueezeNet [15] provided a great overview of the effect

of downsampling on AlexNet [17] The promising workof SqueezeDet [26] which builds on the SqueezeNet prin-ciples and applies them to self-driving car applicationswas particularly interesting and validated our thoughtsthat SqueezeNet had relevant ideas that could transcendAlexNet While we did survey methods of compressionsuch as one-shot whole network compression with Bayesianmatrix factorization Tucker decomposition on kernel ten-sor and fine-tuning to recover accumulated loss of ac-curacy as presented by Kim et al [16] we decided thebest path forward forward given our prior familiarity withSqueezeNet was to focus on using its principals for param-eter reduction in a different model

We surveyed several architectures that might benefitfrom downsampling These included high parameter archi-tectures such as ResNet [9] GoogLeNet [24] GoogLeNetalso provided us with a foundational understanding of how1x1 filters and concatenation could effectively reduce pa-rameters without losing expressivity However we decidedthat even if heavily optimized they would still run tooslowly on modern mobile devices and continued to focuson faster models with better real time detection potentialWe also looked at the current trade offs of various modelsas presented by Huang et al [14] which showed that com-binations of models are often optimal

While we decided to use the VOC dataset we also con-sidered the COCO dataset [18] which has far more cate-gories and may be the focus of future studies For this studyVOC provided the focus we needed

Finally MobileNet from Google [13] has demonstratedthat depth-wise separable convolutions can remove the bot-tleneck of convolution layers with many channels Theserdquothinnerrdquo models and the methodology behind them informour future work

We chose to focus on YOLO because it provided a de-cent baseline for accuracyspeed and was most accessiblefor modification

3 Technical Approach31 YOLO

The YOLO modelrsquos novel motivation is that it re-framesobject detection as a single regression problem directlyfrom image pixels to bounding box coordinates and classprobabilities This means that the YOLO model only rdquolooksoncerdquo at an image for object detection

It works by dividing an image into S x S grid Each gridcell predicts B bounding boxes and a confidence score foreach box The confidence scores show how confident themodel is that there is an object in that bounding box Thisis formally defined as

Confidence = Pr(Object) lowast IOU truthpred

2

IOU truthpred is defined as the intersection over the union

between the predicted box and the ground truth For refer-ence our model used S = 13 and B = 5 as done in theYolov2 original paper

Following this each bounding box has 5 predictions dxdy w h and confidence The (dx dy) coordinates rep-resent the center of the box relative to the bounds of thegrid cell Each grid cell also predicts C conditional classprobabilities which in mathematical notation is given byPr(Classi|Object)

Figure 2 Yolo model regression pipeline (Redmon et al)

The formulation of object detection described above al-lows us to put a neural network architecture on top ofthis representation and get predictions for objects in im-age frames However certain architectures regarding thefinal prediction layer greatly influence the number of pa-rameters and therefore overall speed of the implementa-tion A novel idea introduced by YoloV2 was to com-pletely eliminate the need for a fully connected layer at theend of the network for predictions and instead use a con-volution layer that makes predictions In order to makethis possible the Yolo implementation uses anchor boxesto predict bounding boxes meaning it predicts offsets in-stead of direct coordinates and thus accessible for a con-volutional layer to make reasonable predictions The loca-tions for the anchor boxes are given by the configurationfiles that were downloaded from the Yolo website httpspjreddiecomdarknetyolo For our projectwe focused on models specific for the VOC task meaningthe number of classes for prediction was 20

Since there are a various architectures that can be imple-mented for this task the authors of YOLO described a fullYOLO version which achieved optimal accuracy and a morecompact YOLO called tiny-yolo that run faster but isnrsquot asaccurate Tiny-yolo was important to our project because itallowed us to get reasonable results when deployed to thelimited hardward of a mobile device

The full architecture yolo-tiny is below (max-pool-2means max pool of size 2x2 and stride 2 NxN convolution-F-P means convolution of F filters of size NxN with pad P

followed by batchnorm)

Input 416x416x33x3 convolution-16-1max-pool-23x3 convolution-32-1max-pool-23x3 convolution-64-1max-pool-23x3 convolution-128-1max-pool-23x3 convolution-512-1max-pool-23x3 convolution-1024-13x3 convolution-1024-11x1 convolution-125-1

32 Fire Layer

A goal of this project was to speed up the implementa-tion of YOLO so that it could classify faster on the limitedhardware of a mobile device As a result our group re-searched various architectures that have been used in recentresearch that minimize the number of parameters in a neu-ral network without compromising on accuracy Thus wefound the architecture SqueezeNet [15] and realized thatwe could borrow motifs from SqueezeNet and use them inthe YOLO architecture to replace bottleneck layers whilehopefully maintaining accuracy In particular we looked atSqueezeNetrsquos rdquofire modulerdquo which is a composition of twosteps of convolutions with a concatenation at the end Thefirst layer is called the rdquosqueezerdquo layer which is a 1x1 con-volution with s1x1 filters The second layer has two separateconvolutions one is 1x1 with e1x1 filters and the second is3x3 with e3x3 filters A diagram of this is shown in figure3

Figure 3 Fire module diagram (Iandola et al)

We focused on reducing the apparent bottleneck in thelast two layers of the tiny-yolo architecture where the 3x3

3

filters with padding of 1 convolutions had 1024 filters re-placing those layers with fire modules See experimentssection for more details

33 Mobile

Since iOS does not support CUDA or OpenCL we useMetal and Metal Performance Shaders to perform work onthe GPUs found in iOS devices Thus to bring our imple-mentation onto mobile we had to convert weights into onesthat could be interpreted by Metal We used the open-sourcerdquoYAD2Krdquo script which converts the Darknet weights to aKeras format [2] to the compatible format For new modelswe could almost directly translate architectures into Swiftusing Metals convolution and maxpool abstraction and in-corporate the weights trained by Darkflow However onemajor drawback of this approach is that that Metal does notcurrently support an abstracted batchnorm layer a regular-ization technique used in the tiny-YOLO implementation

Figure 4 Histogram of output distributions before and after batch-norming for the first convolutional layer As noted in the Holle-mans post without batchnorming output distributions tend to be-come diffuse (left) and accuracy can decrease Image from Holle-mans [12]

We were able to overcome this hurdle by using a strategyposted by M Hollemans [12] who addressed the discrep-ancy by using a folding technique to essentially merge thebatch norm weights into the previous convolutional layer[11] We can use the following equations to compute newweights and bias terms for the convolution layer

wnew =gamma lowast w

sqrt(variance)

bnew =gamma lowast (bminusmean)

sqrt(variance)+ beta

Hollemansrsquo post also offered starter Yolo code that lever-ages the Forge wrapper library around Metal and was inte-gral to our project [10] Forge is a collection of helper codethat makes it easier to construct deep neural networks usingApplersquos MPSCNN framework For example to set up tiny-yolo using Forge you can implement the layers in Swiftcode like below

After appropriate channel and bounding box process-ing combined with UI functionality we could spin up anapp that drew bounding boxes with confidence scores inreal time To add additional functionality we also addedlanguage buttons that allow a user to identify objects in 5different languages (English Spanish French Hindhi andChinese)

4 ExperimentsOur experiments aimed to determine which net YOLO

tiny-YOLO or tiny-YOLO with fire layer modificationwould be best for deployment on mobile Our modifiedtiny-YOLO replaced 3 convolutional layers - the final 1024and 512 convolutional layers - with two fire layers

[convolutional]batch_normalize=1filters=256size=3stride=1pad=1activation=leaky

[maxpool]size=2stride=2

[fire]s_1x1=38e_1x1=256e_3x3=256

[fire]s_1x1=42e_1x1=512e_3x3=512

[convolutional]size=1stride=1pad=1

4

filters=425activation=linear

In addition we used the YOLO and the tiny-YOLO net ar-chitectures to train and test on the VOC 2012 validationdataset and evaluated their speed and accuracy locally us-ing mAP

We chose to evaluate the models on CPU to mimic thelimited computing resources available on modern phonesWe then took the model with the highest CPU speed andimplemented it in a mobile app

41 Baselines

The VOC dataset is considered rdquosolvedrdquo in many con-texts The VOC 2012 competition saw mAP scores of over90 in each category [4] The full YOLO version has a re-ported mAP score of 78 on the VOC 2007 test set afterbeing trained on the VOC 2007 + 2012 sets [21] The fullYOLO net served as the baseline for our experiments TheGPU runs were conducted with a Titan X and served as ourbenchmark

We used the Adam optimizer in all training sessionsdue to its tendency for expedient convergence and we seta learning rate of 1eminus3 to begin training as suggested inthe Darkflow implementation [25] If the modelrsquos lossstarted thrashing (oscillating between two large numbers)we paused training and resumed from the last checkpointwith a lower learning rate The lowest rate we used was1eminus6 We used the default mini-batch size of 64 for train-ing and we ran CPU evaluations using pre-trained weightsfrom Darknet [21]

42 Evaluation

We evaluted the accuracy of networks using mAP scoreson the VOC test set We heavily modified an R-CNN mAPevaluation script [19] to implement our own mAP scoreevaluator that is compatible with the json output of Dark-flow

To get the mean average precision we compute the aver-age precision for each of the 20 classes as follows

AvgPrecision =

nsumk=1

P (k)∆r(k)

In this equation P (k) refers to precision at a giventhreshold k and ∆r(k) refers to the change in recall betweenk and k minus 1 This corresponds to the area under the preci-sion recall curve Thus mean average precision is simplyAvgPrecisionC Where C = 20 for the VOC PASCALdataset

To evaluate test time inference speed on CPU we took100 images from the VOC validation set and used the Dark-net framework to time how long each architecture took to

make predictions for all the images We then divided thetotal time by the number of images (100) to get an averagenumber of images (frames) per second To get scores forMobile we used the Forge Library to determine the maxFPS for our deployed net The GPU speed references thosereported on the Darknet Website [21]

The GPU times were recorded using the NVIDIA Ti-tan X as reported in Darkent [21] The CPU times wererecorded on a 2016 MacBook Pro with 27 GHz Intel Corei5 The mobile times were recorded on a standard iPhone 7running iOS 10

43 Results

We report our findings on speed and accuracy on mobileand GPU in comparison to modern models below

Graph 1 Empirical Speed and mAP of Locally Trainedand Tested Models Using the VOC 2012 Dataset

We include a detailed comparison in Table 1 The tiny-YOLO net significantly outperformed all others in terms ofspeed overtaking YOLO by a factor of 5x on both CPUand GPU We achieved slightly lower accuracies when thenets were tested locally on a sample from the 2012 VOCvalidation set (rather than on the 2007 sets used to set thebenchmarks on the Darknet website) The tiny-YOLO firelayer modification was slightly faster but failed to predictany examples

The locally tested YOLOs struggled with similar cate-gories as their GPU counterparts as expected tiny-YOLOalso trained and predicted significantly faster than YOLObecause it has a fraction of the layers and parameters tiny-YOLO-fire had even fewer parameters overall since al-though fire layers perform 2 convolutions we replaced 3convolution layers with 2 fire layers reducing overall pa-rameter count and thus performed slightly faster than tiny-YOLO as expected

5

Model Speed (FPS) FLOPS mAP

SqueezeNet (GPU) - 217 Bn -YOLO (GPU) 40 5968 Bn 786

tiny-YOLO (GPU) 207 081 Bn 571YOLO (CPU) 2 5968 Bn 65

tiny-YOLO (CPU) 6 081 Bn 51tiny-YOLO-fire (CPU) 8 - lt 1tiny-YOLO (mobile) 11 081 Bn -

Table 1 Best Reported Inference Speed and mAP scores of modelscompared to our tests of YOLO tiny-YOLO and tiny-YOLO-fireon CPU

44 Discussion of Fire Layer

We attempted to train the modified tiny-YOLO with firelayers from scratch This required instantiating a blank fileand training took around 4 hours on the GPU provisionedby Google Cloud which resulted in a minimum loss of 6

Figure 5 Tensorboard loss convergence graph of tiny-YOLO-fireafter 500 iterations

However although the loss converged the net still had amAP of 0 at test time and was unable to make predictionsfor almost all images As a result we attempted to modifythe s1x1 layer of the second fire layer to have 46 channelsinstead of 42 (to better align with the original SqueezeNetimplementation[15]) and tuned hyperparameters such asthe learning rate This also had no effect

We then used the pre-trained weights for the first convo-lutional layers from tiny-YOLO and fine-tuned the remain-ing fire layers using these weights as a starting point To testthis technique we created a small set of 3 images and an-notations from the VOC validation set and trained the fire-layer tiny-YOLO on these images in an attempt to overfitand accurately predict these images at test time The netwas unable to make accurate classifications on this datasetas well

These experiments led us to believe that fire layers are illsuited as replacements to vanilla convolution layers Intu-itively this makes sense as the fire layers concatenate out-puts leading to very different maps from sequential convo-

lutionFurthermore the fire layers added at the end of the archi-

tecture squeeze an already aggressively reduced neural netHowever before these layers the activations are alreadysomewhat well-defined and the parameter count is very lowAs a result additional squeezing without careful precisionmay result in the loss or scrambling of useful informationmaking the model far less expressive and unable to makesense of information learned in previous layers Addition-ally while fire layers may have made sense for all-imageinput without bounding boxes where concatenated chan-nels revealed information about the interplay of differentfeatures in an image adding text and bounding boxes likelyonly adds noise to the outputs since there is a disconnectbetween the meaning of the bounding box parameters andthe pixel values

Further tuning of the s1x1 e1x1 and e3x3 filter sizes andlayers would need to take place to ensure that the networkcaptures relevant data in these steps

Another factor to consider is that while YOLO relies onEuclidean loss the original SqueezeNet uses Softmax [15]

LEucl =1

2

nsumi=1

(xi minus yi)2

LSoftmax = minuslog(efyisumj e

fj)

These are two fundamentally different ways of measuringloss with one relying on distance while the other relies onprobability distributions

Empirical evidence suggests that YOLO is not as ef-fective when evaluated using Softmax with mAP scoresdropping as low as 5-10 upon changing to the Softmaxloss function [22] It is possible that the reverse is truefor SqueezeNet-inspired architectures and Euclidean Losswith the incompatible loss function having a strong negativeeffect on final mAP scores

45 Results of Deployment

The tiny-YOLO net was successfully deployed to theiPhone 7 using Forge [12] The app only draws bound-ing boxes if the confidence of prediction is above a user-defined threshold (05 in our case) The deployment tomobile actually saw a slightly higher maximum InferenceSpeed than on laptop CPU because of the direct access tothe mobile GPU provided by Metal2 However the averagespeed roughly matched the results of the same network onCPU

The relatively low mAP score is evidenced by thetendency of our app to ignore chairs and struggle topredict bottles The categories with the lowest averageprecisions such as potted plants and sofas are rarelydetected This is likely because these categories are so

6

variant in both shape and color and often include a lotof background noise The app has the highest mAPfor the categories of people and cars To view a demovisit httpswwwyoutubecomwatchv=m0_GXm9aN5Qampfeature=youtube Images from a runare shown below

Figure 6 Detections in different languages

5 Conclusions and Future Work

While the fire layer may be promising for speeding upthe bottlenecked high-channel layers the lack of predictionsin this project make it hard to draw positive conclusions forits future usefulness Further tuning on a large number ofparameters would need to take place before conclusions canbe drawn with this hybrid architecture

More likely it seems that layer-pruning and depth-wiseseparable convolutions as proposed in Googlersquos MobileNet[13] will provide an effective trade-off between speed andefficiency These models shrink the number of hyperparam-eters without making drastic changes to architecture andseem to be the next step in making deep learning practi-cal on mobile Next steps would investigate MobileNet andpotential hybrids between it and YOLO

Mobile deep learning is quickly becoming more accessi-ble With a push for increased GPU access on phones eventhe full YOLO may one day run in real time on our iPhonesConsequently the best development strategy to take advan-tage of mobile deep learning might be to slowly add to exist-ing models such tiny-YOLO to make them approach YOLOaccuracy Nevertheless our current app indicates that objectdetection for mobile can be done in near real time and hintsat many possibilities in the realms of facial recognition ARand autonomous driving in the near future

References

[1] Cs231n lecture 11 slide 80 May 10 2017[2] allanzelener Yad2k httpsgithubcomallanzelenerYAD2K

2017

[3] J Dai H Qi Y Xiong Y Li G Zhang H Hu andY Wei Deformable convolutional networks CoRRabs170306211 2017

[4] M Everingham L Van Gool C K I Williams J Winn andA Zisserman Assessing the Significance of PerformanceDifferences on the PASCAL VOC Challenges via Bootstrap-ping 2012

[5] R B Girshick Fast R-CNN CoRR abs150408083 2015[6] R B Girshick J Donahue T Darrell and J Malik Rich

feature hierarchies for accurate object detection and semanticsegmentation CoRR abs13112524 2013

[7] R B Girshick F N Iandola T Darrell and J MalikDeformable part models are convolutional neural networksCoRR abs14095403 2014

[8] K He G Gkioxari P Dollar and R B Girshick MaskR-CNN CoRR abs170306870 2017

[9] K He X Zhang S Ren and J Sun Deep residual learningfor image recognition CoRR abs151203385 2015

[10] hollance Forge A Neural Network Toolkit for MetalhttpsgithubcomhollanceForge 2017

[11] hollance Yolo2metal open source script 2017httpsgithubcomhollanceForgeblobmasterExamplesYOLOyolo2metalpy

[12] M Hollemans Real-time object detection with yolohttpmachinethinknetblogobject-detection-with-yolo2017

[13] A G Howard M Zhu B Chen D Kalenichenko W WangT Weyand M Andreetto and H Adam Mobilenets Effi-cient convolutional neural networks for mobile vision appli-cations CoRR abs170404861 2017

[14] J Huang V Rathod C Sun M Zhu A KorattikaraA Fathi I Fischer Z Wojna Y Song S Guadarrama andK Murphy Speedaccuracy trade-offs for modern convolu-tional object detectors CoRR abs161110012 2016

[15] F N Iandola M W Moskewicz K Ashraf S Han W JDally and K Keutzer Squeezenet Alexnet-level accuracywith 50x fewer parameters and lt1mb model size CoRRabs160207360 2016

[16] Y Kim E Park S Yoo T Choi L Yang and D ShinCompression of deep convolutional neural networks for fastand low power mobile applications CoRR abs1511065302015

[17] A Krizhevsky I Sutskever and G E Hinton Imagenetclassification with deep convolutional neural networks InAdvances in Neural Information Processing Systems page2012

[18] T Lin M Maire S J Belongie L D Bourdev R BGirshick J Hays P Perona D Ramanan P Dollar andC L Zitnick Microsoft COCO common objects in contextCoRR abs14050312 2014

[19] rbgirshick Faster r-cnn httpsgithubcomrbgirshickpy-faster-rcnnblobmasterlibdatasets 2016

[20] J Redmon and A Farhadi You only lookonce Unified real-time object detectionhttpspjreddiecomdarknetyolo 2015

[21] J Redmon and A Farhadi Yolo9000 Better faster strongerhttpspjreddiecomdarknetyolo 2016

7

[22] J Redmon and A Farhadi Yolo9000 Better faster strongerhttpspjreddiecompublicationsyolo 2016

[23] S Ren K He R B Girshick and J Sun Faster R-CNNtowards real-time object detection with region proposal net-works CoRR abs150601497 2015

[24] C Szegedy W Liu Y Jia P Sermanet S E ReedD Anguelov D Erhan V Vanhoucke and A RabinovichGoing deeper with convolutions CoRR abs140948422014

[25] thtrieu Darkflow httpsgithubcomthtrieudarkflow 2017[26] B Wu F N Iandola P H Jin and K Keutzer Squeezedet

Unified small low power fully convolutional neural net-works for real-time object detection for autonomous drivingCoRR abs161201051 2016

8

Page 2: YOLO Net on iOS - Stanford Universitycs231n.stanford.edu/reports/2017/pdfs/135.pdf · YOLO Net on iOS Maneesh Apte Stanford University mapte05@stanford.edu Simar Mangat Stanford University

tion layers and 6 pooling layers instead of the 24 convolu-tion layers in the traditional YOLO net

We then used the Metal framework and the Forge library[10] to translate our architecture to Swift for deployment onour iPhones [12]

2 Background and Related WorkWe surveyed several object detection architectures and

mobile frameworksWe first investigated Deformable Part Models which are

graphical models that can be used for predictions Accord-ing to Girshik et al these models can be formulated asCNNs [7] Current literature builds on this work to inves-tigate how deformable convolution and deformable poolingcan enhance the transformation modeling capability of theCNN [3] However because of our focus on mobile weshifted our focus to other models with a greater emphasison parameter reduction

The first of these models is the RCNN which proposesregions of interest and classifies regions with an SVM afterconvolution on each region [6] The speed of this model isimproved upon in Fast-RCNN [5] and Faster-RCNN [23]which speeds up passes by introducing Region ProposalNetworks (RPN) to predict proposals from features

Figure 1 Comparison of RCNN speeds [1]

The RCNN is often used as a benchmark for the speed-accuracy trade off For work focused exclusively on hu-mans and human poses Mask RCNN is a promising pathThis architecture includes joint coordinates in addition tobox coordinate [8]

We then looked into the rdquoYou Only Look Oncerdquo (YOLO)net [20] and decided this framework would be best suitedfor the limited resources available on mobile The originalnet was improved upon in 2015 and this YOLO v2 servesas the primary focus of our experiments [21]

Our next focus was finding methods of parameter re-duction to increase speed without sacrificing accuracySqueezeNet [15] provided a great overview of the effect

of downsampling on AlexNet [17] The promising workof SqueezeDet [26] which builds on the SqueezeNet prin-ciples and applies them to self-driving car applicationswas particularly interesting and validated our thoughtsthat SqueezeNet had relevant ideas that could transcendAlexNet While we did survey methods of compressionsuch as one-shot whole network compression with Bayesianmatrix factorization Tucker decomposition on kernel ten-sor and fine-tuning to recover accumulated loss of ac-curacy as presented by Kim et al [16] we decided thebest path forward forward given our prior familiarity withSqueezeNet was to focus on using its principals for param-eter reduction in a different model

We surveyed several architectures that might benefitfrom downsampling These included high parameter archi-tectures such as ResNet [9] GoogLeNet [24] GoogLeNetalso provided us with a foundational understanding of how1x1 filters and concatenation could effectively reduce pa-rameters without losing expressivity However we decidedthat even if heavily optimized they would still run tooslowly on modern mobile devices and continued to focuson faster models with better real time detection potentialWe also looked at the current trade offs of various modelsas presented by Huang et al [14] which showed that com-binations of models are often optimal

While we decided to use the VOC dataset we also con-sidered the COCO dataset [18] which has far more cate-gories and may be the focus of future studies For this studyVOC provided the focus we needed

Finally MobileNet from Google [13] has demonstratedthat depth-wise separable convolutions can remove the bot-tleneck of convolution layers with many channels Theserdquothinnerrdquo models and the methodology behind them informour future work

We chose to focus on YOLO because it provided a de-cent baseline for accuracyspeed and was most accessiblefor modification

3 Technical Approach31 YOLO

The YOLO modelrsquos novel motivation is that it re-framesobject detection as a single regression problem directlyfrom image pixels to bounding box coordinates and classprobabilities This means that the YOLO model only rdquolooksoncerdquo at an image for object detection

It works by dividing an image into S x S grid Each gridcell predicts B bounding boxes and a confidence score foreach box The confidence scores show how confident themodel is that there is an object in that bounding box Thisis formally defined as

Confidence = Pr(Object) lowast IOU truthpred

2

IOU truthpred is defined as the intersection over the union

between the predicted box and the ground truth For refer-ence our model used S = 13 and B = 5 as done in theYolov2 original paper

Following this each bounding box has 5 predictions dxdy w h and confidence The (dx dy) coordinates rep-resent the center of the box relative to the bounds of thegrid cell Each grid cell also predicts C conditional classprobabilities which in mathematical notation is given byPr(Classi|Object)

Figure 2 Yolo model regression pipeline (Redmon et al)

The formulation of object detection described above al-lows us to put a neural network architecture on top ofthis representation and get predictions for objects in im-age frames However certain architectures regarding thefinal prediction layer greatly influence the number of pa-rameters and therefore overall speed of the implementa-tion A novel idea introduced by YoloV2 was to com-pletely eliminate the need for a fully connected layer at theend of the network for predictions and instead use a con-volution layer that makes predictions In order to makethis possible the Yolo implementation uses anchor boxesto predict bounding boxes meaning it predicts offsets in-stead of direct coordinates and thus accessible for a con-volutional layer to make reasonable predictions The loca-tions for the anchor boxes are given by the configurationfiles that were downloaded from the Yolo website httpspjreddiecomdarknetyolo For our projectwe focused on models specific for the VOC task meaningthe number of classes for prediction was 20

Since there are a various architectures that can be imple-mented for this task the authors of YOLO described a fullYOLO version which achieved optimal accuracy and a morecompact YOLO called tiny-yolo that run faster but isnrsquot asaccurate Tiny-yolo was important to our project because itallowed us to get reasonable results when deployed to thelimited hardward of a mobile device

The full architecture yolo-tiny is below (max-pool-2means max pool of size 2x2 and stride 2 NxN convolution-F-P means convolution of F filters of size NxN with pad P

followed by batchnorm)

Input 416x416x33x3 convolution-16-1max-pool-23x3 convolution-32-1max-pool-23x3 convolution-64-1max-pool-23x3 convolution-128-1max-pool-23x3 convolution-512-1max-pool-23x3 convolution-1024-13x3 convolution-1024-11x1 convolution-125-1

32 Fire Layer

A goal of this project was to speed up the implementa-tion of YOLO so that it could classify faster on the limitedhardware of a mobile device As a result our group re-searched various architectures that have been used in recentresearch that minimize the number of parameters in a neu-ral network without compromising on accuracy Thus wefound the architecture SqueezeNet [15] and realized thatwe could borrow motifs from SqueezeNet and use them inthe YOLO architecture to replace bottleneck layers whilehopefully maintaining accuracy In particular we looked atSqueezeNetrsquos rdquofire modulerdquo which is a composition of twosteps of convolutions with a concatenation at the end Thefirst layer is called the rdquosqueezerdquo layer which is a 1x1 con-volution with s1x1 filters The second layer has two separateconvolutions one is 1x1 with e1x1 filters and the second is3x3 with e3x3 filters A diagram of this is shown in figure3

Figure 3 Fire module diagram (Iandola et al)

We focused on reducing the apparent bottleneck in thelast two layers of the tiny-yolo architecture where the 3x3

3

filters with padding of 1 convolutions had 1024 filters re-placing those layers with fire modules See experimentssection for more details

33 Mobile

Since iOS does not support CUDA or OpenCL we useMetal and Metal Performance Shaders to perform work onthe GPUs found in iOS devices Thus to bring our imple-mentation onto mobile we had to convert weights into onesthat could be interpreted by Metal We used the open-sourcerdquoYAD2Krdquo script which converts the Darknet weights to aKeras format [2] to the compatible format For new modelswe could almost directly translate architectures into Swiftusing Metals convolution and maxpool abstraction and in-corporate the weights trained by Darkflow However onemajor drawback of this approach is that that Metal does notcurrently support an abstracted batchnorm layer a regular-ization technique used in the tiny-YOLO implementation

Figure 4 Histogram of output distributions before and after batch-norming for the first convolutional layer As noted in the Holle-mans post without batchnorming output distributions tend to be-come diffuse (left) and accuracy can decrease Image from Holle-mans [12]

We were able to overcome this hurdle by using a strategyposted by M Hollemans [12] who addressed the discrep-ancy by using a folding technique to essentially merge thebatch norm weights into the previous convolutional layer[11] We can use the following equations to compute newweights and bias terms for the convolution layer

wnew =gamma lowast w

sqrt(variance)

bnew =gamma lowast (bminusmean)

sqrt(variance)+ beta

Hollemansrsquo post also offered starter Yolo code that lever-ages the Forge wrapper library around Metal and was inte-gral to our project [10] Forge is a collection of helper codethat makes it easier to construct deep neural networks usingApplersquos MPSCNN framework For example to set up tiny-yolo using Forge you can implement the layers in Swiftcode like below

After appropriate channel and bounding box process-ing combined with UI functionality we could spin up anapp that drew bounding boxes with confidence scores inreal time To add additional functionality we also addedlanguage buttons that allow a user to identify objects in 5different languages (English Spanish French Hindhi andChinese)

4 ExperimentsOur experiments aimed to determine which net YOLO

tiny-YOLO or tiny-YOLO with fire layer modificationwould be best for deployment on mobile Our modifiedtiny-YOLO replaced 3 convolutional layers - the final 1024and 512 convolutional layers - with two fire layers

[convolutional]batch_normalize=1filters=256size=3stride=1pad=1activation=leaky

[maxpool]size=2stride=2

[fire]s_1x1=38e_1x1=256e_3x3=256

[fire]s_1x1=42e_1x1=512e_3x3=512

[convolutional]size=1stride=1pad=1

4

filters=425activation=linear

In addition we used the YOLO and the tiny-YOLO net ar-chitectures to train and test on the VOC 2012 validationdataset and evaluated their speed and accuracy locally us-ing mAP

We chose to evaluate the models on CPU to mimic thelimited computing resources available on modern phonesWe then took the model with the highest CPU speed andimplemented it in a mobile app

41 Baselines

The VOC dataset is considered rdquosolvedrdquo in many con-texts The VOC 2012 competition saw mAP scores of over90 in each category [4] The full YOLO version has a re-ported mAP score of 78 on the VOC 2007 test set afterbeing trained on the VOC 2007 + 2012 sets [21] The fullYOLO net served as the baseline for our experiments TheGPU runs were conducted with a Titan X and served as ourbenchmark

We used the Adam optimizer in all training sessionsdue to its tendency for expedient convergence and we seta learning rate of 1eminus3 to begin training as suggested inthe Darkflow implementation [25] If the modelrsquos lossstarted thrashing (oscillating between two large numbers)we paused training and resumed from the last checkpointwith a lower learning rate The lowest rate we used was1eminus6 We used the default mini-batch size of 64 for train-ing and we ran CPU evaluations using pre-trained weightsfrom Darknet [21]

42 Evaluation

We evaluted the accuracy of networks using mAP scoreson the VOC test set We heavily modified an R-CNN mAPevaluation script [19] to implement our own mAP scoreevaluator that is compatible with the json output of Dark-flow

To get the mean average precision we compute the aver-age precision for each of the 20 classes as follows

AvgPrecision =

nsumk=1

P (k)∆r(k)

In this equation P (k) refers to precision at a giventhreshold k and ∆r(k) refers to the change in recall betweenk and k minus 1 This corresponds to the area under the preci-sion recall curve Thus mean average precision is simplyAvgPrecisionC Where C = 20 for the VOC PASCALdataset

To evaluate test time inference speed on CPU we took100 images from the VOC validation set and used the Dark-net framework to time how long each architecture took to

make predictions for all the images We then divided thetotal time by the number of images (100) to get an averagenumber of images (frames) per second To get scores forMobile we used the Forge Library to determine the maxFPS for our deployed net The GPU speed references thosereported on the Darknet Website [21]

The GPU times were recorded using the NVIDIA Ti-tan X as reported in Darkent [21] The CPU times wererecorded on a 2016 MacBook Pro with 27 GHz Intel Corei5 The mobile times were recorded on a standard iPhone 7running iOS 10

43 Results

We report our findings on speed and accuracy on mobileand GPU in comparison to modern models below

Graph 1 Empirical Speed and mAP of Locally Trainedand Tested Models Using the VOC 2012 Dataset

We include a detailed comparison in Table 1 The tiny-YOLO net significantly outperformed all others in terms ofspeed overtaking YOLO by a factor of 5x on both CPUand GPU We achieved slightly lower accuracies when thenets were tested locally on a sample from the 2012 VOCvalidation set (rather than on the 2007 sets used to set thebenchmarks on the Darknet website) The tiny-YOLO firelayer modification was slightly faster but failed to predictany examples

The locally tested YOLOs struggled with similar cate-gories as their GPU counterparts as expected tiny-YOLOalso trained and predicted significantly faster than YOLObecause it has a fraction of the layers and parameters tiny-YOLO-fire had even fewer parameters overall since al-though fire layers perform 2 convolutions we replaced 3convolution layers with 2 fire layers reducing overall pa-rameter count and thus performed slightly faster than tiny-YOLO as expected

5

Model Speed (FPS) FLOPS mAP

SqueezeNet (GPU) - 217 Bn -YOLO (GPU) 40 5968 Bn 786

tiny-YOLO (GPU) 207 081 Bn 571YOLO (CPU) 2 5968 Bn 65

tiny-YOLO (CPU) 6 081 Bn 51tiny-YOLO-fire (CPU) 8 - lt 1tiny-YOLO (mobile) 11 081 Bn -

Table 1 Best Reported Inference Speed and mAP scores of modelscompared to our tests of YOLO tiny-YOLO and tiny-YOLO-fireon CPU

44 Discussion of Fire Layer

We attempted to train the modified tiny-YOLO with firelayers from scratch This required instantiating a blank fileand training took around 4 hours on the GPU provisionedby Google Cloud which resulted in a minimum loss of 6

Figure 5 Tensorboard loss convergence graph of tiny-YOLO-fireafter 500 iterations

However although the loss converged the net still had amAP of 0 at test time and was unable to make predictionsfor almost all images As a result we attempted to modifythe s1x1 layer of the second fire layer to have 46 channelsinstead of 42 (to better align with the original SqueezeNetimplementation[15]) and tuned hyperparameters such asthe learning rate This also had no effect

We then used the pre-trained weights for the first convo-lutional layers from tiny-YOLO and fine-tuned the remain-ing fire layers using these weights as a starting point To testthis technique we created a small set of 3 images and an-notations from the VOC validation set and trained the fire-layer tiny-YOLO on these images in an attempt to overfitand accurately predict these images at test time The netwas unable to make accurate classifications on this datasetas well

These experiments led us to believe that fire layers are illsuited as replacements to vanilla convolution layers Intu-itively this makes sense as the fire layers concatenate out-puts leading to very different maps from sequential convo-

lutionFurthermore the fire layers added at the end of the archi-

tecture squeeze an already aggressively reduced neural netHowever before these layers the activations are alreadysomewhat well-defined and the parameter count is very lowAs a result additional squeezing without careful precisionmay result in the loss or scrambling of useful informationmaking the model far less expressive and unable to makesense of information learned in previous layers Addition-ally while fire layers may have made sense for all-imageinput without bounding boxes where concatenated chan-nels revealed information about the interplay of differentfeatures in an image adding text and bounding boxes likelyonly adds noise to the outputs since there is a disconnectbetween the meaning of the bounding box parameters andthe pixel values

Further tuning of the s1x1 e1x1 and e3x3 filter sizes andlayers would need to take place to ensure that the networkcaptures relevant data in these steps

Another factor to consider is that while YOLO relies onEuclidean loss the original SqueezeNet uses Softmax [15]

LEucl =1

2

nsumi=1

(xi minus yi)2

LSoftmax = minuslog(efyisumj e

fj)

These are two fundamentally different ways of measuringloss with one relying on distance while the other relies onprobability distributions

Empirical evidence suggests that YOLO is not as ef-fective when evaluated using Softmax with mAP scoresdropping as low as 5-10 upon changing to the Softmaxloss function [22] It is possible that the reverse is truefor SqueezeNet-inspired architectures and Euclidean Losswith the incompatible loss function having a strong negativeeffect on final mAP scores

45 Results of Deployment

The tiny-YOLO net was successfully deployed to theiPhone 7 using Forge [12] The app only draws bound-ing boxes if the confidence of prediction is above a user-defined threshold (05 in our case) The deployment tomobile actually saw a slightly higher maximum InferenceSpeed than on laptop CPU because of the direct access tothe mobile GPU provided by Metal2 However the averagespeed roughly matched the results of the same network onCPU

The relatively low mAP score is evidenced by thetendency of our app to ignore chairs and struggle topredict bottles The categories with the lowest averageprecisions such as potted plants and sofas are rarelydetected This is likely because these categories are so

6

variant in both shape and color and often include a lotof background noise The app has the highest mAPfor the categories of people and cars To view a demovisit httpswwwyoutubecomwatchv=m0_GXm9aN5Qampfeature=youtube Images from a runare shown below

Figure 6 Detections in different languages

5 Conclusions and Future Work

While the fire layer may be promising for speeding upthe bottlenecked high-channel layers the lack of predictionsin this project make it hard to draw positive conclusions forits future usefulness Further tuning on a large number ofparameters would need to take place before conclusions canbe drawn with this hybrid architecture

More likely it seems that layer-pruning and depth-wiseseparable convolutions as proposed in Googlersquos MobileNet[13] will provide an effective trade-off between speed andefficiency These models shrink the number of hyperparam-eters without making drastic changes to architecture andseem to be the next step in making deep learning practi-cal on mobile Next steps would investigate MobileNet andpotential hybrids between it and YOLO

Mobile deep learning is quickly becoming more accessi-ble With a push for increased GPU access on phones eventhe full YOLO may one day run in real time on our iPhonesConsequently the best development strategy to take advan-tage of mobile deep learning might be to slowly add to exist-ing models such tiny-YOLO to make them approach YOLOaccuracy Nevertheless our current app indicates that objectdetection for mobile can be done in near real time and hintsat many possibilities in the realms of facial recognition ARand autonomous driving in the near future

References

[1] Cs231n lecture 11 slide 80 May 10 2017[2] allanzelener Yad2k httpsgithubcomallanzelenerYAD2K

2017

[3] J Dai H Qi Y Xiong Y Li G Zhang H Hu andY Wei Deformable convolutional networks CoRRabs170306211 2017

[4] M Everingham L Van Gool C K I Williams J Winn andA Zisserman Assessing the Significance of PerformanceDifferences on the PASCAL VOC Challenges via Bootstrap-ping 2012

[5] R B Girshick Fast R-CNN CoRR abs150408083 2015[6] R B Girshick J Donahue T Darrell and J Malik Rich

feature hierarchies for accurate object detection and semanticsegmentation CoRR abs13112524 2013

[7] R B Girshick F N Iandola T Darrell and J MalikDeformable part models are convolutional neural networksCoRR abs14095403 2014

[8] K He G Gkioxari P Dollar and R B Girshick MaskR-CNN CoRR abs170306870 2017

[9] K He X Zhang S Ren and J Sun Deep residual learningfor image recognition CoRR abs151203385 2015

[10] hollance Forge A Neural Network Toolkit for MetalhttpsgithubcomhollanceForge 2017

[11] hollance Yolo2metal open source script 2017httpsgithubcomhollanceForgeblobmasterExamplesYOLOyolo2metalpy

[12] M Hollemans Real-time object detection with yolohttpmachinethinknetblogobject-detection-with-yolo2017

[13] A G Howard M Zhu B Chen D Kalenichenko W WangT Weyand M Andreetto and H Adam Mobilenets Effi-cient convolutional neural networks for mobile vision appli-cations CoRR abs170404861 2017

[14] J Huang V Rathod C Sun M Zhu A KorattikaraA Fathi I Fischer Z Wojna Y Song S Guadarrama andK Murphy Speedaccuracy trade-offs for modern convolu-tional object detectors CoRR abs161110012 2016

[15] F N Iandola M W Moskewicz K Ashraf S Han W JDally and K Keutzer Squeezenet Alexnet-level accuracywith 50x fewer parameters and lt1mb model size CoRRabs160207360 2016

[16] Y Kim E Park S Yoo T Choi L Yang and D ShinCompression of deep convolutional neural networks for fastand low power mobile applications CoRR abs1511065302015

[17] A Krizhevsky I Sutskever and G E Hinton Imagenetclassification with deep convolutional neural networks InAdvances in Neural Information Processing Systems page2012

[18] T Lin M Maire S J Belongie L D Bourdev R BGirshick J Hays P Perona D Ramanan P Dollar andC L Zitnick Microsoft COCO common objects in contextCoRR abs14050312 2014

[19] rbgirshick Faster r-cnn httpsgithubcomrbgirshickpy-faster-rcnnblobmasterlibdatasets 2016

[20] J Redmon and A Farhadi You only lookonce Unified real-time object detectionhttpspjreddiecomdarknetyolo 2015

[21] J Redmon and A Farhadi Yolo9000 Better faster strongerhttpspjreddiecomdarknetyolo 2016

7

[22] J Redmon and A Farhadi Yolo9000 Better faster strongerhttpspjreddiecompublicationsyolo 2016

[23] S Ren K He R B Girshick and J Sun Faster R-CNNtowards real-time object detection with region proposal net-works CoRR abs150601497 2015

[24] C Szegedy W Liu Y Jia P Sermanet S E ReedD Anguelov D Erhan V Vanhoucke and A RabinovichGoing deeper with convolutions CoRR abs140948422014

[25] thtrieu Darkflow httpsgithubcomthtrieudarkflow 2017[26] B Wu F N Iandola P H Jin and K Keutzer Squeezedet

Unified small low power fully convolutional neural net-works for real-time object detection for autonomous drivingCoRR abs161201051 2016

8

Page 3: YOLO Net on iOS - Stanford Universitycs231n.stanford.edu/reports/2017/pdfs/135.pdf · YOLO Net on iOS Maneesh Apte Stanford University mapte05@stanford.edu Simar Mangat Stanford University

IOU truthpred is defined as the intersection over the union

between the predicted box and the ground truth For refer-ence our model used S = 13 and B = 5 as done in theYolov2 original paper

Following this each bounding box has 5 predictions dxdy w h and confidence The (dx dy) coordinates rep-resent the center of the box relative to the bounds of thegrid cell Each grid cell also predicts C conditional classprobabilities which in mathematical notation is given byPr(Classi|Object)

Figure 2 Yolo model regression pipeline (Redmon et al)

The formulation of object detection described above al-lows us to put a neural network architecture on top ofthis representation and get predictions for objects in im-age frames However certain architectures regarding thefinal prediction layer greatly influence the number of pa-rameters and therefore overall speed of the implementa-tion A novel idea introduced by YoloV2 was to com-pletely eliminate the need for a fully connected layer at theend of the network for predictions and instead use a con-volution layer that makes predictions In order to makethis possible the Yolo implementation uses anchor boxesto predict bounding boxes meaning it predicts offsets in-stead of direct coordinates and thus accessible for a con-volutional layer to make reasonable predictions The loca-tions for the anchor boxes are given by the configurationfiles that were downloaded from the Yolo website httpspjreddiecomdarknetyolo For our projectwe focused on models specific for the VOC task meaningthe number of classes for prediction was 20

Since there are a various architectures that can be imple-mented for this task the authors of YOLO described a fullYOLO version which achieved optimal accuracy and a morecompact YOLO called tiny-yolo that run faster but isnrsquot asaccurate Tiny-yolo was important to our project because itallowed us to get reasonable results when deployed to thelimited hardward of a mobile device

The full architecture yolo-tiny is below (max-pool-2means max pool of size 2x2 and stride 2 NxN convolution-F-P means convolution of F filters of size NxN with pad P

followed by batchnorm)

Input 416x416x33x3 convolution-16-1max-pool-23x3 convolution-32-1max-pool-23x3 convolution-64-1max-pool-23x3 convolution-128-1max-pool-23x3 convolution-512-1max-pool-23x3 convolution-1024-13x3 convolution-1024-11x1 convolution-125-1

32 Fire Layer

A goal of this project was to speed up the implementa-tion of YOLO so that it could classify faster on the limitedhardware of a mobile device As a result our group re-searched various architectures that have been used in recentresearch that minimize the number of parameters in a neu-ral network without compromising on accuracy Thus wefound the architecture SqueezeNet [15] and realized thatwe could borrow motifs from SqueezeNet and use them inthe YOLO architecture to replace bottleneck layers whilehopefully maintaining accuracy In particular we looked atSqueezeNetrsquos rdquofire modulerdquo which is a composition of twosteps of convolutions with a concatenation at the end Thefirst layer is called the rdquosqueezerdquo layer which is a 1x1 con-volution with s1x1 filters The second layer has two separateconvolutions one is 1x1 with e1x1 filters and the second is3x3 with e3x3 filters A diagram of this is shown in figure3

Figure 3 Fire module diagram (Iandola et al)

We focused on reducing the apparent bottleneck in thelast two layers of the tiny-yolo architecture where the 3x3

3

filters with padding of 1 convolutions had 1024 filters re-placing those layers with fire modules See experimentssection for more details

33 Mobile

Since iOS does not support CUDA or OpenCL we useMetal and Metal Performance Shaders to perform work onthe GPUs found in iOS devices Thus to bring our imple-mentation onto mobile we had to convert weights into onesthat could be interpreted by Metal We used the open-sourcerdquoYAD2Krdquo script which converts the Darknet weights to aKeras format [2] to the compatible format For new modelswe could almost directly translate architectures into Swiftusing Metals convolution and maxpool abstraction and in-corporate the weights trained by Darkflow However onemajor drawback of this approach is that that Metal does notcurrently support an abstracted batchnorm layer a regular-ization technique used in the tiny-YOLO implementation

Figure 4 Histogram of output distributions before and after batch-norming for the first convolutional layer As noted in the Holle-mans post without batchnorming output distributions tend to be-come diffuse (left) and accuracy can decrease Image from Holle-mans [12]

We were able to overcome this hurdle by using a strategyposted by M Hollemans [12] who addressed the discrep-ancy by using a folding technique to essentially merge thebatch norm weights into the previous convolutional layer[11] We can use the following equations to compute newweights and bias terms for the convolution layer

wnew =gamma lowast w

sqrt(variance)

bnew =gamma lowast (bminusmean)

sqrt(variance)+ beta

Hollemansrsquo post also offered starter Yolo code that lever-ages the Forge wrapper library around Metal and was inte-gral to our project [10] Forge is a collection of helper codethat makes it easier to construct deep neural networks usingApplersquos MPSCNN framework For example to set up tiny-yolo using Forge you can implement the layers in Swiftcode like below

After appropriate channel and bounding box process-ing combined with UI functionality we could spin up anapp that drew bounding boxes with confidence scores inreal time To add additional functionality we also addedlanguage buttons that allow a user to identify objects in 5different languages (English Spanish French Hindhi andChinese)

4 ExperimentsOur experiments aimed to determine which net YOLO

tiny-YOLO or tiny-YOLO with fire layer modificationwould be best for deployment on mobile Our modifiedtiny-YOLO replaced 3 convolutional layers - the final 1024and 512 convolutional layers - with two fire layers

[convolutional]batch_normalize=1filters=256size=3stride=1pad=1activation=leaky

[maxpool]size=2stride=2

[fire]s_1x1=38e_1x1=256e_3x3=256

[fire]s_1x1=42e_1x1=512e_3x3=512

[convolutional]size=1stride=1pad=1

4

filters=425activation=linear

In addition we used the YOLO and the tiny-YOLO net ar-chitectures to train and test on the VOC 2012 validationdataset and evaluated their speed and accuracy locally us-ing mAP

We chose to evaluate the models on CPU to mimic thelimited computing resources available on modern phonesWe then took the model with the highest CPU speed andimplemented it in a mobile app

41 Baselines

The VOC dataset is considered rdquosolvedrdquo in many con-texts The VOC 2012 competition saw mAP scores of over90 in each category [4] The full YOLO version has a re-ported mAP score of 78 on the VOC 2007 test set afterbeing trained on the VOC 2007 + 2012 sets [21] The fullYOLO net served as the baseline for our experiments TheGPU runs were conducted with a Titan X and served as ourbenchmark

We used the Adam optimizer in all training sessionsdue to its tendency for expedient convergence and we seta learning rate of 1eminus3 to begin training as suggested inthe Darkflow implementation [25] If the modelrsquos lossstarted thrashing (oscillating between two large numbers)we paused training and resumed from the last checkpointwith a lower learning rate The lowest rate we used was1eminus6 We used the default mini-batch size of 64 for train-ing and we ran CPU evaluations using pre-trained weightsfrom Darknet [21]

42 Evaluation

We evaluted the accuracy of networks using mAP scoreson the VOC test set We heavily modified an R-CNN mAPevaluation script [19] to implement our own mAP scoreevaluator that is compatible with the json output of Dark-flow

To get the mean average precision we compute the aver-age precision for each of the 20 classes as follows

AvgPrecision =

nsumk=1

P (k)∆r(k)

In this equation P (k) refers to precision at a giventhreshold k and ∆r(k) refers to the change in recall betweenk and k minus 1 This corresponds to the area under the preci-sion recall curve Thus mean average precision is simplyAvgPrecisionC Where C = 20 for the VOC PASCALdataset

To evaluate test time inference speed on CPU we took100 images from the VOC validation set and used the Dark-net framework to time how long each architecture took to

make predictions for all the images We then divided thetotal time by the number of images (100) to get an averagenumber of images (frames) per second To get scores forMobile we used the Forge Library to determine the maxFPS for our deployed net The GPU speed references thosereported on the Darknet Website [21]

The GPU times were recorded using the NVIDIA Ti-tan X as reported in Darkent [21] The CPU times wererecorded on a 2016 MacBook Pro with 27 GHz Intel Corei5 The mobile times were recorded on a standard iPhone 7running iOS 10

43 Results

We report our findings on speed and accuracy on mobileand GPU in comparison to modern models below

Graph 1 Empirical Speed and mAP of Locally Trainedand Tested Models Using the VOC 2012 Dataset

We include a detailed comparison in Table 1 The tiny-YOLO net significantly outperformed all others in terms ofspeed overtaking YOLO by a factor of 5x on both CPUand GPU We achieved slightly lower accuracies when thenets were tested locally on a sample from the 2012 VOCvalidation set (rather than on the 2007 sets used to set thebenchmarks on the Darknet website) The tiny-YOLO firelayer modification was slightly faster but failed to predictany examples

The locally tested YOLOs struggled with similar cate-gories as their GPU counterparts as expected tiny-YOLOalso trained and predicted significantly faster than YOLObecause it has a fraction of the layers and parameters tiny-YOLO-fire had even fewer parameters overall since al-though fire layers perform 2 convolutions we replaced 3convolution layers with 2 fire layers reducing overall pa-rameter count and thus performed slightly faster than tiny-YOLO as expected

5

Model Speed (FPS) FLOPS mAP

SqueezeNet (GPU) - 217 Bn -YOLO (GPU) 40 5968 Bn 786

tiny-YOLO (GPU) 207 081 Bn 571YOLO (CPU) 2 5968 Bn 65

tiny-YOLO (CPU) 6 081 Bn 51tiny-YOLO-fire (CPU) 8 - lt 1tiny-YOLO (mobile) 11 081 Bn -

Table 1 Best Reported Inference Speed and mAP scores of modelscompared to our tests of YOLO tiny-YOLO and tiny-YOLO-fireon CPU

44 Discussion of Fire Layer

We attempted to train the modified tiny-YOLO with firelayers from scratch This required instantiating a blank fileand training took around 4 hours on the GPU provisionedby Google Cloud which resulted in a minimum loss of 6

Figure 5 Tensorboard loss convergence graph of tiny-YOLO-fireafter 500 iterations

However although the loss converged the net still had amAP of 0 at test time and was unable to make predictionsfor almost all images As a result we attempted to modifythe s1x1 layer of the second fire layer to have 46 channelsinstead of 42 (to better align with the original SqueezeNetimplementation[15]) and tuned hyperparameters such asthe learning rate This also had no effect

We then used the pre-trained weights for the first convo-lutional layers from tiny-YOLO and fine-tuned the remain-ing fire layers using these weights as a starting point To testthis technique we created a small set of 3 images and an-notations from the VOC validation set and trained the fire-layer tiny-YOLO on these images in an attempt to overfitand accurately predict these images at test time The netwas unable to make accurate classifications on this datasetas well

These experiments led us to believe that fire layers are illsuited as replacements to vanilla convolution layers Intu-itively this makes sense as the fire layers concatenate out-puts leading to very different maps from sequential convo-

lutionFurthermore the fire layers added at the end of the archi-

tecture squeeze an already aggressively reduced neural netHowever before these layers the activations are alreadysomewhat well-defined and the parameter count is very lowAs a result additional squeezing without careful precisionmay result in the loss or scrambling of useful informationmaking the model far less expressive and unable to makesense of information learned in previous layers Addition-ally while fire layers may have made sense for all-imageinput without bounding boxes where concatenated chan-nels revealed information about the interplay of differentfeatures in an image adding text and bounding boxes likelyonly adds noise to the outputs since there is a disconnectbetween the meaning of the bounding box parameters andthe pixel values

Further tuning of the s1x1 e1x1 and e3x3 filter sizes andlayers would need to take place to ensure that the networkcaptures relevant data in these steps

Another factor to consider is that while YOLO relies onEuclidean loss the original SqueezeNet uses Softmax [15]

LEucl =1

2

nsumi=1

(xi minus yi)2

LSoftmax = minuslog(efyisumj e

fj)

These are two fundamentally different ways of measuringloss with one relying on distance while the other relies onprobability distributions

Empirical evidence suggests that YOLO is not as ef-fective when evaluated using Softmax with mAP scoresdropping as low as 5-10 upon changing to the Softmaxloss function [22] It is possible that the reverse is truefor SqueezeNet-inspired architectures and Euclidean Losswith the incompatible loss function having a strong negativeeffect on final mAP scores

45 Results of Deployment

The tiny-YOLO net was successfully deployed to theiPhone 7 using Forge [12] The app only draws bound-ing boxes if the confidence of prediction is above a user-defined threshold (05 in our case) The deployment tomobile actually saw a slightly higher maximum InferenceSpeed than on laptop CPU because of the direct access tothe mobile GPU provided by Metal2 However the averagespeed roughly matched the results of the same network onCPU

The relatively low mAP score is evidenced by thetendency of our app to ignore chairs and struggle topredict bottles The categories with the lowest averageprecisions such as potted plants and sofas are rarelydetected This is likely because these categories are so

6

variant in both shape and color and often include a lotof background noise The app has the highest mAPfor the categories of people and cars To view a demovisit httpswwwyoutubecomwatchv=m0_GXm9aN5Qampfeature=youtube Images from a runare shown below

Figure 6 Detections in different languages

5 Conclusions and Future Work

While the fire layer may be promising for speeding upthe bottlenecked high-channel layers the lack of predictionsin this project make it hard to draw positive conclusions forits future usefulness Further tuning on a large number ofparameters would need to take place before conclusions canbe drawn with this hybrid architecture

More likely it seems that layer-pruning and depth-wiseseparable convolutions as proposed in Googlersquos MobileNet[13] will provide an effective trade-off between speed andefficiency These models shrink the number of hyperparam-eters without making drastic changes to architecture andseem to be the next step in making deep learning practi-cal on mobile Next steps would investigate MobileNet andpotential hybrids between it and YOLO

Mobile deep learning is quickly becoming more accessi-ble With a push for increased GPU access on phones eventhe full YOLO may one day run in real time on our iPhonesConsequently the best development strategy to take advan-tage of mobile deep learning might be to slowly add to exist-ing models such tiny-YOLO to make them approach YOLOaccuracy Nevertheless our current app indicates that objectdetection for mobile can be done in near real time and hintsat many possibilities in the realms of facial recognition ARand autonomous driving in the near future

References

[1] Cs231n lecture 11 slide 80 May 10 2017[2] allanzelener Yad2k httpsgithubcomallanzelenerYAD2K

2017

[3] J Dai H Qi Y Xiong Y Li G Zhang H Hu andY Wei Deformable convolutional networks CoRRabs170306211 2017

[4] M Everingham L Van Gool C K I Williams J Winn andA Zisserman Assessing the Significance of PerformanceDifferences on the PASCAL VOC Challenges via Bootstrap-ping 2012

[5] R B Girshick Fast R-CNN CoRR abs150408083 2015[6] R B Girshick J Donahue T Darrell and J Malik Rich

feature hierarchies for accurate object detection and semanticsegmentation CoRR abs13112524 2013

[7] R B Girshick F N Iandola T Darrell and J MalikDeformable part models are convolutional neural networksCoRR abs14095403 2014

[8] K He G Gkioxari P Dollar and R B Girshick MaskR-CNN CoRR abs170306870 2017

[9] K He X Zhang S Ren and J Sun Deep residual learningfor image recognition CoRR abs151203385 2015

[10] hollance Forge A Neural Network Toolkit for MetalhttpsgithubcomhollanceForge 2017

[11] hollance Yolo2metal open source script 2017httpsgithubcomhollanceForgeblobmasterExamplesYOLOyolo2metalpy

[12] M Hollemans Real-time object detection with yolohttpmachinethinknetblogobject-detection-with-yolo2017

[13] A G Howard M Zhu B Chen D Kalenichenko W WangT Weyand M Andreetto and H Adam Mobilenets Effi-cient convolutional neural networks for mobile vision appli-cations CoRR abs170404861 2017

[14] J Huang V Rathod C Sun M Zhu A KorattikaraA Fathi I Fischer Z Wojna Y Song S Guadarrama andK Murphy Speedaccuracy trade-offs for modern convolu-tional object detectors CoRR abs161110012 2016

[15] F N Iandola M W Moskewicz K Ashraf S Han W JDally and K Keutzer Squeezenet Alexnet-level accuracywith 50x fewer parameters and lt1mb model size CoRRabs160207360 2016

[16] Y Kim E Park S Yoo T Choi L Yang and D ShinCompression of deep convolutional neural networks for fastand low power mobile applications CoRR abs1511065302015

[17] A Krizhevsky I Sutskever and G E Hinton Imagenetclassification with deep convolutional neural networks InAdvances in Neural Information Processing Systems page2012

[18] T Lin M Maire S J Belongie L D Bourdev R BGirshick J Hays P Perona D Ramanan P Dollar andC L Zitnick Microsoft COCO common objects in contextCoRR abs14050312 2014

[19] rbgirshick Faster r-cnn httpsgithubcomrbgirshickpy-faster-rcnnblobmasterlibdatasets 2016

[20] J Redmon and A Farhadi You only lookonce Unified real-time object detectionhttpspjreddiecomdarknetyolo 2015

[21] J Redmon and A Farhadi Yolo9000 Better faster strongerhttpspjreddiecomdarknetyolo 2016

7

[22] J Redmon and A Farhadi Yolo9000 Better faster strongerhttpspjreddiecompublicationsyolo 2016

[23] S Ren K He R B Girshick and J Sun Faster R-CNNtowards real-time object detection with region proposal net-works CoRR abs150601497 2015

[24] C Szegedy W Liu Y Jia P Sermanet S E ReedD Anguelov D Erhan V Vanhoucke and A RabinovichGoing deeper with convolutions CoRR abs140948422014

[25] thtrieu Darkflow httpsgithubcomthtrieudarkflow 2017[26] B Wu F N Iandola P H Jin and K Keutzer Squeezedet

Unified small low power fully convolutional neural net-works for real-time object detection for autonomous drivingCoRR abs161201051 2016

8

Page 4: YOLO Net on iOS - Stanford Universitycs231n.stanford.edu/reports/2017/pdfs/135.pdf · YOLO Net on iOS Maneesh Apte Stanford University mapte05@stanford.edu Simar Mangat Stanford University

filters with padding of 1 convolutions had 1024 filters re-placing those layers with fire modules See experimentssection for more details

33 Mobile

Since iOS does not support CUDA or OpenCL we useMetal and Metal Performance Shaders to perform work onthe GPUs found in iOS devices Thus to bring our imple-mentation onto mobile we had to convert weights into onesthat could be interpreted by Metal We used the open-sourcerdquoYAD2Krdquo script which converts the Darknet weights to aKeras format [2] to the compatible format For new modelswe could almost directly translate architectures into Swiftusing Metals convolution and maxpool abstraction and in-corporate the weights trained by Darkflow However onemajor drawback of this approach is that that Metal does notcurrently support an abstracted batchnorm layer a regular-ization technique used in the tiny-YOLO implementation

Figure 4 Histogram of output distributions before and after batch-norming for the first convolutional layer As noted in the Holle-mans post without batchnorming output distributions tend to be-come diffuse (left) and accuracy can decrease Image from Holle-mans [12]

We were able to overcome this hurdle by using a strategyposted by M Hollemans [12] who addressed the discrep-ancy by using a folding technique to essentially merge thebatch norm weights into the previous convolutional layer[11] We can use the following equations to compute newweights and bias terms for the convolution layer

wnew =gamma lowast w

sqrt(variance)

bnew =gamma lowast (bminusmean)

sqrt(variance)+ beta

Hollemansrsquo post also offered starter Yolo code that lever-ages the Forge wrapper library around Metal and was inte-gral to our project [10] Forge is a collection of helper codethat makes it easier to construct deep neural networks usingApplersquos MPSCNN framework For example to set up tiny-yolo using Forge you can implement the layers in Swiftcode like below

After appropriate channel and bounding box process-ing combined with UI functionality we could spin up anapp that drew bounding boxes with confidence scores inreal time To add additional functionality we also addedlanguage buttons that allow a user to identify objects in 5different languages (English Spanish French Hindhi andChinese)

4 ExperimentsOur experiments aimed to determine which net YOLO

tiny-YOLO or tiny-YOLO with fire layer modificationwould be best for deployment on mobile Our modifiedtiny-YOLO replaced 3 convolutional layers - the final 1024and 512 convolutional layers - with two fire layers

[convolutional]batch_normalize=1filters=256size=3stride=1pad=1activation=leaky

[maxpool]size=2stride=2

[fire]s_1x1=38e_1x1=256e_3x3=256

[fire]s_1x1=42e_1x1=512e_3x3=512

[convolutional]size=1stride=1pad=1

4

filters=425activation=linear

In addition we used the YOLO and the tiny-YOLO net ar-chitectures to train and test on the VOC 2012 validationdataset and evaluated their speed and accuracy locally us-ing mAP

We chose to evaluate the models on CPU to mimic thelimited computing resources available on modern phonesWe then took the model with the highest CPU speed andimplemented it in a mobile app

41 Baselines

The VOC dataset is considered rdquosolvedrdquo in many con-texts The VOC 2012 competition saw mAP scores of over90 in each category [4] The full YOLO version has a re-ported mAP score of 78 on the VOC 2007 test set afterbeing trained on the VOC 2007 + 2012 sets [21] The fullYOLO net served as the baseline for our experiments TheGPU runs were conducted with a Titan X and served as ourbenchmark

We used the Adam optimizer in all training sessionsdue to its tendency for expedient convergence and we seta learning rate of 1eminus3 to begin training as suggested inthe Darkflow implementation [25] If the modelrsquos lossstarted thrashing (oscillating between two large numbers)we paused training and resumed from the last checkpointwith a lower learning rate The lowest rate we used was1eminus6 We used the default mini-batch size of 64 for train-ing and we ran CPU evaluations using pre-trained weightsfrom Darknet [21]

42 Evaluation

We evaluted the accuracy of networks using mAP scoreson the VOC test set We heavily modified an R-CNN mAPevaluation script [19] to implement our own mAP scoreevaluator that is compatible with the json output of Dark-flow

To get the mean average precision we compute the aver-age precision for each of the 20 classes as follows

AvgPrecision =

nsumk=1

P (k)∆r(k)

In this equation P (k) refers to precision at a giventhreshold k and ∆r(k) refers to the change in recall betweenk and k minus 1 This corresponds to the area under the preci-sion recall curve Thus mean average precision is simplyAvgPrecisionC Where C = 20 for the VOC PASCALdataset

To evaluate test time inference speed on CPU we took100 images from the VOC validation set and used the Dark-net framework to time how long each architecture took to

make predictions for all the images We then divided thetotal time by the number of images (100) to get an averagenumber of images (frames) per second To get scores forMobile we used the Forge Library to determine the maxFPS for our deployed net The GPU speed references thosereported on the Darknet Website [21]

The GPU times were recorded using the NVIDIA Ti-tan X as reported in Darkent [21] The CPU times wererecorded on a 2016 MacBook Pro with 27 GHz Intel Corei5 The mobile times were recorded on a standard iPhone 7running iOS 10

43 Results

We report our findings on speed and accuracy on mobileand GPU in comparison to modern models below

Graph 1 Empirical Speed and mAP of Locally Trainedand Tested Models Using the VOC 2012 Dataset

We include a detailed comparison in Table 1 The tiny-YOLO net significantly outperformed all others in terms ofspeed overtaking YOLO by a factor of 5x on both CPUand GPU We achieved slightly lower accuracies when thenets were tested locally on a sample from the 2012 VOCvalidation set (rather than on the 2007 sets used to set thebenchmarks on the Darknet website) The tiny-YOLO firelayer modification was slightly faster but failed to predictany examples

The locally tested YOLOs struggled with similar cate-gories as their GPU counterparts as expected tiny-YOLOalso trained and predicted significantly faster than YOLObecause it has a fraction of the layers and parameters tiny-YOLO-fire had even fewer parameters overall since al-though fire layers perform 2 convolutions we replaced 3convolution layers with 2 fire layers reducing overall pa-rameter count and thus performed slightly faster than tiny-YOLO as expected

5

Model Speed (FPS) FLOPS mAP

SqueezeNet (GPU) - 217 Bn -YOLO (GPU) 40 5968 Bn 786

tiny-YOLO (GPU) 207 081 Bn 571YOLO (CPU) 2 5968 Bn 65

tiny-YOLO (CPU) 6 081 Bn 51tiny-YOLO-fire (CPU) 8 - lt 1tiny-YOLO (mobile) 11 081 Bn -

Table 1 Best Reported Inference Speed and mAP scores of modelscompared to our tests of YOLO tiny-YOLO and tiny-YOLO-fireon CPU

44 Discussion of Fire Layer

We attempted to train the modified tiny-YOLO with firelayers from scratch This required instantiating a blank fileand training took around 4 hours on the GPU provisionedby Google Cloud which resulted in a minimum loss of 6

Figure 5 Tensorboard loss convergence graph of tiny-YOLO-fireafter 500 iterations

However although the loss converged the net still had amAP of 0 at test time and was unable to make predictionsfor almost all images As a result we attempted to modifythe s1x1 layer of the second fire layer to have 46 channelsinstead of 42 (to better align with the original SqueezeNetimplementation[15]) and tuned hyperparameters such asthe learning rate This also had no effect

We then used the pre-trained weights for the first convo-lutional layers from tiny-YOLO and fine-tuned the remain-ing fire layers using these weights as a starting point To testthis technique we created a small set of 3 images and an-notations from the VOC validation set and trained the fire-layer tiny-YOLO on these images in an attempt to overfitand accurately predict these images at test time The netwas unable to make accurate classifications on this datasetas well

These experiments led us to believe that fire layers are illsuited as replacements to vanilla convolution layers Intu-itively this makes sense as the fire layers concatenate out-puts leading to very different maps from sequential convo-

lutionFurthermore the fire layers added at the end of the archi-

tecture squeeze an already aggressively reduced neural netHowever before these layers the activations are alreadysomewhat well-defined and the parameter count is very lowAs a result additional squeezing without careful precisionmay result in the loss or scrambling of useful informationmaking the model far less expressive and unable to makesense of information learned in previous layers Addition-ally while fire layers may have made sense for all-imageinput without bounding boxes where concatenated chan-nels revealed information about the interplay of differentfeatures in an image adding text and bounding boxes likelyonly adds noise to the outputs since there is a disconnectbetween the meaning of the bounding box parameters andthe pixel values

Further tuning of the s1x1 e1x1 and e3x3 filter sizes andlayers would need to take place to ensure that the networkcaptures relevant data in these steps

Another factor to consider is that while YOLO relies onEuclidean loss the original SqueezeNet uses Softmax [15]

LEucl =1

2

nsumi=1

(xi minus yi)2

LSoftmax = minuslog(efyisumj e

fj)

These are two fundamentally different ways of measuringloss with one relying on distance while the other relies onprobability distributions

Empirical evidence suggests that YOLO is not as ef-fective when evaluated using Softmax with mAP scoresdropping as low as 5-10 upon changing to the Softmaxloss function [22] It is possible that the reverse is truefor SqueezeNet-inspired architectures and Euclidean Losswith the incompatible loss function having a strong negativeeffect on final mAP scores

45 Results of Deployment

The tiny-YOLO net was successfully deployed to theiPhone 7 using Forge [12] The app only draws bound-ing boxes if the confidence of prediction is above a user-defined threshold (05 in our case) The deployment tomobile actually saw a slightly higher maximum InferenceSpeed than on laptop CPU because of the direct access tothe mobile GPU provided by Metal2 However the averagespeed roughly matched the results of the same network onCPU

The relatively low mAP score is evidenced by thetendency of our app to ignore chairs and struggle topredict bottles The categories with the lowest averageprecisions such as potted plants and sofas are rarelydetected This is likely because these categories are so

6

variant in both shape and color and often include a lotof background noise The app has the highest mAPfor the categories of people and cars To view a demovisit httpswwwyoutubecomwatchv=m0_GXm9aN5Qampfeature=youtube Images from a runare shown below

Figure 6 Detections in different languages

5 Conclusions and Future Work

While the fire layer may be promising for speeding upthe bottlenecked high-channel layers the lack of predictionsin this project make it hard to draw positive conclusions forits future usefulness Further tuning on a large number ofparameters would need to take place before conclusions canbe drawn with this hybrid architecture

More likely it seems that layer-pruning and depth-wiseseparable convolutions as proposed in Googlersquos MobileNet[13] will provide an effective trade-off between speed andefficiency These models shrink the number of hyperparam-eters without making drastic changes to architecture andseem to be the next step in making deep learning practi-cal on mobile Next steps would investigate MobileNet andpotential hybrids between it and YOLO

Mobile deep learning is quickly becoming more accessi-ble With a push for increased GPU access on phones eventhe full YOLO may one day run in real time on our iPhonesConsequently the best development strategy to take advan-tage of mobile deep learning might be to slowly add to exist-ing models such tiny-YOLO to make them approach YOLOaccuracy Nevertheless our current app indicates that objectdetection for mobile can be done in near real time and hintsat many possibilities in the realms of facial recognition ARand autonomous driving in the near future

References

[1] Cs231n lecture 11 slide 80 May 10 2017[2] allanzelener Yad2k httpsgithubcomallanzelenerYAD2K

2017

[3] J Dai H Qi Y Xiong Y Li G Zhang H Hu andY Wei Deformable convolutional networks CoRRabs170306211 2017

[4] M Everingham L Van Gool C K I Williams J Winn andA Zisserman Assessing the Significance of PerformanceDifferences on the PASCAL VOC Challenges via Bootstrap-ping 2012

[5] R B Girshick Fast R-CNN CoRR abs150408083 2015[6] R B Girshick J Donahue T Darrell and J Malik Rich

feature hierarchies for accurate object detection and semanticsegmentation CoRR abs13112524 2013

[7] R B Girshick F N Iandola T Darrell and J MalikDeformable part models are convolutional neural networksCoRR abs14095403 2014

[8] K He G Gkioxari P Dollar and R B Girshick MaskR-CNN CoRR abs170306870 2017

[9] K He X Zhang S Ren and J Sun Deep residual learningfor image recognition CoRR abs151203385 2015

[10] hollance Forge A Neural Network Toolkit for MetalhttpsgithubcomhollanceForge 2017

[11] hollance Yolo2metal open source script 2017httpsgithubcomhollanceForgeblobmasterExamplesYOLOyolo2metalpy

[12] M Hollemans Real-time object detection with yolohttpmachinethinknetblogobject-detection-with-yolo2017

[13] A G Howard M Zhu B Chen D Kalenichenko W WangT Weyand M Andreetto and H Adam Mobilenets Effi-cient convolutional neural networks for mobile vision appli-cations CoRR abs170404861 2017

[14] J Huang V Rathod C Sun M Zhu A KorattikaraA Fathi I Fischer Z Wojna Y Song S Guadarrama andK Murphy Speedaccuracy trade-offs for modern convolu-tional object detectors CoRR abs161110012 2016

[15] F N Iandola M W Moskewicz K Ashraf S Han W JDally and K Keutzer Squeezenet Alexnet-level accuracywith 50x fewer parameters and lt1mb model size CoRRabs160207360 2016

[16] Y Kim E Park S Yoo T Choi L Yang and D ShinCompression of deep convolutional neural networks for fastand low power mobile applications CoRR abs1511065302015

[17] A Krizhevsky I Sutskever and G E Hinton Imagenetclassification with deep convolutional neural networks InAdvances in Neural Information Processing Systems page2012

[18] T Lin M Maire S J Belongie L D Bourdev R BGirshick J Hays P Perona D Ramanan P Dollar andC L Zitnick Microsoft COCO common objects in contextCoRR abs14050312 2014

[19] rbgirshick Faster r-cnn httpsgithubcomrbgirshickpy-faster-rcnnblobmasterlibdatasets 2016

[20] J Redmon and A Farhadi You only lookonce Unified real-time object detectionhttpspjreddiecomdarknetyolo 2015

[21] J Redmon and A Farhadi Yolo9000 Better faster strongerhttpspjreddiecomdarknetyolo 2016

7

[22] J Redmon and A Farhadi Yolo9000 Better faster strongerhttpspjreddiecompublicationsyolo 2016

[23] S Ren K He R B Girshick and J Sun Faster R-CNNtowards real-time object detection with region proposal net-works CoRR abs150601497 2015

[24] C Szegedy W Liu Y Jia P Sermanet S E ReedD Anguelov D Erhan V Vanhoucke and A RabinovichGoing deeper with convolutions CoRR abs140948422014

[25] thtrieu Darkflow httpsgithubcomthtrieudarkflow 2017[26] B Wu F N Iandola P H Jin and K Keutzer Squeezedet

Unified small low power fully convolutional neural net-works for real-time object detection for autonomous drivingCoRR abs161201051 2016

8

Page 5: YOLO Net on iOS - Stanford Universitycs231n.stanford.edu/reports/2017/pdfs/135.pdf · YOLO Net on iOS Maneesh Apte Stanford University mapte05@stanford.edu Simar Mangat Stanford University

filters=425activation=linear

In addition we used the YOLO and the tiny-YOLO net ar-chitectures to train and test on the VOC 2012 validationdataset and evaluated their speed and accuracy locally us-ing mAP

We chose to evaluate the models on CPU to mimic thelimited computing resources available on modern phonesWe then took the model with the highest CPU speed andimplemented it in a mobile app

41 Baselines

The VOC dataset is considered rdquosolvedrdquo in many con-texts The VOC 2012 competition saw mAP scores of over90 in each category [4] The full YOLO version has a re-ported mAP score of 78 on the VOC 2007 test set afterbeing trained on the VOC 2007 + 2012 sets [21] The fullYOLO net served as the baseline for our experiments TheGPU runs were conducted with a Titan X and served as ourbenchmark

We used the Adam optimizer in all training sessionsdue to its tendency for expedient convergence and we seta learning rate of 1eminus3 to begin training as suggested inthe Darkflow implementation [25] If the modelrsquos lossstarted thrashing (oscillating between two large numbers)we paused training and resumed from the last checkpointwith a lower learning rate The lowest rate we used was1eminus6 We used the default mini-batch size of 64 for train-ing and we ran CPU evaluations using pre-trained weightsfrom Darknet [21]

42 Evaluation

We evaluted the accuracy of networks using mAP scoreson the VOC test set We heavily modified an R-CNN mAPevaluation script [19] to implement our own mAP scoreevaluator that is compatible with the json output of Dark-flow

To get the mean average precision we compute the aver-age precision for each of the 20 classes as follows

AvgPrecision =

nsumk=1

P (k)∆r(k)

In this equation P (k) refers to precision at a giventhreshold k and ∆r(k) refers to the change in recall betweenk and k minus 1 This corresponds to the area under the preci-sion recall curve Thus mean average precision is simplyAvgPrecisionC Where C = 20 for the VOC PASCALdataset

To evaluate test time inference speed on CPU we took100 images from the VOC validation set and used the Dark-net framework to time how long each architecture took to

make predictions for all the images We then divided thetotal time by the number of images (100) to get an averagenumber of images (frames) per second To get scores forMobile we used the Forge Library to determine the maxFPS for our deployed net The GPU speed references thosereported on the Darknet Website [21]

The GPU times were recorded using the NVIDIA Ti-tan X as reported in Darkent [21] The CPU times wererecorded on a 2016 MacBook Pro with 27 GHz Intel Corei5 The mobile times were recorded on a standard iPhone 7running iOS 10

43 Results

We report our findings on speed and accuracy on mobileand GPU in comparison to modern models below

Graph 1 Empirical Speed and mAP of Locally Trainedand Tested Models Using the VOC 2012 Dataset

We include a detailed comparison in Table 1 The tiny-YOLO net significantly outperformed all others in terms ofspeed overtaking YOLO by a factor of 5x on both CPUand GPU We achieved slightly lower accuracies when thenets were tested locally on a sample from the 2012 VOCvalidation set (rather than on the 2007 sets used to set thebenchmarks on the Darknet website) The tiny-YOLO firelayer modification was slightly faster but failed to predictany examples

The locally tested YOLOs struggled with similar cate-gories as their GPU counterparts as expected tiny-YOLOalso trained and predicted significantly faster than YOLObecause it has a fraction of the layers and parameters tiny-YOLO-fire had even fewer parameters overall since al-though fire layers perform 2 convolutions we replaced 3convolution layers with 2 fire layers reducing overall pa-rameter count and thus performed slightly faster than tiny-YOLO as expected

5

Model Speed (FPS) FLOPS mAP

SqueezeNet (GPU) - 217 Bn -YOLO (GPU) 40 5968 Bn 786

tiny-YOLO (GPU) 207 081 Bn 571YOLO (CPU) 2 5968 Bn 65

tiny-YOLO (CPU) 6 081 Bn 51tiny-YOLO-fire (CPU) 8 - lt 1tiny-YOLO (mobile) 11 081 Bn -

Table 1 Best Reported Inference Speed and mAP scores of modelscompared to our tests of YOLO tiny-YOLO and tiny-YOLO-fireon CPU

44 Discussion of Fire Layer

We attempted to train the modified tiny-YOLO with firelayers from scratch This required instantiating a blank fileand training took around 4 hours on the GPU provisionedby Google Cloud which resulted in a minimum loss of 6

Figure 5 Tensorboard loss convergence graph of tiny-YOLO-fireafter 500 iterations

However although the loss converged the net still had amAP of 0 at test time and was unable to make predictionsfor almost all images As a result we attempted to modifythe s1x1 layer of the second fire layer to have 46 channelsinstead of 42 (to better align with the original SqueezeNetimplementation[15]) and tuned hyperparameters such asthe learning rate This also had no effect

We then used the pre-trained weights for the first convo-lutional layers from tiny-YOLO and fine-tuned the remain-ing fire layers using these weights as a starting point To testthis technique we created a small set of 3 images and an-notations from the VOC validation set and trained the fire-layer tiny-YOLO on these images in an attempt to overfitand accurately predict these images at test time The netwas unable to make accurate classifications on this datasetas well

These experiments led us to believe that fire layers are illsuited as replacements to vanilla convolution layers Intu-itively this makes sense as the fire layers concatenate out-puts leading to very different maps from sequential convo-

lutionFurthermore the fire layers added at the end of the archi-

tecture squeeze an already aggressively reduced neural netHowever before these layers the activations are alreadysomewhat well-defined and the parameter count is very lowAs a result additional squeezing without careful precisionmay result in the loss or scrambling of useful informationmaking the model far less expressive and unable to makesense of information learned in previous layers Addition-ally while fire layers may have made sense for all-imageinput without bounding boxes where concatenated chan-nels revealed information about the interplay of differentfeatures in an image adding text and bounding boxes likelyonly adds noise to the outputs since there is a disconnectbetween the meaning of the bounding box parameters andthe pixel values

Further tuning of the s1x1 e1x1 and e3x3 filter sizes andlayers would need to take place to ensure that the networkcaptures relevant data in these steps

Another factor to consider is that while YOLO relies onEuclidean loss the original SqueezeNet uses Softmax [15]

LEucl =1

2

nsumi=1

(xi minus yi)2

LSoftmax = minuslog(efyisumj e

fj)

These are two fundamentally different ways of measuringloss with one relying on distance while the other relies onprobability distributions

Empirical evidence suggests that YOLO is not as ef-fective when evaluated using Softmax with mAP scoresdropping as low as 5-10 upon changing to the Softmaxloss function [22] It is possible that the reverse is truefor SqueezeNet-inspired architectures and Euclidean Losswith the incompatible loss function having a strong negativeeffect on final mAP scores

45 Results of Deployment

The tiny-YOLO net was successfully deployed to theiPhone 7 using Forge [12] The app only draws bound-ing boxes if the confidence of prediction is above a user-defined threshold (05 in our case) The deployment tomobile actually saw a slightly higher maximum InferenceSpeed than on laptop CPU because of the direct access tothe mobile GPU provided by Metal2 However the averagespeed roughly matched the results of the same network onCPU

The relatively low mAP score is evidenced by thetendency of our app to ignore chairs and struggle topredict bottles The categories with the lowest averageprecisions such as potted plants and sofas are rarelydetected This is likely because these categories are so

6

variant in both shape and color and often include a lotof background noise The app has the highest mAPfor the categories of people and cars To view a demovisit httpswwwyoutubecomwatchv=m0_GXm9aN5Qampfeature=youtube Images from a runare shown below

Figure 6 Detections in different languages

5 Conclusions and Future Work

While the fire layer may be promising for speeding upthe bottlenecked high-channel layers the lack of predictionsin this project make it hard to draw positive conclusions forits future usefulness Further tuning on a large number ofparameters would need to take place before conclusions canbe drawn with this hybrid architecture

More likely it seems that layer-pruning and depth-wiseseparable convolutions as proposed in Googlersquos MobileNet[13] will provide an effective trade-off between speed andefficiency These models shrink the number of hyperparam-eters without making drastic changes to architecture andseem to be the next step in making deep learning practi-cal on mobile Next steps would investigate MobileNet andpotential hybrids between it and YOLO

Mobile deep learning is quickly becoming more accessi-ble With a push for increased GPU access on phones eventhe full YOLO may one day run in real time on our iPhonesConsequently the best development strategy to take advan-tage of mobile deep learning might be to slowly add to exist-ing models such tiny-YOLO to make them approach YOLOaccuracy Nevertheless our current app indicates that objectdetection for mobile can be done in near real time and hintsat many possibilities in the realms of facial recognition ARand autonomous driving in the near future

References

[1] Cs231n lecture 11 slide 80 May 10 2017[2] allanzelener Yad2k httpsgithubcomallanzelenerYAD2K

2017

[3] J Dai H Qi Y Xiong Y Li G Zhang H Hu andY Wei Deformable convolutional networks CoRRabs170306211 2017

[4] M Everingham L Van Gool C K I Williams J Winn andA Zisserman Assessing the Significance of PerformanceDifferences on the PASCAL VOC Challenges via Bootstrap-ping 2012

[5] R B Girshick Fast R-CNN CoRR abs150408083 2015[6] R B Girshick J Donahue T Darrell and J Malik Rich

feature hierarchies for accurate object detection and semanticsegmentation CoRR abs13112524 2013

[7] R B Girshick F N Iandola T Darrell and J MalikDeformable part models are convolutional neural networksCoRR abs14095403 2014

[8] K He G Gkioxari P Dollar and R B Girshick MaskR-CNN CoRR abs170306870 2017

[9] K He X Zhang S Ren and J Sun Deep residual learningfor image recognition CoRR abs151203385 2015

[10] hollance Forge A Neural Network Toolkit for MetalhttpsgithubcomhollanceForge 2017

[11] hollance Yolo2metal open source script 2017httpsgithubcomhollanceForgeblobmasterExamplesYOLOyolo2metalpy

[12] M Hollemans Real-time object detection with yolohttpmachinethinknetblogobject-detection-with-yolo2017

[13] A G Howard M Zhu B Chen D Kalenichenko W WangT Weyand M Andreetto and H Adam Mobilenets Effi-cient convolutional neural networks for mobile vision appli-cations CoRR abs170404861 2017

[14] J Huang V Rathod C Sun M Zhu A KorattikaraA Fathi I Fischer Z Wojna Y Song S Guadarrama andK Murphy Speedaccuracy trade-offs for modern convolu-tional object detectors CoRR abs161110012 2016

[15] F N Iandola M W Moskewicz K Ashraf S Han W JDally and K Keutzer Squeezenet Alexnet-level accuracywith 50x fewer parameters and lt1mb model size CoRRabs160207360 2016

[16] Y Kim E Park S Yoo T Choi L Yang and D ShinCompression of deep convolutional neural networks for fastand low power mobile applications CoRR abs1511065302015

[17] A Krizhevsky I Sutskever and G E Hinton Imagenetclassification with deep convolutional neural networks InAdvances in Neural Information Processing Systems page2012

[18] T Lin M Maire S J Belongie L D Bourdev R BGirshick J Hays P Perona D Ramanan P Dollar andC L Zitnick Microsoft COCO common objects in contextCoRR abs14050312 2014

[19] rbgirshick Faster r-cnn httpsgithubcomrbgirshickpy-faster-rcnnblobmasterlibdatasets 2016

[20] J Redmon and A Farhadi You only lookonce Unified real-time object detectionhttpspjreddiecomdarknetyolo 2015

[21] J Redmon and A Farhadi Yolo9000 Better faster strongerhttpspjreddiecomdarknetyolo 2016

7

[22] J Redmon and A Farhadi Yolo9000 Better faster strongerhttpspjreddiecompublicationsyolo 2016

[23] S Ren K He R B Girshick and J Sun Faster R-CNNtowards real-time object detection with region proposal net-works CoRR abs150601497 2015

[24] C Szegedy W Liu Y Jia P Sermanet S E ReedD Anguelov D Erhan V Vanhoucke and A RabinovichGoing deeper with convolutions CoRR abs140948422014

[25] thtrieu Darkflow httpsgithubcomthtrieudarkflow 2017[26] B Wu F N Iandola P H Jin and K Keutzer Squeezedet

Unified small low power fully convolutional neural net-works for real-time object detection for autonomous drivingCoRR abs161201051 2016

8

Page 6: YOLO Net on iOS - Stanford Universitycs231n.stanford.edu/reports/2017/pdfs/135.pdf · YOLO Net on iOS Maneesh Apte Stanford University mapte05@stanford.edu Simar Mangat Stanford University

Model Speed (FPS) FLOPS mAP

SqueezeNet (GPU) - 217 Bn -YOLO (GPU) 40 5968 Bn 786

tiny-YOLO (GPU) 207 081 Bn 571YOLO (CPU) 2 5968 Bn 65

tiny-YOLO (CPU) 6 081 Bn 51tiny-YOLO-fire (CPU) 8 - lt 1tiny-YOLO (mobile) 11 081 Bn -

Table 1 Best Reported Inference Speed and mAP scores of modelscompared to our tests of YOLO tiny-YOLO and tiny-YOLO-fireon CPU

44 Discussion of Fire Layer

We attempted to train the modified tiny-YOLO with firelayers from scratch This required instantiating a blank fileand training took around 4 hours on the GPU provisionedby Google Cloud which resulted in a minimum loss of 6

Figure 5 Tensorboard loss convergence graph of tiny-YOLO-fireafter 500 iterations

However although the loss converged the net still had amAP of 0 at test time and was unable to make predictionsfor almost all images As a result we attempted to modifythe s1x1 layer of the second fire layer to have 46 channelsinstead of 42 (to better align with the original SqueezeNetimplementation[15]) and tuned hyperparameters such asthe learning rate This also had no effect

We then used the pre-trained weights for the first convo-lutional layers from tiny-YOLO and fine-tuned the remain-ing fire layers using these weights as a starting point To testthis technique we created a small set of 3 images and an-notations from the VOC validation set and trained the fire-layer tiny-YOLO on these images in an attempt to overfitand accurately predict these images at test time The netwas unable to make accurate classifications on this datasetas well

These experiments led us to believe that fire layers are illsuited as replacements to vanilla convolution layers Intu-itively this makes sense as the fire layers concatenate out-puts leading to very different maps from sequential convo-

lutionFurthermore the fire layers added at the end of the archi-

tecture squeeze an already aggressively reduced neural netHowever before these layers the activations are alreadysomewhat well-defined and the parameter count is very lowAs a result additional squeezing without careful precisionmay result in the loss or scrambling of useful informationmaking the model far less expressive and unable to makesense of information learned in previous layers Addition-ally while fire layers may have made sense for all-imageinput without bounding boxes where concatenated chan-nels revealed information about the interplay of differentfeatures in an image adding text and bounding boxes likelyonly adds noise to the outputs since there is a disconnectbetween the meaning of the bounding box parameters andthe pixel values

Further tuning of the s1x1 e1x1 and e3x3 filter sizes andlayers would need to take place to ensure that the networkcaptures relevant data in these steps

Another factor to consider is that while YOLO relies onEuclidean loss the original SqueezeNet uses Softmax [15]

LEucl =1

2

nsumi=1

(xi minus yi)2

LSoftmax = minuslog(efyisumj e

fj)

These are two fundamentally different ways of measuringloss with one relying on distance while the other relies onprobability distributions

Empirical evidence suggests that YOLO is not as ef-fective when evaluated using Softmax with mAP scoresdropping as low as 5-10 upon changing to the Softmaxloss function [22] It is possible that the reverse is truefor SqueezeNet-inspired architectures and Euclidean Losswith the incompatible loss function having a strong negativeeffect on final mAP scores

45 Results of Deployment

The tiny-YOLO net was successfully deployed to theiPhone 7 using Forge [12] The app only draws bound-ing boxes if the confidence of prediction is above a user-defined threshold (05 in our case) The deployment tomobile actually saw a slightly higher maximum InferenceSpeed than on laptop CPU because of the direct access tothe mobile GPU provided by Metal2 However the averagespeed roughly matched the results of the same network onCPU

The relatively low mAP score is evidenced by thetendency of our app to ignore chairs and struggle topredict bottles The categories with the lowest averageprecisions such as potted plants and sofas are rarelydetected This is likely because these categories are so

6

variant in both shape and color and often include a lotof background noise The app has the highest mAPfor the categories of people and cars To view a demovisit httpswwwyoutubecomwatchv=m0_GXm9aN5Qampfeature=youtube Images from a runare shown below

Figure 6 Detections in different languages

5 Conclusions and Future Work

While the fire layer may be promising for speeding upthe bottlenecked high-channel layers the lack of predictionsin this project make it hard to draw positive conclusions forits future usefulness Further tuning on a large number ofparameters would need to take place before conclusions canbe drawn with this hybrid architecture

More likely it seems that layer-pruning and depth-wiseseparable convolutions as proposed in Googlersquos MobileNet[13] will provide an effective trade-off between speed andefficiency These models shrink the number of hyperparam-eters without making drastic changes to architecture andseem to be the next step in making deep learning practi-cal on mobile Next steps would investigate MobileNet andpotential hybrids between it and YOLO

Mobile deep learning is quickly becoming more accessi-ble With a push for increased GPU access on phones eventhe full YOLO may one day run in real time on our iPhonesConsequently the best development strategy to take advan-tage of mobile deep learning might be to slowly add to exist-ing models such tiny-YOLO to make them approach YOLOaccuracy Nevertheless our current app indicates that objectdetection for mobile can be done in near real time and hintsat many possibilities in the realms of facial recognition ARand autonomous driving in the near future

References

[1] Cs231n lecture 11 slide 80 May 10 2017[2] allanzelener Yad2k httpsgithubcomallanzelenerYAD2K

2017

[3] J Dai H Qi Y Xiong Y Li G Zhang H Hu andY Wei Deformable convolutional networks CoRRabs170306211 2017

[4] M Everingham L Van Gool C K I Williams J Winn andA Zisserman Assessing the Significance of PerformanceDifferences on the PASCAL VOC Challenges via Bootstrap-ping 2012

[5] R B Girshick Fast R-CNN CoRR abs150408083 2015[6] R B Girshick J Donahue T Darrell and J Malik Rich

feature hierarchies for accurate object detection and semanticsegmentation CoRR abs13112524 2013

[7] R B Girshick F N Iandola T Darrell and J MalikDeformable part models are convolutional neural networksCoRR abs14095403 2014

[8] K He G Gkioxari P Dollar and R B Girshick MaskR-CNN CoRR abs170306870 2017

[9] K He X Zhang S Ren and J Sun Deep residual learningfor image recognition CoRR abs151203385 2015

[10] hollance Forge A Neural Network Toolkit for MetalhttpsgithubcomhollanceForge 2017

[11] hollance Yolo2metal open source script 2017httpsgithubcomhollanceForgeblobmasterExamplesYOLOyolo2metalpy

[12] M Hollemans Real-time object detection with yolohttpmachinethinknetblogobject-detection-with-yolo2017

[13] A G Howard M Zhu B Chen D Kalenichenko W WangT Weyand M Andreetto and H Adam Mobilenets Effi-cient convolutional neural networks for mobile vision appli-cations CoRR abs170404861 2017

[14] J Huang V Rathod C Sun M Zhu A KorattikaraA Fathi I Fischer Z Wojna Y Song S Guadarrama andK Murphy Speedaccuracy trade-offs for modern convolu-tional object detectors CoRR abs161110012 2016

[15] F N Iandola M W Moskewicz K Ashraf S Han W JDally and K Keutzer Squeezenet Alexnet-level accuracywith 50x fewer parameters and lt1mb model size CoRRabs160207360 2016

[16] Y Kim E Park S Yoo T Choi L Yang and D ShinCompression of deep convolutional neural networks for fastand low power mobile applications CoRR abs1511065302015

[17] A Krizhevsky I Sutskever and G E Hinton Imagenetclassification with deep convolutional neural networks InAdvances in Neural Information Processing Systems page2012

[18] T Lin M Maire S J Belongie L D Bourdev R BGirshick J Hays P Perona D Ramanan P Dollar andC L Zitnick Microsoft COCO common objects in contextCoRR abs14050312 2014

[19] rbgirshick Faster r-cnn httpsgithubcomrbgirshickpy-faster-rcnnblobmasterlibdatasets 2016

[20] J Redmon and A Farhadi You only lookonce Unified real-time object detectionhttpspjreddiecomdarknetyolo 2015

[21] J Redmon and A Farhadi Yolo9000 Better faster strongerhttpspjreddiecomdarknetyolo 2016

7

[22] J Redmon and A Farhadi Yolo9000 Better faster strongerhttpspjreddiecompublicationsyolo 2016

[23] S Ren K He R B Girshick and J Sun Faster R-CNNtowards real-time object detection with region proposal net-works CoRR abs150601497 2015

[24] C Szegedy W Liu Y Jia P Sermanet S E ReedD Anguelov D Erhan V Vanhoucke and A RabinovichGoing deeper with convolutions CoRR abs140948422014

[25] thtrieu Darkflow httpsgithubcomthtrieudarkflow 2017[26] B Wu F N Iandola P H Jin and K Keutzer Squeezedet

Unified small low power fully convolutional neural net-works for real-time object detection for autonomous drivingCoRR abs161201051 2016

8

Page 7: YOLO Net on iOS - Stanford Universitycs231n.stanford.edu/reports/2017/pdfs/135.pdf · YOLO Net on iOS Maneesh Apte Stanford University mapte05@stanford.edu Simar Mangat Stanford University

variant in both shape and color and often include a lotof background noise The app has the highest mAPfor the categories of people and cars To view a demovisit httpswwwyoutubecomwatchv=m0_GXm9aN5Qampfeature=youtube Images from a runare shown below

Figure 6 Detections in different languages

5 Conclusions and Future Work

While the fire layer may be promising for speeding upthe bottlenecked high-channel layers the lack of predictionsin this project make it hard to draw positive conclusions forits future usefulness Further tuning on a large number ofparameters would need to take place before conclusions canbe drawn with this hybrid architecture

More likely it seems that layer-pruning and depth-wiseseparable convolutions as proposed in Googlersquos MobileNet[13] will provide an effective trade-off between speed andefficiency These models shrink the number of hyperparam-eters without making drastic changes to architecture andseem to be the next step in making deep learning practi-cal on mobile Next steps would investigate MobileNet andpotential hybrids between it and YOLO

Mobile deep learning is quickly becoming more accessi-ble With a push for increased GPU access on phones eventhe full YOLO may one day run in real time on our iPhonesConsequently the best development strategy to take advan-tage of mobile deep learning might be to slowly add to exist-ing models such tiny-YOLO to make them approach YOLOaccuracy Nevertheless our current app indicates that objectdetection for mobile can be done in near real time and hintsat many possibilities in the realms of facial recognition ARand autonomous driving in the near future

References

[1] Cs231n lecture 11 slide 80 May 10 2017[2] allanzelener Yad2k httpsgithubcomallanzelenerYAD2K

2017

[3] J Dai H Qi Y Xiong Y Li G Zhang H Hu andY Wei Deformable convolutional networks CoRRabs170306211 2017

[4] M Everingham L Van Gool C K I Williams J Winn andA Zisserman Assessing the Significance of PerformanceDifferences on the PASCAL VOC Challenges via Bootstrap-ping 2012

[5] R B Girshick Fast R-CNN CoRR abs150408083 2015[6] R B Girshick J Donahue T Darrell and J Malik Rich

feature hierarchies for accurate object detection and semanticsegmentation CoRR abs13112524 2013

[7] R B Girshick F N Iandola T Darrell and J MalikDeformable part models are convolutional neural networksCoRR abs14095403 2014

[8] K He G Gkioxari P Dollar and R B Girshick MaskR-CNN CoRR abs170306870 2017

[9] K He X Zhang S Ren and J Sun Deep residual learningfor image recognition CoRR abs151203385 2015

[10] hollance Forge A Neural Network Toolkit for MetalhttpsgithubcomhollanceForge 2017

[11] hollance Yolo2metal open source script 2017httpsgithubcomhollanceForgeblobmasterExamplesYOLOyolo2metalpy

[12] M Hollemans Real-time object detection with yolohttpmachinethinknetblogobject-detection-with-yolo2017

[13] A G Howard M Zhu B Chen D Kalenichenko W WangT Weyand M Andreetto and H Adam Mobilenets Effi-cient convolutional neural networks for mobile vision appli-cations CoRR abs170404861 2017

[14] J Huang V Rathod C Sun M Zhu A KorattikaraA Fathi I Fischer Z Wojna Y Song S Guadarrama andK Murphy Speedaccuracy trade-offs for modern convolu-tional object detectors CoRR abs161110012 2016

[15] F N Iandola M W Moskewicz K Ashraf S Han W JDally and K Keutzer Squeezenet Alexnet-level accuracywith 50x fewer parameters and lt1mb model size CoRRabs160207360 2016

[16] Y Kim E Park S Yoo T Choi L Yang and D ShinCompression of deep convolutional neural networks for fastand low power mobile applications CoRR abs1511065302015

[17] A Krizhevsky I Sutskever and G E Hinton Imagenetclassification with deep convolutional neural networks InAdvances in Neural Information Processing Systems page2012

[18] T Lin M Maire S J Belongie L D Bourdev R BGirshick J Hays P Perona D Ramanan P Dollar andC L Zitnick Microsoft COCO common objects in contextCoRR abs14050312 2014

[19] rbgirshick Faster r-cnn httpsgithubcomrbgirshickpy-faster-rcnnblobmasterlibdatasets 2016

[20] J Redmon and A Farhadi You only lookonce Unified real-time object detectionhttpspjreddiecomdarknetyolo 2015

[21] J Redmon and A Farhadi Yolo9000 Better faster strongerhttpspjreddiecomdarknetyolo 2016

7

[22] J Redmon and A Farhadi Yolo9000 Better faster strongerhttpspjreddiecompublicationsyolo 2016

[23] S Ren K He R B Girshick and J Sun Faster R-CNNtowards real-time object detection with region proposal net-works CoRR abs150601497 2015

[24] C Szegedy W Liu Y Jia P Sermanet S E ReedD Anguelov D Erhan V Vanhoucke and A RabinovichGoing deeper with convolutions CoRR abs140948422014

[25] thtrieu Darkflow httpsgithubcomthtrieudarkflow 2017[26] B Wu F N Iandola P H Jin and K Keutzer Squeezedet

Unified small low power fully convolutional neural net-works for real-time object detection for autonomous drivingCoRR abs161201051 2016

8

Page 8: YOLO Net on iOS - Stanford Universitycs231n.stanford.edu/reports/2017/pdfs/135.pdf · YOLO Net on iOS Maneesh Apte Stanford University mapte05@stanford.edu Simar Mangat Stanford University

[22] J Redmon and A Farhadi Yolo9000 Better faster strongerhttpspjreddiecompublicationsyolo 2016

[23] S Ren K He R B Girshick and J Sun Faster R-CNNtowards real-time object detection with region proposal net-works CoRR abs150601497 2015

[24] C Szegedy W Liu Y Jia P Sermanet S E ReedD Anguelov D Erhan V Vanhoucke and A RabinovichGoing deeper with convolutions CoRR abs140948422014

[25] thtrieu Darkflow httpsgithubcomthtrieudarkflow 2017[26] B Wu F N Iandola P H Jin and K Keutzer Squeezedet

Unified small low power fully convolutional neural net-works for real-time object detection for autonomous drivingCoRR abs161201051 2016

8