multi-scale generative adversarial networks for crowd...

Multi-scale Generative Adversarial Networks forCrowd Counting

Jianxing Yang and Yuan ZhouSchool of Electrical and Information Engineering,

Tianjin University,Tianjin, China

[email protected]

Sun-Yuan KungElectrical Engineering Department,

Princeton University,Princeton, USA,

[email protected]

Abstract—We investigate generative adversarial networks as aneffective solution to the crowd counting problem. These networksnot only learn the mapping from crowd image to correspondingdensity map, but also learn a loss function to train this mapping.There are many challenges to the task of crowd counting, suchas severe occlusions in extremely dense crowd scenes, perspectivedistortion, and high visual similarity between pedestrians andbackground elements. To address these problems, we proposedmulti-scale generative adversarial network to generate high-quality crowd density maps of arbitrary crowd density scenes.We utilized the adversarial loss from discriminator to improvethe quality of the estimated density map, which is criticalto accurately predict crowd counts. The proposed multi-scalegenerator can extract multiple hierarchy features from the crowdimage. The results showed that the proposed method providedbetter performance compared to current state-of-the-art methods.

I. INTRODUCTION

Crowd counting is used to calculate the number of people indifferent crowd scenes. It has recently attracted considerableresearch attention owing to the practical demands for it,including crowd analysis [1] and public safety. The tasks ofcrowd counting and density estimation are valuable for urbanplanning, crowd control, crowd detection [2] and video surveil-lance. Rapid growth of urban population and urbanization hasresulted in an increase in mass activities, such as sportingevents, political rallies, transportation and communication. Insuch scenarios, the risk exists in crowd-related life-threateningdisasters, such as stampedes. Therefore, it is essential toestimate the crowd density distribution and analyze crowdbehavior to improve management, safety and security.

However, the task of crowd counting has many challenges,such as severe occlusion, non-uniform density, perspectivedistortion, and the highly similar appearance between peopleand background elements. Several methods have been pro-posed to address these crowd counting problems. Most existingdetection-based methods [3] [4] [5] use a sliding windowdetector to detect individuals in a crowd. This approachassumes that the people in the crowd are perceived as distinctlyindividual entities. However, when applied to high-densitycrowds with severe occlusion, such detection-based methodsare limited. Researchers have thus attempted to address thisproblem by adopting a regression model that learns a mappingfrom crowd features to crowd counts [6]. More exactly, crowd

features are extracted from the foreground, such as the areaof a crowd mask [7] [8], the edge count [8] [9], or texturefeatures [7].

Recently, researchers have leveraged the performances ofconvolutional neural network (CNN-based) architectures tolearn nonlinear mapping from crowd images to correspondingdensity maps [10] [9] [11] . The density map or heatmapis generated to represent the crowd density distribution ina crowd image, and crowd count is obtained by integration.This mapping to a density map preserves more information ofthe spatial distribution compared to the mapping from crowdfeatures to crowd counts. The density-map based approacheshave demonstrated good performance on most datasets. Nev-ertheless, two key challenges remain, as outlined below.

In many complicated and highly clustered crowd scenes,the estimated density maps often have poor quality becauseof the high visual similarity between the pedestrians andbackground elements. Most existing CNN-based approachesfocus on accurate estimation of the crowd count while ignoringthe quality of the estimated crowd density maps.

The second challenge is that crowd images are often cap-tured from varying viewpoints, resulting in a wide varietyof perspectives and scales. People near the camera are oftencaptured in refined detail i.e., the face or even the entire bodyof the individual is captured. However, in the case of peopledistant from the camera, or when images are captured froman aerial viewpoint, each person is represented only as a so-called head blob. Efficient detection of people in both thesescenarios requires the model to handle significant variation inthe scale of the people in the images.

To overcome these challenges, we propose a multi-scalegenerative adversarial network (MS-GAN) to generate high-quality crowd density maps of arbitrary crowd densities andperspectives. The MS-GAN consists of two sub-networks: agenerator and a discriminator. The MS-GAN adopts a multi-scale generator to predict a density map for a given imagewith a large variation in the scale of the people. Moreover,a discriminator is used to refine the resulting density mapproduced by the generator.

The generator is a multi-scale fully convolutional networkthat combines both global and local features. In this archi-tecture, the features are incorporated from multiple hierarchy

2018 24th International Conference on Pattern Recognition (ICPR)Beijing, China, August 20-24, 2018

978-1-5386-3788-3/18/$31.00 ©2018 IEEE 3244

conv1

conv2

Max pooling*2

Max pooling

conv3 conv4

Estimated density map

Ground truth map

Incep-1

Incep-3

Incep-2

Crowd image

Adversarial loss

Generator

Discriminator

Fig. 1. Overall architecture of the proposed density map estimation system

convolutional layers with different receptive fields to detectpeople with large-scale variation. The discriminator networkserves as a supervisor and provides guidance on the qualityof the density maps. We use adversarial loss in addition toEuclidean loss for improving the estimation accuracy. Thediscriminator is trained to distinguish the real ground truthdensity map and the generated poor-quality density map.

In this study, we adopts an adversarial training mode forcrowd density estimation. The generator network and discrim-inator network are trained in an alternative manner to solve themin-max problem. The generator network is trained to deceivethe discriminator by generating the precise density map. Onthe other hand, the discriminator is trained to distinguish thegenerated density map from the real ground truth ones. It alsoprovides feedback about localization precision and estimationaccuracy to the generator. The two networks complement eachother for an optimal result by simultaneous training.

We summarize our main contributions as follows:

1) To the best of our knowledge, our presented study is thefirst to successfully apply GAN-like models to solve thechallenging task of crowd counting.

2) A multi-scale fully convolutional network (FCN) ispresented as the density map generator, which combinesboth global and local features. The features extractedby multiple hierarchy convolutional layers with different

receptive fields are fused to detect pedestrians withlarge-scale variations.

3) We use a conditional GAN with an adversarial loss forgenerating a high-quality density map. We feed both thecrowd image and density map into the discriminator asinput. It is trained to distinguish the ground truth densitymap and generated poor quality density map. The use ofadversarial loss engenders greater stability and improvesthe quality of the generated density maps.

II. NETWORK ARCHITECTURE

The proposed MS-GAN addresses the previously mentionedchallenges of crowd density estimation problems. Fig. 1 illus-trate the architecture of the MS-GAN that contains two majorcomponents: a generator and a discriminator. Because crowdimages often contain various numbers of pixels of personsowing to perspectival distortion, we introduce new designsfor the generator network with a multi-scale fully convolu-tional network (FCN) to estimate the non-uniform distributioncrowd density. The proposed architecture of the multi-scaleFCN combines different levels of the features extracted frommultiple hierarchical convolutional layers. The discriminatornetwork is trained to distinguish the generated density mapfrom the real ground truth map. In this section, we providethe architecture details of the generator and discriminator.

3245

A. Generator with Multi-Scale FCN

Crowd images are often obtained using surveillance camerasfrom different viewpoints under the problem of perspectivedistortion. A significant variation exists in the scale of thepeople in the crowd images. Especially, only the heads ofindividuals in the crowd can be observed in highly congestedscenes on account of severe occlusion. Recently, Hariharan etal. [12] showed that the information of interest for pixel-leveltasks is spread across all layers of a CNN. They introducedthe concept of the hypercolumn, which is defined as theconcatenation of features corresponding to a spatial locationacross all the layers of the deep network.

Inspired by the above method [12] and the VGG network[13], we employ a fully convolutional network (FCN) toextract multi-scale features. The proposed FCN contains fourconvolutional blocks (conv-1 to conv-4) with two convolu-tional layers in each block, and three inception modules. Thearchitecture details are shown in Table 1.

We extract features from first three blocks in which theconvolution layers have different receptive fields for the multi-scale representation. The architecture of the generator is shownin Fig. 1. Features from lower convolution layer assist inobtaining more information of tiny heads in extremely densecrowds. This is because the receptive field in the lowerconvolution layer is not too large and is appropriate to the headscale. The feature maps from blocks are fed into an inceptionmodule, which is introduced from GoogLeNet [14] and usesmulti-scale convolution kernels.

3×3 convolutions

5×5 convolutions

7×7 convolutions

Previous layer

Filter concatenation

Fig. 2. The architecture of inception module

We adopt the inception module with filters of differentreceptive fields, i.e., 1 × 1, 3 × 3, and 5 × 5, for capturingfeatures from multiple scales, as shown in Fig. 2. Eachinception module can capture a multi-scale representation ofdeep features. To fuse the features of the different inceptionmodules, we synchronize the outputs of incep-1 and incep-2 to the same size as that of incep-3 by pooling, so thatconcatenation can be applied. The outputs of each inceptionmodule are concatenated and fed into conv-4 , each having akernel size of 3× 3.

TABLE ITHE ARCHITECTURE DETAILS OF THE MULTI-SCALE GENERATOR

Block Layer Kernel NumConv-1 2conv 7× 7 32

Down sample Max-pool 2× 2 -Conv-2 2conv 5× 5 64

Down sample Max-pool 2× 2 -Conv-3 2conv 5× 5 64Incp-1 conv (5/3/1)× (5/3/1) 3× 16Incp-2 conv (5/3/1)× (5/3/1) 3× 16Incp-3 conv (5/3/1)× (5/3/1) 3× 16

conv-4 conv 3× 3 48conv 3× 3 1

B. Discriminator

To discriminate the real density map corresponding to thecrowd image and the poor-quality density map mismatcheswith the corresponding crowd image, we input the pairs ofthe density map and corresponding crowd image as jointobservation and train the discriminator to judge the pairsas real or fake. We follow the architecture introduced byRadford et al. [15]. The discriminator consists of four stride-2 convolution layers (conv1 to conv4); thus, pooling layersare not needed. The filters in each convolution layers adoptthe same 55 kernel size. Then, rectified linear unit (ReLU)activation is applied after every convolutional layer. Finally, afully connected layer employing sigmoid activation is used toobtain a probability for binary classification.

III. ADVERSARIAL TRAINING

Generative adversarial networks [16] are commonly usedto produce images that appear realistic. The generative modeland discriminative model are simultaneously trained for anoptimal result. The generative model captures the distributionof the real data and generates realistic samples with a normaldistribution as input. The discriminative model is a simplebinary classifier that is used to distinguish the real data andgenerated samples (which we refer to as fake). The outputof the discriminator is the possibility that the input imagebelongs to the real data. In other words, the generator tends toproduce samples that appear as realistic images to deceive thediscriminator. Meanwhile, the discriminative model strives todetect whether a sample is from the real data or the generatedimage. The two models compete in this adversarial game toimprove the generated results until the generated samples areindistinguishable for the discriminative model.

The density map estimation problem has some significantdifferences from the above scenario. Firstly, the goal of crowdestimation is to accurately produce a realistic density maprather than to generate a realistic nature image. As such, in ourcase the input to the generator is not random noise but a crowdimage. Secondly, as the crowd image contains considerableinformation, we input both the crowd image and density mapto the discriminator.

3246

A. Learning objective

To train our model, we introduce conditional generativeadversarial networks to accurately predict a high-quality den-sity map. The generator and discriminator play the followingadversarial min-max game on V (D,G):

minG

maxD

V (D,G) = Ex∼Pdata(x),y∼Pdata(y)[log(D(x, y))]+

Ex∼pdata(x)[log(1−D(x,G(x)))](1)

The G tries to minimize this objective against an adversarialD that tries to maximize it. Where G represents the generatorthat predicts density map G(x) with a crowd image x as input,and D represents a discriminator which outputs the likelihoodthat the given density map is sampled from the ground truthmap. The discriminator attempts to distinguish between thegenerated density map G(x) produced by the generator and thereal density map y from the ground truth. While the generatoris trained to produce a high-quality density map that cannotbe distinguished from the real density map by an adversarialdiscriminator.

As shown in Fig. 1, we include both the crowd image andthe density map as inputs to the discriminator. The discrimina-tor distinguishes between generated density map and groundtruth density map, and judges whether the generated densitymap matches its corresponding crowd image.

B. Content loss

The generator is used to learn a mapping from the inputcrowd image to the corresponding density map. We minimizethe Euclidean distance between the generated density mapand the ground truth map. The loss function in the pixel-wisemean-squared error (MSE) can be formulated as:

LMSE =1

N

N∑i=1

‖yi −G(xi; θ)‖22 (2)

where G(xi; Θ) is the estimated density map produced bythe generator with Θ as the learnable parameters. yi is theground truth density map of ith image xi. N is the number ofthe training images.

C. Adversarial loss

As mentioned, the discriminator determines the differencebetween the generated density map and the real density mappairs. We thus label the generated density map with zero andthe ground truth map with one. The output of the discriminatorrepresents the score of the estimation accuracy of the generateddensity map. The discriminator is trained to solve a binaryclassification task between the poor-quality density map andthe ground truth ones. We train the density map generatorwith an additional adversarial loss provided by discriminatorto generate the density map with a high quality. The adversarialloss function can be formulated as:

LADV = − logD(xi, G(xi, θ)) (3)

Where D(xi, G(xi, θ)) represents the probability that theestimated density map is the precise density map matchingthe corresponding crowd image. We concatenate crowd imagexi and generated density map G(xi) as the input of thediscriminator.

We use the loss function, which is composed of the meansquare loss (MSE) from the generator and the adversarial lossfrom the discriminator. The final loss function for the generatorduring adversarial training is formulated as:

LGAN = LMSE + αLADV (4)

We set the hyperparameter α = 0.002 to trade off thetwo losses. The combination of the two losses can make thenetwork more stable and enable more accurate prediction ofthe density map.

IV. EXPERIMENT

We evaluated the performance of our proposed MS-GANarchitecture on four crowd counting datasets. We take thegenerator as a regressor to predict the density map. The totalcount of the people in the image could be obtained by asummation over the predicted density map. We demonstratedthat the multi-scale convolution net with an adversarial net-work architecture achieved competitive and often superiorperformance.

A. Evaluation metric

By following the convention of existing works for crowdcounting, we evaluated different methods by using the meanabsolute error (MAE) and the MSE, which are defined asfollows:

MAE =1

N

N∑i=1

|zi − zi|,MSE =

√√√√ 1

N

N∑i=1

(zi − zi)2 (5)

where N is the number of test images, zi and zi are theactual count and the estimated count of the ith crowd imagerespectively.Roughly speaking, MAE indicates the accuracyof the estimation, and MSE indicates the robustness of theestimation.

B. Implementation details

The whole network was trained with a stochastic gradientdescent and the Adam optimizer with a learning rate of 0.0002on a single NVIDIA GeForce GTX TITAN X GPU with 12 GBof memory. The parameters of the generator and discriminatorwere initialized with a normal distribution function. We set thebatch size to one because the dataset was not very large. Attraining time, the generator and discriminator were alternatelytrained. We first trained the generator network for 20 epochsusing MSE loss, which was computed with respect to theestimated density map and ground truth. From that point, weadded the discriminator network and alternately updated thegenerator and discriminator for 100 epochs.

3247

GT count:4633

GT count:1940

GT count:2550

GT count:682

Est count:4019

Est count:1864

Est count:2228

Est count:3110

Est count:302

Est count:504

Est count:1342

Est count:1720

Est count:787

Est count:405 GT count:376

Est count:3719

Est count:560

Est count:1429

Est count:197

Est count:2106

Crowd images MCNNCrowdNet MS-GAN Ground Truth

Fig. 3. The results of our MS-GAN Model on UCF CC 50 dataset, fromleft to right: crowd images, CrowdNet [10], MCNN [11], MS-GAN(ours) andGround Truth density map

C. UCF CC 50 dataset

The UCF CC 50 dataset contains 50 extremely densecrowd images firstly introduced by H.Idress et al. [17]. The50 images are all high-density crowd scenes selected fromthe Internet. The crowd density distributions are non-uniform,and the crowd images have severe occlusion problems. Thedrawback of this dataset is that only a limited number ofimages are available for training and evaluation. We adopteda data augmentation approach that cropped nine patches fromeach image, and each patch was a quarter the size of theoriginal image for training. Following the standard settingin [17], we perform five-fold cross-validation to evaluate theresults of our methods.

We compared our method with four existing methods onthe UCF CC 50 dataset. As shown in Fig. 3, the densitymaps are of high quality with the corresponding input crowdimages. The MAE and MSE results of our MS-GAN are shownin Table 2. Idress et al. [17] employed multiple sources ofinformation to estimate the counts of people in extremelydense crowd images. The method presented in [9] is a CNN-based model with two relative objectives, crowd density andcrowd count. Boominathan et al. [10] leveraged a combinationof deep and shallow fully convolutional networks to estimatethe number of individuals comprising a dense crowd. Zhanget al. [11] proposed multi-column convolutional networks toestimate crowd count in a single image with broad variationsof people. Our method includes the best MAE and MSE withthe existing approaches.

D. ShanghaiTech dataset

The ShanghaiTech dataset is a large-scale crowd countingdataset introduced in [11] . It contains 1,198 annotated imageswith a total of 330,165 people. The dataset consists of two

TABLE IICOMPARISON PERFORMANCE OF DIFFERENT METHODS ON UCF CC 50

DATASET.

Method MAE MSEIdress et al [17]. 419.5 541.6Zhang et al. [9] 467.0 498.5CrowdNet [10] 452.5 -

MCNN [11] 377.6 509.1MS-GAN(ours) 345.7 418.3

TABLE IIICOMPARISON RESULTS OF DIFFERENT APPROACHES WITH OUR MS-GAN

MODEL.

PartA PartBMethod MAE MSE MAE MSE

LBP+RR et al. 303.3 371.0 59.1 87.1Zhang et al. [9] 181.8 277.7 32 49.8

MCNN [11] 110.2 173.2 26.4 41.3MS-GAN(ours) 96.5 135.9 18.7 30.5

parts: Part A has 482 crowd images randomly obtained fromthe Internet, and Part B contains 716 crowd images taken fromthe busy streets of metropolitan areas in Shanghai. Both partswere divided into training and testing datasets. Part A used 300images for training, and the remaining 182 images were usedfor testing. Part B used 400 images for training and 316 fortesting. Table 3 gives the the comparison with other datasetson both MAE and MSE metrics. Our method can perfectlypredict the crowd number despite huge fluctuations of crowddensity.

We compared our method with the LBP+RR method, whichuses local binary pattern (LBP) features and ridge regression(RR) to estimate the crowd number. Meanwhile, Zhang et al.[9] proposed a deep CNN for cross-scene crowd counting. TheMCNN [11] estimates the crowd count via a multi-columnCNN in a single image.

V. CONCLUSION

In this paper, we proposed multi-scale generative adversarialnetworks to generate a high-quality crowd density map of anarbitrary crowd density and arbitrary perspective. We utilizedthe generative adversarial networks to improve the qualityof the estimated density map, which is critical to accuratelypredict crowd counts. The proposed multi-scale generator canextract multiple hierarchy features from the crowd image.Extensive experiments on diverse datasets showed that ouradversarial model exhibited state-of-the-art performance onmajor dataset.

VI. ACKNOWLEDGEMENT

This work was supported in part by the National NaturalScience Foundation of China under Grant 61520106002 andGrant 61571326, in part by the National Natural ScienceFoundation of Tianjin under Grant 16JCQNJC00900, andin part by the Brandeis Program of the Defense AdvancedResearch Project Agency and Space and Naval Warfare SystemCenter Pacific under Contract 66001-15-C-4068.

3248

REFERENCES

[1] S. Wang, E. Zhu, J. Yin, and F. Porikli, “Anomaly detection in crowdedscenes by sl-hof descriptor and foreground classification,” in Interna-tional Conference on Pattern Recognition, 2017, pp. 3398–3403.

[2] K. Nakamura, T. Ono, and N. Babaguchi, “Detection of groups in crowdconsidering their activity state,” in International Conference on PatternRecognition, 2017, pp. 277–282.

[3] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection:An evaluation of the state of the art,” IEEE Transactions on PatternAnalysis & Machine Intelligence, vol. 34, no. 4, pp. 743–761, 2012.

[4] M. Li, Z. Zhang, K. Huang, and T. Tan, “Estimating the numberof people in crowded scenes by mid based foreground segmentationand head-shoulder detection,” in International Conference on PatternRecognition, 2009, pp. 1–4.

[5] Z. Lin and L. S. Davis, “Shape-based human detection and segmentationvia hierarchical part-template matching,” IEEE Transactions on PatternAnalysis & Machine Intelligence, vol. 32, no. 4, pp. 604–18, 2010.

[6] K. Chen, C. L. Chen, S. Gong, and T. Xiang, “Feature mining forlocalised crowd counting,” in British Machine Vision Conference, 2013.

[7] A. B. Chan, Z. S. J. Liang, and N. Vasconcelos, “Privacy preservingcrowd monitoring: Counting people without people models or tracking,”in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEEConference on, 2008, pp. 1–7.

[8] N. Paragios and V. Ramesh, “A mrf-based approach for real-time subwaymonitoring,” in Computer Vision and Pattern Recognition, 2001. CVPR2001. Proceedings of the 2001 IEEE Computer Society Conference on,2001, pp. I–1034–I–1040 vol.1.

[9] C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd count-ing via deep convolutional neural networks,” in IEEE Conference onComputer Vision and Pattern Recognition, 2015, pp. 833–841.

[10] L. Boominathan, S. S. S. Kruthiventi, and R. V. Babu, “Crowdnet: Adeep convolutional network for dense crowd counting,” pp. 640–644,2016.

[11] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Single-image crowdcounting via multi-column convolutional neural network,” in ComputerVision and Pattern Recognition, 2016, pp. 589–597.

[12] B. Hariharan, P. Arbelez, R. Girshick, and J. Malik, “Hypercolumns forobject segmentation and fine-grained localization,” in IEEE Conferenceon Computer Vision and Pattern Recognition, 2015, pp. 447–456.

[13] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” Computer Science, 2014.

[14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”pp. 1–9, 2014.

[15] P. Isola, J. Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translationwith conditional adversarial networks,” pp. 5967–5976, 2016.

[16] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”in International Conference on Neural Information Processing Systems,2014, pp. 2672–2680.

[17] H. Idrees, I. Saleemi, C. Seibert, and M. Shah, “Multi-source multi-scale counting in extremely dense crowd images,” in Computer Visionand Pattern Recognition, 2013, pp. 2547–2554.

[18] B. Xu and G. Qiu, “Crowd density estimation based on rich features andrandom projection forest,” in IEEE Winter Conference on Applicationsof Computer Vision, 2016, pp. 1–8.

[19] B. Sheng, C. Shen, G. Lin, J. Li, W. Yang, and C. Sun, “Crowd countingvia weighted vlad on dense attribute feature maps,” IEEE Transactionson Circuits & Systems for Video Technology, vol. PP, no. 99, pp. 1–1,2016.

3249

multi-scale generative adversarial networks for crowd...

Documents