hybrid graph neural networks for crowd counting · hybrid graph neural network (hygnn), for joint...

8
Hybrid Graph Neural Networks for Crowd Counting Ao Luo 1 , Fan Yang 2 , Xin Li 2 , Dong Nie 3 , Zhicheng Jiao 4 , Shangchen Zhou 5 , Hong Cheng 1* 1 Center for Robotics, University of Electronic Science and Technology of China 2 Inception Institute of Artificial Intelligence 3 Department of Computer Science, University of North Carolina at Chapel Hill 4 University of Pennsylvania 5 Nanyang Technological University aoluo [email protected], [email protected] Abstract Crowd counting is an important yet challenging task due to the large scale and density variation. Recent investigations have shown that distilling rich relations among multi-scale features and exploiting useful information from the auxiliary task, i.e., localization, are vital for this task. Nevertheless, how to comprehensively leverage these relations within a uni- fied network architecture is still a challenging problem. In this paper, we present a novel network structure called Hybrid Graph Neural Network (HyGnn) which targets to relieve the problem by interweaving the multi-scale features for crowd density as well as its auxiliary task (localization) together and performing joint reasoning over a graph. Specifically, HyGnn integrates a hybrid graph to jointly represent the task-specific feature maps of dierent scales as nodes, and two types of relations as edges: (i) multi-scale relations for capturing the feature dependencies across scales and (ii) mutual beneficial relations building bridges for the cooperation between count- ing and localization. Thus, through message passing, HyGnn can distill rich relations between the nodes to obtain more powerful representations, leading to robust and accurate re- sults. Our HyGnn performs significantly well on four chal- lenging datasets: ShanghaiTech Part A, ShanghaiTech Part B, UCF CC 50 and UCF QNRF, outperforming the state-of- the-art approaches by a large margin. Introduction Crowd counting, with the purpose of analyzing large crowds quickly, is a crucial yet challenging computer vision and AI task. It has drawn a lot of attention due to its potential appli- cations in public security and planning, trac control, crowd management, public space design, etc. Same as many other computer vision tasks, the perfor- mance of crowd counting has been substantially improved by Convolutional Neural Networks (CNNs). Recently, the state-of-the-art crowd counting methods (Liu, Weng, and Mu 2019; Liu, Salzmann, and Fua 2019; Wan et al. 2019; Liu, Salzmann, and Fua 2019; Jiang et al. 2019) mostly fol- low the density-based paradigm. Given an image or video frame, CNN-based regressors are trained to estimate the * denotes the Corresponding Author Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. = (, ℰ) Counting branch Localization branch Multi-scale pooling (a) Input image (b) Backbone (d) HyGnn Update Multi-scale pooling (e) results (c) Domain specific branches Figure 1: Illustration of the proposed HyGnn model. (a) In- put image, in which crowds have heavy overlaps and occlu- sions. (b) Backbone, which is a truncated VGG-16 model. (c) Domain-specific branches: one for crowd counting and the other for localization. (d)HyGnn, which represents the features from dierent scales and domains as nodes, while the relations between them as edges. After several mes- sage passing iterations, multiple types of useful relations are built. (e) Crowd density map (for counting) and localization map (as the auxiliary task). crowd density map, whose values are summed to give the entire crowd count. Recent studies (Shen et al. 2018; Cao et al. 2018; Li et al. 2017; 2018) have shown that multi-scale information, or relations across multiple scales helps to capture contextual knowledge which benefits crowd counting. Moreover, the crowd counting and its auxiliary task (localization), in spite of analyzing the crowd scene from dierent perspectives, could provide beneficial clues for each other (Liu, Weng, and Mu 2019; Lian et al. 2019). Crowd density map can oer guidance information and self-adaptive perception for precise crowd localization, and on the other hand, crowd localization can help to alleviate local inconsistency issue in density map. The mutual cooperation, or called mutual beneficial relation, is the key factor in estimating the high- quality density map. However, most methods only consider the crowd counting problem from one aspect, while ignore the other one. Consequently, they fail to fully utilize mul- arXiv:2002.00092v1 [cs.CV] 31 Jan 2020

Upload: others

Post on 05-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hybrid Graph Neural Networks for Crowd Counting · Hybrid Graph Neural Network (HyGnn), for joint crowd counting and localization. To the best of our knowledge, HyGnn is the first

Hybrid Graph Neural Networks for Crowd Counting

Ao Luo1, Fan Yang2, Xin Li2, Dong Nie3, Zhicheng Jiao4, Shangchen Zhou5, Hong Cheng1∗

1Center for Robotics, University of Electronic Science and Technology of China2Inception Institute of Artificial Intelligence

3Department of Computer Science, University of North Carolina at Chapel Hill4University of Pennsylvania 5Nanyang Technological University

aoluo [email protected], [email protected]

Abstract

Crowd counting is an important yet challenging task due tothe large scale and density variation. Recent investigationshave shown that distilling rich relations among multi-scalefeatures and exploiting useful information from the auxiliarytask, i.e., localization, are vital for this task. Nevertheless,how to comprehensively leverage these relations within a uni-fied network architecture is still a challenging problem. Inthis paper, we present a novel network structure called HybridGraph Neural Network (HyGnn) which targets to relieve theproblem by interweaving the multi-scale features for crowddensity as well as its auxiliary task (localization) together andperforming joint reasoning over a graph. Specifically, HyGnnintegrates a hybrid graph to jointly represent the task-specificfeature maps of different scales as nodes, and two types ofrelations as edges: (i) multi-scale relations for capturing thefeature dependencies across scales and (ii) mutual beneficialrelations building bridges for the cooperation between count-ing and localization. Thus, through message passing, HyGnncan distill rich relations between the nodes to obtain morepowerful representations, leading to robust and accurate re-sults. Our HyGnn performs significantly well on four chal-lenging datasets: ShanghaiTech Part A, ShanghaiTech PartB, UCF CC 50 and UCF QNRF, outperforming the state-of-the-art approaches by a large margin.

IntroductionCrowd counting, with the purpose of analyzing large crowdsquickly, is a crucial yet challenging computer vision and AItask. It has drawn a lot of attention due to its potential appli-cations in public security and planning, traffic control, crowdmanagement, public space design, etc.

Same as many other computer vision tasks, the perfor-mance of crowd counting has been substantially improvedby Convolutional Neural Networks (CNNs). Recently, thestate-of-the-art crowd counting methods (Liu, Weng, andMu 2019; Liu, Salzmann, and Fua 2019; Wan et al. 2019;Liu, Salzmann, and Fua 2019; Jiang et al. 2019) mostly fol-low the density-based paradigm. Given an image or videoframe, CNN-based regressors are trained to estimate the

∗denotes the Corresponding AuthorCopyright c© 2020, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

𝒢 = (𝒱, ℰ)

Counting branch

Localization branch

Mu

lti-scale po

olin

g

(a) Input image

(b) Backbone (d) HyGnn

Update

Mu

lti-scale po

olin

g

(e) results(c) Domain specific branches

Figure 1: Illustration of the proposed HyGnn model. (a) In-put image, in which crowds have heavy overlaps and occlu-sions. (b) Backbone, which is a truncated VGG-16 model.(c) Domain-specific branches: one for crowd counting andthe other for localization. (d)HyGnn, which represents thefeatures from different scales and domains as nodes, whilethe relations between them as edges. After several mes-sage passing iterations, multiple types of useful relations arebuilt. (e) Crowd density map (for counting) and localizationmap (as the auxiliary task).

crowd density map, whose values are summed to give theentire crowd count.

Recent studies (Shen et al. 2018; Cao et al. 2018; Li etal. 2017; 2018) have shown that multi-scale information, orrelations across multiple scales helps to capture contextualknowledge which benefits crowd counting. Moreover, thecrowd counting and its auxiliary task (localization), in spiteof analyzing the crowd scene from different perspectives,could provide beneficial clues for each other (Liu, Weng,and Mu 2019; Lian et al. 2019). Crowd density map canoffer guidance information and self-adaptive perception forprecise crowd localization, and on the other hand, crowdlocalization can help to alleviate local inconsistency issuein density map. The mutual cooperation, or called mutualbeneficial relation, is the key factor in estimating the high-quality density map. However, most methods only considerthe crowd counting problem from one aspect, while ignorethe other one. Consequently, they fail to fully utilize mul-

arX

iv:2

002.

0009

2v1

[cs

.CV

] 3

1 Ja

n 20

20

Page 2: Hybrid Graph Neural Networks for Crowd Counting · Hybrid Graph Neural Network (HyGnn), for joint crowd counting and localization. To the best of our knowledge, HyGnn is the first

tiple types of useful relations or structural dependencies inthe learning and inferring processes, resulting in sub-optimalresults.

One primary reason is the lack of a unified and effectiveframework capable of modeling the different types of rela-tions (i.e., multi-scale relations and mutual beneficial rela-tions) over a single model. To address this issue, we intro-duce a novel Hybrid Graph Neural Network (HyGnn), whichformulates the crowd counting and localization as a graph-based, joint reasoning procedure. As shown in Fig. 1, webuild a hybrid graph which consists of two types of nodes,i.e., counting nodes storing density-related features and lo-calization nodes storing location-related features. Besides,there are two different pairwise relationships (edge types)between them. By interweaving the multi-scale and multi-task features together and progressively propagating infor-mation over the hybrid graph, HyGnn can fully leveragethe different types of useful information, and is capable ofdistilling the valuable, high-order relations among them formuch more comprehensive crowd analysis.

HyGnn is easy to implement and end-to-end learnable.Importantly, it has two major benefits in comparison to ex-isting models for crowd counting (Liu, Weng, and Mu 2019;Liu, Salzmann, and Fua 2019; Wan et al. 2019; Liu, Salz-mann, and Fua 2019; Jiang et al. 2019). (i) HyGnn inter-weaves crowd counting and localization with a joint, multi-scale and graph-based processing rather than a simple com-bination as done in most existing solutions. Thus, HyGnnsignificantly strengthens the information flow between tasksand across scales, thereby enabling the augmented repre-sentation to incorporate more useful priors learned fromthe auxiliary task and different scales. (ii) HyGnn explicitlymodels and reasons all relations (multi-scale relations andmutual beneficial relations) simultaneously over a hybridgraph, while most existing methods are not capable of deal-ing with such complicated relations. Therefore, our HyGnncan effectively capture their dependencies to overcome in-herent ambiguities in the crowd scenes. Consequently, ourpredicted crowd density map is potentially more accurate,and consistent with the true crowd localization.

In our experiments, we show that HyGnn performs re-markably well on four well-used benchmarks and surpassesprior methods by a large margin. Our contributions aresummarized in three aspects:

• We present a novel end-to-end learnable model, namelyHybrid Graph Neural Network (HyGnn), for joint crowdcounting and localization. To the best of our knowledge,HyGnn is the first deep model capable of explicitly mod-eling and mining high-level relations between countingand its auxiliary task (localization) across different scalesthrough a hybrid graph model.

• HyGnn is equipped with a unique multi-tasking property,where different types of nodes, connections (or edges),and message passing functions are parameterized by dif-ferent neural designs. With such property, HyGnn canmore precisely leverage cooperative information betweencrowd counting and localization to boost the counting per-formance.

• We conduct extensive experiments on four well-knownbenchmarks including ShanghaiTech Part A, Shang-haiTech Part B, UCF CC 50 and UCF QNRF, on whichwe set new records.

Related WorksCrowd Counting and Localization. Early works (Viola,Jones, and Snow 2005) in crowd counting use detection-based methods, and employ handcrafted features likeHaar (Viola, Jones, and others 2001) and HOG (Dalal andTriggs 2005) to train the detector. The overall performancesof these algorithms are rather limited due to various occlu-sion. The regression-based methods, which can avoid solv-ing the hard detection problem, has become mainstreamand achieved great performance breakthroughs. Tradition-ally, the regression models (Chen et al. 2013; Lempitskyand Zisserman 2010; Pham et al. 2015) learn the map-ping between low-level images features and object countor density, using gaussian process or random forests regres-sors. Recently, various CNN-based counting methods havebeen proposed (Zhang et al. 2015; 2016; Liu, Weng, andMu 2019; Liu, Salzmann, and Fua 2019; Wan et al. 2019;Liu, Salzmann, and Fua 2019; Jiang et al. 2019) to bet-ter deal with different challenges, by predicting a densitymap whose values are summed to give the count. Partic-ularly, the scale variation issue has attracted the most at-tention of recent CNN-based methods (Dai et al. 2019;Varior et al. 2019). On the other hand, as observed by somerecent researches (Idrees et al. 2018a; Liu, Weng, and Mu2019; Lian et al. 2019), although the current state-of-the-art methods can report accurate crowd count, they may pro-duce the density map which is inconsistent with the truedensity. One major reason is the lack of crowd localiza-tion information. Some recent studies (Zhao et al. 2019;Liu, Weng, and Mu 2019) have tried to exploit the usefulinformation from localization in a unified framework. They,however, only simply share the underlying representationsor interweave two modules for different task together formore robust representations. Differently, our HyGnn con-siders a better way to utilize the mutual guidance informa-tion: explicitly modeling and iteratively distilling the mutualbeneficial relations across scales within a hybrid graph. Fora more comprehensive survey, we refer interested readersto (Kang, Ma, and Chan 2018).

Graph Neural Networks. The essential idea of GraphNeural Network (GNN) is to enhance the node representa-tions by propagating information between nodes. Scarselliet al. (Scarselli et al. 2008) first introduced the concept ofGNN, which extended recursive neural networks for pro-cessing graph structure data. Li et al. (Li et al. 2016) pro-posed to improve the representation capacity of GNN by us-ing Gated Recurrent Units (GRUs). Gilmer et al. (Gilmeret al. 2017) used message passing neural network to gen-eralize the GNN. Recently, GNN has been successfully ap-plied in attributes recognition (Meng et al. 2018), human-object interactions (Qi et al. 2018a), action recognition (Siet al. 2018), etc. Our HyGnn shares similar ideas with

Page 3: Hybrid Graph Neural Networks for Crowd Counting · Hybrid Graph Neural Network (HyGnn), for joint crowd counting and localization. To the best of our knowledge, HyGnn is the first

Density Prediction

Localization Prediction

Counting Branch

Localization Branch

Truncated

VGG-16

Input

Message passing ReadoutGraph BuildingFeature Extraction

Cross-scale

Message passing

Cross-domain

Message passing

k-th iteration

𝒢 = (𝒱, ℰ)

𝒗𝟏𝟏

𝒗𝟏𝟐

𝒗𝟐𝟏

𝒗𝟑𝟏

𝒗𝟐𝟐

𝒗𝟑𝟐

෬𝒆𝟑𝟐,𝟏

𝒉𝟑𝟏(𝟎)

𝒉𝟐𝟏(𝟎)

𝒉𝟏𝟏(𝟎)

𝒉𝟏𝟐(𝟎)𝒉𝟑

𝟐(𝟎)

𝒉𝟐𝟐(𝟎)

𝒉𝟐𝟏(𝒌)

𝒉𝟏𝟏(𝒌)

𝒉𝟑𝟏(𝒌)

𝒉𝟐𝟏(𝒌)

𝒉𝟏𝟏(𝒌)

𝒉𝟏𝟐(𝒌)

𝒉𝟐𝟐(𝒌)

𝒉𝟑𝟐(𝒌)

𝒉𝟑𝟏(𝒌)

𝒉𝟏𝟐(𝒌)

𝒉𝟐𝟐(𝒌)

𝒉𝟑𝟐(𝒌)

Figure 2: Overall of our HyGnn model. Our model is built on the truncated VGG-16, and includes a Domain-specific FeatureLearning Module to extract features from different domains. A novel HyGnn is used to distill multi-scale and cross-domaininformation, so as to learn better representations. Finally, the multi-scale features are fused to produce the density map forcounting as well as generate the auxiliary task prediction (localization map).

above methods that fully exploits the underlying relation-ships between multiple latent representations through GNN.However, most existing GNN-based models are designedto deal with only one relation type, which may limit thepower of GNN. To overcome above limitation, our HyGnnis equipped with a multitasking property, i.e., parameteriz-ing different types of connections (or edges) and the mes-sage passing functions with different neural designs, whichsignificantly discriminates HyGnn from all existing GNNs.

MethodologyPreliminariesProblem Formulation. Let the crowd counting model berepresented by the function M which takes an Image I asinput and generates the corresponding crowd density mapD (for counting) as well as the auxiliary task prediction,i.e., localization map L. Let Dg and Lg be the groundtruthof density map and localization map, respectively. Our goalis to learn the powerful domain-specific representations, de-noted as fd and fl, to minimize errors between the estimatedD and groundtruth Dg, as well as L and Lg. Notably, thetwo tasks share a common meta-objective, and Dg and Lg

are obtained from the same point-annotations without addi-tional target labels.

Notations. To achieve the goal, we need to distill the un-derlying dependencies between multi-task and multi-scalefeatures. Given the multi-scale density feature maps Fd ={f si

d }Ni=1 and multi-scale localization feature maps Fl =

{f sil }

Ni=1, we represent Fd and Fl with a directed graph G =

(V,E), where V is a set of nodes and E are edges. Thenodes in our HyGnn are further grouped into two types:V = V1 ⋃

V2, where V1 = {v1i }|V1 |

i=1 is the set of counting(density) nodes andV2 = {v2

i }|V2 |

i=1 denotes the set of localiza-tion nodes. In our model, we have the same number of nodesin two latent domains, and therefore |V1| = |V2| = N. Ac-cordingly, there are two types of edges E = Ei, j

⋃Em,n be-

tween them: (i) cross-scale edges emi, j = (vm

i , vmj ) ∈ Ei, j stand

for the multi-scale relations between nodes from the ith scaleto the jth scale within the same domain m ∈ {1, 2}, wherei, j ∈ {1, · · · ,N}; (ii) cross-domain edges em,n

i = (vmi , v

ni ) ∈

Em,n reflect mutual beneficial relations between nodes fromthe domain m to n with the same scale i ∈ {1, · · · ,N}, wherem, n ∈ {1, 2} & m , n. For each node vm

i (i ∈ {1, · · · ,N}& m ∈ {1, 2}), we learn its updated representation, namelyhm

i , through aggregating representations of its neighbors. Fi-nally, the updated multi-scale features H1 = {h1

i }|V1 |

i=1 andH2 = {h2

i }|V2 |

i=1 are fused to produce the final representation fdand fl, which are used to generate the outputsD andL. Here,we only consider multi-scale relations between nodes in thesame domain, and mutual beneficial (cross-domain) rela-tions between nodes with the same scale in our graph model.Considering that our graph model is designed to simultane-ously deal with two different node and relation types, weterm it as Hybrid Graph Neural Network (HyGnn) whichwill be detailed in the following section.

Hybrid Graph Neural Network (HyGnn)Overview. The key idea of our HyGnn is to perform Kmessage propagation iterations over G to joint distill andreason all relations between crowd counting and the aux-iliary task (localization) across scales. Generally, as shownin Fig. 2, HyGnn maps the given image I to the finalpredictions D and L through three phases. First, in thedomain-specific feature extracting phase, HyGnn generatesthe multi-scale density features Fd and localization featuresFl for I through a Domain-specific Feature Learning Mod-ule (DFL), and represents these features with a graph G =(V,E). Second, a parametric message passing phase runs forK times to propagate the message between nodes and alsoto update the node representations according to the receivedmessages within the graph G. Third, a readout phase fusesthe updated multi-scale featuresH1 andH2 to generate finalrepresentations (i.e., fd and fl), and maps them to the outputsD and L. Note that, as crowd counting is our main task, weemphasize the accuracy ofD during the learning process.

Page 4: Hybrid Graph Neural Networks for Crowd Counting · Hybrid Graph Neural Network (HyGnn), for joint crowd counting and localization. To the best of our knowledge, HyGnn is the first

3×3

C

2×2

max

3×3

C

2×2

max

3×3

C

Dynamic

Convolution𝐡𝑖m(𝑘)

𝜃𝒊∗𝑚(𝑘)

𝐡𝑖n(𝑘)

𝐡𝑖′m(𝑘)

Figure 3: The architecture of the learnable adapter. Theadapter takes the node representation of one (source) domainhm(k)

i as input and outputs the adaptive convolution parame-ters θ∗m(k)

i . The adaptive representation h′m(k)i is generated

conditioned on hn(k)i .

Domain-specific Feature Learning Module (DFL). DFLis one of the major modules of our model, which extractsthe multi-scale, domain-specific features Fd = {f si

d }Ni=1 and

Fl = {f sid }

Ni=1 from the input I. DFL is composed of three

parts: one front-end and two domain-specific back-ends.The front-end Fr(·) is based on the well-known VGG-16,

which maps the RGB image I to the shared underlying rep-resentations: fshare = Fr(I). More specifically, the first 10layers of VGG-16 are deployed as the front-end which isshared by the two tasks. Meanwhile, two series of convolu-tion layers with different dilation rates are appended onto theback-ends, denoted as Bd(·) and Bl(·). With the large recep-tive fields, the stacked convolutions are tailored for learningdomain-specific features: fd = Bd(fshare) and fl = Bl(fshare).In addition, the Pyramid Pooling Module (PPM) (Zhao etal. 2017) is applied in each domain-specific back-end forextracting multi-scale features, followed by an interpolationlayer R(·) to ensure the multi-scale feature maps to have thesame size H ×W.

Node Embedding. In our HyGnn, each node v1i or v2

i ∈ V,where i takes an unique value from {1, · · · , |V|}, is associ-ated with an initial node embedding (or node state), namelyv1

i or v2i . We use the domain-specific feature maps produced

by DFL as the initial node representations. Taking an arbi-trary counting node v1

i ∈ V1 for example, its initial repre-

sentations h1(0)i can be calculated by:

h1(0)i = v1

i = R(P(fd, si)) ∈ RH×W×C , (1)

where h1(0)i ∈ RH×W×C is a 3D tensor feature (batch size

is omitted). R(·) and P(·) denote the interpolation operationand pyramid pooling operation, respectively. The initial rep-resentation for the localization node v2

i ∈ V2 is defined sim-

ilarly as follows:

h2(0)i = v2

i = R(P(fl, si)) ∈ RH×W×C , (2)

where h2(0)i ∈ RH×W×C denotes the initial representation for

the localization node v2i ∈ V

2.

Cross-scale Edge Embedding. A cross-scale edge emi, j ∈

Ei, j connects two nodes vmi and vm

j which are from the samedomain m ∈ {1, 2} but different scales i, j ∈ {1, · · · ,N}. Thecross-scale edge embedding, denoted as em

i, j, is used to distill

𝐡𝑖𝑛(𝑘)

𝐡𝑖m(𝑘) 𝐡𝑖

′m(𝑘) ෬𝐞𝑖𝑚,𝑛(𝑘)

෭𝐦𝑖𝑚,𝑛(𝑘)

A

A

A Adaptive module

Convolution

Multiply

Minus

Figure 4: Detailed illustration of the cross-domain edge em-bedding and message aggregation. Please see text for details.

the multi-scale relation from vmi to vm

j as the edge represen-tation. To this goal, we employ a relation function frel(·, ·) tocapture the relations by:

em(k)i, j = frel(hm(k)

i ,hm(k)j ) = Conv(g(hm(k)

i ,hm(k)j )) ∈ RH×W×C ,

(3)where g(·, ·) is a function to combine the feature hm(k)

i andhm(k)

j . Following (Wang et al. 2019b), we model g(hi,h j) =

hi−h j, making the relations based on the difference betweennode embeddings to alleviate the symmetric impact in fea-ture combination. Conv(·) means the convolution operationthat is used to learn the edge embedding in a data-drivenway. Each element in em(k)

i, j reflects the pixel-level relationsbetween the nodes of different scales from i to j. As a re-sult, em(k)

i, j can be considered as the features that depict themulti-scale relationships between nodes.

Cross-domain Edge Embedding. Since our HyGnn is de-signed to fully exploit the complementary knowledge con-tained in the nodes of different domains (m, n ∈ {1, 2} &m , n), one major challenge is to overcome the “domaingap” between them. Rather than directly combining fea-tures as used in the cross-scale edge embedding, we firstadapt the node representation of one (source) domain hm(k)

iconditioned on the node representation of the other (target)domain hn(k)

i to overcome the domain difference. Here, in-spired by (Bertinetto et al. 2016), we integrate a learnableadapter Am(hm(k)

i

∣∣∣∣∣∣hn(k)i ) into our HyGnn to transform the

original node representation hm(k)i to the adaptive represen-

tation h′m(k)i as follows:

h′m(k)i = Am(hm(k)

i

∣∣∣∣∣∣hn(k)i ) = θ∗m(k)

i ∗ hn(k)i ,

where θ∗m(k)i = Eφ(hm(k)

i ).(4)

In the above function, ∗ is the convolution operation, andθ∗m(k)

i means the dynamic convolutional kernels. Eφ(·) is aone-shot learner to predict the dynamic parameters θ∗m(k)

ifrom a single exemplar. Following (Nie et al. 2018), asshown in Fig. 3, we implement it by a small CNN with learn-able parameters φ.

After achieving the adaptive representation h′m(k)i , the

cross-domain edge embedding em,ni for the edge em,n

i =

Page 5: Hybrid Graph Neural Networks for Crowd Counting · Hybrid Graph Neural Network (HyGnn), for joint crowd counting and localization. To the best of our knowledge, HyGnn is the first

(vmi , v

ni ) ∈ Em,n can be formulated as:

em,n(k)i = frel(h′m(k)

i ,hn(k)i ) = Conv(g(h′m(k)

i ,hn(k)i )) ∈ RH×W×C ,

(5)where em,n

i ∈ RH×W×C is a 3D tensor, which contains the hid-den representation of the cross-domain relation. The detailedarchitecture can be found in Fig. 4.

Cross-scale Message Aggregation. In our HyGnn, weemploy different aggregation schemes for each node to ag-gregate feature messages from its neighbors. For the mes-sage mm

i, j passed from node vmi to vm

j within the same domainacross different scales, we have:

mm(k)i, j = M(hm(k−1)

i , em(k−1)i, j ) = S ig(em(k−1)

i, j ) · hm(k−1)i , (6)

where M(·) is the cross-scale message passing function (ag-gregator), and S ig(·) maps the edge’s embedding into thelink weight. Note that since our HyGnn is devised to han-dle the pixel-level task, the link weight between nodes is inthe manner of a 2D map. Thus, mm

i, j assigns the pixel-wiseweighted features from node vm

i to vmj to aggregate informa-

tion.

Cross-domain Message Aggregation. As the cross-domain discrepancy is significant in the high-dimensionalfeature space and distribution, directly passing the learnedrepresentations of one node to its neighboring nodes for ag-gregation is a sub-optimal solution. Therefore, we formulatethe message passing from node vm

i to vni as an adaptive repre-

sentation learning process conditioned on hni . Here, we use

the similar idea with that used in the cross-domain edge em-bedding process, i.e., using a one-shot adapter to predict themessage that should be passed:

mm,n(k)i = M(hm(k−1)

i , em,n(k−1)i

∣∣∣∣∣∣hn(k−1)i )

= Am(S ig(em,n(k−1)i ) · hm(k−1)

i

∣∣∣∣∣∣hn(k−1)i )

= Eη(S ig(em,n(k−1)i ) · hm(k−1)

i ) ∗ hn(k−1)i

= ψ∗m(k−1)i ∗ hn(k−1)

i ,

(7)

where M(·) means the message passing function betweennodes from two different domains. A(·) is the adapter whichis conditioned on the node embedding of target domainhn(k−1)

i . Eη(·) means a small CNN with learnable parame-ters η, which serves as an one-shot learner to predict thedynamic parameters. ψ∗m(k−1)

i is the produced dynamic con-volutional kernels, which includes the guidance informationthat should be propagated from node vm

i to vni .

Two-stage Node State Update. In the kth step, our HyGnnfirst aggregates the information from the cross-domainnodes within the same scale i using Eq. 7. Therefore, vn

i

(i ∈ {1, · · · ,N} & n ∈ {1, 2}) gets an intermediate state hn(k)i

by taking into account its received cross-domain messagemm,n(k)

i and its prior state hn(k−1)i . Here, following (Qi et al.

2018b), we apply Gated Recurrent Unit (GRU) (Ballas et al.2015) as the update function,

hn(k)i = UGRU(hn(k−1)

i , mm,n(k)i ). (8)

Then, HyGnn performs message passing across scaleswithin the same domain n using Eq. 3, and aggregates mes-sages using Eq. 6. After that, vn

i gets the new state hn(k)i after

the kth iteration by considering the cross-scale message mn(k)j,i

and its intermediate state hn(k)i ,

hn(k)i = UGRU (hn(k)

i ,mn(k)j,i ). (9)

Readout Function. After K message passing iterations,the updated multi-scale features of two domains H1 =

{h1i }|V1 |

i=1 and H2 = {h2i }|V2 |

i=1 are merged to form their finalrepresentations fd and fl,

fd = Cd(H1) and fl = Cl(H2), (10)

where Cd(·) and Cl(·) are the merge functions by concatena-tion. Then, fd and fl are fed into a convolution layer to getthe final per-pixel predictions.

Loss. Our HyGnn is implemented to be fully differentiableand end-to-end trainable. The loss for each task can be com-puted after the readout functions, and the error can propagateback according to the chain rule. Here, we simply employthe Mean Square Error (MSE) loss to optimize the networkparameters for two tasks:

L = L1(Dg,D) + λL2(Lg,L), (11)

where L1 and L2 are MSE losses, and λ is the combinationweight. As our main task is the crowd counting, we set λ =0.001 to emphasize the accuracy of counting results.

ExperimentsIn this section, we empirically validate our HyGnn on fourpublic counting benchmarks (i.e., ShanghaiTech Part A,ShanghaiTech Part B, UCF CC 50 and UCF QNRF). First,we conduct an ablation experiment to prove the effective-ness of our hybrid graph model and the multi-task learning.Then, our proposed HyGnn is evaluated on all of these pub-lic benchmarks, and compare the performance with the state-of-the-art approaches.

Datasets. We use Shanghai Tech (Zhang et al. 2016),UCF CC 50 (Idrees et al. 2013) and UCF QNRF (Idreeset al. 2018b) for benchmarking our HyGnn. ShanghaiTech provides 1,198 annotated images containing morethan 330K people with head center annotations. It in-cludes two subsets: Shanghai Tech A and Shanghai TechB. UCF CC 50 provides 50 images with 63,974 head anno-tations in total. The small dataset volume and large countvariance make it a very challenging dataset. UCF QNRFis the largest dataset to date, which contains 1,535 imagesthat are divided into training and testing sets of 1,201 and3,34 images respectively. All of these benchmarks have beenwidely used for performance evaluation by state-of-the-artapproaches.

Implementation Details and Evaluation Protocol. Tomake a fair comparison with existing works, we use a trun-cated VGG as the backbone network. Specifically, the first10 convolutional layers from VGG-16 are used as the front-end and shared by two tasks. Following (Li, Zhang, and

Page 6: Hybrid Graph Neural Networks for Crowd Counting · Hybrid Graph Neural Network (HyGnn), for joint crowd counting and localization. To the best of our knowledge, HyGnn is the first

Table 1: Comparison with other state-of-the-art crowd counting methods on four benchmark crowd counting datasets using theMAE and MSE metrics.

Methods Shanghai Tech A Shanghai Tech B UCF CC 50 UCF QNRFMAE MSE MAE MSE MAE MSE MAE MSE

Crowd CNN (Zhang et al. 2015) 181.8 277.7 32 49.8 467 498 - -MC-CNN (Zhang et al. 2016) 110.2 173.2 26.4 41.3 377.6 509.1 277 426Switching CNN (Sam, Surya, and Babu 2017) 90.4 135 21.6 33.4 318.1 439.2 228 445CP-CNN (Sindagi and Patel 2017) 73.6 106.4 20.1 30.1 298.8 320.9 - -D-ConvNet (Shi et al. 2018) 73.5 112.3 18.7 26 288.4 404.7 - -L2R (Liu, van de Weijer, and Bagdanov 2018) 72 106.6 13.7 21.4 279.6 388.9 - -CSRNet (Li, Zhang, and Chen 2018) 68.2 115 10.6 16 266.1 397.5 - -PACNN (Shi et al. 2019) 66.3 106.4 8.9 13.5 267.9 357.8 - -RA2-Net (Liu, Weng, and Mu 2019) 65.1 106.7 8.4 14.1 - - 116 195SFCN (Wang et al. 2019a) 64.8 107.5 7.6 13 214.2 318.2 124.7 203.5TEDNet (Jiang et al. 2019) 64.2 109.1 8.2 12.8 249.4 354.2 113 188ADCrowdNet (Liu et al. 2019) 63.2 98.9 7.6 13.9 257.1 363.5 - -HyGnn (Ours) 60.2 94.5 7.5 12.7 184.4 270.1 100.8 185.3

Table 2: Analysis of the proposed method. Our results areobtained on Shanghai Tech A.

Methods MAE↓ MSE↓Baseline Model (a truncated VGG) 68.2 115.0Baseline + PSP (Zhao et al. 2017) 65.3 106.8Baseline + Bidirectional Fusion (Yang et al. 2018) 65.1 105.9Single-task GNN 62.5 103.4Multi-task GNN w/o adapter 62.4 101.8HyGnn (N=2, K=3) 62.1 100.8HyGnn (N=3, K=3) 60.2 94.5HyGnn (N=5, K=3) 60.2 94.1HyGnn (N=3, K=1) 65.4 109.2HyGnn (N=3, K=3) 60.2 94.5HyGnn (N=3, K=5) 60.1 94.4HyGnn (full model) 60.2 94.5

Chen 2018), our counting and localization back-ends arecomposed of 8 dilated convolutions with kernel size 3 × 3.

We use Adam optimizer with an initial learning rate 10−4.We set the momentum to 0.9, the weight decay to 10−4

and the batchsize to 8. For data augmentation, the train-ing images and corresponding groundtruths are randomlyflipped and cropped from different locations with the sizeof 400× 400. In the testing phase, we simply feed the wholeimage into the model for predicting the counting and local-ization results.

We adopt Mean Absolute Error (MAE) and Mean SquaredError (MSE) to evaluate the performance. The definitions areas follows:

MAE =1N

N∑i=1

|Ci − CGTi | and MSE =

√√1N

N∑i=1

|Ci −CGTi |

2,

(12)where Ci and CGT

i are the estimated count and the groundtruth of the ith testing image, respectively.

Ablation Study. Extensive ablation experiments are per-formed on ShanghaiTech A to verify the impact of eachcomponent of our HyGnn. Results are summarized in Tab. 2.Effectiveness of HyGnn. To show the importance of ourHyGnn, we offer a baseline model without HyGnn, whichgives the results from our backbone model, the truncated

VGG with dilated back-ends. As shown in Tab. 2, ourHyGnn significantly outperforms the baseline by 8.0 inMAE (68.2 7→ 60.2) and 20.5 in MSE (115.0 7→ 94.5).This is because our HyGnn can simultaneously model themulti-scale and cross-domain relationships which are impor-tant for achieving accurate crowd counting results.Multi-task GNN vs. Single-task GNN. To evaluate the ad-vantage of multi-task cooperation, we provide a single-task model which only formulates the cross-scale relation-ship. According to our experiments, HyGnn outperforms thesingle-task graph neural network by 2.3 in MAE (62.5 7→60.2) and 8.9 in MSE (103.4 7→ 94.5). This is because ourHyGnn is able to distill mutual benefits between the densityand localization, while single-task graph neural network ig-nores these important information.Effectiveness of the Cross-domain Edge Embedding. OurHyGnn carefully deals with the cross-domain informationby a learnable adapter. To evaluate its effectiveness, weprovide a multi-task GNN without the learnable adapter.Instead, we directly fuse features from different domainsthrough the aggregation operation. As shown Tab. 2, ourcross-domain edge embedding method achieves better per-formance in both MAE (60.2 vs. 62.4) and MSE (94.5 vs.101.8), which indicates that our design of cross-domain edgeembedding method is helpful for better leveraging the infor-mation from the other domain.Node Numbers N in HyGnn. In our model, we have N num-bers of nodes in each domain, i.e., |V1| = |V2| = N. Toinvestigate the impact of node numbers, we report the per-formance of our HyGnn with different N. We find that withmore scales in the model (2 7→ 3), the performance improvessignificantly (i.e., 62.1 7→ 60.2 in MAE and 100.8 7→ 94.5in MSE). However, when further considering more scales(3 7→ 5), it only achieves slight performance improvements,i.e., 60.1 7→ 60.2 in MAE and 94.5 7→ 94.1 in MSE. Thismay be due to the redundant information within additionalfeatures. Considering the tradeoff between efficiency andperformance, we set N = 3 in the following experiments.Message Passing Iterations K. To evaluate the impact of mes-sage passing iterations K, we report the performance of our

Page 7: Hybrid Graph Neural Networks for Crowd Counting · Hybrid Graph Neural Network (HyGnn), for joint crowd counting and localization. To the best of our knowledge, HyGnn is the first

imageLocalization

map

Density map

Ground Truth

145 146

156155

Density map

CSRNet

124

165 166150

175

Density map

Ours

Localization

Ground Truth

Figure 5: Density and localization maps generated by our HyGnn. We also show the counting map estimated by CSRNet forcomparison. Clearly, our HyGnn produces more accurate results.

model with different passing iterations. Each message pass-ing iteration in HyGnn includes two cascade steps: i) thecross-scale message passing and ii) the cross-domain mes-sage passing. We find that with more iterations (1 7→ 3), theperformance of our model improves to some extent. Whenfurther considering more iterations (3 7→ 5), it just bring aslight improvement. Therefore, we set k = 3, and our HyGnncan converge to an optimal result.GNN vs. Other Multi-feature Aggregation Methods. Here, weconduct an ablation to evaluate the superiority of GNN. Toprevent other jamming factor, we use a single-task GNN tofully distill the underlying relationships between multi-scalefeatures, and compare our method with two well-knownmulti-scale feature aggregation methods (PSP (Zhao et al.2017) and Bidirectional Fusion (Yang et al. 2018)). As canbe seen, our GNN-based method greatly outperforms othermethods by a large margin.

Comparison with State-of-the-art. We compare ourHyGnn with the state-of-the-art for the performance ofcounting.Quantitative Results. As can be seen in Tab. 1, our HyGnnconsistently achieves better results than other methods onfour widely-used benchmarks. Specifically, our methodgreatly outperforms previous best result by 3.0 in MAE and4.4 in MSE on ShanghaiTech Part A. Although previousmethods have made remarkable progresses on ShanghaiTechPart B, our HyGnn also achieves the best performance. Com-pared with existing top approaches like ADCrowdNet (Liuet al. 2019) and SFCN (Wang et al. 2019a), our HyGnnachieves performance gain by 0.1 in MAE and 1.2 in MSEand 0.1 in MAE and 0.3 in MSE, respectively. On the mostchallenging UCF CC 50, our HyGnn achieves considerableperformance gain by decreasing the MAE from previousbest 214.2 to 184.4 and MSE from 318.2 to 270.1. On UCF-

QNRF dataset, our HyGnn also outperforms other methodsby a large margin. As shown in Tab. 1, our HyGnn achievesa significant improvement of 10.8% in MAE over the ex-isting best result produced by TEDNet (Jiang et al. 2019).Compared with other top-ranked methods, our HyGnn pro-duces more accurate results. This is because HyGnn is ableto leverage free-of-cost localization information and jointlyreason all relations among them.Qualitative Results. Fig. 5 provides some visualization com-parisons of the predicted density maps and counts with CSR-Net (Li, Zhang, and Chen 2018). In addition, we also showthe localization results. We observe that our HyGnn is verypowerful, achieves much more accurate count estimationsand reserves more consistency with the real crowd distribu-tions. This is because our HyGnn can distill the significantbenefit information from the auxiliary task through a graph.

ConclusionsIn this paper, we propose a novel method for crowd count-ing with a hybrid graph model. To best of our knowledge,it is the first deep neural network model that can distill bothmulti-scale and mutual beneficial relations within a unifiedgraph for crowd counting. The whole HyGnn is end-to-enddifferentiable, and is able to handle different relations effec-tively. Meanwhile, the domain gap between different tasksis also carefully considered in our HyGnn. According toour experiments, HyGnn achieves significant improvementscompared to recent state-of-the-art methods on four bench-marks. We believe that our HyGnn can also incorporateother knowledge, e.g., foreground information, for furtherimprovements.Acknowledgement. This work was supported inpart by the National Key R&D Program of China(No.2017YFB1302300) and the NSFC (No.U1613223).

Page 8: Hybrid Graph Neural Networks for Crowd Counting · Hybrid Graph Neural Network (HyGnn), for joint crowd counting and localization. To the best of our knowledge, HyGnn is the first

ReferencesBallas, N.; Yao, L.; Pal, C.; and Courville, A. 2015. Delvingdeeper into convolutional networks for learning video representa-tions. arXiv preprint arXiv:1511.06432.

Bertinetto, L.; Henriques, J. F.; Valmadre, J.; Torr, P.; and Vedaldi,A. 2016. Learning feed-forward one-shot learners. In NeurIPS.

Cao, X.; Wang, Z.; Zhao, Y.; and Su, F. 2018. Scale aggregationnetwork for accurate and efficient crowd counting. In ECCV.

Chen, K.; Gong, S.; Xiang, T.; and Change Loy, C. 2013. Cu-mulative attribute space for age and crowd density estimation. InCVPR.

Dai, F.; Liu, H.; Ma, Y.; Cao, J.; Zhao, Q.; and Zhang, Y. 2019.Dense scale network for crowd counting. CoRR abs/1906.09707.

Dalal, N., and Triggs, B. 2005. Histograms of oriented gradientsfor human detection. In CVPR.

Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl,G. E. 2017. Neural message passing for quantum chemistry. CoRRabs/1704.01212.

Idrees, H.; Saleemi, I.; Seibert, C.; and Shah, M. 2013. Multi-source multi-scale counting in extremely dense crowd images. InCVPR.

Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.;Rajpoot, N.; and Shah, M. 2018a. Composition loss for counting,density map estimation and localization in dense crowds. In ECCV.

Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.;Rajpoot, N.; and Shah, M. 2018b. Composition loss for counting,density map estimation and localization in dense crowds. In ECCV.

Jiang, X.; Xiao, Z.; Zhang, B.; Zhen, X.; Cao, X.; Doermann, D.;and Shao, L. 2019. Crowd counting and density estimation bytrellis encoder-decoder networks. In CVPR.

Kang, D.; Ma, Z.; and Chan, A. B. 2018. Beyond counting: Com-parisons of density maps for crowd analysis taskscounting, detec-tion, and tracking. TCSVT 29(5):1408–1422.

Lempitsky, V., and Zisserman, A. 2010. Learning to count objectsin images. In NeurIPS.

Li, Y.; Tarlow, D.; Brockschmidt, M.; and Zemel, R. 2016. Gatedgraph sequence neural networks. In ICLR.

Li, X.; Yang, F.; Cheng, H.; Chen, J.; Guo, Y.; and Chen, L. 2017.Multi-scale cascade network for salient object detection. In ACMMM.

Li, X.; Yang, F.; Cheng, H.; Liu, W.; and Shen, D. 2018. Contourknowledge transfer for salient object detection. In ECCV.

Li, Y.; Zhang, X.; and Chen, D. 2018. Csrnet: Dilated convolutionalneural networks for understanding the highly congested scenes. InCVPR.

Lian, D.; Li, J.; Zheng, J.; Luo, W.; and Gao, S. 2019. Density mapregression guided detection network for rgb-d crowd counting andlocalization. In CVPR.

Liu, N.; Long, Y.; Zou, C.; Niu, Q.; Pan, L.; and Wu, H. 2019.Adcrowdnet: An attention-injective deformable convolutional net-work for crowd understanding. In CVPR.

Liu, W.; Salzmann, M.; and Fua, P. 2019. Context-aware crowdcounting. In CVPR.

Liu, X.; van de Weijer, J.; and Bagdanov, A. D. 2018. Leveragingunlabeled data for crowd counting by learning to rank. In CVPR.

Liu, C.; Weng, X.; and Mu, Y. 2019. Recurrent attentive zoomingfor joint crowd counting and precise localization. In CVPR.

Meng, Z.; Adluru, N.; Kim, H. J.; Fung, G.; and Singh, V. 2018.Efficient relative attribute learning using graph neural networks. InECCV.Nie, X.; Feng, J.; Zuo, Y.; and Yan, S. 2018. Human pose estima-tion with parsing induced learner. In CVPR.Pham, V.-Q.; Kozakaya, T.; Yamaguchi, O.; and Okada, R. 2015.Count forest: Co-voting uncertain number of targets using randomforest for crowd density estimation. In ICCV.Qi, S.; Wang, W.; Jia, B.; Shen, J.; and Zhu, S.-C. 2018a. Learninghuman-object interactions by graph parsing neural networks. InECCV.Qi, S.; Wang, W.; Jia, B.; Shen, J.; and Zhu, S.-C. 2018b. Learninghuman-object interactions by graph parsing neural networks. InECCV.Sam, D. B.; Surya, S.; and Babu, R. V. 2017. Switching convolu-tional neural network for crowd counting. In CVPR.Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfar-dini, G. 2008. The graph neural network model. TNNLS 20(1):61–80.Shen, Z.; Xu, Y.; Ni, B.; Wang, M.; Hu, J.; and Yang, X. 2018.Crowd counting via adversarial cross-scale consistency pursuit. InCVPR.Shi, Z.; Zhang, L.; Liu, Y.; Cao, X.; Ye, Y.; Cheng, M.-M.; andZheng, G. 2018. Crowd counting with deep negative correlationlearning. In CVPR.Shi, M.; Yang, Z.; Xu, C.; and Chen, Q. 2019. Revisiting perspec-tive information for efficient crowd counting. In CVPR.Si, C.; Jing, Y.; Wang, W.; Wang, L.; and Tan, T. 2018. Skeleton-based action recognition with spatial reasoning and temporal stacklearning. In ECCV.Sindagi, V. A., and Patel, V. M. 2017. Generating high-qualitycrowd density maps using contextual pyramid cnns. In CVPR.Varior, R. R.; Shuai, B.; Tighe, J.; and Modolo, D. 2019.Scale-aware attention network for crowd counting. CoRRabs/1901.06026.Viola, P.; Jones, M.; et al. 2001. Rapid object detection using aboosted cascade of simple features.Viola, P.; Jones, M. J.; and Snow, D. 2005. Detecting pedestriansusing patterns of motion and appearance. IJCV 63(2):153–161.Wan, J.; Luo, W.; Wu, B.; Chan, A. B.; and Liu, W. 2019. Residualregression with semantic prior for crowd counting. In CVPR.Wang, Q.; Gao, J.; Lin, W.; and Yuan, Y. 2019a. Learning fromsynthetic data for crowd counting in the wild. In CVPR.Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S. E.; Bronstein, M. M.; andSolomon, J. M. 2019b. Dynamic graph cnn for learning on pointclouds. TOG.Yang, F.; Li, X.; Cheng, H.; Guo, Y.; Chen, L.; and Li, J. 2018.Multi-scale bidirectional fcn for object skeleton extraction. InAAAI.Zhang, C.; Li, H.; Wang, X.; and Yang, X. 2015. Cross-scenecrowd counting via deep convolutional neural networks. In CVPR.Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; and Ma, Y. 2016. Single-image crowd counting via multi-column convolutional neural net-work. In CVPR.Zhao, H.; Shi, J.; Qi, X.; Wang, X.; and Jia, J. 2017. Pyramid sceneparsing network. In CVPR.Zhao, M.; Zhang, J.; Zhang, C.; and Zhang, W. 2019. Leveragingheterogeneous auxiliary tasks to assist crowd counting. In CVPR.