learning hierarchical graph for occluded pedestrian detection · learning hierarchical graph for...

9
Learning Hierarchical Graph for Occluded Pedestrian Detection Gang Li ∗‡ Nanjing University of Science and Technology [email protected] Jian Li Youtu Lab, Tencent [email protected] Shanshan Zhang †‡ Nanjing University of Science and Technology [email protected] Jian Yang Nanjing University of Science and Technology [email protected] ABSTRACT Although pedestrian detection has made significant progress with the help of deep convolution neural networks, it is still a challenging problem to detect occluded pedestrians since the occluded ones can not provide sufficient information for classification and regression. In this paper, we propose a novel Hierarchical Graph Pedestrian Detector (HGPD), which integrates semantic and spatial relation information to construct two graphs named intra-proposal graph and inter-proposal graph, without relying on extra cues w.r.t visible regions. In order to capture the occlusion patterns and enhance features from visible regions, the intra-proposal graph considers body parts as nodes and assigns corresponding edge weights based on semantic relations between body parts. On the other hand, the inter-proposal graph adopts spatial relations between neighbour- ing proposals to provide additional proposal-wise context infor- mation for each proposal, which alleviates the lack of information caused by occlusion. We conduct extensive experiments on stan- dard benchmarks of CityPersons and Caltech to demonstrate the effectiveness of our method. On CityPersons, our approach out- performs the baseline method by a large margin of 5.24 on the heavy occlusion set, and surpasses all previous methods; on Caltech, we establish a new state of the art of 3.78% MR. Code is available at https://github.com/ligang-cs/PedestrianDetection-HGPD. CCS CONCEPTS Computing methodologies Object detection. KEYWORDS computer vision, object detection, occluded pedestrian detection, graph neural networks Both authors contributed equally to this research. The corresponding author is Shanshan Zhang Gang Li, Shanshan Zhang and Jian Yang are with PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. MM ’20, October 12–16, 2020, Seattle, WA, USA © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7988-5/20/10. . . $15.00 https://doi.org/10.1145/3394171.3413983 ACM Reference Format: Gang Li, Jian Li, Shanshan Zhang, and Jian Yang. 2020. Learning Hierar- chical Graph for Occluded Pedestrian Detection. In Proceedings of the 28th ACM International Conference on Multimedia (MM ’20), October 12–16, 2020, Seattle, WA, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/10. 1145/3394171.3413983 1 INTRODUCTION Pedestrian detection is a popular topic in computer vision. On the one hand, it has extensive applications, such as autonomous driving, video surveillance, and robotics. On the other hand, it also serves as a fundamental step for some other vision-based tasks, e.g. person re-identification [26, 39], pose estimation [10, 19], etc. Recently, significant improvements have been achieved with the development of deep convolution neural networks. Although some state-of-the-art detectors can provide reasonable detection results for non-occluded or partially occluded pedestrians, the performance for heavily occluded pedestrians is still far from satisfactory. Occlusion occurs frequently in real-world applications, and thus it is an important problem to solve. Though quite some efforts have been made for handling occlusion, it is still far from being solved. By analyzing the occlusion issue, we assume the difficulty mainly comes from the following two reasons: (1) lack of human body information from invisible parts; (2) background noise inside the detection window of occluded pedestrians. We show one example in Figure 1. To improve the model’s discrimination ability for occluded pedes- trians, an intuitive way is to enhance human body features from visible regions and suppress noisy features from occluded regions. To this end, we need to predict visible parts and then perform feature aggregation or re-weighting based on the prediction. For example, OR-CNN [37] divides each proposal into five pre-defined parts and aggregates features from these parts with predicted vis- ibility scores; MGAN [21] produces a pixel-wise attention map based on visible regions, and then gives higher attention weights to features from visible regions; Zhang et al. [38] propose to use the predicted visible box or part detection results as guidance to conduct channel-wise feature selection. All of the above works rely on either visible box annotations as additional supervision at training time or an external body part detector. However, visible boxes are not provided on some large-scale pedestrian datasets, such as EuroCity Persons [1] and NightOwls [18], as they require extensive human labor; running an additional body part detector is computationally expensive at inference.

Upload: others

Post on 03-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Hierarchical Graph for Occluded Pedestrian Detection · Learning Hierarchical Graph for Occluded Pedestrian Detection Gang Li∗‡ Nanjing University of Science and Technology

Learning Hierarchical Graph for Occluded Pedestrian Detection

Gang Li∗‡Nanjing University of Science and Technology

[email protected]

Jian Li∗Youtu Lab, Tencent

[email protected]

Shanshan Zhang†‡Nanjing University of Science and Technology

[email protected]

Jian Yang‡Nanjing University of Science and Technology

[email protected]

ABSTRACTAlthough pedestrian detection has made significant progress withthe help of deep convolution neural networks, it is still a challengingproblem to detect occluded pedestrians since the occluded ones cannot provide sufficient information for classification and regression.In this paper, we propose a novel Hierarchical Graph PedestrianDetector (HGPD), which integrates semantic and spatial relationinformation to construct two graphs named intra-proposal graphand inter-proposal graph, without relying on extra cues w.r.t visibleregions. In order to capture the occlusion patterns and enhancefeatures from visible regions, the intra-proposal graph considersbody parts as nodes and assigns corresponding edge weights basedon semantic relations between body parts. On the other hand, theinter-proposal graph adopts spatial relations between neighbour-ing proposals to provide additional proposal-wise context infor-mation for each proposal, which alleviates the lack of informationcaused by occlusion. We conduct extensive experiments on stan-dard benchmarks of CityPersons and Caltech to demonstrate theeffectiveness of our method. On CityPersons, our approach out-performs the baseline method by a large margin of 5.24𝑝𝑝 on theheavy occlusion set, and surpasses all previous methods; on Caltech,we establish a new state of the art of 3.78%MR. Code is availableat https://github.com/ligang-cs/PedestrianDetection-HGPD.

CCS CONCEPTS• Computing methodologies → Object detection.

KEYWORDScomputer vision, object detection, occluded pedestrian detection,graph neural networks∗Both authors contributed equally to this research.†The corresponding author is Shanshan Zhang‡Gang Li, Shanshan Zhang and Jian Yang are with PCA Lab, Key Lab of IntelligentPerception and Systems for High-Dimensional Information of Ministry of Education,and Jiangsu Key Lab of Image and Video Understanding for Social Security, School ofComputer Science and Engineering, Nanjing University of Science and Technology

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’20, October 12–16, 2020, Seattle, WA, USA© 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-7988-5/20/10. . . $15.00https://doi.org/10.1145/3394171.3413983

ACM Reference Format:Gang Li, Jian Li, Shanshan Zhang, and Jian Yang. 2020. Learning Hierar-chical Graph for Occluded Pedestrian Detection. In Proceedings of the 28thACM International Conference on Multimedia (MM ’20), October 12–16, 2020,Seattle, WA, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3394171.3413983

1 INTRODUCTIONPedestrian detection is a popular topic in computer vision. Onthe one hand, it has extensive applications, such as autonomousdriving, video surveillance, and robotics. On the other hand, it alsoserves as a fundamental step for some other vision-based tasks,e.g. person re-identification [26, 39], pose estimation [10, 19], etc.Recently, significant improvements have been achieved with thedevelopment of deep convolution neural networks. Although somestate-of-the-art detectors can provide reasonable detection resultsfor non-occluded or partially occluded pedestrians, the performancefor heavily occluded pedestrians is still far from satisfactory.

Occlusion occurs frequently in real-world applications, and thusit is an important problem to solve. Though quite some efforts havebeen made for handling occlusion, it is still far from being solved.By analyzing the occlusion issue, we assume the difficulty mainlycomes from the following two reasons: (1) lack of human bodyinformation from invisible parts; (2) background noise inside thedetection window of occluded pedestrians. We show one examplein Figure 1.

To improve the model’s discrimination ability for occluded pedes-trians, an intuitive way is to enhance human body features fromvisible regions and suppress noisy features from occluded regions.To this end, we need to predict visible parts and then performfeature aggregation or re-weighting based on the prediction. Forexample, OR-CNN [37] divides each proposal into five pre-definedparts and aggregates features from these parts with predicted vis-ibility scores; MGAN [21] produces a pixel-wise attention mapbased on visible regions, and then gives higher attention weightsto features from visible regions; Zhang et al. [38] propose to usethe predicted visible box or part detection results as guidance toconduct channel-wise feature selection. All of the above worksrely on either visible box annotations as additional supervision attraining time or an external body part detector. However, visibleboxes are not provided on some large-scale pedestrian datasets,such as EuroCity Persons [1] and NightOwls [18], as they requireextensive human labor; running an additional body part detector iscomputationally expensive at inference.

Page 2: Learning Hierarchical Graph for Occluded Pedestrian Detection · Learning Hierarchical Graph for Occluded Pedestrian Detection Gang Li∗‡ Nanjing University of Science and Technology
Page 3: Learning Hierarchical Graph for Occluded Pedestrian Detection · Learning Hierarchical Graph for Occluded Pedestrian Detection Gang Li∗‡ Nanjing University of Science and Technology
Page 4: Learning Hierarchical Graph for Occluded Pedestrian Detection · Learning Hierarchical Graph for Occluded Pedestrian Detection Gang Li∗‡ Nanjing University of Science and Technology
Page 5: Learning Hierarchical Graph for Occluded Pedestrian Detection · Learning Hierarchical Graph for Occluded Pedestrian Detection Gang Li∗‡ Nanjing University of Science and Technology
Page 6: Learning Hierarchical Graph for Occluded Pedestrian Detection · Learning Hierarchical Graph for Occluded Pedestrian Detection Gang Li∗‡ Nanjing University of Science and Technology

Table 1: Ablation study on CityPersons validation set. Num-bers are log-average miss rates (lower number indicates bet-ter performance).

Baseline Two-stepRegression

Intra-proposalGraph

Inter-proposalGraph

R HO

! 13.46 56.98! 12.50 55.58

! 12.33 53.19! 12.58 53.59

! ! 11.82 52.87! ! ! 11.27 51.74

Overall Improvement +2.19 +5.24

4.2 Evaluation metricFollowing [9], the log-average miss rate (noted as MR) is used inour work. It is calculated by averaging miss rates over 9 pointsuniformly sampled from [10−2, 100] false positive per image (FPPI).On the CityPersons validation set and Caltech test set, we reportresults across two different occlusion subsets: Reasonable (R) andHeavy Occlusion (HO). The visibility ratio in R is larger than 65%,and the visibility ratio inHO ranges from 20% to 65%. InR andHO,the height of pedestrians is at least 50 pixels. Better results on HOare considered as stronger evidence of better occlusion handling.On the CityPersons test set, besides R and HO, we also reportresults on the All subset. All contains pedestrians, whose visibilityratio is larger than 20% and height is at least 20 pixels. Besides, toevaluate the quality of region proposals, we introduce the averagerecall (noted as AR), which is calculated by averaging the recallsacross IoU thresholds from 0.5 to 0.95 with a step of 0.05.

4.3 ImplementationWe implement our method with Pytorch [22] and mmdetection [5].No data augmentation is used except standard horizontal imageflipping. SGD is selected as the back-propagation algorithm.CityPersons. The model is trained on one GTX 2080Ti GPU witha batch size of 2, for 14 epochs. The learning rate is set to 0.02 andreduced to 0.002 after 10 epochs.Caltech. As in [15, 16, 31, 37], we start with the model pretrainedon CityPersons, then finetune the model on the Caltech dataset foranother 6 epochs. For the first 4 epochs, the learning rate is set to0.02 and then reduced to 0.002 for the last two epochs.

4.4 Ablation StudyTo better understand our model, we conduct ablation experimentson the CityPersons validation set. The R set is used as the trainingset for these experiments.Component-wise Analysis. To demonstrate the effectivenessof our hierarchical graph pedestrian detector, a comprehensivecomponent-wise analysis is performed in which different compo-nents are added on top of a strong baseline method step by step. Theresults are reported in Table 1. As Table 1 shows, both of intra- andinter-proposal graphs dramatically reduce the error on the HO set.

Table 2: The effects of intra-proposal graph.

w/o graphintra-proposal graph

R HOself-attention inter-affinity

13.46 56.98! 13.98 55.15

! 12.36 54.47! ! 12.33 53.19

Table 3: Ablation study on the number of human parts. 2 or3 parts mean uniformly cropping the full body into 2 or 3parts vertically. 5 parts mean cropping the full body into 3parts vertically and 2 parts horizontally.

# number R HO R+HO

2 11.96 53.59 30.923 12.33 53.19 31.055 12.49 55.67 31.65

Specifically, intra-proposal graph brings an absolute gain of 3.79𝑝𝑝(𝑝𝑝 represents percentage points), and inter-proposal graph alsooutperforms the baseline with 3.39𝑝𝑝 , demonstrating the proposedhierarchical graph is useful for addressing occlusion. With the helpof two-step regression, more accurate region proposals are sent intothe hierarchical graph. Finally, HGPD achieves log-average missrates of 11.27% on the R set and 51.74% on the HO set, achieving atotal improvement to the baseline of 2.19% and 5.24% respectively.These results demonstrate the proposed method effectively handlesdifferent levels of occlusion.The effects of intra-proposal graph. To model accurate occlu-sion patterns, we propose two options to define the edge weights,namely self-attention and inter-affinity. From Table 2, when onlyself-attention is used as edge weights, intra-proposal graph bringsgains of 1.10𝑝𝑝 on the R set and 2.51𝑝𝑝 on the HO set; whenwe involve inter-affinity as an additional term for edge weights,we achieve larger improvements, i.e. 1.13𝑝𝑝 and 3.79𝑝𝑝 on the Rand HO sets respectively. These results validate that inter-affinityguidance can help to better model occlusion patterns. We also con-duct experiments to verify the effect of using the intra-proposalgraph. For comparison, we introduce another reference methodwithout graph, for which features from body parts are multipliedwith corresponding visibility scores, and simply concatenated forclassification. The results in Table 2 indicate that the concatena-tion operation without graph brings negligible improvements onthe HO set and it even drops by 0.52𝑝𝑝 on the R set. Comparedto the simple concatenation operation, our intra-proposal graphintegrates the features from different body parts in a more effectiveway.Number of body parts. Table 3 shows the performance on dif-ferent number of body parts. When we divide the full body into 2parts (top and bottom half), it achieves the best performance onthe R set, as the occlusion ratio on the R set is lower than 35%,so 2 parts can handle all occlusion patterns on the R set. But the

Page 7: Learning Hierarchical Graph for Occluded Pedestrian Detection · Learning Hierarchical Graph for Occluded Pedestrian Detection Gang Li∗‡ Nanjing University of Science and Technology

Table 4: The effects of two-step regression.

RPN R-CNN AR100 AR300 AR100064.8 69.7 71.2

! 69.6 72.8 73.7! 72.2 73.0 73.2

Table 5: The effects of decoupling two tasks.

sibling separate R HO

! 12.44 53.36! 11.27 51.74

best performance on theHO set is obtained under the number of 3,where more diverse occlusion patterns can be modeled. The resultalso indicates more body parts, e.g. 5, can not bring inconsistentimprovements. We divide the full body into 3 parts in the followingexperiments.The effects of two-step regression. Table 4 shows the qualityof proposals when we place two-step regression at different loca-tions of Faster R-CNN. AR100,AR300 and AR1000 refer to averagerecalls for top 100, 300 and 1000 proposals in each image. Two-stepregression can be placed at either the RPN or R-CNN stage, andwe conduct experiments to compare these two choices. As Table 4indicates, no matter where to place, two-step regression can bringsignificant improvements on AR. In practice, we usually use a largenumber of region proposals for Faster R-CNN (i.e. 1000), so AR1000is a better indicator. Performing two-step regression in the RPNstage achieves the highest AR1000 of 73.7%, and outperforms thebaseline and placing it at R-CNN by 2.5𝑝𝑝 , 0.5𝑝𝑝 respectively.Effects of decoupling classification and regression tasks. Weassume that combining local features and neighboring featureswould be harmful to the localization task, since the localization issensitive to the boundary features [29]. So we propose to selectfeatures before the hierarchical graph to perform bounding boxregression, which is noted as the separate head. And performingboth classification and regression on the output features from hier-archical graphs is noted as sibling head. The comparison in Table 5demonstrates decoupling two tasks works better, and it outperformsthe counterpart by 1.17𝑝𝑝/1.61𝑝𝑝 on the R/HO set.

4.5 Comparison on CityPersonsWe compare our method on the CityPersons validation with state-of-the-art methods in Table 6. It is noted that existing pedestriandetectors employ different subsets of training samples, which dif-fer in occlusion level, and input scales, which highly affect theperformance. Considering fairness, we make comparisons at differ-ent settings in terms of training subset and input scale. Based onwhether visible box annotations (VBB) are used at training time,pedestrian detectors are divided into two groups, namely VBB-freeand VBB-based methods.

First, we compare our method with VBB-free methods, includingATT+part [38], RepLoss [31], adaptive-NMS [12], CSP [16], andALFNet [15]. From Table 6(a), we have the following observations:

Table 6: Comparisons of different methods on the CityPer-sons validation set. Numbers are log-average miss rates(lower is better). The scale column indicates the upsamplingfactor of input images. Bold indicates the best results. Weseparate VBB-free andVBB-basedmethods in two subtables.

(a) Comparison of VBB-free methods

Methods Visibility Scale R HO

ATT+part [38][CVPR18] ≥ 65% 1x 16.0 56.7RepLoss [31][CVPR18] ≥ 65% 1x 13.2 56.9AdapNMS [12][CVPR19] ≥ 65% 1x 11.9 55.2

HGPD(Ours) ≥ 65% 1x 11.3 51.7

ALFNet [15] [ECCV18] ≥ 0% 1x 12.0 52.0CSP [16] [CVPR19] ≥ 0% 1x 11.0 49.3

HGPD(Ours) ≥ 0% 1x 11.5 41.3

RepLoss [31] [CVPR18] ≥ 65% 1.3x 11.5 55.3AdapNMS [12] [CVPR19] ≥ 65% 1.3x 10.8 54.0

HGPD(Ours) ≥ 65% 1.3x 10.7 50.9

(b) Comparison of VBB-based methods

Methods Visibility Scale R HO

MGAN [21] [ICCV19] ≥ 65% 1x 11.5 51.7HGPD(Ours) ≥ 65% 1x 11.3 51.7

MGAN [21] [ICCV19] ≥ 0% 1x 11.3 42.0HGPD(Ours) ≥ 0% 1x 11.5 41.3

OR-CNN [37] [ECCV18] ≥ 50% 1x 12.8 55.7MGAN [21] [ICCV19] ≥ 50% 1x 10.8 46.7

HGPD(Ours) ≥ 50% 1x 11.5 45.9

Bi-box [41] [ECCV18] ≥ 30% 1.3x 11.2 44.2A+DT [40] [ICCV19] ≥ 30% 1.3x 11.1 44.3

HGPD(Ours) ≥ 30% 1.3x 10.9 40.9

(1) Our method outperforms top competitors by a large margin(more than 3𝑝𝑝) on the HO set and achieves comparable perfor-mance to the second best one (CSP) on the R set. (2) Our HGPDalso consistently outperforms those methods, which are designedfor occlusion handling, including RepLoss and Adaptive-NMS, onboth R and HO. With a 1x input scale and R training set, ourmethod achieves log-average miss rates of 11.3% and 51.7% on theR andHO sets, surpassing RepLoss and Adaptive-NMS. (3) Besides,compared with one-stage detectors (ALFNet and CSP), our methodoutperforms both of them on theHO set by a large margin (∼ 8𝑝𝑝).

Furthermore, we compare our HGPD with VBB-based meth-ods in Table 6(b). Though our method uses less supervision in-formation, it still achieves the best performance on the HO set.MGAN [21] is a strong competitor, which uses visible regions togenerate spatial-wise attention.We conduct comprehensive compar-isons with MGAN, specifically we use three training sets, includingvisibility ratio ≥ 65%, ≥ 50% and ≥ 0%. With each training set, our

Page 8: Learning Hierarchical Graph for Occluded Pedestrian Detection · Learning Hierarchical Graph for Occluded Pedestrian Detection Gang Li∗‡ Nanjing University of Science and Technology
Page 9: Learning Hierarchical Graph for Occluded Pedestrian Detection · Learning Hierarchical Graph for Occluded Pedestrian Detection Gang Li∗‡ Nanjing University of Science and Technology

REFERENCES[1] Markus Braun, Sebastian Krebs, Fabian Flohr, and Dariu M Gavrila. 2018. The

eurocity persons dataset: A novel benchmark for object detection. arXiv preprintarXiv:1805.07193 (2018).

[2] Garrick Brazil and Xiaoming Liu. 2019. Pedestrian detection with autoregressivenetwork phases. In CVPR. 7231–7240.

[3] Garrick Brazil, Xi Yin, and Xiaoming Liu. 2017. Illuminating pedestrians viasimultaneous detection & segmentation. In ICCV. 4950–4959.

[4] Joan Bruna,Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral net-works and locally connected networks on graphs. arXiv preprint arXiv:1312.6203(2013).

[5] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li,Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. 2019. MMDetection: Openmmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019).

[6] Cheng Chi, Shifeng Zhang, Junliang Xing, Zhen Lei, Stan Z Li, and Xudong Zou.2019. PedHunter: Occlusion Robust Pedestrian Detector in Crowded Scenes.arXiv preprint arXiv:1909.06826 (2019).

[7] Cheng Chi, Shifeng Zhang, Junliang Xing, Zhen Lei, Stan Z Li, and Xudong Zou.2019. Relational Learning for Joint Head and Human Detection. arXiv preprintarXiv:1909.10674 (2019).

[8] Xuangeng Chu, Anlin Zheng, Xiangyu Zhang, and Jian Sun. 2020. Detec-tion in Crowded Scenes: One Proposal, Multiple Predictions. arXiv preprintarXiv:2003.09163 (2020).

[9] Piotr Dollar, Christian Wojek, Bernt Schiele, and Pietro Perona. 2011. Pedestriandetection: An evaluation of the state of the art. IEEE transactions on patternanalysis and machine intelligence 34, 4 (2011), 743–761.

[10] Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, Hao-Shu Fang, and Cewu Lu. 2019.Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition.10863–10872.

[11] Chunze Lin, Jiwen Lu, Gang Wang, and Jie Zhou. 2018. Graininess-aware deepfeature learning for pedestrian detection. In ECCV. 732–747.

[12] Songtao Liu, Di Huang, and Yunhong Wang. 2019. Adaptive NMS: RefiningPedestrian Detection in a Crowd. In CVPR. 6459–6468.

[13] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. 2018. Path aggregationnetwork for instance segmentation. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. 8759–8768.

[14] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector.In European conference on computer vision. Springer, 21–37.

[15] Wei Liu, Shengcai Liao, Weidong Hu, Xuezhi Liang, and Xiao Chen. 2018. Learn-ing efficient single-stage pedestrian detectors by asymptotic localization fitting.In ECCV. 618–634.

[16] Wei Liu, Shengcai Liao, Weiqiang Ren, Weidong Hu, and Yinan Yu. 2019. High-level Semantic Feature Detection: A New Perspective for Pedestrian Detection.In CVPR. 5187–5196.

[17] Jiayuan Mao, Tete Xiao, Yuning Jiang, and Zhimin Cao. 2017. What can helppedestrian detection?. In CVPR. 3127–3136.

[18] Lukáš Neumann, Michelle Karg, Shanshan Zhang, Christian Scharfenberger, EricPiegert, Sarah Mistr, Olga Prokofyeva, Robert Thiel, Andrea Vedaldi, AndrewZisserman, et al. 2018. NightOwls: A pedestrians at night dataset. In ACCV.Springer, 691–705.

[19] Alejandro Newell, Zhiao Huang, and Jia Deng. 2017. Associative embedding:End-to-end learning for joint detection and grouping. In Advances in neuralinformation processing systems. 2277–2287.

[20] Junhyug Noh, Soochan Lee, Beomsu Kim, and Gunhee Kim. 2018. Improvingocclusion and hard negative handling for single-stage pedestrian detectors. InCVPR. 966–974.

[21] Yanwei Pang, Jin Xie, Muhammad Haris Khan, Rao Muhammad Anwer, Fa-had Shahbaz Khan, and Ling Shao. 2019. Mask-guided attention network foroccluded pedestrian detection. In ICCV. 4967–4975.

[22] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.2017. Automatic differentiation in pytorch. (2017).

[23] Jimmy Ren, Xiaohao Chen, Jianbo Liu, Wenxiu Sun, Jiahao Pang, Qiong Yan,Yu-Wing Tai, and Li Xu. 2017. Accurate single stage detector using recurrentrolling convolution. In CVPR. 5420–5428.

[24] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn:Towards real-time object detection with region proposal networks. In NeurIPS.91–99.

[25] Franco Scarselli, MarcoGori, AhChung Tsoi, MarkusHagenbuchner, andGabrieleMonfardini. 2008. The graph neural network model. IEEE Transactions on NeuralNetworks 20, 1 (2008), 61–80.

[26] Yantao Shen, Hongsheng Li, Shuai Yi, Dapeng Chen, and Xiaogang Wang. 2018.Person re-identification with deep similarity-guided graph neural network. InProceedings of the European conference on computer vision (ECCV). 486–504.

[27] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2019. Two-stream adap-tive graph convolutional networks for skeleton-based action recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition.12026–12035.

[28] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networksfor large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[29] Guanglu Song, Yu Liu, and Xiaogang Wang. 2020. Revisiting the Sibling Head inObject Detector. arXiv preprint arXiv:2003.07540 (2020).

[30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information processing systems. 5998–6008.

[31] Xinlong Wang, Tete Xiao, Yuning Jiang, Shuai Shao, Jian Sun, and Chunhua Shen.2018. Repulsion loss: Detecting pedestrians in a crowd. In CVPR. 7774–7783.

[32] Jin Xie, Yanwei Pang, Hisham Cholakkal, Rao Muhammad Anwer, Fahad ShahbazKhan, and Ling Shao. 2020. PSC-Net: Learning Part Spatial Co-occurence forOccluded Pedestrian Detection. arXiv preprint arXiv:2001.09252 (2020).

[33] Yichao Yan, Qiang Zhang, Bingbing Ni, Wendong Zhang, Minghao Xu, andXiaokang Yang. 2019. Learning context graph for person search. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition. 2158–2167.

[34] Shanshan Zhang, Rodrigo Benenson, Mohamed Omran, Jan Hosang, and BerntSchiele. 2016. How far are we from solving pedestrian detection?. In CVPR.1259–1267.

[35] Shanshan Zhang, Rodrigo Benenson, Mohamed Omran, Jan Hosang, and BerntSchiele. 2017. Towards reaching human performance in pedestrian detection.IEEE transactions on pattern analysis and machine intelligence 40, 4 (2017), 973–986.

[36] Shanshan Zhang, Rodrigo Benenson, and Bernt Schiele. 2017. Citypersons: Adiverse dataset for pedestrian detection. In CVPR. 3213–3221.

[37] Shifeng Zhang, LongyinWen, Xiao Bian, Zhen Lei, and Stan Z Li. 2018. Occlusion-aware R-CNN: detecting pedestrians in a crowd. In ECCV. 637–653.

[38] Shanshan Zhang, Jian Yang, and Bernt Schiele. 2018. Occluded pedestrian detec-tion through guided attention in CNNs. In CVPR. 6995–7003.

[39] Feng Zheng, Cheng Deng, Xing Sun, Xinyang Jiang, Xiaowei Guo, Zongqiao Yu,Feiyue Huang, and Rongrong Ji. 2019. Pyramidal person re-identification viamulti-loss dynamic training. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. 8514–8522.

[40] Chunluan Zhou, Ming Yang, and Junsong Yuan. 2019. Discriminative FeatureTransformation for Occluded Pedestrian Detection. In ICCV. 9557–9566.

[41] Chunluan Zhou and Junsong Yuan. 2018. Bi-box regression for pedestrian de-tection and occlusion estimation. In Proceedings of the European Conference onComputer Vision (ECCV). 135–151.