Adapted Center and Scale Prediction: More Stable and More Accurate

Jusheng Zhang; Wenhao Wang

arxiv: 2002.09053 · v3 · pith:TY3DTH5Znew · submitted 2020-02-20 · 💻 cs.CV

Adapted Center and Scale Prediction: More Stable and More Accurate

Wenhao Wang , Jusheng Zhang This is my paper

Pith reviewed 2026-05-24 14:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords pedestrian detectionanchor-free detectorone-stage detectorCityPersons benchmarkcenter and scale predictioncompressing widthswitchable normalization

0 comments

The pith

Adapting Center and Scale Prediction with robustness fixes and compressing width yields second-best results on CityPersons pedestrian benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts the Center and Scale Prediction detector to improve its robustness during training and introduce a compressing width method for predicting object widths. These changes produce an anchor-free one-stage model that reaches 9.3 percent log-average miss rate on the reasonable set of CityPersons. The result also shows 8.7 percent on the partial set and 5.6 percent on the bare set. This performance indicates that such simple detectors can still reach high accuracy levels previously associated with more complex two-stage approaches. The work additionally examines further properties of Switchable Normalization.

Core claim

By improving the robustness of CSP and introducing compressing width prediction, the adapted detector attains 9.3% MR on reasonable, 8.7% on partial, and 5.6% on bare sets of CityPersons, showing anchor-free and one-stage detectors can still have high accuracy.

What carries the argument

Adapted Center and Scale Prediction detector that adds robustness improvements and a compressing width prediction method.

If this is right

An anchor-free one-stage detector can achieve second-best performance on the CityPersons benchmark.
The model reaches 9.3% log-average miss rate on the reasonable set, 8.7% on partial, and 5.6% on bare.
Switchable Normalization has capabilities not mentioned in its original paper.
Pedestrian detection can use simpler detector designs while maintaining competitive accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The robustness and width adaptations might transfer to other one-stage detection tasks to reduce training instability.
Further study of normalization methods could identify additional ways to boost accuracy in similar models.
Lowering dependence on anchor boxes may simplify deployment in resource-limited settings.

Load-bearing premise

The reported performance gains on CityPersons are caused by the proposed adaptations rather than differences in training procedure, data handling, or other implementation details.

What would settle it

Retraining the original CSP detector with the same training procedure and data handling as the adapted version and checking whether it matches or exceeds the reported miss rates on CityPersons.

Figures

Figures reproduced from arXiv: 2002.09053 by Jusheng Zhang, Wenhao Wang.

**Figure 2.** Figure 2: It is the architecture of original CSP [ [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The proportion of the weight of each normalization method in different parts is shown in the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparisons of different batch size. It is [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Pedestrian detection benefits from deep learning technology and gains rapid development in recent years. Most of detectors follow general object detection frame, i.e. default boxes and two-stage process. Recently, anchor-free and one-stage detectors have been introduced into this area. However, their accuracies are unsatisfactory. Therefore, in order to enjoy the simplicity of anchor-free detectors and the accuracy of two-stage ones simultaneously, we propose some adaptations based on a detector, Center and Scale Prediction(CSP). The main contributions of our paper are: (1) We improve the robustness of CSP and make it easier to train. (2) We propose a novel method to predict width, namely compressing width. (3) We achieve the second best performance on CityPersons benchmark, i.e. 9.3% log-average miss rate(MR) on reasonable set, 8.7% MR on partial set and 5.6% MR on bare set, which shows an anchor-free and one-stage detector can still have high accuracy. (4) We explore some capabilities of Switchable Normalization which are not mentioned in its original paper. The code is publicly available at https://github.com/WangWenhao0716/Adapted-Center-and-Scale-Prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes two adaptations to the Center and Scale Prediction (CSP) anchor-free one-stage pedestrian detector: (1) robustness improvements intended to make training more stable and easier, and (2) a 'compressing width' technique for width prediction. It reports second-best results on the CityPersons benchmark (9.3% log-average miss rate on the reasonable set, 8.7% on partial, 5.6% on bare) and explores additional properties of Switchable Normalization. The code is released publicly.

Significance. If the reported gains are shown to result from the two listed adaptations rather than training-protocol differences, the work would demonstrate that anchor-free one-stage detectors can reach competitive accuracy on pedestrian detection benchmarks. The public code release is a clear strength for reproducibility.

major comments (2)

[Experiments] Experiments section (Tables 1-3 and associated text): no ablation is presented that re-trains the original CSP under the exact same optimizer, augmentation, epoch count, and data protocol as the adapted version. Without this controlled comparison, the 9.3% MR on the reasonable set cannot be attributed to the robustness fixes or compressing-width change rather than unstated implementation differences.
[§3.2] §3.2 (compressing width): the method is described only at a high level with no equation, loss term, or pseudocode that distinguishes it from standard width regression; therefore its claimed novelty and its contribution to the final numbers cannot be evaluated.

minor comments (2)

[Abstract and §4] The abstract states that Switchable Normalization capabilities 'not mentioned in its original paper' are explored, yet the main text does not list the specific new observations or provide a dedicated subsection or table for them.
[Figures] Figure captions and axis labels in the CityPersons result plots use inconsistent font sizes and omit error bars or run-to-run variance, reducing clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments identify important gaps in experimental controls and methodological clarity. We address each below and will revise the manuscript to strengthen the claims.

read point-by-point responses

Referee: [Experiments] Experiments section (Tables 1-3 and associated text): no ablation is presented that re-trains the original CSP under the exact same optimizer, augmentation, epoch count, and data protocol as the adapted version. Without this controlled comparison, the 9.3% MR on the reasonable set cannot be attributed to the robustness fixes or compressing-width change rather than unstated implementation differences.

Authors: We agree that the current experiments lack a controlled re-implementation of the original CSP under identical training settings. This prevents definitive attribution of the 9.3% MR improvement. In the revised manuscript we will add an ablation that re-trains the baseline CSP with the exact optimizer, augmentation, epoch schedule, and data protocol used for the adapted model, allowing direct isolation of the contributions from the robustness changes and compressing-width technique. revision: yes
Referee: [§3.2] §3.2 (compressing width): the method is described only at a high level with no equation, loss term, or pseudocode that distinguishes it from standard width regression; therefore its claimed novelty and its contribution to the final numbers cannot be evaluated.

Authors: We acknowledge that §3.2 currently provides only a high-level description. In the revision we will insert the precise mathematical formulation of the compressing-width prediction, the modified loss term, and pseudocode that explicitly contrasts it with standard width regression. These additions will clarify the novelty and enable readers to assess its specific contribution to the reported results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark results are independent of any self-defined derivations

full rationale

The paper describes adaptations (robustness improvements and compressing width) to the CSP detector and reports standard CityPersons benchmark numbers (9.3% MR reasonable etc.). No equations, fitted parameters, or predictions are presented that reduce by construction to inputs. No self-citation load-bearing uniqueness theorems, ansatzes smuggled via citation, or renaming of known results appear in the provided text. The central claim is an empirical performance comparison whose validity rests on implementation details and ablations (not supplied), not on any derivation chain that collapses to self-reference. This matches the default case of a non-circular empirical ML paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review; insufficient detail to enumerate specific free parameters, axioms, or invented entities beyond standard deep-learning training assumptions.

free parameters (1)

training hyperparameters
Deep learning detectors rely on many fitted hyperparameters such as learning rate and loss weights not detailed in the abstract.

axioms (1)

domain assumption CityPersons benchmark provides a fair and representative measure of pedestrian detection performance.
Performance claims rest on the validity of this standard evaluation protocol.

pith-pipeline@v0.9.0 · 5747 in / 1229 out tokens · 36277 ms · 2026-05-24T14:12:06.579543+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We replace all BN layers with SN layers... compressing width: w = r · h where r < 0.41... vanilla L1 loss... achieve 9.3% MR on reasonable set
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We make original CSP more robust... no special occlusion handling... second best on CityPersons

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 11 internal anchors

[1]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geof- frey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Eurocity persons: A novel benchmark for person detection in traﬃc scenes

Markus Braun, Sebastian Krebs, Fabian Flohr, and Dariu M Gavrila. Eurocity persons: A novel benchmark for person detection in traﬃc scenes. IEEE transactions on pattern analysis and ma- chine intelligence , 41(8):1844–1861, 2019. 10

work page 2019
[3]

The cityscapes dataset for seman- tic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Ro- drigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for seman- tic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016

work page 2016
[4]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE con- ference on computer vision and pattern recogni- tion, pages 248–255. Ieee, 2009

work page 2009
[5]

Fast feature pyramids for object detection

Piotr Doll´ ar, Ron Appel, Serge Belongie, and Pietro Perona. Fast feature pyramids for object detection. IEEE transactions on pattern anal- ysis and machine intelligence , 36(8):1532–1545, 2014

work page 2014
[6]

Pedestrian detection: A benchmark

Piotr Doll´ ar, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: A benchmark. In 2009 IEEE Conference on Com- puter Vision and Pattern Recognition , pages 304–311. IEEE, 2009

work page 2009
[7]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urta- sun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Con- ference on Computer Vision and Pattern Recog- nition, pages 3354–3361. IEEE, 2012

work page 2012
[8]

Fast r-cnn

Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015

work page 2015
[9]

Rich feature hierarchies for ac- curate object detection and semantic segmenta- tion

Ross Girshick, Jeﬀ Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for ac- curate object detection and semantic segmenta- tion. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 580–587, 2014

work page 2014
[10]

Deep residual learning for image recog- nition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016

work page 2016
[11]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Eﬃcient convolutional neural net- works for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

DenseBox: Unifying Landmark Localization with End to End Object Detection

Lichao Huang, Yi Yang, Yafeng Deng, and Yi- nan Yu. Densebox: Unifying landmark local- ization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioﬀe and Christian Szegedy. Batch nor- malization: Accelerating deep network train- ing by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[14]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[15]

Foveabox: Beyond anchor-based object detector

Tao Kong, Fuchun Sun, Huaping Liu, Yun- ing Jiang, and Jianbo Shi. Foveabox: Beyond anchor-based object detector. arXiv preprint arXiv:1904.03797, 2019

work page arXiv 1904
[16]

Cornernet: Detecting objects as paired keypoints

Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018

work page 2018
[17]

Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes

Yuhong Li, Xiaofan Zhang, and Deming Chen. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE conference on com- puter vision and pattern recognition, pages 1091– 1100, 2018

work page 2018
[18]

Detnet: Design backbone for object detection

Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Detnet: Design backbone for object detection. In Pro- ceedings of the European Conference on Com- puter Vision (ECCV) , pages 334–350, 2018

work page 2018
[19]

An ex- tended set of haar-like features for rapid object detection

Rainer Lienhart and Jochen Maydt. An ex- tended set of haar-like features for rapid object detection. In Proceedings. international confer- ence on image processing , volume 1, pages I–I. IEEE, 2002. 11

work page 2002
[20]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Pi- otr Doll´ ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision , pages 740–755. Springer, 2014

work page 2014
[21]

Adaptive nms: Reﬁning pedestrian detection in a crowd

Songtao Liu, Di Huang, and Yunhong Wang. Adaptive nms: Reﬁning pedestrian detection in a crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 6459–6468, 2019

work page 2019
[22]

Ssd: Single shot multi- box detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multi- box detector. In European conference on com- puter vision , pages 21–37. Springer, 2016

work page 2016
[23]

Learning eﬃcient single- stage pedestrian detectors by asymptotic local- ization ﬁtting

Wei Liu, Shengcai Liao, Weidong Hu, Xuezhi Liang, and Xiao Chen. Learning eﬃcient single- stage pedestrian detectors by asymptotic local- ization ﬁtting. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 618–634, 2018

work page 2018
[24]

High-level semantic feature detection: A new perspective for pedestrian de- tection

Wei Liu, Shengcai Liao, Weiqiang Ren, Weidong Hu, and Yinan Yu. High-level semantic feature detection: A new perspective for pedestrian de- tection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 5187–5196, 2019

work page 2019
[25]

Semantic head en- hanced pedestrian detection in a crowd

Ruiqi Lu and Huimin Ma. Semantic head en- hanced pedestrian detection in a crowd. arXiv preprint arXiv:1911.11985, 2019

work page arXiv 1911
[26]

Differentiable Learning-to-Normalize via Switchable Normalization

Ping Luo, Jiamin Ren, Zhanglin Peng, Ruimao Zhang, and Jingyu Li. Diﬀerentiable learning-to- normalize via switchable normalization. arXiv preprint arXiv:1806.10779, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Local Decorrelation For Improved Detection

Woonhyun Nam, Piotr Doll´ ar, and Joon Hee Han. Local decorrelation for improved detection. arXiv preprint arXiv:1406.1134 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[28]

Mask-guided attention network for occluded pedestrian detection

Yanwei Pang, Jin Xie, Muhammad Haris Khan, Rao Muhammad Anwer, Fahad Shahbaz Khan, and Ling Shao. Mask-guided attention network for occluded pedestrian detection. In Proceed- ings of the IEEE International Conference on Computer Vision , pages 4967–4975, 2019

work page 2019
[29]

Automatic diﬀeren- tiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De- Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic diﬀeren- tiation in pytorch. 2017

work page 2017
[30]

You only look once: Uni- ﬁed, real-time object detection

Joseph Redmon, Santosh Divvala, Ross Gir- shick, and Ali Farhadi. You only look once: Uni- ﬁed, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016

work page 2016
[31]

Yolo9000: bet- ter, faster, stronger

Joseph Redmon and Ali Farhadi. Yolo9000: bet- ter, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 7263–7271, 2017

work page 2017
[32]

Faster r-cnn: Towards real-time ob- ject detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time ob- ject detection with region proposal networks. In Advances in neural information processing sys- tems, pages 91–99, 2015

work page 2015
[33]

Weight nor- malization: A simple reparameterization to ac- celerate training of deep neural networks

Tim Salimans and Durk P Kingma. Weight nor- malization: A simple reparameterization to ac- celerate training of deep neural networks. In Ad- vances in neural information processing systems , pages 901–909, 2016

work page 2016
[34]

How does batch normalization help optimization? In Ad- vances in Neural Information Processing Sys- tems, pages 2483–2493, 2018

Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? In Ad- vances in Neural Information Processing Sys- tems, pages 2483–2493, 2018

work page 2018
[35]

CrowdHuman: A Benchmark for Detecting Human in a Crowd

Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Xiangyu Zhang, and Jian Sun. Crowd- human: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large- scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 12

work page internal anchor Pith review Pith/arXiv arXiv 2014
[37]

Small-scale pedestrian detec- tion based on topological line localization and temporal feature aggregation

Tao Song, Leiyu Sun, Di Xie, Haiming Sun, and Shiliang Pu. Small-scale pedestrian detec- tion based on topological line localization and temporal feature aggregation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 536–551, 2018

work page 2018
[38]

Compositional human pose regres- sion

Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei. Compositional human pose regres- sion. In Proceedings of the IEEE International Conference on Computer Vision , pages 2602– 2611, 2017

work page 2017
[39]

Integral human pose regression

Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. Integral human pose regression. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 529–545, 2018

work page 2018
[40]

Mean teach- ers are better role models: Weight-averaged con- sistency targets improve semi-supervised deep learning results

Antti Tarvainen and Harri Valpola. Mean teach- ers are better role models: Weight-averaged con- sistency targets improve semi-supervised deep learning results. In Advances in neural informa- tion processing systems , pages 1195–1204, 2017

work page 2017
[41]

Instance Normalization: The Missing Ingredient for Fast Stylization

Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The miss- ing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[42]

Robust real- time object detection

Paul Viola, Michael Jones, et al. Robust real- time object detection. International journal of computer vision, 4(34-47):4, 2001

work page 2001
[43]

Repulsion loss: Detecting pedestrians in a crowd

Xinlong Wang, Tete Xiao, Yuning Jiang, Shuai Shao, Jian Sun, and Chunhua Shen. Repulsion loss: Detecting pedestrians in a crowd. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 7774– 7783, 2018

work page 2018
[44]

Group normaliza- tion

Yuxin Wu and Kaiming He. Group normaliza- tion. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 3–19, 2018

work page 2018
[45]

Psc-net: Learning part spatial co-occurence for occluded pedestrian detection

Jin Xie, Yanwei Pang, Hisham Cholakkal, Rao Muhammad Anwer, Fahad Shahbaz Khan, and Ling Shao. Psc-net: Learning part spatial co-occurence for occluded pedestrian detection. arXiv preprint arXiv:2001.09252 , 2020

work page arXiv 2001
[46]

Deep layer aggregation

Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In Pro- ceedings of the IEEE conference on computer vi- sion and pattern recognition , pages 2403–2412, 2018

work page 2018
[47]

Jialiang Zhang, Lixiang Lin, Yun-chen Chen, Yao Hu, Steven C. H. Hoi, and Jianke Zhu. CSID: center, scale, identity and density-aware pedestrian detection in a crowd. CoRR, abs/1910.09188, 2019

work page arXiv 1910
[48]

Towards reaching human performance in pedestrian de- tection

Shanshan Zhang, Rodrigo Benenson, Mohamed Omran, Jan Hosang, and Bernt Schiele. Towards reaching human performance in pedestrian de- tection. IEEE transactions on pattern analysis and machine intelligence , 40(4):973–986, 2017

work page 2017
[49]

Citypersons: A diverse dataset for pedestrian detection

Shanshan Zhang, Rodrigo Benenson, and Bernt Schiele. Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3213–3221, 2017

work page 2017
[50]

Filtered channel features for pedes- trian detection

Shanshan Zhang, Rodrigo Benenson, Bernt Schiele, et al. Filtered channel features for pedes- trian detection. In CVPR, volume 1, page 4, 2015

work page 2015
[51]

Occluded pedestrian detection through guided attention in cnns

Shanshan Zhang, Jian Yang, and Bernt Schiele. Occluded pedestrian detection through guided attention in cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6995–7003, 2018

work page 2018
[52]

Occlusion-aware r-cnn: de- tecting pedestrians in a crowd

Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. Occlusion-aware r-cnn: de- tecting pedestrians in a crowd. In Proceedings of the European Conference on Computer Vision (ECCV), pages 637–653, 2018

work page 2018
[53]

Discriminative feature transformation for oc- cluded pedestrian detection

Chunluan Zhou, Ming Yang, and Junsong Yuan. Discriminative feature transformation for oc- cluded pedestrian detection. In Proceedings of the IEEE International Conference on Com- puter Vision , pages 9557–9566, 2019. 13

work page 2019
[54]

Bi-box regression for pedestrian detection and occlu- sion estimation

Chunluan Zhou and Junsong Yuan. Bi-box regression for pedestrian detection and occlu- sion estimation. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 135–151, 2018

work page 2018
[55]

Objects as Points

Xingyi Zhou, Dequan Wang, and Philipp Kr¨ ahenb¨ uhl. Objects as points.arXiv preprint arXiv:1904.07850, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[56]

Bottom-up object detection by group- ing extreme and center points

Xingyi Zhou, Jiacheng Zhuo, and Philipp Kra- henbuhl. Bottom-up object detection by group- ing extreme and center points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 850–859, 2019

work page 2019
[57]

Feature selective anchor-free module for single- shot object detection

Chenchen Zhu, Yihui He, and Marios Savvides. Feature selective anchor-free module for single- shot object detection. In Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pages 840–849, 2019. 14

work page 2019

[1] [1]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geof- frey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Eurocity persons: A novel benchmark for person detection in traﬃc scenes

Markus Braun, Sebastian Krebs, Fabian Flohr, and Dariu M Gavrila. Eurocity persons: A novel benchmark for person detection in traﬃc scenes. IEEE transactions on pattern analysis and ma- chine intelligence , 41(8):1844–1861, 2019. 10

work page 2019

[3] [3]

The cityscapes dataset for seman- tic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Ro- drigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for seman- tic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016

work page 2016

[4] [4]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE con- ference on computer vision and pattern recogni- tion, pages 248–255. Ieee, 2009

work page 2009

[5] [5]

Fast feature pyramids for object detection

Piotr Doll´ ar, Ron Appel, Serge Belongie, and Pietro Perona. Fast feature pyramids for object detection. IEEE transactions on pattern anal- ysis and machine intelligence , 36(8):1532–1545, 2014

work page 2014

[6] [6]

Pedestrian detection: A benchmark

Piotr Doll´ ar, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: A benchmark. In 2009 IEEE Conference on Com- puter Vision and Pattern Recognition , pages 304–311. IEEE, 2009

work page 2009

[7] [7]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urta- sun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Con- ference on Computer Vision and Pattern Recog- nition, pages 3354–3361. IEEE, 2012

work page 2012

[8] [8]

Fast r-cnn

Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015

work page 2015

[9] [9]

Rich feature hierarchies for ac- curate object detection and semantic segmenta- tion

Ross Girshick, Jeﬀ Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for ac- curate object detection and semantic segmenta- tion. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 580–587, 2014

work page 2014

[10] [10]

Deep residual learning for image recog- nition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016

work page 2016

[11] [11]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Eﬃcient convolutional neural net- works for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

DenseBox: Unifying Landmark Localization with End to End Object Detection

Lichao Huang, Yi Yang, Yafeng Deng, and Yi- nan Yu. Densebox: Unifying landmark local- ization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioﬀe and Christian Szegedy. Batch nor- malization: Accelerating deep network train- ing by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[14] [14]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[15] [15]

Foveabox: Beyond anchor-based object detector

Tao Kong, Fuchun Sun, Huaping Liu, Yun- ing Jiang, and Jianbo Shi. Foveabox: Beyond anchor-based object detector. arXiv preprint arXiv:1904.03797, 2019

work page arXiv 1904

[16] [16]

Cornernet: Detecting objects as paired keypoints

Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018

work page 2018

[17] [17]

Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes

Yuhong Li, Xiaofan Zhang, and Deming Chen. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE conference on com- puter vision and pattern recognition, pages 1091– 1100, 2018

work page 2018

[18] [18]

Detnet: Design backbone for object detection

Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Detnet: Design backbone for object detection. In Pro- ceedings of the European Conference on Com- puter Vision (ECCV) , pages 334–350, 2018

work page 2018

[19] [19]

An ex- tended set of haar-like features for rapid object detection

Rainer Lienhart and Jochen Maydt. An ex- tended set of haar-like features for rapid object detection. In Proceedings. international confer- ence on image processing , volume 1, pages I–I. IEEE, 2002. 11

work page 2002

[20] [20]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Pi- otr Doll´ ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision , pages 740–755. Springer, 2014

work page 2014

[21] [21]

Adaptive nms: Reﬁning pedestrian detection in a crowd

Songtao Liu, Di Huang, and Yunhong Wang. Adaptive nms: Reﬁning pedestrian detection in a crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 6459–6468, 2019

work page 2019

[22] [22]

Ssd: Single shot multi- box detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multi- box detector. In European conference on com- puter vision , pages 21–37. Springer, 2016

work page 2016

[23] [23]

Learning eﬃcient single- stage pedestrian detectors by asymptotic local- ization ﬁtting

Wei Liu, Shengcai Liao, Weidong Hu, Xuezhi Liang, and Xiao Chen. Learning eﬃcient single- stage pedestrian detectors by asymptotic local- ization ﬁtting. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 618–634, 2018

work page 2018

[24] [24]

High-level semantic feature detection: A new perspective for pedestrian de- tection

Wei Liu, Shengcai Liao, Weiqiang Ren, Weidong Hu, and Yinan Yu. High-level semantic feature detection: A new perspective for pedestrian de- tection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 5187–5196, 2019

work page 2019

[25] [25]

Semantic head en- hanced pedestrian detection in a crowd

Ruiqi Lu and Huimin Ma. Semantic head en- hanced pedestrian detection in a crowd. arXiv preprint arXiv:1911.11985, 2019

work page arXiv 1911

[26] [26]

Differentiable Learning-to-Normalize via Switchable Normalization

Ping Luo, Jiamin Ren, Zhanglin Peng, Ruimao Zhang, and Jingyu Li. Diﬀerentiable learning-to- normalize via switchable normalization. arXiv preprint arXiv:1806.10779, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [27]

Local Decorrelation For Improved Detection

Woonhyun Nam, Piotr Doll´ ar, and Joon Hee Han. Local decorrelation for improved detection. arXiv preprint arXiv:1406.1134 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[28] [28]

Mask-guided attention network for occluded pedestrian detection

Yanwei Pang, Jin Xie, Muhammad Haris Khan, Rao Muhammad Anwer, Fahad Shahbaz Khan, and Ling Shao. Mask-guided attention network for occluded pedestrian detection. In Proceed- ings of the IEEE International Conference on Computer Vision , pages 4967–4975, 2019

work page 2019

[29] [29]

Automatic diﬀeren- tiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De- Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic diﬀeren- tiation in pytorch. 2017

work page 2017

[30] [30]

You only look once: Uni- ﬁed, real-time object detection

Joseph Redmon, Santosh Divvala, Ross Gir- shick, and Ali Farhadi. You only look once: Uni- ﬁed, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016

work page 2016

[31] [31]

Yolo9000: bet- ter, faster, stronger

Joseph Redmon and Ali Farhadi. Yolo9000: bet- ter, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 7263–7271, 2017

work page 2017

[32] [32]

Faster r-cnn: Towards real-time ob- ject detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time ob- ject detection with region proposal networks. In Advances in neural information processing sys- tems, pages 91–99, 2015

work page 2015

[33] [33]

Weight nor- malization: A simple reparameterization to ac- celerate training of deep neural networks

Tim Salimans and Durk P Kingma. Weight nor- malization: A simple reparameterization to ac- celerate training of deep neural networks. In Ad- vances in neural information processing systems , pages 901–909, 2016

work page 2016

[34] [34]

How does batch normalization help optimization? In Ad- vances in Neural Information Processing Sys- tems, pages 2483–2493, 2018

Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? In Ad- vances in Neural Information Processing Sys- tems, pages 2483–2493, 2018

work page 2018

[35] [35]

CrowdHuman: A Benchmark for Detecting Human in a Crowd

Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Xiangyu Zhang, and Jian Sun. Crowd- human: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [36]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large- scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 12

work page internal anchor Pith review Pith/arXiv arXiv 2014

[37] [37]

Small-scale pedestrian detec- tion based on topological line localization and temporal feature aggregation

Tao Song, Leiyu Sun, Di Xie, Haiming Sun, and Shiliang Pu. Small-scale pedestrian detec- tion based on topological line localization and temporal feature aggregation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 536–551, 2018

work page 2018

[38] [38]

Compositional human pose regres- sion

Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei. Compositional human pose regres- sion. In Proceedings of the IEEE International Conference on Computer Vision , pages 2602– 2611, 2017

work page 2017

[39] [39]

Integral human pose regression

Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. Integral human pose regression. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 529–545, 2018

work page 2018

[40] [40]

Mean teach- ers are better role models: Weight-averaged con- sistency targets improve semi-supervised deep learning results

Antti Tarvainen and Harri Valpola. Mean teach- ers are better role models: Weight-averaged con- sistency targets improve semi-supervised deep learning results. In Advances in neural informa- tion processing systems , pages 1195–1204, 2017

work page 2017

[41] [41]

Instance Normalization: The Missing Ingredient for Fast Stylization

Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The miss- ing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[42] [42]

Robust real- time object detection

Paul Viola, Michael Jones, et al. Robust real- time object detection. International journal of computer vision, 4(34-47):4, 2001

work page 2001

[43] [43]

Repulsion loss: Detecting pedestrians in a crowd

Xinlong Wang, Tete Xiao, Yuning Jiang, Shuai Shao, Jian Sun, and Chunhua Shen. Repulsion loss: Detecting pedestrians in a crowd. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 7774– 7783, 2018

work page 2018

[44] [44]

Group normaliza- tion

Yuxin Wu and Kaiming He. Group normaliza- tion. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 3–19, 2018

work page 2018

[45] [45]

Psc-net: Learning part spatial co-occurence for occluded pedestrian detection

Jin Xie, Yanwei Pang, Hisham Cholakkal, Rao Muhammad Anwer, Fahad Shahbaz Khan, and Ling Shao. Psc-net: Learning part spatial co-occurence for occluded pedestrian detection. arXiv preprint arXiv:2001.09252 , 2020

work page arXiv 2001

[46] [46]

Deep layer aggregation

Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In Pro- ceedings of the IEEE conference on computer vi- sion and pattern recognition , pages 2403–2412, 2018

work page 2018

[47] [47]

Jialiang Zhang, Lixiang Lin, Yun-chen Chen, Yao Hu, Steven C. H. Hoi, and Jianke Zhu. CSID: center, scale, identity and density-aware pedestrian detection in a crowd. CoRR, abs/1910.09188, 2019

work page arXiv 1910

[48] [48]

Towards reaching human performance in pedestrian de- tection

Shanshan Zhang, Rodrigo Benenson, Mohamed Omran, Jan Hosang, and Bernt Schiele. Towards reaching human performance in pedestrian de- tection. IEEE transactions on pattern analysis and machine intelligence , 40(4):973–986, 2017

work page 2017

[49] [49]

Citypersons: A diverse dataset for pedestrian detection

Shanshan Zhang, Rodrigo Benenson, and Bernt Schiele. Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3213–3221, 2017

work page 2017

[50] [50]

Filtered channel features for pedes- trian detection

Shanshan Zhang, Rodrigo Benenson, Bernt Schiele, et al. Filtered channel features for pedes- trian detection. In CVPR, volume 1, page 4, 2015

work page 2015

[51] [51]

Occluded pedestrian detection through guided attention in cnns

Shanshan Zhang, Jian Yang, and Bernt Schiele. Occluded pedestrian detection through guided attention in cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6995–7003, 2018

work page 2018

[52] [52]

Occlusion-aware r-cnn: de- tecting pedestrians in a crowd

Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. Occlusion-aware r-cnn: de- tecting pedestrians in a crowd. In Proceedings of the European Conference on Computer Vision (ECCV), pages 637–653, 2018

work page 2018

[53] [53]

Discriminative feature transformation for oc- cluded pedestrian detection

Chunluan Zhou, Ming Yang, and Junsong Yuan. Discriminative feature transformation for oc- cluded pedestrian detection. In Proceedings of the IEEE International Conference on Com- puter Vision , pages 9557–9566, 2019. 13

work page 2019

[54] [54]

Bi-box regression for pedestrian detection and occlu- sion estimation

Chunluan Zhou and Junsong Yuan. Bi-box regression for pedestrian detection and occlu- sion estimation. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 135–151, 2018

work page 2018

[55] [55]

Objects as Points

Xingyi Zhou, Dequan Wang, and Philipp Kr¨ ahenb¨ uhl. Objects as points.arXiv preprint arXiv:1904.07850, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[56] [56]

Bottom-up object detection by group- ing extreme and center points

Xingyi Zhou, Jiacheng Zhuo, and Philipp Kra- henbuhl. Bottom-up object detection by group- ing extreme and center points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 850–859, 2019

work page 2019

[57] [57]

Feature selective anchor-free module for single- shot object detection

Chenchen Zhu, Yihui He, and Marios Savvides. Feature selective anchor-free module for single- shot object detection. In Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pages 840–849, 2019. 14

work page 2019