Human Pose Estimation for Real-World Crowded Scenarios

Arne Schumann; J\"urgen Beyerer; Thomas Golda; Tobias Kalb

arxiv: 1907.06922 · v1 · pith:BSIJAMUFnew · submitted 2019-07-16 · 💻 cs.CV

Human Pose Estimation for Real-World Crowded Scenarios

Thomas Golda , Tobias Kalb , Arne Schumann , J\"urgen Beyerer This is my paper

Pith reviewed 2026-05-24 21:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords human pose estimationcrowded scenesocclusion augmentationsynthetic dataJTA datasetCOCO datasetkeypoint detection

0 comments

The pith

Occlusion augmentation from COCO, explicit occluded-part detection, and an extended JTA dataset raise pose estimation accuracy in crowds by 4.7 percent AP.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines ways to make human pose estimation work better when many people stand close together and block each other. It tests three changes to a standard method: randomly pasting person and object shapes from the COCO dataset to create training occlusions, training a model that outputs separate flags for visible and hidden body parts using JTA occlusion labels, and extending the synthetic JTA dataset with more varied poses and denser crowds. If these changes succeed, pose estimators could handle everyday crowded settings like streets or events without large drops in performance. The reported result is a 4.7 percent gain in average precision, bringing the system in line with current top methods on the tested data.

Core claim

The authors establish that the combination of COCO-based occlusion augmentation during training, the use of JTA's occlusion flags to train a two-branch model distinguishing visible and occluded parts, and the creation of an extension to the JTA dataset to better match real-world crowd densities and pose variety, raises the accuracy of a baseline pose detector by 4.7% AP on crowded scenarios and matches state-of-the-art performance.

What carries the argument

Occlusion data augmentation using COCO cutouts combined with explicit occluded-body-part detection via JTA flags and a dataset extension for transfer to real crowds.

Load-bearing premise

The JTA extension and COCO occlusion augmentation create training examples that match the distribution of real-world crowd occlusions and densities without harmful artifacts.

What would settle it

Running the method on a fresh, real-world crowded pose dataset collected independently and checking whether the 4.7% AP improvement still appears would confirm or refute the claim.

Figures

Figures reproduced from arXiv: 1907.06922 by Arne Schumann, J\"urgen Beyerer, Thomas Golda, Tobias Kalb.

**Figure 1.** Figure 1: Example of a scene captured by a surveillance camera. Such situation is characterized by heterogeneous levels of crowdedness and lots of occlusions and ambiguities at more crowded spots. The drawn poses show the difficulty of the task. dle larger groups or crowds of people which also introduces many new challenges, see [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: CrowdIndex distributions. CrowdPose and JTA differ significantly regarding their CrowdIndex distributions. JTA-Ext was created to diminish this difference and create a distribution closer to uniform distributed. the ground truth heatmaps for visible keypoints to zero in the occluded branch and vice versa. 3.3. JTA-Ext The synthetic dataset JTA introduced by Fabbri et al. [3] is currently the largest publi… view at source ↗

**Figure 3.** Figure 3: Our two architecture extensions: OccNet (top), OccNetCB (bottom). The Occlusion Detection Networks with an additional branch to detect occluded keypoints and OccNetCB with interconnections between the visible and occluded branch. 4. Experiments 4.1. Datasets We conduct experiments on two different pose estimation datasets: the recently released real-world dataset CrowdPose [6] and the synthetical dataset … view at source ↗

**Figure 4.** Figure 4: Two images showing crowded situations. Based on the CrowdIndex Fig. 4a belongs to the class of hard cases and Fig. 4b to the medium cases. 4.2. Evaluation Metrics Following the procedure introduced in [6], the adjusted Object Keypoints Similarity (OKS) of CrowdPose is used. The OKS is used to describe the similarity of two poses and their keypoints by introducing a way to measure overlaps between keypoint… view at source ↗

**Figure 5.** Figure 5: 2d histograms of selected keypoint types of CrowdPose and JTA. It is evident that the keypoint distribution of CrowdPose is more spread out than the distributions of JTA. However, the distributions of JTA are multivariate, because the persons in JTA face the camera just as often as they do not, whereas the persons in CrowdPose are primarily facing the camera. This is also represented in the average poses f… view at source ↗

**Figure 6.** Figure 6: Visualized results of the OccNetCB. Detected occluded keypoints are marked in red and visible keypoints in green. The darker shade of keypoints denotes uncertain keypoints (i.e. with a likelihood lower than 0.7). The two images on the left show persons from JTA and the images on the right show selected images from CrowdPose [6] [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results. The left column reports results generated by our Baseline-50 method, the right column of our final model. The latter detects more keypoints and delivers legit estimates for occluded ones (see first and second row). Furthermore, at smaller scale more poses can be detected (third row). The crops have heights of 254px, 239px, and 97px for the top, middle and last row respectively. 4.7.… view at source ↗

read the original abstract

Human pose estimation has recently made significant progress with the adoption of deep convolutional neural networks. Its many applications have attracted tremendous interest in recent years. However, many practical applications require pose estimation for human crowds, which still is a rarely addressed problem. In this work, we explore methods to optimize pose estimation for human crowds, focusing on challenges introduced with dense crowds, such as occlusions, people in close proximity to each other, and partial visibility of people. In order to address these challenges, we evaluate three aspects of a pose detection approach: i) a data augmentation method to introduce robustness to occlusions, ii) the explicit detection of occluded body parts, and iii) the use of the synthetic generated datasets. The first approach to improve the accuracy in crowded scenarios is to generate occlusions at training time using person and object cutouts from the object recognition dataset COCO (Common Objects in Context). Furthermore, the synthetically generated dataset JTA (Joint Track Auto) is evaluated for the use in real-world crowd applications. In order to overcome the transfer gap of JTA originating from a low pose variety and less dense crowds, an extension dataset is created to ease the use for real-world applications. Additionally, the occlusion flags provided with JTA are utilized to train a model, which explicitly distinguishes between occluded and visible body parts in two distinct branches. The combination of the proposed additions to the baseline method help to improve the overall accuracy by 4.7% AP and thereby provide comparable results to current state-of-the-art approaches on the respective dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A modest 4.7% AP gain on crowded pose estimation from COCO cutouts, an occlusion branch, and an extended JTA dataset, but no checks that the synthetic distributions match real crowds.

read the letter

This paper takes a standard pose estimator and layers on three targeted changes for dense scenes: pasting person and object cutouts from COCO to create occlusions at training time, adding a separate branch that predicts occlusion flags using JTA labels, and extending the JTA synthetic dataset to increase pose variety and crowd density. The result is a 4.7% AP lift that brings performance in line with current top numbers on their test set. The approach is straightforward and directly attacks the stated problems of occlusion and proximity in crowds. Using existing COCO data for augmentation keeps the method practical, and training the visibility branch explicitly is a logical split of the signal. Extending JTA to close the transfer gap is also a sensible move rather than just hoping a general model scales. The main gap is the missing validation that the generated training data actually reproduces the occlusion rates and inter-person distances found in real crowded test images. Without those statistics or overlap histograms, the measured improvement could be tied to artifacts in how the synthetic scenes were built. The abstract also skips the usual experimental details on baseline code, run-to-run variance, or exact metric definitions, which leaves the 4.7% figure harder to assess. This is the sort of paper that matters to people building surveillance or robotics systems that must handle busy public spaces. It will not move the core of pose estimation research, but the concrete fixes are worth knowing if you already work with similar constraints. I would send it to peer review. The problem is clearly scoped, the methods are transparent, and the reported gain is large enough to justify referee time even if more evidence on data realism is required.

Referee Report

2 major / 1 minor

Summary. The paper explores improvements to human pose estimation in crowded real-world scenarios by evaluating three modifications to a baseline: (i) occlusion augmentation via person/object cutouts from COCO, (ii) a two-branch architecture that explicitly detects occluded vs. visible body parts, and (iii) training on an extended version of the synthetic JTA dataset created to increase pose variety and crowd density. The authors report that the combination yields a 4.7% AP gain and reaches performance comparable to current state-of-the-art methods on the evaluated dataset.

Significance. If the reported gains are shown to be robust and the synthetic data distributions are validated to match real-world test conditions, the work would provide a practical set of techniques for handling occlusions and proximity in crowd pose estimation, which remains an important but under-addressed application area.

major comments (2)

[Abstract, §4] Abstract and §4 (dataset extension description): the central 4.7% AP claim and SOTA-comparable result rest on the assumption that the JTA extension plus COCO cutout augmentation produce occlusion statistics and crowd densities that match real-world test distributions. No quantitative validation (e.g., histograms of inter-person distances, occlusion fractions, or pose entropy) is described; without it the measured improvement could be an artifact of the synthetic distribution rather than a general solution.
[§3] §3 (experimental setup): the abstract reports a 4.7% AP improvement but provides no details on the baseline architecture, exact evaluation metric (AP@? IoU), dataset splits, statistical significance, or number of runs. These omissions make it impossible to assess whether the gain is load-bearing for the three proposed modifications or could be explained by confounds.

minor comments (1)

[Abstract] The phrase 'the respective dataset' in the abstract is ambiguous; the manuscript should explicitly name the test set(s) used for the final comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly where needed.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (dataset extension description): the central 4.7% AP claim and SOTA-comparable result rest on the assumption that the JTA extension plus COCO cutout augmentation produce occlusion statistics and crowd densities that match real-world test distributions. No quantitative validation (e.g., histograms of inter-person distances, occlusion fractions, or pose entropy) is described; without it the measured improvement could be an artifact of the synthetic distribution rather than a general solution.

Authors: We agree that the manuscript would be strengthened by explicit quantitative validation of the distribution match. Although gains are measured on real-world test data, no histograms or entropy comparisons between the extended JTA/COCO augmentations and the test set are currently provided. In revision we will add these analyses (inter-person distance histograms, occlusion fraction distributions, and pose entropy) to support the claim that the training distributions are representative. revision: yes
Referee: [§3] §3 (experimental setup): the abstract reports a 4.7% AP improvement but provides no details on the baseline architecture, exact evaluation metric (AP@? IoU), dataset splits, statistical significance, or number of runs. These omissions make it impossible to assess whether the gain is load-bearing for the three proposed modifications or could be explained by confounds.

Authors: We acknowledge these details are missing from the abstract and experimental setup section. The revised manuscript will explicitly describe the baseline architecture, state the precise AP metric including the IoU threshold, document the dataset splits, and report results with standard deviations across multiple runs to demonstrate statistical significance and rule out confounds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical evaluation of modifications

full rationale

The paper reports measured AP gains from three empirical modifications (COCO-based occlusion augmentation, JTA extension for density/pose variety, and explicit occlusion branches) evaluated on the respective dataset. No derivation chain, first-principles predictions, fitted parameters renamed as outputs, or self-citation load-bearing uniqueness theorems appear. The 4.7% improvement is a direct experimental outcome, not a quantity forced by construction from the inputs. The dataset-simulation assumption is a validity concern, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim depends on the transferability of the proposed training strategies from synthetic and augmented data to real crowded scenes, with no free parameters explicitly fitted beyond standard model training.

axioms (1)

domain assumption Convolutional neural networks trained on augmented data can generalize to occluded poses in crowds
Core assumption underlying the data augmentation and training approach.

pith-pipeline@v0.9.0 · 5814 in / 1331 out tokens · 31984 ms · 2026-05-24T21:09:15.606819+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 5 internal anchors

[1]

Andriluka, L

M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014. 1

work page 2014
[2]

Y . Chen, Z. Wang, Y . Peng, Z. Zhang, G. Yu, and J. Sun. Cascaded pyramid network for multi-person pose estimation. CoRR, abs/1711.07319, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Fabbri, F

M. Fabbri, F. Lanzi, S. Calderara, A. Palazzi, R. Vezzani, and R. Cucchiara. Learning to detect and track visible and occluded body joints in a virtual world. In European Con- ference on Computer Vision (ECCV), 2018. 1, 2, 3, 4, 8

work page 2018
[4]

H.-S. Fang, S. Xie, Y .-W. Tai, and C. Lu. RMPE: Regional multi-person pose estimation. In ICCV, 2017. 5

work page 2017
[5]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) , pages 770– 778, June 2016. 3

work page 2016
[6]

J. Li, C. Wang, H. Zhu, Y . Mao, H.-S. Fang, and C. Lu. Crowdpose: Efﬁcient crowded scenes pose estimation and a new benchmark. arXiv preprint arXiv:1812.00324, 2018. 2, 3, 4, 5, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014. 1, 8

work page internal anchor Pith review Pith/arXiv arXiv 2014
[8]

Stacked Hourglass Networks for Human Pose Estimation

A. Newell, K. Yang, and J. Deng. Stacked hourglass net- works for human pose estimation. CoRR, abs/1603.06937,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

YOLOv3: An Incremental Improvement

J. Redmon and A. Farhadi. Yolov3: An incremental improve- ment. CoRR, abs/1804.02767, 2018. 5

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

S ´ar´andi, T

I. S ´ar´andi, T. Linder, K. O. Arras, and B. Leibe. How robust is 3d human pose estimation to occlusion? In IROS Work- shop - Robotic Co-workers 4.0, 2018. 3

work page 2018
[11]

B. Xiao, H. Wu, and Y . Wei. Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision (ECCV), 2018. 2, 3, 5, 7, 8

work page 2018
[12]

Y . Xiu, J. Li, H. Wang, Y . Fang, and C. Lu. Pose Flow: Efﬁcient online pose tracking. In BMVC, 2018. 5

work page 2018

[1] [1]

Andriluka, L

M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014. 1

work page 2014

[2] [2]

Y . Chen, Z. Wang, Y . Peng, Z. Zhang, G. Yu, and J. Sun. Cascaded pyramid network for multi-person pose estimation. CoRR, abs/1711.07319, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

Fabbri, F

M. Fabbri, F. Lanzi, S. Calderara, A. Palazzi, R. Vezzani, and R. Cucchiara. Learning to detect and track visible and occluded body joints in a virtual world. In European Con- ference on Computer Vision (ECCV), 2018. 1, 2, 3, 4, 8

work page 2018

[4] [4]

H.-S. Fang, S. Xie, Y .-W. Tai, and C. Lu. RMPE: Regional multi-person pose estimation. In ICCV, 2017. 5

work page 2017

[5] [5]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) , pages 770– 778, June 2016. 3

work page 2016

[6] [6]

J. Li, C. Wang, H. Zhu, Y . Mao, H.-S. Fang, and C. Lu. Crowdpose: Efﬁcient crowded scenes pose estimation and a new benchmark. arXiv preprint arXiv:1812.00324, 2018. 2, 3, 4, 5, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014. 1, 8

work page internal anchor Pith review Pith/arXiv arXiv 2014

[8] [8]

Stacked Hourglass Networks for Human Pose Estimation

A. Newell, K. Yang, and J. Deng. Stacked hourglass net- works for human pose estimation. CoRR, abs/1603.06937,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

YOLOv3: An Incremental Improvement

J. Redmon and A. Farhadi. Yolov3: An incremental improve- ment. CoRR, abs/1804.02767, 2018. 5

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

S ´ar´andi, T

I. S ´ar´andi, T. Linder, K. O. Arras, and B. Leibe. How robust is 3d human pose estimation to occlusion? In IROS Work- shop - Robotic Co-workers 4.0, 2018. 3

work page 2018

[11] [11]

B. Xiao, H. Wu, and Y . Wei. Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision (ECCV), 2018. 2, 3, 5, 7, 8

work page 2018

[12] [12]

Y . Xiu, J. Li, H. Wang, Y . Fang, and C. Lu. Pose Flow: Efﬁcient online pose tracking. In BMVC, 2018. 5

work page 2018