pith. machine review for the scientific record. sign in

arxiv: 2604.09715 · v1 · submitted 2026-04-08 · 💻 cs.CV · eess.IV

Recognition: no theorem link

MuPPet: Multi-person 2D-to-3D Pose Lifting

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:14 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords multi-person pose estimation2D-to-3D liftinginter-person correlationsocclusion handlingattention mechanismsgroup interactionshuman pose estimation
0
0 comments X

The pith

Explicitly modeling relationships between people allows more accurate lifting of their 2D poses into 3D, especially when some are occluded.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that 2D-to-3D pose lifting works better in groups when the model accounts for how the individuals relate to one another rather than treating each person in isolation. Existing approaches either ignore these links or cannot scale to different numbers of people in a scene. MuPPet adds three targeted pieces: an encoding step that organizes each person's features, a training step that permutes people to create more varied examples, and an attention step that lets the model weigh connections between people dynamically. Tests on datasets showing group interactions confirm higher accuracy than prior single-person and multi-person methods, with particular gains when bodies block one another.

Core claim

MuPPet is a multi-person 2D-to-3D pose lifting framework that explicitly models inter-person correlations. It does so through Person Encoding to structure individual representations, Permutation Augmentation to enhance training diversity, and Dynamic Multi-Person Attention to adaptively model correlations between individuals. Extensive experiments on group interaction datasets show that this approach significantly outperforms state-of-the-art single- and multi-person 2D-to-3D pose lifting methods while improving robustness in occlusion scenarios.

What carries the argument

Dynamic Multi-Person Attention that adaptively models correlations between individuals, supported by Person Encoding to structure representations and Permutation Augmentation to increase training variety.

If this is right

  • The model can process scenes with any number of people without requiring a fixed group size.
  • Accuracy holds up better when one or more individuals are partially hidden from the camera.
  • The resulting 3D poses carry richer information about social spatial arrangements.
  • Downstream tasks that rely on group pose, such as interaction analysis, receive more reliable input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the same attention structure to video frames could enforce consistency across time without major redesign.
  • The emphasis on group context may transfer to related problems like multi-person tracking or collective activity recognition.
  • Real-world deployment would benefit from checking performance on crowds larger than those in current benchmark datasets.

Load-bearing premise

That adding explicit modeling of inter-person relationships through encoding, permutation, and attention will consistently raise 3D accuracy and handle changing group sizes and occlusions better than single-person methods.

What would settle it

A new test set containing large groups with frequent mutual occlusions on which MuPPet fails to exceed the accuracy of the strongest single-person lifting baseline.

Figures

Figures reproduced from arXiv: 2604.09715 by Jan van Gemert, Jouh Yeong Chew, Thomas Markhorst, Xucong Zhang, Zhi-Yi Lin.

Figure 1
Figure 1. Figure 1: We exploit social inter-person correlations for 3D pose [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our MuPPet pipeline. Given a sequence of detected 2D human pose joints from multiple persons [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of permutation learning for a five person scene. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: We show the performance of different levels of occlusion on the Haggling dataset, in comparison with the adapted D3DP [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results on an in-the-wild setting predicted by [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Multi-person social interactions are inherently built on coherence and relationships among all individuals within the group, making multi-person localization and body pose estimation essential to understanding these social dynamics. One promising approach is 2D-to-3D pose lifting which provides a 3D human pose consisting of rich spatial details by building on the significant advances in 2D pose estimation. However, the existing 2D-to-3D pose lifting methods often neglect inter-person relationships or cannot handle varying group sizes, limiting their effectiveness in multi-person settings. We propose MuPPet, a novel multi-person 2D-to-3D pose lifting framework that explicitly models inter-person correlations. To leverage these inter-person dependencies, our approach introduces Person Encoding to structure individual representations, Permutation Augmentation to enhance training diversity, and Dynamic Multi-Person Attention to adaptively model correlations between individuals. Extensive experiments on group interaction datasets demonstrate MuPPet significantly outperforms state-of-the-art single- and multi-person 2D-to-3D pose lifting methods, and improves robustness in occlusion scenarios. Our findings highlight the importance of modeling inter-person correlations, paving the way for accurate and socially-aware 3D pose estimation. Our code is available at: https://github.com/Thomas-Markhorst/MuPPet

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MuPPet, a multi-person 2D-to-3D pose lifting framework that explicitly models inter-person correlations via three introduced components: Person Encoding to structure individual representations, Permutation Augmentation to enhance training diversity, and Dynamic Multi-Person Attention to adaptively model correlations between individuals. It evaluates the approach on group interaction datasets and claims significant outperformance over state-of-the-art single- and multi-person 2D-to-3D lifting methods along with improved robustness under occlusion.

Significance. If the empirical results hold with proper validation, the work would be significant for the field by shifting 2D-to-3D lifting from single-person assumptions to explicit inter-person modeling, which is relevant for social scene understanding and related applications. The release of code at the provided GitHub link supports reproducibility.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: The central claim that MuPPet 'significantly outperforms' SOTA methods and improves occlusion robustness is asserted without any quantitative metrics (e.g., MPJPE or PCK), ablation results isolating the contribution of each of the three components, dataset details, or error analysis. This is load-bearing for the empirical claim and prevents verification that the data supports the assertions.
  2. [Method] Method section (Dynamic Multi-Person Attention): The description of how the attention mechanism handles varying group sizes (e.g., via masking, padding, or fixed-size assumptions) is not specified in sufficient detail to assess whether it generalizes as claimed beyond the tested datasets.
minor comments (2)
  1. [Abstract] The abstract would benefit from including one or two key quantitative results and the names of the group interaction datasets used to make the contribution clearer at a glance.
  2. [Figures and Tables] Ensure that all figures (e.g., architecture diagrams) include clear labels for the three proposed components and that tables compare against both single-person and multi-person baselines with consistent metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our empirical results and methodological details. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: The central claim that MuPPet 'significantly outperforms' SOTA methods and improves occlusion robustness is asserted without any quantitative metrics (e.g., MPJPE or PCK), ablation results isolating the contribution of each of the three components, dataset details, or error analysis. This is load-bearing for the empirical claim and prevents verification that the data supports the assertions.

    Authors: We agree that the abstract states the performance claim at a high level without numbers (standard for abstracts) and that the Experiments section would benefit from greater explicitness. The current version includes MPJPE results on group interaction datasets and some comparisons, but we acknowledge the absence of dedicated ablations isolating Person Encoding, Permutation Augmentation, and Dynamic Multi-Person Attention, as well as limited dataset statistics and occlusion-specific error analysis. In the revision we will add: (1) a summary table of key MPJPE/PCK numbers in the Experiments section, (2) an ablation study subsection quantifying each component's contribution, (3) expanded dataset descriptions (including group size distributions and occlusion statistics), and (4) error analysis focused on occlusion robustness. These changes will make the supporting evidence fully verifiable. revision: yes

  2. Referee: [Method] Method section (Dynamic Multi-Person Attention): The description of how the attention mechanism handles varying group sizes (e.g., via masking, padding, or fixed-size assumptions) is not specified in sufficient detail to assess whether it generalizes as claimed beyond the tested datasets.

    Authors: We thank the referee for highlighting this gap in clarity. The Dynamic Multi-Person Attention handles variable group sizes by padding sequences to a fixed maximum length and applying a binary mask that excludes padded tokens from the attention computation (preventing any fixed-size assumption on the actual number of people). We will revise the Method section to include a precise description of the padding and masking procedure, the corresponding equations, and a short pseudocode snippet. This will explicitly demonstrate generalization to arbitrary group sizes within the tested range and beyond. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces MuPPet as a new framework that explicitly adds three components—Person Encoding, Permutation Augmentation, and Dynamic Multi-Person Attention—to model inter-person correlations on top of existing 2D pose estimators. These additions are presented as novel remedies to limitations of prior single-person lifting methods, with performance gains shown via experiments on group datasets. No derivation step reduces by construction to a fitted parameter, self-definition, or a load-bearing self-citation chain; the central claims rest on the independent design of the new modules and external empirical validation rather than renaming or re-deriving inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited; the approach rests on the domain assumption that inter-person correlations are load-bearing for accuracy and that the three introduced modules can capture them effectively.

free parameters (1)
  • Neural network hyperparameters
    Training of attention-based models typically involves multiple hyperparameters chosen to fit the group interaction datasets.
axioms (1)
  • domain assumption Inter-person relationships improve 3D pose lifting accuracy in group settings
    The framework is built around this premise to justify the new components.

pith-pipeline@v0.9.0 · 5544 in / 1401 out tokens · 39014 ms · 2026-05-10T19:14:12.856627+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 12 canonical work pages · 4 internal anchors

  1. [1]

    In5th IEEE-RAS International Conference on Humanoid Robots, 2005., pages 418–423

    Towards a humanoid museum guide robot that interacts with multiple persons. In5th IEEE-RAS International Conference on Humanoid Robots, 2005., pages 418–423. IEEE, 2005. 1

  2. [2]

    Analyzing free-standing conversational groups: A multimodal approach

    Xavier Alameda-Pineda, Yan Yan, Elisa Ricci, Oswald Lanz, and Nicu Sebe. Analyzing free-standing conversational groups: A multimodal approach. InProceedings of the 23rd ACM international conference on Multimedia, pages 5–14,

  3. [3]

    Blended diffusion for text-driven editing of natural images

    Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images.CoRR, abs/2111.14818, 2021. 3

  4. [4]

    Bodily be- haviors in social interaction: Novel annotations and state-of- the-art evaluation

    Michal Balazia, Philipp M ¨uller, ´Akos Levente T´anczos, Au- gust von Liechtenstein, and Francois Bremond. Bodily be- haviors in social interaction: Novel annotations and state-of- the-art evaluation. InProceedings of the 30th ACM Interna- tional Conference on Multimedia, pages 70–79, 2022. 1

  5. [5]

    Conditional image generation with score-based diffusion models.arXiv preprint arXiv:2111.13606, 2021

    Georgios Batzolis, Jan Stanczuk, Carola-Bibiane Sch ¨onlieb, and Christian Etmann. Conditional image generation with score-based diffusion models.CoRR, abs/2111.13606, 2021. 3

  6. [6]

    Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y . A. Sheikh. Openpose: Realtime multi-person 2d pose estima- tion using part affinity fields.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. 5

  7. [7]

    Ab- sposelifter: Absolute 3d human pose lifting network from a single noisy 2d human pose.CoRR, abs/1910.12029, 2019

    Ju Yong Chang, Gyeongsik Moon, and Kyoung Mu Lee. Ab- sposelifter: Absolute 3d human pose lifting network from a single noisy 2d human pose.CoRR, abs/1910.12029, 2019. 2

  8. [8]

    3d human pose es- timation= 2d pose estimation+ matching

    Ching-Hang Chen and Deva Ramanan. 3d human pose es- timation= 2d pose estimation+ matching. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 7035–7043, 2017. 2

  9. [9]

    3dcrowdnet: 2d human pose-guided3d crowd human pose and shape estimation in the wild.CoRR, abs/2104.07300, 2021

    Hongsuk Choi, Gyeongsik Moon, JoonKyu Park, and Ky- oung Mu Lee. 3dcrowdnet: 2d human pose-guided3d crowd human pose and shape estimation in the wild.CoRR, abs/2104.07300, 2021. 2

  10. [10]

    Jeongjun Choi, Dongseok Shim, and H. Jin Kim. DiffuPose: Monocular 3d human pose estimation via denoising diffu- sion probabilistic model. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3773–3780. ISSN: 2153-0866. 2, 3

  11. [11]

    Op- timizing network structure for 3d human pose estimation

    Hai Ci, Chunyu Wang, Xiaoxuan Ma, and Yizhou Wang. Op- timizing network structure for 3d human pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2262–2271, 2019. 2

  12. [12]

    Gfpose: Learning 3d human pose prior with gradient fields, 2022

    Hai Ci, Mingdong Wu, Wentao Zhu, Xiaoxuan Ma, Hao Dong, Fangwei Zhong, and Yizhou Wang. Gfpose: Learning 3d human pose prior with gradient fields, 2022. 2

  13. [13]

    See- ing is believing: body motion dominates in multisensory conversations.ACM Transactions on Graphics (TOG), 29 (4):1–9, 2010

    Cathy Ennis, Rachel McDonnell, and Carol O’Sullivan. See- ing is believing: body motion dominates in multisensory conversations.ACM Transactions on Graphics (TOG), 29 (4):1–9, 2010. 1

  14. [14]

    Conducting interaction: Patterns of behavior in focused encounters

    Susan Fiksdal. Conducting interaction: Patterns of behavior in focused encounters. adam kendon. cambridge: Cambridge university press, 1990. pp. vii+ 292. 16.95 paper.Studies in Second Language Acquisition, 15(1):116–117, 1993. 8

  15. [15]

    Poseaug: A differentiable pose augmentation framework for 3d hu- man pose estimation

    Kehong Gong, Jianfeng Zhang, and Jiashi Feng. Poseaug: A differentiable pose augmentation framework for 3d hu- man pose estimation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 8575–8584, 2021. 2

  16. [16]

    Nonverbal communication.Annual review of psychology, 70 (2019):271–294, 2019

    Judith A Hall, Terrence G Horgan, and Nora A Murphy. Nonverbal communication.Annual review of psychology, 70 (2019):271–294, 2019. 1

  17. [17]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.CoRR, abs/2006.11239, 2020. 3, 4

  18. [18]

    Cascaded diffusion models for high fidelity image generation

    Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded dif- fusion models for high fidelity image generation.CoRR, abs/2106.15282, 2021. 3

  19. [19]

    DiffPose: Multi- hypothesis human pose estimation using diffusion models

    Karl Holmquist and Bastian Wandt. DiffPose: Multi- hypothesis human pose estimation using diffusion models. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 15931–15941. IEEE. 3, 4

  20. [20]

    Exploiting tem- poral information for 3d human pose estimation

    Mir Rayat Imtiaz Hossain and James J Little. Exploiting tem- poral information for 3d human pose estimation. InProceed- ings of the European conference on computer vision (ECCV), pages 68–84, 2018. 2

  21. [21]

    Prodiff: Progressive fast diffusion model for high-quality text-to-speech, 2022

    Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, and Yi Ren. Prodiff: Progressive fast diffusion model for high-quality text-to-speech, 2022. 3

  22. [22]

    Back to optimization: Diffusion-based zero-shot 3d human pose estimation

    Zhongyu Jiang, Zhuoran Zhou, Lei Li, Wenhao Chai, Cheng- Yen Yang, and Jenq-Neng Hwang. Back to optimization: Diffusion-based zero-shot 3d human pose estimation. In 2024 IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 6130–6140. IEEE. 2, 3

  23. [23]

    Towards social artificial intelligence: Nonverbal social signal prediction in a triadic interaction

    Hanbyul Joo, Tomas Simon, Mina Cikara, and Yaser Sheikh. Towards social artificial intelligence: Nonverbal social signal prediction in a triadic interaction. 5

  24. [24]

    Nabbe, Iain A

    Hanbyul Joo, Tomas Simon, Xulong Li, Hao Liu, Lei Tan, Lin Gui, Sean Banerjee, Timothy Godisart, Bart C. Nabbe, Iain A. Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh. Panoptic studio: A massively multiview sys- tem for social interaction capture.CoRR, abs/1612.03153,

  25. [25]

    Harold W. Kuhn. The Hungarian Method for the Assignment Problem.Naval Research Logistics Quarterly, 2(1–2):83– 97, 1955. 8

  26. [26]

    Talking with hands 16.2 m: A large-scale dataset of synchronized body- finger motion and audio for conversational motion analy- sis and synthesis

    Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S Srinivasa, and Yaser Sheikh. Talking with hands 16.2 m: A large-scale dataset of synchronized body- finger motion and audio for conversational motion analy- sis and synthesis. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 763–772,

  27. [27]

    Hierarchical graph networks for 3d human pose esti- mation.CoRR, abs/2111.11927, 2021

    Han Li, Bowen Shi, Wenrui Dai, Yabo Chen, Botao Wang, Yu Sun, Min Guo, Chenglin Li, Junni Zou, and Hongkai Xiong. Hierarchical graph networks for 3d human pose esti- mation.CoRR, abs/2111.11927, 2021. 2

  28. [28]

    MHFormer: Multi-hypothesis transformer for 3d human pose estimation

    Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, and Luc Van Gool. MHFormer: Multi-hypothesis transformer for 3d human pose estimation. pages 13147–13156. 2

  29. [29]

    Hourglass tokenizer for efficient transformer-based 3d human pose estimation

    Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Jialun Cai, and Nicu Sebe. Hourglass tokenizer for efficient transformer-based 3d human pose estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 604–613, 2024. 2

  30. [30]

    Attention mechanism exploits tem- poral contexts: Real-time 3d human pose reconstruction

    Ruixu Liu, Ju Shen, He Wang, Chen Chen, Sen-ching Che- ung, and Vijayan Asari. Attention mechanism exploits tem- poral contexts: Real-time 3d human pose reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2020. 2

  31. [31]

    Matthew Loper, Naureen Mahmood, Javier Romero, Ger- ard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015. 2

  32. [32]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 5

  33. [33]

    Body communicative cue extraction for conversational analysis

    Alvaro Marcos-Ramiro, Daniel Pizarro-Perez, Marta Marron-Romera, Laurent Nguyen, and Daniel Gatica-Perez. Body communicative cue extraction for conversational analysis. In2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pages 1–8. IEEE, 2013. 1

  34. [34]

    A simple yet effective baseline for 3d human pose esti- mation

    Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective baseline for 3d human pose esti- mation. InProceedings of the IEEE international conference on computer vision, pages 2640–2649, 2017. 2

  35. [35]

    Motion- AGFormer: Enhancing 3d human pose estimation with a transformer-GCNFormer network

    Soroush Mehraban, Vida Adeli, and Babak Taati. Motion- AGFormer: Enhancing 3d human pose estimation with a transformer-GCNFormer network. In2024 IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), pages 6905–6915. IEEE. 2

  36. [36]

    Mo- tionagformer: Enhancing 3d human pose estimation with a transformer-gcnformer network

    Soroush Mehraban, Vida Adeli, and Babak Taati. Mo- tionagformer: Enhancing 3d human pose estimation with a transformer-gcnformer network. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 6920–6930, 2024. 1, 2

  37. [37]

    Single-shot multi-person 3d pose estimation from monocular RGB

    Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Srinath Sridhar, Gerard Pons-Moll, and Christian Theobalt. Single-shot multi-person 3d pose estimation from monocular RGB. 5, 8

  38. [38]

    Vnect: Real-time 3d human pose estimation with a single rgb cam- era.Acm transactions on graphics (tog), 36(4):1–14, 2017

    Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. Vnect: Real-time 3d human pose estimation with a single rgb cam- era.Acm transactions on graphics (tog), 36(4):1–14, 2017. 2

  39. [39]

    A review of mo- tion analysis methods for human nonverbal communication computing.Image and Vision Computing, 31(6-7):421–433,

    Dimitris Metaxas and Shaoting Zhang. A review of mo- tion analysis methods for human nonverbal communication computing.Image and Vision Computing, 31(6-7):421–433,

  40. [40]

    Camera distance-aware top-down approach for 3d multi- person pose estimation from a single RGB image

    Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Camera distance-aware top-down approach for 3d multi- person pose estimation from a single RGB image. 2

  41. [41]

    The progress of human pose estimation: A survey and taxonomy of models applied in 2d human pose estimation.Ieee Access, 8:133330–133348, 2020

    Tewodros Legesse Munea, Yalew Zelalem Jembre, Hale- fom Tekle Weldegebriel, Longbiao Chen, Chenxi Huang, and Chenhui Yang. The progress of human pose estimation: A survey and taxonomy of models applied in 2d human pose estimation.Ieee Access, 8:133330–133348, 2020. 1

  42. [42]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: towards photorealistic image genera- tion and editing with text-guided diffusion models.CoRR, abs/2112.10741, 2021. 3

  43. [43]

    Towards robust and smooth 3d multi-person pose estimation from monocular videos in the wild

    Sungchan Park, Eunyi You, Inhoe Lee, and Joonseok Lee. Towards robust and smooth 3d multi-person pose estimation from monocular videos in the wild. In2023 IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 14726–14736. IEEE. 1, 2, 4, 5, 6

  44. [44]

    3d human pose estimation in video with tem- poral convolutions and semi-supervised training, 2019

    Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d human pose estimation in video with tem- poral convolutions and semi-supervised training, 2019. 2

  45. [45]

    Ktpformer: Kinematics and trajectory prior knowledge-enhanced trans- former for 3d human pose estimation

    Jihua Peng, Yanghong Zhou, and PY Mok. Ktpformer: Kinematics and trajectory prior knowledge-enhanced trans- former for 3d human pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1123–1132, 2024. 2

  46. [46]

    A dual- augmentor framework for domain generalization in 3d hu- man pose estimation

    Qucheng Peng, Ce Zheng, and Chen Chen. A dual- augmentor framework for domain generalization in 3d hu- man pose estimation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2240–2249, 2024. 1, 2

  47. [47]

    HSTFormer: Hierarchical spatial-temporal transformers for 3d human pose estimation

    Xiaoye Qian, Youbao Tang, Ning Zhang, Mei Han, Jing Xiao, Ming-Chun Huang, and Ruei-Sung Lin. HSTFormer: Hierarchical spatial-temporal transformers for 3d human pose estimation. 2

  48. [48]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. 3

  49. [49]

    3d human pose estimation: A review of the literature and analysis of covariates.Computer Vision and Image Understanding, 152:1–20, 2016

    Nikolaos Sarafianos, Bogdan Boteanu, Bogdan Ionescu, and Ioannis A Kakadiaris. 3d human pose estimation: A review of the literature and analysis of covariates.Computer Vision and Image Understanding, 152:1–20, 2016. 2

  50. [50]

    Diffusion-based 3d human pose estimation with multi- hypothesis aggregation

    Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Zhao Wang, Kai Han, Shanshe Wang, Siwei Ma, and Wen Gao. Diffusion-based 3d human pose estimation with multi- hypothesis aggregation. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 14715– 14725. IEEE. 2, 3, 4, 5, 6, 8

  51. [51]

    Deep Unsupervised Learning using Nonequilibrium Thermodynamics

    Jascha Sohl-Dickstein, Eric A. Weiss, Niru Mah- eswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics.CoRR, abs/1503.03585, 2015. 3

  52. [52]

    Human body model fitting by learned gradient descent

    Jie Song, Xu Chen, and Otmar Hilliges. Human body model fitting by learned gradient descent. InEuropean Conference on Computer Vision, pages 744–760. Springer, 2020. 2

  53. [53]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models.CoRR, abs/2010.02502, 2020. 4

  54. [54]

    Body movement mirroring and synchrony in human–robot interac- tion.J

    Darja Stoeva, Andreas Kriegler, and Margrit Gelautz. Body movement mirroring and synchrony in human–robot interac- tion.J. Hum.-Robot Interact., 13(4), 2024. 1

  55. [55]

    VirtualPose: Learning generalizable 3d hu- man pose models from virtual data

    Jiajun Su, Chunyu Wang, Xiaoxuan Ma, Wenjun Zeng, and Yizhou Wang. VirtualPose: Learning generalizable 3d hu- man pose models from virtual data. 1, 2, 5, 6

  56. [56]

    Yu Sun, Wu Liu, Qian Bao, Yili Fu, Tao Mei, and Michael J. Black. Putting people in their place: Monocular regression of 3d people in depth. In2022 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 13233– 13242. IEEE. 2, 5

  57. [57]

    Monocular, one-stage, regression of multiple 3d people,

    Yu Sun, Qian Bao, Wu Liu, Yili Fu, and Tao Mei. Centerhmr: a bottom-up single-shot method for multi-person 3d mesh recovery from a single image.CoRR, abs/2008.12272, 2020. 1, 2

  58. [58]

    3d human pose estimation with spatio- temporal criss-cross attention

    Zhenhua Tang, Zhaofan Qiu, Yanbin Hao, Richang Hong, and Ting Yao. 3d human pose estimation with spatio- temporal criss-cross attention. In2023 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 4790–4799. IEEE. 2

  59. [59]

    Social diffusion: Long-term multiple hu- man motion anticipation

    Julian Tanke, Linguang Zhang, Amy Zhao, Chengcheng Tang, Yujun Cai, Lezi Wang, Po-Chen Wu, Juergen Gall, and Cem Keskin. Social diffusion: Long-term multiple hu- man motion anticipation. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9567–9577. IEEE. 5

  60. [60]

    Deeppose: Human pose estimation via deep neural networks

    Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1653–1660, 2014. 1

  61. [61]

    Joint estimation of human pose and conversa- tional groups from social scenes.International Journal of Computer Vision, 126(2):410–429, 2018

    Jagannadan Varadarajan, Ramanathan Subramanian, Samuel Rota Bul `o, Narendra Ahuja, Oswald Lanz, and Elisa Ricci. Joint estimation of human pose and conversa- tional groups from social scenes.International Journal of Computer Vision, 126(2):410–429, 2018. 1, 8

  62. [62]

    Black, Bodo Rosenhahn, and Gerard Pons-Moll

    Timo von Marcard, Roberto Henschel, Michael J. Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering ac- curate 3d human pose in the wild using imus and a mov- ing camera. InProceedings of the European Conference on Computer Vision (ECCV), 2018. 5

  63. [63]

    Deep 3d human pose estimation: A review.Computer Vision and Image Under- standing, 210:103225, 2021

    Jinbao Wang, Shujie Tan, Xiantong Zhen, Shuo Xu, Feng Zheng, Zhenyu He, and Ling Shao. Deep 3d human pose estimation: A review.Computer Vision and Image Under- standing, 210:103225, 2021. 1, 2

  64. [64]

    Finepose: Fine- grained prompt-driven 3d human pose estimation via diffu- sion models

    Jinglin Xu, Yijie Guo, and Yuxin Peng. Finepose: Fine- grained prompt-driven 3d human pose estimation via diffu- sion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 561–570,

  65. [65]

    Simpoe: Simulated character control for 3d hu- man pose estimation

    Ye Yuan, Shih-En Wei, Tomas Simon, Kris Kitani, and Ja- son Saragih. Simpoe: Simulated character control for 3d hu- man pose estimation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 7159–7169, 2021. 2

  66. [66]

    Deep network for the in- tegrated 3d sensing of multiple people in natural images

    Andrei Zanfir, Elisabeta Marinoiu, Mihai Zanfir, Alin-Ionut Popa, and Cristian Sminchisescu. Deep network for the in- tegrated 3d sensing of multiple people in natural images. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2018. 6

  67. [67]

    Fast human pose estimation

    Feng Zhang, Xiatian Zhu, and Mao Ye. Fast human pose estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3517–3526,

  68. [68]

    MixSTE: Seq2seq mixed spatio-temporal en- coder for 3d human pose estimation in video

    Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, and Jun- song Yuan. MixSTE: Seq2seq mixed spatio-temporal en- coder for 3d human pose estimation in video. 2, 3, 5

  69. [69]

    Multi-view emo- tional expressions dataset using 2d pose estimation.Scien- tific Data, 10:649, 2023

    Mingming Zhang, Yanan Zhou, Xinye Xu, Zhiwei Ren, Yi- han Zhang, Shenglan Liu, and Wenbo Luo. Multi-view emo- tional expressions dataset using 2d pose estimation.Scien- tific Data, 10:649, 2023. 1

  70. [70]

    Dynamic inertial poser (dynaip): Part- based motion dynamics learning for enhanced human pose estimation with sparse inertial sensors

    Yu Zhang, Songpengcheng Xia, Lei Chu, Jiarui Yang, Qi Wu, and Ling Pei. Dynamic inertial poser (dynaip): Part- based motion dynamics learning for enhanced human pose estimation with sparse inertial sensors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1889–1899, 2024. 2

  71. [71]

    Poseformerv2: Exploring frequency domain for efficient and robust 3d human pose estimation

    Qitao Zhao, Ce Zheng, Mengyuan Liu, Pichao Wang, and Chen Chen. Poseformerv2: Exploring frequency domain for efficient and robust 3d human pose estimation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8877–8886, 2023. 2

  72. [72]

    Unleashing text-to-image diffu- sion models for visual perception

    Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, and Jiwen Lu. Unleashing text-to-image diffu- sion models for visual perception. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5729–5739, 2023. 8

  73. [73]

    SMAP: Single-shot multi- person absolute 3d pose estimation

    Jianan Zhen, Qi Fang, Jiaming Sun, Wentao Liu, Wei Jiang, Hujun Bao, and Xiaowei Zhou. SMAP: Single-shot multi- person absolute 3d pose estimation. 2, 5, 6

  74. [74]

    3d human pose estima- tion with spatial and temporal transformers

    Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang, Chen Chen, and Zhengming Ding. 3d human pose estima- tion with spatial and temporal transformers. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 11656–11665, 2021. 2

  75. [75]

    Deep learning-based human pose estimation: A survey.ACM Computing Surveys, 56(1):1–37, 2023

    Ce Zheng, Wenhan Wu, Chen Chen, Taojiannan Yang, Si- jie Zhu, Ju Shen, Nasser Kehtarnavaz, and Mubarak Shah. Deep learning-based human pose estimation: A survey.ACM Computing Surveys, 56(1):1–37, 2023. 1

  76. [76]

    MotionBERT: A unified perspective on learning human motion representations

    Wentao Zhu, Xiaoxuan Ma, Zhaoyang Liu, Libin Liu, Wayne Wu, and Yizhou Wang. MotionBERT: A unified perspective on learning human motion representations. version: 5. 2

  77. [77]

    Modulated graph convolutional network for 3d human pose estimation

    Zhiming Zou and Wei Tang. Modulated graph convolutional network for 3d human pose estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11477–11487, 2021. 2