Recognition: unknown
SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance
Pith reviewed 2026-05-10 13:37 UTC · model grok-4.3
The pith
SocialMirror reconstructs 3D human interaction behaviors from monocular videos by integrating semantic cues from vision-language models with geometric constraints in a diffusion framework.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SocialMirror is a diffusion-based framework that first leverages high-level interaction descriptions generated by a vision-language model to guide a semantic-guided motion infiller for hallucinating occluded bodies and resolving local pose ambiguities, then applies a sequence-level temporal refiner that enforces smooth jitter-free motions while incorporating geometric constraints during sampling to ensure plausible contact and spatial relationships.
What carries the argument
Semantic-guided motion infiller directed by vision-language model descriptions, combined with geometric constraints enforced during diffusion sampling in the temporal refiner.
If this is right
- Reconstructed meshes enable realistic virtual avatars that interact naturally in augmented reality scenes.
- Precise capture of contact dynamics supports detailed performance analysis in sports training tools.
- Natural collaborative motion patterns become usable for planning and executing tasks with robots.
- Strong generalization to new datasets and in-the-wild videos reduces the requirement for domain-specific retraining.
Where Pith is reading between the lines
- The framework could extend to reconstructing interactions involving more than two people in crowded environments by scaling the infiller and constraint modules.
- Replacing the separate vision-language model with an integrated semantic encoder trained jointly might reduce dependency on external description quality.
- Applying the same geometric sampling constraints to longer sequences would test whether temporal smoothness holds over extended durations.
Load-bearing premise
High-level interaction descriptions produced by the vision-language model are accurate enough to correctly guide hallucination of occluded body parts without introducing semantic or pose errors.
What would settle it
Ground-truth 3D meshes from occluded interaction videos where the output shows body intersections, implausible contacts, or incorrect spatial distances between individuals would show the method fails to resolve ambiguities.
Figures
read the original abstract
Accurately reconstructing human behavior in close-interaction scenarios is crucial for enabling realistic virtual interactions in augmented reality, precise motion analysis in sports, and natural collaborative behavior in human-robot tasks. Reliable reconstruction in these contexts significantly enhances the realism and effectiveness of AI-driven interactive applications. However, human reconstruction from monocular videos in close-interaction scenarios remains challenging due to severe mutual occlusions, leading local motion ambiguity, disrupted temporal continuity and spatial relationship error. In this paper, we propose SocialMirror, a diffusion-based framework that integrates semantic and geometric cues to effectively address these issues. Specifically, we first leverage high-level interaction descriptions generated by a vision-language model to guide a semantic-guided motion infiller, hallucinating occluded bodies and resolving local pose ambiguities. Next, we propose a sequence-level temporal refiner that enforces smooth, jitter-free motions, while incorporating geometric constraints during sampling to ensure plausible contact and spatial relationships. Evaluations on multiple interaction benchmarks show that SocialMirror achieves state-of-the-art performance in reconstructing interactive human meshes, demonstrating strong generalization across unseen datasets and in-the-wild scenarios. The code will be released upon publication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SocialMirror, a diffusion-based framework for reconstructing 3D human meshes in close-interaction scenarios from monocular videos. It first generates high-level interaction descriptions via a vision-language model to condition a semantic-guided motion infiller that hallucinates occluded body parts and resolves local pose ambiguities; a subsequent sequence-level temporal refiner then enforces temporal smoothness while applying geometric constraints during diffusion sampling to ensure plausible contacts and spatial relationships. The central claim is that this yields state-of-the-art performance on interaction benchmarks together with strong generalization to unseen datasets and in-the-wild videos.
Significance. If the quantitative results and ablations hold, the work would represent a meaningful advance in monocular 3D human reconstruction under severe mutual occlusion, directly benefiting applications in AR/VR, sports motion analysis, and collaborative robotics. The explicit combination of semantic (VLM) and geometric cues during diffusion sampling is a timely direction; the planned code release would further strengthen reproducibility.
major comments (3)
- [Abstract / Experiments] Abstract and Experiments section: the headline SOTA claim on multiple interaction benchmarks is asserted without any reported quantitative metrics, baseline comparisons, error bars, or table of results. This absence prevents evaluation of whether the semantic infiller plus geometric constraints actually deliver the claimed gains.
- [Method (semantic-guided motion infiller)] Method section (semantic-guided motion infiller): no ablation isolates the contribution of VLM-generated descriptions from the geometric sampling constraints, nor is there a quantitative measure of VLM description accuracy on close-interaction videos. Without these, the weakest assumption—that high-level descriptions reliably guide hallucination of occluded limbs without systematic errors—remains untested and load-bearing for the generalization claims.
- [Experiments] Experiments / Results: the manuscript provides no failure-case analysis or qualitative examples where VLM outputs are ambiguous or incorrect, which could introduce new artifacts in contact or pose that the geometric refiner cannot fully correct. This directly affects the reliability of the reported benchmark improvements.
minor comments (2)
- [Abstract] Abstract: the phrase 'leading local motion ambiguity' appears to be missing 'to' and should read 'leading to local motion ambiguity' for grammatical clarity.
- [Method] Notation: the distinction between the motion infiller and the temporal refiner should be made explicit with consistent variable names or module labels when first introduced.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and for recognizing the potential impact of our work on monocular 3D human reconstruction in close-interaction scenarios. We will revise the manuscript to address each point raised, improving the clarity and completeness of the experimental validation.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: the headline SOTA claim on multiple interaction benchmarks is asserted without any reported quantitative metrics, baseline comparisons, error bars, or table of results. This absence prevents evaluation of whether the semantic infiller plus geometric constraints actually deliver the claimed gains.
Authors: We acknowledge the need for explicit quantitative support. In the revised manuscript, we will add a comprehensive table in the Experiments section reporting quantitative metrics (such as MPJPE, PA-MPJPE, and contact accuracy) on the interaction benchmarks, along with comparisons to relevant baselines, error bars from repeated evaluations, and clear indications of the contributions from the semantic infiller and geometric constraints. revision: yes
-
Referee: [Method (semantic-guided motion infiller)] Method section (semantic-guided motion infiller): no ablation isolates the contribution of VLM-generated descriptions from the geometric sampling constraints, nor is there a quantitative measure of VLM description accuracy on close-interaction videos. Without these, the weakest assumption—that high-level descriptions reliably guide hallucination of occluded limbs without systematic errors—remains untested and load-bearing for the generalization claims.
Authors: We agree that isolating these components is essential for validating the approach. We will add an ablation study in the Experiments section that separately evaluates the effect of VLM-generated descriptions versus geometric sampling constraints. We will also include a quantitative measure of VLM description accuracy on close-interaction videos, for example through semantic similarity metrics or human evaluation on benchmark subsets. revision: yes
-
Referee: [Experiments] Experiments / Results: the manuscript provides no failure-case analysis or qualitative examples where VLM outputs are ambiguous or incorrect, which could introduce new artifacts in contact or pose that the geometric refiner cannot fully correct. This directly affects the reliability of the reported benchmark improvements.
Authors: We will add a dedicated subsection on failure cases and limitations in the Experiments section. This will include qualitative examples of videos where VLM outputs are ambiguous or incorrect, showing the resulting 3D reconstructions, and analyzing any introduced artifacts in contact or pose along with the mitigating effects (or limitations) of the sequence-level temporal refiner and geometric constraints. revision: yes
Circularity Check
No circularity: method assembles standard components without self-referential reductions
full rationale
The paper describes a diffusion-based pipeline that conditions a motion infiller on VLM-generated interaction descriptions and applies geometric constraints during sampling. No equations, fitted parameters, or predictions are shown to reduce to their own inputs by construction. No load-bearing self-citations or uniqueness theorems from the same authors are invoked. The framework is presented as a composition of existing techniques (diffusion models, VLMs, geometric priors) whose performance claims rest on external benchmarks rather than internal redefinitions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Multi-hmr: Multi-person whole-body hu- man mesh recovery in a single shot
Fabien Baradel*, Matthieu Armando, Salma Galaaoui, Ro- main Br ´egier, Philippe Weinzaepfel, Gr ´egory Rogez, and Thomas Lucas*. Multi-hmr: Multi-person whole-body hu- man mesh recovery in a single shot. InECCV, 2024. 2
2024
-
[4]
Keep it smpl: Automatic estimation of 3d human pose and shape from a sin- gle image
Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a sin- gle image. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 561–578. Springer, 2016. 1, 2
2016
-
[5]
Executing your commands via motion diffusion in latent space
Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF CVPR, pages 18000–18010, 2023. 3
2023
-
[6]
Ho Kei Cheng and Alexander G. Schwing. XMem: Long- term video object segmentation with an atkinson-shiffrin memory model. InECCV, 2022. 4
2022
-
[7]
Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. Ilvr: Conditioning method for denoising diffusion probabilistic models.arXiv preprint arXiv:2108.02938, 2021. 3
-
[8]
Interaction transformer for human reaction generation.IEEE Transactions on Multimedia, 25: 8842–8854, 2023
Baptiste Chopin, Hao Tang, Naima Otberdout, Mohamed Daoudi, and Nicu Sebe. Interaction transformer for human reaction generation.IEEE Transactions on Multimedia, 25: 8842–8854, 2023. 3
2023
-
[9]
Huang, Siyu Tang, Dimitris Tzionas, and Michael J
Vasileios Choutas, Lea M ¨uller, Chun-Hao P. Huang, Siyu Tang, Dimitris Tzionas, and Michael J. Black. Accurate 3d body shape regression using metric and semantic attribute. InProceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2022. 3
2022
-
[10]
Improving diffusion models for inverse problems using manifold constraints.Advances in Neural Information Pro- cessing Systems, 35:25683–25696, 2022
Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for inverse problems using manifold constraints.Advances in Neural Information Pro- cessing Systems, 35:25683–25696, 2022. 3
2022
-
[11]
Capturing closely interacted two-person motions with reaction priors
Qi Fang, Yinghui Fan, Yanjun Li, Junting Dong, Dingwei Wu, Weidong Zhang, and Kang Chen. Capturing closely interacted two-person motions with reaction priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 655–665, 2024. 2
2024
-
[12]
Diffpose: Spatiotemporal diffusion model for video-based human pose estimation
Runyang Feng, Yixing Gao, Tze Ho Elden Tse, Xueqing Ma, and Hyung Jin Chang. Diffpose: Spatiotemporal diffusion model for video-based human pose estimation. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 14861–14872, 2023. 1
2023
-
[13]
Three- dimensional reconstruction of human interactions
Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, and Cristian Sminchisescu. Three- dimensional reconstruction of human interactions. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7214–7223, 2020. 2
2020
-
[14]
Remips: Physically consistent 3d reconstruction of multiple interacting people under weak supervision.Advances in Neural Information Processing Systems, 34:19385–19397, 2021
Mihai Fieraru, Mihai Zanfir, Teodor Szente, Eduard Bazavan, Vlad Olaru, and Cristian Sminchisescu. Remips: Physically consistent 3d reconstruction of multiple interacting people under weak supervision.Advances in Neural Information Processing Systems, 34:19385–19397, 2021. 2
2021
-
[15]
The potential of human pose estimation for motion capture in sports: a validation study
Takashi Fukushima, Patrick Blauberger, Tiago Guedes Rus- somanno, and Martin Lames. The potential of human pose estimation for motion capture in sports: a validation study. Sports Engineering, 27(1):19, 2024. 1
2024
-
[16]
YOLOX: Exceeding YOLO Series in 2021
Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021.arXiv preprint arXiv:2107.08430, 2021. 4
work page internal anchor Pith review arXiv 2021
-
[17]
Remos: 3d motion- conditioned reaction synthesis for two-person interactions
Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Chris- tian Theobalt, and Philipp Slusallek. Remos: 3d motion- conditioned reaction synthesis for two-person interactions. In European Conference on Computer Vision, pages 418–437. Springer, 2024. 3
2024
-
[18]
Humans in 4d: Recon- structing and tracking humans with transformers
Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Humans in 4d: Recon- structing and tracking humans with transformers. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 14783–14794, 2023. 2, 4, 7
2023
-
[19]
Reconstructing groups of people with hypergraph relational reasoning.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 14827–14837, 2023
Buzhen Huang, Jingyi Ju, Zhihao Li, and Yangang Wang. Reconstructing groups of people with hypergraph relational reasoning.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 14827–14837, 2023. 2, 7
2023
-
[20]
Closely interactive human reconstruction with proxemics and physics-guided adaption
Buzhen Huang, Chen Li, Chongyang Xu, Liang Pan, Yan- gang Wang, and Gim Hee Lee. Closely interactive human reconstruction with proxemics and physics-guided adaption. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1011–1021, 2024. 1, 2, 3, 4, 6, 7
2024
-
[21]
Motiongpt: Human motion as a foreign language
Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Y , and Tao Chen. Motiongpt: Human motion as a foreign language. Advances in neural information processing systems, 36, 2024. 3
2024
-
[22]
Coherent reconstruction of multiple humans from a single image
Wen Jiang, Nikos Kolotouros, Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis. Coherent reconstruction of multiple humans from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5579–5588, 2020. 2
2020
-
[23]
Ultralytics yolov8, 2023
Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics yolov8, 2023. 4
2023
-
[24]
End-to-end recovery of human shape and pose
Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7122–7131, 2018. 1, 2
2018
-
[25]
Maximizing parallelism in the construction of bvhs, octrees, and k-d trees
Tero Karras. Maximizing parallelism in the construction of bvhs, octrees, and k-d trees. InProceedings of the Fourth ACM SIGGRAPH/Eurographics Conference on High- Performance Graphics, pages 33–37, 2012. 5
2012
-
[26]
Harmony4d: A video dataset for in-the-wild close human interactions.Advances in Neural Information Processing Systems, 37:107270–107285, 2024
Rawal Khirodkar, Jyun-Ting Song, Jinkun Cao, Zhengyi Luo, and Kris Kitani. Harmony4d: A video dataset for in-the-wild close human interactions.Advances in Neural Information Processing Systems, 37:107270–107285, 2024. 6, 7 9
2024
-
[27]
Vibe: Video inference for human body pose and shape es- timation
Muhammed Kocabas, Nikos Athanasiou, and Michael J Black. Vibe: Video inference for human body pose and shape es- timation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5253–5263,
-
[28]
Pare: Part attention regressor for 3d human body estimation
Muhammed Kocabas, Chun-Hao P Huang, Otmar Hilliges, and Michael J Black. Pare: Part attention regressor for 3d human body estimation. InProceedings of the IEEE/CVF international conference on computer vision, pages 11127– 11137, 2021. 2
2021
-
[29]
Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation
Jiefeng Li, Chao Xu, Zhicun Chen, Siyuan Bian, Lixin Yang, and Cewu Lu. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3383–3393, 2021. 2
2021
-
[30]
Cliff: Carrying location information in full frames into human pose and shape estimation
Zhihao Li, Jianzhuang Liu, Zhensong Zhang, Songcen Xu, and Youliang Yan. Cliff: Carrying location information in full frames into human pose and shape estimation. InEuropean Conference on Computer Vision, 2022. 2
2022
-
[31]
Intergen: Diffusion-based multi-human motion gener- ation under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024
Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion gener- ation under complex interactions.International Journal of Computer Vision, 132(9):3463–3483, 2024. 3, 4
2024
-
[32]
Motions as queries: One-stage multi- person holistic human motion capture
Kenkun Liu, Yurong Fu, Weihao Yuan, Jing Lin, Peihao Li, Xiaodong Gu, Lingteng Qiu, Haoqian Wang, Zilong Dong, and Xiaoguang Han. Motions as queries: One-stage multi- person holistic human motion capture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17529–17539, 2025. 2
2025
-
[33]
arXiv preprint arXiv:2312.08983 (2023) 3
Yunze Liu, Changxi Chen, and Li Yi. Interactive humanoid: Online full-body motion reaction synthesis with social af- fordance canonicalization and forecasting.arXiv preprint arXiv:2312.08983, 2023. 3
-
[34]
Dposer: Diffusion model as robust 3d human pose prior.arXiv preprint arXiv:2312.05541,
Junzhe Lu, Jing Lin, Hongkun Dou, Ailing Zeng, Yue Deng, Yulun Zhang, and Haoqian Wang. Dposer: Diffusion model as robust 3d human pose prior.arXiv preprint arXiv:2312.05541,
-
[35]
Autotrackanything, 2024
Roman Lyskov. Autotrackanything, 2024. 4
2024
-
[36]
Generative proxemics: A prior for 3d social interaction from images
Lea M ¨uller, Vickie Ye, Georgios Pavlakos, Michael Black, and Angjoo Kanazawa. Generative proxemics: A prior for 3d social interaction from images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9687–9697, 2024. 1, 2, 3, 7
2024
-
[37]
Richter, and Vladlen Koltun
Alejandro Newell, Peiyun Hu, Lahav Lipson, Stephan R. Richter, and Vladlen Koltun. Comotion: Concurrent multi- person 3d motion. InInternational Conference on Learning Representations, 2025. 2
2025
-
[38]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171. PMLR,
-
[39]
Expressive body capture: 3d hands, face, and body from a single image
Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 10975–10985, 2019. 2
2019
-
[40]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 4, 1
2021
-
[41]
Humor: 3d human motion model for robust pose estimation
Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J Guibas. Humor: 3d human motion model for robust pose estimation. InProceedings of the IEEE/CVF international conference on computer vision, pages 11488–11499, 2021. 2
2021
-
[42]
Diffhpe: Robust, coherent 3d human pose lifting with diffu- sion
C´edric Rommel, Eduardo Valle, Micka ¨el Chen, Souhaiel Khalfaoui, Renaud Marlet, Matthieu Cord, and Patrick P´erez. Diffhpe: Robust, coherent 3d human pose lifting with diffu- sion. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 3220–3229, 2023. 1
2023
-
[43]
Phasemp: Robust 3d pose estimation via phase-conditioned human motion prior
Mingyi Shi, Sebastian Starke, Yuting Ye, Taku Komura, and Jungdam Won. Phasemp: Robust 3d pose estimation via phase-conditioned human motion prior. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14725–14737, 2023. 2
2023
-
[44]
Sat-hmr: Real-time multi-person 3d mesh estimation via scale-adaptive tokens
Chi Su, Xiaoxuan Ma, Jiajun Su, and Yizhou Wang. Sat-hmr: Real-time multi-person 3d mesh estimation via scale-adaptive tokens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16796–16806, 2025. 2
2025
-
[45]
Pose priors from language models.arXiv preprint arXiv:2405.03689, 2024
Sanjay Subramanian, Evonne Ng, Lea M ¨uller, Dan Klein, Shiry Ginosar, and Trevor Darrell. Pose priors from language models.arXiv preprint arXiv:2405.03689, 2024. 3
-
[46]
Monocular, one-stage, regression of multiple 3d people
Yu Sun, Qian Bao, Wu Liu, Yili Fu, Michael J Black, and Tao Mei. Monocular, one-stage, regression of multiple 3d people. InProceedings of the IEEE/CVF international conference on computer vision, pages 11179–11188, 2021. 2
2021
-
[47]
Putting people in their place: Monocular regression of 3d people in depth
Yu Sun, Wu Liu, Qian Bao, Yili Fu, Tao Mei, and Michael J Black. Putting people in their place: Monocular regression of 3d people in depth. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13243–13252, 2022. 2, 7
2022
-
[48]
Role-aware interac- tion generation from textual description
Mikihiro Tanaka and Kent Fujiwara. Role-aware interac- tion generation from textual description. InProceedings of the IEEE/CVF international conference on computer vision, pages 15999–16009, 2023. 3
2023
-
[49]
Human motion diffusion model
Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffusion model. InThe Eleventh International Conference on Learning Representations, 2022. 3
2022
-
[50]
Multiphys: multi- person physics-aware 3d motion estimation
Nicolas Ugrinovic, Boxiao Pan, Georgios Pavlakos, Despoina Paschalidou, Bokui Shen, Jordi Sanchez-Riera, Francesc Moreno-Noguer, and Leonidas Guibas. Multiphys: multi- person physics-aware 3d motion estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2331–2340, 2024. 2, 6
2024
-
[51]
Ai-based pose estimation of human operators in manu- facturing environments
Marcello Urgo, Francesco Berardinucci, Pai Zheng, and Lihui Wang. Ai-based pose estimation of human operators in manu- facturing environments. InCIRP Novel Topics in Production Engineering: Volume 1, pages 3–38. Springer, 2024. 1
2024
-
[52]
Recovering accurate 3d 10 human pose in the wild using imus and a moving camera
Timo V on Marcard, Roberto Henschel, Michael J Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3d 10 human pose in the wild using imus and a moving camera. In Proceedings of the European conference on computer vision (ECCV), pages 601–617, 2018. 6, 7
2018
-
[53]
Tlcontrol: Trajectory and language control for human motion synthesis
Weilin Wan, Zhiyang Dou, Taku Komura, Wenping Wang, Dinesh Jayaraman, and Lingjie Liu. Tlcontrol: Trajectory and language control for human motion synthesis. InEuropean Conference on Computer Vision, pages 37–54. Springer, 2024. 3, 4
2024
-
[54]
Black, and Muhammed Kocabas
Yufu Wang, Yu Sun, Priyanka Patel, Kostas Daniilidis, Michael J. Black, and Muhammed Kocabas. Prompthmr: Promptable human mesh recovery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1148–1159, 2025. 3
2025
-
[55]
Inter- control: Generate human motion interactions by controlling every joint.CoRR, 2023
Zhenzhi Wang, Jingbo Wang, Dahua Lin, and Bo Dai. Inter- control: Generate human motion interactions by controlling every joint.CoRR, 2023. 3, 4, 5
2023
-
[56]
Crowd3d: Towards hundreds of peo- ple reconstruction from a single image
Hao Wen, Jing Huang, Huili Cui, Haozhe Lin, Yu-Kun Lai, Lu Fang, and Kun Li. Crowd3d: Towards hundreds of peo- ple reconstruction from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8937–8946, 2023. 2
2023
-
[57]
Xinyao Xi, Chen Zhang, Wen Jia, and Ruxue Jiang. En- hancing human pose estimation in sports training: Integrating spatiotemporal transformer for improved accuracy and real- time performance.Alexandria Engineering Journal, 109: 144–156, 2024. 1
2024
-
[58]
Occluded human pose estimation based on part-aware discrete diffusion priors.Knowledge-Based Systems, 315:113272, 2025
Hongyu Xiao, Hui He, Yifan Xie, and Yi Zheng. Occluded human pose estimation based on part-aware discrete diffusion priors.Knowledge-Based Systems, 315:113272, 2025. 3
2025
-
[59]
arXiv preprint arXiv:2310.08580 , year=
Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, and Huaizu Jiang. Omnicontrol: Control any joint at any time for human motion generation.arXiv preprint arXiv:2310.08580,
-
[60]
Adapting human mesh recovery with vision-language feedback.arXiv preprint arXiv:2502.03836, 2025
Chongyang Xu, Buzhen Huang, Chengfang Zhang, Zil- iang Feng, and Yangang Wang. Adapting human mesh recovery with vision-language feedback.arXiv preprint arXiv:2502.03836, 2025. 3
-
[61]
Ghum & ghuml: Generative 3d human shape and articulated pose models
Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir, William T Freeman, Rahul Sukthankar, and Cristian Smin- chisescu. Ghum & ghuml: Generative 3d human shape and articulated pose models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6184–6193, 2020. 2
2020
-
[62]
Regennet: Towards human action-reaction synthesis
Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, and Wenjun Zeng. Regennet: Towards human action-reaction synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1759–1769, 2024. 3
2024
-
[63]
Spatial tempo- ral graph convolutional networks for skeleton-based action recognition
Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo- ral graph convolutional networks for skeleton-based action recognition. InAAAI Conference on Artificial Intelligence,
-
[64]
Hi4d: 4d instance segmentation of close human interaction
Yifei Yin, Chen Guo, Manuel Kaufmann, Juan Jose Zarate, Jie Song, and Otmar Hilliges. Hi4d: 4d instance segmentation of close human interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17016–17027, 2023. 6, 7
2023
-
[65]
Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting.arXiv preprint arXiv:1709.04875, 2017. 5
-
[66]
Glamr: Global occlusion-aware human mesh recovery with dynamic cameras
Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. Glamr: Global occlusion-aware human mesh recovery with dynamic cameras. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11038–11049, 2022. 2
2022
-
[67]
Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints
Andrei Zanfir, Elisabeta Marinoiu, and Cristian Sminchisescu. Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2148–2157, 2018. 2
2018
-
[68]
Smoothnet: A plug-and-play network for refining human poses in videos
Ailing Zeng, Lei Yang, Xuan Ju, Jiefeng Li, Jianyi Wang, and Qiang Xu. Smoothnet: A plug-and-play network for refining human poses in videos. InEuropean Conference on Computer Vision, pages 625–642. Springer, 2022. 2
2022
-
[69]
Faster segment anything: Towards lightweight sam for mobile applications,
Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile appli- cations.arXiv preprint arXiv:2306.14289, 2023. 4
-
[70]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 3, 4
2023
-
[71]
Kaifeng Zhao, Gen Li, and Siyu Tang. Dart: A diffusion- based autoregressive motion model for real-time text-driven motion control.arXiv preprint arXiv:2410.05260, 2024. 3
-
[72]
3d human pose estimation with spatial and temporal transformers
Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang, Chen Chen, and Zhengming Ding. 3d human pose estimation with spatial and temporal transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 11656–11665, 2021. 2
2021
-
[73]
Dpmesh: Exploiting diffusion prior for occluded human mesh recovery
Yixuan Zhu, Ao Li, Yansong Tang, Wenliang Zhao, Jie Zhou, and Jiwen Lu. Dpmesh: Exploiting diffusion prior for occluded human mesh recovery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1101–1110, 2024. 2 11 SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geo...
2024
-
[74]
Implementation details Our model was implemented using PyTorch and trained on an NVIDIA RTX 3090 GPU
Additional Details 6.1. Implementation details Our model was implemented using PyTorch and trained on an NVIDIA RTX 3090 GPU. The batch size was set to 32 for the Semantic-Guided Motion Infiller and 64 for the Geometry Optimizer. We employed the AdamW optimizer with CyclicLRWithRestarts, where the learning rate was initially set to 0.0001, with parameters...
-
[75]
Ablation on Geometry Optimizer Geometry Optimizer focuses on processing 3D joint posi- tions to provide geometric guidance information
Additional Experiments 7.1. Ablation on Geometry Optimizer Geometry Optimizer focuses on processing 3D joint posi- tions to provide geometric guidance information. To validate the effectiveness of our encoding layer design for the auxil- iary model, we conducted an ablation study by implementing the motion embedding layer with either STGCN or a Linear lay...
-
[76]
Discussions 8.1. Reconstruction under VLM Limitations Based on our user study, the text descriptions generated by the VLM are, on average, superior to those produced by human annotators. As shown in Fig.5, VLM annotations can capture not only macroscopic actions but also fine-grained contact relationships between specific joints (e.g., “A person leads the...
-
[77]
LLMs are used only for light text polishing and grammar fixes
The Use of Large Language Models (LLMs) We declare that vision-language models (VLMs) in this paper are used primarily as a VLM Annotator to produce textual descriptions of interactions in image sequences and spatio- temporal joint contact pairs. LLMs are used only for light text polishing and grammar fixes. The research approach, core ideas, reasoning, a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.