pith. sign in

arxiv: 2606.02350 · v1 · pith:PAS6DJQHnew · submitted 2026-06-01 · 💻 cs.CV

TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos

Pith reviewed 2026-06-28 15:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D reconstructionmulti-view videoshuman reconstructionscene reconstructioncamera pose estimationtemporal coherencehuman-scene interaction
0
0 comments X

The pith

A unified framework jointly reconstructs dynamic humans, static scenes, and camera poses from multi-view videos in one global frame.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the task of unified human-scene-camera reconstruction from multi-view videos to overcome limitations of prior works that assume single-view inputs or decouple the elements. It proposes TROPHIES, which uses a Human Branch for modeling humans with temporal and spatial reasoning, a Scene Branch for static geometry with human-aware attention, and a global alignment module to enforce consistency. This approach aims to produce coherent geometry, stable motion, and physically aligned trajectories. A sympathetic reader would care because it enables comprehensive perception of humans and environments in 4D space without inconsistencies.

Core claim

TROPHIES achieves globally aligned, physically plausible 4D reconstructions by jointly estimating dynamic humans, static scenes, and camera poses in one global coordinate frame from multi-view videos, outperforming existing paradigms in global fidelity and human-scene consistency.

What carries the argument

TROPHIES framework with Human Branch for temporal and spatial human modeling, Scene Branch for static geometry with human-aware attention, and global alignment module enforcing scale consistency, contact priors, and cross-view temporal coherence.

If this is right

  • Joint estimation produces coherent geometry and stable motion across humans and scenes.
  • Enforces physically aligned trajectories for dynamic elements in a shared coordinate frame.
  • Yields higher global fidelity and human-scene consistency than decoupled methods on EgoHuman and EgoExo4D.
  • Establishes a new unified task that couples human, scene, and camera reconstruction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This joint approach could support more reliable motion prediction in extended video sequences by maintaining cross-view temporal coherence.
  • Applications in robotics might benefit from the enforced contact priors to improve interaction planning with environments.
  • Testing on videos with varying numbers of views could reveal the minimum overlap needed for stable alignment.

Load-bearing premise

Multi-view videos supply enough overlapping information and constraints to allow joint estimation of dynamic humans, static scenes, and camera poses in one global frame without the inconsistencies that arise when these elements are decoupled.

What would settle it

If the output 4D models exhibit scale inconsistencies or non-physical human-scene contacts when tested against ground-truth measurements from held-out multi-view video sequences.

Figures

Figures reproduced from arXiv: 2606.02350 by Jinpeng Liu, Xingyu Liu, Yukang Xu, Yutong Li.

Figure 1
Figure 1. Figure 1: Overview of Trophies. Given temporally synchronized video streams, Trophies jointly reconstructs dynamic humans, static scene geometry, and camera trajectories within a globally consistent 4D space. Our method couples a human branch and a scene branch through a global alignment and optimization stage that enforces scale, contact, and gravity consistency. This unified reconstruction produces temporally stab… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline Overview. Our framework consists of three components: a Scene Branch that reconstructs the static environ￾ment with human-aware attention (implemented as a plug-and￾play module applicable to DUSt3R, MonST3R, and CUT3R back￾bones); a Human Branch that estimates temporally coherent body parameters from multi-view videos via symmetric and anchor￾referenced attention; and a global Align and Optimizati… view at source ↗
Figure 3
Figure 3. Figure 3: Human-aware attention. Each frame is divided into human (colored) and non-human (transparent) patches. For views at the same time, all patches share information to enforce multi￾view consistency. Across different time steps, only non-human patches exchange information, while human patches are masked in the attention layer to avoid motion-induced inconsistency. reweighting [5, 58, 65] or multi-memory reason… view at source ↗
Figure 4
Figure 4. Figure 4: Our method takes synchronized multi-view video frames as input and processes them with shared Human Video Transformers. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results. Comparison of multi-view reconstructions before and after global optimization (Scenes 1–2) and against prior work (Scene 3). For Scenes 1 and 2, the initial results (a,c) exhibit misalignment between humans and scenes, leading to interpenetration, floating feet, and incorrect grounding (red boxes). Our global optimization (b,d) lead to physically coherent and well-grounded reconstruc￾t… view at source ↗
read the original abstract

Reconstructing humans and their surrounding environments in a globally consistent 4D space is essential for comprehensive perception. However, prior works typically assume single-view inputs or decouple humans, scenes, and cameras, making them unable to recover coherent geometry, stable motion, and physically aligned trajectories. These limitations motivate us to introduce a new task: unified human-scene-camera reconstruction from multi-view videos, which aims to jointly estimate dynamic humans, static scenes, and camera poses in one global coordinate frame. We propose TROPHIES--Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos-a unified framework tailored for this task. TROPHIES features a Human Branch that models humans through temporal and spatial reasoning, and a Scene Branch that reconstructs static geometry with human-aware attention. A global alignment and optimization module couples both branches by enforcing scale consistency, contact priors, and cross-view temporal coherence. Experiments on EgoHuman and EgoExo4D demonstrate that TROPHIES achieves globally aligned, physically plausible 4D reconstructions and consistently outperforms existing paradigms in both global fidelity and human-scene consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces a new task of unified human-scene-camera reconstruction from multi-view videos. It proposes the TROPHIES framework consisting of a Human Branch for temporal and spatial reasoning on dynamic humans, a Scene Branch for static geometry reconstruction using human-aware attention, and a global alignment and optimization module that couples the branches via scale consistency, contact priors, and cross-view temporal coherence. Experiments on EgoHuman and EgoExo4D are reported to yield globally aligned, physically plausible 4D reconstructions that outperform prior decoupled approaches in global fidelity and human-scene consistency.

Significance. If the quantitative claims hold under full scrutiny, the work is significant for addressing inconsistencies that arise from separately estimating humans, scenes, and cameras. The explicit coupling mechanisms target a recognized limitation in the field and could enable more coherent 4D models for downstream applications. The framework's design choices around contact priors and temporal coherence represent a direct response to the weakest assumption noted in the stress test.

minor comments (3)
  1. The abstract states that TROPHIES 'consistently outperforms existing paradigms' but does not name the specific baselines or report numerical deltas; the experiments section should include a clear comparison table with metrics and error bars.
  2. Implementation details for the human-aware attention mechanism and the optimization module (e.g., loss weights, convergence criteria) are referenced at a high level; adding pseudocode or a dedicated subsection would improve reproducibility.
  3. The manuscript should explicitly discuss failure cases or limitations when multi-view overlap is limited, to address the weakest assumption about sufficient overlapping information.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of TROPHIES, the recognition of its significance in addressing inconsistencies in decoupled human-scene-camera estimation, and the recommendation for minor revision. No major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces a new task of unified human-scene-camera reconstruction and describes TROPHIES as a framework with separate Human and Scene Branches plus a global alignment module that enforces scale consistency, contact priors, and temporal coherence. All performance claims are presented as empirical results on the external datasets EgoHuman and EgoExo4D rather than as quantities derived by construction from fitted parameters or prior self-citations. No equations, ansatzes, or uniqueness theorems are shown that reduce the central outputs to the inputs by definition, and the construction is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No information is available from the abstract to identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5730 in / 907 out tokens · 35766 ms · 2026-06-28T15:01:24.661044+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    Visual imi- tation enables contextual humanoid control

    Arthur Allshire, Hongsuk Choi, Junyi Zhang, David McAllis- ter, Anthony Zhang, Chung Min Kim, Trevor Darrell, Pieter Abbeel, Jitendra Malik, and Angjoo Kanazawa. Visual imi- tation enables contextual humanoid control. InCoRL, 2025. 2

  2. [2]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M¨ uller. Zoedepth: Zero-shot trans- fer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023. 5

  3. [3]

    Behave: Dataset and method for tracking human object in- teractions

    Bharat Lal Bhatnagar, Xianghui Xie, Ilya Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. Behave: Dataset and method for tracking human object in- teractions. InCVPR, 2022. 3

  4. [4]

    Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion

    Michael J Black, Priyanka Patel, Joachim Tesch, and Jinlong Yang. Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. InCVPR, 2023. 7

  5. [5]

    Easi3r: Estimating disentangled motion from dust3r without training.arXiv preprint arXiv:2503.24391,

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3r: Estimating disentangled motion from dust3r without training.arXiv preprint arXiv:2503.24391,

  6. [6]

    Human3r: Everyone everywhere all at once.arXiv preprint arXiv:2510.06219, 2025

    Yue Chen, Xingyu Chen, Yuxuan Xue, Anpei Chen, Yuliang Xiu, and Pons-Moll Gerard. Human3r: Everyone everywhere all at once.arXiv preprint arXiv:2510.06219, 2025. 2

  7. [7]

    Beyond static features for temporally consistent 3d human pose and shape from a video

    Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, and Ky- oung Mu Lee. Beyond static features for temporally consistent 3d human pose and shape from a video. InCVPR, 2021. 2

  8. [8]

    Hsc4d: Human-centered 4d scene capture in large-scale indoor-outdoor space using wearable imus and lidar

    Yudi Dai, Yitai Lin, Chenglu Wen, Siqi Shen, Lan Xu, Jingyi Yu, Yuexin Ma, and Cheng Wang. Hsc4d: Human-centered 4d scene capture in large-scale indoor-outdoor space using wearable imus and lidar. InCVPR, 2022. 3

  9. [9]

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 1981

    Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 1981. 5, 6

  10. [10]

    Humans in 4D: Reconstructing and tracking humans with transformers

    Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Humans in 4D: Reconstructing and tracking humans with transformers. In ICCV, 2023. 2, 8

  11. [11]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InCVPR, 2024. 6, 7

  12. [12]

    Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors

    Vladimir Guzov, Aymen Mir, Torsten Sattler, and Gerard Pons-Moll. Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. InCVPR, 2021. 3

  13. [13]

    Real-time deep dynamic charac- ters

    Marc Habermann and et al. Real-time deep dynamic charac- ters. InSIGGRAPH, 2021. 1

  14. [14]

    Populating 3d scenes by learn- ing human-scene interaction

    Mohamed Hassan, Partha Ghosh, Joachim Tesch, Dimitrios Tzionas, and Michael J Black. Populating 3d scenes by learn- ing human-scene interaction. InCVPR, 2021. 3

  15. [15]

    Populating 3d scenes by learn- ing human-scene interaction

    Mohamed Hassan, Partha Ghosh, Joachim Tesch, Dimitrios Tzionas, and Michael J Black. Populating 3d scenes by learn- ing human-scene interaction. InCVPR, 2021. 1, 3

  16. [16]

    Deep learning for character motion synthesis

    Daniel Holden and et al. Deep learning for character motion synthesis. InSIGGRAPH, 2016. 1

  17. [17]

    Reconstructing groups of people with hypergraph relational reasoning

    Buzhen Huang, Jingyi Ju, Zhihao Li, and Yangang Wang. Reconstructing groups of people with hypergraph relational reasoning. InICCV, 2023. 1

  18. [18]

    Huang, Hongwei Yi, Markus H ¨oschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J

    Chun-Hao P. Huang, Hongwei Yi, Markus H ¨oschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J. Black. Capturing and inferring dense full-body human-scene contact. InCVPR, 2022. 2

  19. [19]

    Black, and Dim- itrios Tzionas

    Yinghao Huang, Omid Taheri, Michael J. Black, and Dim- itrios Tzionas. InterCap: Joint markerless 3D tracking of humans and objects in interaction. InGCPR, 2022. 3

  20. [20]

    Black, and Dim- itrios Tzionas

    Yinghao Huang, Omid Taheri, Michael J. Black, and Dim- itrios Tzionas. InterCap: Joint markerless 3D tracking of humans and objects in interaction from multi-view RGB-D images.IJCV, 2024. 3

  21. [21]

    Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predic- tive methods for 3d human sensing in natural environments. PAMI, 2013. 7

  22. [22]

    Towards immersive human- x interaction: A real-time framework for physically plausible motion synthesis.arXiv preprint arXiv:2508.02106, 2025

    Kaiyang Ji, Ye Shi, Zichen Jin, Kangyi Chen, Lan Xu, Yuexin Ma, Jingyi Yu, and Jingya Wang. Towards immersive human- x interaction: A real-time framework for physically plausible motion synthesis.arXiv preprint arXiv:2508.02106, 2025. 1

  23. [23]

    H4d: Human 4d modeling by learning neural compositional representation

    Boyan Jiang, Yinda Zhang, Xingkui Wei, Xiangyang Xue, and Yanwei Fu. H4d: Human 4d modeling by learning neural compositional representation. InCVPR, 2022. 3

  24. [24]

    Scaling up dynamic human-scene interaction mod- eling

    Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, and Siyuan Huang. Scaling up dynamic human-scene interaction mod- eling. InCVPR, 2024. 1

  25. [25]

    Black, David W

    Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. InCVPR, 2018. 2, 8

  26. [26]

    Ego-humans: An ego- centric 3d multi-human benchmark

    Rawal Khirodkar, Aayush Bansal, Lingni Ma, Richard New- combe, Minh Vo, and Kris Kitani. Ego-humans: An ego- centric 3d multi-human benchmark. InICCV, 2023. 6, 7

  27. [27]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll´ar, and Ross Girshick. Segment anything.arXiv:2304.02643, 2023. 7

  28. [28]

    Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. Vibe: Video inference for human body pose and shape estimation. InCVPR, 2020. 2

  29. [29]

    Huang, Otmar Hilliges, and Michael J

    Muhammed Kocabas, Chun-Hao P. Huang, Otmar Hilliges, and Michael J. Black. PARE: Part attention regressor for 3D human body estimation. InICCV, 2021. 2

  30. [30]

    Huang, Joachim Tesch, Lea M¨ uller, Otmar Hilliges, and Michael J

    Muhammed Kocabas, Chun-Hao P. Huang, Joachim Tesch, Lea M¨ uller, Otmar Hilliges, and Michael J. Black. SPEC: 9 Seeing people in the wild with an estimated camera. In ICCV, 2021. 6

  31. [31]

    Learning to reconstruct 3d human pose and shape via model-fitting in the loop

    Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. InICCV, 2019. 2

  32. [32]

    Ground- ing image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InECCV, 2024. 1, 2

  33. [33]

    Joint Optimization for 4D Human-Scene Reconstruction in the Wild

    Zhizheng Liu, Joe Lin, Wayne Wu, and Bolei Zhou. Joint optimization for 4d human-scene reconstruction in the wild. arXiv preprint arXiv:2501.02158, 2025. 2, 3

  34. [34]

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi- person linear model.SIGGRAPH Asia, 2015. 2, 3

  35. [35]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 7

  36. [36]

    VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

    Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt- slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025. 2

  37. [37]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InECCV, 2020. 2

  38. [38]

    Reconstructing people, places, and cameras.arXiv:2412.17806, 2024

    Lea M¨ uller, Hongsuk Choi, Anthony Zhang, Brent Yi, Jiten- dra Malik, and Angjoo Kanazawa. Reconstructing people, places, and cameras.arXiv:2412.17806, 2024. 2, 3, 5, 7

  39. [39]

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image. InCVPR, 2019. 2, 3

  40. [40]

    Black, and G¨ ul Varol

    Mathis Petrovich, Michael J. Black, and G¨ ul Varol. TMR: Text-to-motion retrieval using contrastive 3D human motion synthesis. InICCV, 2023. 2

  41. [41]

    Sam 2: Segment anything in images and videos, 2024

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Doll ´ar, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos, 2024. 7

  42. [42]

    Slahmr: Scale-aware human motion recovery from monocular videos

    Davis Rempe and et al. Slahmr: Scale-aware human motion recovery from monocular videos. InCVPR, 2023. 2, 4

  43. [43]

    Grounding dino 1.5: Advance the ”edge” of open-set object detection, 2024

    Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wen- long Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, Yuda Xiong, Hao Zhang, Feng Li, Peijun Tang, Kent Yu, and Lei Zhang. Grounding dino 1.5: Advance the ”edge” of open-set object detection, 2024. 7

  44. [44]

    Grounded sam: Assembling open-world models for diverse visual tasks,

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks,

  45. [45]

    Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bodies together.SIGGRAPH Asia, 2017. 2

  46. [46]

    Schonberger and Jan-Michael Frahm

    Johannes L. Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InCVPR, 2016. 2

  47. [47]

    A multi-view stereo benchmark with high- resolution images and multi-camera videos

    Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. InCVPR, 2017. 2

  48. [48]

    World-grounded human motion recovery via gravity-view co- ordinates

    Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view co- ordinates. InSIGGRAPH Asia, 2024. 2, 4, 8

  49. [49]

    Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J. Black. WHAM: Reconstructing world-grounded humans with accu- rate 3D motion. InCVPR, 2024. 2, 4, 6

  50. [50]

    Lgm: Large multi-view gaussian model for high-resolution 3d content creation

    Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InECCV,

  51. [51]

    Flag3d: A 3d fitness activity dataset with language instruction

    Yansong Tang, Jinpeng Liu, Aoyang Liu, Bin Yang, Wenxun Dai, Yongming Rao, Jiwen Lu, Jie Zhou, and Xiu Li. Flag3d: A 3d fitness activity dataset with language instruction. In CVPR, 2023. 1

  52. [52]

    Flag3d++: A benchmark for 3d fitness activity comprehension with lan- guage instruction.PAMI, 2025

    Yansong Tang, Aoyang Liu, Jinpeng Liu, Shiyi Zhang, Wenxun Dai, Jie Zhou, Xiu Li, and Jiwen Lu. Flag3d++: A benchmark for 3d fitness activity comprehension with lan- guage instruction.PAMI, 2025. 1

  53. [53]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.NIPS, 2021

    Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.NIPS, 2021. 2, 5

  54. [54]

    Recovering ac- curate 3d human pose in the wild using imus and a moving camera

    Timo Von Marcard, Roberto Henschel, Michael J Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering ac- curate 3d human pose in the wild using imus and a moving camera. InECCV, 2018. 7

  55. [55]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025. 1

  56. [56]

    Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction

    Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. InNIPS, 2021. 2

  57. [57]

    Continuous 3d perception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InCVPR, 2025. 2, 3, 4, 5, 7, 8

  58. [58]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024. 1, 2, 3, 4, 5, 7, 8

  59. [59]

    Tram: Global trajectory and motion of 3d humans from in- the-wild videos

    Yufu Wang, Ziyun Wang, Lingjie Liu, and Kostas Daniilidis. Tram: Global trajectory and motion of 3d humans from in- the-wild videos. InECCV, 2024. 2, 4, 6, 7, 8

  60. [60]

    Uni- fied human-scene interaction via prompted chain-of-contacts

    Zeqi Xiao, Tai Wang, Jingbo Wang, Jinkun Cao, Wenwei Zhang, Bo Dai, Dahua Lin, and Jiangmiao Pang. Uni- fied human-scene interaction via prompted chain-of-contacts. arXiv preprint arXiv:2309.07918, 2023. 1

  61. [61]

    ViT- Pose: Simple vision transformer baselines for human pose estimation

    Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. ViT- Pose: Simple vision transformer baselines for human pose estimation. InAdvances in Neural Information Processing Systems, 2022. 6

  62. [62]

    Vit- pose+: Vision transformer foundation model for generic body pose estimation.arXiv preprint arXiv:2212.04246, 2022

    Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vit- pose+: Vision transformer foundation model for generic body pose estimation.arXiv preprint arXiv:2212.04246, 2022. 6 10

  63. [63]

    Staf: 3d human mesh recovery from video with spatio- temporal alignment fusion.TCSVT, 2024

    Wei Yao, Hongwen Zhang, Yunlian Sun, and Jinhui Tang. Staf: 3d human mesh recovery from video with spatio- temporal alignment fusion.TCSVT, 2024. 1, 4

  64. [64]

    Human-aware object placement for visual environment reconstruction

    Hongwei Yi, Chun-Hao P Huang, Dimitrios Tzionas, Muhammed Kocabas, Mohamed Hassan, Siyu Tang, Justus Thies, and Michael J Black. Human-aware object placement for visual environment reconstruction. InCVPR, 2022. 1

  65. [65]

    Monst3r: A simple approach for estimating geometry in the presence of motion.ICLR, 2025

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming- Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.ICLR, 2025. 1, 2, 3, 4, 5, 7, 8

  66. [66]

    On the continuity of rotation representations in neural networks

    Yi Zhou, Connelly Barnes, Lu Jingwan, Yang Jimei, and Li Hao. On the continuity of rotation representations in neural networks. InCVPR, 2019. 5 11