CoGS: Compositional Dynamic Human-Object Scenes Gaussian Splatting from Monocular Video

Jerrin Bright; John Zelek

arxiv: 2606.28820 · v1 · pith:C4E53BFAnew · submitted 2026-06-27 · 💻 cs.CV

CoGS: Compositional Dynamic Human-Object Scenes Gaussian Splatting from Monocular Video

Jerrin Bright , John Zelek This is my paper

Pith reviewed 2026-06-30 10:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords dynamic scene reconstructionGaussian splattinghuman-object interactionmonocular videocompositional decompositionarticulated humanrigid objectscene rendering

0 comments

The pith

CoGS decomposes monocular video into separate Gaussian branches for articulated humans, rigid objects, and static scenes, stabilized by a six-stage optimization schedule.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CoGS as a compositional Gaussian splatting approach for reconstructing dynamic human-object interaction scenes from a single video. It splits the problem into three branches that each handle one motion type: an articulated human from a canonical prior, a rigid object driven by its trajectory, and a static scene with optional planar regularization. A six-stage schedule first optimizes the human and object on their own, then fuses everything under full-image supervision and component-specific constraints. This separation prevents motion from one element from contaminating the others. A reader would care because shared pixels in video make it easy for object movement to distort the human or background in prior methods.

Core claim

CoGS decomposes the video into three coordinated branches: an articulated human initialized from a complete canonical prior, a rigid object field driven by an estimated object trajectory, and a static scene field regularized by weak scene-only planar primitives when available. A six-stage optimization schedule first stabilizes the human and object independently, then fuses them with the scene under full-image supervision, visibility-aware human anchoring, object silhouette and motion constraints, and delayed scene regularization. This design keeps each component responsible for its own geometry and motion while allowing photometric evidence to correct the final composite.

What carries the argument

The compositional framework of three coordinated Gaussian branches managed by a six-stage optimization schedule that stabilizes components independently before fusion.

If this is right

Improved human-object interaction reconstruction on HOSNeRF without motion leakage into the human or scene.
Stronger in-the-wild human-scene rendering on NeuMan across full-frame and human-focused metrics.
Higher fidelity and perceptual quality than prior dynamic radiance-field or Gaussian methods that entangle components.
Each branch remains responsible for its geometry and motion while the final composite is corrected by photometric evidence.
Delayed scene regularization and object silhouette constraints keep the static background stable during fusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The staged independent stabilization could be tested on other multi-motion scenes such as multiple rigid objects or animals.
Explicit branch separation might support post-reconstruction editing of the human or object in isolation.
Removing the canonical human prior would likely require new initialization constraints to maintain the same separation.
The method suggests that explicit motion-model decomposition is useful whenever pixels are shared across differently moving elements.

Load-bearing premise

The articulated human can be initialized from a complete canonical prior and the six-stage schedule can stabilize the human and object independently before fusion without motion leakage between components.

What would settle it

A single entangled Gaussian field or an optimization without the staged separation producing equal or better fidelity and perceptual scores on the HOSNeRF and NeuMan datasets would show the decomposition is not required.

Figures

Figures reproduced from arXiv: 2606.28820 by Jerrin Bright, John Zelek.

**Figure 1.** Figure 1: Visibility-aware human anchoring. Well-observed Gaussians (blue) are corrected by image evidence, while rarely observed or occluded regions (orange/red) remain softly tethered to the canonical prior. dence is weak, while planar primitives weakly regularize large static surfaces after warmup. These priors are not treated as fixed outputs; they provide structure that can be corrected by sequence-level photom… view at source ↗

**Figure 2.** Figure 2: Overview of CoGS: a trust-then-correct compositional pipeline. From a monocular human–object video, preprocessing yields masks, COLMAP camera calibration and sparse geometry, SMPL motion, an object trajectory, optional scene planar primitives, and a complete canonical human seed. The human branch decodes a canonical Gaussian human from a triplane–MLP field and poses it by linear blend skinning (LBS) under … view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on NeuMan validation frames. We compare CoGS with some SOTA models on frames from four NeuMan sequences. Red boxes indicate general reconstruction artifacts such as ghosting, geometry leakage, missing body structure, or inconsistent foreground placement, while blue boxes highlight color-intensity errors where the rendering is noticeably lighter or darker than the ground truth. Across… view at source ↗

**Figure 4.** Figure 4: Object-level qualitative comparison. Each pair shows the ground-truth object on the left and the corresponding CoGS rendering on the right. CoGS preserves object shape, color, and contact-consistent appearance across held-out examples, with the thin tennis racket remaining the hardest case. improves structural consistency. Removing human foreground losses causes the larger drop because the compositor lose… view at source ↗

read the original abstract

Reconstructing dynamic human--object interaction scenes from monocular video is difficult because the human, manipulated object, and background obey different motion models while sharing the same pixels. Existing dynamic radiance-field and Gaussian-splatting methods often entangle these components, causing object motion to leak into the human or static scene, and monocular human reconstruction remains underconstrained in regions that are rarely observed. We present CoGS, a compositional Gaussian-splatting framework for monocular human--object scene reconstruction. CoGS decomposes the video into three coordinated branches: an articulated human initialized from a complete canonical prior, a rigid object field driven by an estimated object trajectory, and a static scene field regularized by weak scene-only planar primitives when available. A six-stage optimization schedule first stabilizes the human and object independently, then fuses them with the scene under full-image supervision, visibility-aware human anchoring, object silhouette and motion constraints, and delayed scene regularization. This design keeps each component responsible for its own geometry and motion while allowing photometric evidence to correct the final composite. Experiments on HOSNeRF and NeuMan show that CoGS improves both human--object interaction reconstruction and in-the-wild human--scene rendering, achieving stronger fidelity and perceptual quality across full-frame and human-focused evaluations. Code will be released upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoGS introduces a three-branch compositional split plus a six-stage schedule to reduce motion leakage in monocular human-object Gaussian splatting, but the abstract gives no numbers or ablations to check whether the decoupling actually works.

read the letter

The core idea is a clean decomposition: one branch for an articulated human seeded from a canonical prior, one for a rigid object driven by its own trajectory, and one for the static scene with planar regularization. A six-stage schedule stabilizes the human and object first, then brings in the scene under visibility and silhouette constraints. This directly targets the entanglement problem that shows up in earlier dynamic radiance fields and Gaussian splatting work.

The approach is sensible on paper. Treating the three components as having distinct motion models and giving them separate optimization phases is a practical way to keep object motion from bleeding into the human or background. The visibility-aware anchoring and delayed scene regularization are reasonable engineering choices for monocular data.

The soft spot is the lack of supporting evidence. The abstract states that experiments on HOSNeRF and NeuMan show better fidelity and perceptual quality, yet it supplies no quantitative scores, error bars, ablation tables, or leakage metrics. Without those, it is impossible to tell whether the staged losses actually enforce separation when photometric gradients are shared across branches. The stress-test concern about unverified assumptions on the canonical prior and independent stabilization therefore stands; the abstract describes the schedule but does not demonstrate that it succeeds under in-the-wild monocular conditions.

This is aimed at people already working on dynamic Gaussian splatting or monocular human-object reconstruction for AR/VR or robotics. A reader who needs a new baseline or a concrete schedule to try could get something useful from the full paper if the experiments hold up. The work shows clear thinking about the decomposition problem and engages the relevant prior literature, so it is coherent on its own terms.

I would send it to peer review so the experiments and ablations can be checked properly.

Referee Report

1 major / 0 minor

Summary. The paper presents CoGS, a compositional Gaussian-splatting framework for monocular reconstruction of dynamic human-object interaction scenes. It decomposes the scene into three branches—an articulated human initialized from a canonical prior, a rigid object driven by estimated trajectory, and a static scene field—and optimizes them via a six-stage schedule that first stabilizes human and object independently before fusing under full-image supervision with visibility-aware anchoring, silhouette constraints, and delayed scene regularization. Experiments on HOSNeRF and NeuMan are claimed to show improved fidelity and perceptual quality over HOSNeRF and NeuMan baselines for both interaction reconstruction and in-the-wild rendering.

Significance. If the staged optimization demonstrably prevents motion leakage while preserving component-specific geometry, the method would offer a practical advance over entangled dynamic radiance fields by explicitly handling differing motion models within shared pixels, with potential utility for in-the-wild human-scene applications.

major comments (1)

[Abstract and method description of six-stage optimization schedule] The central claim that the six-stage schedule decouples the articulated human (from complete canonical prior) and rigid object (from estimated trajectory) without motion leakage into each other or the static scene rests on unverified assumptions; no ablations, leakage metrics, or failure-case analysis are referenced to show that visibility-aware anchoring and silhouette constraints suffice when photometric evidence is shared across branches under monocular in-the-wild conditions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the central claim regarding the six-stage optimization. We address the concern point by point below.

read point-by-point responses

Referee: [Abstract and method description of six-stage optimization schedule] The central claim that the six-stage schedule decouples the articulated human (from complete canonical prior) and rigid object (from estimated trajectory) without motion leakage into each other or the static scene rests on unverified assumptions; no ablations, leakage metrics, or failure-case analysis are referenced to show that visibility-aware anchoring and silhouette constraints suffice when photometric evidence is shared across branches under monocular in-the-wild conditions.

Authors: We acknowledge that the current manuscript does not include dedicated ablations, quantitative leakage metrics, or explicit failure-case analysis isolating the effect of the six-stage schedule on motion leakage. The design rationale is that independent stabilization of the human (via canonical prior) and object (via trajectory) in early stages, followed by visibility-aware anchoring, silhouette constraints, and delayed scene regularization, assigns responsibility for geometry and motion to each branch before full photometric supervision. The reported improvements over HOSNeRF and NeuMan baselines on both datasets are presented as indirect evidence that leakage is mitigated relative to entangled baselines. However, this does not constitute direct verification under shared-pixel monocular conditions. We will therefore add, in the revised manuscript: (i) an ablation removing individual stages of the schedule, (ii) a leakage metric based on cross-component optical-flow consistency between branches, and (iii) qualitative failure-case analysis on sequences with heavy occlusion or ambiguous photometric evidence. These additions will directly test whether the anchoring and silhouette constraints suffice. revision: yes

Circularity Check

0 steps flagged

No circularity: framework relies on external priors, datasets, and staged optimization without self-referential reductions.

full rationale

The abstract and available description present CoGS as a new compositional Gaussian-splatting pipeline that decomposes scenes into human, object, and scene branches, initialized from a canonical prior and optimized via a six-stage schedule. No equations, fitted parameters renamed as predictions, or self-citations are shown that would make any claimed result equivalent to its inputs by construction. The method is benchmarked on external datasets (HOSNeRF, NeuMan) with standard techniques, making the derivation self-contained against those benchmarks. No load-bearing step reduces to a self-definition or fitted input.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

Based solely on the abstract; limited visibility into implementation details means the ledger captures only explicitly mentioned elements.

free parameters (1)

stage-specific regularization weights and constraints
The six-stage optimization schedule implies multiple hyperparameters for stabilization, anchoring, silhouette constraints, and delayed scene regularization.

axioms (2)

domain assumption Human model can be initialized from a complete canonical prior
Abstract states the articulated human is initialized from a complete canonical prior.
domain assumption Object trajectory can be estimated independently to drive the rigid field
Abstract describes the rigid object field driven by an estimated object trajectory.

invented entities (1)

Three coordinated compositional branches no independent evidence
purpose: To decompose entangled human, object, and scene components
The framework introduces human, object, and scene branches as the core decomposition strategy.

pith-pipeline@v0.9.1-grok · 5760 in / 1464 out tokens · 57350 ms · 2026-06-30T10:22:14.912699+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Per-Gaussian Embedding- Based Deformation for Deformable 3D Gaussian Splatting

Jeongmin Bae, Seoha Kim, Youngsik Yun, Hahyun Lee, Gun Bang, and Youngjung Uh. Per-Gaussian Embedding- Based Deformation for Deformable 3D Gaussian Splatting. InEuropean Conference on Computer Vision, pages 321–
[2]

Springer, 2024. 1, 2, 6

2024
[3]

BEHA VE: Dataset and Method for Tracking Human Object Interactions

Bharat Lal Bhatnagar, Xianghui Xie, Ilya A Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. BEHA VE: Dataset and Method for Tracking Human Object Interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15935– 15946, 2022. 2

2022
[4]

Distribution and depth- aware transformers for 3d human mesh recovery.arXiv preprint arXiv:2403.09063, 4(6):7, 2024

Jerrin Bright, Bavesh Balaji, Harish Prakash, Yuhao Chen, David A Clausi, and John Zelek. Distribution and depth- aware transformers for 3d human mesh recovery.arXiv preprint arXiv:2403.09063, 4(6):7, 2024. 2

work page arXiv 2024
[5]

Zelek, and David A

Jerrin Bright, Zhibo Wang, Yuhao Chen, Sirisha Rambhatla, John S. Zelek, and David A. Clausi. Gen4d: Synthesizing humans and scenes in the wild. InWorkshop on Computer Vision in the Wild at CVPR, 2025. 2

2025
[6]

HexPlane: A Fast Representa- tion for Dynamic Scenes

Ang Cao and Justin Johnson. HexPlane: A Fast Representa- tion for Dynamic Scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023. 2

2023
[7]

ARCTIC: A Dataset for Dexterous Bimanual Hand- Object Manipulation

Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J Black, and Otmar Hilliges. ARCTIC: A Dataset for Dexterous Bimanual Hand- Object Manipulation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 12943–12954, 2023. 2

2023
[8]

HOLD: Category-Agnostic 3D Reconstruction of Interact- ing Hands and Objects from Video

Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Xu Chen, Muhammed Kocabas, Michael J Black, and Otmar Hilliges. HOLD: Category-Agnostic 3D Reconstruction of Interact- ing Hands and Objects from Video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 494–504, 2024. 2

2024
[9]

K- Planes: Explicit Radiance Fields in Space, Time, and Ap- pearance

Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K- Planes: Explicit Radiance Fields in Space, Time, and Ap- pearance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12479– 12488, 2023. 2, 6

2023
[10]

Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-Supervised Scene Decomposi- tion

Chen Guo, Tianjian Jiang, Xu Chen, Jie Song, and Ot- mar Hilliges. Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-Supervised Scene Decomposi- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 12858–12868,
[11]

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians

Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Shengping Zhang, and Liqiang Nie. GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 634–644, 2024. 2

2024
[12]

GauHuman: Artic- ulated Gaussian Splatting from Monocular Human Videos

Shoukang Hu, Tao Hu, and Ziwei Liu. GauHuman: Artic- ulated Gaussian Splatting from Monocular Human Videos. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 20418–20431, 2024. 2

2024
[13]

NeuMan: Neural Human Radiance Field from a Single Video

Wei Jiang, Kwang Moo Yi, Golnoosh Samei, Oncel Tuzel, and Anurag Ranjan. NeuMan: Neural Human Radiance Field from a Single Video. InEuropean Conference on Com- puter Vision, pages 402–418. Springer, 2022. 1, 2, 6, 7

2022
[14]

3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Trans. Graph., 42(4):139– 1, 2023. 1, 2

2023
[15]

HUGS: Human Gaussian Splats

Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, and Anurag Ranjan. HUGS: Human Gaussian Splats. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 505–515, 2024. 1, 2

2024
[16]

Fully Explicit Dynamic Gaussian Splat- ting.Advances in Neural Information Processing Systems, 37:5384–5409, 2024

Junoh Lee, ChangYeon Won, Hyunjun Jung, Inhwan Bae, and Hae-Gon Jeon. Fully Explicit Dynamic Gaussian Splat- ting.Advances in Neural Information Processing Systems, 37:5384–5409, 2024. 1, 2, 6

2024
[17]

HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video

Jia-Wei Liu, Yan-Pei Cao, Tianyuan Yang, Zhongcong Xu, Jussi Keppo, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 18483–18494, 2023. 1, 2, 6

2023
[18]

MoDGS: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors

Qingming LIU, Yuan Liu, Jiepeng Wang, Xianqiang Lyu, Peng Wang, Wenping Wang, and Junhui Hou. MoDGS: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors. InThe Thirteenth International Conference on Learning Representations, 2025. 1, 2

2025
[19]

SMPL: A Skinned Multi-Person Linear Model.ACM Transactions on Graphics (TOG), 34(6):248:1–248:16, 2015

Matthew Loper, Naureen Mahmood, Javier Romero, Ger- ard Pons-Moll, and Michael J Black. SMPL: A Skinned Multi-Person Linear Model.ACM Transactions on Graphics (TOG), 34(6):248:1–248:16, 2015. 1, 2

2015
[20]

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis.Communications of the ACM, 65(1):99–106,

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis.Communications of the ACM, 65(1):99–106,
[21]

GASPACHO: Gaus- sian Splatting for Controllable Humans and Objects.arXiv preprint arXiv:2503.09342, 2025

Aymen Mir, Arthur Moreau, Helisa Dhamo, Zhensong Zhang, and Eduardo P ´erez-Pellitero. GASPACHO: Gaus- sian Splatting for Controllable Humans and Objects.arXiv preprint arXiv:2503.09342, 2025. 2

work page arXiv 2025
[22]

Expressive Whole-Body 3D Gaussian Avatar

Gyeongsik Moon, Takaaki Shiratori, and Shunsuke Saito. Expressive Whole-Body 3D Gaussian Avatar. InEuropean Conference on Computer Vision, pages 19–35. Springer,
[23]

Nerfies: Deformable Neural Radiance Fields

Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable Neural Radiance Fields. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 5865–5874, 2021. 1, 2, 6

2021
[24]

HyperNeRF: A Higher- Dimensional Representation for Topologically Varying Neu- 9 ral Radiance Fields.arXiv preprint arXiv:2106.13228, 2021

Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin- Brualla, and Steven M Seitz. HyperNeRF: A Higher- Dimensional Representation for Topologically Varying Neu- 9 ral Radiance Fields.arXiv preprint arXiv:2106.13228, 2021. 1, 2, 6

work page arXiv 2021
[25]

Priyanka Patel and Michael J. Black. CameraHMR: Aligning People with Perspective.arXiv preprint arXiv:2411.08128,

work page arXiv
[26]

Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019. 1, 2

2019
[27]

Neural Body: Implicit Neural Representations with Structured La- tent Codes for Novel View Synthesis of Dynamic Humans

Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural Body: Implicit Neural Representations with Structured La- tent Codes for Novel View Synthesis of Dynamic Humans. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 9054–9063, 2021. 2

2021
[28]

3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting

Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang. 3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5020–5030, 2024. 2

2024
[29]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digi- tization

Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor- ishima, Angjoo Kanazawa, and Hao Li. PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digi- tization. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 2304–2314, 2019. 2

2019
[31]

PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization

Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul Joo. PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 84–93, 2020. 2

2020
[32]

Structure- from-Motion Revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-Motion Revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 1

2016
[33]

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. DreamGaussian: Generative Gaussian Splat- ting for Efficient 3D Content Creation.arXiv preprint arXiv:2309.16653, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Hu- manNeRF: Free-Viewpoint Rendering of Moving People from Monocular Video

Chung-Yi Weng, Brian Curless, Pratul P Srinivasan, Jonathan T Barron, and Ira Kemelmacher-Shlizerman. Hu- manNeRF: Free-Viewpoint Rendering of Moving People from Monocular Video. InProceedings of the IEEE/CVF conference on computer vision and pattern Recognition, pages 16210–16220, 2022. 2

2022
[35]

4D Gaussian Splatting for Real-Time Dynamic Scene Ren- dering

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4D Gaussian Splatting for Real-Time Dynamic Scene Ren- dering. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 20310–20320,
[36]

Dˆ2-NeRF: Self-Supervised Decoupling of Dynamic and Static Objects from a Monocu- lar Video.Advances in neural information processing sys- tems, 35:32653–32666, 2022

Tianhao Wu, Fangcheng Zhong, Andrea Tagliasacchi, For- rester Cole, and Cengiz Oztireli. Dˆ2-NeRF: Self-Supervised Decoupling of Dynamic and Static Objects from a Monocu- lar Video.Advances in neural information processing sys- tems, 35:32653–32666, 2022. 2, 6

2022
[37]

Rendering Humans from Object-Occluded Monoc- ular Videos

Tiange Xiang, Adam Sun, Jiajun Wu, Ehsan Adeli, and Li Fei-Fei. Rendering Humans from Object-Occluded Monoc- ular Videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3239–3250, 2023. 1, 2

2023
[38]

Wild2Avatar: Rendering Humans Behind Occlusions.arXiv preprint arXiv:2401.00431, 2024

Tiange Xiang, Adam Sun, Scott Delp, Kazuki Kozuka, Li Fei-Fei, and Ehsan Adeli. Wild2Avatar: Rendering Humans Behind Occlusions.arXiv preprint arXiv:2401.00431, 2024. 1, 2

work page arXiv 2024
[39]

ICON: Implicit Clothed Humans Obtained from Normals

Yuliang Xiu, Jinlong Yang, Dimitrios Tzionas, and Michael J Black. ICON: Implicit Clothed Humans Obtained from Normals. In2022 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 13286–13296. IEEE, 2022. 2

2022
[40]

ECON: Explicit Clothed Humans Optimized via Normal Integration

Yuliang Xiu, Jinlong Yang, Xu Cao, Dimitrios Tzionas, and Michael J Black. ECON: Explicit Clothed Humans Optimized via Normal Integration. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 512–523, 2023. 2

2023
[41]

HSR: Holistic 3D Human-Scene Recon- struction from Monocular Videos

Lixin Xue, Chen Guo, Chengwei Zheng, Fangjinhua Wang, Tianjian Jiang, Hsuan-I Ho, Manuel Kaufmann, Jie Song, and Otmar Hilliges. HSR: Holistic 3D Human-Scene Recon- struction from Monocular Videos. InEuropean Conference on Computer Vision, 2024. 2

2024
[42]

Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction

Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 20331–20341, 2024. 1, 2, 6

2024
[43]

Zetong Zhang, Manuel Kaufmann, Lixin Xue, Jie Song, and Martin R. Oswald. ODHSR: Online Dense 3D Reconstruc- tion of Humans and Scenes from Monocular Videos. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2 10

2025

[1] [1]

Per-Gaussian Embedding- Based Deformation for Deformable 3D Gaussian Splatting

Jeongmin Bae, Seoha Kim, Youngsik Yun, Hahyun Lee, Gun Bang, and Youngjung Uh. Per-Gaussian Embedding- Based Deformation for Deformable 3D Gaussian Splatting. InEuropean Conference on Computer Vision, pages 321–

[2] [2]

Springer, 2024. 1, 2, 6

2024

[3] [3]

BEHA VE: Dataset and Method for Tracking Human Object Interactions

Bharat Lal Bhatnagar, Xianghui Xie, Ilya A Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. BEHA VE: Dataset and Method for Tracking Human Object Interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15935– 15946, 2022. 2

2022

[4] [4]

Distribution and depth- aware transformers for 3d human mesh recovery.arXiv preprint arXiv:2403.09063, 4(6):7, 2024

Jerrin Bright, Bavesh Balaji, Harish Prakash, Yuhao Chen, David A Clausi, and John Zelek. Distribution and depth- aware transformers for 3d human mesh recovery.arXiv preprint arXiv:2403.09063, 4(6):7, 2024. 2

work page arXiv 2024

[5] [5]

Zelek, and David A

Jerrin Bright, Zhibo Wang, Yuhao Chen, Sirisha Rambhatla, John S. Zelek, and David A. Clausi. Gen4d: Synthesizing humans and scenes in the wild. InWorkshop on Computer Vision in the Wild at CVPR, 2025. 2

2025

[6] [6]

HexPlane: A Fast Representa- tion for Dynamic Scenes

Ang Cao and Justin Johnson. HexPlane: A Fast Representa- tion for Dynamic Scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023. 2

2023

[7] [7]

ARCTIC: A Dataset for Dexterous Bimanual Hand- Object Manipulation

Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J Black, and Otmar Hilliges. ARCTIC: A Dataset for Dexterous Bimanual Hand- Object Manipulation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 12943–12954, 2023. 2

2023

[8] [8]

HOLD: Category-Agnostic 3D Reconstruction of Interact- ing Hands and Objects from Video

Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Xu Chen, Muhammed Kocabas, Michael J Black, and Otmar Hilliges. HOLD: Category-Agnostic 3D Reconstruction of Interact- ing Hands and Objects from Video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 494–504, 2024. 2

2024

[9] [9]

K- Planes: Explicit Radiance Fields in Space, Time, and Ap- pearance

Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K- Planes: Explicit Radiance Fields in Space, Time, and Ap- pearance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12479– 12488, 2023. 2, 6

2023

[10] [10]

Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-Supervised Scene Decomposi- tion

Chen Guo, Tianjian Jiang, Xu Chen, Jie Song, and Ot- mar Hilliges. Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-Supervised Scene Decomposi- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 12858–12868,

[11] [11]

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians

Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Shengping Zhang, and Liqiang Nie. GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 634–644, 2024. 2

2024

[12] [12]

GauHuman: Artic- ulated Gaussian Splatting from Monocular Human Videos

Shoukang Hu, Tao Hu, and Ziwei Liu. GauHuman: Artic- ulated Gaussian Splatting from Monocular Human Videos. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 20418–20431, 2024. 2

2024

[13] [13]

NeuMan: Neural Human Radiance Field from a Single Video

Wei Jiang, Kwang Moo Yi, Golnoosh Samei, Oncel Tuzel, and Anurag Ranjan. NeuMan: Neural Human Radiance Field from a Single Video. InEuropean Conference on Com- puter Vision, pages 402–418. Springer, 2022. 1, 2, 6, 7

2022

[14] [14]

3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Trans. Graph., 42(4):139– 1, 2023. 1, 2

2023

[15] [15]

HUGS: Human Gaussian Splats

Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, and Anurag Ranjan. HUGS: Human Gaussian Splats. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 505–515, 2024. 1, 2

2024

[16] [16]

Fully Explicit Dynamic Gaussian Splat- ting.Advances in Neural Information Processing Systems, 37:5384–5409, 2024

Junoh Lee, ChangYeon Won, Hyunjun Jung, Inhwan Bae, and Hae-Gon Jeon. Fully Explicit Dynamic Gaussian Splat- ting.Advances in Neural Information Processing Systems, 37:5384–5409, 2024. 1, 2, 6

2024

[17] [17]

HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video

Jia-Wei Liu, Yan-Pei Cao, Tianyuan Yang, Zhongcong Xu, Jussi Keppo, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 18483–18494, 2023. 1, 2, 6

2023

[18] [18]

MoDGS: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors

Qingming LIU, Yuan Liu, Jiepeng Wang, Xianqiang Lyu, Peng Wang, Wenping Wang, and Junhui Hou. MoDGS: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors. InThe Thirteenth International Conference on Learning Representations, 2025. 1, 2

2025

[19] [19]

SMPL: A Skinned Multi-Person Linear Model.ACM Transactions on Graphics (TOG), 34(6):248:1–248:16, 2015

Matthew Loper, Naureen Mahmood, Javier Romero, Ger- ard Pons-Moll, and Michael J Black. SMPL: A Skinned Multi-Person Linear Model.ACM Transactions on Graphics (TOG), 34(6):248:1–248:16, 2015. 1, 2

2015

[20] [20]

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis.Communications of the ACM, 65(1):99–106,

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis.Communications of the ACM, 65(1):99–106,

[21] [21]

GASPACHO: Gaus- sian Splatting for Controllable Humans and Objects.arXiv preprint arXiv:2503.09342, 2025

Aymen Mir, Arthur Moreau, Helisa Dhamo, Zhensong Zhang, and Eduardo P ´erez-Pellitero. GASPACHO: Gaus- sian Splatting for Controllable Humans and Objects.arXiv preprint arXiv:2503.09342, 2025. 2

work page arXiv 2025

[22] [22]

Expressive Whole-Body 3D Gaussian Avatar

Gyeongsik Moon, Takaaki Shiratori, and Shunsuke Saito. Expressive Whole-Body 3D Gaussian Avatar. InEuropean Conference on Computer Vision, pages 19–35. Springer,

[23] [23]

Nerfies: Deformable Neural Radiance Fields

Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable Neural Radiance Fields. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 5865–5874, 2021. 1, 2, 6

2021

[24] [24]

HyperNeRF: A Higher- Dimensional Representation for Topologically Varying Neu- 9 ral Radiance Fields.arXiv preprint arXiv:2106.13228, 2021

Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin- Brualla, and Steven M Seitz. HyperNeRF: A Higher- Dimensional Representation for Topologically Varying Neu- 9 ral Radiance Fields.arXiv preprint arXiv:2106.13228, 2021. 1, 2, 6

work page arXiv 2021

[25] [25]

Priyanka Patel and Michael J. Black. CameraHMR: Aligning People with Perspective.arXiv preprint arXiv:2411.08128,

work page arXiv

[26] [26]

Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019. 1, 2

2019

[27] [27]

Neural Body: Implicit Neural Representations with Structured La- tent Codes for Novel View Synthesis of Dynamic Humans

Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural Body: Implicit Neural Representations with Structured La- tent Codes for Novel View Synthesis of Dynamic Humans. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 9054–9063, 2021. 2

2021

[28] [28]

3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting

Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang. 3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5020–5030, 2024. 2

2024

[29] [29]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digi- tization

Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor- ishima, Angjoo Kanazawa, and Hao Li. PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digi- tization. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 2304–2314, 2019. 2

2019

[31] [31]

PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization

Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul Joo. PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 84–93, 2020. 2

2020

[32] [32]

Structure- from-Motion Revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-Motion Revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 1

2016

[33] [33]

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. DreamGaussian: Generative Gaussian Splat- ting for Efficient 3D Content Creation.arXiv preprint arXiv:2309.16653, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Hu- manNeRF: Free-Viewpoint Rendering of Moving People from Monocular Video

Chung-Yi Weng, Brian Curless, Pratul P Srinivasan, Jonathan T Barron, and Ira Kemelmacher-Shlizerman. Hu- manNeRF: Free-Viewpoint Rendering of Moving People from Monocular Video. InProceedings of the IEEE/CVF conference on computer vision and pattern Recognition, pages 16210–16220, 2022. 2

2022

[35] [35]

4D Gaussian Splatting for Real-Time Dynamic Scene Ren- dering

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4D Gaussian Splatting for Real-Time Dynamic Scene Ren- dering. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 20310–20320,

[36] [36]

Dˆ2-NeRF: Self-Supervised Decoupling of Dynamic and Static Objects from a Monocu- lar Video.Advances in neural information processing sys- tems, 35:32653–32666, 2022

Tianhao Wu, Fangcheng Zhong, Andrea Tagliasacchi, For- rester Cole, and Cengiz Oztireli. Dˆ2-NeRF: Self-Supervised Decoupling of Dynamic and Static Objects from a Monocu- lar Video.Advances in neural information processing sys- tems, 35:32653–32666, 2022. 2, 6

2022

[37] [37]

Rendering Humans from Object-Occluded Monoc- ular Videos

Tiange Xiang, Adam Sun, Jiajun Wu, Ehsan Adeli, and Li Fei-Fei. Rendering Humans from Object-Occluded Monoc- ular Videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3239–3250, 2023. 1, 2

2023

[38] [38]

Wild2Avatar: Rendering Humans Behind Occlusions.arXiv preprint arXiv:2401.00431, 2024

Tiange Xiang, Adam Sun, Scott Delp, Kazuki Kozuka, Li Fei-Fei, and Ehsan Adeli. Wild2Avatar: Rendering Humans Behind Occlusions.arXiv preprint arXiv:2401.00431, 2024. 1, 2

work page arXiv 2024

[39] [39]

ICON: Implicit Clothed Humans Obtained from Normals

Yuliang Xiu, Jinlong Yang, Dimitrios Tzionas, and Michael J Black. ICON: Implicit Clothed Humans Obtained from Normals. In2022 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 13286–13296. IEEE, 2022. 2

2022

[40] [40]

ECON: Explicit Clothed Humans Optimized via Normal Integration

Yuliang Xiu, Jinlong Yang, Xu Cao, Dimitrios Tzionas, and Michael J Black. ECON: Explicit Clothed Humans Optimized via Normal Integration. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 512–523, 2023. 2

2023

[41] [41]

HSR: Holistic 3D Human-Scene Recon- struction from Monocular Videos

Lixin Xue, Chen Guo, Chengwei Zheng, Fangjinhua Wang, Tianjian Jiang, Hsuan-I Ho, Manuel Kaufmann, Jie Song, and Otmar Hilliges. HSR: Holistic 3D Human-Scene Recon- struction from Monocular Videos. InEuropean Conference on Computer Vision, 2024. 2

2024

[42] [42]

Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction

Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 20331–20341, 2024. 1, 2, 6

2024

[43] [43]

Zetong Zhang, Manuel Kaufmann, Lixin Xue, Jie Song, and Martin R. Oswald. ODHSR: Online Dense 3D Reconstruc- tion of Humans and Scenes from Monocular Videos. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2 10

2025