pith. machine review for the scientific record. sign in

arxiv: 2604.04016 · v1 · submitted 2026-04-05 · 💻 cs.CV · cs.AI

Recognition: no theorem link

HOIGS: Human-Object Interaction Gaussian Splatting

Dongyoon Wee, Geonho Cha, Jaehyun Pyun, Joonsik Nam, Kyeongbo Kong, Suk-Ju Kang, Suwoong Yeom, Taewoo Kim, Yun-Seong Jeong

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords human-object interactiongaussian splattingdynamic scene reconstructioncross attentiondeformation modeling4d reconstructioncomputer visionneural rendering
0
0 comments X

The pith

HOIGS reconstructs interactive human-object scenes by fusing distinct deformation models with cross-attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a method to reconstruct dynamic scenes involving humans and objects by explicitly modeling their interactions. It separates deformation modeling for humans using HexPlane and for objects using Cubic Hermite Spline, then fuses them with a cross-attention module to capture how interactions affect each other's motions. This addresses limitations in prior Gaussian Splatting approaches that either ignore objects or treat all motions uniformly. The result is improved handling of complex scenarios like occlusion and contact, leading to higher fidelity reconstructions as validated on several datasets. Readers care because accurate modeling of such interactions is essential for realistic virtual environments and simulation.

Core claim

By integrating heterogeneous features from HexPlane for humans and Cubic Hermite Spline for objects via a cross-attention-based HOI module, HOIGS effectively captures interdependent motions and improves deformation estimation in scenarios involving occlusion, contact, and object manipulation. Comprehensive experiments on multiple datasets demonstrate that our method consistently outperforms state-of-the-art human-centric and 4D Gaussian approaches, highlighting the importance of explicitly modeling human-object interactions for high-fidelity reconstruction.

What carries the argument

Cross-attention-based HOI module integrating HexPlane human features and Cubic Hermite Spline object features to model interaction-induced deformations.

If this is right

  • Explicitly modeling human-object interactions leads to better deformation estimation under occlusion and contact.
  • The method outperforms existing human-centric and 4D Gaussian Splatting approaches on multiple datasets.
  • Interdependent motions between humans and objects are captured more accurately through feature fusion.
  • High-fidelity reconstruction of manipulation scenarios is achieved by using heterogeneous deformation baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This fusion strategy might be applicable to other multi-entity dynamic scenes, such as multiple humans or animal-object interactions.
  • If the module generalizes well, it could enable more robust real-world applications in robotics for predicting interaction outcomes.
  • Extending the approach with physics constraints could further improve accuracy in contact-rich scenarios.

Load-bearing premise

The cross-attention HOI module reliably extracts and fuses interaction-induced deformations from HexPlane and Cubic Hermite Spline baselines without new artifacts or needing scene-specific tuning.

What would settle it

A direct comparison showing that on datasets with significant occlusions and contacts, the HOIGS method produces more artifacts or lower quality reconstructions than a unified motion field approach would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.04016 by Dongyoon Wee, Geonho Cha, Jaehyun Pyun, Joonsik Nam, Kyeongbo Kong, Suk-Ju Kang, Suwoong Yeom, Taewoo Kim, Yun-Seong Jeong.

Figure 1
Figure 1. Figure 1: 4DGS-based and human-centric models fail to render HOI scenes. Achieving high-quality view synthesis and 4D reconstruction in these scenarios is critical for downstream applications such as immersive virtual reality, metaverse content creation, and 3D animation. However, accurately modeling the com￾plex spatial and temporal dependencies between humans and objects remains a challenging task, particularly gi… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Proposed Framework. Given an input video sequence, we first extract object-specific information to reconstruct the 3D object shape via a diffusion prior. Based on the reconstructed shape, we initialize 3D Gaussians for each keyframe and use a spline-based deformation as our baseline. Here, time-invariant and time varying HexPlane features are employed for canonical human and interaction mod… view at source ↗
Figure 3
Figure 3. Figure 3: Spline-based Object Motion Modeling. To ensure temporally continuous and physically plausible object trajectories, we parameterize the sequence of 3D Gaussians using a Cubic Hermite Spline (CHS). Since these canonical Gaussians lack physical scale and world alignment, we explicitly transform them into the world coordinate system. To compute the projected 3D Gaussian bounding box area (At proj), we project … view at source ↗
Figure 4
Figure 4. Figure 4: Cross-Attention-based HOI Module. The proposed architecture estimates human-object interactions using human body part features and per-Gaussian object representations. where Vp denotes the set of vertices in part p, and ϕ queries the HexPlane feature at time t. Each part feature fp(t) ∈ R 96, and the human feature set comprises 16 tokens, yielding a 16 × 96 representation corresponding to SMPL-X semantic p… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of reconstructed rendered views. We compare our method (HOIGS) against state-of-the-art baselines on both the HOSNeRF [26] and BEHAVE [3] datasets. We display the full-frame (top) rendering and a zoom-in (bot￾tom) of the red Region of Interest (ROI) [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of reconstructed rendered view results on the ARCTIC [9] dataset [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: More qualitative comparison between ground truth and our HOIGS on the HOSNeRF [26] dataset. Ground Truth Ours ExAvatar [31] E-D3DGS [2] 4DGS [44] [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: 4DGS-based and human-centric models fail to render HOI scenes on the BEHAVE [3] dataset [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
read the original abstract

Reconstructing dynamic scenes with complex human-object interactions is a fundamental challenge in computer vision and graphics. Existing Gaussian Splatting methods either rely on human pose priors while neglecting dynamic objects, or approximate all motions within a single field, limiting their ability to capture interaction-rich dynamics. To address this gap, we propose Human-Object Interaction Gaussian Splatting (HOIGS), which explicitly models interaction-induced deformation between humans and objects through a cross-attention-based HOI module. Distinct deformation baselines are employed to extract features: HexPlane for humans and Cubic Hermite Spline (CHS) for objects. By integrating these heterogeneous features, HOIGS effectively captures interdependent motions and improves deformation estimation in scenarios involving occlusion, contact, and object manipulation. Comprehensive experiments on multiple datasets demonstrate that our method consistently outperforms state-of-the-art human-centric and 4D Gaussian approaches, highlighting the importance of explicitly modeling human-object interactions for high-fidelity reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces HOIGS, a Gaussian Splatting method for dynamic human-object interaction scenes. It employs separate deformation baselines—HexPlane for humans and Cubic Hermite Splines for objects—fused via a cross-attention HOI module to capture interdependent motions, claiming consistent outperformance over human-centric and 4D Gaussian baselines on multiple datasets, particularly in occlusion, contact, and manipulation scenarios.

Significance. If the central claim holds, the work advances dynamic scene reconstruction by explicitly separating and fusing heterogeneous deformation representations rather than relying on a single field or human-pose priors alone. This addresses a clear gap in handling interaction-induced deformations and could improve fidelity for downstream applications in graphics and robotics.

major comments (2)
  1. [§3.2] §3.2 (HOI module): The cross-attention fusion of HexPlane (volumetric 4D planes) and Cubic Hermite Spline (parametric curve) features assumes raw feature vectors are already commensurable in space and time without any explicit alignment, contact regularization, or projection step. This is load-bearing for the claim that improvements arise from interaction modeling rather than added capacity; no derivation or constraint guarantees consistency at contact points or under occlusion.
  2. [§4] §4 (Experiments): The reported gains on multiple datasets are presented without ablation isolating the HOI module's contribution versus the baseline HexPlane+CHS capacity increase, and without details on dataset splits or implementation specifics. This undermines the ability to attribute outperformance specifically to interdependent motion capture.
minor comments (2)
  1. Notation for the fused feature vectors in the HOI module is introduced without a clear diagram or equation reference showing the exact attention formulation and output dimensionality.
  2. Figure 3 (qualitative results) would benefit from side-by-side error maps highlighting deformation inconsistencies at contact regions to visually support the central claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major point below with clarifications on our design and planned revisions to strengthen the presentation of the HOI module and experimental validation.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (HOI module): The cross-attention fusion of HexPlane (volumetric 4D planes) and Cubic Hermite Spline (parametric curve) features assumes raw feature vectors are already commensurable in space and time without any explicit alignment, contact regularization, or projection step. This is load-bearing for the claim that improvements arise from interaction modeling rather than added capacity; no derivation or constraint guarantees consistency at contact points or under occlusion.

    Authors: The cross-attention mechanism in the HOI module is designed to learn dynamic correspondences and weighting between the heterogeneous HexPlane and CHS features during optimization, rather than assuming pre-aligned inputs. The attention operates on the extracted feature vectors to capture interaction-induced deformations in a data-driven manner, with consistency at contact points and under occlusion enforced implicitly through the photometric and geometric reconstruction losses on real interaction sequences. We acknowledge the absence of an explicit theoretical derivation or contact regularization term guaranteeing these properties; the improvements are supported empirically. To better isolate the contribution beyond added capacity, we will add an ablation removing the cross-attention fusion in the revision. revision: partial

  2. Referee: [§4] §4 (Experiments): The reported gains on multiple datasets are presented without ablation isolating the HOI module's contribution versus the baseline HexPlane+CHS capacity increase, and without details on dataset splits or implementation specifics. This undermines the ability to attribute outperformance specifically to interdependent motion capture.

    Authors: We agree that an explicit ablation isolating the HOI module is essential to attribute gains to interaction modeling. In the revised version we will include a new ablation comparing the full model against a variant that uses separate HexPlane and CHS baselines without the cross-attention fusion. We will also expand the supplementary material with precise dataset splits (train/test sequence IDs), training hyperparameters, and implementation details to enable full reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation; new architectural components introduced without self-referential fits or load-bearing self-citations

full rationale

The paper presents HOIGS as a novel integration of HexPlane for human deformation and Cubic Hermite Splines for objects, fused via a cross-attention HOI module. No equations are shown that reduce by construction to fitted inputs or prior self-citations; the central claim rests on the new module's ability to handle heterogeneous features rather than renaming or re-deriving existing results. The derivation chain is self-contained with independent architectural choices, consistent with the reader's assessment of score 2.0 but qualifying as 0 under strict rules requiring explicit reduction quotes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that interaction deformations can be captured by fusing two pre-existing motion representations through attention; no new physical axioms or invented entities are introduced.

pith-pipeline@v0.9.0 · 5487 in / 1087 out tokens · 31455 ms · 2026-05-13T17:20:52.985578+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 1 internal anchor

  1. [1]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Alldieck, T., Zanfir, M., Sminchisescu, C.: Photorealistic monocular 3d reconstruc- tion of humans wearing clothing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1506–1515 (2022)

  2. [2]

    In: European Conference on Computer Vision

    Bae, J., Kim, S., Yun, Y., Lee, H., Bang, G., Uh, Y.: Per-gaussian embedding- based deformation for deformable 3d gaussian splatting. In: European Conference on Computer Vision. pp. 321–335. Springer (2024)

  3. [3]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

    Bhatnagar, B.L., Xie, X., Petrov, I.A., Sminchisescu, C., Theobalt, C., Pons-Moll, G.: Behave: Dataset and method for tracking human object interactions. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 15935–15946 (2022)

  4. [4]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

    Cao, A., Johnson, J.: Hexplane: A fast representation for dynamic scenes. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 130–141 (2023)

  5. [5]

    arXiv:2501.12375 (2025)

    Chen, S., Guo, H., Zhu, S., Zhang, F., Huang, Z., Feng, J., Kang, B.: Video depth anything: Consistent depth estimation for super-long videos. arXiv:2501.12375 (2025)

  6. [6]

    NeurIPS (2024)

    Chu, W.H., Ke, L., Fragkiadaki, K.: Dreamscene4d: Dynamic multi-object scene generation from monocular videos. NeurIPS (2024)

  7. [7]

    In: International Conference on 3D Vision 2025 (2025),https://openreview.net/ forum?id=5uw1GRBFoT

    Duisterhof, B.P., Zust, L., Weinzaepfel, P., Leroy, V., Cabon, Y., Revaud, J.: MASt3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. In: International Conference on 3D Vision 2025 (2025),https://openreview.net/ forum?id=5uw1GRBFoT

  8. [8]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Fan, Z., Parelli, M., Kadoglou, M.E., Chen, X., Kocabas, M., Black, M.J., Hilliges, O.: Hold: Category-agnostic 3d reconstruction of interacting hands and objects from video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 494–504 (2024)

  9. [9]

    In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Fan, Z., Taheri, O., Tzionas, D., Kocabas, M., Kaufmann, M., Black, M.J., Hilliges, O.: Arctic: A dataset for dexterous bimanual hand-object manipulation. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12943–12954 (2023)

  10. [10]

    In: Proceedings of the 22 T

    Fridovich-Keil, S., Meanti, G., Warburg, F.R., Recht, B., Kanazawa, A.: K-planes: Explicit radiance fields in space, time, and appearance. In: Proceedings of the 22 T. Kim et al. IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12479– 12488 (2023)

  11. [11]

    In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition

    Guo, C., Jiang, T., Chen, X., Song, J., Hilliges, O.: Vid2avatar: 3d avatar re- construction from videos in the wild via self-supervised scene decomposition. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 12858–12868 (2023)

  12. [12]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Hu, L., Zhang, H., Zhang, Y., Zhou, B., Liu, B., Zhang, S., Nie, L.: Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 634–644 (2024)

  13. [13]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

    Hu, M., Yin, W., Zhang, C., Cai, Z., Long, X., Chen, H., Wang, K., Yu, G., Shen, C., Shen, S.: Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

  14. [14]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Hu, S., Hu, T., Liu, Z.: Gauhuman: Articulated gaussian splatting from monocular human videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20418–20431 (2024)

  15. [15]

    In: CVPR (2025)

    Hu, W., Gao, X., Li, X., Zhao, S., Cun, X., Zhang, Y., Quan, L., Shan, Y.: Depthcrafter: Generating consistent long depth sequences for open-world videos. In: CVPR (2025)

  16. [16]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., Xu, Y., Lassner, C., Li, H., Tung, T.: Arch: Animatable reconstruction of clothed humans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3093–3102 (2020)

  17. [17]

    In: European Conference on Computer Vision

    Jiang, W., Yi, K.M., Samei, G., Tuzel, O., Ranjan, A.: Neuman: Neural human radiance field from a single video. In: European Conference on Computer Vision. pp. 402–418. Springer (2022)

  18. [18]

    arXiv preprint arXiv:2312.15059 (2023)

    Jung, H., Brasch, N., Song, J., Perez-Pellitero, E., Zhou, Y., Li, Z., Navab, N., Busam, B.: Deformable 3d gaussian splatting for animatable human avatars. arXiv preprint arXiv:2312.15059 (2023)

  19. [19]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7122–7131 (2018)

  20. [20]

    ACM Trans

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023)

  21. [21]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Kim, S., Do, S., Park, J.: Showmak3r: Compositional tv show reconstruction. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 864– 874 (2025)

  22. [22]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Kocabas, M., Chang, J.H.R., Gabriel, J., Tuzel, O., Ranjan, A.: Hugs: Human gaussian splats. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 505–515 (2024)

  23. [23]

    Advances in Neural Information Processing Systems37, 5384–5409 (2024)

    Lee, J., Won, C., Jung, H., Bae, I., Jeon, H.G.: Fully explicit dynamic gaus- sian splatting. Advances in Neural Information Processing Systems37, 5384–5409 (2024)

  24. [24]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, Z., Chen, Z., Li, Z., Xu, Y.: Spacetime gaussian feature splatting for real- time dynamic view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8508–8520 (2024)

  25. [25]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liao, T., Zhang, X., Xiu, Y., Yi, H., Liu, X., Qi, G.J., Zhang, Y., Wang, X., Zhu, X., Lei, Z.: High-fidelity clothed avatar reconstruction from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8662–8672 (2023) HOIGS: Human-Object Interaction Gaussian Splatting 23

  26. [26]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Liu, J.W., Cao, Y.P., Yang, T., Xu, Z., Keppo, J., Shan, Y., Qie, X., Shou, M.Z.: Hosnerf: Dynamic human-object-scene neural radiance fields from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18483–18494 (2023)

  27. [27]

    In: Proceedings of the 32nd ACM International Conference on Multimedia

    Liu, Y., Huang, X., Qin, M., Lin, Q., Wang, H.: Animatable 3d gaussian: Fast and high-quality reconstruction of multiple human avatars. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 1120–1129 (2024)

  28. [28]

    In: Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pp

    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinned multi-person linear model. In: Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pp. 851–866 (2023)

  29. [29]

    https://github.com/facebookresearch/maskrcnn-benchmark(2018)

    Massa, F., Girshick, R.: maskrcnn-benchmark: Fast, modular reference implemen- tation of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/facebookresearch/maskrcnn-benchmark(2018)

  30. [30]

    Commu- nications of the ACM65(1), 99–106 (2021)

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu- nications of the ACM65(1), 99–106 (2021)

  31. [31]

    In: European Conference on Computer Vision

    Moon, G., Shiratori, T., Saito, S.: Expressive whole-body 3d gaussian avatar. In: European Conference on Computer Vision. pp. 19–35. Springer (2024)

  32. [32]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Park, K., Sinha, U., Barron, J.T., Bouaziz, S., Goldman, D.B., Seitz, S.M., Martin- Brualla, R.: Nerfies: Deformable neural radiance fields. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5865–5874 (2021)

  33. [33]

    arXiv preprint arXiv:2106.13228 (2021)

    Park,K.,Sinha,U.,Hedman,P.,Barron,J.T.,Bouaziz,S.,Goldman,D.B.,Martin- Brualla, R., Seitz, S.M.: Hypernerf: A higher-dimensional representation for topo- logically varying neural radiance fields. arXiv preprint arXiv:2106.13228 (2021)

  34. [34]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single im- age. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10975–10985 (2019)

  35. [35]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Peng, S., Zhang, Y., Xu, Y., Wang, Q., Shuai, Q., Bao, H., Zhou, X.: Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9054–9063 (2021)

  36. [36]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Qian, Z., Wang, S., Mihajlovic, M., Geiger, A., Tang, S.: 3dgs-avatar: Animatable avatars via deformable 3d gaussian splatting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5020–5030 (2024)

  37. [37]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024),https://arxiv.org/ abs/2408.00714

  38. [38]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2304–2314 (2019)

  39. [39]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Saito, S., Simon, T., Saragih, J., Joo, H.: Pifuhd: Multi-level pixel-aligned im- plicit function for high-resolution 3d human digitization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 84–93 (2020)

  40. [40]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016) 24 T. Kim et al

  41. [41]

    Dreamgaussian: Generative gaussian splatting for efficient 3d content creation.arXiv preprint arXiv:2309.16653, 2023

    Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)

  42. [42]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wen, J., Zhao, X., Ren, Z., Schwing, A.G., Wang, S.: Gomavatar: Efficient animat- able human modeling from monocular video using gaussians-on-mesh. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2059–2069 (2024)

  43. [43]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition

    Weng, C.Y., Curless, B., Srinivasan, P.P., Barron, J.T., Kemelmacher-Shlizerman, I.: Humannerf: Free-viewpoint rendering of moving people from monocular video. In: Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition. pp. 16210–16220 (2022)

  44. [44]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang, X.: 4d gaussian splatting for real-time dynamic scene rendering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20310– 20320 (2024)

  45. [45]

    Advances in neural information processing systems35, 32653–32666 (2022)

    Wu, T., Zhong, F., Tagliasacchi, A., Cole, F., Oztireli, C.: Dˆ 2nerf: Self-supervised decoupling of dynamic and static objects from a monocular video. Advances in neural information processing systems35, 32653–32666 (2022)

  46. [46]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Xiu, Y., Yang, J., Cao, X., Tzionas, D., Black, M.J.: Econ: Explicit clothed humans optimized via normal integration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 512–523 (2023)

  47. [47]

    In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Xiu, Y., Yang, J., Tzionas, D., Black, M.J.: Icon: Implicit clothed humans obtained from normals. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13286–13296. IEEE (2022)

  48. [48]

    Yang, C.Y., Huang, H.W., Chai, W., Jiang, Z., Hwang, J.N.: Samurai: Adapting segment anything model for zero-shot visual tracking with motion-aware memory (2024),https://arxiv.org/abs/2411.11922

  49. [49]

    Yang, J., Gao, M., Li, Z., Gao, S., Wang, F., Zheng, F.: Track anything: Segment anything meets videos (2023)

  50. [50]

    Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting.arXiv preprint arXiv:2310.10642, 2023

    Yang, Z., Yang, H., Pan, Z., Zhang, L.: Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. arXiv preprint arXiv:2310.10642 (2023)

  51. [51]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yang, Z., Gao, X., Zhou, W., Jiao, S., Zhang, Y., Jin, X.: Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20331– 20341 (2024)

  52. [52]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)