High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

Bo Peng; Deying Kong; Hidenobu Matsuki; Jingjing Shen; Juyong Zhang; Mingsong Dou; Xu Chen; Yi Gu; Zhengyang Shen

arxiv: 2606.15908 · v2 · pith:Z7CHG7KSnew · submitted 2026-06-14 · 💻 cs.CV

High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

Bo Peng , Xu Chen , Yi Gu , Hidenobu Matsuki , Mingsong Dou , Jingjing Shen , Deying Kong , Juyong Zhang

show 1 more author

Zhengyang Shen

This is my paper

Pith reviewed 2026-06-27 03:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords 4D reconstructionhand-object interactionmulti-view trackingGaussian optimizationphysics-aware reconstructiontemplate-free capturespatiotemporal transformer

0 comments

The pith

A multi-view transformer initializes hand-object poses from video, then physics-aware Gaussians refine them into template-free 4D reconstructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to remove reliance on pre-scanned object templates and physical markers when capturing 4D hand-object interactions from synchronized multi-view videos. It does so by first running a feed-forward transformer that fuses cross-view geometry with temporal information to produce consistent starting values for hand and object poses plus dense object shape. These initials then feed into a second stage that optimizes a set of Gaussians under explicit tetrahedral, collision, and appearance constraints, yielding reconstructions that remain physically plausible and visually clean even when hands and objects heavily occlude each other. A reader would care because the method promises to turn ordinary calibrated video into high-fidelity 4D assets at scale for embodied AI and spatial computing.

Core claim

The central claim is that a two-component pipeline—a multi-view feed-forward transformer supplying metric-consistent initial poses and dense geometry, followed by a physics-aware Gaussian optimization that adds tetrahedral constraints, collision refinement, and appearance decomposition—delivers robust, artifact-free 4D hand-object reconstructions from calibrated multi-view video without any object templates or markers.

What carries the argument

The multi-view feed-forward transformer for cross-view and temporal initialization together with the physics-aware Gaussian optimization framework that enforces tetrahedral constraints and collision refinement.

If this is right

The method produces physically plausible and visually accurate 4D hand-object reconstructions from ordinary synchronized multi-view video.
No pre-scanned object templates or physical markers are required at capture time.
The reconstructions remain robust and artifact-free even under the occlusions inherent to hand-object interactions.
The pipeline supplies an efficient route to automated generation of large-scale 4D interaction assets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same initialization-plus-physics pipeline could be tested on fewer cameras to determine how many views are strictly necessary for reliable convergence.
The physics constraints may incidentally improve compatibility with downstream physics simulators that consume the reconstructed meshes.
Because the approach avoids object-specific templates, it could be applied directly to novel everyday objects without additional scanning steps.

Load-bearing premise

The transformer stage supplies initial pose and geometry estimates accurate enough for the physics-aware Gaussian optimizer to reach a stable, artifact-free solution under the severe occlusions typical of hand-object interactions.

What would settle it

Apply the full pipeline to a multi-view sequence containing known ground-truth poses and dense geometry under heavy occlusion; if the final output exhibits visible interpenetrations or large surface errors relative to the ground truth, the claim is falsified.

Figures

Figures reproduced from arXiv: 2606.15908 by Bo Peng, Deying Kong, Hidenobu Matsuki, Jingjing Shen, Juyong Zhang, Mingsong Dou, Xu Chen, Yi Gu, Zhengyang Shen.

**Figure 1.** Figure 1: Overview of our framework. Our pipeline operates in two stages. First, the Hand Object Spatiotemporal Transformer (HOST) processes multi-view videos to robustly regress parametric hand/object poses, dense object point clouds, and segmentation masks, yielding a metric 3D initialization. Second, the Hand Object Physicsaware Gaussian (HOPG) module leverages this initialization to optimize a hybrid 2D Gaussi… view at source ↗

**Figure 2.** Figure 2: Overview of our composite training dataset. Left: [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Automated, Scalable High-Fidelity Capture across Diverse Objects. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: High-Fidelity Object Geometry. Detailed 3D meshes of complex, realworld objects extracted via TSDF fusion. 4DGS Dfm3DGS Ours GT 4DGS Dfm3DGS Ours GT [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on our Custom-Captured Dataset. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: InterHand2.6M Validation and View Robustness. (a) [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation studies. (a) Geometry: Our tetrahedral formulation (Tet + 3D ARAP) prevents the severe geometric flattening and volume loss seen in surface-only baselines. (b) Collision: Our offline collision optimization successfully resolves nonphysical inter-penetrations. ground-truth segmentation masks, optimizing exclusively via photometric rendering loss exposes the critical weakness of 2D ARAP: its inabi… view at source ↗

**Figure 8.** Figure 8: Qualitative Ablation on Kinematic Representation and ICP Refine [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Detailed Architecture of HOST. The network extracts features using a frozen DINOv2 backbone integrated with camera and time embeddings. Given an input sequence of F frames from V views (S = F ×V images total), 12 spatiotemporal blocks process the tokens by alternating between spatial (cross-view) and temporal (crossframe) attention. These shared feature representations are then fed into task-specific head… view at source ↗

read the original abstract

The growing demand for high-fidelity 4D hand-object interaction (HOI) data in embodied AI and spatial computing is currently bottlenecked by the reliance on pre-scanned object templates and physical markers. While recent methods have demonstrated promising results in reconstructing 4D hand-object interaction from videos, they are highly sensitive to initial estimates of hand and object poses. Yet, estimating these poses from images is challenging, in particular under severe occlusion which is inherent in hand-object interaction scenarios. We propose a novel system for the robust and accurate reconstruction of hands and objects from synchronized and calibrated multi-view videos without requiring any templates or markers. Our system consists of two main components with key innovations: (1) a multi-view feed-forward transformer model that aggregates cross-view geometry and temporal cues to provide a reliable, metric-consistent initialization for both poses and dense object geometry, and (2) a hand-object physics-aware Gaussian-based optimization framework to refine the initial estimates, integrating tetrahedral constraints, collision refinement, and appearance decomposition to produce physically plausible and visually accurate reconstruction. Validated on public benchmarks and an extensive internal dataset, our pipeline achieves highly robust, artifact-free reconstruction, providing an efficient foundation for automated 4D asset generation. Our project page are available at https://zyshen021.github.io/HOSTPG/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical two-stage pipeline for markerless 4D hand-object capture that pairs a multi-view transformer initializer with physics-constrained Gaussian refinement, but the abstract leaves the quantitative gains and failure modes unclear.

read the letter

The core contribution is a system that first runs a feed-forward multi-view transformer to produce initial hand and object poses plus dense geometry, then feeds those into a Gaussian optimization stage that adds tetrahedral constraints, collision terms, and appearance decomposition. This directly targets the initialization sensitivity that has limited earlier HOI methods under heavy occlusion.

The approach is sensible on paper. Removing the need for pre-scanned templates and markers is a real practical step forward for scaling capture, and layering physics constraints on top of learned initialization is a reasonable way to stabilize the output. The claim of artifact-free results on public benchmarks plus an internal dataset suggests the pipeline can handle real interaction footage.

The main weakness is that the abstract supplies no numbers, no ablation tables, and no error breakdowns, so it is impossible to judge how much the transformer actually improves the starting point or how critical the physics terms are versus standard Gaussian splatting. The internal dataset also cannot be checked for independence or diversity. Without those details the central handoff from initialization to refinement remains an assumption rather than a demonstrated result.

This is the kind of work that matters to groups building embodied AI datasets or spatial computing tools. A reader who needs concrete 4D HOI assets would get value from the method description and project page even before full replication. It is worth sending to peer review so the experiments and implementation choices can be examined properly.

Referee Report

2 major / 1 minor

Summary. The paper proposes a template-free, marker-free system for high-fidelity 4D hand-object interaction reconstruction from synchronized multi-view videos. It comprises two stages: (1) a multi-view feed-forward transformer that aggregates cross-view geometry and temporal information to produce metric-consistent initial hand poses and dense object geometry, and (2) a physics-aware Gaussian optimization stage that refines these estimates via tetrahedral constraints, collision refinement, and appearance decomposition. The authors claim the pipeline yields robust, artifact-free results on public benchmarks and an internal dataset.

Significance. If the central claims hold, the work would be significant for embodied AI and spatial computing by removing the practical requirement for pre-scanned object templates and markers, thereby enabling scalable 4D HOI asset generation. The combination of a learned multi-view initializer with a physics-constrained Gaussian refinement is a coherent technical direction. However, the manuscript supplies no quantitative metrics, ablation studies, or error analysis, preventing any assessment of whether the claimed robustness is actually achieved.

major comments (2)

[Abstract] Abstract: the assertion that the pipeline 'achieves highly robust, artifact-free reconstruction' on benchmarks and an internal dataset is unsupported by any reported metrics (e.g., MPJPE, object Chamfer distance, or success rates under occlusion), ablation studies, or implementation details. This absence is load-bearing for the central claim that the method overcomes the initialization sensitivity of prior work.
[Abstract] Abstract (and implied methods): the handoff from the multi-view transformer initialization to the physics-aware Gaussian optimization is presented as reliable under severe occlusions, yet no analysis of convergence behavior, failure modes, or sensitivity to initialization error is provided. Without such evidence the weakest assumption of the pipeline remains untested.

minor comments (1)

[Abstract] Abstract: grammatical error in 'Our project page are available' (should read 'is available').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We will revise the manuscript to include quantitative metrics, ablation studies, and analysis of the optimization stage to support the claims made in the abstract.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the pipeline 'achieves highly robust, artifact-free reconstruction' on benchmarks and an internal dataset is unsupported by any reported metrics (e.g., MPJPE, object Chamfer distance, or success rates under occlusion), ablation studies, or implementation details. This absence is load-bearing for the central claim that the method overcomes the initialization sensitivity of prior work.

Authors: We agree that the abstract's claim requires supporting quantitative evidence. The full manuscript includes validation on benchmarks, but to make this explicit and address the concern, we will add detailed quantitative results, including MPJPE, Chamfer distances, and ablation studies in a new or expanded results section in the revised version. revision: yes
Referee: [Abstract] Abstract (and implied methods): the handoff from the multi-view transformer initialization to the physics-aware Gaussian optimization is presented as reliable under severe occlusions, yet no analysis of convergence behavior, failure modes, or sensitivity to initialization error is provided. Without such evidence the weakest assumption of the pipeline remains untested.

Authors: We acknowledge the need for analysis on the reliability of the handoff between stages, particularly under occlusions. In the revision, we will include an analysis of convergence behavior, potential failure modes, and sensitivity to initialization errors, perhaps through additional experiments or visualizations. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and context describe a two-component pipeline: a multi-view feed-forward transformer providing initialization, followed by a separate physics-aware Gaussian optimization stage that refines estimates using tetrahedral constraints, collision refinement, and appearance decomposition. No equations, parameter-fitting steps, self-citations, or uniqueness theorems are supplied that would reduce any claimed output to the inputs by construction. Validation is stated to occur on public benchmarks plus an internal dataset, with no load-bearing dependency on prior author work that itself lacks independent verification. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, implementation details, or experimental sections are present from which free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5797 in / 1089 out tokens · 46349 ms · 2026-06-27T03:29:57.580167+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 15 canonical work pages · 10 internal anchors

[1]

Anonymous: View-efficient photometric hand performance capture at scale (2026), under review

2026
[2]

In: Sensor fusion IV: control paradigms and data structures

Besl, P.J., McKay, N.D.: Method for registration of 3-d shapes. In: Sensor fusion IV: control paradigms and data structures. vol. 1611, pp. 586–606. Spie (1992)

1992
[3]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chao,Y.W.,Yang,W.,Xiang, Y.,Molchanov,P.,Handa,A., Tremblay, J.,Narang, Y.S., Van Wyk, K., Iqbal, U., Birchfield, S., et al.: Dexycb: A benchmark for capturing hand grasping of objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9044–9053 (2021)

2021
[4]

IEEE Transactions on Visualization and Computer Graphics 31(9), 6100–6111 (2024) 16 B

Chen, D., Li, H., Ye, W., Wang, Y., Xie, W., Zhai, S., Wang, N., Liu, H., Bao, H., Zhang, G.: Pgsr: Planar-based gaussian splatting for efficient and high-fidelity sur- face reconstruction. IEEE Transactions on Visualization and Computer Graphics 31(9), 6100–6111 (2024) 16 B. Peng et al

2024
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

Chen, X., Wang, B., Shum, H.Y.: Hand avatar: Free-pose hand animation and rendering from monocular video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

2023
[7]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, X., Wang, B., Shum, H.Y.: Hand avatar: Free-pose hand animation and rendering from monocular video. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8683–8693 (2023)

2023
[8]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Chen,Z.,Potamias,R.A.,Chen,S.,Schmid,C.:Hort:Monocularhand-heldobjects reconstruction with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6046–6057 (2025)

2025
[9]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

Cong, X., Xing, A., Pokhariya, C., Fu, R., Sridhar, S.: Dytact: Capturing dynamic contacts in hand-object manipulation. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 7191–7203 (2025)

2025
[10]

In: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques

Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques. pp. 303–312 (1996)

1996
[11]

In: Proceedings of the European conference on computer vision (ECCV)

Damen, D., Doughty, H., Farinella, G.M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al.: Scaling egocentric vision: The epic-kitchens dataset. In: Proceedings of the European conference on computer vision (ECCV). pp. 720–736 (2018)

2018
[12]

Georgia Institute of Technology, Tech

Dellaert, F.: Factor graphs and gtsam: A hands-on introduction. Georgia Institute of Technology, Tech. Rep2(4) (2012)

2012
[13]

arXiv preprint arXiv:2503.14736 (2025)

Dong, Y., Wang, W., Wang, Q., Yang, J., Liu, H., Zhu, X., Slabaugh, G., Yuan, S.: Handscs: Structural coordinate space for animatable hand gaussian splatting. arXiv preprint arXiv:2503.14736 (2025)

work page arXiv 2025
[14]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Fan, Z., Parelli, M., Kadoglou, M.E., Chen, X., Kocabas, M., Black, M.J., Hilliges, O.: Hold: Category-agnostic 3d reconstruction of interacting hands and objects from video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 494–504 (2024)

2024
[16]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Fan, Z., Taheri, O., Tzionas, D., Kocabas, M., Kaufmann, M., Black, M.J., Hilliges, O.: Arctic: A dataset for dexterous bimanual hand-object manipulation. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12943–12954 (2023)

2023
[17]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Fu, R., Zhang, D., Jiang, A., Fu, W., Funk, A., Ritchie, D., Sridhar, S.: Gigahands: A massive annotated dataset of bimanual hand activities. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 17461–17474 (2025)

2025
[18]

In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Ham- burger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 18995–19012 (2022)

2022
[19]

IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (2025)

Guo, Z., Zhou, W., Wang, M., Li, L., Li, H.: Handnerf++: Modeling animatable interacting hands with neural radiance fields. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (2025)

2025
[20]

In: CVPR (2020) High-Fidelity 4D Hand-Object Capture 17

Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: A method for 3d annotation of hand and object poses. In: CVPR (2020) High-Fidelity 4D Hand-Object Capture 17

2020
[21]

In: CVPR (2022)

Hampali, S., Sarkar, S.D., Rad, M., Lepetit, V.: Keypoint transformer: Solving joint identification in challenging hands and object interactions for accurate 3d pose estimation. In: CVPR (2022)

2022
[22]

International journal of computer vision103(3), 267–305 (2013)

Hartley, R., Trumpf, J., Dai, Y., Li, H.: Rotation averaging. International journal of computer vision103(3), 267–305 (2013)

2013
[23]

LRM: Large Reconstruction Model for Single Image to 3D

Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Hu, S., Hu, T., Liu, Z.: Gauhuman: Articulated gaussian splatting from monocular human videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20418–20431 (2024)

2024
[25]

In: ACM SIGGRAPH 2024 conference papers

Huang, B., Yu, Z., Chen, A., Geiger, A., Gao, S.: 2d gaussian splatting for geo- metrically accurate radiance fields. In: ACM SIGGRAPH 2024 conference papers. pp. 1–11 (2024)

2024
[26]

Journal of Math- ematical Imaging and Vision35(2), 155–164 (2009)

Huynh, D.Q.: Metrics for 3d rotations: Comparison and analysis. Journal of Math- ematical Imaging and Vision35(2), 155–164 (2009)

2009
[27]

In: 2025 IEEE 19th International Conference on Auto- matic Face and Gesture Recognition (FG)

Ivashechkin, M., Mendez, O., Bowden, R.: Handocc: Nerf-based hand rendering with occupancy networks. In: 2025 IEEE 19th International Conference on Auto- matic Face and Gesture Recognition (FG). pp. 1–10. IEEE (2025)

2025
[28]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., et al.: Mapanything: Univer- sal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

ACM Trans

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G., et al.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023)

2023
[30]

In: Proceedings of the IEEE/CVF international conference on computer vision

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)

2023
[31]

In: 2011 IEEE international conference on robotics and automation

Kümmerle, R., Grisetti, G., Strasdat, H., Konolige, K., Burgard, W.: g 2 o: A general framework for graph optimization. In: 2011 IEEE international conference on robotics and automation. pp. 3607–3613. IEEE (2011)

2011
[32]

In: European conference on computer vision

Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3d with mast3r. In: European conference on computer vision. pp. 71–91. Springer (2024)

2024
[33]

Li, B.: Mv-sam3d: Adaptive multi-view 3d reconstruction.https://github.com/ devinli123/MV-SAM3D(2025), accessed: March 2026

2025
[34]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

Liu, X., Ren, P., Qi, Q., Sun, H., Zhuang, Z., Wang, J., Liao, J., Wang, J.: Gen- eralizable hand-object modeling from monocular rgb images via 3d gaussians. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

2025
[35]

arXiv preprint arXiv:2603.23997 (2026)

Liu, Y., Long, X.X., Habermann, M., Yang, X., Lin, C., Liu, Y., Ma, Y., Wang, W., Liu, L.: Hggt: Robust and flexible 3d hand mesh reconstruction from uncalibrated images. arXiv preprint arXiv:2603.23997 (2026)

work page arXiv 2026
[36]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liu, Y., Long, X., Yang, Z., Liu, Y., Habermann, M., Theobalt, C., Ma, Y., Wang, W.: Easyhoi: Unleashing the power of large models for reconstructing hand-object interactions in the wild. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 7037–7047 (2025)

2025
[37]

6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image

Moon, G., Yu, S.I., Wen, H., Shiratori, T., Lee, K.M.: Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In: European Conference on Computer Vision. pp. 548–564. Springer (2020)

2020
[38]

In: Proceedings of the IEEE/CVF international conference on computer vision

Mundra, A., Wang, J., Habermann, M., Theobalt, C., Elgharib, M., et al.: Live- hand: Real-time and photorealistic neural hand rendering. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 18035–18045 (2023) 18 B. Peng et al

2023
[39]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

IEEE Transactions on Visualization and Computer Graphics (2025)

Peng, B., Tao, Y., Zhan, H., Guo, Y., Zhang, J.: Pica: Physics-integrated clothed avatar. IEEE Transactions on Visualization and Computer Graphics (2025)

2025
[41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pokhariya, C., Shah, I.N., Xing, A., Li, Z., Chen, K., Sharma, A., Sridhar, S.: Manus: Markerless grasp capture using articulated 3d gaussians. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2197–2208 (2024)

2024
[42]

In: Proceedings of the IEEE/CVF international conference on computer vision

Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12179–12188 (2021)

2021
[43]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

ACM Transactions on Graphics, (Proc

Romero, J., Tzionas, D., Black, M.J.: Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia)36(6) (Nov 2017)

2017
[45]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Saito, S., Schwartz, G., Simon, T., Li, J., Nam, G.: Relightable gaussian codec avatars. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 130–141 (2024)

2024
[46]

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

Shen, Y., Zhang, Z., Qu, Y., Zheng, X., Ji, J., Zhang, S., Cao, L.: Fastvggt: Training-free acceleration of visual geometry transformer. arXiv preprint arXiv:2509.02560 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

In: Symposium on Geometry processing

Sorkine, O., Alexa, M., et al.: As-rigid-as-possible surface modeling. In: Symposium on Geometry processing. vol. 4, pp. 109–116 (2007)

2007
[48]

In: European conference on computer vision

Taheri, O., Ghorbani, N., Black, M.J., Tzionas, D.: Grab: A dataset of whole- body human grasping of objects. In: European conference on computer vision. pp. 581–600. Springer (2020)

2020
[49]

SAM 3D: 3Dfy Anything in Images

Team, S.D., Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., Lin, A., Liu, J., Ma, Z., Sagar, A., Song, B., Wang, X., Yang, J., Zhang, B., Dollár, P., Gkioxari, G., Feiszli, M., Malik, J.: Sam 3d: 3dfy anything in images. arXiv preprint arXiv:2511.16624 (2025),https: //arxiv.org/abs/2511.16624

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

IEEE Transactions on Pattern Analysis and Machine Intelligence47(11), 9426–9437 (2024)

Verbin, D., Hedman, P., Mildenhall, B., Zickler, T., Barron, J.T., Srinivasan, P.P.: Ref-nerf: Structured view-dependent appearance for neural radiance fields. IEEE Transactions on Pattern Analysis and Machine Intelligence47(11), 9426–9437 (2024)

2024
[51]

arXiv preprint arXiv:2511.18416 (2025)

Wang, H., Zhou, H., Liu, H., Yan, L.: 4d-vggt: A general foundation model with spatiotemporal awareness for dynamic scene geometry estimation. arXiv preprint arXiv:2511.18416 (2025)

work page arXiv 2025
[52]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

2025
[53]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wang, S., He, H., Parelli, M., Gebhardt, C., Fan, Z., Song, J.: Magichoi: Leveraging 3d priors for accurate hand-object reconstruction from short monocular video clips. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5957–5968 (2025)

2025
[54]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20697–20709 (2024) High-Fidelity 4D Hand-Object Capture 19

2024
[55]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π 3: Permutation-equivariant visual geometry learning. arXiv preprint arXiv:2507.13347 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, Y., Xu, H., Heng, P.A., Fu, C.W.: Unihope: A unified approach for hand- only and hand-object pose estimation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 12231–12241 (2025)

2025
[57]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang, X.: 4d gaussian splatting for real-time dynamic scene rendering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20310– 20320 (2024)

2024
[58]

arXiv preprint arXiv:2512.05060 (2025)

Wu, X., Bai, Y., Li, M., Wu, X., Zhao, X., Lai, Z., Liu, W., Wang, X.: 4dlangvggt: 4d language-visual geometry grounded transformer. arXiv preprint arXiv:2512.05060 (2025)

work page arXiv 2025
[59]

CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction

Xie, X., Wen, B., Chang, Y., Rabeti, H., Li, J., Yuan, Y., Pons-Moll, G., Birchfield, S.: Cari4d: Category agnostic 4d reconstruction of human-object interaction. arXiv preprint arXiv:2512.11988 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yang, J., Sax, A., Liang, K.J., Henaff, M., Tang, H., Cao, A., Chai, J., Meier, F., Feiszli, M.: Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 21924–21935 (2025)

2025
[61]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yang, L., Li, K., Zhan, X., Wu, F., Xu, A., Liu, L., Lu, C.: Oakink: A large-scale knowledge repository for understanding hand-object interaction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20953–20962 (2022)

2022
[62]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

Yang, L., Zhong, L., Zhu, P., Zhan, X., Kong, J., Xu, J., Lu, C.: Multi-view hand reconstruction with a point-embedded transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

2025
[63]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yang, Z., Gao, X., Zhou, W., Jiao, S., Zhang, Y., Jin, X.: Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20331– 20341 (2024)

2024
[64]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Ye, Y., Gupta, A., Tulsiani, S.: What’s in your hands? 3d reconstruction of generic objects in hands. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3895–3905 (2022)

2022
[65]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yu, Z., Zafeiriou, S., Birdal, T.: Dyn-hamr: Recovering 4d interacting hand motion from a dynamic camera. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 27716–27726 (2025)

2025
[66]

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

Zhang, J., Herrmann, C., Hur, J., Jampani, V., Darrell, T., Cole, F., Sun, D., Yang, M.H.: Monst3r: A simple approach for estimating geometry in the presence of motion. arXiv preprint arXiv:2410.03825 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

arXiv preprint arXiv:2508.10868 (2025)

Zhang, Y., Zhang, L., Ma, R., Cao, N.: Texverse: A universe of 3d objects with high-resolution textures. arXiv preprint arXiv:2508.10868 (2025)

work page arXiv 2025
[68]

In: 2025 International Conference on 3D Vision (3DV)

Zhenyuan, L., Guo, Y., Li, X., Bickel, B., Zhang, R.: Bigs: Bidirectional primi- tives for relightable 3d gaussian splatting. In: 2025 International Conference on 3D Vision (3DV). pp. 1022–1031. IEEE (2025)

2025
[69]

In: European conference on computer vision

Zhou, Q.Y., Park, J., Koltun, V.: Fast global registration. In: European conference on computer vision. pp. 766–782. Springer (2016) 20 B. Peng et al. Supplementary Material This supplementary document provides additional details omitted from the main text due to space constraints. It is organized into two primary sections: Section A presents extended abl...

2016

[1] [1]

Anonymous: View-efficient photometric hand performance capture at scale (2026), under review

2026

[2] [2]

In: Sensor fusion IV: control paradigms and data structures

Besl, P.J., McKay, N.D.: Method for registration of 3-d shapes. In: Sensor fusion IV: control paradigms and data structures. vol. 1611, pp. 586–606. Spie (1992)

1992

[3] [3]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chao,Y.W.,Yang,W.,Xiang, Y.,Molchanov,P.,Handa,A., Tremblay, J.,Narang, Y.S., Van Wyk, K., Iqbal, U., Birchfield, S., et al.: Dexycb: A benchmark for capturing hand grasping of objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9044–9053 (2021)

2021

[4] [4]

IEEE Transactions on Visualization and Computer Graphics 31(9), 6100–6111 (2024) 16 B

Chen, D., Li, H., Ye, W., Wang, Y., Xie, W., Zhai, S., Wang, N., Liu, H., Bao, H., Zhang, G.: Pgsr: Planar-based gaussian splatting for efficient and high-fidelity sur- face reconstruction. IEEE Transactions on Visualization and Computer Graphics 31(9), 6100–6111 (2024) 16 B. Peng et al

2024

[5] [6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

Chen, X., Wang, B., Shum, H.Y.: Hand avatar: Free-pose hand animation and rendering from monocular video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

2023

[6] [7]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, X., Wang, B., Shum, H.Y.: Hand avatar: Free-pose hand animation and rendering from monocular video. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8683–8693 (2023)

2023

[7] [8]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Chen,Z.,Potamias,R.A.,Chen,S.,Schmid,C.:Hort:Monocularhand-heldobjects reconstruction with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6046–6057 (2025)

2025

[8] [9]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

Cong, X., Xing, A., Pokhariya, C., Fu, R., Sridhar, S.: Dytact: Capturing dynamic contacts in hand-object manipulation. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 7191–7203 (2025)

2025

[9] [10]

In: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques

Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques. pp. 303–312 (1996)

1996

[10] [11]

In: Proceedings of the European conference on computer vision (ECCV)

Damen, D., Doughty, H., Farinella, G.M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al.: Scaling egocentric vision: The epic-kitchens dataset. In: Proceedings of the European conference on computer vision (ECCV). pp. 720–736 (2018)

2018

[11] [12]

Georgia Institute of Technology, Tech

Dellaert, F.: Factor graphs and gtsam: A hands-on introduction. Georgia Institute of Technology, Tech. Rep2(4) (2012)

2012

[12] [13]

arXiv preprint arXiv:2503.14736 (2025)

Dong, Y., Wang, W., Wang, Q., Yang, J., Liu, H., Zhu, X., Slabaugh, G., Yuan, S.: Handscs: Structural coordinate space for animatable hand gaussian splatting. arXiv preprint arXiv:2503.14736 (2025)

work page arXiv 2025

[13] [14]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010

[14] [15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Fan, Z., Parelli, M., Kadoglou, M.E., Chen, X., Kocabas, M., Black, M.J., Hilliges, O.: Hold: Category-agnostic 3d reconstruction of interacting hands and objects from video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 494–504 (2024)

2024

[15] [16]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Fan, Z., Taheri, O., Tzionas, D., Kocabas, M., Kaufmann, M., Black, M.J., Hilliges, O.: Arctic: A dataset for dexterous bimanual hand-object manipulation. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12943–12954 (2023)

2023

[16] [17]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Fu, R., Zhang, D., Jiang, A., Fu, W., Funk, A., Ritchie, D., Sridhar, S.: Gigahands: A massive annotated dataset of bimanual hand activities. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 17461–17474 (2025)

2025

[17] [18]

In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Ham- burger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 18995–19012 (2022)

2022

[18] [19]

IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (2025)

Guo, Z., Zhou, W., Wang, M., Li, L., Li, H.: Handnerf++: Modeling animatable interacting hands with neural radiance fields. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (2025)

2025

[19] [20]

In: CVPR (2020) High-Fidelity 4D Hand-Object Capture 17

Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: A method for 3d annotation of hand and object poses. In: CVPR (2020) High-Fidelity 4D Hand-Object Capture 17

2020

[20] [21]

In: CVPR (2022)

Hampali, S., Sarkar, S.D., Rad, M., Lepetit, V.: Keypoint transformer: Solving joint identification in challenging hands and object interactions for accurate 3d pose estimation. In: CVPR (2022)

2022

[21] [22]

International journal of computer vision103(3), 267–305 (2013)

Hartley, R., Trumpf, J., Dai, Y., Li, H.: Rotation averaging. International journal of computer vision103(3), 267–305 (2013)

2013

[22] [23]

LRM: Large Reconstruction Model for Single Image to 3D

Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [24]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Hu, S., Hu, T., Liu, Z.: Gauhuman: Articulated gaussian splatting from monocular human videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20418–20431 (2024)

2024

[24] [25]

In: ACM SIGGRAPH 2024 conference papers

Huang, B., Yu, Z., Chen, A., Geiger, A., Gao, S.: 2d gaussian splatting for geo- metrically accurate radiance fields. In: ACM SIGGRAPH 2024 conference papers. pp. 1–11 (2024)

2024

[25] [26]

Journal of Math- ematical Imaging and Vision35(2), 155–164 (2009)

Huynh, D.Q.: Metrics for 3d rotations: Comparison and analysis. Journal of Math- ematical Imaging and Vision35(2), 155–164 (2009)

2009

[26] [27]

In: 2025 IEEE 19th International Conference on Auto- matic Face and Gesture Recognition (FG)

Ivashechkin, M., Mendez, O., Bowden, R.: Handocc: Nerf-based hand rendering with occupancy networks. In: 2025 IEEE 19th International Conference on Auto- matic Face and Gesture Recognition (FG). pp. 1–10. IEEE (2025)

2025

[27] [28]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., et al.: Mapanything: Univer- sal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [29]

ACM Trans

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G., et al.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023)

2023

[29] [30]

In: Proceedings of the IEEE/CVF international conference on computer vision

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)

2023

[30] [31]

In: 2011 IEEE international conference on robotics and automation

Kümmerle, R., Grisetti, G., Strasdat, H., Konolige, K., Burgard, W.: g 2 o: A general framework for graph optimization. In: 2011 IEEE international conference on robotics and automation. pp. 3607–3613. IEEE (2011)

2011

[31] [32]

In: European conference on computer vision

Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3d with mast3r. In: European conference on computer vision. pp. 71–91. Springer (2024)

2024

[32] [33]

Li, B.: Mv-sam3d: Adaptive multi-view 3d reconstruction.https://github.com/ devinli123/MV-SAM3D(2025), accessed: March 2026

2025

[33] [34]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

Liu, X., Ren, P., Qi, Q., Sun, H., Zhuang, Z., Wang, J., Liao, J., Wang, J.: Gen- eralizable hand-object modeling from monocular rgb images via 3d gaussians. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

2025

[34] [35]

arXiv preprint arXiv:2603.23997 (2026)

Liu, Y., Long, X.X., Habermann, M., Yang, X., Lin, C., Liu, Y., Ma, Y., Wang, W., Liu, L.: Hggt: Robust and flexible 3d hand mesh reconstruction from uncalibrated images. arXiv preprint arXiv:2603.23997 (2026)

work page arXiv 2026

[35] [36]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liu, Y., Long, X., Yang, Z., Liu, Y., Habermann, M., Theobalt, C., Ma, Y., Wang, W.: Easyhoi: Unleashing the power of large models for reconstructing hand-object interactions in the wild. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 7037–7047 (2025)

2025

[36] [37]

6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image

Moon, G., Yu, S.I., Wen, H., Shiratori, T., Lee, K.M.: Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In: European Conference on Computer Vision. pp. 548–564. Springer (2020)

2020

[37] [38]

In: Proceedings of the IEEE/CVF international conference on computer vision

Mundra, A., Wang, J., Habermann, M., Theobalt, C., Elgharib, M., et al.: Live- hand: Real-time and photorealistic neural hand rendering. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 18035–18045 (2023) 18 B. Peng et al

2023

[38] [39]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [40]

IEEE Transactions on Visualization and Computer Graphics (2025)

Peng, B., Tao, Y., Zhan, H., Guo, Y., Zhang, J.: Pica: Physics-integrated clothed avatar. IEEE Transactions on Visualization and Computer Graphics (2025)

2025

[40] [41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pokhariya, C., Shah, I.N., Xing, A., Li, Z., Chen, K., Sharma, A., Sridhar, S.: Manus: Markerless grasp capture using articulated 3d gaussians. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2197–2208 (2024)

2024

[41] [42]

In: Proceedings of the IEEE/CVF international conference on computer vision

Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12179–12188 (2021)

2021

[42] [43]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [44]

ACM Transactions on Graphics, (Proc

Romero, J., Tzionas, D., Black, M.J.: Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia)36(6) (Nov 2017)

2017

[44] [45]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Saito, S., Schwartz, G., Simon, T., Li, J., Nam, G.: Relightable gaussian codec avatars. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 130–141 (2024)

2024

[45] [46]

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

Shen, Y., Zhang, Z., Qu, Y., Zheng, X., Ji, J., Zhang, S., Cao, L.: Fastvggt: Training-free acceleration of visual geometry transformer. arXiv preprint arXiv:2509.02560 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [47]

In: Symposium on Geometry processing

Sorkine, O., Alexa, M., et al.: As-rigid-as-possible surface modeling. In: Symposium on Geometry processing. vol. 4, pp. 109–116 (2007)

2007

[47] [48]

In: European conference on computer vision

Taheri, O., Ghorbani, N., Black, M.J., Tzionas, D.: Grab: A dataset of whole- body human grasping of objects. In: European conference on computer vision. pp. 581–600. Springer (2020)

2020

[48] [49]

SAM 3D: 3Dfy Anything in Images

Team, S.D., Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., Lin, A., Liu, J., Ma, Z., Sagar, A., Song, B., Wang, X., Yang, J., Zhang, B., Dollár, P., Gkioxari, G., Feiszli, M., Malik, J.: Sam 3d: 3dfy anything in images. arXiv preprint arXiv:2511.16624 (2025),https: //arxiv.org/abs/2511.16624

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [50]

IEEE Transactions on Pattern Analysis and Machine Intelligence47(11), 9426–9437 (2024)

Verbin, D., Hedman, P., Mildenhall, B., Zickler, T., Barron, J.T., Srinivasan, P.P.: Ref-nerf: Structured view-dependent appearance for neural radiance fields. IEEE Transactions on Pattern Analysis and Machine Intelligence47(11), 9426–9437 (2024)

2024

[50] [51]

arXiv preprint arXiv:2511.18416 (2025)

Wang, H., Zhou, H., Liu, H., Yan, L.: 4d-vggt: A general foundation model with spatiotemporal awareness for dynamic scene geometry estimation. arXiv preprint arXiv:2511.18416 (2025)

work page arXiv 2025

[51] [52]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

2025

[52] [53]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wang, S., He, H., Parelli, M., Gebhardt, C., Fan, Z., Song, J.: Magichoi: Leveraging 3d priors for accurate hand-object reconstruction from short monocular video clips. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5957–5968 (2025)

2025

[53] [54]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20697–20709 (2024) High-Fidelity 4D Hand-Object Capture 19

2024

[54] [55]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π 3: Permutation-equivariant visual geometry learning. arXiv preprint arXiv:2507.13347 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [56]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, Y., Xu, H., Heng, P.A., Fu, C.W.: Unihope: A unified approach for hand- only and hand-object pose estimation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 12231–12241 (2025)

2025

[56] [57]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang, X.: 4d gaussian splatting for real-time dynamic scene rendering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20310– 20320 (2024)

2024

[57] [58]

arXiv preprint arXiv:2512.05060 (2025)

Wu, X., Bai, Y., Li, M., Wu, X., Zhao, X., Lai, Z., Liu, W., Wang, X.: 4dlangvggt: 4d language-visual geometry grounded transformer. arXiv preprint arXiv:2512.05060 (2025)

work page arXiv 2025

[58] [59]

CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction

Xie, X., Wen, B., Chang, Y., Rabeti, H., Li, J., Yuan, Y., Pons-Moll, G., Birchfield, S.: Cari4d: Category agnostic 4d reconstruction of human-object interaction. arXiv preprint arXiv:2512.11988 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [60]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yang, J., Sax, A., Liang, K.J., Henaff, M., Tang, H., Cao, A., Chai, J., Meier, F., Feiszli, M.: Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 21924–21935 (2025)

2025

[60] [61]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yang, L., Li, K., Zhan, X., Wu, F., Xu, A., Liu, L., Lu, C.: Oakink: A large-scale knowledge repository for understanding hand-object interaction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20953–20962 (2022)

2022

[61] [62]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

Yang, L., Zhong, L., Zhu, P., Zhan, X., Kong, J., Xu, J., Lu, C.: Multi-view hand reconstruction with a point-embedded transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

2025

[62] [63]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yang, Z., Gao, X., Zhou, W., Jiao, S., Zhang, Y., Jin, X.: Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20331– 20341 (2024)

2024

[63] [64]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Ye, Y., Gupta, A., Tulsiani, S.: What’s in your hands? 3d reconstruction of generic objects in hands. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3895–3905 (2022)

2022

[64] [65]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yu, Z., Zafeiriou, S., Birdal, T.: Dyn-hamr: Recovering 4d interacting hand motion from a dynamic camera. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 27716–27726 (2025)

2025

[65] [66]

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

Zhang, J., Herrmann, C., Hur, J., Jampani, V., Darrell, T., Cole, F., Sun, D., Yang, M.H.: Monst3r: A simple approach for estimating geometry in the presence of motion. arXiv preprint arXiv:2410.03825 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [67]

arXiv preprint arXiv:2508.10868 (2025)

Zhang, Y., Zhang, L., Ma, R., Cao, N.: Texverse: A universe of 3d objects with high-resolution textures. arXiv preprint arXiv:2508.10868 (2025)

work page arXiv 2025

[67] [68]

In: 2025 International Conference on 3D Vision (3DV)

Zhenyuan, L., Guo, Y., Li, X., Bickel, B., Zhang, R.: Bigs: Bidirectional primi- tives for relightable 3d gaussian splatting. In: 2025 International Conference on 3D Vision (3DV). pp. 1022–1031. IEEE (2025)

2025

[68] [69]

In: European conference on computer vision

Zhou, Q.Y., Park, J., Koltun, V.: Fast global registration. In: European conference on computer vision. pp. 766–782. Springer (2016) 20 B. Peng et al. Supplementary Material This supplementary document provides additional details omitted from the main text due to space constraints. It is organized into two primary sections: Section A presents extended abl...

2016