ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views

Eugene Sohn; Inseo Lee; Jin-Hwa Kim; Jiwoong Lee; Joonseok Lee; Jungmin You; Yoonji Kim

arxiv: 2605.24304 · v1 · pith:QAYEWZ7Lnew · submitted 2026-05-23 · 💻 cs.CV · cs.AI

ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views

Inseo Lee , Yoonji Kim , Eugene Sohn , Jiwoong Lee , Jungmin You , Joonseok Lee , Jin-Hwa Kim This is my paper

Pith reviewed 2026-06-30 14:24 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords articulated object reconstruction3D Gaussian Splattingfeed-forward networksparse multi-viewjoint parameter estimationcross-state attentionmulti-state imagesPartNet-Mobility

0 comments

The pith

A single forward pass reconstructs both 3D Gaussian geometry and joint parameters of articulated objects from sparse uncalibrated multi-state views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ArtSplat as the first feed-forward method that turns a handful of uncalibrated images taken at different articulation states into a complete 3D Gaussian splat model plus the object's joint parameters. Earlier techniques required dense views, known depth, predefined joint counts, or slow per-object optimization to solve the same ill-posed problem. The method encodes joints as per-pixel maps and uses a Cross-State Attention block with state tokens to link information across the input states inside one network forward pass. On 68 objects from PartNet-Mobility the approach matches the accuracy of slower baselines while running more than 400 times faster. A reader would care because the speed gain removes the main barrier to using articulated reconstruction in interactive or real-time settings.

Core claim

ArtSplat is a feed-forward network that ingests sparse multi-view images captured at multiple articulation states and directly outputs 3D Gaussian primitives together with the object's joint parameters. It solves the joint geometry-and-articulation inference task by representing articulation via a per-pixel joint map and by applying a Cross-State Attention mechanism that uses learned state tokens to model discrete motion between the input states, all without per-object optimization or strong external priors.

What carries the argument

The per-pixel joint map representation together with the Cross-State Attention mechanism that employs state tokens to capture discrete motion across input states.

If this is right

Both geometry and joint parameters are recovered jointly inside a single network pass instead of separate optimization stages.
The same architecture handles both single-joint and multi-joint objects without requiring the number of joints to be known in advance.
Inference becomes more than 400 times faster than optimization-based baselines while remaining competitive in geometry and joint accuracy.
Reconstruction no longer depends on dense views, depth maps, or predefined joint types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The feed-forward design could be inserted into video pipelines to track articulated objects across continuous motion sequences.
If the joint-map representation generalizes, similar per-pixel structure tokens might help other inverse-graphics tasks that must infer hidden parameters from images.
Real-time robotics applications could acquire movable-object models from a few casual phone photos taken while the object is moved by hand.
Extending the state-token mechanism to handle more than the tested number of discrete states might reduce errors on highly articulated objects.

Load-bearing premise

That a per-pixel joint map plus cross-state attention suffices to resolve the ambiguities of simultaneous geometry and articulation recovery from sparse uncalibrated multi-state views.

What would settle it

Running the model on a set of objects whose joint types or counts lie outside the PartNet-Mobility single- and multi-joint configurations used in training and checking whether the predicted joint parameters produce geometrically inconsistent splats across states.

Figures

Figures reproduced from arXiv: 2605.24304 by Eugene Sohn, Inseo Lee, Jin-Hwa Kim, Jiwoong Lee, Joonseok Lee, Jungmin You, Yoonji Kim.

**Figure 1.** Figure 1: Overview. Given sparse multi-view images across two states, our model predicts geometry and joint parameters in a forward pass. Depth and Gaussian predictions are integrated with the joint maps to produce a state-conditioned Gaussian set, enabling articulated novel-state rendering without per-object optimization. apply to its Gaussian primitive, constructed on the same pixel. By formulating articulation as… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison of novel-view renderings via Gaussian rasterization. Baselines exhibit ghosting and misaligned edges around the joints due to inaccurate axis estimation, whereas ArtSplat produces clean renderings of both static and movable parts. Articulated object PARIS DTA ArtGS ScrewSplat ArtSplat (Ours) State 0 State 1 Prismatic Joint Axis / Part Revolute Joint Axis / Part [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of extracted meshes and predicted joint axes. State 0 State 0.25 State 0.5 State 0.75 State 1 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Articulated object reconstruction from sparse-view images is an ill-posed problem that requires simultaneous inference of geometry and underlying articulation structure. Existing methods for articulated object reconstruction based on NeRF and 3D Gaussian Splatting (3DGS) typically rely on dense views or strong priors (e.g., depth maps, joint types, predefined number of joints) and require costly per-object optimization. In this paper, we propose ArtSplat, the first feed-forward framework for articulated 3D Gaussian Splatting. It reconstructs both geometry and joint parameters from sparse multi-view images across multiple articulation states in a single forward pass. To address the challenges of single-pass articulated reconstruction, we introduce a per-pixel joint map representation that enables the integration of joint parameter estimation into the feed-forward pipeline. We further propose a Cross-State Attention (CSA) mechanism with state tokens, which effectively captures discrete motion across input states. Experiments on 68 articulated objects from PartNet-Mobility, including both single- and multi-joint configurations, demonstrate that ArtSplat achieves competitive performance in both geometry and joint estimation, while being over 400 times faster than baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ArtSplat puts forward a feed-forward pipeline for articulated 3DGS using a per-pixel joint map and cross-state attention, delivering claimed 400x speedups on PartNet-Mobility objects.

read the letter

The core contribution is a single-pass network that takes sparse uncalibrated views from multiple articulation states and directly predicts both 3D Gaussians and joint parameters. The two new pieces are the per-pixel joint map, which lets the model output articulation info alongside geometry, and the Cross-State Attention module that uses state tokens to link information across the input states.

Those components let the method avoid the usual per-object optimization loop, which is the practical advance. On the 68-object PartNet-Mobility test set the abstract reports competitive geometry and joint accuracy at more than 400 times the speed of prior baselines. That speed difference is the clearest evidence the feed-forward route works at all.

The main limitation visible so far is that the abstract supplies no equations, loss terms, or training protocol, so it is impossible to judge whether the joint map actually resolves the ill-posed geometry-plus-articulation problem or simply fits the training distribution. The evaluation set is also modest and the paper does not show error bars or ablations on the attention mechanism. Without those details the competitive claim stays provisional.

The work is aimed at researchers who need fast articulated reconstruction for robotics or graphics pipelines. Anyone already using 3DGS feed-forward models would find the new representation and attention pattern worth examining.

I would send it to peer review. The idea is coherent and the speed result is worth a full technical check even if the current evidence is thin.

Referee Report

2 major / 0 minor

Summary. The paper proposes ArtSplat, the first feed-forward framework for articulated 3D Gaussian Splatting. It reconstructs both geometry and joint parameters from sparse multi-view images across multiple articulation states in a single forward pass. The method introduces a per-pixel joint map representation and a Cross-State Attention (CSA) mechanism with state tokens to handle the ill-posed problem of simultaneous geometry and articulation inference. Experiments on 68 articulated objects from PartNet-Mobility (single- and multi-joint) report competitive performance in geometry and joint estimation while being over 400 times faster than baselines.

Significance. If the results hold, this would represent a meaningful advance by moving articulated 3DGS reconstruction from per-object optimization to feed-forward inference, addressing a key scalability bottleneck in prior NeRF/3DGS-based articulated methods. The per-pixel joint map and CSA approach could enable practical use in settings requiring rapid reconstruction from limited uncalibrated multi-state views.

major comments (2)

Abstract and method description: the central claim that the per-pixel joint map together with CSA using state tokens resolves the ill-posed simultaneous geometry and articulation inference from sparse uncalibrated multi-state views cannot be evaluated, as no equations, network architecture diagrams, loss formulations, or training details are provided to show how joint parameters are regressed or how CSA integrates discrete motion across states.
Experiments section: the claims of 'competitive performance' and '>400 times faster' on 68 PartNet-Mobility objects lack supporting quantitative tables, metrics (e.g., PSNR, Chamfer distance, joint angle error), baselines, error bars, or ablation studies, making it impossible to verify whether the reported results actually support the feed-forward advantage or the handling of single- vs. multi-joint cases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review. We address the two major comments below. Both comments correctly identify that the provided manuscript text consists only of the abstract and lacks the requested technical details and results; we will revise the manuscript to incorporate them.

read point-by-point responses

Referee: [—] Abstract and method description: the central claim that the per-pixel joint map together with CSA using state tokens resolves the ill-posed simultaneous geometry and articulation inference from sparse uncalibrated multi-state views cannot be evaluated, as no equations, network architecture diagrams, loss formulations, or training details are provided to show how joint parameters are regressed or how CSA integrates discrete motion across states.

Authors: The referee is correct that the abstract alone does not contain equations, diagrams, loss terms, or training details. We will expand the Methods section in the revised manuscript to include: (1) the per-pixel joint map formulation and regression head, (2) the CSA mechanism with state tokens and cross-state attention equations, (3) a network architecture diagram, (4) the full loss formulation combining reconstruction, joint, and regularization terms, and (5) training hyperparameters and data preprocessing details. revision: yes
Referee: [—] Experiments section: the claims of 'competitive performance' and '>400 times faster' on 68 PartNet-Mobility objects lack supporting quantitative tables, metrics (e.g., PSNR, Chamfer distance, joint angle error), baselines, error bars, or ablation studies, making it impossible to verify whether the reported results actually support the feed-forward advantage or the handling of single- vs. multi-joint cases.

Authors: The referee is correct that the abstract provides no quantitative tables or metrics. We will add a dedicated Experiments section containing: Table 1 reporting PSNR/SSIM/LPIPS and Chamfer distance for geometry reconstruction, Table 2 reporting joint angle and axis errors for single- and multi-joint objects, direct comparisons against optimization-based baselines with runtime measurements confirming the >400x speedup, error bars from repeated runs, and ablation studies isolating the contribution of the joint map and CSA components. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract and provided context describe a proposed feed-forward architecture (per-pixel joint map + Cross-State Attention with state tokens) for articulated 3DGS reconstruction, presented as an empirical engineering contribution evaluated on PartNet-Mobility. No equations, derivation chains, fitted-parameter predictions, or self-citation load-bearing steps are visible in the given material. The method is introduced as a new representation and mechanism rather than derived from prior results by construction, so the central claims remain independent of any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, training procedures, or architectural details, so no free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5760 in / 1181 out tokens · 48851 ms · 2026-06-30T14:24:46.698687+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 4 canonical work pages · 4 internal anchors

[1]

R. J. Campello, D. Moulavi, and J. Sander. Density-Based Clustering Based on Hierarchical Density Estimates. InProceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2013

2013
[2]

A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository.arXiv:1512.03012, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[3]

Charatan, S

D. Charatan, S. L. Li, A. Tagliasacchi, and V . Sitzmann. pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[4]

X. Chen, Y . Chen, Y . Xiu, A. Geiger, and A. Chen. Easi3R: Estimating Disentangled Motion from DUSt3R Without Training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[5]

Y . Chen, H. Xu, C. Zheng, B. Zhuang, M. Pollefeys, A. Geiger, T.-J. Cham, and J. Cai. MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images. InProceedings of the European Conference on Computer Vision (ECCV), 2024

2024
[6]

J. Guo, Y . Xin, G. Liu, K. Xu, L. Liu, and R. Hu. ArticulatedGS: Self-supervised Digital Twin Modeling of Articulated Objects using 3D Gaussian Splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[7]

Hartley and A

R. Hartley and A. Zisserman.Multiple View Geometry in Computer Vision. Cambridge university press, 2003

2003
[8]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. LoRA: Low-Rank Adaptation of Large Language Models. InProceedings of the International Conference on Learning Representations (ICLR), 2022

2022
[9]

Huang, Z

B. Huang, Z. Yu, A. Chen, A. Geiger, and S. Gao. 2D Gaussian Splatting for Geometrically Accurate Radiance Fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024

2024
[10]

P. J. Huber. Robust Estimation of a Location Parameter.The Annals of Mathematical Statistics, 35(1):73 – 101, 1964

1964
[11]

Jiang, Y

L. Jiang, Y . Mao, L. Xu, T. Lu, K. Ren, Y . Jin, X. Xu, M. Yu, J. Pang, F. Zhao, et al. AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025

2025
[12]

Jiang, C.-C

Z. Jiang, C.-C. Hsu, and Y . Zhu. Ditto: Building Digital Twins of Articulated Objects from Interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[13]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

2023
[14]

S. Kim, J. Ha, Y . H. Kim, Y . Lee, and F. C. Park. ScrewSplat: An End-to-End Method for Articulated Object Recognition. InProceedings of the Conference on Robot Learning (CoRL), 2025

2025
[15]

Leroy, Y

V . Leroy, Y . Cabon, and J. Revaud. Grounding Image Matching in 3D with MASt3R. InProceedings of the European Conference on Computer Vision (ECCV), 2024

2024
[16]

Z. Li, C. Zhang, Z. Li, H. Howard-Jenkins, Z. Lv, C. Geng, J. Wu, R. Newcombe, J. Engel, and Z. Dong. ART: Articulated Reconstruction Transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[17]

S. Lin, J. Fang, M. Z. Irshad, V . C. Guizilini, R. A. Ambrus, G. Shakhnarovich, and M. R. Walter. SplArt: Articulation Estimation and Part-Level Reconstruction with 3D Gaussian Splatting. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 10

2025
[18]

J. Liu, A. Mahdavi-Amiri, and M. Savva. PARIS: Part-level Reconstruction and Motion Analysis for Articulated Objects. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023
[19]

Y . Liu, B. Jia, R. Lu, J. Ni, S.-C. Zhu, and S. Huang. ArtGS: Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting. InProceedings of the International Conference on Learning Representations (ICLR), 2025

2025
[20]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis.Communications of the ACM, 65(1):99–106, 2021

2021
[21]

K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su. PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

2019
[22]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. DINOv2: Learning Robust Visual Features without Supervision.arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Perez, F

E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville. FiLM: Visual Reasoning with a General Conditioning Layer. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2018

2018
[24]

Ranftl, A

R. Ranftl, A. Bochkovskiy, and V . Koltun. Vision Transformers for Dense Prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021
[25]

L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms.Physica D: nonlinear phenomena, 60(1-4):259–268, 1992

1992
[26]

L. Shen, S. Zhang, H. Li, P. Yang, Z. Huang, Z. Zhang, and H. Zhao. GaussianArt: Unified Modeling of Geometry and Motion for Articulated Objects. InProceedings of the International Conference on 3D Vision (3DV), 2025

2025
[27]

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

B. Smart, C. Zheng, I. Laina, and V . A. Prisacariu. Splatt3R: Zero-shot Gaussian Splatting from Uncali- brated Image Pairs.arXiv:2408.13912, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Tseng, H.-J

W.-C. Tseng, H.-J. Liao, L. Yen-Chen, and M. Sun. CLA-NeRF: Category-Level Articulated Neural Radiance Field. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2022

2022
[29]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. VGGT: Visual Geometry Grounded Transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[30]

Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa. Continuous 3D Perception Model with Persistent State. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[31]

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. DUSt3R: Geometric 3D Vision Made Easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[32]

Y . Weng, B. Wen, J. Tremblay, V . Blukis, D. Fox, L. Guibas, and S. Birchfield. Neural Implicit Repre- sentation for Building Digital Twins of Unknown Articulated Objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[33]

D. Wu, L. Liu, Z. Linli, A. Huang, L. Song, Q. Yu, Q. Wu, and C. Lu. REArtGS: Reconstructing and Generating Articulated Objects via 3D Gaussian Splatting with Geometric and Motion Constraints. In Advances in Neural Information Processing Systems (NeurIPS), 2025

2025
[34]

Xiang, Y

F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, L. Yi, A. X. Chang, L. J. Guibas, and H. Su. SAPIEN: A simulated part-based interactive environment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

2020
[35]

H. Xu, S. Peng, F. Wang, H. Blum, D. Barath, A. Geiger, and M. Pollefeys. DepthSplat: Connecting Gaussian Splatting and Depth. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[36]

J. Yang, A. Sax, K. J. Liang, M. Henaff, H. Tang, A. Cao, J. Chai, F. Meier, and M. Feiszli. Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 11

2025
[37]

B. Ye, S. Liu, H. Xu, X. Li, M. Pollefeys, M.-H. Yang, and S. Peng. No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images. InProceedings of the International Conference on Learning Representations (ICLR), 2025

2025
[38]

T. Yu, V . Shah, M. Wahed, Y . Shen, K. A. Nguyen, and I. Lourentzou. Part2GS: Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting.arXiv:2506.17212, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

S. Yuan, R. Shi, X. Wei, X. Zhang, H. Su, and M. Liu. LARM: A Large Articulated Object Reconstruction Model. InProceedings of the SIGGRAPH Asia Conference Papers (SIGGRAPH Asia), 2025

2025
[40]

Zhang, C

J. Zhang, C. Herrmann, J. Hur, V . Jampani, T. Darrell, F. Cole, D. Sun, and M.-H. Yang. MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion. InProceedings of the International Conference on Learning Representations (ICLR), 2025

2025
[41]

Zhang, J

S. Zhang, J. Wang, Y . Xu, N. Xue, C. Rupprecht, X. Zhou, Y . Shen, and G. Wetzstein. FLARE: Feed- forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 12 Appendix A Training data details A.1 Multi-view rendering For each trainin...

2025

[1] [1]

R. J. Campello, D. Moulavi, and J. Sander. Density-Based Clustering Based on Hierarchical Density Estimates. InProceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2013

2013

[2] [2]

A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository.arXiv:1512.03012, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[3] [3]

Charatan, S

D. Charatan, S. L. Li, A. Tagliasacchi, and V . Sitzmann. pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[4] [4]

X. Chen, Y . Chen, Y . Xiu, A. Geiger, and A. Chen. Easi3R: Estimating Disentangled Motion from DUSt3R Without Training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025

[5] [5]

Y . Chen, H. Xu, C. Zheng, B. Zhuang, M. Pollefeys, A. Geiger, T.-J. Cham, and J. Cai. MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images. InProceedings of the European Conference on Computer Vision (ECCV), 2024

2024

[6] [6]

J. Guo, Y . Xin, G. Liu, K. Xu, L. Liu, and R. Hu. ArticulatedGS: Self-supervised Digital Twin Modeling of Articulated Objects using 3D Gaussian Splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[7] [7]

Hartley and A

R. Hartley and A. Zisserman.Multiple View Geometry in Computer Vision. Cambridge university press, 2003

2003

[8] [8]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. LoRA: Low-Rank Adaptation of Large Language Models. InProceedings of the International Conference on Learning Representations (ICLR), 2022

2022

[9] [9]

Huang, Z

B. Huang, Z. Yu, A. Chen, A. Geiger, and S. Gao. 2D Gaussian Splatting for Geometrically Accurate Radiance Fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024

2024

[10] [10]

P. J. Huber. Robust Estimation of a Location Parameter.The Annals of Mathematical Statistics, 35(1):73 – 101, 1964

1964

[11] [11]

Jiang, Y

L. Jiang, Y . Mao, L. Xu, T. Lu, K. Ren, Y . Jin, X. Xu, M. Yu, J. Pang, F. Zhao, et al. AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025

2025

[12] [12]

Jiang, C.-C

Z. Jiang, C.-C. Hsu, and Y . Zhu. Ditto: Building Digital Twins of Articulated Objects from Interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022

[13] [13]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

2023

[14] [14]

S. Kim, J. Ha, Y . H. Kim, Y . Lee, and F. C. Park. ScrewSplat: An End-to-End Method for Articulated Object Recognition. InProceedings of the Conference on Robot Learning (CoRL), 2025

2025

[15] [15]

Leroy, Y

V . Leroy, Y . Cabon, and J. Revaud. Grounding Image Matching in 3D with MASt3R. InProceedings of the European Conference on Computer Vision (ECCV), 2024

2024

[16] [16]

Z. Li, C. Zhang, Z. Li, H. Howard-Jenkins, Z. Lv, C. Geng, J. Wu, R. Newcombe, J. Engel, and Z. Dong. ART: Articulated Reconstruction Transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026

[17] [17]

S. Lin, J. Fang, M. Z. Irshad, V . C. Guizilini, R. A. Ambrus, G. Shakhnarovich, and M. R. Walter. SplArt: Articulation Estimation and Part-Level Reconstruction with 3D Gaussian Splatting. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 10

2025

[18] [18]

J. Liu, A. Mahdavi-Amiri, and M. Savva. PARIS: Part-level Reconstruction and Motion Analysis for Articulated Objects. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023

[19] [19]

Y . Liu, B. Jia, R. Lu, J. Ni, S.-C. Zhu, and S. Huang. ArtGS: Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting. InProceedings of the International Conference on Learning Representations (ICLR), 2025

2025

[20] [20]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis.Communications of the ACM, 65(1):99–106, 2021

2021

[21] [21]

K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su. PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

2019

[22] [22]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. DINOv2: Learning Robust Visual Features without Supervision.arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Perez, F

E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville. FiLM: Visual Reasoning with a General Conditioning Layer. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2018

2018

[24] [24]

Ranftl, A

R. Ranftl, A. Bochkovskiy, and V . Koltun. Vision Transformers for Dense Prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021

[25] [25]

L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms.Physica D: nonlinear phenomena, 60(1-4):259–268, 1992

1992

[26] [26]

L. Shen, S. Zhang, H. Li, P. Yang, Z. Huang, Z. Zhang, and H. Zhao. GaussianArt: Unified Modeling of Geometry and Motion for Articulated Objects. InProceedings of the International Conference on 3D Vision (3DV), 2025

2025

[27] [27]

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

B. Smart, C. Zheng, I. Laina, and V . A. Prisacariu. Splatt3R: Zero-shot Gaussian Splatting from Uncali- brated Image Pairs.arXiv:2408.13912, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Tseng, H.-J

W.-C. Tseng, H.-J. Liao, L. Yen-Chen, and M. Sun. CLA-NeRF: Category-Level Articulated Neural Radiance Field. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2022

2022

[29] [29]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. VGGT: Visual Geometry Grounded Transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[30] [30]

Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa. Continuous 3D Perception Model with Persistent State. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[31] [31]

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. DUSt3R: Geometric 3D Vision Made Easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[32] [32]

Y . Weng, B. Wen, J. Tremblay, V . Blukis, D. Fox, L. Guibas, and S. Birchfield. Neural Implicit Repre- sentation for Building Digital Twins of Unknown Articulated Objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[33] [33]

D. Wu, L. Liu, Z. Linli, A. Huang, L. Song, Q. Yu, Q. Wu, and C. Lu. REArtGS: Reconstructing and Generating Articulated Objects via 3D Gaussian Splatting with Geometric and Motion Constraints. In Advances in Neural Information Processing Systems (NeurIPS), 2025

2025

[34] [34]

Xiang, Y

F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, L. Yi, A. X. Chang, L. J. Guibas, and H. Su. SAPIEN: A simulated part-based interactive environment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

2020

[35] [35]

H. Xu, S. Peng, F. Wang, H. Blum, D. Barath, A. Geiger, and M. Pollefeys. DepthSplat: Connecting Gaussian Splatting and Depth. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[36] [36]

J. Yang, A. Sax, K. J. Liang, M. Henaff, H. Tang, A. Cao, J. Chai, F. Meier, and M. Feiszli. Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 11

2025

[37] [37]

B. Ye, S. Liu, H. Xu, X. Li, M. Pollefeys, M.-H. Yang, and S. Peng. No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images. InProceedings of the International Conference on Learning Representations (ICLR), 2025

2025

[38] [38]

T. Yu, V . Shah, M. Wahed, Y . Shen, K. A. Nguyen, and I. Lourentzou. Part2GS: Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting.arXiv:2506.17212, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

S. Yuan, R. Shi, X. Wei, X. Zhang, H. Su, and M. Liu. LARM: A Large Articulated Object Reconstruction Model. InProceedings of the SIGGRAPH Asia Conference Papers (SIGGRAPH Asia), 2025

2025

[40] [40]

Zhang, C

J. Zhang, C. Herrmann, J. Hur, V . Jampani, T. Darrell, F. Cole, D. Sun, and M.-H. Yang. MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion. InProceedings of the International Conference on Learning Representations (ICLR), 2025

2025

[41] [41]

Zhang, J

S. Zhang, J. Wang, Y . Xu, N. Xue, C. Rupprecht, X. Zhou, Y . Shen, and G. Wetzstein. FLARE: Feed- forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 12 Appendix A Training data details A.1 Multi-view rendering For each trainin...

2025