arxiv: 2602.08058 · v2 · submitted 2026-02-08 · 💻 cs.CV · cs.AI· cs.RO· cs.SY· eess.SY

Recognition: 2 theorem links

· Lean Theorem

Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling

Xihang Yu , Rajat Talak , Lorenzo Shaikewitz , Luca Carlone

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.ROcs.SYeess.SY

keywords scene reconstructionphysics-constrained samplingobject contact graphrejection samplingpose and shape estimationphysical plausibilitymulti-object interactionsocclusion handling

0 comments

The pith

Picasso reconstructs multi-object scenes by jointly enforcing geometry, non-penetration, and physics through contact-graph-guided rejection sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that separate per-object pose and shape estimation produces geometrically faithful but physically invalid scenes, such as interpenetrating or unstable object arrangements, especially under occlusion and sensor noise. It argues that reliable reconstruction instead requires holistic reasoning that accounts for object interactions and physical plausibility so the resulting models can be used directly in simulators for planning and control. Picasso implements this idea with a fast rejection sampler that first infers an object contact graph and then uses it to bias sample generation toward valid configurations. The authors release a new benchmark of ten real-world contact-rich scenes together with a physical-plausibility metric and demonstrate that the method yields more stable and human-aligned results than prior techniques on both this dataset and YCB-V. If correct, the work shows that physics constraints can be folded into the core estimation loop without sacrificing speed or accuracy.

Core claim

Picasso is a reconstruction pipeline that builds multi-object scenes by considering geometry, non-penetration, and physics together. It relies on a fast rejection sampling method that reasons over multi-object interactions by leveraging an inferred object contact graph to guide samples. The resulting estimates are both geometrically consistent with sensor data and physically plausible, allowing direct import into simulators without manual correction.

What carries the argument

The central mechanism is physics-constrained rejection sampling guided by an inferred object contact graph that directs the sampler toward non-penetrating and stable configurations.

If this is right

Reconstructed scenes can be imported directly into simulators to predict dynamic behavior without corrective post-processing.
Performance gains appear in contact-rich environments where inter-object constraints dominate the solution space.
The same pipeline improves results on established benchmarks such as YCB-V while adding physical validity guarantees.
Digital twins built from these reconstructions support more reliable simulation-based planning for contact-rich robotic tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Jointly optimizing the contact graph together with the pose estimates rather than inferring it first could further reduce rejection rates on ambiguous scenes.
Extending the sampler to incorporate temporal consistency across video frames would allow reconstruction of moving scenes without separate tracking.
The physical-plausibility metric introduced in the benchmark could serve as a training signal for learning-based reconstructors that currently optimize only geometric error.
Scaling the approach to scenes with dozens of objects will likely require more efficient graph inference or learned proposal distributions to keep the rejection sampler tractable.

Load-bearing premise

The inferred object contact graph is accurate enough to steer sampling toward valid solutions without excluding good configurations or requiring an impractical number of rejections.

What would settle it

A controlled experiment in which the contact-graph inference is deliberately corrupted on an otherwise solvable scene and the sampler either fails to return any valid configuration within a fixed budget or returns only interpenetrating or unstable arrangements.

Figures

Figures reproduced from arXiv: 2602.08058 by Lorenzo Shaikewitz, Luca Carlone, Rajat Talak, Xihang Yu.

**Figure 1.** Figure 1: We propose Picasso, an approach to build multi-object scene reconstructions by accounting for object geometry, nonpenetration, and physics (i.e., objects should be in a stable equilibrium for the scene to be static). We also release the Picasso dataset: a collection of 10 contact-rich real-world scenes we use to test physical plausibility of scene reconstructions. The figure shows the digital twins genera… view at source ↗

**Figure 2.** Figure 2: An example illustrating that a 3D scene reconstruction [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 4.** Figure 4: Sample image and corresponding contact scene graph [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Conceptual illustration of the loss landscape on the [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of 3D scene reconstruction from the Pi [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: VLM prompt for contact scene graph generation. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Examples of Picasso Dataset. Left: RGB images. Middle: Depth maps. Right: Reconstructed 3D models. TABLE VI: Evaluation results on the ADD-S (×10−3 m) and ADD-S (AUC %) metrics for the YCB-V dataset. CRISPSyn+Picasso w/o phy: CRISP-Syn+Picasso with physics constraints turned off. ADD-S ↓ ADD-S (AUC %) ↑ Method Mean Median 1 cm 2 cm 3 cm CRISP-Syn +Picasso w/o phy 8.35 3.04 51.8 68.9 77.1 CRISP-Syn +Picas… view at source ↗

**Figure 10.** Figure 10: A failure case due to noisy and partial depth point [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Approximation on contact scene graph inference can [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Human evaluation and SPS comparison between SAM3D and SAM3D+Picasso. Top: SAM3D. Bottom: SAM3D+Picasso. For both Experts and Public, SAM3D+Picasso achieves higher physics plausibility. TABLE X: Human evaluation of physics plausibility and SPS across 12 YCB-Video trajectories. Human scores are on a 1-7 scale (higher is better) across 83 participants. SPS is averaged across 3 frames per trajectory. S: SAM… view at source ↗

read the original abstract

In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions -- which fit the sensor data -- can still be physically incorrect. For instance, when estimating the poses and shapes of objects in the scene and importing the resulting estimates into a simulator, small errors might translate to implausible configurations including object interpenetration or unstable equilibrium. This makes it difficult to predict the dynamic behavior of the scene using a digital twin, an important step in simulation-based planning and control of contact-rich behaviors. In this paper, we posit that object pose and shape estimation requires reasoning holistically over the scene (instead of reasoning about each object in isolation), accounting for object interactions and physical plausibility. Towards this goal, our first contribution is Picasso, a physics-constrained reconstruction pipeline that builds multi-object scene reconstructions by considering geometry, non-penetration, and physics. Picasso relies on a fast rejection sampling method that reasons over multi-object interactions, leveraging an inferred object contact graph to guide samples. Second, we propose the Picasso dataset, a collection of 10 contact-rich real-world scenes with ground truth annotations, as well as a metric to quantify physical plausibility, which we open-source as part of our benchmark. Finally, we provide an extensive evaluation of Picasso on our newly introduced dataset and on the YCB-V dataset, and show it largely outperforms the state of the art while providing reconstructions that are both physically plausible and more aligned with human intuition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Picasso adds contact-graph guided rejection sampling to enforce physical plausibility in multi-object scene reconstruction and ships a 10-scene dataset plus metric, but the small scale and missing ablations leave the gains hard to judge.

read the letter

The main point is that Picasso reconstructs scenes by sampling poses and shapes while using an inferred contact graph to reject physically invalid configurations like interpenetrations or unstable stacks. This holistic step replaces isolated per-object fitting and produces outputs that work better when dropped into a simulator. They also release a new 10-scene real-world dataset with ground-truth annotations and a physical-plausibility metric, plus results on YCB-V that beat prior methods on their metric and look more intuitive to people.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that Picasso, a physics-constrained scene reconstruction pipeline, produces physically plausible multi-object reconstructions by using fast rejection sampling guided by an inferred object contact graph. It introduces a new 10-scene real-world dataset with ground-truth annotations and a physical plausibility metric, demonstrating outperformance over prior methods on this dataset and on YCB-V while yielding results more aligned with human intuition.

Significance. If the results hold, the work could advance simulation-based planning and control by enabling more reliable digital twins for contact-rich scenes. The new dataset and plausibility metric are valuable open contributions that address a gap in evaluating physical correctness beyond geometric fit. The holistic treatment of object interactions via the contact graph is a promising direction, though its robustness remains to be fully substantiated.

major comments (3)

[§5] §5 (Experiments): No ablation study isolates the contribution of the inferred contact graph to sampling efficiency or reconstruction quality. Without removing or replacing this component, it is impossible to determine whether the reported gains in physical plausibility derive from the graph-guided rejection sampling or from other elements of the pipeline.
[§5.2] §5.2 and Table 2: The evaluation provides no quantitative analysis of contact-graph inference accuracy, rejection rates, or failure cases in contact-rich scenes. This leaves the central assumption—that the graph inferred from noisy geometry reliably guides sampling without excessive rejections or exclusion of valid configurations—unsupported by direct evidence.
[§5.1] §5.1: Baseline comparisons lack full details on implementation, hyper-parameter tuning, and error bars on the new plausibility metric. The claim of outperformance is therefore only moderately supported, as variance and reproducibility cannot be assessed.

minor comments (2)

[Figure 3] Figure 3 and §4.2: The contact-graph visualization would benefit from explicit annotation of false-positive/negative edges to illustrate inference errors on real data.
[§3] §3: Notation for the rejection-sampling acceptance probability could be clarified with a short pseudocode block to avoid ambiguity in the multi-object interaction term.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that additional ablations, quantitative analyses of the contact graph, and greater transparency in baseline comparisons will strengthen the paper. We will incorporate these elements in the revised version. Below we address each major comment point by point.

read point-by-point responses

Referee: [§5] §5 (Experiments): No ablation study isolates the contribution of the inferred contact graph to sampling efficiency or reconstruction quality. Without removing or replacing this component, it is impossible to determine whether the reported gains in physical plausibility derive from the graph-guided rejection sampling or from other elements of the pipeline.

Authors: We agree that an ablation isolating the contact graph's contribution is valuable. In the revised manuscript, we will add an ablation comparing the full Picasso pipeline to a variant using rejection sampling without contact-graph guidance. We will report differences in sampling efficiency (rejection rates and runtime) and reconstruction quality (geometric accuracy and physical plausibility metrics) on the Picasso dataset and YCB-V to clarify the graph's role. revision: yes
Referee: [§5.2] §5.2 and Table 2: The evaluation provides no quantitative analysis of contact-graph inference accuracy, rejection rates, or failure cases in contact-rich scenes. This leaves the central assumption—that the graph inferred from noisy geometry reliably guides sampling without excessive rejections or exclusion of valid configurations—unsupported by direct evidence.

Authors: We will add a new analysis subsection in the revision. This will include quantitative metrics on contact-graph inference accuracy (precision/recall against ground-truth contacts from our dataset annotations), average rejection rates during sampling, and a discussion of observed failure cases in contact-rich scenes. These results will directly support the reliability of the graph-guided approach. revision: yes
Referee: [§5.1] §5.1: Baseline comparisons lack full details on implementation, hyper-parameter tuning, and error bars on the new plausibility metric. The claim of outperformance is therefore only moderately supported, as variance and reproducibility cannot be assessed.

Authors: We acknowledge the need for greater reproducibility. In the revised manuscript, we will expand the baseline section with full implementation details, specific hyper-parameter values and tuning procedures for each method, and error bars (standard deviations over multiple runs) for the physical plausibility metric on both datasets. This will allow proper assessment of variance and strengthen the outperformance claims. revision: yes

Circularity Check

0 steps flagged

New rejection sampling and dataset avoid circular derivation

full rationale

The paper introduces Picasso as a novel physics-constrained pipeline relying on rejection sampling guided by an inferred contact graph, plus a new 10-scene dataset and physical plausibility metric. No equations or claims reduce by construction to prior fitted parameters; evaluations on the new dataset and YCB-V provide independent content. Minor self-citations may exist for background but are not load-bearing for the central reconstruction claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard rigid-body physics and contact assumptions plus the claim that rejection sampling guided by an inferred graph can efficiently locate valid configurations; no new free parameters or invented entities are introduced beyond conventional sampling hyperparameters.

axioms (1)

domain assumption Rigid-body non-penetration and equilibrium constraints are sufficient to define physical plausibility for the target scenes
Invoked to justify rejection of samples that violate interpenetration or stability

pith-pipeline@v0.9.0 · 5577 in / 1305 out tokens · 72756 ms · 2026-05-16T05:52:47.524285+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations
cs.CV 2026-04 unverdicted novelty 6.0

RecGen achieves state-of-the-art 3D multi-object scene reconstruction from sparse RGB-D views by combining compositional synthetic scene generation with strong 3D shape priors, outperforming SAM3D by 30%+ in shape qua...

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Scenecom- plete: Open-world 3d scene completion in cluttered real world environments for robot manipulation.IEEE Robotics and Automation Letters, 11(1):482–489, 2025

Aditya Agarwal, Gaurav Singh, Bipasha Sen, Tom ´as Lozano-P´erez, and Leslie Pack Kaelbling. Scenecom- plete: Open-world 3d scene completion in cluttered real world environments for robot manipulation.IEEE Robotics and Automation Letters, 11(1):482–489, 2025

work page 2025
[2]

Amodal 3d reconstruction for robotic manipulation via stability and connectivity

William Agnew, Christopher Xie, Aaron Walsman, Oc- tavian Murad, Yubo Wang, Pedro Domingos, and Sid- dhartha Srinivasa. Amodal 3d reconstruction for robotic manipulation via stability and connectivity. InCon- ference on Robot Learning (CoRL), pages 1498–1508. PMLR, 2021

work page 2021
[3]

A general and adaptive robust loss function

Jonathan T Barron. A general and adaptive robust loss function. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4331– 4339, 2019

work page 2019
[4]

Spatial functa: Scaling functa to imagenet classifi- cation and generation.arXiv preprint arXiv:2302.03130, 2023

Matthias Bauer, Emilien Dupont, Andy Brock, Dan Rosenbaum, Jonathan Richard Schwarz, and Hyunjik Kim. Spatial functa: Scaling functa to imagenet classifi- cation and generation.arXiv preprint arXiv:2302.03130, 2023

work page arXiv 2023
[5]

Vysics: Ob- ject reconstruction under occlusion by fusing vision and contact-rich physics.arXiv preprint arXiv:2504.18719, 2025

Bibit Bianchini, Minghan Zhu, Mengti Sun, Bowen Jiang, Camillo J Taylor, and Michael Posa. Vysics: Ob- ject reconstruction under occlusion by fusing vision and contact-rich physics.arXiv preprint arXiv:2504.18719, 2025

work page arXiv 2025
[6]

Strictly constrained generative modeling via split augmented langevin sampling.arXiv preprint arXiv:2505.18017, 2025

Matthieu Blanke, Yongquan Qu, Sara Shamekh, and Pierre Gentine. Strictly constrained generative modeling via split augmented langevin sampling.arXiv preprint arXiv:2505.18017, 2025

work page arXiv 2025
[7]

T. M. Breuel. Implementation techniques for geomet- ric branch-and-bound matching methods.Comput. Vis. Image Underst., 90(3):258–294, 2003

work page 2003
[8]

ShapeNet: An Information-Rich 3D Model Repository

Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[9]

SAM 3D: 3Dfy Anything in Images

Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018

Blender Online Community.Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018. URL http:// www.blender.org

work page 2018
[11]

G.F. Cooper. The computational complexity of proba- bilistic inference using Bayesian belief networks.Arti- ficial Intelligence, 42(2-3):393–405, 1990. ISSN 0004- 3702

work page 1990
[12]

Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Information Processing Systems (NeurIPS), 36: 35799–35813, 2023

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Information Processing Systems (NeurIPS), 36: 35799–35813, 2023

work page 2023
[13]

Blenderproc: Reducing the re- ality gap with photorealistic rendering

Maximilian Denninger, Martin Sundermeyer, Dominik Winkelbauer, Dmitry Olefir, Tomas Hodan, Youssef Zi- dan, Mohamad Elbadrawy, Markus Knauer, Harinandan Katam, and Ahsan Lodhi. Blenderproc: Reducing the re- ality gap with photorealistic rendering. In16th Robotics: Science and Systems, RSS 2020, Workshops, 2020

work page 2020
[14]

Google scanned objects: A high-quality dataset of 3d scanned household items

Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items. InIEEE Intl. Conf. on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022

work page 2022
[15]

From data to functa: Your data point is a function and you can treat it like one.arXiv preprint arXiv:2201.12204, 2022

Emilien Dupont, Hyunjik Kim, SM Eslami, Danilo Rezende, and Dan Rosenbaum. From data to functa: Your data point is a function and you can treat it like one.arXiv preprint arXiv:2201.12204, 2022

work page arXiv 2022
[16]

Fischler and R

M. Fischler and R. Bolles. Random sample consensus: a paradigm for model fitting with application to image analysis and automated cartography.Commun. ACM, 24: 381–395, 1981

work page 1981
[17]

Diffusion models for constrained domains.arXiv preprint arXiv:2304.05364, 2023

Nic Fishman, Leo Klarner, Valentin De Bortoli, Emile Mathieu, and Michael Hutchinson. Diffusion models for constrained domains.arXiv preprint arXiv:2304.05364, 2023

work page arXiv 2023
[18]

Yang Fu and Xiaolong Wang. Category-level 6d object pose estimation in the wild: A semi-supervised learning approach and a new dataset.Advances in Neural Infor- mation Processing Systems (NeurIPS), 35:27469–27483, 2022

work page 2022
[19]

Gothoskar, M

N. Gothoskar, M. Cusumano-Towner, B. Zinberg, M. Ghavamizadeh, F. Pollok, A. Garrett, J.B. Tenen- baum, D. Gutfreund, and V .K. Mansinghka. 3DP3: 3D scene perception via probabilistic programming. InarXiv preprint: 2111.00312, 2021

work page arXiv 2021
[20]

Bayes3d: fast learning and inference in structured generative models of 3d objects and scenes

Nishad Gothoskar, Matin Ghavami, Eric Li, Aidan Curtis, Michael Noseworthy, Karen Chung, Brian Pat- ton, William T Freeman, Joshua B Tenenbaum, Mirko Klukas, et al. Bayes3d: fast learning and inference in structured generative models of 3d objects and scenes. arXiv preprint arXiv:2312.08715, 2023

work page arXiv 2023
[21]

Nonconvex rigid bodies with stacking.ACM transactions on graphics (TOG), 22(3):871–878, 2003

Eran Guendelman, Robert Bridson, and Ronald Fedkiw. Nonconvex rigid bodies with stacking.ACM transactions on graphics (TOG), 22(3):871–878, 2003

work page 2003
[22]

Realistic animation of rigid bodies.ACM Siggraph computer graphics, 22(4):299–308, 1988

James K Hahn. Realistic animation of rigid bodies.ACM Siggraph computer graphics, 22(4):299–308, 1988

work page 1988
[23]

Hartley and F

R.I. Hartley and F. Kahl. Global optimization through rotation space search.Intl. J. of Computer Vision, 82(1): 64–79, 2009

work page 2009
[24]

Zero-shot multi-object scene completion

Shun Iwase, Katherine Liu, Vitor Guizilini, Adrien Gaidon, Kris Kitani, Rares ¸ Ambrus ¸, and Sergey Za- kharov. Zero-shot multi-object scene completion. In European Conf. on Computer Vision (ECCV), pages 96–

work page
[25]

Ze- rograsp: Zero-shot shape reconstruction enabled robotic grasping

Shun Iwase, Muhammad Zubair Irshad, Katherine Liu, Vitor Guizilini, Robert Lee, Takuya Ikeda, Ayako Amma, Koichi Nishiwaki, Kris Kitani, Rares Ambrus, et al. Ze- rograsp: Zero-shot shape reconstruction enabled robotic grasping. InIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 17405–17415, 2025

work page 2025
[26]

Izatt, H

G. Izatt, H. Dai, and R. Tedrake. Globally optimal object pose estimation in point clouds with mixed-integer programming. InProc. of the Intl. Symp. of Robotics Research (ISRR), 2017

work page 2017
[27]

libigl: A simple C++ geometry processing library, 2018

Alec Jacobson, Daniele Panozzo, et al. libigl: A simple C++ geometry processing library, 2018. https://libigl.github.io/

work page 2018
[28]

Phystwin: Physics- informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973, 2025

Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang, Hsin-Ni Yu, Shenlong Wang, and Yunzhu Li. Phystwin: Physics- informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973, 2025

work page arXiv 2025
[29]

small gicp: Efficient and parallel algo- rithms for point cloud registration.Journal of Open Source Software, 9(100):6948, August 2024

Kenji Koide. small gicp: Efficient and parallel algo- rithms for point cloud registration.Journal of Open Source Software, 9(100):6948, August 2024. doi: 10. 21105/joss.06948

work page 2024
[30]

Labbe, J

Y . Labbe, J. Carpentier, M. Aubry, and J. Sivic. Cosy- Pose: Consistent multi-view multi-object 6D pose esti- mation. InEuropean Conf. on Computer Vision (ECCV), 2020

work page 2020
[31]

Megapose: 6d pose estimation of novel objects via render & compare

Yann Labb ´e, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpentier, Mathieu Aubry, Dieter Fox, and Josef Sivic. Megapose: 6d pose estimation of novel objects via render & compare. 2022

work page 2022
[32]

H. Lim, D. Kim, G. Shin, J. Shi, I. Vizzo, H. Myung, J. Park, and L. Carlone. KISS-Matcher: Fast and robust point cloud registration revisited. InIEEE Intl. Conf. on Robotics and Automation (ICRA), 2025

work page 2025
[33]

Xingyu Liu, Ruida Zhang, Chenyangguang Zhang, Bowen Fu, Jiwen Tang, Xiquan Liang, Jingyi Tang, Xiao- tian Cheng, Yukang Zhang, Gu Wang, and Xiangyang Ji. Gdrnpp. https://github.com/shanice-l/gdrnpp bop2022, 2022

work page 2022
[34]

Physpose: Refining 6d object poses with physical constraints.arXiv preprint arXiv:2503.23587, 2025

Martin Malenick `y, Martin C´ıfka, M´ed´eric Fourmy, Louis Montaut, Justin Carpentier, Josef Sivic, and Vladimir Petrik. Physpose: Refining 6d object poses with physical constraints.arXiv preprint arXiv:2503.23587, 2025

work page arXiv 2025
[35]

Phyrecon: Physically plausible neural scene reconstruction.Advances in Neural Information Processing Systems (NeurIPS), 37:25747–25780, 2024

Junfeng Ni, Yixin Chen, Bohan Jing, Nan Jiang, Bin Wang, Bo Dai, Puhao Li, Yixin Zhu, Song-Chun Zhu, and Siyuan Huang. Phyrecon: Physically plausible neural scene reconstruction.Advances in Neural Information Processing Systems (NeurIPS), 37:25747–25780, 2024

work page 2024
[36]

Score-based con- strained generative modeling via langevin diffusions with boundary conditions.arXiv preprint arXiv:2510.23985, 2025

Adam Nordenh ¨og and Akash Sharma. Score-based con- strained generative modeling via langevin diffusions with boundary conditions.arXiv preprint arXiv:2510.23985, 2025

work page arXiv 2025
[37]

Parra Bustos, T

´A. Parra Bustos, T. J. Chin, and D. Suter. Fast rotation search with stereographic projections for 3d registration. InIEEE Conf. on Computer Vision and Pattern Recog- nition (CVPR), pages 3930–3937, 2014

work page 2014
[38]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zem- ing Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

work page 2019
[39]

Pavlakos, X

G. Pavlakos, X. Zhou, A. Chan, K. Derpanis, and K. Daniilidis. 6-dof object pose from semantic keypoints. InIEEE Intl. Conf. on Robotics and Automation (ICRA), 2017

work page 2017
[40]

Quantiphy: A quantitative benchmark evaluating physical reasoning abilities of vision-language models.arXiv preprint arXiv:2512.19526, 2025

Li Puyin, Tiange Xiang, Ella Mao, Shirley Wei, Xinye Chen, Adnan Masood, Li Fei-Fei, and Ehsan Adeli. Quantiphy: A quantitative benchmark evaluating physical reasoning abilities of vision-language models.arXiv preprint arXiv:2512.19526, 2025

work page arXiv 2025
[41]

N. Ravi, V . Gabeur, Y-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K.V . Alwala, N. Carion, C-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer. SAM 2: Segment anything in images and videos, 2024. URL https://arxiv. org/abs/2408.00714

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Common objects in 3d: Large-scale learning and eval- uation of real-life 3d category reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and eval- uation of real-life 3d category reconstruction. InIntl. Conf. on Computer Vision (ICCV), pages 10901–10911, 2021

work page 2021
[43]

Generalized ICP

Aleksandr Segal, Dirk Haehnel, and Sebastian Thrun. Generalized ICP. InRobotics: Science and Systems (RSS), Jun. 2009. doi: 10.15607/RSS.2009.V .021

work page doi:10.15607/rss.2009.v 2009
[44]

J. Shi*, R. Talak*, D. Maggio, and L. Carlone. A correct- and-certify approach to self-supervise object pose estima- tors via ensemble self-training. InRobotics: Science and Systems (RSS), 2023. https://arxiv.org/pdf/2302.06019. pdf

work page arXiv 2023
[45]

J. Shi, R. Talak, H. Zhang, D. Jin, and L. Carlone. CRISP: Object pose and shape estimation with test-time adaptation. InIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[46]

Talak, L

R. Talak, L. Peng, and L. Carlone. Certifiable 3D object pose estimation: Foundations, learning models, and self- training.IEEE Trans. Robotics, 39(4):2805–2824, 2023. https://arxiv.org/pdf/2206.11215.pdf

work page arXiv 2023
[47]

Shape prior deformation for categorical 6d object pose and size estimation

Meng Tian, Marcelo H Ang, and Gim Hee Lee. Shape prior deformation for categorical 6d object pose and size estimation. InEuropean Conf. on Computer Vision (ECCV), pages 530–546. Springer, 2020

work page 2020
[48]

H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2642–2651, 2019

work page 2019
[49]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 5294–5306, 2025

work page 2025
[50]

Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects

Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas M ¨uller, Alex Evans, Dieter Fox, Jan Kautz, and Stan Birchfield. Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 606–617, 2023

work page 2023
[51]

FoundationPose: Unified 6D Pose Estimation and Track- ing of Novel Objects

Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. FoundationPose: Unified 6D Pose Estimation and Track- ing of Novel Objects . In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17868–17879, Los Alamitos, CA, USA, June 2024. IEEE Computer Society. doi: 10.1109/CVPR52733. 2024.01692. URL https://doi.ieeecomputersociety.o...

work page doi:10.1109/cvpr52733 2024
[52]

Galileo: Perceiving physical object properties by integrating a physics engine with deep learning.Advances in Neural Information Processing Systems (NeurIPS), 28, 2015

Jiajun Wu, Ilker Yildirim, Joseph J Lim, Bill Freeman, and Josh Tenenbaum. Galileo: Perceiving physical object properties by integrating a physics engine with deep learning.Advances in Neural Information Processing Systems (NeurIPS), 28, 2015

work page 2015
[53]

Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation

Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei Ren, Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, et al. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. InIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 803–814, 2023

work page 2023
[54]

PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes

Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. In Robotics: Science and Systems (RSS), 2018

work page 2018
[55]

H. Yang, J. Shi, and L. Carlone. TEASER: Fast and Certifiable Point Cloud Registration.IEEE Trans. Robotics, 37(2):314–333, 2020. extended arXiv version 2001.07715 https://arxiv.org/pdf/2001.07715.pdf

work page arXiv 2020
[56]

J. Yang, H. Li, D. Campbell, and Y . Jia. Go-ICP: A globally optimal solution to 3D ICP point-set registration. IEEE Trans. Pattern Anal. Machine Intell., 38(11):2241– 2254, November 2016. ISSN 0162-8828

work page 2016
[57]

Twintrack: Bridg- ing vision and contact physics for real-time track- ing of unknown dynamic objects.arXiv preprint arXiv:2505.22882, 2025

Wen Yang, Zhixian Xie, Xuechao Zhang, Heni Ben Amor, Shan Lin, and Wanxin Jin. Twintrack: Bridg- ing vision and contact physics for real-time track- ing of unknown dynamic objects.arXiv preprint arXiv:2505.22882, 2025

work page arXiv 2025
[58]

Cast: Component-aligned 3d scene reconstruc- tion from an rgb image.ACM Transactions on Graphics (TOG), 44(4):1–19, 2025

Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Lan Xu, Wei Yang, Jiayuan Gu, and Jingyi Yu. Cast: Component-aligned 3d scene reconstruc- tion from an rgb image.ACM Transactions on Graphics (TOG), 44(4):1–19, 2025

work page 2025
[59]

Box pose and shape estimation and domain adaptation for large-scale warehouse automation.arXiv preprint arXiv:2507.00984, 2025

Xihang Yu, Rajat Talak, Jingnan Shi, Ulrich Viereck, Igor Gilitschenski, and Luca Carlone. Box pose and shape estimation and domain adaptation for large-scale warehouse automation.arXiv preprint arXiv:2507.00984, 2025

work page arXiv 2025
[60]

Non-penetration iterative closest points for single-view multi-object 6d pose estimation

Mengchao Zhang and Kris Hauser. Non-penetration iterative closest points for single-view multi-object 6d pose estimation. InIEEE Intl. Conf. on Robotics and Automation (ICRA), pages 1520–1526. IEEE, 2022

work page 2022
[61]

Zheng, Y

B. Zheng, Y . Zhao, J. C. Yu, K. Ikeuchi, and S. Zhu. Beyond point clouds: Scene understanding by reasoning geometry and physics. InIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 3127– 3134, 2013

work page 2013
[62]

3d neural embedding likelihood: Probabilistic inverse graphics for robust 6d pose estimation

Guangyao Zhou, Nishad Gothoskar, Lirui Wang, Joshua B Tenenbaum, Dan Gutfreund, Miguel L ´azaro- Gredilla, Dileep George, and Vikash K Mansinghka. 3d neural embedding likelihood: Probabilistic inverse graphics for robust 6d pose estimation. InIntl. Conf. on Computer Vision (ICCV), pages 21625–21636, 2023

work page 2023
[63]

Object reconstruction under occlusion with generative priors and contact-induced constraints

Minghan Zhu, Zhiyi Wang, Qihang Sun, Maani Ghaffari, and Michael Posa. Object reconstruction under occlusion with generative priors and contact-induced constraints. arXiv preprint arXiv:2512.05079, 2025. Given the masks of objects, RGB image (second to the last) and depth map (last), give me contact dependency graph (adjacency list). Use the indices of th...

work page arXiv 2025