pith. machine review for the scientific record. sign in

arxiv: 2604.09100 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.RO

Recognition: no theorem link

Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:18 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords 3D reconstructionhand occlusionproprioceptiontactile sensingsigned distance fielddiffusion modelrobotic manipulationamodal completion
0
0 comments X

The pith

Proprioception and multi-contact touch enable metric-scale 3D object reconstruction under severe hand occlusion by constraining surfaces with physical signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that signals from the interacting hand itself—its known geometry from proprioception and surface contact points from touch—can resolve ambiguity in occluded regions during 3D reconstruction. It encodes objects as pose-aware signed distance fields inside a compact latent space learned by a Structure-VAE, then trains a conditional flow-matching diffusion model on vision, hand latents, and tactile data while adding physics-based losses and decoder guidance to enforce non-interpenetration and contact alignment. If the approach holds, robots could maintain accurate, scale-correct object models during manipulation tasks even when their own hand blocks the camera, allowing downstream modules to refine geometry and appearance without vision-only failures.

Core claim

We represent object structure as a pose-aware, camera-aligned signed distance field (SDF) and learn a compact latent space with a Structure-VAE. In this latent space, we train a conditional flow-matching diffusion model, pretraining on vision-only images and finetuning on occluded manipulation scenes while conditioning on visible RGB evidence, occluder/visibility masks, the hand latent representation, and tactile information. Crucially, we incorporate physics-based objectives and differentiable decoder-guidance during finetuning and inference to reduce hand–object interpenetration and to align the reconstructed surface with contact observations.

What carries the argument

A conditional flow-matching diffusion model in the latent space of a Structure-VAE for pose-aware SDFs, conditioned on RGB, masks, hand proprioception, and multi-contact touch while guided by physics objectives for non-interpenetration and contact matching.

If this is right

  • Substantially improves completion of occluded object parts compared to vision-only baselines.
  • Yields physically plausible reconstructions at correct real-world scale.
  • Reduces hand-object interpenetration through the added physics objectives.
  • Transfers to real humanoid robots even when the end-effector differs from training data.
  • Integrates directly into two-stage pipelines where a downstream module can refine geometry and predict appearance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robots could maintain object models continuously during grasping sequences without pausing for clear views.
  • The same conditioning strategy might apply to other occluders if their geometry and contact data are available.
  • Real-time versions could support closed-loop control by updating the object estimate as contacts change.
  • Reducing dependence on external cameras could simplify workspace setups in cluttered or mobile manipulation scenarios.

Load-bearing premise

That physics-based objectives and differentiable decoder guidance during finetuning and inference will reliably reduce hand-object interpenetration and align surfaces with contacts without introducing artifacts or scale errors.

What would settle it

Reconstructed meshes that still penetrate the hand model or fail to match observed contact locations at test time, or output objects whose real-world scale deviates from ground-truth measurements on the robot.

Figures

Figures reproduced from arXiv: 2604.09100 by Gabriele Mario Caddeo, Lorenzo Natale, Pasquale Marra.

Figure 1
Figure 1. Figure 1: Inference pipeline. We present a method to generate physically plausible 3D shape reconstructions by fusing vision with contact. In particular, we consider the active information of contact coming from tactile sensors distributed on the hand, and the negative information (non-interpenetration) coming from the hand geometry. Apart from these information, the method requires the Egocentric RGB Image of the o… view at source ↗
Figure 2
Figure 2. Figure 2: Training procedure and Architecture. a1): We train a Structure-VAE autoencoder that reconstructs pose-aware object SDFs. a2): Using the frozen VAE en￾coder, we build latent datasets and train a conditional flow transformer from scratch using pose-consistent, unoccluded object images. a3): We finetune the Structure￾flow on occluded manipulation scenes, conditioning on visible RGB evidence, oc￾cluder/visibil… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results in simulation. Vision-only baselines often exhibit ar￾tifacts (e.g., holes or inconsistent relative dimensions) under occlusion. By leveraging contact cues, our method produces more physically plausible reconstructions. proprioception, touch, and physics-based constraints improves performance on all metrics, except in the least-occluded regime for F@0.02. This suggests that when occlusi… view at source ↗
Figure 4
Figure 4. Figure 4: 4.4 Ablation Study We perform ablations to quantify the contribution of individual components of our method and to evaluate robustness to noise in tactile readings. Additional [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on real-world data. Contact cues improve physical plausibility under occlusion. Artifacts can arise when camera–hand calibration/forward kinematics are inaccurate, leading to hand–object misalignment (third row). qualitative examples and extended quantitative results are provided in the sup￾plementary material. Sensing ablation. To assess the role of each sensing modality, we train vari… view at source ↗
read the original abstract

We propose a multimodal, physically grounded approach for metric-scale amodal object reconstruction and pose estimation under severe hand occlusion. Unlike prior occlusion-aware 3D generation methods that rely only on vision, we leverage physical interaction signals: proprioception provides the posed hand geometry, and multi-contact touch constrains where the object surface must lie, reducing ambiguity in occluded regions. We represent object structure as a pose-aware, camera-aligned signed distance field (SDF) and learn a compact latent space with a Structure-VAE. In this latent space, we train a conditional flow-matching diffusion model, pretraining on vision-only images and finetuning on occluded manipulation scenes while conditioning on visible RGB evidence, occluder/visibility masks, the hand latent representation, and tactile information. Crucially, we incorporate physics-based objectives and differentiable decoder-guidance during finetuning and inference to reduce hand--object interpenetration and to align the reconstructed surface with contact observations. Because our method produces a metric, physically consistent structure estimate, it integrates naturally into existing two-stage reconstruction pipelines, where a downstream module refines geometry and predicts appearance. Experiments in simulation show that adding proprioception and touch substantially improves completion under occlusion and yields physically plausible reconstructions at correct real-world scale compared to vision-only baselines; we further validate transfer by deploying the model on a real humanoid robot with an end-effector different from those used during training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes a multimodal, physically grounded method for metric-scale amodal 3D object reconstruction and pose estimation under severe hand occlusion. It represents objects as pose-aware camera-aligned SDFs via a Structure-VAE, then trains a conditional flow-matching diffusion model on vision, occluder masks, hand latent representations, and tactile signals. Physics-based objectives and differentiable decoder guidance are added during finetuning and inference to enforce physical consistency. Simulation experiments claim substantial gains in occluded completion and scale accuracy over vision-only baselines, with further validation via deployment on a real humanoid robot using a novel end-effector.

Significance. If the results hold, the work meaningfully advances grounded 3D generation for robotics by showing how proprioception and multi-contact touch can resolve visual ambiguity in manipulation scenes while producing metric, non-penetrating reconstructions. The sim-to-real transfer with an unseen end-effector and the integration path into two-stage pipelines are practical strengths that could influence downstream tasks such as grasping and interaction planning.

major comments (2)
  1. Abstract: the central claim that proprioception and touch 'substantially improves completion under occlusion and yields physically plausible reconstructions at correct real-world scale' is presented without any quantitative metrics, error bars, ablation tables, or statistical tests in the abstract itself; this makes it impossible to judge the magnitude or reliability of the reported gains from the provided text alone.
  2. Method description: the physics-based objectives and differentiable decoder-guidance are described as key to reducing interpenetration and aligning surfaces with contacts, yet no explicit formulation, weighting schedule, or ablation isolating their contribution is referenced, leaving open whether they reliably avoid artifacts or scale drift as assumed.
minor comments (3)
  1. Abstract: expand the experimental summary to include at least one key quantitative result (e.g., IoU or Chamfer distance improvement) so readers can immediately gauge the effect size.
  2. Notation: confirm that 'SDF' is expanded on first use and that all conditioning signals (RGB, masks, hand latent, tactile) are consistently denoted across text and figures.
  3. Figures: ensure simulation and real-robot result panels are clearly labeled with baseline comparisons and scale references so the physical-consistency claim is visually verifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the constructive feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and completeness.

read point-by-point responses
  1. Referee: Abstract: the central claim that proprioception and touch 'substantially improves completion under occlusion and yields physically plausible reconstructions at correct real-world scale' is presented without any quantitative metrics, error bars, ablation tables, or statistical tests in the abstract itself; this makes it impossible to judge the magnitude or reliability of the reported gains from the provided text alone.

    Authors: We agree that the abstract would benefit from including key quantitative indicators to convey the scale of improvements. We have revised the abstract to reference representative metrics from our simulation experiments (e.g., improvements in completion IoU and scale accuracy relative to vision-only baselines) while preserving brevity, with full details, error bars, and statistical comparisons remaining in the main results section and tables. revision: yes

  2. Referee: Method description: the physics-based objectives and differentiable decoder-guidance are described as key to reducing interpenetration and aligning surfaces with contacts, yet no explicit formulation, weighting schedule, or ablation isolating their contribution is referenced, leaving open whether they reliably avoid artifacts or scale drift as assumed.

    Authors: The explicit mathematical formulations for the physics-based objectives (interpenetration penalty and contact alignment terms) and the differentiable decoder guidance appear in Section 3.4, with the weighting schedule and training procedure described in the implementation details. To address the concern directly, we have added cross-references to the relevant equations in the method overview and included a new ablation study in the revised manuscript that isolates the contribution of these components, confirming their role in reducing interpenetration and scale drift. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's core pipeline—Structure-VAE for latent representation of pose-aware SDFs, followed by conditional flow-matching diffusion pretrained on vision and finetuned with proprioception/touch conditioning plus physics-based losses—is built from standard generative modeling components. No equation or claim reduces the reconstructed output to a fitted input by construction, nor does any uniqueness theorem or ansatz rely on self-citation chains. Simulation gains and real-robot transfer with a novel end-effector are presented as empirical validation rather than definitional consequences. The approach therefore contains independent content and does not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard robotics sensor assumptions and ML training practices rather than new physical postulates.

free parameters (2)
  • latent space dimension
    Size of the compact latent space learned by the Structure-VAE is a modeling choice.
  • conditioning signal weights
    Relative influence of RGB, masks, hand latent, and tactile inputs during diffusion training.
axioms (2)
  • domain assumption Proprioception provides accurate posed hand geometry
    Invoked to supply occluder geometry and reduce reconstruction ambiguity.
  • domain assumption Multi-contact touch observations constrain object surface location
    Used to guide the SDF in occluded regions during finetuning and inference.

pith-pipeline@v0.9.0 · 5561 in / 1459 out tokens · 65285 ms · 2026-05-10T18:18:25.267076+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 24 canonical work pages · 5 internal anchors

  1. [1]

    Building Normalizing Flows with Stochastic Interpolants

    Albergo, M.S., Vanden-Eijnden, E.: Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571 (2022) Physically Grounded 3D Generative Reconstruction 15

  2. [2]

    IEEE Sensors Journal25(5), 8697–8709 (2025).https://doi

    Caddeo, G.M., Maracani, A., Alfano, P.D., Piga, N.A., Rosasco, L., Natale, L.: Sim2surf: A sim2real surface classifier for vision-based tactile sensors with a bilevel adaptation pipeline. IEEE Sensors Journal25(5), 8697–8709 (2025).https://doi. org/10.1109/JSEN.2025.3530712

  3. [3]

    IEEE Robotics Automation Magazine22(3), 36–52 (2015).https://doi.org/ 10.1109/MRA.2015.2448951

    Calli, B., Walsman, A., Singh, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: Bench- marking in manipulation research: Using the yale-cmu-berkeley object and model set. IEEE Robotics Automation Magazine22(3), 36–52 (2015).https://doi.org/ 10.1109/MRA.2015.2448951

  4. [4]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L.J., Tremblay, J., Khamis, S., et al.: Efficient geometry-aware 3d generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16123–16133 (2022)

  5. [5]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Chan, E.R., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5799–5809 (2021)

  6. [6]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Chen, H., Gu, J., Chen, A., Tian, W., Tu, Z., Liu, L., Su, H.: Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2416–2425 (2023)

  7. [7]

    In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Chen, Y., Tekden, A.E., Deisenroth, M.P., Bekiroglu, Y.: Sliding touch-based ex- ploration for modeling unknown object shape with multi-fingered hands. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 8943–8950. IEEE (2023)

  8. [8]

    Advances in Neural Information Processing Systems37, 102501–102530 (2024)

    Chen, Y., Xie, T., Zong, Z., Li, X., Gao, F., Yang, Y., Wu, Y.N., Jiang, C.: Atlas3d: Physically constrained self-supporting text-to-3d for simulation and fabrication. Advances in Neural Information Processing Systems37, 102501–102530 (2024)

  9. [9]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Chen, Z., Tang, J., Dong, Y., Cao, Z., Hong, F., Lan, Y., Wang, T., Xie, H., Wu, T., Saito, S., et al.: 3dtopia-xl: Scaling high-quality 3d asset generation via prim- itive diffusion. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26576–26586 (2025)

  10. [10]

    org/abs/2508.00427

    Chi, S., Sachdeva, E., Huang, P.H., Lee, K.: Contact-aware amodal completion for human-object interaction via multi-regional inpainting (2025),https://arxiv. org/abs/2508.00427

  11. [11]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Cho, J., Youwang, K., Yang, H., Oh, T.H.: Robust 3d shape reconstruction in zero- shot from a single image in the wild. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22786–22798 (2025)

  12. [12]

    Chu, R., Xie, E., Mo, S., Li, Z., Nießner, M., Fu, C.W., Jia, J.: Diffcomplete: Diffusion-based generative 3d shape completion (2023),https://arxiv.org/abs/ 2306.16329

  13. [13]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Collins, J., Goel, S., Deng, K., Luthra, A., Xu, L., Gundogdu, E., Zhang, X., Vicente, T.F.Y., Dideriksen, T., Arora, H., et al.: Abo: Dataset and benchmarks for real-world 3d object understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21126–21136 (2022)

  14. [14]

    IEEE Robotics and Automation Letters9(6), 5719–5726 (2024)

    Comi, M., Lin, Y., Church, A., Tonioni, A., Aitchison, L., Lepora, N.F.: Touchsdf: A deepsdf approach for 3d shape reconstruction using vision-based tactile sensing. IEEE Robotics and Automation Letters9(6), 5719–5726 (2024)

  15. [15]

    In: 2025 International Conference on 3D Vision (3DV)

    Comi, M., Tonioni, A., Tremblay, J., Yang, M., Blukis, V., Lin, Y., Lepora, N.F., Aitchison, L.: Snap-it, tap-it, splat-it: Tactile-informed 3d gaussian splatting for reconstructing challenging surfaces. In: 2025 International Conference on 3D Vision (3DV). pp. 1134–1143. IEEE (2025) 16 Gabriele M. Caddeo , Pasquale Marra , and Lorenzo Natake

  16. [16]

    In: European Conference on Computer Vision

    Cui, R., Liu, W., Sun, W., Wang, S., Shang, T., Li, Y., Song, X., Yan, H., Wu, Z., Chen, S., et al.: Neusdfusion: A spatial-aware generative model for 3d shape completion, reconstruction, and generation. In: European Conference on Computer Vision. pp. 1–18. Springer (2024)

  17. [17]

    Advances in Neural Information Processing Systems36, 35799–35813 (2023)

    Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., et al.: Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems36, 35799–35813 (2023)

  18. [18]

    In: 2022 International Conference on Robotics and Automation (ICRA)

    Downs, L., Francis, A., Koenig, N., Kinman, B., Hickman, R., Reymann, K., McHugh, T.B., Vanhoucke, V.: Google scanned objects: A high-quality dataset of 3d scanned household items. In: 2022 International Conference on Robotics and Automation (ICRA). pp. 2553–2560. Ieee (2022)

  19. [19]

    International Journal of Computer Vision129(12), 3313–3337 (2021)

    Fu, H., Jia, R., Gao, L., Gong, M., Zhao, B., Maybank, S., Tao, D.: 3d-future: 3d furniture shape with texture. International Journal of Computer Vision129(12), 3313–3337 (2021)

  20. [20]

    Advances in neural information processing systems35, 31841–31854 (2022)

    Gao, J., Shen, T., Wang, Z., Chen, W., Yin, K., Li, D., Litany, O., Gojcic, Z., Fidler, S.: Get3d: A generative model of high quality 3d textured shapes learned from images. Advances in neural information processing systems35, 31841–31854 (2022)

  21. [21]

    Advances in Neural Information Processing Systems37, 29839–29863 (2024)

    Gao, R., Deng, K., Yang, G., Yuan, W., Zhu, J.Y.: Tactile dreamfusion: Exploit- ing tactile sensing for 3d generation. Advances in Neural Information Processing Systems37, 29839–29863 (2024)

  22. [22]

    Communications of the ACM63(11), 139–144 (2020)

    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM63(11), 139–144 (2020)

  23. [23]

    Advances in Neural Information Processing Systems37, 119260–119282 (2024)

    Guo, M., Wang, B., Ma, P., Zhang, T., Owens, C., Gan, C., Tenenbaum, J., He, K., Matusik, W.: Physically compatible 3d object modeling from a single image. Advances in Neural Information Processing Systems37, 119260–119282 (2024)

  24. [24]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

  25. [25]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Huang, Z., Yu, Y., Xu, J., Ni, F., Le, X.: Pf-net: Point fractal network for 3d point cloud completion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7662–7670 (2020)

  26. [26]

    Jamali, N., Ciliberto, C., Rosasco, L., Natale, L.: Active perception: Building ob- jects’ models using tactile exploration. pp. 179–185 (2016).https://doi.org/10. 1109/HUMANOIDS.2016.7803275

  27. [27]

    Jiang, J., Cao, G., Butterworth, A., Do, T.T., Luo, S.: Where shall i touch? vision- guided tactile poking for transparent object grasping (2022),https://arxiv.org/ abs/2208.09743

  28. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Khanna, M., Mao, Y., Jiang, H., Haresh, S., Shacklett, B., Batra, D., Clegg, A., Undersander, E., Chang, A.X., Savva, M.: Habitat synthetic scenes dataset (hssd- 200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16384–16393 (2024)

  29. [29]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Li, M., Duan, Y., Zhou, J., Lu, J.: Diffusion-sdf: Text-to-shape via voxelized diffu- sion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12642–12651 (2023)

  30. [30]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Li, R., Zheng, C., Rupprecht, C., Vedaldi, A.: Dso: Aligning 3d generators with simulation feedback for physical soundness. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6772–6783 (2025) Physically Grounded 3D Generative Reconstruction 17

  31. [31]

    In: proceedings of the AAAI Conference on Artificial Intelligence

    Lin, C.H., Kong, C., Lucey, S.: Learning efficient point cloud generation for dense 3d object reconstruction. In: proceedings of the AAAI Conference on Artificial Intelligence. vol. 32 (2018)

  32. [32]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

  33. [33]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

  34. [34]

    IEEE Transactions on Robotics (2025)

    Luo, S., Lepora, N.F., Yuan, W., Althoefer, K., Cheng, G., Dahiya, R.: Tactile robotics: An outlook. IEEE Transactions on Robotics (2025)

  35. [35]

    In: 2018 IEEE International Conference on Robotics and Automation (ICRA)

    Luo, S., Yuan, W., Adelson, E., Cohn, A.G., Fuentes, R.: Vitac: Feature sharing between vision and tactile sensing for cloth texture recognition. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). pp. 2722–2727 (2018).https://doi.org/10.1109/ICRA.2018.8460494

  36. [36]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Luo, S., Hu, W.: Diffusion probabilistic models for 3d point cloud generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 2837–2845 (2021)

  37. [37]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Melas-Kyriazi, L., Rupprecht, C., Vedaldi, A.: Pc2: Projection-conditioned point cloud diffusion for single-image 3d reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12923– 12932 (2023)

  38. [38]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Müller, N., Siddiqui, Y., Porzi, L., Bulo, S.R., Kontschieder, P., Nießner, M.: Diffrf: Rendering-guided 3d radiance field diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4328–4338 (2023)

  39. [39]

    Advances in Neural Information Processing Systems37, 25747–25780 (2024)

    Ni, J., Chen, Y., Jing, B., Jiang, N., Wang, B., Dai, B., Li, P., Zhu, Y., Zhu, S.C., Huang, S.: Phyrecon: Physically plausible neural scene reconstruction. Advances in Neural Information Processing Systems37, 25747–25780 (2024)

  40. [40]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Niemeyer, M., Geiger, A.: Giraffe: Representing scenes as compositional generative neural feature fields. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11453–11464 (2021)

  41. [41]

    Oller, M., Planas, M., Berenson, D., Fazeli, N.: Manipulation via membranes: High-resolution and highly deformable tactile sensing and control (2022),https: //arxiv.org/abs/2209.13432

  42. [42]

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W.,Howes,R.,Huang,P.Y.,Li,S.W.,Misra,I.,Rabbat,M.,Sharma,V.,Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without su...

  43. [43]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ren, X., Huang, J., Zeng, X., Museth, K., Fidler, S., Williams, F.: Xcube: Large- scale 3d generative modeling using sparse voxel hierarchies. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4209–4219 (2024)

  44. [44]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  45. [45]

    Advances in neural information processing systems 33, 20154–20166 (2020)

    Schwarz, K., Liao, Y., Niemeyer, M., Geiger, A.: Graf: Generative radiance fields for 3d-aware image synthesis. Advances in neural information processing systems 33, 20154–20166 (2020)

  46. [46]

    Safe multi-agent navigation guided by goal- conditioned safe reinforcement learning

    Shahidzadeh, A.H., Caddeo, G.M., Alapati, K., Natale, L., Fermüler, C., Aloi- monos, Y.: Feelanyforce: Estimating contact force feedback from tactile sensa- tion for vision-based tactile sensors. In: 2025 IEEE International Conference on 18 Gabriele M. Caddeo , Pasquale Marra , and Lorenzo Natake Robotics and Automation (ICRA). pp. 251–257 (2025).https://...

  47. [47]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

    Shahidzadeh, A.H., Yoo, S.J., Mantripragada, P., Singh, C.D., Fermüller, C., Aloi- monos, Y.: Actexplore: Active tactile exploration on unknown objects. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 3411–

  48. [48]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Shue, J.R., Chan, E.R., Po, R., Ankner, Z., Wu, J., Wetzstein, G.: 3d neural field generation using triplane diffusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20875–20886 (2023)

  49. [49]

    Smith, E., Calandra, R., Romero, A., Gkioxari, G., Meger, D., Malik, J., Drozdzal, M.:3dshapereconstructionfromvisionandtouch.AdvancesinNeuralInformation Processing Systems33, 14193–14206 (2020)

  50. [50]

    Advances in Neural Information Processing Systems34, 16064–16078 (2021)

    Smith, E., Meger, D., Pineda, L., Calandra, R., Malik, J., Romero Soriano, A., Drozdzal, M.: Active 3d shape reconstruction from vision and touch. Advances in Neural Information Processing Systems34, 16064–16078 (2021)

  51. [51]

    NeRF-Supervision: Learning Dense Object Descriptors from Neural Radiance Fields , booktitle =

    Sodhi, P., Kaess, M., Mukadanr, M., Anderson, S.: Patchgraph: In-hand tac- tile tracking with learned surface normals. In: 2022 International Conference on Robotics and Automation (ICRA). pp. 2164–2170 (2022).https://doi.org/10. 1109/ICRA46639.2022.9811953

  52. [52]

    In: International conference on machine learning

    Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsuper- vised learning using nonequilibrium thermodynamics. In: International conference on machine learning. pp. 2256–2265. pmlr (2015)

  53. [53]

    Science Robotics9(96), eadl0628 (2024)

    Suresh, S., Qi, H., Wu, T., Fan, T., Pineda, L., Lambeta, M., Malik, J., Kalakrish- nan, M., Calandra, R., Kaess, M., et al.: Neuralfeels with neural fields: Visuotactile perception for in-hand manipulation. Science Robotics9(96), eadl0628 (2024)

  54. [54]

    In: 2022 International Conference on Robotics and Automation (ICRA)

    Suresh, S., Si, Z., Mangelson, J.G., Yuan, W., Kaess, M.: Shapemap 3-d: Efficient shape mapping through dense touch and vision. In: 2022 International Conference on Robotics and Automation (ICRA). pp. 7073–7080. IEEE (2022)

  55. [55]

    In: 2024 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS)

    Swann,A.,Strong,M.,Do,W.K.,Camps,G.S.,Schwager,M.,Kennedy,M.:Touch- gs: Visual-tactile supervised 3d gaussian splatting. In: 2024 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS). pp. 10511–10518. IEEE (2024)

  56. [56]

    In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Tahoun, M., Tahri, O., Ramón, J.A.C., Mezouar, Y.: Visual-tactile fusion for 3d objects reconstruction from a single depth view and a single gripper touch for robotics tasks. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 6786–6793. IEEE (2021)

  57. [57]

    Tang, B., Akinola, I., Xu, J., Wen, B., Handa, A., Wyk, K.V., Fox, D., Sukhatme, G.S., Ramos, F., Narang, Y.: Automate: Specialist and generalist assembly policies over diverse geometries (2024),https://arxiv.org/abs/2407.08028

  58. [58]

    Team, S.D., Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., Lin, A., Liu, J., Ma, Z., Sagar, A., Song, B., Wang, X., Yang, J., Zhang, B., Dollár, P., Gkioxari, G., Feiszli, M., Malik, J.: Sam 3d: 3dfy anything in images (2025),https://arxiv.org/abs/2511.16624

  59. [59]

    Advances in Neural Infor- mation Processing Systems35, 10021–10039 (2022)

    Vahdat, A., Williams, F., Gojcic, Z., Litany, O., Fidler, S., Kreis, K., et al.: Lion: Latent point diffusion models for 3d shape generation. Advances in Neural Infor- mation Processing Systems35, 10021–10039 (2022)

  60. [60]

    IEEE Transactions on Robotics33(5), 1139–1155 (2017).https://doi.org/10.1109/TRO.2017.2707092

    Vezzani, G., Pattacini, U., Battistelli, G., Chisci, L., Natale, L.: Memory unscented particle filter for 6-dof tactile localization. IEEE Transactions on Robotics33(5), 1139–1155 (2017).https://doi.org/10.1109/TRO.2017.2707092

  61. [61]

    arXiv preprint arXiv:2410.18070 (2024) Physically Grounded 3D Generative Reconstruction 19

    Wang, L., Cheng, C., Liao, Y., Qu, Y., Liu, G.: Training free guided flow matching with optimal control. arXiv preprint arXiv:2410.18070 (2024) Physically Grounded 3D Generative Reconstruction 19

  62. [62]

    ACM Transactions on Graphics (SIGGRAPH) 41(4) (2022)

    Wang, P.S., Liu, Y., Tong, X.: Dual octree graph networks for learning adaptive volumetric shape representations. ACM Transactions on Graphics (SIGGRAPH) 41(4) (2022)

  63. [63]

    In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Wang, S., Wu, J., Sun, X., Yuan, W., Freeman, W.T., Tenenbaum, J.B., Adelson, E.H.: 3d shape perception from monocular vision, touch, and shape priors. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1606–1613. IEEE (2018)

  64. [64]

    Wei, Z., Xu, Z., Guo, J., Hou, Y., Gao, C., Cai, Z., Luo, J., Shao, L.:D(R,O)grasp: A unified representation of robot and object interaction for cross-embodiment dex- terous grasping (2025),https://arxiv.org/abs/2410.01702

  65. [65]

    Wen, B., Yang, W., Kautz, J., Birchfield, S.: Foundationpose: Unified 6d pose estimation and tracking of novel objects (2024),https://arxiv.org/abs/2312. 08344

  66. [66]

    Wu, T., Zheng, C., Guan, F., Vedaldi, A., Cham, T.J.: Amodal3r: Amodal 3d reconstruction from occluded 2d images (2025),https://arxiv.org/abs/2503. 13439

  67. [67]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wu, Z., Wang, Y., Feng, M., Xie, H., Mian, A.: Sketch and text guided diffu- sion model for colored point cloud generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8929–8939 (2023)

  68. [68]

    arXiv preprint arXiv:2412.01506 (2024) 4

    Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and versatile 3d generation. arXiv preprint arXiv:2412.01506 (2024)

  69. [69]

    PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

    Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes (2018),https://arxiv. org/abs/1711.00199

  70. [70]

    Xu, J., Lin, H., Song, S., Ciocarlie, M.: Tandem3d: Active tactile exploration for 3d object recognition (2023),https://arxiv.org/abs/2209.08772

  71. [71]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

    Yang, Y., Jia, B., Zhi, P., Huang, S.: Physcene: Physically interactable 3d scene synthesis for embodied ai. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 16262–16272 (2024)

  72. [72]

    In: European Conference on Computer Vision

    Zhang, B., Cheng, Y., Wang, C., Zhang, T., Yang, J., Tang, Y., Zhao, F., Chen, D., Guo, B.: Rodinhd: High-fidelity 3d avatar generation with diffusion models. In: European Conference on Computer Vision. pp. 465–483. Springer (2024)

  73. [73]

    11 Published as a conference paper at ICLR 2025 A DATA We use two datasets in our experiments

    Zhang, B., Cheng, Y., Yang, J., Wang, C., Zhao, F., Tang, Y., Chen, D., Guo, B.: Gaussiancube: A structured and explicit radiance representation for 3d generative modeling. arXiv preprint arXiv:2403.19655 (2024)

  74. [74]

    ACM Transactions on Graphics (TOG)43(4), 1–20 (2024)

    Zhang, L., Wang, Z., Zhang, Q., Qiu, Q., Pang, A., Jiang, H., Yang, W., Xu, L., Yu, J.: Clay: A controllable large-scale generative model for creating high-quality 3d assets. ACM Transactions on Graphics (TOG)43(4), 1–20 (2024)

  75. [75]

    Zheng, H., Jalba, A., Cuijpers, R.H., IJsselsteijn, W., Schoenmakers, S.: A bayesian framework for active tactile object recognition, pose estimation and shape transfer learning (2024),https://arxiv.org/abs/2409.06912 20 Gabriele M. Caddeo , Pasquale Marra , and Lorenzo Natake A More details on Datasets A.1 Image processing Pose-consistent sprite placemen...