pith. machine review for the scientific record. sign in

arxiv: 2602.13833 · v2 · submitted 2026-02-14 · 💻 cs.RO

Recognition: no theorem link

Semantic-Contact Fields for Category-Level Generalizable Tactile Tool Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:16 UTC · model grok-4.3

classification 💻 cs.RO
keywords semantic contact fieldstactile sensingtool manipulationsim-to-real transfercategory-level generalizationdiffusion policycontact estimationrobot manipulation
0
0 comments X

The pith

Semantic-Contact Fields fuse visual semantics with dense contact probability and force estimates to support category-level generalization in tactile tool manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to give robots both high-level semantic understanding and low-level physical contact awareness for tasks like scraping, drawing, and peeling. It does so by learning a single 3D representation that combines camera semantics with tactile contact signals, trained first in large-scale simulation and then adapted to real sensors with only a small amount of pseudo-labeled data. A reader should care because current vision-language policies lack reliable physical grounding while pure tactile methods stay tied to one specific tool shape. If the approach works, robots could apply the same learned policy to many unseen tools within a category without collecting new real-world data for each one.

Core claim

SCFields is a unified 3D representation that fuses visual semantics with dense extrinsic contact estimates, including both contact probability and force magnitude. It is obtained via a two-stage pipeline: pre-training on large-scale simulation to acquire geometry-aware contact priors, followed by fine-tuning on a modest real dataset whose labels are generated by geometric heuristics and force optimization to align real tactile readings. The resulting force-aware field is then supplied as dense observation to a diffusion policy that performs the manipulation, producing category-level generalization to unseen tool instances.

What carries the argument

Semantic-Contact Fields (SCFields), a 3D representation that merges visual semantics with dense estimates of contact probability and force; it supplies the observation to the diffusion policy.

If this is right

  • The same learned representation supports multiple contact-rich tasks (scraping, crayon drawing, peeling) across different tool shapes without retraining.
  • Performance exceeds both vision-only and raw-tactile baselines on the reported real-robot experiments.
  • Only a small amount of real data is required after large-scale simulation pre-training.
  • The force component of the field improves physical control beyond what contact probability alone provides.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same representation could be attached to existing vision-language-action models to supply missing physical grounding without retraining the entire policy.
  • Extending the two-stage pipeline to additional sensor types or dynamic contact scenarios would test whether the alignment method generalizes beyond the three tasks shown.
  • If the force estimates prove reliable, they could support downstream planning that reasons explicitly about expected reaction forces rather than treating contact as binary.

Load-bearing premise

The geometric heuristics plus force optimization used to pseudo-label the small real dataset can correctly align real tactile signals with simulation despite the nonlinear deformation of soft sensors.

What would settle it

A controlled test in which a new tool geometry produces systematically mismatched contact-force predictions between the simulated priors and the real sensor readings, causing the diffusion policy to fail on the same task that succeeded in simulation.

Figures

Figures reproduced from arXiv: 2602.13833 by Heng Zhang, Kevin Yuchen Ma, Mike Zheng Shou, Weisi Lin, Yan Wu.

Figure 1
Figure 1. Figure 1: Semantic-Contact Fields (SCFields) Overview. 1. Multimodal Inputs: The system takes RGB-D observations and tactile readings from GelSight sensors. 2. SCFields Generation: Our unified perception module fuses these inputs into a dense point cloud representation containing both category-level semantics (blue/green heatmap) and extrinsic contact force vectors (green arrows). 3. Policy Execution: A diffusion po… view at source ↗
Figure 2
Figure 2. Figure 2: Method Overview. Left: Contact Field Learning (III-B) Stage 1 learns general geometry-aware contact priors from simulated data; Stage 2 aligns the model to real tactile sensor responses using pseudo-labeled real data. Right: Policy Learning (III-C) A Diffusion Policy is conditioned on the resulting SCFields to achieve robust tool manipulation. at the sensor surface, rather than grounding contact on the ext… view at source ↗
Figure 3
Figure 3. Figure 3: Contact field model architecture. The network fuses [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: Real robot experiment setup: We use a Franka Emika Panda robot with 2 GelSight Mini tactile sensors mounted on the gripper fingers, and 3 RealSense D435 cameras to capture RGB-D observations. Right: Training and Testing Tools • Ablation - Loss Function: Replaces Focal Loss with standard BCE Loss to evaluate robustness to class imbal￾ance. 1) Evaluation 1: Architecture Validation (Sim-to-Sim): We firs… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of contact-field estimation on the Peeler. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Rollouts of contact-rich tasks with unseen tools. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example failure modes of baseline/ablation methods. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Peeler meshes used in simulation Listing 1: Key Simulation and TacSL parameters. TacSL: compliance_stiffness_range: [1400, 1500] compliant_damping_range: [1.5, 2.5] elastomer_friction: 5.0 IsaacGym: substeps: 4 physx: num_pos_iterations: 32 num_vel_iterations: 2 contact_offset: 0.002 max_depenetration_vel: 1.0 friction_corr_dist: 0.001 Robot_Control: gripper_prop_gains: [800, 800] gripper_deriv_gains: [4, … view at source ↗
Figure 9
Figure 9. Figure 9: Illustration of the crayon picking setup, Semantic Field visualization, and successful rollouts on both seen and unseen [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative results in simulation. The predicted contact probabilities (bottom row) closely match the ground truth fields [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative results in the real world. We compare the predicted contact fields by [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
read the original abstract

Generalizing tool manipulation requires both semantic planning and precise physical control. Modern generalist robot policies, such as Vision-Language-Action (VLA) models, often lack the physical grounding required for contact-rich tool manipulation. Conversely, existing contact-aware policies that leverage tactile or haptic sensing are typically instance-specific and fail to generalize across diverse tool geometries. Bridging this gap requires learning representations that are both semantically transferable and physically grounded, yet a fundamental barrier remains: diverse real-world tactile data are prohibitive to collect at scale, while direct zero-shot sim-to-real transfer is challenging due to the complex nonlinear deformation of soft tactile sensors. To address this, we propose Semantic-Contact Fields (SCFields), a unified 3D representation that fuses visual semantics with dense extrinsic contact estimates, including contact probability and force. SCFields is learned through a two-stage Sim-to-Real Contact Learning Pipeline: we first pre-train on large-scale simulation to learn geometry-aware contact priors, then fine-tune on a small set of real data pseudo-labeled via geometric heuristics and force optimization to align real tactile signals. The resulting force-aware representation serves as the dense observation input to a diffusion policy, enabling physical generalization to unseen tool instances. Experiments on scraping, crayon drawing, and peeling demonstrate robust category-level generalization, significantly outperforming vision-only and raw-tactile baselines. Project page: https://kevinskwk.github.io/SCFields/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Semantic-Contact Fields (SCFields), a unified 3D representation fusing visual semantics with dense extrinsic contact probability and force estimates. SCFields are learned via a two-stage Sim-to-Real Contact Learning Pipeline (simulation pre-training on geometry-aware priors followed by fine-tuning on a small real dataset whose labels are generated by geometric heuristics plus force optimization). The resulting representation is fed as dense observation to a diffusion policy, with experiments on scraping, crayon drawing, and peeling claimed to show robust category-level generalization that significantly outperforms vision-only and raw-tactile baselines.

Significance. If the pseudo-labels prove accurate, the approach would meaningfully advance category-level tactile tool manipulation by enabling physically grounded representations with limited real data, addressing the sim-to-real gap for soft sensors and the semantic-physical disconnect in current VLA models. The explicit separation of simulation pre-training from heuristic-based real fine-tuning is a pragmatic strength that could generalize beyond the three demonstrated tasks.

major comments (2)
  1. [Two-stage Sim-to-Real Contact Learning Pipeline] Two-stage pipeline (abstract and method description): no quantitative validation (IoU, force MSE, or similar) of the geometric-heuristic pseudo-labels against independent real-world ground truth on the same soft-sensor hardware is reported. Without such checks, it is impossible to confirm that the diffusion policy receives physically accurate dense observations, directly undermining attribution of the reported outperformance to SCFields rather than to heuristic artifacts.
  2. [Experiments] Experiments section: the claim of 'significantly outperforming' vision-only and raw-tactile baselines is stated without any numerical metrics, ablation tables, error bars, or statistical tests. This absence prevents verification of the central generalization result and makes the superiority claim unverifiable from the provided evidence.
minor comments (1)
  1. The abstract and method description would benefit from an explicit statement of the number of real-world trajectories used for fine-tuning and the precise form of the force-optimization objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to strengthen the validation and presentation of results.

read point-by-point responses
  1. Referee: [Two-stage Sim-to-Real Contact Learning Pipeline] Two-stage pipeline (abstract and method description): no quantitative validation (IoU, force MSE, or similar) of the geometric-heuristic pseudo-labels against independent real-world ground truth on the same soft-sensor hardware is reported. Without such checks, it is impossible to confirm that the diffusion policy receives physically accurate dense observations, directly undermining attribution of the reported outperformance to SCFields rather than to heuristic artifacts.

    Authors: We agree that direct quantitative validation of the pseudo-labels would strengthen the paper. In the revised manuscript we will add a dedicated subsection reporting IoU for contact probability and MSE for force estimates, obtained by comparing the heuristic-generated labels against a small set of manually annotated real-world ground-truth data collected on the same soft-sensor hardware. This will allow readers to assess label accuracy and support attribution of policy gains to SCFields. revision: yes

  2. Referee: [Experiments] Experiments section: the claim of 'significantly outperforming' vision-only and raw-tactile baselines is stated without any numerical metrics, ablation tables, error bars, or statistical tests. This absence prevents verification of the central generalization result and makes the superiority claim unverifiable from the provided evidence.

    Authors: We acknowledge that the current draft presents the outperformance claim without sufficient numerical detail. In the revision we will expand the experiments section with explicit success-rate tables (including means and standard deviations), ablation studies isolating semantic versus contact components, error-bar visualizations, and statistical significance tests (e.g., paired t-tests or Wilcoxon tests) across tool instances. These additions will make the category-level generalization results fully verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline uses independent simulation pre-training and external geometric heuristics

full rationale

The paper's central derivation chain consists of pre-training SCFields on large-scale simulation to obtain geometry-aware contact priors, followed by fine-tuning on a small real dataset whose contact probability and force labels are generated by geometric heuristics plus force optimization. These heuristics are described as independent external procedures, not derived from or equivalent to the SCFields model itself. The resulting representation is then fed as dense observation to a diffusion policy. No equation reduces the claimed category-level generalization to a fitted parameter defined by the same equations, no self-citation chain is load-bearing for the uniqueness of the representation, and the experimental results are presented as empirical outcomes rather than mathematical necessities. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the feasibility of sim-to-real transfer for nonlinear tactile sensor deformations and the accuracy of heuristic-based pseudo-labeling; no free parameters are explicitly named in the abstract.

axioms (1)
  • domain assumption Sim-to-real transfer remains feasible for soft tactile sensors despite complex nonlinear deformations
    The abstract identifies this as the fundamental barrier yet assumes the two-stage pipeline solves it.
invented entities (1)
  • Semantic-Contact Fields (SCFields) no independent evidence
    purpose: Unified 3D representation that fuses visual semantics with dense contact probability and force estimates
    Newly proposed construct serving as the core observation for the diffusion policy.

pith-pipeline@v0.9.0 · 5561 in / 1419 out tokens · 64950 ms · 2026-05-15T22:16:48.036527+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 3 internal anchors

  1. [1]

    Tacsl: A library for visuotactile sensor simulation and learning.IEEE Transactions on Robotics, 2025

    Iretiayo Akinola, Jie Xu, Jan Carius, Dieter Fox, and Yashraj Narang. Tacsl: A library for visuotactile sensor simulation and learning.IEEE Transactions on Robotics, 2025

  2. [2]

    Vla-touch: Enhancing vision- language-action models with dual-level tactile feedback

    Jianxin Bi, Kevin Yuchen Ma, Ce Hao, Mike Zheng Shou, and Harold Soh. Vla-touch: Enhancing vision- language-action models with dual-level tactile feedback. arXiv preprint arXiv:2507.17294, 2025

  3. [3]

    Sim-to-real transfer for robotic manipulation with tactile sensory

    Zihan Ding, Ya-Yen Tsai, Wang Wei Lee, and Bidan Huang. Sim-to-real transfer for robotic manipulation with tactile sensory. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6778–6785. IEEE, 2021

  4. [4]

    Domahidi, E

    A. Domahidi, E. Chu, and S. Boyd. ECOS: An SOCP solver for embedded systems. InEuropean Control Conference (ECC), pages 3071–3076, 2013

  5. [5]

    Act3d: 3d feature field transform- ers for multi-task robotic manipulation

    Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3d: 3d feature field transform- ers for multi-task robotic manipulation. InConference on Robot Learning, pages 3949–3965. PMLR, 2023

  6. [6]

    FoAR: Force-aware reactive policy for contact-rich robotic manipulation.IEEE Robotics and Automation Letters, 10(6):5625–5632, 2025

    Zihao He, Hongjie Fang, Jingjing Chen, Hao-Shu Fang, and Cewu Lu. FoAR: Force-aware reactive policy for contact-rich robotic manipulation.IEEE Robotics and Automation Letters, 10(6):5625–5632, 2025. doi: 10.1109/LRA.2025.3560871

  7. [7]

    Neural contact fields: Tracking extrinsic contact with tactile sensing

    Carolina Higuera, Siyuan Dong, Byron Boots, and Mustafa Mukadam. Neural contact fields: Tracking extrinsic contact with tactile sensing. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 12576–12582. IEEE, 2023

  8. [8]

    Perceiving ex- trinsic contacts from touch improves learning insertion policies.arXiv preprint arXiv:2309.16652, 2023

    Carolina Higuera, Joseph Ortiz, Haozhi Qi, Luis Pineda, Byron Boots, and Mustafa Mukadam. Perceiving ex- trinsic contacts from touch improves learning insertion policies.arXiv preprint arXiv:2309.16652, 2023

  9. [9]

    Sparsh: Self- supervised touch representations for vision-based tac- tile sensing

    Carolina Higuera, Akash Sharma, Chaithanya Krishna Bodduluri, Taosha Fan, Patrick Lancaster, Mrinal Kalakr- ishnan, Michael Kaess, Byron Boots, Mike Lambeta, Tingfan Wu, and Mustafa Mukadam. Sparsh: Self- supervised touch representations for vision-based tac- tile sensing. In8th Annual Conference on Robot Learning, 2024. URL https://openreview.net/forum?i...

  10. [10]

    3d-vitac: Learning fine-grained ma- nipulation with visuo-tactile sensing

    Binghao Huang, Yixuan Wang, Xinyi Yang, Yiyue Luo, and Yunzhu Li. 3d-vitac: Learning fine-grained ma- nipulation with visuo-tactile sensing. In8th Annual Conference on Robot Learning, 2024. URL https:// openreview.net/forum?id=bk28WlkqZn

  11. [11]

    3d diffuser actor: Policy diffusion with 3d scene representations

    Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations. In8th Annual Conference on Robot Learning, 2024

  12. [12]

    Im2contact: Vision-based contact localization without touch or force sensing

    Leon Kim, Yunshuang Li, Michael Posa, and Dinesh Jayaraman. Im2contact: Vision-based contact localization without touch or force sensing. InConference on Robot Learning, pages 1533–1546. PMLR, 2023

  13. [13]

    Vitascope: Visuo-tactile implicit representation for in-hand pose and extrinsic contact estimation

    Jayjun Lee and Nima Fazeli. Vitascope: Visuo-tactile implicit representation for in-hand pose and extrinsic contact estimation. InRobotics: Science and Systems (RSS), 2025

  14. [14]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object detection. InProceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

  15. [15]

    Bi-touch: Bimanual tactile manipulation with sim-to-real deep re- inforcement learning.IEEE Robotics and Automation Letters, 8(9):5472–5479, 2023

    Yijiong Lin, Alex Church, Max Yang, Haoran Li, John Lloyd, Dandan Zhang, and Nathan F Lepora. Bi-touch: Bimanual tactile manipulation with sim-to-real deep re- inforcement learning.IEEE Robotics and Automation Letters, 8(9):5472–5479, 2023

  16. [16]

    Moka: Open-world robotic manipulation through mark-based visual prompting.arXiv preprint arXiv:2403.03174, 2024

    Fangchen Liu, Kuan Fang, Pieter Abbeel, and Sergey Levine. Moka: Open-world robotic manipulation through mark-based visual prompting.arXiv preprint arXiv:2403.03174, 2024

  17. [17]

    Factr: Force-attending curriculum training for contact-rich pol- icy learning.arXiv preprint arXiv:2502.17432, 2025

    Jason Jingzhou Liu, Yulong Li, Kenneth Shaw, Tony Tao, Ruslan Salakhutdinov, and Deepak Pathak. Factr: Force-attending curriculum training for contact-rich pol- icy learning.arXiv preprint arXiv:2502.17432, 2025

  18. [18]

    Forcemimic: Force-centric imitation learn- ing with force-motion capture system for contact-rich manipulation

    Wenhai Liu, Junbo Wang, Yiming Wang, Weiming Wang, and Cewu Lu. Forcemimic: Force-centric imitation learn- ing with force-motion capture system for contact-rich manipulation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 1105–1112. IEEE, 2025

  19. [19]

    Ex- trinsic contact sensing with relative-motion tracking from distributed tactile measurements

    Daolin Ma, Siyuan Dong, and Alberto Rodriguez. Ex- trinsic contact sensing with relative-motion tracking from distributed tactile measurements. In2021 IEEE interna- tional conference on robotics and automation (ICRA), pages 11262–11268. IEEE, 2021

  20. [20]

    Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

    Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

  21. [21]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

  22. [22]

    Toolflownet: Robotic manipulation with tools via predicting tool flow from point clouds

    Daniel Seita, Yufei Wang, Sarthak J Shetty, Edward Yao Li, Zackory Erickson, and David Held. Toolflownet: Robotic manipulation with tools via predicting tool flow from point clouds. InConference on Robot Learning, pages 1038–1049. PMLR, 2023

  23. [23]

    Taxim: An example-based simulation model for gelsight tactile sensors.IEEE Robotics and Automation Letters, 7(2):2361–2368, 2022

    Zilin Si and Wenzhen Yuan. Taxim: An example-based simulation model for gelsight tactile sensors.IEEE Robotics and Automation Letters, 7(2):2361–2368, 2022

  24. [24]

    AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation

    Anukriti Singh, Kasra Torshizi, Khuzema Habib, Kelin Yu, Ruohan Gao, and Pratap Tokekar. Afford2act: Affordance-guided automatic keypoint selection for gen- eralizable and lightweight robotic manipulation.arXiv preprint arXiv:2510.01433, 2025

  25. [25]

    Functo: Function-centric one-shot imi- tation learning for tool manipulation.arXiv preprint arXiv:2502.11744, 2025

    Chao Tang, Anxing Xiao, Yuhong Deng, Tianrun Hu, Wenlong Dong, Hanbo Zhang, David Hsu, and Hong Zhang. Functo: Function-centric one-shot imi- tation learning for tool manipulation.arXiv preprint arXiv:2502.11744, 2025

  26. [26]

    Mimicfunc: Imitating tool manipulation from a single human video via functional correspondence

    Chao Tang, Anxing Xiao, Yuhong Deng, Tianrun Hu, Wenlong Dong, Hanbo Zhang, David Hsu, and Hong Zhang. Mimicfunc: Imitating tool manipulation from a single human video via functional correspondence. InConference on Robot Learning, pages 4473–4492. PMLR, 2025

  27. [27]

    Tacto: A fast, flexible, and open- source simulator for high-resolution vision-based tactile sensors.IEEE Robotics and Automation Letters, 7(2): 3930–3937, 2022

    Shaoxiong Wang, Mike Lambeta, Po-Wei Chou, and Roberto Calandra. Tacto: A fast, flexible, and open- source simulator for high-resolution vision-based tactile sensors.IEEE Robotics and Automation Letters, 7(2): 3930–3937, 2022

  28. [28]

    D 3 fields: Dynamic 3d descriptor fields for zero-shot generalizable robotic manipulation

    Yixuan Wang, Mingtong Zhang, Zhuoran Li, Kather- ine Rose Driggs-Campbell, Jiajun Wu, Li Fei-Fei, and Yunzhu Li. D 3 fields: Dynamic 3d descriptor fields for zero-shot generalizable robotic manipulation. InICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation, 2023

  29. [29]

    Gendp: 3d semantic fields for category-level generalizable diffusion policy

    Yixuan Wang, Guang Yin, Binghao Huang, Tarik Ke- lestemur, Jiuguang Wang, and Yunzhu Li. Gendp: 3d semantic fields for category-level generalizable diffusion policy. In8th Annual Conference on Robot Learning, volume 2, 2024

  30. [30]

    Tooleenet: Tool affordance 6d pose estimation

    Yunlong Wang, Lei Zhang, Yuyang Tu, Hui Zhang, Kaixin Bai, Zhaopeng Chen, and Jianwei Zhang. Tooleenet: Tool affordance 6d pose estimation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10519–10526. IEEE, 2024

  31. [31]

    S 2-diffusion: Generalizing from instance-level to category-level skills in robot manipu- lation.IEEE Robotics and Automation Letters, 2025

    Quantao Yang, Michael C Welle, Danica Kragic, and Olov Andersson. S 2-diffusion: Generalizing from instance-level to category-level skills in robot manipu- lation.IEEE Robotics and Automation Letters, 2025

  32. [32]

    Gelsight: High-resolution robot tactile sensors for esti- mating geometry and force.Sensors, 17(12):2762, 2017

    Wenzhen Yuan, Siyuan Dong, and Edward H Adelson. Gelsight: High-resolution robot tactile sensors for esti- mating geometry and force.Sensors, 17(12):2762, 2017

  33. [33]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954, 2024

  34. [34]

    Transferable tactile transformers for repre- sentation learning across diverse sensors and tasks

    Jialiang Zhao, Yuxiang Ma, Lirui Wang, and Edward Adelson. Transferable tactile transformers for repre- sentation learning across diverse sensors and tasks. In 8th Annual Conference on Robot Learning, 2024. URL https://openreview.net/forum?id=KXsropnmNI

  35. [35]

    3d- vla: A 3d vision-language-action generative world model

    Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d- vla: A 3d vision-language-action generative world model. InInternational Conference on Machine Learning, pages 61229–61245. PMLR, 2024. APPENDIX A. Network Architecture Details We utilize a PointNet++ [21] architecture to process the heterogeneous inpu...

  36. [36]

    gaussian

    Simulation Environments and Tactile Modeling:Our simulation pipeline utilizes the TacSL framework [1] to model the physics of the GelSight sensor within IsaacGym [20]. We define a uniform7×9marker grid that matches the physical distribution of the GelSight Mini sensors used in our real-world experiments. We employ TacSL’s penalty-based tactile model to de...

  37. [37]

    Tactile Data Post-Processing:To improve the quality of the tactile signal, we apply a multi-stage post-processing pipeline to the raw simulated tactile data. This includes spatial filtering to emulate the elastic diffusion of the elastomer, temporal filtering to reduce simulation jitter, and contact-phase smoothing to ensure a clean baseline. Spatial and ...

  38. [38]

    To generate smooth, learnable contact probability labels, we utilize the Signed Distance Function (SDF) computed in Open3D

    Soft Contact Probability Labeling:Rigid-body simulators typically treat contact as a binary and unstable state. To generate smooth, learnable contact probability labels, we utilize the Signed Distance Function (SDF) computed in Open3D. We map the penetration depthd i (whered i <0indicates penetration) of each pointp i on the tool surface to a continuous c...

  39. [39]

    Dense Force Labeling by Extrapolation:PyBullet provides discrete contact manifolds consisting of a sparse set of contact positions{x j}, normal vectors{n j}, and force magnitudes{F j}. To transform these sparse interactions into a dense force fieldf ext i defined over the tool’s point cloud, we employ a distance-weighted kernel interpolation modulated by ...

  40. [40]

    The calibration involves making the gripper grasp a reference block and then applying a known external force by placing a calibrated weight on the block

    Real-World Sensor Calibration:To bridge the gap between simulated and real tactile readings, we perform a force calibration procedure on the real GelSight sensors. The calibration involves making the gripper grasp a reference block and then applying a known external force by placing a calibrated weight on the block. During this interaction, we monitor the...

  41. [41]

    Within the force term, the components are weighted byλ mag = 1.5andλ dir = 1.0

    Contact Field Loss Functions:We optimize the network using a composite loss function: Ltotal =λ probLprob +λ f orce(λmagLmag +λ dirLdir)(7) whereλ prob = 1.0andλ f orce = 2.0. Within the force term, the components are weighted byλ mag = 1.5andλ dir = 1.0. Contact Probability Loss (L prob):We use the Focal Loss to handle the extreme class imbalance (contac...

  42. [42]

    Contact Field Training Schedule:Table VI summarizes the training parameters. TABLE VI: Contact Field Model Training Hyperparameters Parameter Stage 1 (Sim) Stage 2 (Real) Optimizer AdamW AdamW Learning Rate1e −4 5e−6 LR Scheduler ReduceLROnPlateau ReduceLROnPlateau Batch Size 320 128 Epochs 400 60 Point Translation±0.1m±0.05m Point Rotation±30 ◦ (Z-axis)±...

  43. [43]

    The policy takes a history ofT obs = 3observations and predicts a sequence of action steps with a prediction horizon ofT= 16, executingT action = 8steps before replanning

    Diffusion Policy Hyperparameters:We utilize a Diffusion Policy modeled as a conditional U-Net to predict robot actions. The policy takes a history ofT obs = 3observations and predicts a sequence of action steps with a prediction horizon ofT= 16, executingT action = 8steps before replanning. The specific hyperparameters are detailed in Table VII. TABLE VII...

  44. [44]

    •Scraping Efficiency (Eff):This metric measures the percentage of debris successfully removed

    Details on Evaluation Metrics for Scraping Task:In the scraping task described in the main text, we employ two primary metrics to evaluate performance: Scraping Efficiency (Eff) and Normalized Scraping Efficiency (Eff Norm). •Scraping Efficiency (Eff):This metric measures the percentage of debris successfully removed. We weigh the debris pushed behind the...

  45. [45]

    a) Simulation Results:Figure 10 illustrates the contact field prediction in the simulation environment

    Qualitative Evaluation of Contact Field Prediction:To evaluate the robustness of our contact field estimation, we provide qualitative comparisons between the model’s predictions and the ground truth (or pseudo-ground truth) data across both simulated and real-world domains. a) Simulation Results:Figure 10 illustrates the contact field prediction in the si...