pith. sign in

arxiv: 2606.21148 · v1 · pith:KVZB5OB2new · submitted 2026-06-19 · 💻 cs.RO

Pose-Agnostic Robotic Functional Grasping via Observation-Action Canonicalization

Pith reviewed 2026-06-26 14:24 UTC · model grok-4.3

classification 💻 cs.RO
keywords functional graspingvisuomotor policyobservation-action canonicalizationobject-centric framesim-to-real transferreinforcement learningmug handle graspingpose-agnostic control
0
0 comments X

The pith

Transforming both depth views and robot actions into a shared mug-centered frame lets one learned policy grasp handles regardless of upright or inverted placement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that functional grasping of mug handles fails when policies must implicitly absorb large changes in both visual appearance and required action direction caused by different object poses. AnyMug removes that coupling by mapping the incoming depth image and the outgoing end-effector command into one fixed object-centric coordinate system before the policy sees them. Because the policy always operates on the same mug-centered view and always outputs actions in the same canonical direction, the same weights succeed on both upright and inverted mugs without retraining. A handle-aware reward plus a pose curriculum and domain randomization close the remaining sim-to-real gap, producing zero-shot transfer to a physical Franka Panda.

Core claim

AnyMug introduces observation-action canonicalization that converts the depth observation and the predicted end-effector action into a shared object-centric frame so the policy receives a consistent mug-centered view and emits actions in a canonical direction independent of the mug's placement; combined with a handle-aware reward, pose curriculum, and domain randomization, the same closed-loop policy trained entirely in simulation reaches over 93 percent success on unseen upright and inverted mugs in simulation and 80 percent success zero-shot on five held-out physical mugs.

What carries the argument

Observation-action canonicalization, which maps both the depth image and the action vector into one fixed object-centric coordinate frame before the policy processes them.

If this is right

  • A single set of policy weights suffices for both upright and inverted placements instead of requiring separate policies or explicit pose estimation.
  • Grasp pose detection is replaced by closed-loop visuomotor control that continuously corrects for perception error on thin handles.
  • Domain randomization and a pose curriculum together suffice for zero-shot sim-to-real transfer on this task without additional real-world fine-tuning.
  • The same canonicalization step can be applied to other functional grasping tasks whose contact geometry is defined relative to an object feature rather than the world frame.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the object-centric frame can be maintained across a wider class of objects, the method may reduce the need for instance-specific training data in other contact-rich manipulation tasks.
  • The approach implicitly assumes that the gripper can always reach the canonical action direction without kinematic collision; testing on robots with different arm geometries would reveal whether that assumption holds.
  • Extending the canonical frame to include velocity or force information could allow the same policy to handle compliant or slippery handles without changing the observation-action mapping.

Load-bearing premise

A reliable object-centric frame can be extracted from depth observations alone even when the mug is inverted or partially occluded.

What would settle it

Deploy the policy on a real mug whose handle lies in a plane that prevents reliable extraction of the object-centric frame from a single depth image and measure whether success rate collapses below the reported 80 percent.

Figures

Figures reproduced from arXiv: 2606.21148 by Cole Harrison, Jiankai Sun, Le Qiu, Marco Pavone, Qianzhong Chen, Suning Huang, Yang You, Yao Liu.

Figure 1
Figure 1. Figure 1: Overview of AnyMug. (a) Challenge: Functional mug-handle grasping must generalize across mug geometries, placements, and pose categories. (b) AnyMug: AnyMug canonicalizes the top-view depth image Dt into a mug-centric frame by centering the mug and aligning the handle to a fixed direction. A shared policy πθ takes the canonical depth, mug-frame end-effector pose, and pose indicator, and predicts a canonica… view at source ↗
Figure 2
Figure 2. Figure 2: Observation-action canonicalization. Left: We canonicalize each top-view depth ob￾servation Dt by projecting the mug center and handle center into the image, estimating the handle direction αt, and applying an affine warp that centers the mug and aligns the handle to a fixed canonical direction. This produces a canonical depth observation Dc t . Right: The policy predicts a canonical action a c t = (∆p c t… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative rollouts of AnyMug. Simulation rollouts (a,b) and real-world rollouts (c,d). In both domains, the policy approaches the handle, aligns the gripper, closes around the handle, and lifts the mug to verify grasp stability. 5 Experiments We evaluate AnyMug in simulation and on a real Franka Panda robot. Our experiments answer three questions: (1) Can a single policy perform functional handle graspin… view at source ↗
read the original abstract

Functional robotic grasping requires a policy that generalizes across diverse object geometries and poses while maintaining task-specific contact precision. We study this challenge through mug-handle grasping, where thin handles, instance variation, and upright or inverted placements make both perception and control sensitive to object configuration. Grasp pose detection methods operate open-loop and are sensitive to estimation errors on thin handle structures. Learned visuomotor policies must implicitly learn to handle the coupled variation in visual appearance and action direction induced by different object placements, limiting generalization. We propose AnyMug, a canonicalized visuomotor reinforcement learning framework for functional grasping that trains a single closed-loop policy entirely in simulation and deploys it zero-shot on a real robot. AnyMug introduces observation-action canonicalization, which transforms both the depth observation and the predicted end-effector action into a shared object-centric frame. The policy therefore sees a consistent mug-centered view and emits actions in a canonical direction regardless of mug placement, allowing the same grasping behavior to be reused across configurations. A handle-aware reward further encourages precise approach, gripper alignment, and opposing-finger placement, while a pose curriculum and domain randomization improve training stability and sim-to-real transfer. In simulation, AnyMug achieves over 93% success rate on both unseen upright and inverted mugs and transfers zero-shot to a real Franka Panda, reaching 80% success rate on 5 held-out physical mugs across both pose categories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents AnyMug, a visuomotor reinforcement learning framework for functional grasping of mugs that introduces observation-action canonicalization: depth observations and predicted end-effector actions are transformed into a shared object-centric frame so that a single policy sees a consistent mug-centered view and produces canonical actions independent of mug pose. The policy is trained entirely in simulation with a handle-aware reward, pose curriculum, and domain randomization, then deployed zero-shot on a real Franka Panda, reporting >93% success on unseen upright/inverted mugs in simulation and 80% success on five held-out physical mugs across both pose categories.

Significance. If the canonicalization mechanism is robust to real depth noise, the approach provides a concrete way to factor out object pose variation from the learned policy, which could reduce the data requirements for generalizing functional grasping across instance and placement diversity. The reported zero-shot sim-to-real numbers on a challenging thin-handle task would be a useful data point for the community if supported by the necessary ablation and error analysis.

major comments (2)
  1. [Abstract] Abstract and implied methods: the central claim attributes the 80% real-world success to observation-action canonicalization, yet the manuscript provides no quantitative measurement of 6-DoF pose estimation error from real depth images on thin handles or inverted placements, nor any sensitivity study showing how such errors affect the canonicalized policy input. Without these, it is impossible to confirm that the reported performance stems from the claimed mechanism rather than from the reward or randomization alone.
  2. [Abstract / Methods] The weakest assumption listed in the stress-test note is load-bearing: the framework presupposes that a reliable object-centric frame can be recovered from depth alone in both sim and reality. If this step fails on real sensor data for the handle geometry, the canonicalization benefit is nullified and the zero-shot transfer result cannot be interpreted as evidence for the method.
minor comments (2)
  1. [Abstract] Abstract: success rates are stated as 'over 93%' and '80%' without trial counts, standard deviations, or number of evaluation episodes, making it difficult to judge reproducibility or statistical reliability.
  2. [Methods] The description of the handle-aware reward and pose curriculum is high-level; explicit equations or pseudocode for the reward terms and curriculum schedule would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for stronger validation of the canonicalization mechanism. We respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract and implied methods: the central claim attributes the 80% real-world success to observation-action canonicalization, yet the manuscript provides no quantitative measurement of 6-DoF pose estimation error from real depth images on thin handles or inverted placements, nor any sensitivity study showing how such errors affect the canonicalized policy input. Without these, it is impossible to confirm that the reported performance stems from the claimed mechanism rather than from the reward or randomization alone.

    Authors: We agree that the manuscript lacks a direct quantitative measurement of 6-DoF pose estimation error on real depth images (particularly for thin handles and inverted placements) and a corresponding sensitivity study. While the paper includes ablation studies isolating the contribution of canonicalization versus reward and randomization, these do not substitute for explicit error analysis. In the revised manuscript we will add (i) measured 6-DoF pose estimation accuracy on the five held-out physical mugs in both upright and inverted configurations and (ii) a sensitivity study that perturbs the estimated object frame and reports resulting policy success rates. revision: yes

  2. Referee: [Abstract / Methods] The weakest assumption listed in the stress-test note is load-bearing: the framework presupposes that a reliable object-centric frame can be recovered from depth alone in both sim and reality. If this step fails on real sensor data for the handle geometry, the canonicalization benefit is nullified and the zero-shot transfer result cannot be interpreted as evidence for the method.

    Authors: Recovering a reliable object-centric frame from depth is indeed a central assumption. In simulation ground-truth poses are used; on the real robot we rely on depth-based estimation whose accuracy is not separately quantified in the current manuscript. The 80% zero-shot success rate provides indirect support that frame recovery is adequate for the evaluated mugs and poses, but we accept that this does not fully substantiate the assumption. The revision will therefore include an explicit description of the real-world pose estimation procedure together with its observed accuracy and any failure modes on thin-handle geometry. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is a standard RL design with empirical validation

full rationale

The paper describes a visuomotor RL framework that applies observation-action canonicalization to an object-centric frame using object pose (available as ground truth in simulation), combined with domain randomization, a pose curriculum, and a handle-aware reward. Reported performance metrics are direct empirical results from training and zero-shot deployment rather than any derivation that reduces by construction to fitted parameters or self-citations. No load-bearing equations, uniqueness theorems, or ansatzes are presented that collapse the central claim into its inputs. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed from abstract only; no explicit free parameters, axioms, or invented entities are identifiable.

pith-pipeline@v0.9.1-grok · 5804 in / 1222 out tokens · 28279 ms · 2026-06-26T14:24:45.093721+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 5 linked inside Pith

  1. [1]

    Bicchi and V

    A. Bicchi and V . Kumar. Robotic grasping and contact: A review. InProceedings 2000 ICRA. Millennium conference. IEEE international conference on robotics and automation. Symposia proceedings (Cat. No. 00CH37065), volume 1, pages 348–353. IEEE, 2000

  2. [2]

    J. Bohg, A. Morales, T. Asfour, and D. Kragic. Data-driven grasp synthesis—a survey.IEEE Transactions on robotics, 30(2):289–309, 2013

  3. [3]

    J. Sun, A. Curtis, Y . You, Y . Xu, M. Koehle, Q. Chen, S. Huang, L. Guibas, S. Chitta, M. Schwager, et al. Arch: Hierarchical hybrid learning for long-horizon contact-rich robotic assembly.CoRL, 2025

  4. [4]

    X. Chen, Z. Ye, J. Sun, Y . Fan, F. Hu, C. Wang, and C. Lu. Transferable active grasping and real embodied dataset. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 3611–3618. IEEE, 2020

  5. [5]

    Q. Chen, H. Zheng, J. Yu, S. Huang, J. Sun, K. Goldberg, C. Wen, P. Abbeel, Y . Shentu, P. Wu, et al. Sarm2: Multi-task stage aware reward modeling for self improving robotic manipulation. arXiv preprint arXiv:2606.10305, 2026

  6. [6]

    Huang, J

    S. Huang, J. Shao, K. Wang, Q. Chen, J. Sun, Y . Guo, M. Schwager, and J. Bohg. Breaking lock-in: Preserving steerability under low-data vla post-training.arXiv preprint arXiv:2604.23121, 2026

  7. [7]

    Huang, T

    Y . Huang, T. Davies, J. Yan, J. Sun, X. Chen, and L. Hu. Spatial robograsp: Generalized robotic grasping control policy.arXiv preprint arXiv:2505.20814, 2025

  8. [8]

    C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar. Mimicplay: Long-horizon imitation learning by watching human play.CoRL, 2023

  9. [9]

    Huang, Q

    S. Huang, Q. Chen, X. Zhang, J. Sun, and M. Schwager. Particleformer: A 3d point cloud world model for multi-object, multi-material robotic manipulation.CoRL, 2025

  10. [10]

    Mahler, J

    J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg. Dex- net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics.arXiv preprint arXiv:1703.09312, 2017

  11. [11]

    Mousavian, C

    A. Mousavian, C. Eppner, and D. Fox. 6-dof graspnet: Variational grasp generation for object manipulation. InProceedings of the IEEE/CVF international conference on computer vision, pages 2901–2910, 2019

  12. [12]

    H.-S. Fang, C. Wang, H. Fang, M. Gou, J. Liu, H. Yan, W. Liu, Y . Xie, and C. Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 39(5):3929–3945, 2023

  13. [13]

    C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017

  14. [14]

    Y . You, W. He, J. Liu, H. Xiong, W. Wang, and C. Lu. Cppf++: Uncertainty-aware sim2real ob- ject pose estimation by vote aggregation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9239–9254, 2024

  15. [15]

    Levine, P

    S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection.The International journal of robotics research, 37(4-5):421–436, 2018

  16. [16]

    Dmitry, I

    K. Dmitry, I. Alex, P. Peter, I. Julian, H. Alexander, J. Eric, Q. Deirdre, H. Ethan, K. Mri- nal, V . Vincent, et al. Qt-opt. scalable deep reinforcement learning for vision-based robotic manipulation.arXiv preprint, 2018. 9

  17. [17]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  18. [18]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

  19. [19]

    Makoviychuk, L

    V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

  20. [20]

    Schulman, F

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  21. [21]

    D. Wang, S. Hart, D. Surovik, T. Kelestemur, H. Huang, H. Zhao, M. Yeatman, J. Wang, R. Walters, and R. Platt. Equivariant diffusion policy. InConference on Robot Learning (CoRL), 2024

  22. [22]

    Yang, Z.-a

    J. Yang, Z.-a. Cao, C. Deng, R. Antonova, S. Song, and J. Bohg. EquiBot: SIM(3)-equivariant diffusion policy for generalizable and data-efficient learning. InConference on Robot Learning (CoRL), 2024

  23. [23]

    A. T. Miller and P. K. Allen. Graspit! a versatile simulator for robotic grasping.IEEE Robotics & Automation Magazine, 11(4):110–122, 2004

  24. [24]

    Redmon and A

    J. Redmon and A. Angelova. Real-time grasp detection using convolutional neural networks. In2015 IEEE international conference on robotics and automation (ICRA), pages 1316–1322. IEEE, 2015

  25. [25]

    Pinto and A

    L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In2016 IEEE international conference on robotics and automation (ICRA), pages 3406–3413. IEEE, 2016

  26. [26]

    Mahler, F

    J. Mahler, F. T. Pokorny, B. Hou, M. Roderick, M. Laskey, M. Aubry, K. Kohlhoff, T. Kr¨oger, J. Kuffner, and K. Goldberg. Dex-net 1.0: A cloud-based network of 3d objects for robust grasp planning using a multi-armed bandit model with correlated rewards. In2016 IEEE international conference on robotics and automation (ICRA), pages 1957–1964. IEEE, 2016

  27. [27]

    Mahler, M

    J. Mahler, M. Matl, V . Satish, M. Danielczuk, B. DeRose, S. McKinley, and K. Goldberg. Learning ambidextrous robot grasping policies.Science robotics, 4(26):eaau4984, 2019

  28. [28]

    Ten Pas, M

    A. Ten Pas, M. Gualtieri, K. Saenko, and R. Platt. Grasp pose detection in point clouds.The International Journal of Robotics Research, 36(13-14):1455–1473, 2017

  29. [29]

    X. Yan, J. Hsu, M. Khansari, Y . Bai, A. Pathak, A. Gupta, J. Davidson, and H. Lee. Learning 6-dof grasping interaction via deep geometry-aware 3d representations. In2018 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 3766–3773. IEEE, 2018

  30. [30]

    Viereck, A

    U. Viereck, A. Pas, K. Saenko, and R. Platt. Learning a visuomotor controller for real world robotic grasping using simulated depth images. InConference on robot learning, pages 291–

  31. [31]

    A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez, and T. Funkhouser. Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4238–

  32. [32]

    C.-C. Hsu, B. Wen, J. Xu, Y . Narang, X. Wang, Y . Zhu, J. Biswas, and S. Birchfield. Spot: Se (3) pose trajectory diffusion for object-centric manipulation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 4853–4860. IEEE, 2025

  33. [33]

    H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2642–2651, 2019

  34. [34]

    A. Zeng, P. Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V . Sindhwani, et al. Transporter networks: Rearranging the visual world for robotic manipulation. InConference on Robot Learning, pages 726–747. PMLR, 2021

  35. [35]

    Huang, D

    H. Huang, D. Wang, R. Walters, and R. Platt. Equivariant transporter network.arXiv preprint arXiv:2202.09400, 2022

  36. [36]

    X. Zhu, D. Wang, G. Su, O. Biza, R. Walters, and R. Platt. On robot grasp learning using equivariant models.Autonomous Robots, 47(8):1175–1193, 2023

  37. [37]

    D. Wang, R. Walters, and R. Platt. SO(2)-equivariant reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2022

  38. [38]

    D. Wang, M. Jia, X. Zhu, R. Walters, and R. Platt. On-robot learning with equivariant models. InConference on Robot Learning (CoRL), 2022. 11 Appendix A Related Work A.1 Grasp Pose Detection A popular paradigm in robotic grasping is to predict a target grasp pose from visual input. Classical methods [1, 23, 2] use analytical grasp metrics or geometric heu...