pith. sign in

arxiv: 2606.05407 · v1 · pith:5SZA6DJ2new · submitted 2026-06-03 · 💻 cs.RO

MoDex: A Diffusion Policy for Sequential Multi-Object Dexterous Grasping

Pith reviewed 2026-06-28 05:45 UTC · model grok-4.3

classification 💻 cs.RO
keywords dexterous graspingdiffusion policysequential graspingmulti-object manipulationopposition spacesim-to-real transferreinforcement learning fine-tuning
0
0 comments X

The pith

MoDex conditions a diffusion policy on opposition space so a dexterous hand can grasp multiple objects in sequence by committing only some fingers each time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most dexterous grasping methods commit every finger to one object and leave no capacity for more. MoDex instead predicts the next gripper pose from observations while conditioning the diffusion model on an opposition space that selects which fingers act now. The remaining degrees of freedom stay available for later objects. Training starts with imitation on expert demonstrations and continues with reinforcement learning fine-tuning, which raises success rates. In both simulation and real hardware tests with a Panda arm and Allegro hand, the method outperforms the learning baselines by several percentage points.

Core claim

MoDex is a diffusion policy that predicts the next gripper pose directly from observations, conditioned on an opposition space and point cloud. The opposition space condition specifies which fingers participate in the current grasp, enabling the gripper to use only a subset of its available degrees of freedom while reserving the remaining degrees of freedom for subsequent grasps. Training occurs in two stages: imitation learning on expert demonstrations followed by reinforcement learning fine-tuning. This yields higher success rates than the evaluated learning-based baselines in both simulation and real-world experiments.

What carries the argument

The opposition space condition, which specifies which fingers participate in the current grasp while reserving remaining degrees of freedom for subsequent grasps.

If this is right

  • A single hand can complete multi-object sequences without releasing held items.
  • Two-stage imitation-plus-reinforcement training improves success rates over imitation alone.
  • The same conditioning approach supports sim-to-real transfer on the corresponding real hardware.
  • Performance gains appear consistently across the tested simulation and real-world setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same opposition-space conditioning could extend to other sequential manipulation skills that require preserving some hand capacity.
  • Adding more objects or changing object properties would test how far the reserved-degree-of-freedom mechanism scales.
  • Pairing the policy with additional tactile feedback might further stabilize the reserved fingers during later grasps.

Load-bearing premise

Conditioning the diffusion policy on an opposition space can specify participating fingers for the current grasp while reserving the rest without lowering overall task success.

What would settle it

Run the same tasks with the opposition space condition removed or replaced by random finger selection and measure whether success rates drop because the hand overcommits fingers on the first object.

Figures

Figures reproduced from arXiv: 2606.05407 by Danica Kragic, Florian T. Pokorny, Haofei Lu, Hongjia Liu, Jens Lundell, Yifei Dong.

Figure 1
Figure 1. Figure 1: MoDex sequentially picks three objects with a single dexterous hand, securely holding all previously grasped objects while picking the next one. All grasps are produced by a single policy. Dashed arrows link the end of one grasp to the start of the next; close-ups (right) show the final hand configuration after each. Abstract: This work addresses sequentially grasping multiple objects with a sin￾gle dexter… view at source ↗
Figure 2
Figure 2. Figure 2: Method overview. MoDex maps an observation, including a point cloud, the selected OS, the robot state, and the grasp history of OSes already used (bottom left), to the next grasp action. A PointNet encoder and a context encoder feed the diffusion policy. The executed grasp is appended to the history before the next object. Training (top right): the policy is first pre-trained by behavior cloning on expert … view at source ↗
Figure 3
Figure 3. Figure 3: Representative real-world rollouts. Two Successful sequences in green and the two failed ones in red. The failure case in the top image is due to the held object slipping out, while the bottom image is due to the Allegro Hand not closing tightly enough. To assess real-world transferability, we deploy MoDex to control the real Allegro Hand and the Franka Emika Panda robot shown in [PITH_FULL_IMAGE:figures/… view at source ↗
read the original abstract

This work addresses sequentially grasping multiple objects with a single dexterous hand without releasing those already held. Most dexterous grasping methods commit all of the hand's degrees of freedom to a single object, underutilizing its dexterity and leaving no redundancy for subsequent grasps. The proposed solution, MoDex, is a diffusion policy that predicts the next gripper pose directly from observations, conditioned on an opposition space and point cloud. The opposition space condition specifies which fingers participate in the current grasp, enabling the gripper to use only a subset of its available degrees of freedom while reserving the remaining degrees of freedom for subsequent grasps. To facilitate sim-to-real transfer, MoDex is trained in two stages: first through imitation learning on expert demonstrations, and subsequently through reinforcement learning fine-tuning, which consistently improves success rates over the pre-trained policy. We evaluate MoDex in simulation on a MuJoCo-based Franka Emika Panda robot equipped with an Allegro Hand and on the corresponding real-world hardware platform. Across both simulation and real-world experiments, MoDex achieves higher success rates than the evaluated learning-based baselines, improving performance by 2.92-17.92% and 6.67-17.78%, respectively. Project page: https://modex2026.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents MoDex, a diffusion policy for sequential multi-object dexterous grasping that predicts gripper poses conditioned on an opposition space (to allocate subsets of fingers/DOFs to the current grasp while reserving others) and point cloud observations. The policy is trained in two stages—imitation learning on expert demonstrations followed by RL fine-tuning—and evaluated on a MuJoCo-simulated Franka Emika Panda with Allegro hand and the corresponding real hardware, reporting higher success rates than learning-based baselines (gains of 2.92-17.92% in simulation and 6.67-17.78% in real-world experiments).

Significance. If the reported gains prove robust under detailed experimental protocols, the work could advance dexterous manipulation by demonstrating a practical way to exploit hand redundancy across sequential grasps without object release. The two-stage training pipeline and inclusion of real-world transfer are constructive elements for sim-to-real research in robotics.

major comments (2)
  1. [Abstract] Abstract: the reported performance improvements (2.92-17.92% simulation, 6.67-17.78% real) are presented without trial counts, variance measures, baseline implementation details, statistical significance tests, or data exclusion criteria. This directly undermines assessment of whether the numbers support the central empirical claim.
  2. [Abstract] Abstract: the opposition-space conditioning is described as the key mechanism for reserving DOFs, yet no ablation, definition of the space, or analysis of its impact on overall task success is referenced, leaving the weakest assumption untested in the reported results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the manuscript to strengthen the presentation of results and the role of key components.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported performance improvements (2.92-17.92% simulation, 6.67-17.78% real) are presented without trial counts, variance measures, baseline implementation details, statistical significance tests, or data exclusion criteria. This directly undermines assessment of whether the numbers support the central empirical claim.

    Authors: We agree the abstract should include more experimental context to support the claims. The full manuscript reports results averaged over 100 trials per task (Section 5.1) with standard deviations; baselines follow original implementations with details and hyperparameters in Section 4.3 and the supplement; paired t-tests yield p<0.05; and data exclusion follows the protocol in Section 5.2 (failed initial grasps excluded). We will revise the abstract to reference these elements concisely. revision: yes

  2. Referee: [Abstract] Abstract: the opposition-space conditioning is described as the key mechanism for reserving DOFs, yet no ablation, definition of the space, or analysis of its impact on overall task success is referenced, leaving the weakest assumption untested in the reported results.

    Authors: The opposition space is defined in Section 3.2 as a per-finger binary allocation mask over the hand's DOFs, and its impact is quantified in the ablation study of Section 5.3 (removing the conditioning reduces success by 12-18% across tasks). We will add a short reference in the revised abstract to these results and the definition to make the key mechanism explicit in the summary. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents MoDex as an empirically trained diffusion policy for sequential grasping, using imitation learning on expert demonstrations followed by RL fine-tuning, then evaluated via success rates on simulation and real hardware against baselines. No mathematical derivation chain, uniqueness theorem, or fitted-parameter prediction is described that reduces reported performance gains back to the same inputs by construction. The opposition-space conditioning is a design choice in the policy architecture, not a self-referential result. The central claims rest on experimental outcomes rather than any load-bearing self-citation or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

This is an empirical machine learning paper; the central claim rests on the learned effectiveness of the opposition space conditioning and two-stage training procedure rather than on first-principles derivations. The opposition space is introduced as a new conditioning concept without independent physical justification.

invented entities (1)
  • opposition space no independent evidence
    purpose: to specify which fingers participate in the current grasp, enabling the hand to reserve remaining degrees of freedom for subsequent grasps
    Presented in the abstract as the key conditioning input that solves the underutilization of hand dexterity.

pith-pipeline@v0.9.1-grok · 5780 in / 1420 out tokens · 45390 ms · 2026-06-28T05:45:21.835477+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 9 canonical work pages

  1. [1]

    Zhang, S

    H. Zhang, S. Christen, Z. Fan, O. Hilliges, and J. Song. GraspXL: Generating grasping motions for diverse objects at scale. InEuropean Conference on Computer Vision (ECCV), 2024

  2. [2]

    Zhang, Z

    H. Zhang, Z. Wu, L. Huang, S. Christen, and J. Song. RobustDexGrasp: Robust dexterous grasping of general objects. InConference on Robot Learning (CoRL), 2025

  3. [3]

    Z. Weng, H. Lu, D. Kragic, and J. Lundell. Dexdiffuser: Generating dexterous grasps with diffusion models.IEEE Robotics and Automation Letters, 9(12):11834–11840, 2024. doi: 10.1109/LRA.2024.3498776

  4. [4]

    H.-S. Fang, C. Wang, H. Fang, M. Gou, J. Liu, H. Yan, W. Liu, Y . Xie, and C. Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 39(5):3929–3945, 2023. doi:10.1109/TRO.2023.3281153

  5. [5]

    I. M. Bullock, R. R. Ma, and A. M. Dollar. A hand-centric classification of human and robot dexterous manipulation.IEEE Transactions on Haptics, 6(2):129–144, 2013. doi:10.1109/ TOH.2012.53

  6. [6]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

  7. [7]

    T. Feix, J. Romero, H.-B. Schmiedmayer, A. M. Dollar, and D. Kragic. The grasp taxonomy of human grasp types.IEEE Transactions on human-machine systems, 46(1):66–77, 2015

  8. [8]

    H. Lu, Y . Dong, Z. Weng, F. T. Pokorny, J. Lundell, and D. Kragic. Grasping a handful: Sequential multi-object dexterous grasp generation.IEEE Robotics and Automation Letters, 10(11):11880–11887, 2025. doi:10.1109/LRA.2025.3614051

  9. [9]

    Yao and A

    K. Yao and A. Billard. Exploiting kinematic redundancy for robotic grasping of multiple objects.IEEE Transactions on Robotics, 39(3):1982–2002, 2023

  10. [10]

    S. He, Z. Shangguan, K. Wang, Y . Gu, Y . Fu, Y . Fu, and D. Seita. Sequential multi-object grasping with one dexterous hand.IEEE/RSJ International Conference on Intelligent Robots and Systems, 2025

  11. [11]

    A. Ren, J. Lidard, L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization. InInternational Conference on Learning Representations, volume 2025, pages 77288–77329, 2025

  12. [12]

    R. Wang, J. Zhang, J. Chen, Y . Xu, P. Li, T. Liu, and H. Wang. Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation. In2023 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 11359–11366, 2023. doi: 10.1109/ICRA48891.2023.10160982

  13. [13]

    T. Liu, Z. Liu, Z. Jiao, Y . Zhu, and S.-C. Zhu. Synthesizing diverse and physically stable grasps with arbitrary hand structures using differentiable force closure estimator.IEEE Robotics and Automation Letters, 7(1):470–477, 2022. doi:10.1109/LRA.2021.3129138

  14. [14]

    Miller and P

    A. Miller and P. Allen. Graspit! a versatile simulator for robotic grasping.IEEE Robotics & Automation Magazine, 11(4):110–122, 2004. doi:10.1109/MRA.2004.1371616

  15. [15]

    M. T. Ciocarlie and P. K. Allen. Hand posture subspaces for dexterous robotic grasp- ing.The International Journal of Robotics Research, 28(7):851–867, 2009. doi:10.1177/ 0278364909105606

  16. [16]

    Yin and P

    Z.-H. Yin and P. Abbeel. Lightning grasp: High performance procedural grasp synthesis with contact fields, 2025. URLhttps://arxiv.org/abs/2511.07418. 9

  17. [17]

    J. Lu, H. Kang, H. Li, B. Liu, Y . Yang, Q. Huang, and G. Hua. Ugg: Unified generative grasping. InEuropean Conference on Computer Vision, pages 414–433. Springer, 2024

  18. [18]

    Lundell, F

    J. Lundell, F. Verdoja, and V . Kyrki. Ddgc: Generative deep dexterous grasping in clutter.IEEE Robotics and Automation Letters, 6(4):6899–6906, 2021. doi:10.1109/LRA.2021.3096239

  19. [19]

    Z. Wei, Z. Xu, J. Guo, Y . Hou, C. Gao, Z. Cai, J. Luo, and L. Shao.D(R,O)grasp: A unified representation of robot and object interaction for cross-embodiment dexterous grasping. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 4982–4988,

  20. [20]

    doi:10.1109/ICRA55743.2025.11127754

  21. [21]

    Mayer, Q

    V . Mayer, Q. Feng, J. Deng, Y . Shi, Z. Chen, and A. Knoll. Ffhnet: Generating multi-fingered robotic grasps for unknown objects in real-time. In2022 International Conference on Robotics and Automation (ICRA), pages 762–769. IEEE, 2022

  22. [22]

    Zhang, H

    J. Zhang, H. Liu, D. Li, X. Yu, H. Geng, Y . Ding, J. Chen, and H. Wang. Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes. In8th Annual Conference on Robot Learning, 2024

  23. [23]

    Makarova, Q

    M. Makarova, Q. Liu, and D. Tsetserukou. Diffusionrl: Efficient training of diffusion policies for robotic grasping using rl-adapted large-scale datasets, 2026. URLhttps://arxiv. org/abs/2505.18876

  24. [24]

    Y . Li, B. Liu, Y . Geng, P. Li, Y . Yang, Y . Zhu, T. Liu, and S. Huang. Grasp multiple objects with one hand.IEEE Robotics and Automation Letters, 9(5):4027–4034, 2024. doi:10.1109/ LRA.2024.3374190

  25. [25]

    Y . Sun, E. Amatova, and T. Chen. Multi-object grasping-types and taxonomy. In2022 Inter- national Conference on Robotics and Automation (ICRA), pages 777–783. IEEE, 2022

  26. [26]

    Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. InProceedings of Robotics: Science and Systems (RSS), 2024

  27. [27]

    T. L. Team, J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, N. Kuppuswamy, K.-H. Lee, K. Liu, D. McConachie, I. McMahon, H. Nishimura, C. Phillips-Grafflin, C. Richter, P. Shah, K. Srinivasan, B. Wulfe, C. Xu, M. Zhang, A. Alspach, M. Angeles, K. Arora, V . C. Guizilini, A. Castro, D. C...

  28. [28]

    Bjorck, N

    NVIDIA, J. Bjorck, N. C. Fernando Casta ˜neda, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, ...

  29. [29]

    Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024. 10

  30. [30]

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffu- sion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

  31. [31]

    J. Wen, Y . Zhu, J. Li, Z. Tang, C. Shen, and F. Feng. Dexvla: Vision-language model with plug- in diffusion expert for general robot control. In9th Annual Conference on Robot Learning

  32. [32]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.arXiv preprint arxiv:2006.11239, 2020

  33. [33]

    T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters. An algorithmic perspective on imitation learning.F oundations and Trends® in Robotics, 7(1-2):1–179, 2018

  34. [34]

    S. Ding, K. Hu, Z. Zhang, K. Ren, W. Zhang, J. W. Jingyi Yu, and Y . Shi. Diffusion- based reinforcement learning via q-weighted variational policy optimization. InThe Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps: //arxiv.org/abs/2405.16173

  35. [35]

    Schulman, F

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  36. [36]

    J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Con- ference on Learning Representations, 2021

  37. [37]

    Y . Zhu, J. Wong, A. Mandlekar, R. Mart´ın-Mart´ın, A. Joshi, K. Lin, A. Maddukuri, S. Nasiri- any, and Y . Zhu. robosuite: A modular simulation framework and benchmark for robot learn- ing.arXiv preprint arXiv:2009.12293, 2020

  38. [38]

    Mandlekar, D

    A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation. InConference on Robot Learning, pages 1678–1690. PMLR, 2022

  39. [39]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

    Gemini Team, Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024. URLhttps://arxiv.org/ abs/2403.05530

  40. [40]

    O. Khatib. A unified approach for motion and force control of robot manipulators: The op- erational space formulation.IEEE Journal on Robotics and Automation, 3(1):43–53, 1987. doi:10.1109/JRA.1987.1087068

  41. [41]

    Y . Qin, B. Huang, Z.-H. Yin, H. Su, and X. Wang. Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation.Conference on Robot Learning (CoRL), 2022

  42. [42]

    A. Wei, A. Agarwal, B. Chen, R. Bosworth, N. E. Pfaff, and R. Tedrake. Empirical analysis of sim-and-real cotraining of diffusion policies for planar pushing from pixels. InWorkshop on Making Sense of Data in Robotics: Composition, Curation, and Interpretability at Scale at CoRL 2025, 2025. URLhttps://openreview.net/forum?id=kBzTJgYgol. 11