pith. machine review for the scientific record. sign in

arxiv: 2605.13117 · v1 · submitted 2026-05-13 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SECOND-Grasp: Semantic Contact-guided Dexterous Grasping

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:36 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords dexterous graspingsemantic contactrobotic manipulationvision-language reasoninggrasp policy learninginverse kinematicscontact map refinement
0
0 comments X

The pith

SECOND-Grasp derives dexterous grasp supervision from language-inferred contacts refined across views to reach 98 percent lifting success on seen objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to unify physical stability in dexterous grasping with semantic guidance drawn from task language. It obtains initial contact proposals via vision-language reasoning on object properties and intent, then applies segmentation and a Semantic-Geometric Consistency Refinement step to produce reliable 3D contact maps. Inverse kinematics converts each map into a feasible hand pose that supplies the training signal for the grasping policy. When trained on DexGraspNet, the resulting system outperforms baselines on both seen and unseen object categories while raising intent-aware grasping accuracy.

Core claim

Deriving policy supervision from 3D contact maps that have been made consistent across views through semantic and geometric checks allows dexterous hands to produce grasps that are simultaneously stable for lifting and aligned with task semantics.

What carries the argument

Semantic-Geometric Consistency Refinement (SGCR), which enforces semantic consistency across multiple viewpoints and discards geometrically invalid regions to turn coarse vision-language contact proposals into accurate 3D contact maps used for inverse-kinematics supervision.

If this is right

  • Policies trained this way generalize to unseen object categories while preserving high lifting success.
  • Intent-aware grasping accuracy improves by more than 12 percent over prior methods.
  • The same contact-to-pose pipeline transfers to different robotic hands such as the Shadow Hand and Allegro Hand.
  • Supervision can be generated from existing datasets without requiring manual 3D contact annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contact-refinement logic could be applied to more complex manipulation sequences where task language specifies not only grasp but also subsequent motion.
  • Better vision-language models would directly increase the quality of the initial contact proposals and therefore the final grasp reliability.
  • Explicit consistency checks between semantic and geometric cues may prove useful in other robotic perception tasks that currently treat language and geometry separately.

Load-bearing premise

Vision-language reasoning produces contact proposals whose refinement yields 3D maps that accurately capture both semantic intent and physical reachability without introducing errors that invalidate the downstream hand-pose supervision.

What would settle it

Measuring whether lifting success falls below baseline levels when the same method is tested on objects whose language descriptions admit multiple valid but geometrically incompatible contact sets.

Figures

Figures reproduced from arXiv: 2605.13117 by Han Yi Shin, Heeju Ko, Honglak Lee, Jaehyeok Lee, Jaewon Mun, Qixing Huang, Sangpil Kim, Sujin Jang, Sung June Kim.

Figure 1
Figure 1. Figure 1: Contact-based grasping ensures physical feasibility but lacks semantic reasoning, while [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview. Given multi-view observations, SECOND-Grasp first infers semantic contact regions using vision-language reasoning (Section 4.1). These 2D proposals are then projected into 3D and refined through Semantic-Geometric Consistency Refinement, which enforces cross￾view semantic consistency and geometric consistency based on local convexity (Section 4.2). The refined semantic-geometric contact map provi… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results of intent-aware grasping. Given different intents on the same object, SECOND￾Grasp adapts its grasp behavior to the specified contact region. Initial Contact Map Semantic Refinement Grasp Intention Geometric Refinement lens of the camera. Initial Contact Map Semantic Refinement Grasp Intention Geometric Refinement barrel of the pistol. High LowConfidence Score [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative visualization of SGCR. Given a grasp intention, SGCR refines coarse contact proposals into localized and geometrically coherent contact maps. Warmer colors indicate higher contact confidence. respectively, improving over DemoFunGrasp by 0.6 cm and 0.92 cm. This improved contact accuracy leads to higher ISR, demonstrating more accurate grasping of the intended contact regions. We also observe hi… view at source ↗
Figure 5
Figure 5. Figure 5: Prompt used for VLM-based grasp intent proposal, including the JSON output format. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of Semantic Contact Map Generation. Visual illustration of the geometric [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Without pseudo-pose supervision, the policy improves only after a long delay, indicating [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional qualitative results of intent-aware grasping on the Allegro Hand. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cross-dataset qualitative results of SGCR contact-map refinement. Coarse contact proposals [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional qualitative results of intent-aware grasping. SECOND-Grasp adapts the refined [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
read the original abstract

Achieving reliable robotic manipulation, such as dexterous grasping, requires a synergy between physically stable interactions and semantic task guidance, yet these objectives are often treated as separate, disjoint goals. In this paper, we investigate how to integrate dexterous grasping techniques, i.e., physically stable grasps for object lifting and language-guided grasp generation, to achieve both physical stability and semantic understanding. To this end, we propose SECOND-Grasp (SEmantic CONtact-guided Dexterous Grasping), a unified framework that enables robotic hands to dynamically adjust grasping strategies based on semantic reasoning while ensuring physical feasibility. We begin by obtaining coarse contact proposals through vision-language reasoning to infer where contacts should occur based on object properties, followed by segmentation to localize these regions across views. To further ensure consistency across multiple viewpoints, we introduce Semantic-Geometric Consistency Refinement (SGCR), which refines initial contact predictions by enforcing semantic consistency across views and removing geometrically invalid regions, yielding reliable 3D contact maps. Then, we derive a feasible hand pose for each contact map via inverse kinematics, generating a supervision signal for policy learning. Our approach, trained on DexGraspNet, consistently outperforms baselines in lifting success rate on both seen and unseen categories, achieving 98.2% and 97.7%, respectively, while also improving intent-aware grasping by 12.8% and 26.2%. We further show promising results on additional datasets and robotic hands, including Shadow Hand and Allegro Hand.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents SECOND-Grasp, a framework for semantic contact-guided dexterous grasping. It uses vision-language models to generate coarse contact proposals from object properties and task language, refines them via Semantic-Geometric Consistency Refinement (SGCR) to produce consistent 3D contact maps across views, computes feasible hand poses using inverse kinematics, and uses these as supervision for policy learning. Trained on DexGraspNet, it reports lifting success rates of 98.2% on seen categories and 97.7% on unseen categories, with additional gains in intent-aware grasping of 12.8% and 26.2%.

Significance. If the contact maps generated by SGCR are sufficiently accurate, the approach offers a promising way to bridge semantic task understanding with physical grasp stability in dexterous manipulation. The reported high success rates on both seen and unseen objects indicate potential for generalization, and extension to different robotic hands strengthens the contribution. The method's reliance on public benchmarks allows for reproducibility in principle.

major comments (3)
  1. [SGCR description] The accuracy of the Semantic-Geometric Consistency Refinement (SGCR) step is central to generating valid supervision signals for inverse kinematics and policy learning, yet the paper provides no quantitative validation such as contact IoU, point-wise error, or cross-view consistency metrics for the refined 3D contact maps. Without this, it is unclear whether the high lifting success rates reflect true semantic-physical synergy or artifacts from the refinement process.
  2. [Experimental results] The experimental section lacks details on baseline implementations, including whether they were re-implemented with the same data splits and training protocols from DexGraspNet, as well as any statistical significance tests or error bars for the reported success rates of 98.2% and 97.7%. This weakens the ability to confidently attribute improvements to the proposed method.
  3. [Ablation studies] No ablation studies are presented to isolate the effects of the vision-language contact proposal versus the SGCR refinement, making it difficult to determine which component drives the performance gains on intent-aware grasping.
minor comments (2)
  1. [Abstract] The abstract mentions improvements of 12.8% and 26.2% but does not specify the baseline values or the exact metric for intent-aware grasping.
  2. [Policy learning] Clarify the exact network architecture and loss functions used for the policy learning stage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses
  1. Referee: [SGCR description] The accuracy of the Semantic-Geometric Consistency Refinement (SGCR) step is central to generating valid supervision signals for inverse kinematics and policy learning, yet the paper provides no quantitative validation such as contact IoU, point-wise error, or cross-view consistency metrics for the refined 3D contact maps. Without this, it is unclear whether the high lifting success rates reflect true semantic-physical synergy or artifacts from the refinement process.

    Authors: We agree that quantitative validation of SGCR is important for substantiating its role. In the revised manuscript we will add a dedicated evaluation subsection reporting contact IoU, average point-wise error, and cross-view consistency metrics computed on a held-out validation split of DexGraspNet. These metrics will quantify the improvement achieved by the refinement step over the initial vision-language proposals. revision: yes

  2. Referee: [Experimental results] The experimental section lacks details on baseline implementations, including whether they were re-implemented with the same data splits and training protocols from DexGraspNet, as well as any statistical significance tests or error bars for the reported success rates of 98.2% and 97.7%. This weakens the ability to confidently attribute improvements to the proposed method.

    Authors: We will clarify the experimental protocol by explicitly stating that all baselines were re-implemented using the identical data splits, preprocessing, and training schedules provided in DexGraspNet. We will also report standard deviations across five random seeds and include paired t-test p-values to establish statistical significance of the reported gains. revision: yes

  3. Referee: [Ablation studies] No ablation studies are presented to isolate the effects of the vision-language contact proposal versus the SGCR refinement, making it difficult to determine which component drives the performance gains on intent-aware grasping.

    Authors: We acknowledge that component-wise ablations would help attribute the observed gains. The revised paper will include ablation experiments that (i) remove the vision-language proposal stage and (ii) disable the SGCR refinement, reporting the resulting intent-aware grasping success rates on both seen and unseen categories. This will isolate the contribution of each module. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims rest on empirical evaluation of lifting success rates (98.2%/97.7%) and intent-aware grasping improvements on held-out test splits from DexGraspNet. The method chain—vision-language contact proposals, SGCR refinement to 3D maps, inverse-kinematics pose derivation, and policy supervision—is a procedural pipeline whose outputs are measured against external benchmarks rather than being algebraically or definitionally forced by the reported metrics. No equations, fitted parameters renamed as predictions, or load-bearing self-citations reduce the success rates to quantities internal to the training loop. The derivation remains self-contained against the external dataset and evaluation protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard robotics assumptions rather than new free parameters or invented entities; the main untested premise is the accuracy of off-the-shelf vision-language models for contact inference.

axioms (2)
  • domain assumption Inverse kinematics yields feasible hand poses from valid contact maps
    Used to generate supervision signals for policy learning
  • domain assumption Vision-language models can infer task-appropriate contact regions from images and language
    Foundation of the initial coarse contact proposals

pith-pipeline@v0.9.0 · 5597 in / 1332 out tokens · 39142 ms · 2026-05-14T18:36:22.143017+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 14 canonical work pages · 4 internal anchors

  1. [1]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    S. Buss. Introduction to inverse kinematics with jacobian transpose, pseudoinverse and damped least squares methods. 2004

  3. [3]

    Calli, A

    B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M. Dollar. The ycb object and model set: Towards common benchmarks for manipulation research. In2015 international conference on advanced robotics (ICAR), pages 510–517. IEEE, 2015

  4. [4]

    T. Chen, M. Tippur, S. Wu, V . Kumar, E. Adelson, and P. Agrawal. Visual dexterity: In-hand reorientation of novel and complex object shapes.Science Robotics, 8(84):eadc9244, 2023

  5. [5]

    S. Deng, X. Xu, C. Wu, K. Chen, and K. Jia. 3d affordancenet: A benchmark for visual object affordance understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1778–1787, 2021

  6. [6]

    T.-T. Do, A. Nguyen, and I. Reid. Affordancenet: An end-to-end deep learning approach for object affordance detection. In2018 IEEE international conference on robotics and automation (ICRA), pages 5882–5889. IEEE, 2018

  7. [7]

    X. Gao, P. Zhang, D. Qu, D. Wang, Z. Wang, Y . Ding, and B. Zhao. Learning 2d invariant affordance knowledge for 3d affordance grounding. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3095–3103, 2025

  8. [8]

    J. J. Gibson. The theory of affordances. the ecological approach to visual perception.The people, place and, space reader, pages 56–60, 1979

  9. [9]

    J. He, D. Li, X. Yu, Z. Qi, W. Zhang, J. Chen, Z. Zhang, Z. Zhang, L. Yi, and H. Wang. Dexvlg: Dexterous vision-language-grasp model at scale. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14248–14258, 2025

  10. [10]

    Huang, Z

    Y . Huang, Z. Peng, C. Wen, X. Yang, and W. Shen. Unlocking 3d affordance segmentation with 2d semantic knowledge.arXiv preprint arXiv:2510.08316, 2025

  11. [11]

    Huang, H

    Z. Huang, H. Yuan, Y . Fu, and Z. Lu. Efficient residual learning with mixture-of-experts for universal dexterous grasping. InThe Thirteenth International Conference on Learning Representations, 2025

  12. [12]

    J. Jian, X. Liu, Z. Chen, M. Li, J. Liu, and R. Hu. G-dexgrasp: Generalizable dexterous grasping synthesis via part-aware prior retrieval and prior-assisted generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11447–11457, 2025

  13. [13]

    J. Lee, E. Park, and M. Cho. Dexter: Language-driven dexterous grasp generation with embodied reasoning.arXiv preprint arXiv:2601.16046, 2026

  14. [14]

    G. Li, D. Sun, L. Sevilla-Lara, and V . Jampani. One-shot open affordance learning with foundation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3086–3096, 2024

  15. [15]

    H. Li, W. Mao, W. Deng, C. Meng, H. Fan, T. Wang, Y . Osamu, P. Tan, H. Wang, and X. Deng. Multi-graspllm: A multimodal llm for multi-hand semantic guided grasp generation.arXiv preprint arXiv:2412.08468, 2024

  16. [16]

    P. Li, T. Liu, Y . Li, Y . Geng, Y . Zhu, Y . Yang, and S. Huang. Gendexgrasp: Generalizable dexterous grasping. In2023 IEEE international conference on robotics and automation (ICRA), pages 8068–8074. IEEE, 2023

  17. [17]

    S. Li, S. Bhagat, J. Campbell, Y . Xie, W. Kim, K. Sycara, and S. Stepputtis. Shapegrasp: Zero-shot task-oriented grasping with large language models through geometric decomposition. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10527–10534. IEEE, 2024. 10

  18. [18]

    S. Liu, Y . Zhou, J. Yang, S. Gupta, and S. Wang. Contactgen: Generative contact modeling for grasp generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20609–20620, 2023

  19. [19]

    T. Liu, Z. Liu, Z. Jiao, Y . Zhu, and S.-C. Zhu. Synthesizing diverse and physically stable grasps with arbitrary hand structures using differentiable force closure estimator.IEEE Robotics and Automation Letters, 7(1):470–477, 2021

  20. [20]

    Y . Liu, Y . Yang, Y . Wang, X. Wu, J. Wang, Y . Yao, S. Schwertfeger, S. Yang, W. Wang, J. Yu, et al. Realdex: Towards human-like grasping for robotic dexterous hand.arXiv preprint arXiv:2402.13853, 2024

  21. [21]

    J. Lu, H. Kang, H. Li, B. Liu, Y . Yang, Q. Huang, and G. Hua. Ugg: Unified generative grasping. InEuropean Conference on Computer Vision, pages 414–433. Springer, 2024

  22. [22]

    Y . Ma, K. Chen, K. Zheng, and Q. Dou. Contact map transfer with conditional diffusion model for generalizable dexterous grasp generation.arXiv preprint arXiv:2511.01276, 2025

  23. [23]

    Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

    V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

  24. [24]

    C. Mao, H. Yuan, Z. Huang, C. Xu, K. Ma, and Z. Lu. Universal dexterous functional grasping via demonstration-editing reinforcement learning.arXiv preprint arXiv:2512.13380, 2025

  25. [25]

    W. Moon, H. S. Seong, and J.-P. Heo. Selective contrastive learning for weakly supervised affordance grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5210–5220, 2025

  26. [26]

    Morrison, P

    D. Morrison, P. Corke, and J. Leitner. Egad! an evolved grasping analysis dataset for diversity and reproducibility in robotic manipulation.IEEE Robotics and Automation Letters, 5(3):4368– 4375, 2020

  27. [27]

    R. M. Murray, Z. Li, and S. S. Sastry.A mathematical introduction to robotic manipulation. CRC press, 2017

  28. [28]

    S. Qian, W. Chen, M. Bai, X. Zhou, Z. Tu, and L. E. Li. Affordancellm: Grounding affordance from vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7587–7597, 2024

  29. [29]

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  30. [30]

    S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artifi- cial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

  31. [31]

    Dexterous hand

    Shadow Robot Company. Dexterous hand. https://www.shadowrobot.com/ dexterous-hand-series/, 2024

  32. [32]

    Siciliano, L

    B. Siciliano, L. Sciavicco, L. Villani, and G. Oriolo.Robotics: modelling, planning and control. Springer, 2009

  33. [33]

    Taheri, N

    O. Taheri, N. Ghorbani, M. J. Black, and D. Tzionas. Grab: A dataset of whole-body human grasping of objects. InEuropean conference on computer vision, pages 581–600. Springer, 2020

  34. [34]

    Turpin, L

    D. Turpin, L. Wang, E. Heiden, Y .-C. Chen, M. Macklin, S. Tsogkas, S. Dickinson, and A. Garg. Grasp’d: Differentiable contact-rich grasp synthesis for multi-fingered hands. InEuropean Conference on Computer Vision, pages 201–221. Springer, 2022. 11

  35. [35]

    Turpin, T

    D. Turpin, T. Zhong, S. Zhang, G. Zhu, J. Liu, R. Singh, E. Heiden, M. Macklin, S. Tsogkas, S. Dickinson, et al. Fast-grasp’d: Dexterous multi-finger grasp generation through differentiable simulation.arXiv preprint arXiv:2306.08132, 2023

  36. [36]

    W. Wan, H. Geng, Y . Liu, Z. Shan, Y . Yang, L. Yi, and H. Wang. Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist- specialist learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3891–3902, 2023

  37. [37]

    R. Wang, J. Zhang, J. Chen, Y . Xu, P. Li, T. Liu, and H. Wang. Dexgraspnet: A large- scale robotic dexterous grasp dataset for general objects based on simulation.arXiv preprint arXiv:2210.02697, 2022

  38. [38]

    W. Wang, F. Wei, L. Zhou, X. Chen, L. Luo, X. Yi, Y . Zhang, Y . Liang, C. Xu, Y . Lu, et al. Unigrasptransformer: Simplified policy distillation for scalable dexterous robotic grasping. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12199–12208, 2025

  39. [39]

    Wei, J.-J

    Y .-L. Wei, J.-J. Jiang, C. Xing, X.-T. Tan, X.-M. Wu, H. Li, M. Cutkosky, and W.-S. Zheng. Grasp as you say: Language-guided dexterous grasp generation.Advances in Neural Information Processing Systems, 37:46881–46907, 2024

  40. [40]

    Y .-L. Wei, M. Lin, Y . Lin, J.-J. Jiang, X.-M. Wu, L.-A. Zeng, and W.-S. Zheng. Afforddexgrasp: Open-set language-guided dexterous grasp with generalizable-instructive affordance. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 11818–11828, 2025

  41. [41]

    Z. Weng, H. Lu, D. Kragic, and J. Lundell. Dexdiffuser: Generating dexterous grasps with diffusion models.IEEE Robotics and Automation Letters, 9(12):11834–11840, 2024

  42. [42]

    S. Wold, K. Esbensen, and P. Geladi. Principal component analysis.Chemometrics and intelligent laboratory systems, 2(1-3):37–52, 1987

  43. [43]

    Allegro hand

    Wonik Robotics. Allegro hand. https://www.wonikrobotics.com/ research-robot-hand, 2024

  44. [44]

    X. Wu, T. Liu, C. Li, Y . Ma, Y . Shi, and X. He. Fastgrasp: Efficient grasp synthesis with diffusion. In2025 International Conference on 3D Vision (3DV), pages 735–747. IEEE, 2025

  45. [45]

    Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015

  46. [46]

    Xu, Y .-L

    G.-H. Xu, Y .-L. Wei, D. Zheng, X.-M. Wu, and W.-S. Zheng. Dexterous grasp transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17933–17942, 2024

  47. [47]

    Y . Xu, W. Wan, J. Zhang, H. Liu, Z. Shan, H. Shen, R. Wang, H. Geng, Y . Weng, J. Chen, et al. Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4737–4746, 2023

  48. [48]

    Y . Yang, W. Zhai, H. Luo, Y . Cao, J. Luo, and Z.-J. Zha. Grounding 3d object affordance from 2d interactions in images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10905–10915, 2023

  49. [49]

    H. Yuan, Z. Huang, Y . Wang, C. Mao, C. Xu, and Z. Lu. Demograsp: Universal dexterous grasping from a single demonstration.arXiv preprint arXiv:2509.22149, 2025

  50. [50]

    H. Yuan, B. Zhou, Y . Fu, and Z. Lu. Cross-embodiment dexterous grasping with reinforcement learning.arXiv preprint arXiv:2410.02479, 2024

  51. [51]

    Zhang, W

    J. Zhang, W. Huang, B. Peng, M. Wu, F. Hu, Z. Chen, B. Zhao, and H. Dong. Omni6dpose: A benchmark and model for universal 6d object pose estimation and tracking. InEuropean Conference on Computer Vision, pages 199–216. Springer, 2024. 12

  52. [52]

    Zhang, H

    J. Zhang, H. Liu, D. Li, X. Yu, H. Geng, Y . Ding, J. Chen, and H. Wang. Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes. In8th Annual Conference on Robot Learning, 2024

  53. [53]

    Zhang, Z

    J. Zhang, Z. Ma, T. Wu, Z. Chen, and H. Dong. Cadgrasp: Learning contact and collision aware general dexterous grasping in cluttered scenes.arXiv preprint arXiv:2601.15039, 2026

  54. [54]

    F. Zhao, D. Tsetserukou, and Q. Liu. Graingrasp: Dexterous grasp generation with fine-grained contact guidance. In2024 IEEE international conference on robotics and automation (ICRA), pages 6470–6476. IEEE, 2024

  55. [55]

    Zhong, X

    Y . Zhong, X. Huang, R. Li, C. Zhang, Z. Chen, T. Guan, F. Zeng, K. N. Lui, Y . Ye, Y . Liang, et al. Dexgraspvla: A vision-language-action framework towards general dexterous grasping. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18836–18844, 2026

  56. [56]

    Zhong, Q

    Y . Zhong, Q. Jiang, J. Yu, and Y . Ma. Dexgrasp anything: Towards universal robotic dex- terous grasping with physics awareness. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22584–22594, 2025

  57. [57]

    H. Zhu, Q. Kong, K. Xu, X. Xia, B. Deng, J. Ye, R. Xiong, and Y . Wang. Grounding 3d object affordance with language instructions, visual observations and interactions. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17337–17346, 2025

  58. [58]

    Zurbrügg, A

    R. Zurbrügg, A. Cramariuc, and M. Hutter. Graspqp: Differentiable optimization of force closure for diverse and robust dexterous grasping. InConference on Robot Learning, pages 2583–2602. PMLR, 2025. 13 A Additional Implementation Details A.1 Simulation and Training Setup We conduct our experiments in Isaac Gym [23] using a dexterous hand in a tabletop ma...

  59. [59]

    5) (c) 3D Geometric Refinement (Eq

    back-project to 3D 2) reproject to view j depth pixel depth pixel mask overlap depth consistency Invalid : seed point : inside candidate : outside candidate invalid / reject valid / accept (a) Semantic Contact Region Proposal (b) Cross-View Semantic Refinement (Eq. 5) (c) 3D Geometric Refinement (Eq. 7) Overall Results (a) (b) (c) Semantic-Geometric Contact...