Task-Aware Bimanual Affordance Prediction via VLM-Guided Semantic-Geometric Reasoning

Alap Kshirsagar; Fabian Hahne; Georgia Chalvatzaki; Jan Peters; Vignesh Prasad

arxiv: 2604.08726 · v1 · submitted 2026-04-09 · 💻 cs.RO

Task-Aware Bimanual Affordance Prediction via VLM-Guided Semantic-Geometric Reasoning

Fabian Hahne , Vignesh Prasad , Georgia Chalvatzaki , Jan Peters , Alap Kshirsagar This is my paper

Pith reviewed 2026-05-10 16:55 UTC · model grok-4.3

classification 💻 cs.RO

keywords bimanual manipulationaffordance predictionvision-language modelstask-oriented graspingrobotic manipulationsemantic-geometric fusiondual-arm systems

0 comments

The pith

A VLM-guided method jointly localizes task affordances and assigns arms to raise bimanual grasping success over geometry-only baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that fusing geometric grasp generation with vision-language model queries for task-relevant regions and arm allocation produces higher real-world task completion rates than either geometric heuristics or coarse semantic segmentation. A reader would care because bimanual robots currently cannot decide which part of an object matters for a given goal, such as stabilizing one side while acting on the other. The system builds a unified 3D scene from multiple RGB-D views, proposes candidate grasps, then queries the VLM to filter those candidates by semantic fit and arm choice without any per-object retraining. If the claim holds, dual-arm systems become usable in homes or factories where task intent varies and objects are unfamiliar.

Core claim

The paper claims that its hierarchical pipeline, which first fuses multi-view RGB-D data into a consistent 3D representation and generates global 6-DoF grasp candidates, then applies VLM queries to select task-relevant affordance regions and perform arm allocation, yields consistently higher success rates than geometric and semantic baselines on nine real-world tasks across parallel manipulation, coordinated stabilization, tool use, and human handover.

What carries the argument

VLM queries that spatially and semantically filter 6-DoF grasp candidates for task-specific affordance regions and arm allocation.

If this is right

Higher completion rates hold across the four task categories of parallel manipulation, coordinated stabilization, tool use, and handover.
The method generalizes to unseen object categories and task wordings because it relies on zero-shot VLM reasoning rather than learned per-class models.
Geometric validity is preserved while task semantics are enforced through the two-stage spatial-then-semantic filtering step.
The same pipeline supports reliable bimanual operation in unstructured scenes where task intent must be inferred from language.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same VLM filtering step could be inserted into existing single-arm grasp planners to add task awareness at low cost.
Replacing the current VLM with a larger or fine-tuned model would likely tighten the gap between predicted and executed arm allocations.
Running the method inside a closed-loop planner could let the robot recover from initial misallocations by re-querying the VLM after partial execution.

Load-bearing premise

A general-purpose vision-language model can reliably name the correct contact regions and assign the right arm for any described task and object category without category-specific training or fine-tuning.

What would settle it

A test set of new objects and task instructions on which the VLM repeatedly selects wrong affordance patches or arm assignments, producing lower success rates than a pure geometric baseline.

Figures

Figures reproduced from arXiv: 2604.08726 by Alap Kshirsagar, Fabian Hahne, Georgia Chalvatzaki, Jan Peters, Vignesh Prasad.

**Figure 2.** Figure 2: Overview of our proposed approach. Given a scene RGB image, a vision–language model (VLM) performs 2D object [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: This figure presents an example of the cropped and [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Representative successful executions of our language-conditioned hierarchical grasp planning framework. The [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 5.** Figure 5: E. Overall Performance Our method achieves the highest mean strategy alignment rate (88.9%), substantially outperforming GeometryOnly (9.0%), Arm Only (45.6%), Region Only (55.6%), and VLPart (55.6%). The Geometry-Only baseline collapses across nearly all tasks, indicating that geometric feasibility alone is insufficient for strategy-consistent manipulation. In contrast, the proposed hierarchical framewo… view at source ↗

read the original abstract

Bimanual manipulation requires reasoning about where to interact with an object and which arm should perform each action, a joint affordance localization and arm allocation problem that geometry-only planners cannot resolve without semantic understanding of task intent. Existing approaches either treat affordance prediction as coarse part segmentation or rely on geometric heuristics for arm assignment, failing to jointly reason about task-relevant contact regions and arm allocation. We reframe bimanual manipulation as a joint affordance localization and arm allocation problem and propose a hierarchical framework for task-aware bimanual affordance prediction that leverages a Vision-Language Model (VLM) to generalize across object categories and task descriptions without requiring category-specific training. Our approach fuses multi-view RGB-D observations into a consistent 3D scene representation and generates global 6-DoF grasp candidates, which are then spatially and semantically filtered by querying the VLM for task-relevant affordance regions on each object, as well as for arm allocation to the individual objects, thereby ensuring geometric validity while respecting task semantics. We evaluate our method on a dual-arm platform across nine real-world manipulation tasks spanning four categories: parallel manipulation, coordinated stabilization, tool use, and human handover. Our approach achieves consistently higher task success rates than geometric and semantic baselines for task-oriented grasping, demonstrating that explicit semantic reasoning over affordances and arm allocation helps enable reliable bimanual manipulation in unstructured environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLM-guided filtering for bimanual affordance and arm allocation works in practice but lacks the ablations to prove the semantic part is essential.

read the letter

Your colleague should know that this work takes a VLM and applies it to filter 3D grasp candidates for both contact regions and arm selection in bimanual manipulation. The evaluation on real hardware across nine tasks shows better performance than the baselines they compare against, but the paper does not provide the details needed to confirm that the VLM is the key driver. What is new is the specific way they combine semantic queries with geometric candidates in a single 3D scene representation. They start with multi-view RGB-D fusion, generate global 6-DoF grasps, then query the VLM twice: once for task-relevant affordance regions on objects and once for assigning which arm handles which object. This avoids category-specific training and aims to handle task intent directly. The real-robot experiments cover parallel actions, stabilization, tool use, and handovers, which is a reasonable test set for the claim. The approach does well in being end-to-end practical. It respects geometric validity while adding semantics without retraining models for each object type. The hierarchical structure seems straightforward to implement on a dual-arm setup. The main soft spot is the lack of supporting analysis for the central claim. Success rates are said to be higher, but without the actual percentages, variance, or statistical significance, it's difficult to judge the effect size. More importantly, there are no ablations that isolate the VLM's contribution. For instance, what happens if you skip the semantic queries and just use the geometric candidates with some heuristic for arm choice? Or how often does the VLM give wrong affordance or arm suggestions? Without those, the improvement could come from better scene representation or the filtering process itself rather than the semantic understanding. The assumption that the VLM consistently identifies correct regions and allocations across varied tasks is taken as given but not verified in the reported results. This paper is aimed at robotics labs working on vision-language integration for manipulation. A reader who wants to see how VLMs can be plugged into existing grasp planners for bimanual cases will find a concrete example. It is not for someone seeking a theoretical advance or a fully solved method. I would send it to peer review. The topic matters for practical robots, the method is described clearly enough, and the real-world testing is a plus, even if the current evidence needs bolstering to make the attribution convincing.

Referee Report

3 major / 2 minor

Summary. The paper proposes a hierarchical framework for task-aware bimanual affordance prediction that fuses multi-view RGB-D data into a 3D scene representation, generates 6-DoF grasp candidates, and uses VLM queries to filter candidates by task-relevant affordance regions and arm allocation. It evaluates the method on a dual-arm robot across nine real-world tasks in four categories (parallel manipulation, coordinated stabilization, tool use, human handover) and claims consistently higher task success rates than geometric and semantic baselines, attributing the gains to explicit joint semantic-geometric reasoning without category-specific training.

Significance. If the central attribution holds, the work would demonstrate a practical way to combine pre-trained VLMs with geometric planning for reliable bimanual manipulation in unstructured environments, extending beyond part-segmentation or heuristic arm-assignment methods. The real-robot evaluation on diverse tasks and the parameter-free use of off-the-shelf VLMs are strengths that could influence downstream research in task-oriented grasping and human-robot handover.

major comments (3)

[§4] §4 (Experiments): The manuscript reports aggregate higher success rates but provides neither per-task numerical success rates with trial counts and variance, nor statistical significance tests against the geometric and semantic baselines, preventing assessment of whether the claimed improvement is robust or task-dependent.
[§4] §4.2 (Ablations, if present) or results discussion: No ablation disables the VLM queries while retaining the 3D fusion, grasp-candidate generation, and hierarchical filtering pipeline; without this, it is impossible to isolate whether performance gains derive from semantic reasoning or from the geometric components alone, directly undermining the central claim that 'explicit semantic reasoning over affordances and arm allocation' is decisive.
[§4] §4.1 (Evaluation protocol): The paper supplies no quantitative VLM correctness metrics (e.g., precision/recall of affordance-region identification or arm-allocation accuracy per query) and no failure-mode breakdown separating VLM errors from geometric or execution failures, leaving the weakest assumption—that the VLM reliably identifies task-relevant regions and allocations across categories—unverified.

minor comments (2)

[Abstract] The abstract states 'consistently higher task success rates' without any numbers; the results section should include a summary table with exact percentages, number of trials, and baseline definitions for immediate readability.
[§3] Notation for the VLM query outputs (affordance masks and arm labels) should be formalized with a clear equation or pseudocode in §3 to avoid ambiguity when describing the spatial-semantic filtering step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental evaluation. We will revise the manuscript to address each of the points raised, providing additional quantitative details, a targeted ablation, and VLM performance analysis to strengthen the claims.

read point-by-point responses

Referee: [§4] §4 (Experiments): The manuscript reports aggregate higher success rates but provides neither per-task numerical success rates with trial counts and variance, nor statistical significance tests against the geometric and semantic baselines, preventing assessment of whether the claimed improvement is robust or task-dependent.

Authors: We agree that aggregate results alone limit assessment of robustness and task dependence. In the revised manuscript, we will expand the results section with a detailed table reporting per-task success rates for our method and both baselines across all nine tasks. Each entry will include the number of trials performed (10 trials per task per method), standard deviations, and statistical significance via the McNemar test for paired binary outcomes, allowing clear evaluation of whether improvements are consistent or vary by task category. revision: yes
Referee: [§4] §4.2 (Ablations, if present) or results discussion: No ablation disables the VLM queries while retaining the 3D fusion, grasp-candidate generation, and hierarchical filtering pipeline; without this, it is impossible to isolate whether performance gains derive from semantic reasoning or from the geometric components alone, directly undermining the central claim that 'explicit semantic reasoning over affordances and arm allocation' is decisive.

Authors: We acknowledge that isolating the VLM contribution requires a specific ablation. While the existing geometric baseline removes semantic filtering and the semantic baseline applies VLM differently, we will add a new ablation in the revised paper. This ablation retains the full 3D fusion, 6-DoF grasp generation, and hierarchical pipeline but replaces VLM-based affordance region and arm-allocation queries with geometric heuristics (e.g., nearest-point selection and fixed left/right assignment). Comparative results will be reported to directly quantify the benefit of the semantic components. revision: yes
Referee: [§4] §4.1 (Evaluation protocol): The paper supplies no quantitative VLM correctness metrics (e.g., precision/recall of affordance-region identification or arm-allocation accuracy per query) and no failure-mode breakdown separating VLM errors from geometric or execution failures, leaving the weakest assumption—that the VLM reliably identifies task-relevant regions and allocations across categories—unverified.

Authors: We agree that direct verification of VLM reliability is important. In the revision, we will add quantitative VLM metrics by manually annotating a subset of queries (approximately 50 across the task categories) to compute precision and recall for affordance-region identification as well as accuracy for arm-allocation decisions. We will also include a failure-mode breakdown in the results, classifying unsuccessful trials into VLM misidentification, geometric planning errors, and execution failures based on post-hoc video and log analysis. This will provide evidence supporting the assumption across categories. revision: yes

Circularity Check

0 steps flagged

No circularity; pipeline uses external VLM and geometric modules with empirical validation

full rationale

The paper describes a hierarchical framework that fuses multi-view RGB-D data into a 3D scene, generates 6-DoF grasp candidates geometrically, then applies VLM queries for task-relevant affordance regions and arm allocation as independent filtering steps. The central claim of higher task success rates rests on real-world experiments across nine tasks and comparisons to geometric/semantic baselines, without any self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the derivation to its inputs. The method treats VLM outputs and geometric processing as external, falsifiable components whose correctness is evaluated via aggregate success metrics rather than by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that an off-the-shelf VLM supplies reliable task semantics for both localization and allocation without training; this is an unverified domain assumption rather than a derived result.

axioms (1)

domain assumption Vision-language models can reliably identify task-relevant affordance regions and perform arm allocation from natural language task descriptions across unseen object categories.
The hierarchical filtering step depends entirely on the VLM returning accurate and geometrically consistent answers without category-specific fine-tuning.

pith-pipeline@v0.9.0 · 5563 in / 1318 out tokens · 75864 ms · 2026-05-10T16:55:24.389498+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

[1]

Goal-auxiliary actor-critic for 6d robotic grasping with point clouds,

L. Wang,et al., “Goal-auxiliary actor-critic for 6d robotic grasping with point clouds,” inConference on Robot Learning, 2022

work page 2022
[2]

Se(3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion,

J. Urain,et al., “Se(3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion,”IEEE International Conference on Robotics and Automation (ICRA), 2023

work page 2023
[3]

Learning any-view 6dof robotic grasping in cluttered scenes via neural surface rendering,

S. Jauhri,et al., “Learning any-view 6dof robotic grasping in cluttered scenes via neural surface rendering,”Robotics: Science and Systems (R:SS), 2024

work page 2024
[4]

Dagdiff: Guiding dual-arm grasp diffusion to stable and collision-free grasps,

M. F. Karim,et al., “Dagdiff: Guiding dual-arm grasp diffusion to stable and collision-free grasps,” inIEEE International Conference on Robotics and Automation (ICRA), 2026

work page 2026
[5]

Dg16m: A large-scale dataset for dual-arm grasping with force-optimized grasps,

——, “Dg16m: A large-scale dataset for dual-arm grasping with force-optimized grasps,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025

work page 2025
[6]

Leveraging semantic and geometric information for zero-shot robot-to-human handover,

J. Liu,et al., “Leveraging semantic and geometric information for zero-shot robot-to-human handover,” inIEEE International Confer- ence on Robotics and Automation (ICRA), 2025

work page 2025
[7]

Graspgpt: Leveraging semantic knowledge from a large language model for task-oriented grasping,

C. Tang,et al., “Graspgpt: Leveraging semantic knowledge from a large language model for task-oriented grasping,”IEEE Robotics and Automation Letters, 2023

work page 2023
[8]

Constrained 6-dof grasp generation on complex shapes for improved dual-arm manipulation,

G. Singh,et al., “Constrained 6-dof grasp generation on complex shapes for improved dual-arm manipulation,” inIEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), 2024

work page 2024
[9]

Qwen3-VL Technical Report

S. Bai,et al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,

B. Chen,et al., “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[11]

Toward grounded commonsense reasoning,

M. Kwon,et al., “Toward grounded commonsense reasoning,” inIEEE International Conference on Robotics and Automation (ICRA), 2024

work page 2024
[12]

Lan-grasp: Using large language models for semantic object grasping,

R. Mirjalili,et al., “Lan-grasp: Using large language models for semantic object grasping,”arXiv preprint arXiv:2310.05239, 2023

work page arXiv 2023
[13]

A survey on integration of large language models with intelligent robots,

Y . Kim,et al., “A survey on integration of large language models with intelligent robots,”Intelligent Service Robotics, 2024

work page 2024
[14]

Language-driven grasp detection,

A. D. Vuong,et al., “Language-driven grasp detection,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[15]

Vl-grasp: a 6-dof interactive grasp policy for language- oriented objects in cluttered indoor scenes,

Y . Lu,et al., “Vl-grasp: a 6-dof interactive grasp policy for language- oriented objects in cluttered indoor scenes,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023

work page 2023
[16]

Language guided robotic grasping with fine-grained instructions,

Q. Sun,et al., “Language guided robotic grasping with fine-grained instructions,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023

work page 2023
[17]

Language-guided robot grasping: Clip-based re- ferring grasp synthesis in clutter,

G. Tziafas,et al., “Language-guided robot grasping: Clip-based re- ferring grasp synthesis in clutter,” inConference on Robot Learning (CoRL), 2023

work page 2023
[18]

Ovgnet: A unified visual-linguistic framework for open- vocabulary robotic grasping,

M. Li,et al., “Ovgnet: A unified visual-linguistic framework for open- vocabulary robotic grasping,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024

work page 2024
[19]

Reasoning grasping via multimodal large language model,

S. Jin,et al., “Reasoning grasping via multimodal large language model,” inConference on Robot Learning. PMLR, 2025

work page 2025
[20]

Thinkgrasp: A vision-language system for strategic part grasping in clutter,

Y . Qian,et al., “Thinkgrasp: A vision-language system for strategic part grasping in clutter,” inConference on Robot Learning (CoRL), 2024

work page 2024
[21]

Task-oriented grasp prediction with visual-language inputs,

C. Tang,et al., “Task-oriented grasp prediction with visual-language inputs,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023

work page 2023
[22]

Foundationgrasp: Generalizable task-oriented grasping with foundation models,

——, “Foundationgrasp: Generalizable task-oriented grasping with foundation models,”IEEE Transactions on Automation Science and Engineering, 2025

work page 2025
[23]

Shapegrasp: Zero-shot task-oriented grasping with large language models through geometric decomposition,

S. Li,et al., “Shapegrasp: Zero-shot task-oriented grasping with large language models through geometric decomposition,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024

work page 2024
[24]

Towards open-world grasping with large vision- language models,

G. Tziafaset al., “Towards open-world grasping with large vision- language models,” inConference on Robot Learning (CoRL), 2024

work page 2024
[25]

Grasp as you say: Language-guided dexterous grasp generation,

Y .-L. Wei,et al., “Grasp as you say: Language-guided dexterous grasp generation,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[26]

Language-guided dexterous functional grasping by llm generated grasp functionality and synergy for humanoid manipula- tion,

Z. Li,et al., “Language-guided dexterous functional grasping by llm generated grasp functionality and synergy for humanoid manipula- tion,”IEEE Transactions on Automation Science and Engineering, 2025

work page 2025
[27]

Decision-making in robotic grasping with large language models,

J. Liao,et al., “Decision-making in robotic grasping with large language models,” inInternational Conference on Intelligent Robotics and Applications (ICIRA). Springer, 2023

work page 2023
[28]

2handedafforder: Learning precise actionable bimanual affordances from human videos,

M. Heidinger,et al., “2handedafforder: Learning precise actionable bimanual affordances from human videos,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025
[29]

Combining vlm and llm for enhanced semantic object perception in robotic handover tasks,

J. Huang,et al., “Combining vlm and llm for enhanced semantic object perception in robotic handover tasks,” inWRC Symposium on Advanced Robotics and Automation (WRC SARA), 2024

work page 2024
[30]

Anygrasp: Robust and efficient grasp perception in spatial and temporal domains,

H.-S. Fang,et al., “Anygrasp: Robust and efficient grasp perception in spatial and temporal domains,”IEEE Transactions on Robotics, 2023

work page 2023
[31]

Going denser with open-vocabulary part segmentation,

P. Sun,et al., “Going denser with open-vocabulary part segmentation,” inIEEE/CVF International Conference on Computer Vision, 2023

work page 2023

[1] [1]

Goal-auxiliary actor-critic for 6d robotic grasping with point clouds,

L. Wang,et al., “Goal-auxiliary actor-critic for 6d robotic grasping with point clouds,” inConference on Robot Learning, 2022

work page 2022

[2] [2]

Se(3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion,

J. Urain,et al., “Se(3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion,”IEEE International Conference on Robotics and Automation (ICRA), 2023

work page 2023

[3] [3]

Learning any-view 6dof robotic grasping in cluttered scenes via neural surface rendering,

S. Jauhri,et al., “Learning any-view 6dof robotic grasping in cluttered scenes via neural surface rendering,”Robotics: Science and Systems (R:SS), 2024

work page 2024

[4] [4]

Dagdiff: Guiding dual-arm grasp diffusion to stable and collision-free grasps,

M. F. Karim,et al., “Dagdiff: Guiding dual-arm grasp diffusion to stable and collision-free grasps,” inIEEE International Conference on Robotics and Automation (ICRA), 2026

work page 2026

[5] [5]

Dg16m: A large-scale dataset for dual-arm grasping with force-optimized grasps,

——, “Dg16m: A large-scale dataset for dual-arm grasping with force-optimized grasps,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025

work page 2025

[6] [6]

Leveraging semantic and geometric information for zero-shot robot-to-human handover,

J. Liu,et al., “Leveraging semantic and geometric information for zero-shot robot-to-human handover,” inIEEE International Confer- ence on Robotics and Automation (ICRA), 2025

work page 2025

[7] [7]

Graspgpt: Leveraging semantic knowledge from a large language model for task-oriented grasping,

C. Tang,et al., “Graspgpt: Leveraging semantic knowledge from a large language model for task-oriented grasping,”IEEE Robotics and Automation Letters, 2023

work page 2023

[8] [8]

Constrained 6-dof grasp generation on complex shapes for improved dual-arm manipulation,

G. Singh,et al., “Constrained 6-dof grasp generation on complex shapes for improved dual-arm manipulation,” inIEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), 2024

work page 2024

[9] [9]

Qwen3-VL Technical Report

S. Bai,et al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,

B. Chen,et al., “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[11] [11]

Toward grounded commonsense reasoning,

M. Kwon,et al., “Toward grounded commonsense reasoning,” inIEEE International Conference on Robotics and Automation (ICRA), 2024

work page 2024

[12] [12]

Lan-grasp: Using large language models for semantic object grasping,

R. Mirjalili,et al., “Lan-grasp: Using large language models for semantic object grasping,”arXiv preprint arXiv:2310.05239, 2023

work page arXiv 2023

[13] [13]

A survey on integration of large language models with intelligent robots,

Y . Kim,et al., “A survey on integration of large language models with intelligent robots,”Intelligent Service Robotics, 2024

work page 2024

[14] [14]

Language-driven grasp detection,

A. D. Vuong,et al., “Language-driven grasp detection,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[15] [15]

Vl-grasp: a 6-dof interactive grasp policy for language- oriented objects in cluttered indoor scenes,

Y . Lu,et al., “Vl-grasp: a 6-dof interactive grasp policy for language- oriented objects in cluttered indoor scenes,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023

work page 2023

[16] [16]

Language guided robotic grasping with fine-grained instructions,

Q. Sun,et al., “Language guided robotic grasping with fine-grained instructions,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023

work page 2023

[17] [17]

Language-guided robot grasping: Clip-based re- ferring grasp synthesis in clutter,

G. Tziafas,et al., “Language-guided robot grasping: Clip-based re- ferring grasp synthesis in clutter,” inConference on Robot Learning (CoRL), 2023

work page 2023

[18] [18]

Ovgnet: A unified visual-linguistic framework for open- vocabulary robotic grasping,

M. Li,et al., “Ovgnet: A unified visual-linguistic framework for open- vocabulary robotic grasping,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024

work page 2024

[19] [19]

Reasoning grasping via multimodal large language model,

S. Jin,et al., “Reasoning grasping via multimodal large language model,” inConference on Robot Learning. PMLR, 2025

work page 2025

[20] [20]

Thinkgrasp: A vision-language system for strategic part grasping in clutter,

Y . Qian,et al., “Thinkgrasp: A vision-language system for strategic part grasping in clutter,” inConference on Robot Learning (CoRL), 2024

work page 2024

[21] [21]

Task-oriented grasp prediction with visual-language inputs,

C. Tang,et al., “Task-oriented grasp prediction with visual-language inputs,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023

work page 2023

[22] [22]

Foundationgrasp: Generalizable task-oriented grasping with foundation models,

——, “Foundationgrasp: Generalizable task-oriented grasping with foundation models,”IEEE Transactions on Automation Science and Engineering, 2025

work page 2025

[23] [23]

Shapegrasp: Zero-shot task-oriented grasping with large language models through geometric decomposition,

S. Li,et al., “Shapegrasp: Zero-shot task-oriented grasping with large language models through geometric decomposition,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024

work page 2024

[24] [24]

Towards open-world grasping with large vision- language models,

G. Tziafaset al., “Towards open-world grasping with large vision- language models,” inConference on Robot Learning (CoRL), 2024

work page 2024

[25] [25]

Grasp as you say: Language-guided dexterous grasp generation,

Y .-L. Wei,et al., “Grasp as you say: Language-guided dexterous grasp generation,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[26] [26]

Language-guided dexterous functional grasping by llm generated grasp functionality and synergy for humanoid manipula- tion,

Z. Li,et al., “Language-guided dexterous functional grasping by llm generated grasp functionality and synergy for humanoid manipula- tion,”IEEE Transactions on Automation Science and Engineering, 2025

work page 2025

[27] [27]

Decision-making in robotic grasping with large language models,

J. Liao,et al., “Decision-making in robotic grasping with large language models,” inInternational Conference on Intelligent Robotics and Applications (ICIRA). Springer, 2023

work page 2023

[28] [28]

2handedafforder: Learning precise actionable bimanual affordances from human videos,

M. Heidinger,et al., “2handedafforder: Learning precise actionable bimanual affordances from human videos,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025

[29] [29]

Combining vlm and llm for enhanced semantic object perception in robotic handover tasks,

J. Huang,et al., “Combining vlm and llm for enhanced semantic object perception in robotic handover tasks,” inWRC Symposium on Advanced Robotics and Automation (WRC SARA), 2024

work page 2024

[30] [30]

Anygrasp: Robust and efficient grasp perception in spatial and temporal domains,

H.-S. Fang,et al., “Anygrasp: Robust and efficient grasp perception in spatial and temporal domains,”IEEE Transactions on Robotics, 2023

work page 2023

[31] [31]

Going denser with open-vocabulary part segmentation,

P. Sun,et al., “Going denser with open-vocabulary part segmentation,” inIEEE/CVF International Conference on Computer Vision, 2023

work page 2023