GROW²: Grounding Which and Where for Robot Tool Use
Pith reviewed 2026-06-30 04:51 UTC · model grok-4.3
The pith
GROW² splits affordance grounding into semantic and geometric stages via object parts so robots can select and locate action regions on open-category tools.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GROW² grounds which object to use and where to act by hierarchically splitting the process with object parts as the connecting abstraction: vision-language models perform semantic grounding by parsing the task instruction, selecting an open-category tool object, and naming task-relevant parts on both tool and target, while vision foundation models then perform geometric grounding by mapping those parts to precise 3D regions in an RGB-D image. The method outperforms state-of-the-art baselines on affordance prediction benchmarks, achieves zero-shot generalization over open-category objects, and records higher success rates than baselines in both simulated and real-world robot tool-use experime
What carries the argument
Object parts as a natural abstraction that splits the grounding process hierarchically into semantic (VLM) and geometric (vision foundation model) levels.
If this is right
- Robots gain the ability to improvise tools from open-category objects without collecting new task-specific training data.
- Affordance prediction accuracy exceeds existing state-of-the-art methods on established benchmarks.
- Zero-shot transfer succeeds to object categories never seen during any training phase.
- Higher task success rates appear in both simulated environments and physical robot experiments.
Where Pith is reading between the lines
- The part-based split could be combined with existing motion planners to handle sequences of improvised tool uses in changing scenes.
- Extending the same abstraction to multiple interacting objects might support more complex household tasks.
- Lower data requirements could make deployment feasible in new environments where collecting robot demonstrations is costly.
Load-bearing premise
Vision-language models possess reliable commonsense reasoning to correctly parse natural-language instructions, select suitable open-category objects as tools, and identify task-relevant parts without systematic errors or hallucinations.
What would settle it
A trial in which the VLM repeatedly selects an unsuitable object or names incorrect parts for a given instruction, causing the robot's success rate on tool-use tasks to fall below the reported baselines.
Figures
read the original abstract
Can the robot use a plate to cut a cake if no knife is available? Tool use greatly expands robot capabilities, but to use tools creatively beyond their intended functions, the robot faces the challenge of $\textit{open-world affordance grounding}$: select an open-category object to act as a tool and localize its specific region of action. To this end, we introduce GROW$^2$ (GROunding Which and Where), which leverages object parts as a natural abstraction to split the grounding process hierarchically into semantic and geometric levels, thus bypassing the need for data-heavy, end-to-end training. Semantically, GROW$^2$ harnesses the commonsense reasoning of Vision-Language Models (VLMs) to parse a natural-language task instruction, select a suitable object as the tool, and identify task-relevant parts on the tool and the target object. Geometrically, vision foundation models then ground the selected parts into precise 3D regions from a single RGB-D image. Experiments on established benchmarks show that GROW$^2$ outperforms state-of-the-art baselines on affordance prediction benchmarks. Further, it achieves zero-shot generalization over open-category objects and outperforms baselines in both simulated and real-world robot tool use experiments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GROW², a hierarchical method for open-world affordance grounding in robot tool use. It decomposes the task into a semantic stage that uses VLMs to parse natural-language instructions, select open-category tools, and identify task-relevant parts on the tool and target, followed by a geometric stage that applies vision foundation models to localize those parts in 3D from a single RGB-D image. The work claims outperformance over state-of-the-art baselines on affordance prediction benchmarks, zero-shot generalization to open-category objects, and superior results in both simulated and real-world robot tool-use experiments.
Significance. If the empirical claims hold after verification, the approach could meaningfully advance flexible robot tool use by avoiding data-intensive end-to-end training and instead composing existing foundation models; the part-based abstraction offers a clean separation of semantic and geometric reasoning that may generalize beyond the evaluated tasks.
major comments (2)
- [Experiments] The central claims of benchmark outperformance, zero-shot generalization, and robot experiment gains all rest on the assumption that the VLM stage reliably selects tools and labels task-relevant parts without systematic hallucinations or mis-selections; however, the manuscript provides no quantitative isolation of VLM error rates, failure-mode analysis, or robustness tests across instruction diversity (see Experiments section and the method description of the semantic stage).
- [Abstract and §4] The abstract and method overview assert superiority over baselines and zero-shot results, yet the manuscript supplies no detailed baseline implementations, dataset statistics, quantitative metrics (e.g., success rates, IoU scores), or error breakdowns to allow verification of these claims against the data.
minor comments (2)
- [Method] Clarify the exact VLM and vision foundation model versions used, including any prompting strategies or post-processing steps, to improve reproducibility.
- [Discussion] Add explicit discussion of failure cases where the VLM selects an inappropriate tool or part, even if only qualitative.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. The comments highlight important areas where additional empirical detail can strengthen the presentation of our results. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Experiments] The central claims of benchmark outperformance, zero-shot generalization, and robot experiment gains all rest on the assumption that the VLM stage reliably selects tools and labels task-relevant parts without systematic hallucinations or mis-selections; however, the manuscript provides no quantitative isolation of VLM error rates, failure-mode analysis, or robustness tests across instruction diversity (see Experiments section and the method description of the semantic stage).
Authors: We agree that isolating the contribution and reliability of the VLM-based semantic stage is valuable for supporting the central claims. The current manuscript reports end-to-end results but does not provide a dedicated quantitative breakdown of VLM error rates or failure modes. In the revised version we will add a new analysis subsection (and corresponding table) that measures tool-selection accuracy, part-labeling precision/recall, and hallucination rates on the evaluation instructions. We will also include robustness tests across paraphrased instructions and report common failure categories. These additions will be placed in the Experiments section and will directly address the concern. revision: yes
-
Referee: [Abstract and §4] The abstract and method overview assert superiority over baselines and zero-shot results, yet the manuscript supplies no detailed baseline implementations, dataset statistics, quantitative metrics (e.g., success rates, IoU scores), or error breakdowns to allow verification of these claims against the data.
Authors: The manuscript does contain quantitative results and some baseline descriptions in §4, yet we acknowledge that the level of detail on implementation choices, dataset statistics, full metric tables (including IoU and success rates), and error breakdowns is limited. In the revision we will expand §4 with explicit baseline implementation details (hyperparameters, prompting, and code references), complete dataset statistics, additional quantitative metrics, and per-category error breakdowns. A concise summary of the key metrics will also be added to the abstract. These changes will make the empirical claims verifiable without altering the reported outcomes. revision: yes
Circularity Check
No circularity: method is compositional over external pre-trained models
full rationale
The paper describes GROW² as a hierarchical pipeline that applies off-the-shelf VLMs for instruction parsing, tool selection, and part identification, then applies separate vision foundation models for 3D localization; no equations, parameter fitting to the paper's own outputs, self-citations, or internal derivations are present that would reduce any claimed result to its inputs by construction. Performance claims rest on empirical comparison against baselines rather than any self-referential construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Vision-language models can reliably perform commonsense reasoning for object and part selection from natural language instructions.
Reference graph
Works this paper leans on
-
[1]
Fitzgerald, A
T. Fitzgerald, A. Goel, and A. Thomaz. Modeling and learning constraints for creative tool use.Frontiers in Robotics and AI, 8:674292, 2021
2021
-
[2]
C. P. Van Schaik and G. R. Pradhan. A model for tool-use traditions in primates: implications for the coevolution of culture and cognition.Journal of Human Evolution, 44(6):645–664, 2003
2003
-
[3]
A. Xie, F. Ebert, S. Levine, and C. Finn. Improvisation through physical understanding: Using novel objects as tools with visual foresight. InProceedings of Robotics: Science and Systems, 2019
2019
-
[4]
K. Fang, Y . Zhu, A. Garg, A. Kurenkov, V . Mehta, L. Fei-Fei, and S. Savarese. Learning task- oriented grasping for tool manipulation from simulated self-supervision.The International Journal of Robotics Research, 39(2-3):202–216, 2020
2020
-
[5]
Holladay, T
R. Holladay, T. Lozano-P ´erez, and A. Rodriguez. Force-and-motion constrained planning for tool use. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2019
2019
-
[6]
Z. Xu, Z. Xian, X. Lin, C. Chi, Z. Huang, C. Gan, and S. Song. Roboninja: Learning an adaptive cutting policy for multi-material objects. InProceedings of Robotics: Science and Systems, 2023
2023
-
[7]
C. G. Jensen, W. E. Red, and J. Pi. Tool selection for five-axis curvature matched machining. Computer-Aided Design, 34(3):251–266, 2002
2002
-
[8]
H. Shi, H. Xu, S. Clarke, Y . Li, and J. Wu. Robocook: Long-horizon elasto-plastic object manipulation with diverse tools. InConference on Robot Learning, 2023
2023
-
[9]
C. Tang, A. Xiao, Y . Deng, T. Hu, W. Dong, H. Zhang, D. Hsu, and H. Zhang. Mimicfunc: Imitating tool manipulation from a single human video via functional correspondence. In Conference on Robot Learning, 2025
2025
-
[10]
Huang, I
S. Huang, I. Ponomarenko, Z. Jiang, X. Li, X. Hu, P. Gao, H. Li, and H. Dong. Manipvqa: Injecting robotic affordance and physically grounded information into multi-modal large lan- guage models. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2024
2024
-
[11]
Y . Li, N. Zhao, J. Xiao, C. Feng, X. Wang, and T.-s. Chua. Laso: Language-guided affordance segmentation on 3d object. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[12]
H. Luo, W. Zhai, J. Zhang, Y . Cao, and D. Tao. Learning affordance grounding from exo- centric images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
2022
-
[13]
Huang, C
W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. InConference on Robot Learning, 2023
2023
-
[14]
W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola. Distilled feature fields enable few-shot language-guided manipulation. InConference on Robot Learning, 2023
2023
-
[15]
Y . Wang, M. Zhang, Z. Li, T. Kelestemur, K. Driggs-Campbell, J. Wu, L. Fei-Fei, and Y . Li. D3 fields: Dynamic 3d descriptor fields for zero-shot generalizable rearrangement. InConference on Robot Learning, 2024
2024
-
[16]
S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 9
2023
-
[17]
S ¸ahin, M
E. S ¸ahin, M. Cakmak, M. R. Do ˘gar, E. U ˘gur, and G. ¨Uc ¸oluk. To afford or not to afford: A new formalization of affordances toward affordance-based robot control.Adaptive Behavior, 15(4):447–472, 2007
2007
-
[18]
W. Yuan, J. Duan, V . Blukis, W. Pumacay, R. Krishna, A. Murali, A. Mousavian, and D. Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics. InConfer- ence on Robot Learning, 2024
2024
-
[19]
Deng and D
Y . Deng and D. Hsu. General-purpose clothes manipulation with semantic keypoints. InIEEE International Conference on Robotics and Automation, 2025
2025
-
[20]
J. Gao, Z. Tao, N. Jaquier, and T. Asfour. K-vil: Keypoints-based visual imitation learning. IEEE Transactions on Robotics, 39(5):3888–3908, 2023
2023
-
[21]
Manuelli, Y
L. Manuelli, Y . Li, P. Florence, and R. Tedrake. Keypoints into the future: Self-supervised correspondence in model-based reinforcement learning. InConference on Robot Learning, 2021
2021
-
[22]
Turpin, L
D. Turpin, L. Wang, S. Tsogkas, S. Dickinson, and A. Garg. Gift: Generalizable interaction- aware functional tool affordances without labels. InProceedings of Robotics: Science and Systems, 2021
2021
-
[23]
Y . Ju, K. Hu, G. Zhang, G. Zhang, M. Jiang, and H. Xu. Robo-abc: Affordance general- ization beyond categories via semantic correspondence for robot manipulation. InEuropean Conference on Computer Vision, 2024
2024
-
[24]
G. Li, D. Sun, L. Sevilla-Lara, and V . Jampani. One-shot open affordance learning with foun- dation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[25]
Stojanov, L
S. Stojanov, L. Zhao, Y . Zhang, D. L. Yamins, and J. Wu. Weakly-supervised learning of dense functional correspondences. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025
2025
-
[26]
Y . Tang, W. Huang, Y . Wang, C. Li, R. Yuan, R. Zhang, J. Wu, and L. Fei-Fei. Uad: Unsuper- vised affordance distillation for generalization in robotic manipulation. InIEEE International Conference on Robotics and Automation, 2025
2025
-
[27]
D. Lu, L. Kong, T. Huang, and G. H. Lee. Geal: Generalizable 3d affordance learning with cross-modal consistency. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
2025
-
[28]
J. Zhu, Y . Ju, J. Zhang, M. Wang, Z. Yuan, K. Hu, and H. Xu. Densematcher: Learning 3d semantic correspondence for category-level manipulation from a single demo. InInternational Conference on Learning Representations, 2025
2025
-
[29]
A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation
J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Pfeiffer, A
K. Pfeiffer, A. Escande, and A. Kheddar. Nut fastening with a humanoid robot. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2017
2017
-
[31]
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025
2025
-
[32]
L. Ke, J. Wang, T. Bhattacharjee, B. Boots, and S. Srinivasa. Grasping with chopsticks: Com- bating covariate shift in model-free imitation learning for fine manipulation. InIEEE Interna- tional Conference on Robotics and Automation, 2021. 10
2021
-
[33]
Black, N
K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Walke...
2025
-
[34]
A. Z. Ren, B. Govil, T.-Y . Yang, K. R. Narasimhan, and A. Majumdar. Leveraging language for accelerated learning of tool manipulation. InConference on Robot Learning, 2022
2022
-
[35]
Seita, Y
D. Seita, Y . Wang, S. J. Shetty, E. Y . Li, Z. Erickson, and D. Held. Toolflownet: Robotic manipulation with tools via predicting tool flow from point clouds. InConference on Robot Learning, 2022
2022
- [36]
-
[37]
Asada and M
H. Asada and M. Brady. The curvature primal sketch.IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(1):2–14, 1986
1986
-
[38]
M. A. Toussaint, K. R. Allen, K. A. Smith, and J. B. Tenenbaum. Differentiable physics and stable modes for tool-use and manipulation planning. InProceedings of Robotics: Science and Systems, 2018
2018
-
[39]
Saito, K
N. Saito, K. Kim, S. Murata, T. Ogata, and S. Sugano. Tool-use model considering tool se- lection by a robot using deep learning. InIEEE-RAS International Conference on Humanoid Robots, 2018
2018
-
[40]
J. J. Gibson. The theory of affordances.Hilldale, USA, 1(2):67–82, 1977
1977
-
[41]
SAM 3: Segment Anything with Concepts
N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
SAM 3D: 3Dfy Anything in Images
X. Chen, F.-J. Chu, P. Gleize, K. J. Liang, A. Sax, H. Tang, W. Wang, M. Guo, T. Hardin, X. Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...
Pith/arXiv arXiv 2025
-
[44]
Ester, H.-P
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. InKDD, 1996
1996
-
[45]
G. Li, V . Jampani, D. Sun, and L. Sevilla-Lara. Locate: Localize and transfer object parts for weakly supervised affordance grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
2023
-
[46]
Qian and D
S. Qian and D. F. Fouhey. Understanding 3d object interaction from a single image. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, 2023
2023
-
[47]
S. Qian, W. Chen, M. Bai, X. Zhou, Z. Tu, and L. E. Li. Affordancellm: Grounding affordance from vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 11
2024
-
[48]
D. Xu, D. Anguelov, and A. Jain. Pointfusion: Deep sensor fusion for 3d bounding box estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018
2018
-
[49]
Aiello, D
E. Aiello, D. Valsesia, and E. Magli. Cross-modal learning for image-guided point cloud shape completion. InAdvances in Neural Information Processing Systems, 2022
2022
-
[50]
Y . Yang, W. Zhai, H. Luo, Y . Cao, J. Luo, and Z.-J. Zha. Grounding 3d object affordance from 2d interactions in images. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023
2023
-
[51]
O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Zhang, C
J. Zhang, C. Herrmann, J. Hur, E. Chen, V . Jampani, D. Sun, and M.-H. Yang. Telling left from right: Identifying geometry-aware semantic correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[53]
F. Liu, K. Fang, P. Abbeel, and S. Levine. Moka: Open-world robotic manipulation through mark-based visual prompting. InProceedings of Robotics: Science and Systems, 2024
2024
-
[54]
Besl and N
P. Besl and N. D. McKay. A method for registration of 3-d shapes.IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):239–256, 1992
1992
-
[55]
Xiang, Y
F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, L. Yi, A. X. Chang, L. J. Guibas, and H. Su. SAPIEN: A simulated part-based interactive environ- ment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2020
2020
-
[56]
Sundermeyer, A
M. Sundermeyer, A. Mousavian, R. Triebel, and D. Fox. Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes. InIEEE International Conference on Robotics and Au- tomation, 2021
2021
-
[57]
Gpt-5.2: Enhanced reasoning and general-purpose foundation model, 2025
OpenAI. Gpt-5.2: Enhanced reasoning and general-purpose foundation model, 2025. URL https://openai.com/index/introducing-gpt-5-2/
2025
-
[58]
Flux.1-dev-controlnet-depth.https://huggingface.co/ Shakker-Labs/FLUX.1-dev-ControlNet-Depth, 2024
Shakker Labs and InstantX Team. Flux.1-dev-controlnet-depth.https://huggingface.co/ Shakker-Labs/FLUX.1-dev-ControlNet-Depth, 2024. Hugging Face model repository
2024
-
[59]
Gemini 3 pro, 2026
Google. Gemini 3 pro, 2026. URLhttps://deepmind.google/models/gemini/
2026
-
[60]
Nano banana pro, 2026
Google. Nano banana pro, 2026. URLhttps://deepmind.google/models/ gemini-image/
2026
-
[61]
Rombach, A
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image syn- thesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
2022
-
[62]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Dekel, S
T. Dekel, S. Oron, M. Rubinstein, S. Avidan, and W. T. Freeman. Best-buddies similarity for robust template matching. InProceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, 2015
2015
-
[64]
Fang, B.-R
X. Fang, B.-R. Huang, J. Mao, J. Shone, J. B. Tenenbaum, T. Lozano-P ´erez, and L. P. Kael- bling. Kalm: Keypoint abstraction using large models for object-relative imitation learning. In IEEE International Conference on Robotics and Automation, 2025. 12
2025
-
[65]
O. Y . Lee, A. Xie, K. Fang, K. Pertsch, and C. Finn. Affordance-guided reinforcement learn- ing via visual prompting. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2025
2025
-
[66]
W. Liu, J. Mao, J. Hsu, T. Hermans, A. Garg, and J. Wu. Composable part-based manipulation. InConference on Robot Learning, 2023
2023
-
[67]
G. Yin, Y . Li, Y . Wang, D. McConachie, P. Shah, K. Hashimoto, H. Zhang, K. Liu, and Y . Li. Codediffuser: Attention-enhanced diffusion policy via vlm-generated code for instruction am- biguity. InProceedings of Robotics: Science and Systems, 2025
2025
-
[68]
Gao and R
W. Gao and R. Tedrake. kpam-sc: Generalizable manipulation planning using keypoint affor- dance and shape completion. InIEEE International Conference on Robotics and Automation, 2021
2021
-
[69]
Y . Liu, J. Mao, J. B. Tenenbaum, T. Lozano-P´erez, and L. P. Kaelbling. One-shot manipulation strategy learning by making contact analogies. InIEEE International Conference on Robotics and Automation, 2025
2025
-
[70]
Manuelli, W
L. Manuelli, W. Gao, P. Florence, and R. Tedrake. kpam: Keypoint affordances for category- level robotic manipulation. InThe International Symposium of Robotics Research, 2019
2019
-
[71]
Z. Qin, K. Fang, Y . Zhu, L. Fei-Fei, and S. Savarese. Keto: Learning keypoint representations for tool manipulation. InIEEE International Conference on Robotics and Automation, 2020
2020
-
[72]
S. M. LaValle and J. J. Kuffner Jr. Randomized kinodynamic planning.The International Journal of Robotics Research, 20(5):378–400, 2001
2001
-
[73]
objects on table
R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (fpfh) for 3d registration. InIEEE International Conference on Robotics and Automation, 2009. 13 Appendices A Prompts 15 B Details of Affordance Prediction Experiments 16 B.1 AGD20K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 B.1.1 Baselines . . . . . ....
2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.