AFUN: Towards an Affordance Foundation Model for Functionality Understanding

Henrik I. Christensen; Jiawei Fu; Jun Gao; Yi Zhong; Zhaoning Wang

arxiv: 2606.02551 · v1 · pith:QJLIEIG7new · submitted 2026-06-01 · 💻 cs.RO · cs.CV

AFUN: Towards an Affordance Foundation Model for Functionality Understanding

Zhaoning Wang , Yi Zhong , Jiawei Fu , Henrik I. Christensen , Jun Gao This is my paper

Pith reviewed 2026-06-28 14:11 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords affordance understandingfunctional mask3D motion predictionrobot manipulationRGB-Dlanguage-conditionedfoundation modelopen-world generalization

0 comments

The pith

AFUN predicts task-conditional functional masks and 3D post-contact motion curves from one RGB-D view plus language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that a single model can determine both where and how a robot should interact with objects to complete a described task, using only one RGB-D image and a language instruction. It supports this by training on a newly unified dataset drawn from robot demonstrations, human videos, simulations, and real scans, all converted to consistent language, mask, and 3D motion labels. If correct, the approach would let the same model work across many objects, environments, and tasks without retraining for each robot body or scenario. The reported gains appear in three evaluations: large improvements in segmenting task-relevant regions, higher accuracy on contact points, and better 3D motion curves on held-out sets. The model is shown running directly on physical robots for manipulation without further tuning or task-specific rules.

Core claim

From a single RGB-D observation and a language task description, AFUN outputs a task-conditional functional mask that marks where interaction should occur and an object-centric 3D post-contact motion curve that specifies how the interaction should proceed. The model is trained on a large-scale data pipeline that converts robot, human, simulation, and scan data into one shared schema of language labels, masks, and 3D motion annotations. On affordance segmentation across eight test sets the model improves mean gIoU and cIoU by 23.9 and 26.3 points over prior methods; it also records higher contact-point hit rates and superior 3D motion accuracy on three separate test collections. The same weig

What carries the argument

Task-conditional functional mask paired with object-centric 3D post-contact motion curve, produced by a model trained on a unified affordance schema that merges heterogeneous data sources.

If this is right

A single set of weights handles affordance segmentation, contact prediction, and motion forecasting across multiple benchmarks.
Real-robot manipulation succeeds without per-embodiment retraining or task-specific rules.
Open-world tasks become feasible because the same model adapts to varied objects and instructions from the unified schema.
Contact-point accuracy improves by 12.7 to 61.3 percent over the strongest baseline on the evaluated sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The unified schema could be reused by other researchers to train models that combine affordance output with higher-level planners.
If the motion curves are accurate enough, they might serve as direct inputs to low-level controllers without intermediate trajectory optimization.
Extending the pipeline to include more sensor types, such as tactile data, would test whether the same schema remains stable.
The separation of mask and motion outputs might allow independent debugging or replacement of either component in future systems.

Load-bearing premise

Heterogeneous robot, human, simulation, and scan data can be converted into one consistent set of language, mask, and 3D motion labels that supports generalization without embodiment-specific fine-tuning.

What would settle it

The model fails to produce usable masks or motion curves on a new object category or robot embodiment without any additional training data or fine-tuning.

Figures

Figures reproduced from arXiv: 2606.02551 by Henrik I. Christensen, Jiawei Fu, Jun Gao, Yi Zhong, Zhaoning Wang.

**Figure 1.** Figure 1: Overview of AFUN. (a) We first build a data pipeline to gather a large-scale diverse dataset for affordance understanding. (b) With such a dataset, we then train AFUN to predict a task-conditional functional segmentation mask and a 3D motion trajectory, conditioned on an RGBD observation and a language task phrase. (c) AFUN can generalize to open-world images for functionality understanding and (d) is dir… view at source ↗

**Figure 2.** Figure 2: Unified data collection pipeline. We first aggregate data from various sources into a unified format (gray), then use Qwen3-VL [3] and SAM3 [50] to generate a functional affordance mask and a 2D tracking (green), and finally back-project them to obtain a 3D trajectory, which can be fit to a Bézier spline curve (blue). This scalable pipeline yields standardized, high-fidelity annotations with RGB-D observat… view at source ↗

**Figure 3.** Figure 3: Object trajectory vs. gripper heuristics. Prior datasets often use hand or gripper trajectories as motion heuristics, but these can involve unwanted pre-contact motion (right). We track the object motion itself, which is more straightforward (left). Annotating Object Tracks and Masks. Prior works often use hand or gripper trajectories as the motion signal for affordance. However, this can entangle the unde… view at source ↗

**Figure 4.** Figure 4: AFUN architecture. Starting from an RGB-D input and a task prompt, a frozen Qwen VLM encodes the language instruction into semantic tokens and motion tokens, and a 3D encoder converts the depth observation into geometric features. With the language information encoded, the segmentation model generates the affordance segmentation mask from the RGB, while the motion decoder takes 3D features, task-conditione… view at source ↗

**Figure 5.** Figure 5: Qualitative Examples on Affordance Segmentation. AFUN accurately segments taskspecific affordance regions, including complex scissor handles with holes and intent-dependent regions such as shovel blades for containing or hammer handles for using. We provide more examples in Appendix B.3. In practice, we find this point-sampling supervision substantially more effective than directly regressing the locatio… view at source ↗

**Figure 6.** Figure 6: Qualitative motion prediction results. AFUN accurately localizes the actionable object region and predicts smooth, task-aligned 3D motion curves, whereas the baselines often fail to identify the relevant affordance region or produce physically plausible motion. † General Flow used the mask prediction from our AFUN for its starting query points. 5.2.2 Contact Point Evaluation Beyond using masks for affordan… view at source ↗

**Figure 7.** Figure 7: Real-robot deployment (Franka). AFUN can be directly deployed to a real robotic system without any additional task-specific heuristics. Given a task from the user, our model can accurately locate the actionable (grasping) region and produce an accurate post-contact trajectory for robot manipulation. 6 Conclusion In this paper, we present AFUN, a step towards an affordance foundation model for understanding… view at source ↗

**Figure 8.** Figure 8: Example of Failed Cases. As there is nothing similar to Spray Bottles or Sun Visors in the [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative gallery on AFUN dataset, Part 1. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative gallery on AFUN dataset, Part 2. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: This example shows the grippers with our model successfully locating the action part of [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative gallery on the affordance segmentation evaluation, Part 1. [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative gallery on the affordance segmentation evaluation, Part 2. [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative gallery on the affordance segmentation evaluation, Part 3. [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative gallery on the affordance segmentation evaluation, Part 4. [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative gallery on the 3D motion evaluation, part 1. [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative gallery on the 3D motion evaluation, part 2. [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗

**Figure 18.** Figure 18: Qualitative gallery on the 3D motion evaluation, part 3. [PITH_FULL_IMAGE:figures/full_fig_p033_18.png] view at source ↗

**Figure 19.** Figure 19: Qualitative gallery on the 3D motion evaluation, part 4. [PITH_FULL_IMAGE:figures/full_fig_p034_19.png] view at source ↗

**Figure 20.** Figure 20: Qualitative gallery on the 3D motion evaluation, part 5. [PITH_FULL_IMAGE:figures/full_fig_p035_20.png] view at source ↗

read the original abstract

Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AFUN combines mask and 3D curve prediction with a multi-source data pipeline and shows clear benchmark gains, but label consistency across data types is the part that still needs checking.

read the letter

The paper's main point is a model that takes one RGB-D view plus a language task and outputs both a functional mask and a 3D post-contact motion curve. They also describe a pipeline that turns robot, human, simulation, and scan data into matching language, mask, and object-centric 3D labels.

The joint output and the standardization effort are the actual new pieces. Earlier affordance work handled one or the other but not both together at this scale with the cross-source collection. The numbers are straightforward: +23.9 gIoU and +26.3 cIoU on segmentation over eight test sets, plus higher contact hit rates and best-in-set motion results on the three motion benchmarks. The claim that it runs on real robots without embodiment fine-tuning or task heuristics is the practical part worth watching.

The soft spot is exactly the one the stress-test note flags. The pipeline is supposed to produce consistent 3D motion curves no matter the source, yet the abstract gives no numbers on inter-source variance, alignment method, or how contact parameterization is unified. If those curves carry source-specific biases, the generalization story weakens even if the benchmark scores look good. No ablation on the pipeline itself is mentioned either.

The work is empirical rather than theoretical, so the citation pattern is mostly prior affordance papers and the math is light. That is fine for this kind of paper.

This is for robotics and embodied AI people who care about scalable perception-to-action interfaces and are tired of per-task engineering. A reader who wants to see whether multi-source data can actually reduce embodiment dependence will get concrete results to evaluate.

It deserves peer review. The experiments cover enough benchmarks to be worth referee time, the deployment claim is testable, and the data-pipeline question is a real one that referees can press on.

Referee Report

2 major / 2 minor

Summary. The paper introduces AFUN, a model that takes a single RGB-D image and language task description as input and outputs a task-conditional functional mask (where to interact) together with a 3D post-contact motion curve (how to interact). A large-scale data pipeline is presented that unifies robot, human, simulation, and real-world scan data into a shared affordance schema with consistent language, mask, and object-centric 3D motion labels. Experiments report large gains over baselines on affordance segmentation (+23.9/+26.3 mean gIoU/cIoU across 8 test sets), contact-point hit-rate (12.7–61.3 % improvement), and 3D motion prediction on three test sets, with zero-shot real-robot deployment claimed.

Significance. If the data pipeline truly yields embodiment-agnostic, consistent functional labels, the work would constitute a meaningful step toward scalable affordance foundation models that jointly address localization and executable motion. The reported cross-benchmark gains and real-world transfer without embodiment-specific fine-tuning would be notable contributions to robot manipulation research.

major comments (2)

[Data pipeline / Methods] Data pipeline section (exact section number not visible in provided text but referenced in abstract): the central generalization claim rests on the assertion that heterogeneous sources are converted into a shared schema with consistent object-centric 3D motion curves. No quantitative checks (e.g., inter-source variance on matched tasks, alignment error statistics, or ablation removing source-specific normalization) are described, leaving open the possibility that learned representations encode source biases rather than pure functionality.
[Experiments / Results] Results tables (affordance segmentation and 3D motion sections): while aggregate gains are reported, the manuscript does not provide per-source breakdowns or controls that isolate the contribution of the unified labeling scheme versus model architecture; without these, attribution of the +23.9/+26.3 gIoU/cIoU improvement to the pipeline remains under-supported.

minor comments (2)

[Abstract] Abstract contains repeated instances of the token "ourmodel" instead of the model name or "AFUN"; this should be corrected for readability.
[Method] Notation for the 3D motion curve (contact parameterization, coordinate frame, trajectory representation) is introduced but not given an explicit equation or diagram reference in the provided abstract; a clear definition would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the data pipeline and experimental attribution. We address each point below and will revise the manuscript to include the requested quantitative checks and breakdowns.

read point-by-point responses

Referee: [Data pipeline / Methods] Data pipeline section (exact section number not visible in provided text but referenced in abstract): the central generalization claim rests on the assertion that heterogeneous sources are converted into a shared schema with consistent object-centric 3D motion curves. No quantitative checks (e.g., inter-source variance on matched tasks, alignment error statistics, or ablation removing source-specific normalization) are described, leaving open the possibility that learned representations encode source biases rather than pure functionality.

Authors: We agree that explicit quantitative validation of cross-source consistency would strengthen the generalization argument. The pipeline enforces a shared schema for language, masks, and object-centric 3D motion, and the large gains across diverse benchmarks provide supporting evidence of effective unification. To directly address the concern, the revised manuscript will add inter-source variance statistics on matched tasks, alignment error metrics, and an ablation removing source-specific normalization steps. revision: yes
Referee: [Experiments / Results] Results tables (affordance segmentation and 3D motion sections): while aggregate gains are reported, the manuscript does not provide per-source breakdowns or controls that isolate the contribution of the unified labeling scheme versus model architecture; without these, attribution of the +23.9/+26.3 gIoU/cIoU improvement to the pipeline remains under-supported.

Authors: We thank the referee for this observation. Current tables report aggregate results across eight test sets drawn from four benchmarks to highlight broad applicability. To better isolate the unified labeling scheme from architectural contributions, the revision will add per-source performance breakdowns together with controls that compare the full pipeline against variants trained without the cross-source unification step. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical model with independent benchmark evaluation

full rationale

The paper presents an empirical ML system: a data pipeline unifies heterogeneous sources into labels, a model is trained to predict masks and motion curves from RGB-D + language, and performance is measured via gIoU/cIoU, hit-rate, and 3D metrics against external baselines on 8 test sets. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described claims. The central results are falsifiable via held-out benchmarks and do not reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the central claims rest on the unstated assumption that the data conversion process preserves task-relevant information across sources.

pith-pipeline@v0.9.1-grok · 5866 in / 1093 out tokens · 24134 ms · 2026-06-28T14:11:15.159341+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

106 extracted references · 18 canonical work pages · 1 internal anchor

[1]

AgiBot World Colosseo: A large- scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, et al. AgiBot World Colosseo: A large- scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025
[2]

Affordances from human videos as a versatile representation for robotics

Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[3]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

Pith/arXiv arXiv 2025
[4]

HOT3D: Hand and object tracking in 3D from egocentric multi-view videos

Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, Richard Newcombe, Robert Wang, Jakob Julian Engel, and Tomas Hodan. HOT3D: Hand and object tracking in 3D from egocentric multi-view videos. InProceedings of the IEEE/CVF Conference on Computer Vision a...

2025
[5]

Track2Act: Predicting point tracks from internet videos enables generalizable robot manipulation

Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2Act: Predicting point tracks from internet videos enables generalizable robot manipulation. In European Conference on Computer Vision (ECCV). Springer, 2024

2024
[6]

Worldafford: Affordance grounding based on natural language instructions

Changmao Chen, Yuren Cong, and Zhen Kan. Worldafford: Affordance grounding based on natural language instructions. In36th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2024, Herndon, VA, USA, October 28-30, 2024, pages 822–828. IEEE, 2024. doi: 10.1109/ICTAI62512.2024.00120. URL https://doi.org/10.1109/ICTAI62512. 2024.00120

work page doi:10.1109/ictai62512.2024.00120 2024
[7]

VidBot: Learning generalizable 3D actions from in-the-wild 2D human videos for zero-shot robotic manipulation

Hanzhi Chen, Boyang Sun, Anran Zhang, Marc Pollefeys, and Stefan Leutenegger. VidBot: Learning generalizable 3D actions from in-the-wild 2D human videos for zero-shot robotic manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27661–27672, 2025

2025
[8]

G3Flow: Generative 3D semantic flow for pose-aware and generalizable object manipulation

Tianxing Chen, Yao Mu, Zhixuan Liang, Zanxin Chen, Shijia Peng, Qiangyu Chen, Mingkun Xu, Ruizhen Hu, Hongyuan Zhang, Xuelong Li, and Ping Luo. G3Flow: Generative 3D semantic flow for pose-aware and generalizable object manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1735–1744, June 2025

2025
[9]

3d-affordancellm: Harnessing large language models for open-vocabulary affordance detection in 3d worlds

Hengshuo Chu, Xiang Deng, Qi Lv, Xiaoyang Chen, Yinchuan Li, Jianye Hao, and Liqiang Nie. 3d-affordancellm: Harnessing large language models for open-vocabulary affordance detection in 3d worlds. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview. net...

2025
[10]

GanHand: Predicting human grasp affordances in multi-object scenes

Enric Corona, Albert Pumarola, Guillem Alenya, Francesc Moreno-Noguer, and Grégory Rogez. GanHand: Predicting human grasp affordances in multi-object scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5031–5041, 2020

2020
[11]

Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100. International Journal of Computer Vision (IJCV), 130(1):33–55, 2022. 11

2022
[12]

SceneFun3D: Fine-grained functionality and affordance understanding in 3D scenes

Alexandros Delitzas, Ayça Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys, and Francis Engelmann. SceneFun3D: Fine-grained functionality and affordance understanding in 3D scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[13]

3d affordancenet: A benchmark for visual object affordance understanding

Shengheng Deng, Xun Xu, Chaozheng Wu, Ke Chen, and Kui Jia. 3d affordancenet: A benchmark for visual object affordance understanding. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2021
[14]

Affordancenet: An end-to-end deep learning ap- proach for object affordance detection

Thanh-Toan Do, Anh Nguyen, and Ian Reid. Affordancenet: An end-to-end deep learning ap- proach for object affordance detection. InInternational Conference on Robotics and Automation (ICRA), 2018

2018
[16]

FlowBot3D: Learning 3D articulation flow to manipulate articulated objects

Ben Eisner, Harry Zhang, and David Held. FlowBot3D: Learning 3D articulation flow to manipulate articulated objects. InProceedings of Robotics: Science and Systems, New York City, NY , USA, June 2022. doi: 10.15607/RSS.2022.XVIII.018

work page doi:10.15607/rss.2022.xviii.018 2022
[17]

RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot.arXiv preprint arXiv:2307.00595, 2023

Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Junbo Wang, Haoyi Zhu, and Cewu Lu. RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot.arXiv preprint arXiv:2307.00595, 2023

arXiv 2023
[18]

Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 2023

Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 2023. doi: 10.1109/TRO.2023.3281153

work page doi:10.1109/tro.2023.3281153 2023
[19]

Poles of inaccessibility: A calculation algorithm for the remotest places on earth.Scottish Geographical Journal, 123(3):227–233, September 2007

Daniel Garcia-Castellanos and Umberto Lombardo. Poles of inaccessibility: A calculation algorithm for the remotest places on earth.Scottish Geographical Journal, 123(3):227–233, September 2007. ISSN 1751-665X. doi: 10.1080/14702540801897809. URL http://dx. doi.org/10.1080/14702540801897809

work page doi:10.1080/14702540801897809 2007
[20]

Gibson.The Ecological Approach to Visual Perception

James J. Gibson.The Ecological Approach to Visual Perception. Houghton Mifflin, 1979

1979
[21]

Ego4D: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18995–19012, 2022

2022
[22]

Fsag: Enhancing human-to-dexterous-hand finger-specific affordance grounding via diffusion models, 2026

Yifan Han, Yichuan Peng, Pengfei Yi, Junyan Li, Hanqing Wang, Gaojing Zhang, Qi Peng Liu, and Wenzhao Lian. Fsag: Enhancing human-to-dexterous-hand finger-specific affordance grounding via diffusion models, 2026. URLhttps://arxiv.org/abs/2601.08246

arXiv 2026
[23]

Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yanbiao Ma, Yunfeng Diao, Ziyu Jia, Wenbo Ding, Hangjun Ye, and Long Chen. Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

arXiv 2025
[24]

2handedafforder: Learning precise actionable bimanual affordances from human videos

Marvin Heidinger, Snehal Jauhri, Vignesh Prasad, and Georgia Chalvatzaki. 2handedafforder: Learning precise actionable bimanual affordances from human videos. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14743–14753, October 2025

2025
[25]

RoboMIND 2.0: A multimodal, bimanual mobile ma- nipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025

Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, Qiuxuan Feng, Zhao Jin, et al. RoboMIND 2.0: A multimodal, bimanual mobile ma- nipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025

arXiv 2025
[26]

a fer, Andrew Wing Keung To, Kuan-Ho Lao, Murat Cubuktepe, Matthew Haley, Peter B \

Siyuan Huang, Iaroslav Ponomarenko, Zhengkai Jiang, Xiaoqi Li, Xiaobin Hu, Peng Gao, Hongsheng Li, and Hao Dong. Manipvqa: Injecting robotic affordance and physically grounded information into multi-modal large language models. InIEEE/RSJ International Conference 12 on Intelligent Robots and Systems, IROS 2024, Abu Dhabi, United Arab Emirates, October 14-...

work page doi:10.1109/iros58592.2024.10801993 2024
[27]

Cruise: Cooperative reconstruction and editing in v2x scenarios using gaussian splatting

Yizhou Huang, Fan Yang, Guoliang Zhu, Gen Li, Hao Shi, Yukun Zuo, Wenrui Chen, Zhiyong Li, and Kailun Yang. Resource-efficient affordance grounding with complementary depth and semantic prompts. InIEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2025, Hangzhou, China, October 19-25, 2025, pages 7788–7795. IEEE, 2025. doi: 10.1109/...

work page doi:10.1109/iros60139.2025.11245943 2025
[28]

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. RLBench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters (RA-L), 5(2):3019–3026, 2020

2020
[29]

INTRA: Interaction relationship-aware weakly supervised affordance grounding

Ji Ha Jang, Hoigi Seo, and Se Young Chun. INTRA: Interaction relationship-aware weakly supervised affordance grounding. InEuropean Conference on Computer Vision (ECCV), pages 18–34. Springer, 2024

2024
[30]

Affordancesam: Segment anything once more in affordance grounding, 2025

Dengyang Jiang, Zanyi Wang, Hengzhuang Li, Sizhe Dang, Teli Ma, Wei Wei, Guang Dai, Lei Zhang, and Mengmeng Wang. Affordancesam: Segment anything once more in affordance grounding, 2025. URLhttps://arxiv.org/abs/2504.15650

arXiv 2025
[31]

Hanxiao Jiang, Yongsen Mao, Manolis Savva, and Angel X. Chang. OPD: Single-view 3d openable part detection. InEuropean Conference on Computer Vision (ECCV), 2022

2022
[32]

Robo- abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation

Yuanchen Ju, Kaizhe Hu, Guowei Zhang, Gu Zhang, Mingrun Jiang, and Huazhe Xu. Robo- abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation. InEuropean Conference on Computer Vision, pages 222–239. Springer, 2024

2024
[33]

DROID: A large-scale in-the-wild robot manipulation dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. DROID: A large-scale in-the-wild robot manipulation dataset. InProceedings of Robotics: Science and Systems (RSS), 2024. arXiv:2403.12945

Pith/arXiv arXiv 2024
[34]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV), 123(1):32–73, 2017

2017
[35]

RAM: Retrieval-based affordance transfer for generalizable zero- shot robotic manipulation

Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, and Yue Wang. RAM: Retrieval-based affordance transfer for generalizable zero- shot robotic manipulation. InProceedings of The 8th Conference on Robot Learning (CoRL), volume 270 ofProceedings of Machine Learning Research, pages 547–565. PMLR, 2024. URL https://proc...

2024
[36]

LISA: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. LISA: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9579–9589, 2024

2024
[37]

Locate: Localize and transfer object parts for weakly supervised affordance grounding

Gen Li, Varun Jampani, Deqing Sun, and Laura Sevilla-Lara. Locate: Localize and transfer object parts for weakly supervised affordance grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023
[38]

Learning precise affordances from egocentric videos for robotic manipulation

Gen Li, Nikolaos Tsagkas, Jifei Song, Ruaridh Mon-Williams, Sethu Vijayakumar, Kun Shao, and Laura Sevilla-Lara. Learning precise affordances from egocentric videos for robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2025
[39]

Locate n’ Rotate: Two-stage openable part detection with geometric foundation model priors

Siqi Li, Xiaoxue Chen, Haoyu Cheng, Guyue Zhou, Hao Zhao, and Guanzhong Tian. Locate n’ Rotate: Two-stage openable part detection with geometric foundation model priors. In Minsu Cho, Ivan Laptev, Du Tran, Angela Yao, and Hongbin Zha, editors,Computer Vision - ACCV 2024 - 17th Asian Conference on Computer Vision, Hanoi, Vietnam, December 8-12, 2024, 13 Pr...

work page doi:10.1007/978-981-96-0963-5 2024
[40]

URL https://proceedings.mlr

Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. InIEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 18061–18070. IEEE, 2024. doi: ...

work page doi:10.1109/cvpr52733.2024.01710 2024
[41]

Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Junhao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Pith/arXiv arXiv 2025
[42]

Fast interactive object annotation with Curve-GCN

Huan Ling, Jun Gao, Amlan Kar, Wenzheng Chen, and Sanja Fidler. Fast interactive object annotation with Curve-GCN. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

2019
[43]

Joint hand motion and interaction hotspots prediction from egocentric videos

Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xiaolong Wang. Joint hand motion and interaction hotspots prediction from egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[44]

a fer, Andrew Wing Keung To, Kuan-Ho Lao, Murat Cubuktepe, Matthew Haley, Peter B \

Zhiyang Liu, Ruiteng Zhao, Lei Zhou, Chengran Yuan, Yuwei Wu, Sheng Guo, Zhengshen Zhang, Chenchen Liu, Marcelo H. Ang, and Francis E. H. Tay. 3d affordance keypoint detection for robotic manipulation. InIEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2024, Abu Dhabi, United Arab Emirates, October 14-18, 2024, pages 7528–7534. IE...

work page doi:10.1109/iros58592.2024.10801792 2024
[45]

Geal: Generalizable 3d affordance learning with cross-modal consistency

Dongyue Lu, Lingdong Kong, Tianxin Huang, and Gim Hee Lee. Geal: Generalizable 3d affordance learning with cross-modal consistency. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1680–1690, June 2025

2025
[46]

Learning affordance grounding from exocentric images

Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and Dacheng Tao. Learning affordance grounding from exocentric images. InCVPR, 2022

2022
[47]

GLOVER++: Unleashing the potential of affordance learning from human behaviors for robotic manipulation

Teli Ma, Jia Zheng, Zifan Wang, Ziyao Gao, Jiaming Zhou, and Junwei Liang. GLOVER++: Unleashing the potential of affordance learning from human behaviors for robotic manipulation. InProceedings of The 9th Conference on Robot Learning (CoRL), volume 305 ofProceedings of Machine Learning Research, pages 3972–3994. PMLR, 2025. URL https://proceedings. mlr.pr...

2025
[48]

Vagnet: Grounding 3d affordance from human-object interactions in videos, 2026

Aihua Mao, Kaihang Huang, Yong-Jin Liu, Chee Seng Chan, and Ying He. Vagnet: Grounding 3d affordance from human-object interactions in videos, 2026. URL https://arxiv.org/ abs/2602.20608

arXiv 2026
[49]

CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

2022
[50]

SAM 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Meta AI Research. SAM 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Pith/arXiv arXiv 2025
[51]

VITRA: Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

Microsoft VITRA Team. VITRA: Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

arXiv 2025
[52]

Where2act: From pixels to actions for articulated 3d objects

Kaichun Mo, Leonidas Guibas, Mustafa Mukadam, Abhinav Gupta, and Shubham Tulsiani. Where2act: From pixels to actions for articulated 3d objects. InInternational Conference on Computer Vision (ICCV), 2021

2021
[53]

Teo, Cornelia Fermüller, and Yiannis Aloimonos

Austin Myers, Ching L. Teo, Cornelia Fermüller, and Yiannis Aloimonos. Affordance detection of tool parts from geometric features. InICRA, 2015. 14

2015
[55]

Object-based affordances detection with convolutional neural networks and dense conditional random fields

Anh Nguyen, Dimitrios Kanoulas, Darwin G Caldwell, and Nikos G Tsagarakis. Object-based affordances detection with convolutional neural networks and dense conditional random fields. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017

2017
[56]

Open-vocabulary affordance detection in 3D point clouds

Toan Nguyen, Minh Nhat Vu, An Vuong, Dzung Nguyen, Thieu V o, Ngan Le, and Anh Nguyen. Open-vocabulary affordance detection in 3D point clouds. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023

2023
[57]

Transfer between modalities with MetaQueries.arXiv preprint arXiv:2504.06256, 2025

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, Ji Hou, and Saining Xie. Transfer between modalities with MetaQueries.arXiv preprint arXiv:2504.06256, 2025

Pith/arXiv arXiv 2025
[58]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 3347–3356

Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, and Li Erran Li. Affor- dancellm: Grounding affordance from vision language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Workshops, Seattle, WA, USA, June 17-18, 2024, pages 7587–7597. IEEE, 2024. doi: 10.1109/CVPRW63382.2024.00754. URL https://doi.org/10...

work page doi:10.1109/cvprw63382.2024.00754 2024
[59]

ToolFlowNet: Robotic manipulation with tools via predicting tool flow from point clouds

Daniel Seita, Yufei Wang, Sarthak Shetty, Edward Li, Zackory Erickson, and David Held. ToolFlowNet: Robotic manipulation with tools via predicting tool flow from point clouds. In Conference on Robot Learning (CoRL), 2022

2022
[60]

GREAT: Geometry-intention collaborative inference for open-vocabulary 3D object affordance grounding

Yawen Shao, Wei Zhai, Yuhang Yang, Hongchen Luo, Yang Cao, and Zheng-Jun Zha. GREAT: Geometry-intention collaborative inference for open-vocabulary 3D object affordance grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17326–17336, 2025

2025
[61]

Xiaohao Sun, Hanxiao Jiang, Manolis Savva, and Angel X. Chang. OPDMulti: Openable part detection for multiple objects. InInternational Conference on 3D Vision, 3DV 2024, Davos, Switzerland, March 18-21, 2024, pages 169–178. IEEE, 2024. doi: 10.1109/3DV62453.2024. 00100. URLhttps://doi.org/10.1109/3DV62453.2024.00100

work page doi:10.1109/3dv62453.2024 2024
[62]

Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026

Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, and Nan Xue. Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026

arXiv 2026
[63]

UAD: Unsupervised affordance distillation for generalization in robotic manipulation

Yihe Tang, Wenlong Huang, Yingke Wang, Chengshu Li, Roy Yuan, Ruohan Zhang, Jiajun Wu, and Li Fei-Fei. UAD: Unsupervised affordance distillation for generalization in robotic manipulation. InIEEE International Conference on Robotics and Automation (ICRA), 2025

2025
[64]

Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot ma- nipulation

Yingbo Tang, Lingfeng Zhang, Shuyi Zhang, Yinuo Zhao, and Xiaoshuai Hao. Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot ma- nipulation. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12706–12713, 2025

2025
[65]

Cruise: Cooperative reconstruction and editing in v2x scenarios using gaussian splatting

Yingbo Tang, Shuaike Zhang, Xiaoshuai Hao, Pengwei Wang, Jianlong Wu, Zhongyuan Wang, and Shanghang Zhang. Affordgrasp: In-context affordance reasoning for open-vocabulary task- oriented grasping in clutter. InIEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2025, Hangzhou, China, October 19-25, 2025, pages 9433–9439. IEEE, 2025. ...

work page doi:10.1109/iros60139.2025.11245995 2025
[66]

Qwen3.5: Accelerating productivity with native multimodal agents, February

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February
[67]

URLhttps://qwen.ai/blog?id=qwen3.5
[68]

O3afford: One-shot 3d object-to-object affordance grounding for generalizable robotic manipulation

Tongxuan Tian, Xuhui Kang, and Yen-Ling Kuo. O3afford: One-shot 3d object-to-object affordance grounding for generalizable robotic manipulation. 2025. 15

2025
[69]

InstructPart: Task-oriented part segmentation with instruction reasoning

Zifu Wan, Yaqi Xie, Ce Zhang, Zhiqiu Lin, Zihan Wang, Simon Stepputtis, Deva Ramanan, and Katia Sycara. InstructPart: Task-oriented part segmentation with instruction reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025

2025
[70]

Videoafford: Grounding 3d affordance from human-object-interaction videos via multimodal large language model,

Hanqing Wang, Mingyu Liu, Xiaoyu Chen, Chengwei Ma, Yiming Zhong, Wenti Yin, Yuhao Liu, Zhiqing Cui, Jiahao Yuan, Lu Dai, Zhiyuan Ma, and Hui Xiong. Videoafford: Grounding 3d affordance from human-object-interaction videos via multimodal large language model,
[71]

URLhttps://arxiv.org/abs/2602.09638

arXiv
[72]

Affordance-R1: Reinforcement learning for generalizable affordance reasoning in multimodal large language models

Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, and Yuexin Ma. Affordance-R1: Reinforcement learning for generalizable affordance reasoning in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

2026
[73]

AffordBot: 3D fine-grained embodied reasoning via multimodal large language models

Xinyi Wang, Xun Yang, Yanlong Xu, Yuchen Wu, Zhen Li, and Na Zhao. AffordBot: 3D fine-grained embodied reasoning via multimodal large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[74]

AffordDexGrasp: Open-set language-guided dexterous grasp with generalizable- instructive affordance

Yi-Lin Wei, Mu Lin, Yuhao Lin, Jian-Jian Jiang, Xiao-Ming Wu, Ling-An Zeng, and Wei- Shi Zheng. AffordDexGrasp: Open-set language-guided dexterous grasp with generalizable- instructive affordance. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[75]

RAGNet: Large-scale reasoning-based affordance segmentation benchmark towards general grasping

Dongming Wu, Yanping Fu, Saike Huang, Yingfei Liu, Fan Jia, Nian Liu, Feng Dai, Tiancai Wang, Rao Muhammad Anwer, Fahad Shahbaz Khan, and Jianbing Shen. RAGNet: Large-scale reasoning-based affordance segmentation benchmark towards general grasping. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[76]

RoboMIND: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. RoboMIND: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024

Pith/arXiv arXiv 2024
[77]

V AT-mart: Learning visual action trajectory proposals for manipulating 3d ARTiculated objects

Ruihai Wu, Yan Zhao, Kaichun Mo, Zizheng Guo, Yian Wang, Tianhao Wu, Qingnan Fan, Xuelin Chen, Leonidas Guibas, and Hao Dong. V AT-mart: Learning visual action trajectory proposals for manipulating 3d ARTiculated objects. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=iEx3PiooLy

2022
[78]

AffordDP: Generalizable diffusion policy with transferable affordance

Shijie Wu, Yihang Zhu, Yunao Huang, Kaizhen Zhu, Jiayuan Gu, Jingyi Yu, Ye Shi, and Jingya Wang. AffordDP: Generalizable diffusion policy with transferable affordance. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 6971–6980, 2025

2025
[79]

Sonata: Self-supervised learning of reliable point representations

Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard Newcombe, Hengshuang Zhao, and Julian Straub. Sonata: Self-supervised learning of reliable point representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[80]

Weakly-supervised affordance grounding guided by part-level semantic priors

Peiran Xu and Yadong MU. Weakly-supervised affordance grounding guided by part-level semantic priors. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=0823rvTIhs

2025
[81]

A0: An affordance-aware hierarchical model for general robotic manipulation

Rongtao Xu, Jian Zhang, Minghao Guo, Youpeng Wen, Haoting Yang, Min Lin, Jianzheng Huang, Zhe Li, Kaidong Zhang, Liqiong Wang, Yuxuan Kuang, Meng Cao, Feng Zheng, and Xiaodan Liang. A0: An affordance-aware hierarchical model for general robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[82]

Ground- ing 3d object affordance from 2d interactions in images

Yuhang Yang, Wei Zhai, Hongchen Luo, Yang Cao, Jiebo Luo, and Zheng-Jun Zha. Ground- ing 3d object affordance from 2d interactions in images. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10905–10915, October 2023

2023

Showing first 80 references.

[1] [1]

AgiBot World Colosseo: A large- scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, et al. AgiBot World Colosseo: A large- scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025

[2] [2]

Affordances from human videos as a versatile representation for robotics

Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023

[3] [3]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

Pith/arXiv arXiv 2025

[4] [4]

HOT3D: Hand and object tracking in 3D from egocentric multi-view videos

Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, Richard Newcombe, Robert Wang, Jakob Julian Engel, and Tomas Hodan. HOT3D: Hand and object tracking in 3D from egocentric multi-view videos. InProceedings of the IEEE/CVF Conference on Computer Vision a...

2025

[5] [5]

Track2Act: Predicting point tracks from internet videos enables generalizable robot manipulation

Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2Act: Predicting point tracks from internet videos enables generalizable robot manipulation. In European Conference on Computer Vision (ECCV). Springer, 2024

2024

[6] [6]

Worldafford: Affordance grounding based on natural language instructions

Changmao Chen, Yuren Cong, and Zhen Kan. Worldafford: Affordance grounding based on natural language instructions. In36th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2024, Herndon, VA, USA, October 28-30, 2024, pages 822–828. IEEE, 2024. doi: 10.1109/ICTAI62512.2024.00120. URL https://doi.org/10.1109/ICTAI62512. 2024.00120

work page doi:10.1109/ictai62512.2024.00120 2024

[7] [7]

VidBot: Learning generalizable 3D actions from in-the-wild 2D human videos for zero-shot robotic manipulation

Hanzhi Chen, Boyang Sun, Anran Zhang, Marc Pollefeys, and Stefan Leutenegger. VidBot: Learning generalizable 3D actions from in-the-wild 2D human videos for zero-shot robotic manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27661–27672, 2025

2025

[8] [8]

G3Flow: Generative 3D semantic flow for pose-aware and generalizable object manipulation

Tianxing Chen, Yao Mu, Zhixuan Liang, Zanxin Chen, Shijia Peng, Qiangyu Chen, Mingkun Xu, Ruizhen Hu, Hongyuan Zhang, Xuelong Li, and Ping Luo. G3Flow: Generative 3D semantic flow for pose-aware and generalizable object manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1735–1744, June 2025

2025

[9] [9]

3d-affordancellm: Harnessing large language models for open-vocabulary affordance detection in 3d worlds

Hengshuo Chu, Xiang Deng, Qi Lv, Xiaoyang Chen, Yinchuan Li, Jianye Hao, and Liqiang Nie. 3d-affordancellm: Harnessing large language models for open-vocabulary affordance detection in 3d worlds. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview. net...

2025

[10] [10]

GanHand: Predicting human grasp affordances in multi-object scenes

Enric Corona, Albert Pumarola, Guillem Alenya, Francesc Moreno-Noguer, and Grégory Rogez. GanHand: Predicting human grasp affordances in multi-object scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5031–5041, 2020

2020

[11] [11]

Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100. International Journal of Computer Vision (IJCV), 130(1):33–55, 2022. 11

2022

[12] [12]

SceneFun3D: Fine-grained functionality and affordance understanding in 3D scenes

Alexandros Delitzas, Ayça Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys, and Francis Engelmann. SceneFun3D: Fine-grained functionality and affordance understanding in 3D scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[13] [13]

3d affordancenet: A benchmark for visual object affordance understanding

Shengheng Deng, Xun Xu, Chaozheng Wu, Ke Chen, and Kui Jia. 3d affordancenet: A benchmark for visual object affordance understanding. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2021

[14] [14]

Affordancenet: An end-to-end deep learning ap- proach for object affordance detection

Thanh-Toan Do, Anh Nguyen, and Ian Reid. Affordancenet: An end-to-end deep learning ap- proach for object affordance detection. InInternational Conference on Robotics and Automation (ICRA), 2018

2018

[15] [16]

FlowBot3D: Learning 3D articulation flow to manipulate articulated objects

Ben Eisner, Harry Zhang, and David Held. FlowBot3D: Learning 3D articulation flow to manipulate articulated objects. InProceedings of Robotics: Science and Systems, New York City, NY , USA, June 2022. doi: 10.15607/RSS.2022.XVIII.018

work page doi:10.15607/rss.2022.xviii.018 2022

[16] [17]

RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot.arXiv preprint arXiv:2307.00595, 2023

Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Junbo Wang, Haoyi Zhu, and Cewu Lu. RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot.arXiv preprint arXiv:2307.00595, 2023

arXiv 2023

[17] [18]

Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 2023

Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 2023. doi: 10.1109/TRO.2023.3281153

work page doi:10.1109/tro.2023.3281153 2023

[18] [19]

Poles of inaccessibility: A calculation algorithm for the remotest places on earth.Scottish Geographical Journal, 123(3):227–233, September 2007

Daniel Garcia-Castellanos and Umberto Lombardo. Poles of inaccessibility: A calculation algorithm for the remotest places on earth.Scottish Geographical Journal, 123(3):227–233, September 2007. ISSN 1751-665X. doi: 10.1080/14702540801897809. URL http://dx. doi.org/10.1080/14702540801897809

work page doi:10.1080/14702540801897809 2007

[19] [20]

Gibson.The Ecological Approach to Visual Perception

James J. Gibson.The Ecological Approach to Visual Perception. Houghton Mifflin, 1979

1979

[20] [21]

Ego4D: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18995–19012, 2022

2022

[21] [22]

Fsag: Enhancing human-to-dexterous-hand finger-specific affordance grounding via diffusion models, 2026

Yifan Han, Yichuan Peng, Pengfei Yi, Junyan Li, Hanqing Wang, Gaojing Zhang, Qi Peng Liu, and Wenzhao Lian. Fsag: Enhancing human-to-dexterous-hand finger-specific affordance grounding via diffusion models, 2026. URLhttps://arxiv.org/abs/2601.08246

arXiv 2026

[22] [23]

Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yanbiao Ma, Yunfeng Diao, Ziyu Jia, Wenbo Ding, Hangjun Ye, and Long Chen. Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

arXiv 2025

[23] [24]

2handedafforder: Learning precise actionable bimanual affordances from human videos

Marvin Heidinger, Snehal Jauhri, Vignesh Prasad, and Georgia Chalvatzaki. 2handedafforder: Learning precise actionable bimanual affordances from human videos. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14743–14753, October 2025

2025

[24] [25]

RoboMIND 2.0: A multimodal, bimanual mobile ma- nipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025

Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, Qiuxuan Feng, Zhao Jin, et al. RoboMIND 2.0: A multimodal, bimanual mobile ma- nipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025

arXiv 2025

[25] [26]

a fer, Andrew Wing Keung To, Kuan-Ho Lao, Murat Cubuktepe, Matthew Haley, Peter B \

Siyuan Huang, Iaroslav Ponomarenko, Zhengkai Jiang, Xiaoqi Li, Xiaobin Hu, Peng Gao, Hongsheng Li, and Hao Dong. Manipvqa: Injecting robotic affordance and physically grounded information into multi-modal large language models. InIEEE/RSJ International Conference 12 on Intelligent Robots and Systems, IROS 2024, Abu Dhabi, United Arab Emirates, October 14-...

work page doi:10.1109/iros58592.2024.10801993 2024

[26] [27]

Cruise: Cooperative reconstruction and editing in v2x scenarios using gaussian splatting

Yizhou Huang, Fan Yang, Guoliang Zhu, Gen Li, Hao Shi, Yukun Zuo, Wenrui Chen, Zhiyong Li, and Kailun Yang. Resource-efficient affordance grounding with complementary depth and semantic prompts. InIEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2025, Hangzhou, China, October 19-25, 2025, pages 7788–7795. IEEE, 2025. doi: 10.1109/...

work page doi:10.1109/iros60139.2025.11245943 2025

[27] [28]

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. RLBench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters (RA-L), 5(2):3019–3026, 2020

2020

[28] [29]

INTRA: Interaction relationship-aware weakly supervised affordance grounding

Ji Ha Jang, Hoigi Seo, and Se Young Chun. INTRA: Interaction relationship-aware weakly supervised affordance grounding. InEuropean Conference on Computer Vision (ECCV), pages 18–34. Springer, 2024

2024

[29] [30]

Affordancesam: Segment anything once more in affordance grounding, 2025

Dengyang Jiang, Zanyi Wang, Hengzhuang Li, Sizhe Dang, Teli Ma, Wei Wei, Guang Dai, Lei Zhang, and Mengmeng Wang. Affordancesam: Segment anything once more in affordance grounding, 2025. URLhttps://arxiv.org/abs/2504.15650

arXiv 2025

[30] [31]

Hanxiao Jiang, Yongsen Mao, Manolis Savva, and Angel X. Chang. OPD: Single-view 3d openable part detection. InEuropean Conference on Computer Vision (ECCV), 2022

2022

[31] [32]

Robo- abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation

Yuanchen Ju, Kaizhe Hu, Guowei Zhang, Gu Zhang, Mingrun Jiang, and Huazhe Xu. Robo- abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation. InEuropean Conference on Computer Vision, pages 222–239. Springer, 2024

2024

[32] [33]

DROID: A large-scale in-the-wild robot manipulation dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. DROID: A large-scale in-the-wild robot manipulation dataset. InProceedings of Robotics: Science and Systems (RSS), 2024. arXiv:2403.12945

Pith/arXiv arXiv 2024

[33] [34]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV), 123(1):32–73, 2017

2017

[34] [35]

RAM: Retrieval-based affordance transfer for generalizable zero- shot robotic manipulation

Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, and Yue Wang. RAM: Retrieval-based affordance transfer for generalizable zero- shot robotic manipulation. InProceedings of The 8th Conference on Robot Learning (CoRL), volume 270 ofProceedings of Machine Learning Research, pages 547–565. PMLR, 2024. URL https://proc...

2024

[35] [36]

LISA: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. LISA: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9579–9589, 2024

2024

[36] [37]

Locate: Localize and transfer object parts for weakly supervised affordance grounding

Gen Li, Varun Jampani, Deqing Sun, and Laura Sevilla-Lara. Locate: Localize and transfer object parts for weakly supervised affordance grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023

[37] [38]

Learning precise affordances from egocentric videos for robotic manipulation

Gen Li, Nikolaos Tsagkas, Jifei Song, Ruaridh Mon-Williams, Sethu Vijayakumar, Kun Shao, and Laura Sevilla-Lara. Learning precise affordances from egocentric videos for robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2025

[38] [39]

Locate n’ Rotate: Two-stage openable part detection with geometric foundation model priors

Siqi Li, Xiaoxue Chen, Haoyu Cheng, Guyue Zhou, Hao Zhao, and Guanzhong Tian. Locate n’ Rotate: Two-stage openable part detection with geometric foundation model priors. In Minsu Cho, Ivan Laptev, Du Tran, Angela Yao, and Hongbin Zha, editors,Computer Vision - ACCV 2024 - 17th Asian Conference on Computer Vision, Hanoi, Vietnam, December 8-12, 2024, 13 Pr...

work page doi:10.1007/978-981-96-0963-5 2024

[39] [40]

URL https://proceedings.mlr

Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. InIEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 18061–18070. IEEE, 2024. doi: ...

work page doi:10.1109/cvpr52733.2024.01710 2024

[40] [41]

Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Junhao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Pith/arXiv arXiv 2025

[41] [42]

Fast interactive object annotation with Curve-GCN

Huan Ling, Jun Gao, Amlan Kar, Wenzheng Chen, and Sanja Fidler. Fast interactive object annotation with Curve-GCN. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

2019

[42] [43]

Joint hand motion and interaction hotspots prediction from egocentric videos

Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xiaolong Wang. Joint hand motion and interaction hotspots prediction from egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022

[43] [44]

a fer, Andrew Wing Keung To, Kuan-Ho Lao, Murat Cubuktepe, Matthew Haley, Peter B \

Zhiyang Liu, Ruiteng Zhao, Lei Zhou, Chengran Yuan, Yuwei Wu, Sheng Guo, Zhengshen Zhang, Chenchen Liu, Marcelo H. Ang, and Francis E. H. Tay. 3d affordance keypoint detection for robotic manipulation. InIEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2024, Abu Dhabi, United Arab Emirates, October 14-18, 2024, pages 7528–7534. IE...

work page doi:10.1109/iros58592.2024.10801792 2024

[44] [45]

Geal: Generalizable 3d affordance learning with cross-modal consistency

Dongyue Lu, Lingdong Kong, Tianxin Huang, and Gim Hee Lee. Geal: Generalizable 3d affordance learning with cross-modal consistency. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1680–1690, June 2025

2025

[45] [46]

Learning affordance grounding from exocentric images

Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and Dacheng Tao. Learning affordance grounding from exocentric images. InCVPR, 2022

2022

[46] [47]

GLOVER++: Unleashing the potential of affordance learning from human behaviors for robotic manipulation

Teli Ma, Jia Zheng, Zifan Wang, Ziyao Gao, Jiaming Zhou, and Junwei Liang. GLOVER++: Unleashing the potential of affordance learning from human behaviors for robotic manipulation. InProceedings of The 9th Conference on Robot Learning (CoRL), volume 305 ofProceedings of Machine Learning Research, pages 3972–3994. PMLR, 2025. URL https://proceedings. mlr.pr...

2025

[47] [48]

Vagnet: Grounding 3d affordance from human-object interactions in videos, 2026

Aihua Mao, Kaihang Huang, Yong-Jin Liu, Chee Seng Chan, and Ying He. Vagnet: Grounding 3d affordance from human-object interactions in videos, 2026. URL https://arxiv.org/ abs/2602.20608

arXiv 2026

[48] [49]

CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

2022

[49] [50]

SAM 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Meta AI Research. SAM 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Pith/arXiv arXiv 2025

[50] [51]

VITRA: Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

Microsoft VITRA Team. VITRA: Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

arXiv 2025

[51] [52]

Where2act: From pixels to actions for articulated 3d objects

Kaichun Mo, Leonidas Guibas, Mustafa Mukadam, Abhinav Gupta, and Shubham Tulsiani. Where2act: From pixels to actions for articulated 3d objects. InInternational Conference on Computer Vision (ICCV), 2021

2021

[52] [53]

Teo, Cornelia Fermüller, and Yiannis Aloimonos

Austin Myers, Ching L. Teo, Cornelia Fermüller, and Yiannis Aloimonos. Affordance detection of tool parts from geometric features. InICRA, 2015. 14

2015

[53] [55]

Object-based affordances detection with convolutional neural networks and dense conditional random fields

Anh Nguyen, Dimitrios Kanoulas, Darwin G Caldwell, and Nikos G Tsagarakis. Object-based affordances detection with convolutional neural networks and dense conditional random fields. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017

2017

[54] [56]

Open-vocabulary affordance detection in 3D point clouds

Toan Nguyen, Minh Nhat Vu, An Vuong, Dzung Nguyen, Thieu V o, Ngan Le, and Anh Nguyen. Open-vocabulary affordance detection in 3D point clouds. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023

2023

[55] [57]

Transfer between modalities with MetaQueries.arXiv preprint arXiv:2504.06256, 2025

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, Ji Hou, and Saining Xie. Transfer between modalities with MetaQueries.arXiv preprint arXiv:2504.06256, 2025

Pith/arXiv arXiv 2025

[56] [58]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 3347–3356

Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, and Li Erran Li. Affor- dancellm: Grounding affordance from vision language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Workshops, Seattle, WA, USA, June 17-18, 2024, pages 7587–7597. IEEE, 2024. doi: 10.1109/CVPRW63382.2024.00754. URL https://doi.org/10...

work page doi:10.1109/cvprw63382.2024.00754 2024

[57] [59]

ToolFlowNet: Robotic manipulation with tools via predicting tool flow from point clouds

Daniel Seita, Yufei Wang, Sarthak Shetty, Edward Li, Zackory Erickson, and David Held. ToolFlowNet: Robotic manipulation with tools via predicting tool flow from point clouds. In Conference on Robot Learning (CoRL), 2022

2022

[58] [60]

GREAT: Geometry-intention collaborative inference for open-vocabulary 3D object affordance grounding

Yawen Shao, Wei Zhai, Yuhang Yang, Hongchen Luo, Yang Cao, and Zheng-Jun Zha. GREAT: Geometry-intention collaborative inference for open-vocabulary 3D object affordance grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17326–17336, 2025

2025

[59] [61]

Xiaohao Sun, Hanxiao Jiang, Manolis Savva, and Angel X. Chang. OPDMulti: Openable part detection for multiple objects. InInternational Conference on 3D Vision, 3DV 2024, Davos, Switzerland, March 18-21, 2024, pages 169–178. IEEE, 2024. doi: 10.1109/3DV62453.2024. 00100. URLhttps://doi.org/10.1109/3DV62453.2024.00100

work page doi:10.1109/3dv62453.2024 2024

[60] [62]

Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026

Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, and Nan Xue. Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026

arXiv 2026

[61] [63]

UAD: Unsupervised affordance distillation for generalization in robotic manipulation

Yihe Tang, Wenlong Huang, Yingke Wang, Chengshu Li, Roy Yuan, Ruohan Zhang, Jiajun Wu, and Li Fei-Fei. UAD: Unsupervised affordance distillation for generalization in robotic manipulation. InIEEE International Conference on Robotics and Automation (ICRA), 2025

2025

[62] [64]

Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot ma- nipulation

Yingbo Tang, Lingfeng Zhang, Shuyi Zhang, Yinuo Zhao, and Xiaoshuai Hao. Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot ma- nipulation. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12706–12713, 2025

2025

[63] [65]

Cruise: Cooperative reconstruction and editing in v2x scenarios using gaussian splatting

Yingbo Tang, Shuaike Zhang, Xiaoshuai Hao, Pengwei Wang, Jianlong Wu, Zhongyuan Wang, and Shanghang Zhang. Affordgrasp: In-context affordance reasoning for open-vocabulary task- oriented grasping in clutter. InIEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2025, Hangzhou, China, October 19-25, 2025, pages 9433–9439. IEEE, 2025. ...

work page doi:10.1109/iros60139.2025.11245995 2025

[64] [66]

Qwen3.5: Accelerating productivity with native multimodal agents, February

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

[65] [67]

URLhttps://qwen.ai/blog?id=qwen3.5

[66] [68]

O3afford: One-shot 3d object-to-object affordance grounding for generalizable robotic manipulation

Tongxuan Tian, Xuhui Kang, and Yen-Ling Kuo. O3afford: One-shot 3d object-to-object affordance grounding for generalizable robotic manipulation. 2025. 15

2025

[67] [69]

InstructPart: Task-oriented part segmentation with instruction reasoning

Zifu Wan, Yaqi Xie, Ce Zhang, Zhiqiu Lin, Zihan Wang, Simon Stepputtis, Deva Ramanan, and Katia Sycara. InstructPart: Task-oriented part segmentation with instruction reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025

2025

[68] [70]

Videoafford: Grounding 3d affordance from human-object-interaction videos via multimodal large language model,

Hanqing Wang, Mingyu Liu, Xiaoyu Chen, Chengwei Ma, Yiming Zhong, Wenti Yin, Yuhao Liu, Zhiqing Cui, Jiahao Yuan, Lu Dai, Zhiyuan Ma, and Hui Xiong. Videoafford: Grounding 3d affordance from human-object-interaction videos via multimodal large language model,

[69] [71]

URLhttps://arxiv.org/abs/2602.09638

arXiv

[70] [72]

Affordance-R1: Reinforcement learning for generalizable affordance reasoning in multimodal large language models

Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, and Yuexin Ma. Affordance-R1: Reinforcement learning for generalizable affordance reasoning in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

2026

[71] [73]

AffordBot: 3D fine-grained embodied reasoning via multimodal large language models

Xinyi Wang, Xun Yang, Yanlong Xu, Yuchen Wu, Zhen Li, and Na Zhao. AffordBot: 3D fine-grained embodied reasoning via multimodal large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[72] [74]

AffordDexGrasp: Open-set language-guided dexterous grasp with generalizable- instructive affordance

Yi-Lin Wei, Mu Lin, Yuhao Lin, Jian-Jian Jiang, Xiao-Ming Wu, Ling-An Zeng, and Wei- Shi Zheng. AffordDexGrasp: Open-set language-guided dexterous grasp with generalizable- instructive affordance. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025

[73] [75]

RAGNet: Large-scale reasoning-based affordance segmentation benchmark towards general grasping

Dongming Wu, Yanping Fu, Saike Huang, Yingfei Liu, Fan Jia, Nian Liu, Feng Dai, Tiancai Wang, Rao Muhammad Anwer, Fahad Shahbaz Khan, and Jianbing Shen. RAGNet: Large-scale reasoning-based affordance segmentation benchmark towards general grasping. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025

[74] [76]

RoboMIND: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. RoboMIND: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024

Pith/arXiv arXiv 2024

[75] [77]

V AT-mart: Learning visual action trajectory proposals for manipulating 3d ARTiculated objects

Ruihai Wu, Yan Zhao, Kaichun Mo, Zizheng Guo, Yian Wang, Tianhao Wu, Qingnan Fan, Xuelin Chen, Leonidas Guibas, and Hao Dong. V AT-mart: Learning visual action trajectory proposals for manipulating 3d ARTiculated objects. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=iEx3PiooLy

2022

[76] [78]

AffordDP: Generalizable diffusion policy with transferable affordance

Shijie Wu, Yihang Zhu, Yunao Huang, Kaizhen Zhu, Jiayuan Gu, Jingyi Yu, Ye Shi, and Jingya Wang. AffordDP: Generalizable diffusion policy with transferable affordance. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 6971–6980, 2025

2025

[77] [79]

Sonata: Self-supervised learning of reliable point representations

Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard Newcombe, Hengshuang Zhao, and Julian Straub. Sonata: Self-supervised learning of reliable point representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[78] [80]

Weakly-supervised affordance grounding guided by part-level semantic priors

Peiran Xu and Yadong MU. Weakly-supervised affordance grounding guided by part-level semantic priors. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=0823rvTIhs

2025

[79] [81]

A0: An affordance-aware hierarchical model for general robotic manipulation

Rongtao Xu, Jian Zhang, Minghao Guo, Youpeng Wen, Haoting Yang, Min Lin, Jianzheng Huang, Zhe Li, Kaidong Zhang, Liqiong Wang, Yuxuan Kuang, Meng Cao, Feng Zheng, and Xiaodan Liang. A0: An affordance-aware hierarchical model for general robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025

[80] [82]

Ground- ing 3d object affordance from 2d interactions in images

Yuhang Yang, Wei Zhai, Hongchen Luo, Yang Cao, Jiebo Luo, and Zheng-Jun Zha. Ground- ing 3d object affordance from 2d interactions in images. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10905–10915, October 2023

2023