AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding
Pith reviewed 2026-06-28 01:21 UTC · model grok-4.3
The pith
AffordanceVLA adds structured affordance forecasting as an intermediate step to tighten the link from vision-language inputs to robotic actions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AffordanceVLA claims that progressively modeling manipulation priors through Which2Act (object-centric grounding via visual latent prediction), Where2Act (2D interaction localization via affordance map estimation), and How2Act (3D geometric reasoning) supplies spatially grounded, semantically conditioned, and action-coupled intermediate representations that bridge vision, language, and action inside a Mixture-of-Transformer architecture, yielding stronger performance on diverse manipulation tasks after three-stage training with an automated data-augmentation pipeline.
What carries the argument
Three complementary affordance modules—Which2Act, Where2Act, and How2Act—inside a Mixture-of-Transformer with specialized experts, trained via progressive data curriculum.
If this is right
- Object-centric visual latent prediction suppresses visual distractions before action planning.
- Affordance-map estimation supplies explicit 2D localization for interaction points.
- 3D geometric reasoning from the How2Act module directly shapes the output manipulation policy.
- The three-stage curriculum with progressive data allows the model to learn the affordance representations before full action generation.
- The overall pipeline produces measurable gains in both simulated and real-world manipulation success rates.
Where Pith is reading between the lines
- The same intermediate-representation strategy could be tested on navigation or multi-step assembly tasks that also require bridging high-level instructions to low-level control.
- If the augmentation pipeline scales reliably, it could reduce dependence on manually labeled robotic data when extending the model to new object categories.
- The modular separation of grounding, localization, and geometry might allow independent upgrades to any one component without retraining the entire policy head.
Load-bearing premise
The automated data augmentation pipeline can produce dense, accurate affordance labels in sufficient quantity to train the three modules despite the scarcity of such labels in existing robotic datasets.
What would settle it
A head-to-head comparison on identical manipulation benchmarks showing that removing the three affordance modules and training only the base VLA yields equal or higher success rates.
read the original abstract
Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose \textbf{AffordanceVLA}, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) \textbf{Which2Act} for object-centric grounding via visual latent prediction to suppress distractions; 2) \textbf{Where2Act} for 2D interaction localization via affordance map estimation; and 3) \textbf{How2Act} for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AffordanceVLA, a VLA model that introduces structured affordance forecasting as an intermediate representation via three modules—Which2Act (object-centric grounding via visual latent prediction), Where2Act (2D affordance map estimation), and How2Act (3D geometric reasoning)—integrated into a Mixture-of-Transformer (MoT) architecture with specialized experts. It uses a three-stage training strategy with progressive data curriculum and an automated data augmentation pipeline to address scarce dense affordance labels in robotic datasets, claiming this establishes a more precise perception-action mapping and yields strong performance across simulation and real-world manipulation scenarios.
Significance. If the empirical claims hold after validation, the work could advance VLA models by demonstrating how task-oriented affordance priors can bridge VLM semantic spaces with embodied control, using complementary 2D/3D cues and MoT experts. The progressive curriculum and pipeline for label generation are practical contributions worth testing in follow-up work.
major comments (2)
- [Abstract and pipeline description in Methods] The automated data augmentation pipeline (described in the abstract and methods) is presented as 'robust' and central to training the three affordance modules, yet no quantitative validation is provided (e.g., label accuracy vs. human ground truth, spatial precision metrics, or sensitivity analysis). This is load-bearing for the central claim that the modules produce 'spatially grounded, semantically conditioned' representations; without it, reported gains could arise from the MoT backbone or VLM pretraining rather than the affordance components.
- [Abstract and Experiments section] The abstract states 'extensive experiments... demonstrate that AffordanceVLA achieves strong performance' but supplies no quantitative results, baselines, metrics, or ablation studies on the individual modules (Which2Act/Where2Act/How2Act). This prevents assessment of whether the affordance forecasting improves the perception-action mapping as claimed.
minor comments (1)
- [Architecture description] Clarify the precise mechanism by which the three affordance outputs are fused into the MoT experts and action decoder (e.g., via equations or a diagram in the architecture section).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key areas where additional evidence is needed to support our claims about the affordance modules and data pipeline. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract and pipeline description in Methods] The automated data augmentation pipeline (described in the abstract and methods) is presented as 'robust' and central to training the three affordance modules, yet no quantitative validation is provided (e.g., label accuracy vs. human ground truth, spatial precision metrics, or sensitivity analysis). This is load-bearing for the central claim that the modules produce 'spatially grounded, semantically conditioned' representations; without it, reported gains could arise from the MoT backbone or VLM pretraining rather than the affordance components.
Authors: We agree that the manuscript currently lacks quantitative validation of the automated data augmentation pipeline, which is a substantive gap given its role in generating labels for the affordance modules. The description in the abstract and methods presents the pipeline as robust without supporting metrics. In the revised manuscript, we will add a new evaluation subsection that reports label accuracy against human ground truth on a sampled subset of data, along with spatial precision metrics (e.g., IoU for 2D affordance maps and 3D geometric error) and a sensitivity analysis to key augmentation parameters. This will help demonstrate that the gains stem from the affordance components rather than the backbone alone. revision: yes
-
Referee: [Abstract and Experiments section] The abstract states 'extensive experiments... demonstrate that AffordanceVLA achieves strong performance' but supplies no quantitative results, baselines, metrics, or ablation studies on the individual modules (Which2Act/Where2Act/How2Act). This prevents assessment of whether the affordance forecasting improves the perception-action mapping as claimed.
Authors: The abstract is intentionally high-level and does not include numbers, which is conventional. However, we acknowledge that the experiments section as described does not provide the requested quantitative details, baselines, metrics, or module ablations, limiting the ability to evaluate the contribution of the affordance forecasting. We will revise the experiments section to include detailed tables with performance metrics on simulation and real-world tasks, comparisons against relevant VLA baselines, and ablation studies that isolate the effect of each module (Which2Act, Where2Act, How2Act) on overall task success and perception-action alignment. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces an empirical VLA architecture with three new affordance modules (Which2Act, Where2Act, How2Act) and a data-augmentation pipeline as practical engineering components, trained end-to-end on robotic tasks. No equations, uniqueness theorems, or first-principles derivations are presented that reduce by construction to fitted parameters or self-citations; performance claims rest on experimental results rather than self-referential definitions. The approach is self-contained against external benchmarks and does not invoke load-bearing self-citations or rename known results as novel derivations.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Affordance forecasting modules (Which2Act, Where2Act, How2Act)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics, 2023.https://arxiv.org/abs/2304.08488
arXiv 2023
-
[2]
Motus: A unified latent action world model, 2025.https://arxiv.org/abs/2512.13030
Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, and Jun Zhu. Motus: A unified latent action world model, 2025.https://arxiv.org/abs/2512.13030
Pith/arXiv arXiv 2025
-
[3]
Zero-shot robotic manipulation with pretrained image-editing diffusion models, 2023.https://arxiv.org/abs/2310
Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models, 2023.https://arxiv.org/abs/2310. 10639
2023
-
[4]
https://arxiv.org/abs/2410.24164
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...
Pith/arXiv arXiv 2026
-
[5]
Closed-loop visuomotor control with generative expectation for robotic manipulation, 2024
Qingwen Bu, Jia Zeng, Li Chen, Yanchao Yang, Guyue Zhou, Junchi Yan, Ping Luo, Heming Cui, Yi Ma, and Hongyang Li. Closed-loop visuomotor control with generative expectation for robotic manipulation, 2024. https://arxiv.org/abs/2409.09016
arXiv 2024
-
[6]
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions, 2025.https://arxiv.org/abs/2505.06111
Pith/arXiv arXiv 2025
-
[7]
Worldvla: Towards autoregressive action world model, 2025
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Worldvla: Towards autoregressive action world model, 2025. https://arxiv.org/abs/2506.21539
Pith/arXiv arXiv 2025
-
[8]
Gr-3 technical report, 2025.https://arxiv.org/abs/2507.15493
Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, Hao Niu, Wenxuan Ou, Wanli Peng, Zeyu Ren, Haixin Shi, Jiawen Tian, Hongtao Wu, Xin Xiao, Yuyang Xiao, Jiafeng Xu, and Yichu Yang. Gr-3 technical report, 2025.https://arxiv.org/abs/2507.15493
Pith/arXiv arXiv 2025
-
[9]
Rmbench: Memory-dependent robotic manipulation benchmark with insights into policy design, 2026
Tianxing Chen, Yuran Wang, Mingleyang Li, Yan Qin, Hao Shi, Zixuan Li, Yifan Hu, Yingsheng Zhang, Kaixuan Wang, Yue Chen, Hongcheng Wang, Renjing Xu, Ruihai Wu, Yao Mu, Yaodong Yang, Hao Dong, and Ping Luo. Rmbench: Memory-dependent robotic manipulation benchmark with insights into policy design, 2026. https://arxiv.org/abs/2603.01229
arXiv 2026
-
[10]
Yue Chen, Chenrui Tie, Ruihai Wu, and Hao Dong. Eqvafford: Se(3) equivariance for point-level affordance learning, 2024.https://arxiv.org/abs/2408.01953
arXiv 2024
-
[11]
Learning part-aware dense 3d feature field for generalizable articulated object manipulation, 2026
Yue Chen, Muqing Jiang, Kaifeng Zheng, Jiaqi Liang, Chenrui Tie, Haoran Lu, Ruihai Wu, and Hao Dong. Learning part-aware dense 3d feature field for generalizable articulated object manipulation, 2026. https: //arxiv.org/abs/2602.14193
arXiv 2026
-
[12]
David Coleman, Ioan Sucan, Sachin Chitta, and Nikolaus Correll. Reducing the barrier to entry of complex robotic software: a moveit! case study, 2014.https://arxiv.org/abs/1404.3785
Pith/arXiv arXiv 2014
-
[13]
Ganhand: Predicting human grasp affordances in multi-object scenes
Enric Corona, Albert Pumarola, Guillem Alenya, Francesc Moreno-Noguer, and Grégory Rogez. Ganhand: Predicting human grasp affordances in multi-object scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5031–5041, 2020
2020
-
[14]
Can Cui, Pengxiang Ding, Wenxuan Song, Shuanghao Bai, Xinyang Tong, Zirui Ge, Runze Suo, Wanqi Zhou, Yang Liu, Bofang Jia, Han Zhao, Siteng Huang, and Donglin Wang. Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation, 2025.https://arxiv.org/abs/2505.03912
arXiv 2025
-
[15]
Abhay Deshpande, Yuquan Deng, Arijit Ray, Jordi Salvador, Winson Han, Jiafei Duan, Kuo-Hao Zeng, Yuke Zhu, Ranjay Krishna, and Rose Hendrix. Graspmolmo: Generalizable task-oriented grasping via large-scale synthetic data generation, 2025.https://arxiv.org/abs/2505.13441
arXiv 2025
-
[16]
Tenenbaum, Dale Schuurmans, and Pieter Abbeel
Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation, 2023.https://arxiv.org/abs/2302.00111
arXiv 2023
-
[17]
Anygrasp: Robust and efficient grasp perception in spatial and temporal domains, 2023
Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains, 2023. https://arxiv.org/abs/2212.08333. 15
arXiv 2023
-
[18]
Act the part: Learning interaction strategies for articulated object part discovery
Samir Yitzhak Gadre, Kiana Ehsani, and Shuran Song. Act the part: Learning interaction strategies for articulated object part discovery. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15752–15761, October 2021
2021
-
[19]
End-to-end affordance learning for robotic manipulation, 2022.https://arxiv.org/abs/2209.12941
Yiran Geng, Boshi An, Haoran Geng, Yuanpei Chen, Yaodong Yang, and Hao Dong. End-to-end affordance learning for robotic manipulation, 2022.https://arxiv.org/abs/2209.12941
arXiv 2022
-
[20]
The theory of affordances
James Jerry Gibson. The theory of affordances. 1977.https://api.semanticscholar.org/CorpusID:60688620
1977
-
[21]
Visual affordance and function understanding: A survey
Mohammed Hassanin, Salman Khan, and Murat Tahtali. Visual affordance and function understanding: A survey. ACM Computing Surveys (CSUR), 54(3):1–35, 2021
2021
-
[22]
Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations, 2025.https://arxiv.org/abs/2412.14803
Pith/arXiv arXiv 2025
-
[23]
Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision- language-action reasoning via reinforced visual latent planning, 2025.https://arxiv.org/abs/2507.16815
Pith/arXiv arXiv 2025
-
[24]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...
Pith/arXiv arXiv 2025
-
[25]
Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, Vedant Choudhary, Foster Collins, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Maitrayee Dhaka, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine...
Pith/arXiv arXiv 2026
-
[26]
Detect anything via next point prediction, 2025.https://arxiv.org/abs/2510.12798
Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction, 2025.https://arxiv.org/abs/2510.12798
arXiv 2025
-
[27]
Yuanchen Ju, Kaizhe Hu, Guowei Zhang, Gu Zhang, Mingrun Jiang, and Huazhe Xu. Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation, 2024.https://arxiv.org/ abs/2401.07487
arXiv 2024
-
[28]
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...
Pith/arXiv arXiv 2025
-
[29]
Openvla: An open-source vision-language-action model, 2024.https://arxiv.org/abs/2406.09246
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model, 2024.https://arxiv.org/abs/2406.09246
Pith/arXiv arXiv 2024
-
[30]
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success, 2025.https://arxiv.org/abs/2502.19645
Pith/arXiv arXiv 2025
-
[31]
Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023.https: //arxiv.org/abs/2304.02643
Pith/arXiv arXiv 2023
-
[32]
Learning task-oriented grasping from human activity datasets, 2020.https://arxiv.org/abs/1910.11669
Mia Kokic, Danica Kragic, and Jeannette Bohg. Learning task-oriented grasping from human activity datasets, 2020.https://arxiv.org/abs/1910.11669
arXiv 2020
-
[33]
Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation, 2024
Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, and Yue Wang. Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation, 2024. https://arxiv.org/abs/2407.04689
arXiv 2024
-
[34]
Flux.1 kontext: Flow matching for in-context image generation and editing in latent space, 2025
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...
Pith/arXiv arXiv 2025
-
[35]
Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation alignment for vision-language-action model, 2025.https: //arxiv.org/abs/2510.12276
arXiv 2025
-
[36]
Jinming Li, Yichen Zhu, Zhibin Tang, Junjie Wen, Minjie Zhu, Xiaoyu Liu, Chengmeng Li, Ran Cheng, Yaxin Peng, Yan Peng, and Feifei Feng. Coa-vla: Improving vision-language-action models via visual-textual chain-of-affordance, 2025.https://arxiv.org/abs/2412.20451
arXiv 2025
-
[37]
Garmentpile++: Affordance-driven cluttered garments retrieval with vision-language reasoning, 2026
Mingleyang Li, Yuran Wang, Yue Chen, Tianxing Chen, Jiaqi Liang, Zishun Shen, Haoran Lu, Ruihai Wu, and Hao Dong. Garmentpile++: Affordance-driven cluttered garments retrieval with vision-language reasoning, 2026. https://arxiv.org/abs/2603.04158
arXiv 2026
-
[38]
Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, and Tieniu Tan. Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models, 2025. https://arxiv.org/abs/2506.07961
arXiv 2025
-
[39]
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, and Baining Guo. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation, 2024.https://arxiv.org...
Pith/arXiv arXiv 2024
-
[40]
Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, and Tao Kong. Vision-language foundation models as effective robot imitators, 2024.https://arxiv.org/abs/2311.01378
Pith/arXiv arXiv 2024
-
[41]
A3d: Adaptive affordance assembly with dual-arm manipulation, 2026.https://arxiv.org/abs/2601.11076
Jiaqi Liang, Yue Chen, Qize Yu, Yan Shen, Haipeng Zhang, Hao Dong, and Ruihai Wu. A3d: Adaptive affordance assembly with dual-arm manipulation, 2026.https://arxiv.org/abs/2601.11076
arXiv 2026
-
[42]
Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models, 2025.https://arxiv.org/abs/2411.04996
Pith/arXiv arXiv 2025
-
[43]
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023
Pith/arXiv arXiv 2023
-
[44]
Visual instruction tuning, 2023.https://arxiv.org/ abs/2304.08485
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023.https://arxiv.org/ abs/2304.08485. 17
Pith/arXiv arXiv 2023
-
[45]
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation, 2025.https://arxiv.org/abs/2410.07864
Pith/arXiv arXiv 2025
-
[46]
Grounded affordance from exocentric view, 2023.https://arxiv.org/abs/2208.13196
Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and Dacheng Tao. Grounded affordance from exocentric view, 2023.https://arxiv.org/abs/2208.13196
arXiv 2023
-
[47]
F1: A vision-language-action model bridging understanding and generation to actions, 2025
Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions, 2025. https://arxiv.org/abs/2509.06951
Pith/arXiv arXiv 2025
-
[48]
Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022
Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022
2022
-
[49]
Where2act: From pixels to actions for articulated 3d objects, 2021.https://arxiv.org/abs/2101.02692
Kaichun Mo, Leonidas Guibas, Mustafa Mukadam, Abhinav Gupta, and Shubham Tulsiani. Where2act: From pixels to actions for articulated 3d objects, 2021.https://arxiv.org/abs/2101.02692
arXiv 2021
-
[50]
Learning affordance landscapes for interaction exploration in 3d environments
Tushar Nagarajan and Kristen Grauman. Learning affordance landscapes for interaction exploration in 3d environments. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546
2020
-
[51]
Ego-topo: Environment affordances from egocentric video
Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, and Kristen Grauman. Ego-topo: Environment affordances from egocentric video. InCVPR, 2020
2020
-
[52]
Chuanruo Ning, Ruihai Wu, Haoran Lu, Kaichun Mo, and Hao Dong. Where2explore: Few-shot affordance learning for unseen novel categories of articulated objects, 2023.https://arxiv.org/abs/2309.07473
arXiv 2023
-
[53]
Gr00t n1: An open foundation model for generalist humanoid robots, 2025
NVIDIA, :, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, Y...
Pith/arXiv arXiv 2025
-
[54]
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...
Pith/arXiv arXiv 2024
-
[55]
Spatialvla: Exploring spatial representations for visual-language-action model, 2025
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spatial representations for visual-language-action model, 2025. https://arxiv.org/abs/2501.15830
Pith/arXiv arXiv 2025
-
[56]
Hi robot: Open-ended instruction following with hierarchical vision-language-action models, 2025
Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, and Chelsea Finn. Hi robot: Open-ended instruction following with hierarchical vision-language-action models, 2025. https://arxiv.org/abs/2502.19417
Pith/arXiv arXiv 2025
-
[57]
Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. Smolvla: A vision-language-action model for affordable and efficient robotics, 2025.https: //arxiv.org/abs/2506.01844
Pith/arXiv arXiv 2025
-
[58]
Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver, 2025.https://arxiv.org/abs/2508.10333
arXiv 2025
-
[59]
Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, Nathan Ratliff, and Dieter Fox. curobo: Parallelized collision-free minimum-jerk robot motion generation, 2023.https://arxiv.org/abs/2310.17274
arXiv 2023
-
[60]
Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, Steven Bohez, Konstantinos Bousmalis, Anthony Brohan, Thomas Buschmann, Arunkumar Byravan, Serkan Cabi, Ken Caluwaerts, Federico Casarini, Oscar Chang, Jose Enrique ...
Pith/arXiv arXiv 2025
-
[61]
Octo: An open-source generalist robot policy, 2024.https://arxiv.org/abs/2405.12213
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy, 2024.https://arxiv.org/abs/2405.12213
Pith/arXiv arXiv 2024
-
[62]
Sam 3d: 3dfy anything in images, 2025.https://arxiv.org/abs/2511.16624
SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. Sam 3d: 3dfy anything in images, 2025.https:...
Pith/arXiv arXiv 2025
-
[63]
Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation, 2024.https://arxiv.org/abs/2412.15109
Pith/arXiv arXiv 2024
-
[64]
Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, Yaping Li, Ping Wang, Junhao Cai, Jia Zeng, Hao Dong, and Jiangmiao Pang. Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy, 2025.https://arxiv.org/abs/2511.16651
arXiv 2025
-
[65]
Chenrui Tie, Yue Chen, Ruihai Wu, Boxuan Dong, Zeyi Li, Chongkai Gao, and Hao Dong. Et-seed: Efficient trajectory-level se(3) equivariant diffusion policy, 2025.https://arxiv.org/abs/2411.03990
arXiv 2025
-
[66]
Adamanip: Adaptive articulated object manipulation environments and policy learning, 2025
Yuanfei Wang, Xiaojie Zhang, Ruihai Wu, Yu Li, Yan Shen, Mingdong Wu, Zhaofeng He, Yizhou Wang, and Hao Dong. Adamanip: Adaptive articulated object manipulation environments and policy learning, 2025. https://arxiv.org/abs/2502.11124
arXiv 2025
-
[67]
Yuran Wang, Ruihai Wu, Yue Chen, Jiarui Wang, Jiaqi Liang, Ziyu Zhu, Haoran Geng, Jitendra Malik, Pieter Abbeel, and Hao Dong. Dexgarmentlab: Dexterous garment manipulation environment with generalizable policy, 2025.https://arxiv.org/abs/2505.11032
arXiv 2025
-
[68]
Junjie Wen, Minjie Zhu, Yichen Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Chengmeng Li, Xiaoyu Liu, Yaxin Peng, Chaomin Shen, and Feifei Feng. Diffusion-vla: Generalizable and interpretable robot foundation model via self-generated reasoning, 2025.https://arxiv.org/abs/2412.03293
arXiv 2025
-
[69]
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, and Jian Tang. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation, 2025.https://arxiv.org/abs/2409.12514
Pith/arXiv arXiv 2025
-
[70]
Unleashing large-scale video generative pre-training for visual robot manipulation, 2023
Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation, 2023. https://arxiv.org/abs/2312.13139
Pith/arXiv arXiv 2023
-
[71]
Ruihai Wu, Yan Zhao, Kaichun Mo, Zizheng Guo, Yian Wang, Tianhao Wu, Qingnan Fan, Xuelin Chen, Leonidas Guibas, and Hao Dong. Vat-mart: Learning visual action trajectory proposals for manipulating 3d articulated objects, 2022.https://arxiv.org/abs/2106.14440
arXiv 2022
-
[72]
Ruihai Wu, Ziyu Zhu, Yuran Wang, Yue Chen, Jiarui Wang, and Hao Dong. Garmentpile: Point-level visual affordance guided retrieval and adaptation for cluttered garments manipulation, 2025.https://arxiv.org/abs/ 2503.09243
arXiv 2025
-
[73]
Shijie Wu, Yihang Zhu, Yunao Huang, Kaizhen Zhu, Jiayuan Gu, Jingyi Yu, Ye Shi, and Jingya Wang. Afforddp: Generalizable diffusion policy with transferable affordance, 2025.https://arxiv.org/abs/2412.03142
arXiv 2025
-
[74]
Qwen3 technical report, 2025.https://arxiv.org/abs/2505.09388
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
Pith/arXiv arXiv 2025
-
[75]
Shuai Yang, Hao Li, Yilun Chen, Bin Wang, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. Instructvla: Vision-language-action instruction tuning from understanding to manipulation, 2025.https://arxiv.org/abs/2507.17520
arXiv 2025
-
[76]
Yuyin Yang, Zetao Cai, Yang Tian, Jia Zeng, and Jiangmiao Pang. Gripper keypose and object pointflow as interfaces for bimanual robotic manipulation, 2025.https://arxiv.org/abs/2504.17784
arXiv 2025
-
[77]
Chengbo Yuan, Chuan Wen, Tong Zhang, and Yang Gao. General flow as foundation affordance for scalable robot learning.arXiv preprint arXiv:2401.11439, 2024
arXiv 2024
-
[78]
Andy Zeng, Shuran Song, Kuan-Ting Yu, Elliott Donlon, Francois Hogan, Maria Bauza, Daolin Ma, Orion Taylor, Melody Liu, Eudald Romo Grau, Nima Fazeli, Ferran Alet, Nikhil Dafle, Rachel Holladay, Isabella Morena, Prem Nair, Druck Green, Ian Taylor, Weber Liu, and Alberto Rodriguez. Robotic pick-and-place of novel objects in clutter with multi-affordance gr...
-
[79]
Hongyin Zhang, Zifeng Zhuang, Han Zhao, Pengxiang Ding, Hongchao Lu, and Donglin Wang. Reinbot: Amplifying robot visual-language manipulation with reinforcement learning, 2025.https://arxiv.org/abs/2505.07395
arXiv 2025
-
[80]
Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent, 2025.https://arxiv.org/abs/2501.18867
arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.