pith. machine review for the scientific record. sign in

arxiv: 2604.05614 · v1 · submitted 2026-04-07 · 💻 cs.RO

Recognition: no theorem link

Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:38 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-action modelshierarchical VLAcontrastive learninglanguage-action alignmentoffline preference learningrobot groundingLanguageTable datasetrobot transparency
0
0 comments X

The pith

A contrastive model ranks language-trajectory pairs to explicitly ground hierarchical vision-language-action models without full supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a training framework for hierarchical vision-language-action models that explicitly aligns sub-task language descriptions with visual observations and action trajectories. It uses a contrastive model to evaluate and rank different language-trajectory pairs by their alignment. This ranking supports offline preference learning to refine the VLA model. Applied to the LanguageTable dataset of annotated robot trajectories, the approach reaches performance levels comparable to fully supervised fine-tuning while reducing the need for extensive data annotations. This matters for making robots transparent so their words match their deeds in human collaboration.

Core claim

The central claim is that by training a contrastive model to assess alignment between generated language and action trajectories, and using it to rank pairs for offline preference learning, hierarchical VLA models can be grounded explicitly in the task and environment, leading to performance comparable to supervised fine-tuning on the LanguageTable dataset with minimized annotation costs.

What carries the argument

A contrastive model for assessing and ranking the alignment between language descriptions and corresponding action trajectories, which enables preference learning to refine the VLA grounding.

If this is right

  • Hierarchical VLA models produce language that is more consistent with their executed actions.
  • The need for costly human annotations for training data is reduced.
  • Robots achieve greater transparency through explicit multimodal grounding.
  • Performance on language-annotated trajectory tasks matches that of fully supervised approaches.
  • Insights are provided into multimodal grounding representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This alignment approach could be applied to other robot learning domains to improve interpretability.
  • Future work might test whether the contrastive rankings generalize to real-world robot deployments beyond the LanguageTable benchmark.
  • It suggests a path to reduce reliance on large labeled datasets in vision-language-action systems.

Load-bearing premise

The contrastive model reliably identifies true alignments between language and actions rather than learning spurious correlations from the dataset.

What would settle it

A direct comparison showing that after preference learning the VLA model's generated language no longer ranks higher in alignment scores with its actions than before, or that task performance does not improve over a non-aligned baseline on LanguageTable.

Figures

Figures reproduced from arXiv: 2604.05614 by Angelo Cangelosi, Federico Tavella, Manith Adikari, Rahul Singh Maharjan, Theodor Wulff.

Figure 1
Figure 1. Figure 1: Method Overview. We extend a regular VLA (left) into a hierarchical VLA by adding a high-level VLM module to break a high-level instruction down into executable low-level instructions (center), following recent trends on hierarchical VLAs [4, 45]. To align the intermediate low-level instruction and the generated trajectory, we invoke a separately trained ranking model, which ranks N sampled output pairs ba… view at source ↗
Figure 2
Figure 2. Figure 2: Action-Conditioned Grounding Model. We extend a pre-trained SigLIP 2 by conditioning the visual features of the SigLIP 2 Vision Encoder on the encoded trajectories. Using a contrastive loss, we align the vision-action pairs with the low-level instructions. 5. Experimental Setup We train all our models on a single NVIDIA A100 GPU. The hierarchical VLA is trained to generate actions with a horizon of 8; the … view at source ↗
Figure 4
Figure 4. Figure 4: Quantitative Evaluation of Generated Trajectories. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: t-SNE visualizations of all grounding models on the LanguageTable dataset using visual inputs and low-level instructions. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt template used for robotic agent instructions. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

Achieving robot transparency is a critical step toward effective human-robot collaboration. To be transparent, a robot's natural language communication must be consistent with its actions and explicitly grounded in the task and environment. Existing hierarchical Vision-Language-Action (VLA) models can generate language (e.g., through chain-of-thought) and low-level actions. However, current work does not consider explicit alignment between these modalities during training. To address this crucial gap, we propose a novel training framework that explicitly grounds hierarchical VLA sub-task descriptions with respect to the visual observation and action space. Our framework uses a contrastive model to assess the alignment between generated language and corresponding action trajectories. This contrastive model enables direct ranking of different language-trajectory pairs based on their alignment, allowing us to refine the grounding of our hierarchical VLA through offline preference learning. We apply our framework to the LanguageTable dataset, a benchmark dataset of human language-annotated trajectories, and provide critical insights into multimodal grounding representations, all while establishing a strong baseline that achieves performance comparable to fully supervised fine-tuning and minimizing the need for costly data annotations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a novel training framework for hierarchical Vision-Language-Action (VLA) models that explicitly aligns generated sub-task language descriptions with visual observations and action trajectories. It introduces a contrastive model to score alignment, enabling ranking of language-trajectory pairs and subsequent refinement of the VLA via offline preference learning. The framework is evaluated on the LanguageTable dataset of human-annotated trajectories and claims to achieve performance comparable to fully supervised fine-tuning while reducing reliance on costly annotations.

Significance. If the experimental claims hold with proper validation, the work could offer a practical method for improving multimodal grounding and transparency in robotic VLA systems without proportional increases in human annotation effort. The contrastive ranking approach for preference optimization represents a potentially useful direction for self-supervised alignment in robotics, though its independence from existing annotations must be demonstrated.

major comments (2)
  1. Abstract: The central claim of achieving 'performance comparable to fully supervised fine-tuning' is presented without any quantitative metrics, ablation studies, error bars, or baseline comparisons. This absence makes it impossible to evaluate whether the contrastive alignment step contributes meaningfully beyond standard supervised training.
  2. Framework description (as summarized in abstract): The contrastive model is described as assessing alignment between generated language and action trajectories to produce preference rankings, but no details are given on its training data, pre-training, or validation. If it is derived from the same human-annotated LanguageTable trajectories used for the VLA, the offline preference learning risks circularity, simply re-encoding existing annotation patterns rather than supplying independent grounding information.
minor comments (2)
  1. The abstract would benefit from explicit mention of the evaluation metrics (e.g., success rate, language consistency score) used to claim comparability with supervised fine-tuning.
  2. Notation for the contrastive scoring function and preference optimization objective should be introduced with clear definitions to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. We address each major comment point by point below, providing clarifications from the manuscript and indicating where revisions will strengthen the presentation of our results and methods.

read point-by-point responses
  1. Referee: Abstract: The central claim of achieving 'performance comparable to fully supervised fine-tuning' is presented without any quantitative metrics, ablation studies, error bars, or baseline comparisons. This absence makes it impossible to evaluate whether the contrastive alignment step contributes meaningfully beyond standard supervised training.

    Authors: We agree that the abstract, as a concise summary, does not include specific quantitative metrics or ablations, which limits immediate evaluation of the claim. The full manuscript reports these details in the Experiments section, with tables comparing our method to fully supervised fine-tuning on LanguageTable, ablations isolating the contrastive alignment and offline preference learning components, and performance metrics with error bars from multiple random seeds. These results support the comparability claim while showing reduced annotation requirements. To address the concern directly, we will revise the abstract to include key quantitative highlights (e.g., relative success rates or performance deltas) within length constraints, making the central contribution clearer to readers. revision: yes

  2. Referee: Framework description (as summarized in abstract): The contrastive model is described as assessing alignment between generated language and action trajectories to produce preference rankings, but no details are given on its training data, pre-training, or validation. If it is derived from the same human-annotated LanguageTable trajectories used for the VLA, the offline preference learning risks circularity, simply re-encoding existing annotation patterns rather than supplying independent grounding information.

    Authors: We thank the referee for raising this important point on potential circularity. The manuscript outlines the high-level framework but indeed provides limited implementation specifics for the contrastive model. In our setup, the contrastive model is trained on LanguageTable trajectories to learn general multimodal alignment between language descriptions, visual observations, and action sequences using a contrastive loss; it is not simply a re-encoding of the VLA's supervised objective. Preference rankings are then derived by having the VLA generate alternative language-trajectory pairs, which the contrastive model scores to create a self-supervised preference signal for offline optimization. This supplies additional grounding beyond the original human annotations. To resolve the concern and demonstrate independence, we will add a dedicated subsection in the Methods detailing the contrastive model's training data splits, architecture, any pre-training, validation metrics, and how generated pairs avoid direct re-use of annotation patterns. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a contrastive model to assess language-trajectory alignment and applies it for ranking and offline preference learning on the LanguageTable dataset. No equations, self-citations, or explicit training details in the provided text demonstrate that any prediction or ranking step reduces by construction to the input annotations or fitted parameters. The framework is presented as adding an explicit alignment mechanism during training, and the claim of minimizing annotation needs is not shown to be tautological with the dataset usage. The derivation chain remains self-contained without load-bearing reductions to prior inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no equations or training details provided, so free parameters, axioms, and invented entities cannot be enumerated beyond the high-level components named.

pith-pipeline@v0.9.0 · 5506 in / 1292 out tokens · 24237 ms · 2026-05-10T19:38:20.092567+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 13 canonical work pages · 10 internal anchors

  1. [1]

    Foundation Models Defining a New Era in Vision: A Survey and Out- look.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2245–2264, 2025

    Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundation Models Defining a New Era in Vision: A Survey and Out- look.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2245–2264, 2025. 2

  2. [2]

    A cookbook of self-supervised learning.arXiv preprint arXiv:2304.12210, 2023

    Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Mor- cos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, Avi Schwarzschild, Andrew Gordon Wilson, Jonas Geiping, Quentin Garrido, Pierre Fernandez, Amir Bar, Hamed Pir- siavash, Yann LeCun, and Micah Goldblum. A Cookbook of Self-Supervised Learning, 2023. a...

  3. [3]

    METEOR: An auto- matic metric for MT evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. METEOR: An auto- matic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005. 6

  4. [4]

    RT-H: Action Hierar- chies using Language

    Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quan Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. RT-H: Action Hierar- chies using Language. InRobotics: Science and Systems,

  5. [5]

    Ondrej Biza, Thomas Weng, Lingfeng Sun, Karl Schmeck- peper, Tarik Kelestemur, Yecheng Jason Ma, Robert Platt, Jan-Willem van de Meent, and Lawson L.S. Wong. On- Robot Reinforcement Learning with Goal-Contrastive Re- wards. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 4797–4805, 2025. 3

  6. [6]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith LLontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Tan, Guanz...

  7. [7]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π 0: A Visi...

  8. [8]

    Kevin Black, Noah Brown, James Darpinian, Karan Dha- balia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szy- mon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. ...

  9. [9]

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Ju- lian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J. Joshi, Ryan Julian, Dmitry Kalash- nikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav ...

  10. [10]

    A Simple Framework for Contrastive Learn- ing of Visual Representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A Simple Framework for Contrastive Learn- ing of Visual Representations. InProceedings of the 37th In- ternational Conference on Machine Learning, pages 1597–

  11. [11]

    Spa- tialRGPT: Grounded Spatial Reasoning in Vision-Language Models

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spa- tialRGPT: Grounded Spatial Reasoning in Vision-Language Models. InAdvances in Neural Information Processing Sys- tems, pages 135062–135093. Curran Associates, Inc., 2024. 8

  12. [12]

    Deep Reinforcement Learn- ing from Human Preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep Reinforcement Learn- ing from Human Preferences. InAdvances in Neural Infor- mation Processing Systems. Curran Associates, Inc., 2017. 3

  13. [13]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur ´elien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi `ere, B...

  14. [14]

    Contrastive Learning as Goal- Conditioned Reinforcement Learning

    Benjamin Eysenbach, Tianjun Zhang, Sergey Levine, and Ruslan Salakhutdinov. Contrastive Learning as Goal- Conditioned Reinforcement Learning. InAdvances in Neu- ral Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022. Neural information processing systems foundation, 2022. 3

  15. [15]

    Feedback- Driven Vision-Language Alignment with Minimal Human Supervision, 2025

    Giorgio Giannone, Ruoteng Li, Qianli Feng, Evgeny Perevodchikov, Rui Chen, and Aleix Martinez. Feedback- Driven Vision-Language Alignment with Minimal Human Supervision, 2025. arXiv:2501.04568 [cs.CV]. 3

  16. [16]

    Zhao Han, Elizabeth Phillips, and Holly A. Yanco. The Need for Verbal Robot Explanations and How People Would Like a Robot to Explain Itself.J. Hum.-Robot Interact., 10(4),

  17. [17]

    The Symbol Grounding Problem.Physica D: Nonlinear Phenomena, 42(1):335–346, 1990

    Stevan Harnad. The Symbol Grounding Problem.Physica D: Nonlinear Phenomena, 42(1):335–346, 1990. 1

  18. [18]

    Momentum Contrast for Unsupervised Visual Rep- resentation Learning.CVPR, 2020

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum Contrast for Unsupervised Visual Rep- resentation Learning.CVPR, 2020. 3

  19. [19]

    Brian Ichter, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalash- nikov, Sergey Levine, Yao Lu, Carolina Parada, Kanishka Rao, Pierre Sermanet, Alexander T. Toshev, Vincent Van- houcke, Fei Xia, Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, Michael Ahn, ...

  20. [20]

    Few-Shot Pref- erence Learning for Human-in-the-Loop RL

    Donald Joseph Hejna III and Dorsa Sadigh. Few-Shot Pref- erence Learning for Human-in-the-Loop RL. In6th Annual Conference on Robot Learning, 2022. 3

  21. [21]

    Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovi- cova, Alexandre Ram ´e, Morgane Rivi `ere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Eti- enne Pot, Ivo Penchev, Ga ¨el Liu, Francesco Visin, Kath- leen Kenealy, Lucas Bey...

  22. [22]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An Open-Source Vision-Language-Action Model. InProceed- ings of The 8th Conference on...

  23. [23]

    Smith, and P

    Kimin Lee, Laura M. Smith, and P. Abbeel. PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training. InIn- ternational Conference on Machine Learning, 2021. 3

  24. [24]

    ROUGE: A Package for Automatic Evalu- ation of Summaries

    Chin-Yew Lin. ROUGE: A Package for Automatic Evalu- ation of Summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, 2004. Association for Com- putational Linguistics. 6

  25. [25]

    Interactive Language: Talking to Robots in Real Time.IEEE Robotics and Automation Letters, pages 1–8, 2023

    Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive Language: Talking to Robots in Real Time.IEEE Robotics and Automation Letters, pages 1–8, 2023. 2, 4, 5, 7

  26. [26]

    Contrastive Imitation Learning for Language- guided Multi-Task Robotic Manipulation

    Teli Ma, Jiaming Zhou, Zifan Wang, Ronghe Qiu, and Jun- wei Liang. Contrastive Imitation Learning for Language- guided Multi-Task Robotic Manipulation. InProceedings of The 8th Conference on Robot Learning, pages 4651–4669. PMLR, 2025. 3

  27. [27]

    A Survey on Vision-Language-Action Models for Embodied AI

    Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A Survey on Vision-Language-Action Models for Embodied AI, 2025. arXiv:2405.14093 [cs.RO]. 1

  28. [28]

    LIV: Language-Image Repre- sentations and Rewards for Robotic Control

    Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bas- tani, and Dinesh Jayaraman. LIV: Language-Image Repre- sentations and Rewards for Robotic Control. InWorkshop on Reincarnating Reinforcement Learning at ICLR 2023, 2023. 3

  29. [29]

    VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

    Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Os- bert Bastani, Vikash Kumar, and Amy Zhang. VIP: Towards Universal Visual Reward and Representation via Value- Implicit Pre-Training, 2023. arXiv:2210.00030 [cs.RO]. 3

  30. [30]

    Octo: An Open- Source Generalist Robot Policy

    Oier Mees, Dibya Ghosh, Karl Pertsch, Kevin Black, Homer Rich Walke, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An Open- Source Generalist Robot Policy. InFirst Workshop on Vision- Language Models for Navigation and Manipulation at ICRA 2024, 2024. 2

  31. [31]

    SimPO: Simple Preference Optimization with a Reference-Free Reward

    Yu Meng, Mengzhou Xia, and Danqi Chen. SimPO: Simple Preference Optimization with a Reference-Free Reward. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 3, 5

  32. [32]

    Active re- ward learning from online preferences

    Vivek Myers, Erdem Bıyık, and Dorsa Sadigh. Active re- ward learning from online preferences. In2023 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 7511–7518, 2023. 3

  33. [33]

    Learning Language- Conditioned Robot Behavior from Offline Data and Crowd- Sourced Annotation

    Suraj Nair, Eric Mitchell, Kevin Chen, Brian Ichter, Sil- vio Savarese, and Chelsea Finn. Learning Language- Conditioned Robot Behavior from Offline Data and Crowd- Sourced Annotation. InProceedings of the 5th Conference on Robot Learning, pages 1303–1315. PMLR, 2022. 3

  34. [34]

    R3M: A Universal Visual Repre- sentation for Robot Manipulation

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3M: A Universal Visual Repre- sentation for Robot Manipulation. In6th Annual Conference on Robot Learning (CoRL), 2022. 3

  35. [35]

    GPT-4 Technical Report

    OpenAI. GPT-4 Technical Report, 2023. arXiv preprint arXiv:2303.08774 [cs.CL]. 3

  36. [36]

    Christiano, Jan Leike, and Ryan Lowe

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instruc- tions with human fe...

  37. [37]

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Al- bert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexan- der Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, An- nie Xie, Anthony Brohan, Antonin Raf...

  38. [38]

    Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

    Norman Di Palo and Edward Johns. Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024. 1

  39. [39]

    BLEU: a method for automatic eval- uation of machine translation

    Kishore Papineni et al. BLEU: a method for automatic eval- uation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguis- tics. Association for Computational Linguistics, 2002. 6

  40. [40]

    FiLM: Visual Reasoning with a General Conditioning Layer.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018

    Ethan Perez, Florian Strub, Harm de Vries, Vincent Du- moulin, and Aaron Courville. FiLM: Visual Reasoning with a General Conditioning Layer.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018. 5

  41. [41]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 3, 5

  42. [42]

    Di- rect Preference Optimization: Your Language Model Is Se- cretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Di- rect Preference Optimization: Your Language Model Is Se- cretly a Reward Model. InAdvances in Neural Information Processing Systems, pages 53728–53741. Curran Associates, Inc., 2023. 3, 5

  43. [43]

    Purposeful Communication in Human–Robot Collaboration: A Review of Modern Approaches in Manufacturing.IEEE Access, 10: 129344–129361, 2022

    Roya Salehzadeh, Jiaqi Gong, and Nader Jalili. Purposeful Communication in Human–Robot Collaboration: A Review of Modern Approaches in Manufacturing.IEEE Access, 10: 129344–129361, 2022. 2

  44. [44]

    Concept2Robot: Learning Manipulation Concepts from Instructions and Human Demonstrations

    Lin Shao, Toki Migimatsu, Qiang Zhang, Karen Yang, and Jeannette Bohg. Concept2Robot: Learning Manipulation Concepts from Instructions and Human Demonstrations. In Proceedings of Robotics: Science and Systems (RSS), 2020. 3

  45. [45]

    Hi Robot: Open-Ended Instruction Following with Hi- erarchical Vision-Language-Action Models

    Lucy Xiaoyang Shi, brian ichter, Michael Robert Equi, Liy- iming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, and Chelsea Finn. Hi Robot: Open-Ended Instruction Following with Hi- erarchical Vision-Language-Action Models. InForty-second International Confer...

  46. [46]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Ar- actingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Ca- dene. SmolVLA: A Vision-Language-Action Model for Af- fordable and Efficient Robotics, 2025. arXiv:2506.01844 [cs.RO]. 5

  47. [47]

    RoboCLIP: One Demonstration is Enough to Learn Robot Policies

    Sumedh Anand Sontakke, Jesse Zhang, S ´eb Arnold, Karl Pertsch, Erdem Biyik, Dorsa Sadigh, Chelsea Finn, and Lau- rent Itti. RoboCLIP: One Demonstration is Enough to Learn Robot Policies. InThirty-seventh Conference on Neural In- formation Processing Systems, 2023. 3

  48. [48]

    Language- Conditioned Imitation Learning for Robot Manipulation Tasks

    Simon Stepputtis, Joseph Campbell, Mariano Phielipp, Ste- fan Lee, Chitta Baral, and Heni Ben Amor. Language- Conditioned Imitation Learning for Robot Manipulation Tasks. InAdvances in Neural Information Processing Sys- tems, pages 13139–13150. Curran Associates, Inc., 2020. 3

  49. [49]

    Learning to Summarize with Human Feedback

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to Summarize with Human Feedback. InAdvances in Neural Information Processing Systems, pages 3008–3021. Curran Associates, Inc., 2020. 3

  50. [50]

    Juan Terven, Diana-Margarita Cordova-Esparza, Julio- Alejandro Romero-Gonz ´alez, Alfonso Ram ´ırez-Pedraza, and E. A. Ch´avez-Urbiola. A Comprehensive Survey of Loss Functions and Metrics in Deep Learning.Artificial Intelli- gence Review, 58(7):195, 2025. 6

  51. [51]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual Vision- Language Encoders with Improved Semantic Understanding, Localization, and Dense Feature...

  52. [52]

    Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. InAdvances in Neural Information Processing Systems, pages 74952–74965. Curran Associates, Inc., 2023. 2

  53. [53]

    RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

    Yufei Wang, Zhanyi Sun, Jesse Zhang, Zhou Xian, Erdem Biyik, David Held, and Zackory Erickson. RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback. InProceedings of the 41th International Conference on Machine Learning, 2024. 3

  54. [54]

    Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reason- ing in Large Language Models.Advances in Neural Infor- mation Processing Systems, 35:24824–24837, 2022. 1, 2

  55. [55]

    Joint Action Language Modelling for Trans- parent Policy Execution, 2025

    Theodor Wulff, Rahul Singh Maharjan, Xinyun Chi, and An- gelo Cangelosi. Joint Action Language Modelling for Trans- parent Policy Execution, 2025. arXiv:2504.10055 [cs.RO]. 3

  56. [56]

    Rank2Reward: Learning Shaped Reward Functions from Passive Video

    Daniel Yang, Davin Tjia, Jacob Berg, Dima Damen, Pulkit Agrawal, and Abhishek Gupta. Rank2Reward: Learning Shaped Reward Functions from Passive Video. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 2806–2813, 2024. 3

  57. [57]

    Sigmoid Loss for Language Image Pre- Training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre- Training. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11941–11952, Paris, France,

  58. [58]

    Weinberger, and Yoav Artzi

    Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating Text Generation with BERT. InInternational Conference on Learning Representations, 2020. 6

  59. [59]

    GRAPE: Generalizing robot policy via prefer- ence alignment

    Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, and Huaxiu Yao. GRAPE: Generalizing robot policy via prefer- ence alignment. InWorkshop on Reasoning and Planning for Large Language Models, 2025. 2, 3, 4, 5

  60. [60]

    Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. InThe Eleventh International Conference on Learn- ing Representations, 2023. 2

  61. [61]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Ser- manet, Pannag R Sanketi, Grecia Salazar, Michael S Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mor- datch, Henryk Michalews...