arxiv: 2604.05614 · v1 · submitted 2026-04-07 · 💻 cs.RO

Recognition: no theorem link

Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment

Theodor Wulff , Federico Tavella , Rahul Singh Maharjan , Manith Adikari , Angelo Cangelosi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:38 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-language-action modelshierarchical VLAcontrastive learninglanguage-action alignmentoffline preference learningrobot groundingLanguageTable datasetrobot transparency

0 comments

The pith

A contrastive model ranks language-trajectory pairs to explicitly ground hierarchical vision-language-action models without full supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a training framework for hierarchical vision-language-action models that explicitly aligns sub-task language descriptions with visual observations and action trajectories. It uses a contrastive model to evaluate and rank different language-trajectory pairs by their alignment. This ranking supports offline preference learning to refine the VLA model. Applied to the LanguageTable dataset of annotated robot trajectories, the approach reaches performance levels comparable to fully supervised fine-tuning while reducing the need for extensive data annotations. This matters for making robots transparent so their words match their deeds in human collaboration.

Core claim

The central claim is that by training a contrastive model to assess alignment between generated language and action trajectories, and using it to rank pairs for offline preference learning, hierarchical VLA models can be grounded explicitly in the task and environment, leading to performance comparable to supervised fine-tuning on the LanguageTable dataset with minimized annotation costs.

What carries the argument

A contrastive model for assessing and ranking the alignment between language descriptions and corresponding action trajectories, which enables preference learning to refine the VLA grounding.

If this is right

Hierarchical VLA models produce language that is more consistent with their executed actions.
The need for costly human annotations for training data is reduced.
Robots achieve greater transparency through explicit multimodal grounding.
Performance on language-annotated trajectory tasks matches that of fully supervised approaches.
Insights are provided into multimodal grounding representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This alignment approach could be applied to other robot learning domains to improve interpretability.
Future work might test whether the contrastive rankings generalize to real-world robot deployments beyond the LanguageTable benchmark.
It suggests a path to reduce reliance on large labeled datasets in vision-language-action systems.

Load-bearing premise

The contrastive model reliably identifies true alignments between language and actions rather than learning spurious correlations from the dataset.

What would settle it

A direct comparison showing that after preference learning the VLA model's generated language no longer ranks higher in alignment scores with its actions than before, or that task performance does not improve over a non-aligned baseline on LanguageTable.

Figures

Figures reproduced from arXiv: 2604.05614 by Angelo Cangelosi, Federico Tavella, Manith Adikari, Rahul Singh Maharjan, Theodor Wulff.

**Figure 1.** Figure 1: Method Overview. We extend a regular VLA (left) into a hierarchical VLA by adding a high-level VLM module to break a high-level instruction down into executable low-level instructions (center), following recent trends on hierarchical VLAs [4, 45]. To align the intermediate low-level instruction and the generated trajectory, we invoke a separately trained ranking model, which ranks N sampled output pairs ba… view at source ↗

**Figure 2.** Figure 2: Action-Conditioned Grounding Model. We extend a pre-trained SigLIP 2 by conditioning the visual features of the SigLIP 2 Vision Encoder on the encoded trajectories. Using a contrastive loss, we align the vision-action pairs with the low-level instructions. 5. Experimental Setup We train all our models on a single NVIDIA A100 GPU. The hierarchical VLA is trained to generate actions with a horizon of 8; the … view at source ↗

**Figure 4.** Figure 4: Quantitative Evaluation of Generated Trajectories. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: t-SNE visualizations of all grounding models on the LanguageTable dataset using visual inputs and low-level instructions. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt template used for robotic agent instructions. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

Achieving robot transparency is a critical step toward effective human-robot collaboration. To be transparent, a robot's natural language communication must be consistent with its actions and explicitly grounded in the task and environment. Existing hierarchical Vision-Language-Action (VLA) models can generate language (e.g., through chain-of-thought) and low-level actions. However, current work does not consider explicit alignment between these modalities during training. To address this crucial gap, we propose a novel training framework that explicitly grounds hierarchical VLA sub-task descriptions with respect to the visual observation and action space. Our framework uses a contrastive model to assess the alignment between generated language and corresponding action trajectories. This contrastive model enables direct ranking of different language-trajectory pairs based on their alignment, allowing us to refine the grounding of our hierarchical VLA through offline preference learning. We apply our framework to the LanguageTable dataset, a benchmark dataset of human language-annotated trajectories, and provide critical insights into multimodal grounding representations, all while establishing a strong baseline that achieves performance comparable to fully supervised fine-tuning and minimizing the need for costly data annotations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a contrastive ranking step to ground language outputs in hierarchical VLAs via offline preference learning on LanguageTable, but the abstract shows no numbers or controls so the actual gain is unclear.

read the letter

The punchline is that this work tries to make hierarchical VLA models more transparent by explicitly scoring how well generated sub-task language matches observed action trajectories, then feeding those rankings into preference optimization. That framing is new relative to prior separate language and action generation pipelines, and applying it to the LanguageTable benchmark is a sensible choice since the dataset already has human language annotations on trajectories. If the full experiments show that the contrastive signal produces better consistency than plain supervised fine-tuning while using fewer new labels, it would be a practical step for robot transparency in human collaboration settings. The authors do a decent job laying out the motivation around alignment gaps in current VLAs and describing the two-stage process cleanly. The contrastive model for direct ranking of language-trajectory pairs followed by preference learning is a straightforward way to operationalize the grounding goal without inventing new data sources. That said, the abstract provides zero quantitative results, no ablation tables, and no error bars, which makes the claim of comparable performance to supervised fine-tuning impossible to evaluate. The circularity risk is real: if the contrastive scorer is trained on the same annotated trajectories used for the VLA, the preference signal may simply reinforce existing patterns rather than add independent grounding information. The paper would benefit from showing how the contrastive model is trained or validated separately and whether the improvement holds when the scorer is held out. This is aimed at robotics researchers working on multimodal VLA architectures and preference-based alignment. A reader already experimenting with contrastive losses or offline RL on robot datasets could extract the framework and test it quickly. It deserves a serious referee because the problem is well-motivated and the method is concrete enough to critique with data, even if the current write-up is preliminary. I would send it out for review but flag the need for full results and training details in the first round.

Referee Report

2 major / 2 minor

Summary. The paper proposes a novel training framework for hierarchical Vision-Language-Action (VLA) models that explicitly aligns generated sub-task language descriptions with visual observations and action trajectories. It introduces a contrastive model to score alignment, enabling ranking of language-trajectory pairs and subsequent refinement of the VLA via offline preference learning. The framework is evaluated on the LanguageTable dataset of human-annotated trajectories and claims to achieve performance comparable to fully supervised fine-tuning while reducing reliance on costly annotations.

Significance. If the experimental claims hold with proper validation, the work could offer a practical method for improving multimodal grounding and transparency in robotic VLA systems without proportional increases in human annotation effort. The contrastive ranking approach for preference optimization represents a potentially useful direction for self-supervised alignment in robotics, though its independence from existing annotations must be demonstrated.

major comments (2)

Abstract: The central claim of achieving 'performance comparable to fully supervised fine-tuning' is presented without any quantitative metrics, ablation studies, error bars, or baseline comparisons. This absence makes it impossible to evaluate whether the contrastive alignment step contributes meaningfully beyond standard supervised training.
Framework description (as summarized in abstract): The contrastive model is described as assessing alignment between generated language and action trajectories to produce preference rankings, but no details are given on its training data, pre-training, or validation. If it is derived from the same human-annotated LanguageTable trajectories used for the VLA, the offline preference learning risks circularity, simply re-encoding existing annotation patterns rather than supplying independent grounding information.

minor comments (2)

The abstract would benefit from explicit mention of the evaluation metrics (e.g., success rate, language consistency score) used to claim comparability with supervised fine-tuning.
Notation for the contrastive scoring function and preference optimization objective should be introduced with clear definitions to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. We address each major comment point by point below, providing clarifications from the manuscript and indicating where revisions will strengthen the presentation of our results and methods.

read point-by-point responses

Referee: Abstract: The central claim of achieving 'performance comparable to fully supervised fine-tuning' is presented without any quantitative metrics, ablation studies, error bars, or baseline comparisons. This absence makes it impossible to evaluate whether the contrastive alignment step contributes meaningfully beyond standard supervised training.

Authors: We agree that the abstract, as a concise summary, does not include specific quantitative metrics or ablations, which limits immediate evaluation of the claim. The full manuscript reports these details in the Experiments section, with tables comparing our method to fully supervised fine-tuning on LanguageTable, ablations isolating the contrastive alignment and offline preference learning components, and performance metrics with error bars from multiple random seeds. These results support the comparability claim while showing reduced annotation requirements. To address the concern directly, we will revise the abstract to include key quantitative highlights (e.g., relative success rates or performance deltas) within length constraints, making the central contribution clearer to readers. revision: yes
Referee: Framework description (as summarized in abstract): The contrastive model is described as assessing alignment between generated language and action trajectories to produce preference rankings, but no details are given on its training data, pre-training, or validation. If it is derived from the same human-annotated LanguageTable trajectories used for the VLA, the offline preference learning risks circularity, simply re-encoding existing annotation patterns rather than supplying independent grounding information.

Authors: We thank the referee for raising this important point on potential circularity. The manuscript outlines the high-level framework but indeed provides limited implementation specifics for the contrastive model. In our setup, the contrastive model is trained on LanguageTable trajectories to learn general multimodal alignment between language descriptions, visual observations, and action sequences using a contrastive loss; it is not simply a re-encoding of the VLA's supervised objective. Preference rankings are then derived by having the VLA generate alternative language-trajectory pairs, which the contrastive model scores to create a self-supervised preference signal for offline optimization. This supplies additional grounding beyond the original human annotations. To resolve the concern and demonstrate independence, we will add a dedicated subsection in the Methods detailing the contrastive model's training data splits, architecture, any pre-training, validation metrics, and how generated pairs avoid direct re-use of annotation patterns. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a contrastive model to assess language-trajectory alignment and applies it for ranking and offline preference learning on the LanguageTable dataset. No equations, self-citations, or explicit training details in the provided text demonstrate that any prediction or ranking step reduces by construction to the input annotations or fitted parameters. The framework is presented as adding an explicit alignment mechanism during training, and the claim of minimizing annotation needs is not shown to be tautological with the dataset usage. The derivation chain remains self-contained without load-bearing reductions to prior inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no equations or training details provided, so free parameters, axioms, and invented entities cannot be enumerated beyond the high-level components named.

pith-pipeline@v0.9.0 · 5506 in / 1292 out tokens · 24237 ms · 2026-05-10T19:38:20.092567+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 13 canonical work pages · 10 internal anchors

[1]

Foundation Models Defining a New Era in Vision: A Survey and Out- look.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2245–2264, 2025

Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundation Models Defining a New Era in Vision: A Survey and Out- look.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2245–2264, 2025. 2

2025
[2]

A cookbook of self-supervised learning.arXiv preprint arXiv:2304.12210, 2023

Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Mor- cos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, Avi Schwarzschild, Andrew Gordon Wilson, Jonas Geiping, Quentin Garrido, Pierre Fernandez, Amir Bar, Hamed Pir- siavash, Yann LeCun, and Micah Goldblum. A Cookbook of Self-Supervised Learning, 2023. a...

work page arXiv 2023
[3]

METEOR: An auto- matic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An auto- matic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005. 6

2005
[4]

RT-H: Action Hierar- chies using Language

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quan Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. RT-H: Action Hierar- chies using Language. InRobotics: Science and Systems,
[5]

Ondrej Biza, Thomas Weng, Lingfeng Sun, Karl Schmeck- peper, Tarik Kelestemur, Yecheng Jason Ma, Robert Platt, Jan-Willem van de Meent, and Lawson L.S. Wong. On- Robot Reinforcement Learning with Goal-Contrastive Re- wards. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 4797–4805, 2025. 3

2025
[6]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith LLontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Tan, Guanz...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π 0: A Visi...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Kevin Black, Noah Brown, James Darpinian, Karan Dha- balia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szy- mon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. ...

2025
[9]

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Ju- lian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J. Joshi, Ryan Julian, Dmitry Kalash- nikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav ...

work page internal anchor Pith review arXiv 2023
[10]

A Simple Framework for Contrastive Learn- ing of Visual Representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A Simple Framework for Contrastive Learn- ing of Visual Representations. InProceedings of the 37th In- ternational Conference on Machine Learning, pages 1597–
[11]

Spa- tialRGPT: Grounded Spatial Reasoning in Vision-Language Models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spa- tialRGPT: Grounded Spatial Reasoning in Vision-Language Models. InAdvances in Neural Information Processing Sys- tems, pages 135062–135093. Curran Associates, Inc., 2024. 8

2024
[12]

Deep Reinforcement Learn- ing from Human Preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep Reinforcement Learn- ing from Human Preferences. InAdvances in Neural Infor- mation Processing Systems. Curran Associates, Inc., 2017. 3

2017
[13]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur ´elien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi `ere, B...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Contrastive Learning as Goal- Conditioned Reinforcement Learning

Benjamin Eysenbach, Tianjun Zhang, Sergey Levine, and Ruslan Salakhutdinov. Contrastive Learning as Goal- Conditioned Reinforcement Learning. InAdvances in Neu- ral Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022. Neural information processing systems foundation, 2022. 3

2022
[15]

Feedback- Driven Vision-Language Alignment with Minimal Human Supervision, 2025

Giorgio Giannone, Ruoteng Li, Qianli Feng, Evgeny Perevodchikov, Rui Chen, and Aleix Martinez. Feedback- Driven Vision-Language Alignment with Minimal Human Supervision, 2025. arXiv:2501.04568 [cs.CV]. 3

work page arXiv 2025
[16]

Zhao Han, Elizabeth Phillips, and Holly A. Yanco. The Need for Verbal Robot Explanations and How People Would Like a Robot to Explain Itself.J. Hum.-Robot Interact., 10(4),
[17]

The Symbol Grounding Problem.Physica D: Nonlinear Phenomena, 42(1):335–346, 1990

Stevan Harnad. The Symbol Grounding Problem.Physica D: Nonlinear Phenomena, 42(1):335–346, 1990. 1

1990
[18]

Momentum Contrast for Unsupervised Visual Rep- resentation Learning.CVPR, 2020

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum Contrast for Unsupervised Visual Rep- resentation Learning.CVPR, 2020. 3

2020
[19]

Brian Ichter, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalash- nikov, Sergey Levine, Yao Lu, Carolina Parada, Kanishka Rao, Pierre Sermanet, Alexander T. Toshev, Vincent Van- houcke, Fei Xia, Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, Michael Ahn, ...

2023
[20]

Few-Shot Pref- erence Learning for Human-in-the-Loop RL

Donald Joseph Hejna III and Dorsa Sadigh. Few-Shot Pref- erence Learning for Human-in-the-Loop RL. In6th Annual Conference on Robot Learning, 2022. 3

2022
[21]

Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovi- cova, Alexandre Ram ´e, Morgane Rivi `ere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Eti- enne Pot, Ivo Penchev, Ga ¨el Liu, Francesco Visin, Kath- leen Kenealy, Lucas Bey...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An Open-Source Vision-Language-Action Model. InProceed- ings of The 8th Conference on...
[23]

Smith, and P

Kimin Lee, Laura M. Smith, and P. Abbeel. PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training. InIn- ternational Conference on Machine Learning, 2021. 3

2021
[24]

ROUGE: A Package for Automatic Evalu- ation of Summaries

Chin-Yew Lin. ROUGE: A Package for Automatic Evalu- ation of Summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, 2004. Association for Com- putational Linguistics. 6

2004
[25]

Interactive Language: Talking to Robots in Real Time.IEEE Robotics and Automation Letters, pages 1–8, 2023

Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive Language: Talking to Robots in Real Time.IEEE Robotics and Automation Letters, pages 1–8, 2023. 2, 4, 5, 7

2023
[26]

Contrastive Imitation Learning for Language- guided Multi-Task Robotic Manipulation

Teli Ma, Jiaming Zhou, Zifan Wang, Ronghe Qiu, and Jun- wei Liang. Contrastive Imitation Learning for Language- guided Multi-Task Robotic Manipulation. InProceedings of The 8th Conference on Robot Learning, pages 4651–4669. PMLR, 2025. 3

2025
[27]

A Survey on Vision-Language-Action Models for Embodied AI

Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A Survey on Vision-Language-Action Models for Embodied AI, 2025. arXiv:2405.14093 [cs.RO]. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

LIV: Language-Image Repre- sentations and Rewards for Robotic Control

Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bas- tani, and Dinesh Jayaraman. LIV: Language-Image Repre- sentations and Rewards for Robotic Control. InWorkshop on Reincarnating Reinforcement Learning at ICLR 2023, 2023. 3

2023
[29]

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Os- bert Bastani, Vikash Kumar, and Amy Zhang. VIP: Towards Universal Visual Reward and Representation via Value- Implicit Pre-Training, 2023. arXiv:2210.00030 [cs.RO]. 3

work page internal anchor Pith review arXiv 2023
[30]

Octo: An Open- Source Generalist Robot Policy

Oier Mees, Dibya Ghosh, Karl Pertsch, Kevin Black, Homer Rich Walke, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An Open- Source Generalist Robot Policy. InFirst Workshop on Vision- Language Models for Navigation and Manipulation at ICRA 2024, 2024. 2

2024
[31]

SimPO: Simple Preference Optimization with a Reference-Free Reward

Yu Meng, Mengzhou Xia, and Danqi Chen. SimPO: Simple Preference Optimization with a Reference-Free Reward. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 3, 5

2024
[32]

Active re- ward learning from online preferences

Vivek Myers, Erdem Bıyık, and Dorsa Sadigh. Active re- ward learning from online preferences. In2023 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 7511–7518, 2023. 3

2023
[33]

Learning Language- Conditioned Robot Behavior from Offline Data and Crowd- Sourced Annotation

Suraj Nair, Eric Mitchell, Kevin Chen, Brian Ichter, Sil- vio Savarese, and Chelsea Finn. Learning Language- Conditioned Robot Behavior from Offline Data and Crowd- Sourced Annotation. InProceedings of the 5th Conference on Robot Learning, pages 1303–1315. PMLR, 2022. 3

2022
[34]

R3M: A Universal Visual Repre- sentation for Robot Manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3M: A Universal Visual Repre- sentation for Robot Manipulation. In6th Annual Conference on Robot Learning (CoRL), 2022. 3

2022
[35]

GPT-4 Technical Report

OpenAI. GPT-4 Technical Report, 2023. arXiv preprint arXiv:2303.08774 [cs.CL]. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Christiano, Jan Leike, and Ryan Lowe

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instruc- tions with human fe...

2022
[37]

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Al- bert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexan- der Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, An- nie Xie, Anthony Brohan, Antonin Raf...

2024
[38]

Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

Norman Di Palo and Edward Johns. Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024. 1

2024
[39]

BLEU: a method for automatic eval- uation of machine translation

Kishore Papineni et al. BLEU: a method for automatic eval- uation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguis- tics. Association for Computational Linguistics, 2002. 6

2002
[40]

FiLM: Visual Reasoning with a General Conditioning Layer.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018

Ethan Perez, Florian Strub, Harm de Vries, Vincent Du- moulin, and Aaron Courville. FiLM: Visual Reasoning with a General Conditioning Layer.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018. 5

2018
[41]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 3, 5

2021
[42]

Di- rect Preference Optimization: Your Language Model Is Se- cretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Di- rect Preference Optimization: Your Language Model Is Se- cretly a Reward Model. InAdvances in Neural Information Processing Systems, pages 53728–53741. Curran Associates, Inc., 2023. 3, 5

2023
[43]

Purposeful Communication in Human–Robot Collaboration: A Review of Modern Approaches in Manufacturing.IEEE Access, 10: 129344–129361, 2022

Roya Salehzadeh, Jiaqi Gong, and Nader Jalili. Purposeful Communication in Human–Robot Collaboration: A Review of Modern Approaches in Manufacturing.IEEE Access, 10: 129344–129361, 2022. 2

2022
[44]

Concept2Robot: Learning Manipulation Concepts from Instructions and Human Demonstrations

Lin Shao, Toki Migimatsu, Qiang Zhang, Karen Yang, and Jeannette Bohg. Concept2Robot: Learning Manipulation Concepts from Instructions and Human Demonstrations. In Proceedings of Robotics: Science and Systems (RSS), 2020. 3

2020
[45]

Hi Robot: Open-Ended Instruction Following with Hi- erarchical Vision-Language-Action Models

Lucy Xiaoyang Shi, brian ichter, Michael Robert Equi, Liy- iming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, and Chelsea Finn. Hi Robot: Open-Ended Instruction Following with Hi- erarchical Vision-Language-Action Models. InForty-second International Confer...
[46]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Ar- actingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Ca- dene. SmolVLA: A Vision-Language-Action Model for Af- fordable and Efficient Robotics, 2025. arXiv:2506.01844 [cs.RO]. 5

work page internal anchor Pith review arXiv 2025
[47]

RoboCLIP: One Demonstration is Enough to Learn Robot Policies

Sumedh Anand Sontakke, Jesse Zhang, S ´eb Arnold, Karl Pertsch, Erdem Biyik, Dorsa Sadigh, Chelsea Finn, and Lau- rent Itti. RoboCLIP: One Demonstration is Enough to Learn Robot Policies. InThirty-seventh Conference on Neural In- formation Processing Systems, 2023. 3

2023
[48]

Language- Conditioned Imitation Learning for Robot Manipulation Tasks

Simon Stepputtis, Joseph Campbell, Mariano Phielipp, Ste- fan Lee, Chitta Baral, and Heni Ben Amor. Language- Conditioned Imitation Learning for Robot Manipulation Tasks. InAdvances in Neural Information Processing Sys- tems, pages 13139–13150. Curran Associates, Inc., 2020. 3

2020
[49]

Learning to Summarize with Human Feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to Summarize with Human Feedback. InAdvances in Neural Information Processing Systems, pages 3008–3021. Curran Associates, Inc., 2020. 3

2020
[50]

Juan Terven, Diana-Margarita Cordova-Esparza, Julio- Alejandro Romero-Gonz ´alez, Alfonso Ram ´ırez-Pedraza, and E. A. Ch´avez-Urbiola. A Comprehensive Survey of Loss Functions and Metrics in Deep Learning.Artificial Intelli- gence Review, 58(7):195, 2025. 6

2025
[51]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual Vision- Language Encoders with Improved Semantic Understanding, Localization, and Dense Feature...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. InAdvances in Neural Information Processing Systems, pages 74952–74965. Curran Associates, Inc., 2023. 2

2023
[53]

RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

Yufei Wang, Zhanyi Sun, Jesse Zhang, Zhou Xian, Erdem Biyik, David Held, and Zackory Erickson. RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback. InProceedings of the 41th International Conference on Machine Learning, 2024. 3

2024
[54]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reason- ing in Large Language Models.Advances in Neural Infor- mation Processing Systems, 35:24824–24837, 2022. 1, 2

2022
[55]

Joint Action Language Modelling for Trans- parent Policy Execution, 2025

Theodor Wulff, Rahul Singh Maharjan, Xinyun Chi, and An- gelo Cangelosi. Joint Action Language Modelling for Trans- parent Policy Execution, 2025. arXiv:2504.10055 [cs.RO]. 3

work page arXiv 2025
[56]

Rank2Reward: Learning Shaped Reward Functions from Passive Video

Daniel Yang, Davin Tjia, Jacob Berg, Dima Damen, Pulkit Agrawal, and Abhishek Gupta. Rank2Reward: Learning Shaped Reward Functions from Passive Video. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 2806–2813, 2024. 3

2024
[57]

Sigmoid Loss for Language Image Pre- Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre- Training. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11941–11952, Paris, France,
[58]

Weinberger, and Yoav Artzi

Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating Text Generation with BERT. InInternational Conference on Learning Representations, 2020. 6

2020
[59]

GRAPE: Generalizing robot policy via prefer- ence alignment

Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, and Huaxiu Yao. GRAPE: Generalizing robot policy via prefer- ence alignment. InWorkshop on Reasoning and Planning for Large Language Models, 2025. 2, 3, 4, 5

2025
[60]

Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. InThe Eleventh International Conference on Learn- ing Representations, 2023. 2

2023
[61]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Ser- manet, Pannag R Sanketi, Grecia Salazar, Michael S Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mor- datch, Henryk Michalews...

2023