Recognition: no theorem link
Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment
Pith reviewed 2026-05-10 19:38 UTC · model grok-4.3
The pith
A contrastive model ranks language-trajectory pairs to explicitly ground hierarchical vision-language-action models without full supervision.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that by training a contrastive model to assess alignment between generated language and action trajectories, and using it to rank pairs for offline preference learning, hierarchical VLA models can be grounded explicitly in the task and environment, leading to performance comparable to supervised fine-tuning on the LanguageTable dataset with minimized annotation costs.
What carries the argument
A contrastive model for assessing and ranking the alignment between language descriptions and corresponding action trajectories, which enables preference learning to refine the VLA grounding.
If this is right
- Hierarchical VLA models produce language that is more consistent with their executed actions.
- The need for costly human annotations for training data is reduced.
- Robots achieve greater transparency through explicit multimodal grounding.
- Performance on language-annotated trajectory tasks matches that of fully supervised approaches.
- Insights are provided into multimodal grounding representations.
Where Pith is reading between the lines
- This alignment approach could be applied to other robot learning domains to improve interpretability.
- Future work might test whether the contrastive rankings generalize to real-world robot deployments beyond the LanguageTable benchmark.
- It suggests a path to reduce reliance on large labeled datasets in vision-language-action systems.
Load-bearing premise
The contrastive model reliably identifies true alignments between language and actions rather than learning spurious correlations from the dataset.
What would settle it
A direct comparison showing that after preference learning the VLA model's generated language no longer ranks higher in alignment scores with its actions than before, or that task performance does not improve over a non-aligned baseline on LanguageTable.
Figures
read the original abstract
Achieving robot transparency is a critical step toward effective human-robot collaboration. To be transparent, a robot's natural language communication must be consistent with its actions and explicitly grounded in the task and environment. Existing hierarchical Vision-Language-Action (VLA) models can generate language (e.g., through chain-of-thought) and low-level actions. However, current work does not consider explicit alignment between these modalities during training. To address this crucial gap, we propose a novel training framework that explicitly grounds hierarchical VLA sub-task descriptions with respect to the visual observation and action space. Our framework uses a contrastive model to assess the alignment between generated language and corresponding action trajectories. This contrastive model enables direct ranking of different language-trajectory pairs based on their alignment, allowing us to refine the grounding of our hierarchical VLA through offline preference learning. We apply our framework to the LanguageTable dataset, a benchmark dataset of human language-annotated trajectories, and provide critical insights into multimodal grounding representations, all while establishing a strong baseline that achieves performance comparable to fully supervised fine-tuning and minimizing the need for costly data annotations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a novel training framework for hierarchical Vision-Language-Action (VLA) models that explicitly aligns generated sub-task language descriptions with visual observations and action trajectories. It introduces a contrastive model to score alignment, enabling ranking of language-trajectory pairs and subsequent refinement of the VLA via offline preference learning. The framework is evaluated on the LanguageTable dataset of human-annotated trajectories and claims to achieve performance comparable to fully supervised fine-tuning while reducing reliance on costly annotations.
Significance. If the experimental claims hold with proper validation, the work could offer a practical method for improving multimodal grounding and transparency in robotic VLA systems without proportional increases in human annotation effort. The contrastive ranking approach for preference optimization represents a potentially useful direction for self-supervised alignment in robotics, though its independence from existing annotations must be demonstrated.
major comments (2)
- Abstract: The central claim of achieving 'performance comparable to fully supervised fine-tuning' is presented without any quantitative metrics, ablation studies, error bars, or baseline comparisons. This absence makes it impossible to evaluate whether the contrastive alignment step contributes meaningfully beyond standard supervised training.
- Framework description (as summarized in abstract): The contrastive model is described as assessing alignment between generated language and action trajectories to produce preference rankings, but no details are given on its training data, pre-training, or validation. If it is derived from the same human-annotated LanguageTable trajectories used for the VLA, the offline preference learning risks circularity, simply re-encoding existing annotation patterns rather than supplying independent grounding information.
minor comments (2)
- The abstract would benefit from explicit mention of the evaluation metrics (e.g., success rate, language consistency score) used to claim comparability with supervised fine-tuning.
- Notation for the contrastive scoring function and preference optimization objective should be introduced with clear definitions to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments. We address each major comment point by point below, providing clarifications from the manuscript and indicating where revisions will strengthen the presentation of our results and methods.
read point-by-point responses
-
Referee: Abstract: The central claim of achieving 'performance comparable to fully supervised fine-tuning' is presented without any quantitative metrics, ablation studies, error bars, or baseline comparisons. This absence makes it impossible to evaluate whether the contrastive alignment step contributes meaningfully beyond standard supervised training.
Authors: We agree that the abstract, as a concise summary, does not include specific quantitative metrics or ablations, which limits immediate evaluation of the claim. The full manuscript reports these details in the Experiments section, with tables comparing our method to fully supervised fine-tuning on LanguageTable, ablations isolating the contrastive alignment and offline preference learning components, and performance metrics with error bars from multiple random seeds. These results support the comparability claim while showing reduced annotation requirements. To address the concern directly, we will revise the abstract to include key quantitative highlights (e.g., relative success rates or performance deltas) within length constraints, making the central contribution clearer to readers. revision: yes
-
Referee: Framework description (as summarized in abstract): The contrastive model is described as assessing alignment between generated language and action trajectories to produce preference rankings, but no details are given on its training data, pre-training, or validation. If it is derived from the same human-annotated LanguageTable trajectories used for the VLA, the offline preference learning risks circularity, simply re-encoding existing annotation patterns rather than supplying independent grounding information.
Authors: We thank the referee for raising this important point on potential circularity. The manuscript outlines the high-level framework but indeed provides limited implementation specifics for the contrastive model. In our setup, the contrastive model is trained on LanguageTable trajectories to learn general multimodal alignment between language descriptions, visual observations, and action sequences using a contrastive loss; it is not simply a re-encoding of the VLA's supervised objective. Preference rankings are then derived by having the VLA generate alternative language-trajectory pairs, which the contrastive model scores to create a self-supervised preference signal for offline optimization. This supplies additional grounding beyond the original human annotations. To resolve the concern and demonstrate independence, we will add a dedicated subsection in the Methods detailing the contrastive model's training data splits, architecture, any pre-training, validation metrics, and how generated pairs avoid direct re-use of annotation patterns. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces a contrastive model to assess language-trajectory alignment and applies it for ranking and offline preference learning on the LanguageTable dataset. No equations, self-citations, or explicit training details in the provided text demonstrate that any prediction or ranking step reduces by construction to the input annotations or fitted parameters. The framework is presented as adding an explicit alignment mechanism during training, and the claim of minimizing annotation needs is not shown to be tautological with the dataset usage. The derivation chain remains self-contained without load-bearing reductions to prior inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Foundation Models Defining a New Era in Vision: A Survey and Out- look.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2245–2264, 2025
Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundation Models Defining a New Era in Vision: A Survey and Out- look.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2245–2264, 2025. 2
2025
-
[2]
A cookbook of self-supervised learning.arXiv preprint arXiv:2304.12210, 2023
Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Mor- cos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, Avi Schwarzschild, Andrew Gordon Wilson, Jonas Geiping, Quentin Garrido, Pierre Fernandez, Amir Bar, Hamed Pir- siavash, Yann LeCun, and Micah Goldblum. A Cookbook of Self-Supervised Learning, 2023. a...
-
[3]
METEOR: An auto- matic metric for MT evaluation with improved correlation with human judgments
Satanjeev Banerjee and Alon Lavie. METEOR: An auto- matic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005. 6
2005
-
[4]
RT-H: Action Hierar- chies using Language
Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quan Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. RT-H: Action Hierar- chies using Language. InRobotics: Science and Systems,
-
[5]
Ondrej Biza, Thomas Weng, Lingfeng Sun, Karl Schmeck- peper, Tarik Kelestemur, Yecheng Jason Ma, Robert Platt, Jan-Willem van de Meent, and Lawson L.S. Wong. On- Robot Reinforcement Learning with Goal-Contrastive Re- wards. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 4797–4805, 2025. 3
2025
-
[6]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith LLontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Tan, Guanz...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π 0: A Visi...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Kevin Black, Noah Brown, James Darpinian, Karan Dha- balia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szy- mon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. ...
2025
-
[9]
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Ju- lian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J. Joshi, Ryan Julian, Dmitry Kalash- nikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav ...
work page internal anchor Pith review arXiv 2023
-
[10]
A Simple Framework for Contrastive Learn- ing of Visual Representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A Simple Framework for Contrastive Learn- ing of Visual Representations. InProceedings of the 37th In- ternational Conference on Machine Learning, pages 1597–
-
[11]
Spa- tialRGPT: Grounded Spatial Reasoning in Vision-Language Models
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spa- tialRGPT: Grounded Spatial Reasoning in Vision-Language Models. InAdvances in Neural Information Processing Sys- tems, pages 135062–135093. Curran Associates, Inc., 2024. 8
2024
-
[12]
Deep Reinforcement Learn- ing from Human Preferences
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep Reinforcement Learn- ing from Human Preferences. InAdvances in Neural Infor- mation Processing Systems. Curran Associates, Inc., 2017. 3
2017
-
[13]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur ´elien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi `ere, B...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Contrastive Learning as Goal- Conditioned Reinforcement Learning
Benjamin Eysenbach, Tianjun Zhang, Sergey Levine, and Ruslan Salakhutdinov. Contrastive Learning as Goal- Conditioned Reinforcement Learning. InAdvances in Neu- ral Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022. Neural information processing systems foundation, 2022. 3
2022
-
[15]
Feedback- Driven Vision-Language Alignment with Minimal Human Supervision, 2025
Giorgio Giannone, Ruoteng Li, Qianli Feng, Evgeny Perevodchikov, Rui Chen, and Aleix Martinez. Feedback- Driven Vision-Language Alignment with Minimal Human Supervision, 2025. arXiv:2501.04568 [cs.CV]. 3
-
[16]
Zhao Han, Elizabeth Phillips, and Holly A. Yanco. The Need for Verbal Robot Explanations and How People Would Like a Robot to Explain Itself.J. Hum.-Robot Interact., 10(4),
-
[17]
The Symbol Grounding Problem.Physica D: Nonlinear Phenomena, 42(1):335–346, 1990
Stevan Harnad. The Symbol Grounding Problem.Physica D: Nonlinear Phenomena, 42(1):335–346, 1990. 1
1990
-
[18]
Momentum Contrast for Unsupervised Visual Rep- resentation Learning.CVPR, 2020
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum Contrast for Unsupervised Visual Rep- resentation Learning.CVPR, 2020. 3
2020
-
[19]
Brian Ichter, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalash- nikov, Sergey Levine, Yao Lu, Carolina Parada, Kanishka Rao, Pierre Sermanet, Alexander T. Toshev, Vincent Van- houcke, Fei Xia, Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, Michael Ahn, ...
2023
-
[20]
Few-Shot Pref- erence Learning for Human-in-the-Loop RL
Donald Joseph Hejna III and Dorsa Sadigh. Few-Shot Pref- erence Learning for Human-in-the-Loop RL. In6th Annual Conference on Robot Learning, 2022. 3
2022
-
[21]
Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovi- cova, Alexandre Ram ´e, Morgane Rivi `ere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Eti- enne Pot, Ivo Penchev, Ga ¨el Liu, Francesco Visin, Kath- leen Kenealy, Lucas Bey...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An Open-Source Vision-Language-Action Model. InProceed- ings of The 8th Conference on...
-
[23]
Smith, and P
Kimin Lee, Laura M. Smith, and P. Abbeel. PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training. InIn- ternational Conference on Machine Learning, 2021. 3
2021
-
[24]
ROUGE: A Package for Automatic Evalu- ation of Summaries
Chin-Yew Lin. ROUGE: A Package for Automatic Evalu- ation of Summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, 2004. Association for Com- putational Linguistics. 6
2004
-
[25]
Interactive Language: Talking to Robots in Real Time.IEEE Robotics and Automation Letters, pages 1–8, 2023
Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive Language: Talking to Robots in Real Time.IEEE Robotics and Automation Letters, pages 1–8, 2023. 2, 4, 5, 7
2023
-
[26]
Contrastive Imitation Learning for Language- guided Multi-Task Robotic Manipulation
Teli Ma, Jiaming Zhou, Zifan Wang, Ronghe Qiu, and Jun- wei Liang. Contrastive Imitation Learning for Language- guided Multi-Task Robotic Manipulation. InProceedings of The 8th Conference on Robot Learning, pages 4651–4669. PMLR, 2025. 3
2025
-
[27]
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A Survey on Vision-Language-Action Models for Embodied AI, 2025. arXiv:2405.14093 [cs.RO]. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
LIV: Language-Image Repre- sentations and Rewards for Robotic Control
Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bas- tani, and Dinesh Jayaraman. LIV: Language-Image Repre- sentations and Rewards for Robotic Control. InWorkshop on Reincarnating Reinforcement Learning at ICLR 2023, 2023. 3
2023
-
[29]
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training
Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Os- bert Bastani, Vikash Kumar, and Amy Zhang. VIP: Towards Universal Visual Reward and Representation via Value- Implicit Pre-Training, 2023. arXiv:2210.00030 [cs.RO]. 3
work page internal anchor Pith review arXiv 2023
-
[30]
Octo: An Open- Source Generalist Robot Policy
Oier Mees, Dibya Ghosh, Karl Pertsch, Kevin Black, Homer Rich Walke, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An Open- Source Generalist Robot Policy. InFirst Workshop on Vision- Language Models for Navigation and Manipulation at ICRA 2024, 2024. 2
2024
-
[31]
SimPO: Simple Preference Optimization with a Reference-Free Reward
Yu Meng, Mengzhou Xia, and Danqi Chen. SimPO: Simple Preference Optimization with a Reference-Free Reward. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 3, 5
2024
-
[32]
Active re- ward learning from online preferences
Vivek Myers, Erdem Bıyık, and Dorsa Sadigh. Active re- ward learning from online preferences. In2023 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 7511–7518, 2023. 3
2023
-
[33]
Learning Language- Conditioned Robot Behavior from Offline Data and Crowd- Sourced Annotation
Suraj Nair, Eric Mitchell, Kevin Chen, Brian Ichter, Sil- vio Savarese, and Chelsea Finn. Learning Language- Conditioned Robot Behavior from Offline Data and Crowd- Sourced Annotation. InProceedings of the 5th Conference on Robot Learning, pages 1303–1315. PMLR, 2022. 3
2022
-
[34]
R3M: A Universal Visual Repre- sentation for Robot Manipulation
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3M: A Universal Visual Repre- sentation for Robot Manipulation. In6th Annual Conference on Robot Learning (CoRL), 2022. 3
2022
-
[35]
OpenAI. GPT-4 Technical Report, 2023. arXiv preprint arXiv:2303.08774 [cs.CL]. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Christiano, Jan Leike, and Ryan Lowe
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instruc- tions with human fe...
2022
-
[37]
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Al- bert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexan- der Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, An- nie Xie, Anthony Brohan, Antonin Raf...
2024
-
[38]
Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics
Norman Di Palo and Edward Johns. Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024. 1
2024
-
[39]
BLEU: a method for automatic eval- uation of machine translation
Kishore Papineni et al. BLEU: a method for automatic eval- uation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguis- tics. Association for Computational Linguistics, 2002. 6
2002
-
[40]
FiLM: Visual Reasoning with a General Conditioning Layer.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018
Ethan Perez, Florian Strub, Harm de Vries, Vincent Du- moulin, and Aaron Courville. FiLM: Visual Reasoning with a General Conditioning Layer.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018. 5
2018
-
[41]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 3, 5
2021
-
[42]
Di- rect Preference Optimization: Your Language Model Is Se- cretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Di- rect Preference Optimization: Your Language Model Is Se- cretly a Reward Model. InAdvances in Neural Information Processing Systems, pages 53728–53741. Curran Associates, Inc., 2023. 3, 5
2023
-
[43]
Purposeful Communication in Human–Robot Collaboration: A Review of Modern Approaches in Manufacturing.IEEE Access, 10: 129344–129361, 2022
Roya Salehzadeh, Jiaqi Gong, and Nader Jalili. Purposeful Communication in Human–Robot Collaboration: A Review of Modern Approaches in Manufacturing.IEEE Access, 10: 129344–129361, 2022. 2
2022
-
[44]
Concept2Robot: Learning Manipulation Concepts from Instructions and Human Demonstrations
Lin Shao, Toki Migimatsu, Qiang Zhang, Karen Yang, and Jeannette Bohg. Concept2Robot: Learning Manipulation Concepts from Instructions and Human Demonstrations. In Proceedings of Robotics: Science and Systems (RSS), 2020. 3
2020
-
[45]
Hi Robot: Open-Ended Instruction Following with Hi- erarchical Vision-Language-Action Models
Lucy Xiaoyang Shi, brian ichter, Michael Robert Equi, Liy- iming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, and Chelsea Finn. Hi Robot: Open-Ended Instruction Following with Hi- erarchical Vision-Language-Action Models. InForty-second International Confer...
-
[46]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Ar- actingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Ca- dene. SmolVLA: A Vision-Language-Action Model for Af- fordable and Efficient Robotics, 2025. arXiv:2506.01844 [cs.RO]. 5
work page internal anchor Pith review arXiv 2025
-
[47]
RoboCLIP: One Demonstration is Enough to Learn Robot Policies
Sumedh Anand Sontakke, Jesse Zhang, S ´eb Arnold, Karl Pertsch, Erdem Biyik, Dorsa Sadigh, Chelsea Finn, and Lau- rent Itti. RoboCLIP: One Demonstration is Enough to Learn Robot Policies. InThirty-seventh Conference on Neural In- formation Processing Systems, 2023. 3
2023
-
[48]
Language- Conditioned Imitation Learning for Robot Manipulation Tasks
Simon Stepputtis, Joseph Campbell, Mariano Phielipp, Ste- fan Lee, Chitta Baral, and Heni Ben Amor. Language- Conditioned Imitation Learning for Robot Manipulation Tasks. InAdvances in Neural Information Processing Sys- tems, pages 13139–13150. Curran Associates, Inc., 2020. 3
2020
-
[49]
Learning to Summarize with Human Feedback
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to Summarize with Human Feedback. InAdvances in Neural Information Processing Systems, pages 3008–3021. Curran Associates, Inc., 2020. 3
2020
-
[50]
Juan Terven, Diana-Margarita Cordova-Esparza, Julio- Alejandro Romero-Gonz ´alez, Alfonso Ram ´ırez-Pedraza, and E. A. Ch´avez-Urbiola. A Comprehensive Survey of Loss Functions and Metrics in Deep Learning.Artificial Intelli- gence Review, 58(7):195, 2025. 6
2025
-
[51]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual Vision- Language Encoders with Improved Semantic Understanding, Localization, and Dense Feature...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. InAdvances in Neural Information Processing Systems, pages 74952–74965. Curran Associates, Inc., 2023. 2
2023
-
[53]
RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback
Yufei Wang, Zhanyi Sun, Jesse Zhang, Zhou Xian, Erdem Biyik, David Held, and Zackory Erickson. RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback. InProceedings of the 41th International Conference on Machine Learning, 2024. 3
2024
-
[54]
Le, and Denny Zhou
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reason- ing in Large Language Models.Advances in Neural Infor- mation Processing Systems, 35:24824–24837, 2022. 1, 2
2022
-
[55]
Joint Action Language Modelling for Trans- parent Policy Execution, 2025
Theodor Wulff, Rahul Singh Maharjan, Xinyun Chi, and An- gelo Cangelosi. Joint Action Language Modelling for Trans- parent Policy Execution, 2025. arXiv:2504.10055 [cs.RO]. 3
-
[56]
Rank2Reward: Learning Shaped Reward Functions from Passive Video
Daniel Yang, Davin Tjia, Jacob Berg, Dima Damen, Pulkit Agrawal, and Abhishek Gupta. Rank2Reward: Learning Shaped Reward Functions from Passive Video. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 2806–2813, 2024. 3
2024
-
[57]
Sigmoid Loss for Language Image Pre- Training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre- Training. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11941–11952, Paris, France,
-
[58]
Weinberger, and Yoav Artzi
Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating Text Generation with BERT. InInternational Conference on Learning Representations, 2020. 6
2020
-
[59]
GRAPE: Generalizing robot policy via prefer- ence alignment
Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, and Huaxiu Yao. GRAPE: Generalizing robot policy via prefer- ence alignment. InWorkshop on Reasoning and Planning for Large Language Models, 2025. 2, 3, 4, 5
2025
-
[60]
Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. InThe Eleventh International Conference on Learn- ing Representations, 2023. 2
2023
-
[61]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Ser- manet, Pannag R Sanketi, Grecia Salazar, Michael S Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mor- datch, Henryk Michalews...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.