arxiv: 2604.17800 · v1 · submitted 2026-04-20 · 💻 cs.RO · cs.CV

Recognition: unknown

ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning

An T. Le, Duy H. M. Nguyen, Khang Nguyen, Minh Nhat Vu, Ngo Anh Vien, Nhat Xuan Tran, Tan Q. Nguyen, Tuan Van Vo

Pith reviewed 2026-05-10 05:02 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords Vision-Language-ActionRobotic policiesReasoning rationalesTeacher-guided fine-tuningMultimodal alignmentManipulation tasksGeneralization

0 comments

The pith

ReFineVLA fine-tunes vision-language-action models on teacher-generated reasoning rationales to improve robotic task performance and multimodal alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ReFineVLA to address the limitation in vision-language-action models where they map inputs directly to actions without explicit reasoning steps. This is particularly limiting for complex, long-horizon manipulation tasks that require logical reasoning for better interpretability and generalization. By first generating reasoning rationales with an expert teacher model and augmenting the robotic datasets, then fine-tuning pre-trained VLAs on this data, the method boosts reasoning capabilities while keeping the original generalization intact. Attention visualizations confirm better focus on relevant elements across vision, language, and action. Evaluations on SimplerEnv benchmarks show improved success rates on WidowX and Google Robot tasks compared to prior methods.

Core claim

ReFineVLA augments robotic datasets with reasoning rationales from an expert teacher model to guide VLA models in learning to reason about their actions, then fine-tunes pre-trained VLAs on these reasoning-enriched datasets to enhance reasoning while preserving generalization, as evidenced by attention map alignment between visual observations, linguistic prompts, and actions.

What carries the argument

Teacher-guided augmentation and fine-tuning of VLAs with reasoning rationales

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such teacher-guided reasoning could be applied to other multimodal models beyond robotics to improve interpretability.
Future work might test if the same method transfers to real robot hardware where environmental uncertainty is higher.
The explicit rationales might enable better human oversight or debugging of robot behaviors.

Load-bearing premise

The external teacher model produces accurate reasoning rationales that, when used for fine-tuning, enhance action quality without reducing the base VLA model's ability to generalize to new tasks.

What would settle it

A controlled experiment showing that fine-tuning on the reasoning-augmented dataset results in equal or lower success rates on the benchmarks or poorer performance on held-out tasks would falsify the benefit of the reasoning step.

Figures

Figures reproduced from arXiv: 2604.17800 by An T. Le, Duy H. M. Nguyen, Khang Nguyen, Minh Nhat Vu, Ngo Anh Vien, Nhat Xuan Tran, Tan Q. Nguyen, Tuan Van Vo.

**Figure 1.** Figure 1: Visualization of Attention Tokens: Attention maps of action tokens in standard VLAs, illustrating the narrow focus on visual cues. Darker red shows higher attention scores, while light green means lesser attention. To investigate this limitation, we conduct a closer look at the attention mechanisms within standard VLAs, specifically examining the attention heatmap of action tokens. Motivated by work in vis… view at source ↗

**Figure 2.** Figure 2: Multimodal Robotic Instruction Understanding with Chain-of-Thought Reasoning [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of ReFineVLA’s Training Flow: A fine-tuning framework that enhances VLA models with explicit multimodal reasoning, guided by rationales from a teacher model. These rationales cover visual cues, spatial reasoning, and task planning and are injected during training via action and reasoning losses. The learner integrates visual-linguistic inputs, infuses reasoning, and outputs interpretable actions f… view at source ↗

**Figure 4.** Figure 4: Training Loss and Action Accuracy During Fine-tuning on Fractal Dataset [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Reasoning accuracy over training steps. 2) How does reasoning accuracy look like during ReFineVLA fine-tuning? To better understand the internal effect of reasoning supervision in our proposed ReFineVLA model, we conduct an ablation analysis focused explicitly on reasoning accuracy throughout fine-tuning, as illustrated in Figure 5. Initially, reasoning accuracy improves rapidly during the early traini… view at source ↗

**Figure 6.** Figure 6: Attention Visualization of ReFineVLA and SpatialVLA: RefineVLA shows better attention to related entities within the given observations conditioned by the input prompts than what SpatialVLA does. 4) How does ReFineVLA reason on complex or long-horizon robotic manipulation tasks? [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Chain-of-Thought Reasoning of ReFineVLA: RefineVLA shows step-by-step reasoning to accomplish the prompted task given the initial observation. The examples illustrate the queries (1) for placing the coke can near the orange in the tabletop setting and (2) for closing the drawer while it is opening. More examples are shown in [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: (left) Average success rate of ReFineVLA with different weights of reasoning loss λr during teacher-guided fine-tuning for multimodal reasoning. (right) Average success rate of ReFineVLA with different numbers of frozen layers during teacher-guided fine-tuning for multimodal reasoning. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Chain-of-Thought Reasoning of ReFineVLA: RefineVLA shows step-by-step reasoning with corresponding actions to accomplish the prompted task given the initial observations and linguistic command. The extended illustrations of [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models have gained much attention from the research community thanks to their strength in translating multimodal observations with linguistic instructions into desired robotic actions. Despite their advancements, VLAs often overlook explicit reasoning and learn the functional input-action mappings, omitting crucial logical steps, which are especially pronounced in interpretability and generalization for complex, long-horizon manipulation tasks. In this work, we propose ReFineVLA, a multimodal reasoning-aware framework that fine-tunes VLAs with teacher-guided reasons. We first augment robotic datasets with reasoning rationales generated by an expert teacher model, guiding VLA models to learn to reason about their actions. Then, we fine-tune pre-trained VLAs with the reasoning-enriched datasets with ReFineVLA, while maintaining the underlying generalization abilities and boosting reasoning capabilities. We also conduct attention map visualization to analyze the alignment among visual observation, linguistic prompts, and to-be-executed actions of ReFineVLA, reflecting the model is ability to focus on relevant tasks and actions. Through this additional step, we explore that ReFineVLA-trained models exhibit a meaningful agreement between vision-language and action domains, highlighting the enhanced multimodal understanding and generalization. Evaluated across a suite of simulated manipulation benchmarks on SimplerEnv with both WidowX and Google Robot tasks, ReFineVLA achieves state-of-the-art performance, in success rate over the second-best method on the both the WidowX benchmark and Google Robot Tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReFineVLA fine-tunes VLAs on teacher-generated reasoning rationales to claim SOTA success rates on WidowX and Google Robot benchmarks, but the gains are not isolated from plain fine-tuning effects.

read the letter

The core contribution here is a practical pipeline: generate reasoning rationales from an expert teacher model, augment existing robotic datasets with them, then fine-tune a pre-trained VLA while trying to preserve its generalization. The authors report higher success rates than prior methods on SimplerEnv WidowX and Google Robot tasks, plus some attention map visualizations that show improved alignment between vision, language, and actions after training. That visualization step is a small but concrete plus for interpretability in multimodal policies. The approach is straightforward to describe and targets a real pain point in current VLAs, which often skip explicit logical steps on long-horizon manipulation. The method keeps the base model intact and adds the reasoning layer on top, which is a reasonable design choice for generalist policies. The stress-test concern holds up from what is shown: the abstract and title position the teacher rationales as the key differentiator, yet there is no mention of an ablation that holds dataset size and fine-tuning procedure fixed while removing or randomizing the rationale content. Without that, any lift could come from extra training data or optimization rather than the multimodal reasoning mechanism itself. The SOTA claim is stated without numbers, baselines, or statistical details in the provided text, so the magnitude and reliability of the improvement are hard to judge. This paper is aimed at researchers already working on VLA models or imitation learning for robotics who want a simple way to experiment with added reasoning. A reader in that niche could pull the pipeline and try the attention analysis on their own setups. It is not a foundational advance, but the empirical focus on standard benchmarks makes it worth a serious referee's time. I would send it to peer review so the authors can add the missing controls and clarify the literature differences.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes ReFineVLA, a multimodal reasoning-aware framework that augments robotic datasets with reasoning rationales generated by an external expert teacher model and then fine-tunes pre-trained Vision-Language-Action (VLA) models on the enriched data. The goal is to improve explicit reasoning for complex, long-horizon manipulation tasks while preserving the base model's generalization. The work includes attention-map visualizations to demonstrate alignment between visual observations, language prompts, and actions, and reports state-of-the-art success rates on SimplerEnv benchmarks using both WidowX and Google Robot tasks.

Significance. If the performance gains can be causally attributed to the teacher-guided reasoning component rather than generic fine-tuning effects, the approach would offer a practical route to more interpretable and capable generalist robotic policies. The attention visualization provides a useful diagnostic for multimodal alignment, and the pipeline is straightforward to implement on existing VLAs. These elements would be of interest to the robotics and multimodal learning communities, particularly for long-horizon tasks where implicit mapping alone is insufficient.

major comments (1)

[Experiments] Experiments section: The central SOTA claim attributes performance improvements on WidowX and Google Robot tasks to the multimodal reasoning-aware fine-tuning step. However, the manuscript provides no ablation that holds dataset size, fine-tuning procedure, and base VLA fixed while removing or randomizing the content of the teacher-generated rationales. Without this control, observed gains could stem from additional training data volume or optimization rather than the reasoning mechanism itself, weakening the attribution required by the title and abstract.

minor comments (2)

[Abstract] Abstract: The claim of 'state-of-the-art performance, in success rate over the second-best method' is stated without any numerical values, baseline names, or trial counts, making the headline result impossible to assess from the abstract alone.
[Method] Method description: The precise format in which the generated rationales are concatenated or tokenized into the VLA input (e.g., as additional text tokens or separate modality) is not specified, hindering reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and describe the revisions we will make to strengthen the attribution of our results.

read point-by-point responses

Referee: The central SOTA claim attributes performance improvements on WidowX and Google Robot tasks to the multimodal reasoning-aware fine-tuning step. However, the manuscript provides no ablation that holds dataset size, fine-tuning procedure, and base VLA fixed while removing or randomizing the content of the teacher-generated rationales. Without this control, observed gains could stem from additional training data volume or optimization rather than the reasoning mechanism itself, weakening the attribution required by the title and abstract.

Authors: We agree that a controlled ablation isolating the effect of the reasoning content is necessary to strengthen the causal attribution of performance gains to the teacher-guided reasoning mechanism. While the current manuscript reports comparisons against multiple baselines (including other fine-tuned VLAs) on the SimplerEnv WidowX and Google Robot benchmarks, it does not include an ablation that keeps dataset size, fine-tuning procedure, and base model identical while replacing the teacher-generated rationales with random or null content. In the revised manuscript, we will add this ablation: we will fine-tune the base VLA on the same augmented dataset volume but with randomized rationales (e.g., shuffled sentences or empty strings) and report the resulting success rates alongside the original ReFineVLA results. This will clarify whether the structured reasoning provides benefits beyond additional training data or optimization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical fine-tuning pipeline

full rationale

The paper presents an empirical fine-tuning framework that augments datasets with external teacher-generated rationales and evaluates resulting VLA policies on WidowX and Google Robot benchmarks. No equations, derivations, or analytical reductions appear in the abstract or described method; performance claims rest on reported success rates rather than any quantity defined in terms of itself. No self-citation chains, fitted inputs renamed as predictions, or ansatzes smuggled via prior work are present. The evaluation is externally benchmarked and does not reduce to internal definitions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are detailed beyond the high-level method name.

axioms (1)

domain assumption An external teacher model can produce useful and accurate reasoning rationales for robotic actions
This underpins the data-augmentation step described in the abstract.

invented entities (1)

ReFineVLA framework no independent evidence
purpose: Multimodal reasoning-aware fine-tuning procedure for VLAs
The named method introduced in the paper.

pith-pipeline@v0.9.0 · 5599 in / 1167 out tokens · 69950 ms · 2026-05-10T05:02:27.366712+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 45 canonical work pages · 17 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219,

work page internal anchor Pith review arXiv
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review arXiv
[3]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,

work page internal anchor Pith review arXiv
[4]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818,

work page internal anchor Pith review arXiv
[5]

Contrastive chain-of-thought prompting

Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset. https://sites.google.com/view/berkeley-ur5/home. Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, and Lidong Bing. Contrastive chain-of-thought prompting.arXiv preprint arXiv:2311.09277,

work page arXiv
[6]

Mobility vla: Multimodal instruction navigation with long-context vlms and topological graphs,

Hao-Tien Lewis Chiang, Zhuo Xu, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Edward Lee, Wenhao Yu, Connor Schenck, David Rendleman, Dhruv Shah, et al. Mobility vla: Multimodal instruction navigation with long-context vlms and topological graphs.arXiv preprint arXiv:2407.07775,

work page arXiv
[8]

What matters in employing vision language models for tokenizing actions in robot control? InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024,

14 Nicolai Dorka, Chenguang Huang, Tim Welschehold, and Wolfram Burgard. What matters in employing vision language models for tokenizing actions in robot control? InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024,

2024
[9]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378,

work page internal anchor Pith review arXiv
[10]

Vision-language models as success detectors.arXiv preprint arXiv:2303.07280, 2023

Yuqing Du, Ksenia Konyushkova, Misha Denil, Akhil Raju, Jessica Landon, Felix Hill, Nando de Freitas, and Serkan Cabi. Vision-language models as success detectors.arXiv preprint arXiv:2303.07280,

work page arXiv
[11]

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets.arXiv preprint arXiv:2109.13396,

work page internal anchor Pith review arXiv
[12]

Clip on wheels: Zero-shot object navigation as object localization and exploration.arXiv preprint arXiv:2203.10421, 3(4):7,

Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. Clip on wheels: Zero-shot object navigation as object localization and exploration.arXiv preprint arXiv:2203.10421, 3(4):7,

work page arXiv
[13]

Polytask: Learning unified policies through behavior distillation.arXiv preprint arXiv:2310.08573,

Siddhant Haldar and Lerrel Pinto. Polytask: Learning unified policies through behavior distillation.arXiv preprint arXiv:2310.08573,

work page arXiv
[14]

Baku: An efficient transformer for multi-task policy learning.arXiv preprint arXiv:2406.07539, 2024

Siddhant Haldar, Zhuoran Peng, and Lerrel Pinto. Baku: An efficient transformer for multi-task policy learning.arXiv preprint arXiv:2406.07539,

work page arXiv
[15]

Visual chain-of-thought diffusion models.arXiv preprint arXiv:2303.16187,

William Harvey and Frank Wood. Visual chain-of-thought diffusion models.arXiv preprint arXiv:2303.16187,

work page arXiv
[16]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models, 2024

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.arXiv preprint arXiv:2406.09403,

work page arXiv
[17]

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang

URLhttps: //arxiv.org/abs/2410.22325. Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. Your large vision-language model only needs a few attention heads for visual grounding.arXiv preprint arXiv:2503.06287,

work page arXiv
[18]

arXiv preprint arXiv:2302.12766 , year=

Siddharth Karamcheti, Suraj Nair, Annie S Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics.arXiv preprint arXiv:2302.12766,

work page arXiv
[19]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

work page internal anchor Pith review arXiv
[20]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review arXiv
[21]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

15 Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024a. Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao ...

work page Pith review arXiv
[22]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024a. Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024b. Gu...

work page internal anchor Pith review arXiv
[23]

Faithful chain-of-thought reasoning

Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. InThe 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL 2023),

2023
[25]

Robotic-clip: Fine-tuning clip on action data for robotic applications.arXiv preprint arXiv:2409.17727,

Nghia Nguyen, Minh Nhat Vu, Tung D Ta, Baoru Huang, Thieu Vo, Ngan Le, and Anh Nguyen. Robotic-clip: Fine-tuning clip on action data for robotic applications.arXiv preprint arXiv:2409.17727,

work page arXiv
[26]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864,

work page internal anchor Pith review arXiv
[27]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning

16 Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning.arXiv preprint arXiv:1511.06342,

work page Pith review arXiv
[29]

Visual chain of thought: bridging logical gaps with multimodal infillings

ISSN 2835-8856. URL https://openreview.net/forum?id=1ikK0kHjvj. Featured Certification, Outstanding Certification. Daniel Rose, Vaishnavi Himakunthala, Andy Ouyang, Ryan He, Alex Mei, Yujie Lu, Michael Saxon, Chinmay Sonar, Diba Mirza, and William Yang Wang. Visual chain of thought: bridging logical gaps with multimodal infillings.arXiv preprint arXiv:2305.02317,

work page arXiv
[30]

Policy Distillation

Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation.arXiv preprint arXiv:1511.06295,

work page Pith review arXiv
[31]

Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models.arXiv preprint arXiv:2403.16999,

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models.arXiv preprint arXiv:2403.16999,

work page arXiv
[32]

arXiv preprint arXiv:2412.03555 (2024) 1

Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555,

work page arXiv
[33]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213,

work page internal anchor Pith review arXiv
[35]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review arXiv
[37]

Scaling image tokenizers with grouped spherical quantization.arXiv preprint arXiv:2412.02632,

17 Jiangtao Wang, Zhen Qin, Yifan Zhang, Vincent Tao Hu, Björn Ommer, Rania Briq, and Stefan Kesselheim. Scaling image tokenizers with grouped spherical quantization.arXiv preprint arXiv:2412.02632,

work page arXiv
[38]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652,

work page internal anchor Pith review arXiv
[39]

://arxiv.org/abs/2401.00025

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025,

work page arXiv
[40]

Tinyvla: To- wards fast, data-efficient vision-language-action models for robotic manipulation, 2024

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.arXiv preprint arXiv:2409.12514,

work page arXiv
[41]

Vila-u: a unified foundation model integrating visual understanding and generation

Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation.arXiv preprint arXiv:2409.04429,

work page arXiv
[42]

Vla model-expert collaboration for bi-directional manipulation learning.arXiv preprint arXiv:2503.04163,

Tian-Yu Xiang, Ao-Qun Jin, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Sheng-Bin Duang, Si-Cheng Wang, Zheng Lei, et al. Vla model-expert collaboration for bi-directional manipulation learning.arXiv preprint arXiv:2503.04163,

work page arXiv
[43]

Beyond Chain-of-Thought, Effec- tive Graph-of-Thought Reasoning in Language Models,

Yao Yao, Zuchao Li, and Hai Zhao. Beyond chain-of-thought, effective graph-of-thought reasoning in language models.arXiv preprint arXiv:2305.16582,

work page arXiv
[44]

Zawalski, W

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693,

work page arXiv
[45]

arXiv preprint arXiv:2411.03540 , year=

Haochen Zhang, Nader Zantout, Pujith Kachana, Zongyuan Wu, Ji Zhang, and Wenshan Wang. Vla-3d: A dataset for 3d semantic scene understanding and navigation.arXiv preprint arXiv:2411.03540,

work page arXiv
[46]

Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867,

work page arXiv
[47]

Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Ran Cheng, Yaxin Peng, Chaomin Shen, and Feifei Feng

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv:2503.22020,

work page arXiv
[48]

3D-VLA: A 3D Vision-Language-Action Generative World Model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024a. Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world mod...

work page internal anchor Pith review arXiv
[49]

Chatvla: Unified multimodal understanding and robot control with vision-language-action model, 2025

Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Ran Cheng, Yaxin Peng, Chaomin Shen, et al. Chatvla: Unified multimodal understanding and robot control with vision-language-action model.arXiv preprint arXiv:2502.14420,

work page arXiv