arxiv: 2307.15818 · v1 · submitted 2023-07-28 · 💻 cs.RO · cs.CL· cs.CV· cs.LG

Recognition: 1 theorem link

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Alexander Herzog, Alex Irpan, Anikait Singh, Anthony Brohan, Avinava Dubey, Ayzaan Wahid, Brian Ichter, Brianna Zitkovich, Chelsea Finn, Chuyuan Fu, Danny Driess, Dmitry Kalashnikov, Fei Xia, Grecia Salazar, Henryk Michalewski, Huong Tran, Igor Mordatch, Isabel Leal, Jasmine Hsu, Jaspiar Singh, Jialin Wu, Justice Carbajal, Kanishka Rao, Karl Pertsch, Karol Hausman, Keerthana Gopalakrishnan, Kehang Han, Krista Reymann, Krzysztof Choromanski, Lisa Lee, Michael Ryoo, Montse Gonzalez Arenas, Nikhil Joshi, Noah Brown, Pannag Sanketi, Paul Wohlhart, Peng Xu, Pete Florence, Pierre Sermanet, Quan Vuong, Radu Soricut, Ryan Julian, Sergey Levine, Sichun Xu, Stefan Welker, Ted Xiao, Tianhe Yu, Tianli Ding, Tsang-Wei Edward Lee, Vincent Vanhoucke, Xi Chen, Yao Lu, Yevgen Chebotar, Yuheng Kuang

Pith reviewed 2026-05-10 22:30 UTC · model grok-4.3

classification 💻 cs.RO cs.CLcs.CVcs.LG

keywords vision-language-action modelsrobotic controlemergent capabilitiesgeneralization to novel objectsweb-scale pretrainingco-fine-tuningchain of thought reasoningRT-2

0 comments

The pith

Vision-language models trained on web data transfer semantic knowledge to robotic control by encoding actions as text tokens, yielding emergent generalization and reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large vision-language models pretrained on internet-scale data can be adapted for end-to-end robotic control. By representing robot actions as sequences of text tokens, the same model is co-fine-tuned on both web vision-language tasks and robot trajectory demonstrations. This joint training lets the robot inherit broad semantic understanding, such as recognizing new objects or interpreting instructions involving numbers, icons, sizes, or proximity. Extensive tests across six thousand trials show the resulting policies perform well on standard tasks while gaining abilities like basic reasoning about object properties or improvised tool use.

Core claim

By expressing robotic actions as text tokens and co-fine-tuning state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks such as visual question answering, we obtain vision-language-action models that map observations to actions while retaining the benefits of web pretraining, producing performant policies that generalize to novel objects, follow previously unseen commands, and perform rudimentary reasoning such as selecting the smallest object or choosing an improvised hammer.

What carries the argument

The vision-language-action (VLA) model, formed by treating actions as text tokens so they fit directly into the same training format as natural language responses, enabling joint optimization on web data and robot trajectories without task-specific architectural changes.

If this is right

Robotic policies gain the ability to interpret commands involving concepts absent from robot data, such as placing an object on a specific number or icon.
Basic reasoning emerges, including selecting objects by relative size or proximity and choosing appropriate tools or items for a described need.
Chain-of-thought prompting extends the model to multi-stage semantic planning, such as identifying an improvised hammer or suitable drink.
A single end-to-end model handles perception, language understanding, and control across a wide range of tasks without separate modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Scaling web pretraining further could reduce the volume of robot-specific demonstrations needed for new skills.
The same token-based approach might extend to other embodied domains such as navigation or manipulation in unstructured environments.
Combining VLA models with longer-horizon planning could produce agents that decompose complex household tasks using web-derived knowledge.

Load-bearing premise

Representing actions as text tokens will allow web-scale semantic knowledge to transfer to robotic policies without degrading action precision or needing extra model components.

What would settle it

A controlled comparison in which the co-fine-tuned model matches or underperforms a robot-only baseline on novel-object tasks or unseen-command generalization would show that web knowledge did not transfer.

read the original abstract

We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RT-2 shows that tokenizing actions as text and co-fine-tuning a VLM on robot trajectories plus web VQA data produces better generalization and some emergent reasoning, backed by 6k trials, though the causal role of the co-training step is not isolated.

read the letter

The main thing to know is that this paper takes a pretrained vision-language model, turns robot actions into text tokens, and fine-tunes it jointly on robot trajectory data and internet-scale VQA tasks. The result is a policy that handles novel objects and commands better than prior robot-only training, plus some basic reasoning like picking the smallest object or figuring out an improvised tool.

Referee Report

2 major / 2 minor

Summary. The paper proposes vision-language-action (VLA) models, instantiated as RT-2, by co-fine-tuning pretrained vision-language models on robotic trajectory data (with actions tokenized as text tokens) together with internet-scale vision-language tasks such as VQA. The central claim is that this simple recipe transfers semantic knowledge from web-scale pretraining to produce performant end-to-end robotic policies with emergent capabilities, including improved generalization to novel objects, interpretation of commands absent from robot training data, and rudimentary reasoning (e.g., selecting smallest/largest objects or using objects as improvised tools), supported by 6k evaluation trials.

Significance. If the results hold, the work demonstrates a practical route to leverage large-scale web pretraining for robotic generalization and semantic reasoning without task-specific architectures or separate modules, advancing end-to-end learning in robotics. The extensive 6k-trial evaluation and demonstration of chain-of-thought reasoning for multi-stage tasks are notable strengths that provide concrete evidence for the transfer effect.

major comments (2)

[§4 and §5] §4 (Methods) and §5 (Experiments): The claim that co-fine-tuning on vision-language tasks is what enables transfer of web-derived semantic capabilities (preventing degradation while learning actions) is load-bearing for the central contribution. The reported comparisons are to RT-1 and other baselines, but no control is presented that fine-tunes the identical base VLM solely on robotic trajectories while omitting the internet VQA data. This ablation is required to isolate whether the observed generalization and reasoning emerge from the co-training mixture or simply from the pretrained weights.
[§5.1] §5.1 (Evaluation setup): The 6k evaluation trials are cited as evidence for causal attribution to web pretraining, yet the manuscript does not report statistical tests, confidence intervals, or precise train/test splits and data mixture ratios for the novel-object and reasoning tasks. Without these, it is difficult to rule out that performance differences arise from other training factors rather than the VLA co-fine-tuning.

minor comments (2)

[Abstract and §3] Abstract and §3: The definition of 'Vision-Language-Action (VLA) model' and the precise tokenization scheme for actions (e.g., discretization granularity) could be stated more explicitly on first use to avoid ambiguity with standard VLM terminology.
[Figure 3 and §5.3] Figure 3 and §5.3: Some qualitative examples of chain-of-thought reasoning would benefit from additional quantitative metrics (success rates across multiple trials) rather than single illustrative rollouts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback on our manuscript. We address the major comments below and will incorporate revisions accordingly.

read point-by-point responses

Referee: [§4 and §5] §4 (Methods) and §5 (Experiments): The claim that co-fine-tuning on vision-language tasks is what enables transfer of web-derived semantic capabilities (preventing degradation while learning actions) is load-bearing for the central contribution. The reported comparisons are to RT-1 and other baselines, but no control is presented that fine-tunes the identical base VLM solely on robotic trajectories while omitting the internet VQA data. This ablation is required to isolate whether the observed generalization and reasoning emerge from the co-training mixture or simply from the pretrained weights.

Authors: We agree that this ablation would provide stronger causal evidence for the role of co-fine-tuning with vision-language data. The current manuscript compares RT-2 to RT-1, which is trained only on robotic trajectories but employs a different model architecture and training procedure. To directly address this, we will conduct and report an ablation where the same base VLM is fine-tuned solely on robotic data without the VQA mixture in the revised manuscript. This will help isolate the contribution of the co-training. revision: yes
Referee: [§5.1] §5.1 (Evaluation setup): The 6k evaluation trials are cited as evidence for causal attribution to web pretraining, yet the manuscript does not report statistical tests, confidence intervals, or precise train/test splits and data mixture ratios for the novel-object and reasoning tasks. Without these, it is difficult to rule out that performance differences arise from other training factors rather than the VLA co-fine-tuning.

Authors: We acknowledge the importance of statistical rigor and detailed reporting of experimental setup. In the revised version, we will include statistical tests (e.g., t-tests or bootstrap confidence intervals) for the key comparisons, along with precise details on train/test splits and the data mixture ratios used for the novel object and reasoning evaluations. The 6k trials aggregate results across multiple tasks and conditions, but we will provide more granular breakdowns and uncertainty estimates to strengthen the claims. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training and held-out evaluation

full rationale

The paper presents an empirical recipe for co-fine-tuning vision-language models on robot trajectories (actions expressed as text tokens) plus internet-scale vision-language tasks, then reports performance on 6k held-out robotic trials. No derivation chain, equations, or first-principles predictions exist that could reduce to inputs by construction. Claims of emergent generalization and reasoning rest on direct experimental results rather than any self-referential fitting or self-citation load-bearing step. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that web-scale semantic knowledge transfers through a shared text-token space with actions. Many training hyperparameters and the precise action discretization are free parameters chosen during development. The VLA category itself is introduced as a new framing.

free parameters (2)

action tokenization discretization
How continuous robot actions are mapped to discrete text tokens is a design choice that affects what the model can learn.
co-fine-tuning mixture weights
The relative amounts of robotic trajectory data versus web vision-language data are chosen to balance the two objectives.

axioms (1)

domain assumption Semantic knowledge acquired from internet-scale vision-language data remains useful when the output space is extended to include robotic actions.
Invoked to explain why emergent generalization and reasoning appear after co-fine-tuning.

invented entities (1)

Vision-Language-Action (VLA) model no independent evidence
purpose: A single model that processes vision, language, and actions in one token space.
New category introduced to describe the architecture; no independent external evidence provided beyond the paper's own experiments.

pith-pipeline@v0.9.0 · 5865 in / 1521 out tokens · 64062 ms · 2026-05-10T22:30:42.657360+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Membership Inference Attacks on Vision-Language-Action Models
cs.CR 2026-05 unverdicted novelty 8.0

Vision-language-action models are highly vulnerable to membership inference attacks, including practical black-box versions that exploit generated actions and motion trajectories.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation
cs.RO 2026-05 conditional novelty 7.0

A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation
cs.RO 2026-05 unverdicted novelty 7.0

SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
cs.RO 2026-04 unverdicted novelty 7.0

A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing ...
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies
cs.CV 2026-04 unverdicted novelty 7.0

CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
cs.RO 2026-04 unverdicted novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
AeroBridge-TTA: Test-Time Adaptive Language-Conditioned Control for UAVs
cs.RO 2026-04 unverdicted novelty 7.0

AeroBridge-TTA achieves +22 pt average gains on out-of-distribution UAV dynamics mismatches by updating a latent state online from observed transitions in a language-conditioned policy.
Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution
cs.RO 2026-04 unverdicted novelty 7.0

A runtime governance framework for embodied agents achieves 96.2% interception of unauthorized actions and 91.4% recovery success in 1000 simulation trials by externalizing policy enforcement.
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
cs.RO 2026-04 conditional novelty 7.0

BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
3D-VLA: A 3D Vision-Language-Action Generative World Model
cs.CV 2024-03 unverdicted novelty 7.0

3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
cs.RO 2023-10 unverdicted novelty 7.0

A collaborative dataset spanning 22 robots and 527 skills enables RT-X models that transfer capabilities across different robot embodiments.
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
Action Emergence from Streaming Intent
cs.RO 2026-05 unverdicted novelty 6.0

Streaming Intent lets a VLA model derive driving intent via streamed chain-of-thought reasoning and use it to steer a flow-matching action head, yielding competitive Waymo scores plus intent-based trajectory control w...
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 6.0

Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
cs.RO 2026-05 unverdicted novelty 6.0

StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
Weather-Robust Scene Semantics with Vision-Aligned 4D Radar
cs.RO 2026-05 unverdicted novelty 6.0

Radar encoders aligned to frozen SigLIP embeddings enable weather-robust scene captioning via a frozen VLM with 7M trainable parameters, outperforming cameras on held-out adverse-weather sequences in K-RADAR.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
cs.CL 2026-05 unverdicted novelty 6.0

A Group Relative Policy Optimization framework with concordance correlation coefficient rewards improves MLLM regression accuracy on long-tailed distributions, especially in medium- and few-shot regimes, without model...
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
cs.CL 2026-05 unverdicted novelty 6.0

A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.
Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling
cs.AI 2026-05 unverdicted novelty 6.0

Hamiltonian World Models structure latent dynamics around energy-conserving Hamiltonian evolution to produce physically grounded, action-controllable predictions for embodied decision making.
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
cs.RO 2026-04 unverdicted novelty 6.0

Empirical study on robosuite tasks reveals a dominant-skill effect in compositions and shows that an atomic probe approximates full revalidation for skill updates at much lower cost.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
cs.CV 2026-04 unverdicted novelty 6.0

A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...
AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation
cs.RO 2026-04 unverdicted novelty 6.0

AsyncShield restores VLA geometric intent from latency via kinematic pose mapping and uses PPO-Lagrangian to balance tracking with LiDAR safety constraints in a plug-and-play module.
RL Token: Bootstrapping Online RL with Vision-Language-Action Models
cs.LG 2026-04 unverdicted novelty 6.0

RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
cs.RO 2026-04 unverdicted novelty 6.0

A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
cs.RO 2026-04 unverdicted novelty 6.0

CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems
cs.RO 2026-04 unverdicted novelty 6.0

Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.
SpaceDex: Generalizable Dexterous Grasping in Tiered Workspaces
cs.RO 2026-04 unverdicted novelty 6.0

SpaceDex achieves 63% success grasping unseen objects in tiered workspaces via VLM spatial planning and arm-hand feature separation, beating a 39% tabletop baseline in 100 real trials.
Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction
cs.RO 2026-04 unverdicted novelty 6.0

COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning du...
Grounded World Model for Semantically Generalizable Planning
cs.RO 2026-04 conditional novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems
cs.RO 2026-04 unverdicted novelty 6.0

EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations
cs.RO 2026-04 unverdicted novelty 6.0

WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...
Learning Without Losing Identity: Capability Evolution for Embodied Agents
cs.RO 2026-04 unverdicted novelty 6.0

Embodied agents maintain a persistent identity while evolving capabilities via modular ECMs, raising simulated task success from 32.4% to 91.3% over 20 iterations with zero policy drift or safety violations.
Neural Operators for Multi-Task Control and Adaptation
cs.LG 2026-04 unverdicted novelty 6.0

Neural operators approximate the solution operator for multi-task optimal control, generalizing to new tasks and enabling efficient adaptation via branch-trunk structure and meta-training.
Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision
cs.RO 2026-04 unverdicted novelty 6.0

Vision-language models generate executable Behavior Tree policies for robots from synthetic vision-language data, with successful transfer demonstrated on two real manipulators.
DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
cs.RO 2026-03 unverdicted novelty 6.0

DIAL decouples intent from action in end-to-end VLAs using a latent visual foresight bottleneck and two-stage training, reaching SOTA on RoboCasa with 10x fewer demonstrations and zero-shot real-world transfer.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
cs.AI 2026-01 conditional novelty 6.0

Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
cs.RO 2025-10 unverdicted novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
cs.RO 2025-06 unverdicted novelty 6.0

RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
cs.AI 2025-06 unverdicted novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
cs.LG 2025-06 unverdicted novelty 6.0

SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
cs.RO 2025-03 unverdicted novelty 6.0

GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
cs.RO 2025-02 accept novelty 6.0

OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
cs.RO 2025-02 unverdicted novelty 6.0

DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
FAST: Efficient Action Tokenization for Vision-Language-Action Models
cs.RO 2025-01 unverdicted novelty 6.0

FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
cs.CV 2024-12 unverdicted novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
cs.RO 2024-11 unverdicted novelty 6.0

CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
cs.LG 2024-10 unverdicted novelty 6.0

π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
cs.CL 2024-10 unverdicted novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
cs.RO 2024-10 unverdicted novelty 6.0

GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
OpenVLA: An Open-Source Vision-Language-Action Model
cs.RO 2024-06 unverdicted novelty 6.0

OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 6.0

DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
cs.RO 2023-12 conditional novelty 6.0

A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.
TD-MPC2: Scalable, Robust World Models for Continuous Control
cs.LG 2023-10 conditional novelty 6.0

TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.