arxiv: 2502.19645 · v2 · submitted 2025-02-27 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

Recognition: 1 theorem link

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim , Chelsea Finn , Percy Liang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 04:31 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG

keywords vision-language-action modelsfine-tuningrobot manipulationaction decodingLIBERO benchmarkOpenVLAimitation learning

0 comments

The pith

A specific fine-tuning recipe for vision-language-action models raises success rates from 76.5% to 97.1% while increasing action generation speed by 26 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how to best adapt pretrained vision-language-action models to new robot tasks by systematically testing choices in action decoding, representation, and training objectives. Using OpenVLA as the base model, the authors isolate an optimized combination of parallel decoding, action chunking, continuous action values, and simple L1 regression that improves both accuracy and inference speed. This recipe produces OpenVLA-OFT, which reaches near-perfect performance on the LIBERO benchmark suites and succeeds at high-frequency dexterous tasks on a real bimanual robot, outperforming alternative fine-tuning approaches and policies trained from scratch.

Core claim

Integrating parallel decoding, action chunking, continuous action representations, and an L1 regression learning objective into an Optimized Fine-Tuning recipe substantially raises policy success rates and action generation throughput for vision-language-action models, as shown by OpenVLA-OFT achieving 97.1% average success across LIBERO task suites and 26 times higher throughput than the base OpenVLA.

What carries the argument

The Optimized Fine-Tuning (OFT) recipe that combines parallel decoding, action chunking, continuous action representations, and L1 regression objectives.

If this is right

OpenVLA-OFT executes dexterous high-frequency control tasks on a bimanual ALOHA robot.
The model outperforms other VLAs fine-tuned with their default recipes and strong imitation learning policies by up to 15% absolute success rate in real-world tests.
The approach provides greater flexibility in the model's input-output specifications.
Inference efficiency improves enough to support real-time control on physical hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The recipe's emphasis on simplicity in action representation and loss may reduce engineering overhead when adapting VLAs to new domains.
If the gains hold across embodiments, practitioners could standardize on one fine-tuning pipeline rather than searching over many options for each robot.
Continuous action outputs paired with chunking could be tested on non-VLA policies to isolate whether the speed benefit is architecture-specific.

Load-bearing premise

The design choices found effective for OpenVLA will transfer to other base vision-language-action models, robot bodies, and task distributions without further per-setup adjustments.

What would settle it

Applying the same OFT recipe to a different base VLA model or to a previously unseen robot embodiment and measuring whether success and speed gains remain comparable without additional tuning.

read the original abstract

Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA as our representative base model. Our empirical analysis informs an Optimized Fine-Tuning (OFT) recipe that integrates parallel decoding, action chunking, a continuous action representation, and a simple L1 regression-based learning objective to altogether improve inference efficiency, policy performance, and flexibility in the model's input-output specifications. We propose OpenVLA-OFT, an instantiation of this recipe, which sets a new state of the art on the LIBERO simulation benchmark, significantly boosting OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$\times$. In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot and outperform other VLAs ($\pi_0$ and RDT-1B) fine-tuned using their default recipes, as well as strong imitation learning policies trained from scratch (Diffusion Policy and ACT) by up to 15% (absolute) in average success rate. We release code for OFT and pretrained model checkpoints at https://openvla-oft.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenVLA-OFT gives a practical fine-tuning recipe that lifts LIBERO success from 76.5% to 97.1% and speeds inference 26x, with real-robot gains, though transfer to other VLAs is untested.

read the letter

This paper's main point is that a combination of parallel decoding, action chunking, continuous action representations, and L1 loss produces clear improvements when fine-tuning OpenVLA. The reported jumps on LIBERO and the 26x throughput increase are backed by direct comparisons to baselines, and the real ALOHA results show up to 15% better success than other fine-tuned VLAs or from-scratch policies like Diffusion Policy and ACT. They also release code and checkpoints, which makes the work immediately usable.

Referee Report

0 major / 2 minor

Summary. The paper examines various design choices for fine-tuning vision-language-action (VLA) models, using OpenVLA as a case study. Through empirical analysis, it develops an Optimized Fine-Tuning (OFT) recipe that incorporates parallel decoding, action chunking, continuous action representations, and an L1 loss. Applying this to create OpenVLA-OFT yields a new state-of-the-art performance on the LIBERO benchmark, raising the average success rate from 76.5% to 97.1% across four task suites and achieving a 26-fold increase in action generation throughput. Real-world tests on a bimanual ALOHA robot show the method outperforming other fine-tuned VLAs and imitation learning policies.

Significance. If these results hold, the work provides a significant practical advancement in adapting VLAs for robotics applications by offering a clear, effective fine-tuning strategy that enhances both success rates and computational efficiency. The inclusion of both simulation and real-robot experiments, along with the public release of code and model checkpoints, adds substantial value for the community and supports reproducibility. The direct comparisons to published baselines on LIBERO and ALOHA strengthen the empirical grounding.

minor comments (2)

[Abstract] Abstract: The real-world claim of outperforming baselines 'by up to 15% (absolute) in average success rate' would benefit from explicit clarification on whether this refers to the peak single-task gain or the mean across the evaluated tasks.
[Experiments] The manuscript would be strengthened by a consolidated table in the experiments section listing the ablation results for each OFT component (parallel decoding, chunking, continuous actions, L1 loss) to isolate their individual contributions to the reported gains.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work and their recommendation to accept the manuscript. We appreciate the recognition of the empirical contributions, the practical value of the OFT recipe, the inclusion of both simulation and real-robot experiments, and the emphasis on reproducibility through code and checkpoint releases.

Circularity Check

0 steps flagged

No significant circularity: empirical evaluation on external benchmarks

full rationale

The paper conducts an empirical study of fine-tuning design choices (action decoding, representations, objectives) for the OpenVLA base model, selects an Optimized Fine-Tuning recipe based on observed performance, and validates it via direct experiments on the LIBERO benchmark suites and real-robot tasks. All headline metrics (97.1% success rate, 26× throughput) are computed from held-out evaluations against published baselines and other VLAs, with no mathematical derivations, parameter fits renamed as predictions, or load-bearing self-citations that reduce claims to inputs by construction. The chain is self-contained through standard experimental comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a purely empirical study; no mathematical axioms, free parameters fitted inside a derivation, or newly postulated entities are required for the central claims.

pith-pipeline@v0.9.0 · 5615 in / 1093 out tokens · 33672 ms · 2026-05-11T04:31:11.443912+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RotVLA: Rotational Latent Action for Vision-Language-Action Model
cs.RO 2026-05 unverdicted novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 7.0

MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models
cs.CV 2026-05 unverdicted novelty 7.0

Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational ov...
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
cs.AI 2026-05 conditional novelty 7.0

State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...
Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 7.0

VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 7.0

ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 7.0

MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-r...
VUDA: Breaking CUDA-Vulkan Isolation for Spatial Sharing of Compute and Graphics on the Same GPU
cs.OS 2026-05 unverdicted novelty 7.0

VUDA enables spatial sharing between CUDA and Vulkan on GPUs via channel redirection and page-table grafting, achieving up to 85% higher throughput than temporal baselines in embodied AI tasks.
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
cs.AI 2026-05 unverdicted novelty 7.0

A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
3D Generation for Embodied AI and Robotic Simulation: A Survey
cs.RO 2026-04 accept novelty 7.0

3D generation for embodied AI is shifting from visual realism toward interaction readiness, organized into data generation, simulation environments, and sim-to-real bridging roles.
DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors
cs.RO 2026-04 unverdicted novelty 7.0

Discrete diffusion policies support native asynchronous execution via unmasking for real-time chunking, delivering higher success rates and 0.7x inference cost versus flow-matching RTC on dynamic robotics benchmarks a...
Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System
cs.RO 2026-04 conditional novelty 7.0

Libra-VLA introduces a coarse-to-fine dual-system architecture for VLA models that decouples discrete macro-directional planning from continuous micro-pose refinement, with performance peaking at balanced learning difficulty.
SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

SpecRLBench is a new benchmark evaluating generalization of LTL-guided RL methods across navigation and manipulation domains with static/dynamic environments and varied robot dynamics.
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies
cs.CV 2026-04 unverdicted novelty 7.0

CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
Mini-BEHAVIOR-Gran: Revealing U-Shaped Effects of Instruction Granularity on Language-Guided Embodied Agents
cs.AI 2026-04 unverdicted novelty 7.0

Mini-BEHAVIOR-Gran benchmark reveals a U-shaped effect of instruction granularity on embodied agent performance, with planning-width correlating best and coarse instructions linked to vision-dominant shallow policies.
HazardArena: Evaluating Semantic Safety in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 7.0

HazardArena shows VLA models trained on safe data frequently produce unsafe actions in semantically risky but visually similar settings, and a training-free Safety Option Layer reduces those failures with little perfo...
STRONG-VLA: Decoupled Robustness Learning for Vision-Language-Action Models under Multimodal Perturbations
cs.RO 2026-04 unverdicted novelty 7.0

STRONG-VLA uses decoupled two-stage training to improve VLA model robustness, yielding up to 16% higher task success rates under seen and unseen perturbations on the LIBERO benchmark.
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

CT-1 transfers spatial reasoning from vision-language models to estimate camera trajectories, which are then used in a video diffusion model with wavelet regularization to produce controllable videos, claiming 25.7% b...
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
cs.RO 2026-04 conditional novelty 7.0

BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
cs.RO 2026-03 unverdicted novelty 7.0

VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
cs.RO 2026-05 unverdicted novelty 6.0

FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
cs.RO 2026-05 conditional novelty 6.0

GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
cs.RO 2026-05 unverdicted novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
Reinforcing VLAs in Task-Agnostic World Models
cs.AI 2026-05 unverdicted novelty 6.0

RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model
cs.RO 2026-05 unverdicted novelty 6.0

GridS reduces visual tokens in VLA models to under 10% of the original count via task-aware differentiable resampling, delivering 76% lower FLOPs with no drop in task success rate on benchmarks and real robots.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
Adaptive Action Chunking via Multi-Chunk Q Value Estimation
cs.LG 2026-05 unverdicted novelty 6.0

ACH lets RL policies dynamically pick action chunk lengths by jointly estimating Q-values for all candidate lengths via a single Transformer pass.
ElasticFlow: One-Step Physics-Consistent Policy with Elastic Time Horizons for Language-Guided Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

ElasticFlow delivers one-step physics-consistent diffusion policies for language-guided robot control by modeling average velocity fields and using elastic time horizons to overcome spectral bias.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
cs.CV 2026-05 unverdicted novelty 6.0

ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
Predictive but Not Plannable: RC-aux for Latent World Models
cs.LG 2026-05 unverdicted novelty 6.0

RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
cs.CV 2026-05 unverdicted novelty 6.0

TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 6.0

ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 6.0

MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
cs.RO 2026-05 unverdicted novelty 6.0

VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

Interventional attribution via ISS and NMR diagnoses causal misalignment in VLA policies and predicts their generalization performance across manipulation tasks.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
cs.AI 2026-04 unverdicted novelty 6.0

PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
cs.RO 2026-04 unverdicted novelty 6.0

A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
cs.RO 2026-04 unverdicted novelty 6.0

CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
cs.RO 2026-04 unverdicted novelty 6.0

UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models
cs.CV 2026-04 unverdicted novelty 6.0

PDF improves VLA success rates on LIBERO and Atari by applying test-time perturbation learning with delayed feedback to correct trajectory overfitting and overconfidence.
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
LongBench: Evaluating Robotic Manipulation Policies on Real-World Long-Horizon Tasks
cs.RO 2026-04 unverdicted novelty 6.0

LongBench is a new real-world benchmark that separates execution robustness from context-dependent reasoning in long-horizon robotic manipulation and shows these are distinct challenges not uniformly solved by memory-...

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 87 Pith papers · 22 internal anchors

[1]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus- man, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Minivla: A better vla with a smaller footprint, 2024

Suneel Belkhale and Dorsa Sadigh. Minivla: A better vla with a smaller footprint, 2024. URL https://github.com/Stanford-ILIAD/ openvla-mini

work page 2024
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023

work page 2023
[6]

Manipulate-anything: Automating real-world robots using vision-language models.arXiv preprint arXiv:2406.18915, 2024

Jiafei Duan, Wentao Yuan, Wilbert Pumacay, Yi Ru Wang, Kiana Ehsani, Dieter Fox, and Ranjay Kr- ishna. Manipulate-anything: Automating real-world robots using vision-language models. arXiv preprint arXiv:2406.18915, 2024

work page arXiv 2024
[7]

An interactive agent foundation model

Zane Durante, Bidipta Sarkar, Ran Gong, Rohan Taori, Yusuke Noda, Paul Tang, Ehsan Adeli, Shrinidhi Kow- shika Lakshmikanth, Kevin Schulman, Arnold Milstein, et al. An interactive agent foundation model. arXiv preprint arXiv:2402.05929, 2024

work page arXiv 2024
[8]

Introducing rfm-1: Giving robots human-like reasoning capabilities, 2024

Andrew Sohn et al. Introducing rfm-1: Giving robots human-like reasoning capabilities, 2024. URL https://covariant.ai/insights/ introducing-rfm-1-giving-robots-human-like-reasoning-capabilities/

work page 2024
[9]

Cows on pas- ture: Baselines and benchmarks for language-driven zero- shot object navigation

Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Il- harco, Ludwig Schmidt, and Shuran Song. Cows on pas- ture: Baselines and benchmarks for language-driven zero- shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171–23181, 2023

work page 2023
[10]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016

work page 2016
[11]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[12]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural infor- mation processing systems , 33:6840–6851, 2020

work page 2020
[13]

Diffusion transformer policy

Zhi Hou, Tianyi Zhang, Yuwen Xiong, Hengjun Pu, Chengyang Zhao, Ronglei Tong, Yu Qiao, Jifeng Dai, and Yuntao Chen. Diffusion transformer policy: Scal- ing diffusion transformer for generalist vision-language- action learning, 2025. URL https://arxiv.org/ abs/2410.15959

work page arXiv 2025
[14]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

An embodied generalist agent in 3d world

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. In Proceedings of the International Conference on Machine Learning (ICML) , 2024

work page 2024
[16]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot plan- ners: Extracting actionable knowledge for embodied agents, 2022. URL https://arxiv.org/abs/ 2201.07207

work page arXiv 2022
[17]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models, 2022. URL https://arxiv.org/abs/2207.05608

work page internal anchor Pith review arXiv 2022
[18]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973 , 2023

work page internal anchor Pith review arXiv 2023
[19]

Re- fined policy distillation: From vla generalists to rl ex- perts

Tobias J ¨ulg, Wolfram Burgard, and Florian Walter. Re- fined policy distillation: From vla generalists to rl ex- perts. arXiv preprint arXiv:2503.05833 , 2025

work page arXiv 2025
[20]

arXiv preprint arXiv:2302.12766 , year=

Siddharth Karamcheti, Suraj Nair, Annie S. Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics. ArXiv, abs/2302.12766, 2023. URL https://api.semanticscholar.org/ CorpusID:257205716

work page arXiv 2023
[21]

Prismatic vlms: Inves- tigating the design space of visually-conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakr- ishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865, 2024

work page arXiv 2024
[22]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review arXiv 2024
[23]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023

Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378, 2023

work page arXiv 2023
[25]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Libero: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems , 36, 2024

work page 2024
[27]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864 , 2024

work page internal anchor Pith review arXiv 2024
[28]

Bidirectional decoding: Improving action chunking via closed-loop resampling

Yuejiang Liu, Jubayer Ibn Hamid, Annie Xie, Yoonho Lee, Maximilian Du, and Chelsea Finn. Bidirectional decoding: Improving action chunking via closed-loop resampling. arXiv preprint arXiv:2408.17355 , 2024

work page arXiv 2024
[29]

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- uan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022

work page arXiv 2022
[30]

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayara- man, Osbert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and represen- tation via value-implicit pre-training. arXiv preprint arXiv:2210.00030, 2022

work page internal anchor Pith review arXiv 2022
[31]

Liv: Language-image representations and rewards for robotic control

Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bastani, and Dinesh Jayaraman. Liv: Language-image representations and rewards for robotic control. In International Conference on Machine Learning , pages 23301–23320. PMLR, 2023

work page 2023
[32]

Where are we in the search for an artificial visual cortex for embodied intelligence? Advances in Neural Information Processing Systems, 36:655–677, 2023

Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent- Pierre Berges, Tingfan Wu, Jay Vakil, et al. Where are we in the search for an artificial visual cortex for embodied intelligence? Advances in Neural Information Processing Systems, 36:655–677, 2023

work page 2023
[33]

R3m: A universal visual representation for robot manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. In CoRL, 2022

work page 2022
[34]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abra- ham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864 , 2023

work page internal anchor Pith review arXiv 2023
[35]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Vision- language-action model and diffusion policy switching enables dexterous control of an anthropomorphic hand

Cheng Pan, Kai Junge, and Josie Hughes. Vision- language-action model and diffusion policy switching enables dexterous control of an anthropomorphic hand. arXiv preprint arXiv:2410.14022 , 2024

work page internal anchor Pith review arXiv 2024
[37]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence , volume 32, 2018

work page 2018
[38]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tok- enization for vision-language-action models, 2025. URL https://arxiv.org/abs/2501.09747

work page internal anchor Pith review arXiv 2025
[39]

Learning transferable visual models from natural lan- guage supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. In International conference on ma- chine learning, pages 8748–8763. PMLR, 2021

work page 2021
[40]

Multimodal diffusion transformer: Learning versatile behavior from multimodal goals.arXiv preprint arXiv:2407.05996, 2024

Moritz Reuss, ¨Omer Erdinc ¸ Ya˘gmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal diffusion transformer: Learning versatile behavior from multimodal goals, 2024. URL https://arxiv.org/abs/2407.05996

work page arXiv 2024
[41]

A reduction of imitation learning and structured prediction to no-regret online learning

St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelli- gence and statistics , pages 627–635. JMLR Workshop and Conference Proceedings, 2011

work page 2011
[42]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

V Sanh. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review arXiv 1910
[43]

arXiv preprint arXiv:2403.12910 , year=

Lucy Xiaoyang Shi, Zheyuan Hu, Tony Z Zhao, Ar- chit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn. Yell at your robot: Improving on-the-fly from language corrections. arXiv preprint arXiv:2403.12910, 2024

work page arXiv 2024
[44]

Progprompt: Generating situated robot task plans using large language mod- els, 2022

Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language mod- els, 2022. URL https://arxiv.org/abs/2209. 11302

work page 2022
[45]

Sadler, Wei-Lun Chao, and Yu Su

Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M. Sadler, Wei-Lun Chao, and Yu Su. Llm- planner: Few-shot grounded planning for embodied agents with large language models, 2023. URL https: //arxiv.org/abs/2212.04088

work page arXiv 2023
[46]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[47]

Stone, T

Austin Stone, Ted Xiao, Yao Lu, Keerthana Gopalakrish- nan, Kuang-Huei Lee, Quan Vuong, Paul Wohlhart, Sean Kirmani, Brianna Zitkovich, Fei Xia, et al. Open-world object manipulation using pre-trained vision-language models. arXiv preprint arXiv:2303.00905 , 2023

work page arXiv 2023
[48]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning , pages 6105–6114. PMLR, 2019

work page 2019
[49]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Predictive inverse dynamics models are scalable learners for robotic manipulation.arXiv preprint arXiv:2412.15109, 2024

Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic ma- nipulation. arXiv preprint arXiv:2412.15109 , 2024

work page arXiv 2024
[51]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning , pages 1723–

work page
[53]

Lingo-2: Driving with natural language

Wayve. Lingo-2: Driving with natural language

work page
[54]

URL https://wayve.ai/thinking/ lingo-2-driving-with-language/

work page
[55]

Tinyvla: To- wards fast, data-efficient vision-language-action models for robotic manipulation, 2024

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, et al. Tinyvla: Towards fast, data- efficient vision-language-action models for robotic ma- nipulation. arXiv preprint arXiv:2409.12514 , 2024

work page arXiv 2024
[56]

Sigmoid loss for language image pre- training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 11975–11986, 2023

work page 2023
[57]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126,

Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity. arXiv preprint arXiv:2410.13126 , 2024

work page arXiv 2024
[59]

3D-VLA: A 3D Vision-Language-Action Generative World Model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d- vla: 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631 , 2024

work page internal anchor Pith review arXiv 2024
[60]

put eggplant into pot

Hongkuan Zhou, Xiangtong Yao, Yuan Meng, Siming Sun, Zhenshan Bing, Kai Huang, and Alois Knoll. Language-conditioned learning for robotic manipulation: A survey. arXiv preprint arXiv:2312.10807 , 2023. APPENDIX A. Model Architecture Details Base OpenVLA Architecture. OpenVLA combines a fused vision backbone (with both SigLIP [55] and DINOv2 [35] vision tr...

work page arXiv 2023
[61]

processes multiple input images (e.g., third-person image plus wrist camera images) through the shared SigLIP- DINOv2 backbone

work page
[62]

projects robot proprioceptive state to language embedding space via a 2-layer MLP with GELU activation

work page
[63]

replaces causal attention with bidirectional attention for parallel decoding

work page
[64]

substitutes the language model decoder output layer with a 4-layer MLP (ReLU activation) for generation of con- tinuous actions (instead of discrete actions)

work page
[65]

outputs chunks of K actions instead of single-timestep actions

work page
[66]

(for OpenVLA-OFT+) adds FiLM [37] modules that use the average task language embedding to modulate visual features in both SigLIP and DINOv2 vision transformers (see Appendix C for details) The complete OpenVLA-OFT+ architecture is illustrated in Figure 1. B. Implementation Details

work page
[67]

A causal attention mask ensures the model only attends to current and previous tokens

Parallel Decoding Implementation: In the original OpenVLA autoregressive training scheme, the model receives ground-truth action tokens shifted right by one position as input (a setup known as teacher forcing). A causal attention mask ensures the model only attends to current and previous tokens. At test time, each predicted token is fed back as input for...

work page
[68]

On the other hand, with a continuous action representation, the VLA can directly model the action distribution without lossy discretiza- tion

Continuous Action Representations: For discrete actions, increasing the number of bins used for discretization improves precision but reduces the frequency of individual tokens in the training data, potentially hurting generalization. On the other hand, with a continuous action representation, the VLA can directly model the action distribution without los...

work page
[69]

fold shorts

Input Processing Details: Passing each input image through the OpenVLA fused vision encoder produces 256 patch embeddings, which are projected to the langauge model embedding space via a 3-layer MLP with GELU activation [11]. Low-dimensional robot states are also projected to the language embedding space through a 2-layer MLP with GELU activation. C. Feat...

work page
[70]

ALOHA Task Suite Details: Below are detailed specifi- cations for each task in our ALOHA experiments:

work page
[71]

fold shorts

“fold shorts” • Task: Bimanual folding of white shorts with two synchro- nized folds • Dataset: 20 demonstrations (19 training, 1 validation) • Episode length: 1000 timesteps (40 seconds) • Evaluation: 10 trials • Initial states: See Figure 9

work page
[72]

fold shirt

“fold shirt” • Task: Long-horizon T-shirt folding with multiple synchro- nized bimanual folds • Dataset: 30 demonstrations (29 training, 1 validation) • Episode length: 1250 timesteps (50 seconds) • Evaluation: 10 trials • Initial states: See Figure 10

work page
[73]

scoop X into bowl

“scoop X into bowl” • Task: Move bowl to center, scoop specified ingredient (raisins, almonds and green M&Ms, or pretzels) into bowl • Dataset: 45 demonstrations (15 per target; 42 training, 3 validation) • Episode length: 900 timesteps (36 seconds) • Evaluation: 12 trials • Initial states: See Figure 11

work page
[74]

put X into pot

“put X into pot” • Task: Open pot, place specified item (green pepper, red pepper, or yellow corn) into pot, close pot • Dataset: 300 demonstrations (100 per target; 285 training, 15 validation)† • Initial variation: 45 cm horizontal, 20 cm vertical for food items; fixed pot pose • Episode length: 400 timesteps (16 seconds) • Evaluation: 24 trials (12 in-...

work page
[75]

ALOHA Task Scoring Rubric: The scoring rubrics and detailed results for the four ALOHA tasks are shown in Tables X, XI, XII, and XIII. G. Additional Experiments

work page
[76]

In this section, we assess whether our method scales to larger fine-tuning datasets by training one OpenVLA- OFT policy on all four task suites combined

Single OpenVLA-OFT Policy for All LIBERO Task Suites Combined: In Section V and Table I, we report results with OpenVLA-OFT policies trained on each task suite in- dependently. In this section, we assess whether our method scales to larger fine-tuning datasets by training one OpenVLA- OFT policy on all four task suites combined. As shown in Table XIV, thi...

work page
[77]

put X into pot

Ablating FiLM in LIBERO: The FiLM ablation study in Section VI suggests that FiLM is crucial for enabling strong language following in real-world ALOHA robot tasks. In this †This relatively large number of demonstrations for the “put X into pot” task is not necessary for satisfactory performance. It simply reflects an earlier investigative phase of this w...

work page
[78]

This ablation study in- vestigates whether OpenVLA’s robot-pretrained representation remains valuable when subjected to a substantially different fine-tuning approach such as OFT

Ablating the OpenVLA Pretrained Representation: We evaluate the performance of OpenVLA-OFT policies produced by fine-tuning the underlying Prismatic VLM [21] directly on the LIBERO downstream datasets without OpenVLA’s Open X-Embodiment [34] robot pretraining. This ablation study in- vestigates whether OpenVLA’s robot-pretrained representation remains val...

work page
[79]

fold shorts

Scaling Up OpenVLA-OFT to a Larger Real-World Dataset (BridgeData V2): In Appendix G1, we observe that a single OpenVLA-OFT policy can effectively fit all four LIBERO task suite datasets combined, confirming that the proposed method scales to larger fine-tuning datasets. In this section, we scale up the fine-tuning data further and assess whether OpenVLA-...

work page