hub Canonical reference

A Pragmatic VLA Foundation Model

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang · 2026 · cs.RO · arXiv 2601.18692

Canonical reference. 73% of citing Pith papers cite this work as background.

29 Pith papers citing it

Background 73% of classified citations

open full Pith review browse 29 citing papers arXiv PDF

abstract

Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 3 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second with an 8-GPU training setup, representing a 1.5~2.8$\times$ (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 baseline 2 method 1

citation-polarity summary

background 8 baseline 2 use method 1

representative citing papers

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

cs.CV · 2026-06-07 · unverdicted · novelty 7.0

X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

cs.RO · 2026-05-21 · conditional · novelty 7.0

EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.

RotVLA: Rotational Latent Action for Vision-Language-Action Model

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

Mini-BEHAVIOR-Gran: Revealing U-Shaped Effects of Instruction Granularity on Language-Guided Embodied Agents

cs.AI · 2026-04-18 · unverdicted · novelty 7.0

Mini-BEHAVIOR-Gran benchmark reveals a U-shaped effect of instruction granularity on embodied agent performance, with planning-width correlating best and coarse instructions linked to vision-dominant shallow policies.

Training Vision-Language-Action Models with Dense Embodied Chain-of-Thought Supervision

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

ZR-0 trains a dual-stream VLA model on 60 million frames with dense ECoT annotations so that high-level reasoning transfers across single-arm, bimanual, and humanoid embodiments while skipping reasoning at inference.

Keypose Exploration: Efficient Automatic Trajectory Labelling and Cross-Embodiment Policy Transfer

cs.RO · 2026-06-27 · unverdicted · novelty 6.0

An automatic single-demo VLM trajectory labelling pipeline enables keypose-guided diffusion policies that match baseline performance and show preliminary benefits for cross-embodiment transfer on robomimic tasks.

Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots

cs.RO · 2026-06-26 · unverdicted · novelty 6.0

A relative wrist translation bridging action with a vision-language-action model using interleaved tokens and attention masking transfers human manipulation skills to robots more effectively than 6DoF actions.

FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

cs.RO · 2026-05-26 · unverdicted · novelty 6.0

FineVLA unifies robot datasets into 47k fine-grained trajectories, adds a VLM annotator and benchmark, and shows that mixing fine-grained and goal-level instructions improves steerable control without hurting task success.

HumanNet: Scaling Human-centric Video Learning to One Million Hours

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.

PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

cs.AI · 2026-04-30 · unverdicted · novelty 6.0

PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.

Long-Horizon Manipulation via Trace-Conditioned VLA Planning

cs.RO · 2026-04-23 · unverdicted · novelty 6.0

LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.

Human Cognition in Machines: A Unified Perspective of World Models

cs.RO · 2026-04-17 · unverdicted · novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.

VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

cs.RO · 2026-04-10 · unverdicted · novelty 6.0

VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.

SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models

cs.RO · 2026-03-26 · unverdicted · novelty 6.0

SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.

FASTER: Rethinking Real-Time Flow VLAs

cs.RO · 2026-03-19 · unverdicted · novelty 6.0 · 2 refs

FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.

Seeing Touch from Motion: A Unified Modality-Aware Visuo-Tactile Policy with Tactile Motion Correlation

cs.RO · 2026-06-29 · unverdicted · novelty 5.0

A visuo-tactile policy learning method that exploits tactile motion correlation for contact state distinction and Mixture-of-Transformers for cross-modal fusion.

DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

cs.RO · 2026-05-29 · unverdicted · novelty 5.0

DeMaVLA is a VLA foundation model using a pruned action expert and flow matching, pre-trained on 5000 hours of real demonstrations and post-trained on multi-task folding data with human-in-the-loop correction, reporting competitive benchmark and real-world folding performance.

Wall-OSS-0.5 Technical Report

cs.RO · 2026-05-29 · unverdicted · novelty 5.0

Wall-OSS-0.5 is a 4B VLA model pretrained across many embodiments that achieves zero-shot real-robot performance on a 17-task suite and outperforms π_0.5 after fine-tuning.

GEM: Generative Supervision Helps Embodied Intelligence

cs.CV · 2026-05-27 · unverdicted · novelty 5.0

GEM adds generative depth supervision to VLM pre-training and reports improved results on embodied benchmarks plus real-world robot execution.

Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

cs.RO · 2026-05-14 · unverdicted · novelty 5.0 · 2 refs

A unified embodied foundation model uses one VLM for understanding and reasoning plus a joint video-action future generator, reporting competitive scores on VLM, world modeling, and robot benchmarks without apparent compromise.

Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

cs.RO · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

ConSFT is a gradient-scaling fine-tuning objective for flow-matching VLAs that bounds parameter disruption via model-confidence weighting, yielding over 20% better capability retention than vanilla SFT on LIBERO and RoboTwin.

Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection

cs.RO · 2026-04-15 · unverdicted · novelty 5.0

A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.

citing papers explorer

Showing 29 of 29 citing papers.

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining cs.CV · 2026-06-07 · unverdicted · none · ref 20 · internal anchor
X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.
EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control cs.RO · 2026-05-21 · conditional · none · ref 40 · internal anchor
EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.
RotVLA: Rotational Latent Action for Vision-Language-Action Model cs.RO · 2026-05-13 · unverdicted · none · ref 45 · internal anchor
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning cs.RO · 2026-05-12 · unverdicted · none · ref 48 · internal anchor
RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 116 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
Mini-BEHAVIOR-Gran: Revealing U-Shaped Effects of Instruction Granularity on Language-Guided Embodied Agents cs.AI · 2026-04-18 · unverdicted · none · ref 6 · internal anchor
Mini-BEHAVIOR-Gran benchmark reveals a U-shaped effect of instruction granularity on embodied agent performance, with planning-width correlating best and coarse instructions linked to vision-dominant shallow policies.
Training Vision-Language-Action Models with Dense Embodied Chain-of-Thought Supervision cs.RO · 2026-06-29 · unverdicted · none · ref 55 · internal anchor
ZR-0 trains a dual-stream VLA model on 60 million frames with dense ECoT annotations so that high-level reasoning transfers across single-arm, bimanual, and humanoid embodiments while skipping reasoning at inference.
Keypose Exploration: Efficient Automatic Trajectory Labelling and Cross-Embodiment Policy Transfer cs.RO · 2026-06-27 · unverdicted · none · ref 7 · internal anchor
An automatic single-demo VLM trajectory labelling pipeline enables keypose-guided diffusion policies that match baseline performance and show preliminary benefits for cross-embodiment transfer on robomimic tasks.
Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots cs.RO · 2026-06-26 · unverdicted · none · ref 59 · internal anchor
A relative wrist translation bridging action with a vision-language-action model using interleaved tokens and attention masking transfers human manipulation skills to robots more effectively than 6DoF actions.
FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies cs.RO · 2026-05-26 · unverdicted · none · ref 2 · internal anchor
FineVLA unifies robot datasets into 47k fine-grained trajectories, adds a VLM annotator and benchmark, and shows that mixing fine-grained and goal-level instructions improves steerable control without hurting task success.
HumanNet: Scaling Human-centric Video Learning to One Million Hours cs.CV · 2026-05-07 · unverdicted · none · ref 36 · internal anchor
HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations cs.AI · 2026-04-30 · unverdicted · none · ref 31 · internal anchor
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
Long-Horizon Manipulation via Trace-Conditioned VLA Planning cs.RO · 2026-04-23 · unverdicted · none · ref 61 · internal anchor
LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.
Human Cognition in Machines: A Unified Perspective of World Models cs.RO · 2026-04-17 · unverdicted · none · ref 190 · internal anchor
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis cs.RO · 2026-04-10 · unverdicted · none · ref 73 · internal anchor
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models cs.RO · 2026-03-26 · unverdicted · none · ref 3 · internal anchor
SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.
FASTER: Rethinking Real-Time Flow VLAs cs.RO · 2026-03-19 · unverdicted · none · ref 93 · 2 links · internal anchor
FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.
Seeing Touch from Motion: A Unified Modality-Aware Visuo-Tactile Policy with Tactile Motion Correlation cs.RO · 2026-06-29 · unverdicted · none · ref 60 · internal anchor
A visuo-tactile policy learning method that exploits tactile motion correlation for contact state distinction and Mixture-of-Transformers for cross-modal fusion.
DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation cs.RO · 2026-05-29 · unverdicted · none · ref 32 · internal anchor
DeMaVLA is a VLA foundation model using a pruned action expert and flow matching, pre-trained on 5000 hours of real demonstrations and post-trained on multi-task folding data with human-in-the-loop correction, reporting competitive benchmark and real-world folding performance.
Wall-OSS-0.5 Technical Report cs.RO · 2026-05-29 · unverdicted · none · ref 10 · internal anchor
Wall-OSS-0.5 is a 4B VLA model pretrained across many embodiments that achieves zero-shot real-robot performance on a 17-task suite and outperforms π_0.5 after fine-tuning.
GEM: Generative Supervision Helps Embodied Intelligence cs.CV · 2026-05-27 · unverdicted · none · ref 76 · internal anchor
GEM adds generative depth supervision to VLM pre-training and reports improved results on embodied benchmarks plus real-world robot execution.
Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action cs.RO · 2026-05-14 · unverdicted · none · ref 45 · 2 links · internal anchor
A unified embodied foundation model uses one VLM for understanding and reasoning plus a joint video-action future generator, reporting competitive scores on VLM, world modeling, and robot benchmarks without apparent compromise.
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT cs.RO · 2026-05-09 · unverdicted · none · ref 1 · 2 links · internal anchor
ConSFT is a gradient-scaling fine-tuning objective for flow-matching VLAs that bounds parameter disruption via model-confidence weighting, yielding over 20% better capability retention than vanilla SFT on LIBERO and RoboTwin.
Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection cs.RO · 2026-04-15 · unverdicted · none · ref 17 · internal anchor
A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment cs.RO · 2026-04-07 · unverdicted · none · ref 60 · internal anchor
CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer to achieve high success rates on multi-arm manipulation tasks.
Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model cs.CV · 2026-05-14 · unverdicted · none · ref 45 · internal anchor
Evo-Depth is a compact VLA model using a lightweight implicit depth encoder from RGB views plus progressive alignment to boost manipulation performance without added hardware.
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy cs.RO · 2026-04-22 · unverdicted · none · ref 38 · internal anchor
JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.
RhinoVLA Technical Report cs.RO · 2026-06-05 · unverdicted · none · ref 42 · internal anchor
RhinoVLA cuts VLM tokens with a Qwen3-VL backbone and continuous action expert, adds a unified cross-robot interface, and reaches real-time 11.69 Hz on Huixi R1 while matching π0.5 downstream performance.
World Model for Robot Learning: A Comprehensive Survey cs.RO · 2026-04-30 · unverdicted · none · ref 59 · internal anchor
A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datasets, and benchmarks.

A Pragmatic VLA Foundation Model

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer