super hub Canonical reference

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Adnan Esmail, Chelsea Finn, Danny Driess, Kevin Black, Michael Equi, Noah Brown · 2024 · cs.LG · arXiv 2410.24164

Canonical reference. 72% of citing Pith papers cite this work as background.

700 Pith papers citing it

Background 72% of classified citations

open full Pith review browse 700 citing papers more from Adnan Esmail arXiv PDF

abstract

Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We then discuss how this model can be trained on a large and diverse dataset from multiple dexterous robot platforms, including single-arm robots, dual-arm robots, and mobile manipulators. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people and from a high-level VLM policy, and its ability to acquire new skills via fine-tuning. Our results cover a wide variety of tasks, such as laundry folding, table cleaning, and assembling boxes.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 121 baseline 25 method 13 dataset 1 other 1

citation-polarity summary

background 116 baseline 25 use method 12 unclear 6 support 1 use dataset 1

claims ledger

abstract Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose

authors

Adnan Esmail Chelsea Finn Danny Driess Kevin Black Michael Equi Noah Brown

co-cited works

representative citing papers

HABIT: Human-Aware Behavior and Interaction Training Dataset for Robot Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 8.0

HABIT is a large-scale robot demonstration dataset for human-present environments that elicits spatiotemporal synchronization, yielding, and gesture grounding behaviors absent from robot-only training data.

Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies

cs.RO · 2026-06-09 · unverdicted · novelty 8.0 · 2 refs

TAKO demonstrates real-time adversarial takeover of robotic diffusion policies via reusable universal patches on visual inputs, achieving 100% success in steering attacker-chosen trajectories across multiple tasks, encoders, and diffusion methods.

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

cs.CV · 2026-05-31 · accept · novelty 8.0

Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.

TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning

cs.RO · 2026-05-08 · accept · novelty 8.0

TAVIS is a released benchmark showing active vision improves imitation learning in a task-dependent manner, multi-task policies struggle with shifts, and imitation produces human-like anticipatory gaze.

Membership Inference Attacks on Vision-Language-Action Models

cs.CR · 2026-05-08 · unverdicted · novelty 8.0

Vision-language-action models are highly vulnerable to membership inference attacks, including practical black-box versions that exploit generated actions and motion trajectories.

Adversarial Imitation Learning with General Function Approximation: Theoretical Analysis and Practical Algorithms

cs.LG · 2026-05-03 · unverdicted · novelty 8.0

OPT-AIL provides the first provably efficient adversarial imitation learning algorithms under general function approximation, achieving polynomial expert sample and interaction complexity.

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

cs.RO · 2026-04-10 · unverdicted · novelty 8.0 · 2 refs

RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

cs.RO · 2026-07-02 · unverdicted · novelty 7.0

Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.

LIME: Learning Intent-aware Camera Motion from Egocentric Video

cs.RO · 2026-07-02 · unverdicted · novelty 7.0

LIME formulates language-conditioned camera motion as predicting SE(3) target poses from RGB and intent text, using mined multi-intent supervision from egocentric video and a flow-matching pose head.

OopsieVerse: A Safety Benchmark with Damage-Aware Simulation for Robot Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

OOPSIEVERSE is a new damage-aware simulation benchmark for household robot manipulation that converts contact, thermal, and fluid signals into task-agnostic damage metrics and demonstrates uses in safer policy learning and benchmarking.

Adapting Generalist Robot Policies with Semantic Reinforcement Learning

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.

Labimus: A Simulation and Benchmark for Humanoid Dexterous Manipulation in Chemical Laboratory

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

Labimus is the first benchmark for humanoid dexterous manipulation in organic chemistry laboratories, exposing a gap between task completion and required experimental precision.

Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation

cs.RO · 2026-06-29 · unverdicted · novelty 7.0

SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.

SurgVLA-Bench: Towards Evaluating Vision-Language-Action Models for Laparoscopic Surgical Robotics

cs.AI · 2026-06-28 · unverdicted · novelty 7.0

SurgVLA-Bench supplies a hierarchical task taxonomy and multi-dimensional evaluation framework for VLA models in laparoscopic robotics simulation, showing autoregressive models excel at semantics while flow-matching models achieve higher precision but all fall short due to endoscopic view constraint

World Action Models Enable Continual Imitation Learning with Recurrent Generative Replays

cs.RO · 2026-06-25 · unverdicted · novelty 7.0

REGEN uses recurrent generative replays from World Action Models to cut catastrophic forgetting by up to 50% in continual imitation learning compared to sequential fine-tuning.

ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models

cs.RO · 2026-06-25 · unverdicted · novelty 7.0 · 2 refs

ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.

USS: Unified Spatial-Semantic Prompts for Embodied Visual Tracking with Latent Dynamics Learning

cs.CV · 2026-06-24 · unverdicted · novelty 7.0

USS is an end-to-end framework for embodied visual tracking that fuses text, point, box, and mask prompts via modality-specific encoders and hybrid attention, augmented by a latent world model, and demonstrates higher success rates with spatial cues on real robots and competitive simulation performa

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

cs.RO · 2026-06-22 · unverdicted · novelty 7.0 · 2 refs

LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi

Cloak: Zero-Shot Cross-Embodiment Manipulation by Masking the End-Effector from the VLA

cs.RO · 2026-06-22 · unverdicted · novelty 7.0

Masking the end-effector from wrist views during training lets a single-gripper VLA transfer zero-shot to other grippers, arms, and five-fingered hands while keeping original performance.

NAC: Neural Action Codec for Vision-Language-Action Models

cs.RO · 2026-06-19 · unverdicted · novelty 7.0

NAC adapts multi-scale RVQGAN audio codecs with kinematic-specific losses to produce ordered action tokens that yield lower reconstruction error and higher task success than prior tokenizers in VLA models.

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

cs.LG · 2026-06-18 · unverdicted · novelty 7.0

Execution-state capsules enable graph-bound full-state checkpointing and sub-millisecond restore for LLMs including KV and recurrent states, yielding 3.9x-27x TTFT speedups in on-device physical-AI serving.

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

cs.CV · 2026-06-18 · unverdicted · novelty 7.0

Processed egocentric human video outperforms teleoperated real-robot trajectories as pretraining data for embodied foundation models, delivering 24% lower validation loss and 52.5-90% higher task success rates under matched post-training protocols.

Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation

cs.RO · 2026-06-18 · unverdicted · novelty 7.0

FAFM performs flow matching in the frequency domain using DCT on action sequences to produce continuous temporally consistent robotic actions with a Sobolev-style smoothness regularizer.

EquiVLA: A General Framework for Rotationally Equivariant Vision-Language-Action Models

cs.RO · 2026-06-18 · unverdicted · novelty 7.0

EquiVLA is the first general framework for end-to-end SO(2)-equivariant VLA models using EquiPerceptor and EquiActor modules, reporting improved success rates on LIBERO, CALVIN, and real-robot benchmarks.

citing papers explorer

Showing 39 of 39 citing papers after filters.

Imitation Learning for Robot Assistance in Open Surgery: A Multi-Policy Evaluation on Suture Following cs.RO · 2026-05-27 · conditional · none · ref 2 · internal anchor
Benchmarking ACT, Diffusion Policy, SmolVLA, and π0 on suture following yields 50-75% success under ideal conditions and 92% stitch completion with π0 in a surgeon-robot trial.
EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control cs.RO · 2026-05-21 · conditional · none · ref 1 · internal anchor
EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR cs.LG · 2026-05-19 · conditional · none · ref 4 · internal anchor
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies cs.RO · 2026-05-17 · conditional · none · ref 2 · internal anchor
Event-grounded SAE analysis in VLA policies produces stronger causal effects on robot behavior than standard methods by anchoring features to clustered end-effector keyframes across simulations and real-robot tests.
Morphologically Equivariant Flow Matching for Bimanual Mobile Manipulation cs.RO · 2026-05-12 · conditional · none · ref 4 · internal anchor
A morphologically equivariant flow matching policy for bimanual robots enforces reflective symmetry to improve sample efficiency and enable zero-shot generalization to mirrored task configurations.
See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model cs.RO · 2026-05-12 · conditional · none · ref 3 · internal anchor
GridS is a plug-and-play differentiable module for geometry-aware visual token resampling in VLA models that achieves under 10% token retention and 76% FLOPs reduction with no success-rate loss.
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation cs.RO · 2026-05-12 · conditional · none · ref 5 · internal anchor
A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System cs.RO · 2026-04-27 · conditional · none · ref 1 · internal anchor
Libra-VLA introduces a coarse-to-fine dual-system architecture for VLA models that decouples discrete macro-directional planning from continuous micro-pose refinement, with performance peaking at balanced learning difficulty.
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination cs.RO · 2026-04-07 · conditional · none · ref 2 · internal anchor
BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching cs.RO · 2026-03-27 · conditional · none · ref 2 · internal anchor
DFM-VLA uses discrete flow matching to iteratively refine action tokens in VLA models, outperforming autoregressive and diffusion baselines with 4.44 average success length on CALVIN and 95.7% success on LIBERO.
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control cs.RO · 2026-03-18 · conditional · none · ref 4 · internal anchor
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
Large Video Planner Enables Generalizable Robot Control cs.RO · 2025-12-17 · conditional · none · ref 10 · internal anchor
A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
EXPO: Stable Reinforcement Learning with Expressive Policies cs.LG · 2025-07-10 · conditional · none · ref 2 · internal anchor
EXPO stabilizes online RL for expressive policies by training a base policy with imitation and using a lightweight Gaussian edit policy to select higher-value actions on the fly for sampling and TD backups.
EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild cs.RO · 2025-05-27 · conditional · none · ref 6 · internal anchor
EgoWalk supplies 50 hours of real-world multimodal human navigation data in varied indoor/outdoor settings together with open pipelines that auto-generate language goal annotations and traversability masks.
Learning to Assist: Collaborative VLAs for Implicit Human-Robot Collaboration cs.RO · 2026-06-10 · conditional · none · ref 4 · internal anchor
VLA models with inference-time steering mitigate action leakage in implicit human-robot collaboration, supporting longer horizons and yielding faster, more reliable assembly than shorter-horizon baselines in a 16-person study.
OpenEAI-Platform: An Open-source Embodied Artificial Intelligence Hardware-Software Unified Platform cs.RO · 2026-06-02 · conditional · none · ref 1 · internal anchor
OpenEAI-Platform delivers an open-source low-cost robotic arm and VLA model that outperforms commercial arms and matches large pretrained baselines on four real-world manipulation tasks using limited open data.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-05-13 · conditional · none · ref 3 · internal anchor
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models cs.RO · 2026-05-13 · conditional · none · ref 5 · internal anchor
PAIR-VLA adds invariance and sensitivity objectives over paired visual variants during PPO fine-tuning of VLA models, yielding 9-16% average gains on ManiSkill3 under distractors, textures, poses, viewpoints, and lighting shifts.
Grounded World Model for Semantically Generalizable Planning cs.RO · 2026-04-13 · conditional · none · ref 4 · internal anchor
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model cs.RO · 2026-04-03 · conditional · none · ref 3 · internal anchor
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors cs.RO · 2026-03-16 · conditional · none · ref 14 · internal anchor
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
Force-Aware Residual DAgger via Trajectory Editing for Precision Insertion with Impedance Control cs.RO · 2026-03-04 · conditional · none · ref 4 · internal anchor
TER-DAgger improves robotic precision insertion success rates by over 37% via residual policies from edited trajectories and force-aware intervention triggers.
VLANeXt: Recipes for Building Strong VLA Models cs.CV · 2026-02-20 · conditional · none · ref 5 · internal anchor
VLANeXt distills 12 design insights from a unified VLA study into a model that outperforms prior methods on LIBERO benchmarks while releasing code for further exploration.
Action Hallucination in Generative Vision-Language-Action Models cs.RO · 2026-02-06 · conditional · none · ref 2 · internal anchor
Generative VLAs hallucinate physically invalid actions due to topological, precision, and horizon mismatches between model architectures and feasible robot behavior.
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning cs.AI · 2026-01-22 · conditional · none · ref 3 · internal anchor
Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.
Clutter-Robust Vision-Language-Action Models through Object-Centric and Geometry Grounding cs.RO · 2025-12-27 · conditional · none · ref 4 · internal anchor
OBEYED-VLA improves VLA robustness in cluttered real-world manipulation by disentangling perception into VLM-based object-centric grounding and geometry-aware stages, then fine-tuning the policy only on single-object demonstrations.
ESPADA: Execution Speedup via Semantics Aware Demonstration Data Downsampling for Imitation Learning cs.RO · 2025-12-08 · conditional · none · ref 3 · internal anchor
ESPADA uses semantic segmentation from VLMs and LLMs plus DTW to downsample non-critical segments in demonstrations, delivering about 2x faster robot execution in behavior cloning while maintaining task success rates.
ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models cs.CV · 2025-11-22 · conditional · none · ref 4 · internal anchor
ActDistill transfers action knowledge from heavy VLA teacher models to lightweight students via graph-encapsulated hierarchies and action-guided dynamic routing, delivering over 50% computation reduction and 1.67x speedup with comparable or better performance on embodied tasks.
Unify Robot Actions in Camera Frame cs.RO · 2025-11-21 · conditional · none · ref 1 · internal anchor
CalibAll estimates camera extrinsics on existing datasets to convert robot actions into a unified camera-frame representation, enabling stronger cross-embodiment pretraining.
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning cs.RO · 2025-09-11 · conditional · none · ref 12 · internal anchor
SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' phenomenon.
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning cs.RO · 2025-05-24 · conditional · none · ref 7 · internal anchor
VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
Policy Contrastive Decoding for Robotic Foundation Models cs.RO · 2025-05-19 · conditional · none · ref 2 · internal anchor
PCD redirects robotic policies toward object-relevant visual features via contrastive decoding on masked inputs, improving generalization without retraining or weight access.
AhaRobot: A Low-Cost Open-Source Bimanual Mobile Manipulator for Embodied AI cs.RO · 2025-03-13 · conditional · none · ref 33 · internal anchor
AhaRobot delivers 0.7 mm repeatability on a $1000 bimanual platform using dual-motor compensation and a novel 26-faced marker handle that cuts tracking error 80% versus a 6-faced baseline.
Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning cs.RO · 2026-06-28 · conditional · none · ref 11 · internal anchor
VLA benchmark success rates cannot distinguish semantic generalization from physical reasoning due to an identifiability gap in current evaluation protocols.
Per-Group Error, Not Total MSE: Fine-Tuning Vision-Language-Action Models for 11-DoF Mobile Manipulation cs.RO · 2026-05-29 · conditional · none · ref 4 · internal anchor
Per-group action error outperforms total MSE for selecting VLA fine-tuning checkpoints that succeed on real 11-DoF mobile manipulators.
Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode cs.AR · 2026-05-28 · conditional · none · ref 3 · internal anchor
Batch-1 autoregressive decode is memory-dominated yet launch overhead caps gains from higher-bandwidth GPUs, shown by measurements and CUDA Graphs ablation across four NVIDIA GPUs.
VLA-REPLICA: A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models cs.RO · 2026-05-20 · conditional · none · ref 4 · internal anchor
VLA-REPLICA is a low-cost and reproducible real-world benchmark for evaluating VLA models in robotic manipulation tasks.
Early Warning Signals for OpenVLA Failure under Visual Distribution Shift cs.CV · 2026-06-29 · conditional · none · ref 5 · internal anchor
OpenVLA layer-16 activations allow a logistic probe to predict failure within 15 steps under occlusion (AUROC 0.972) better than baselines, with some transfer to camera jitter.
Vision-Based Hand Shadowing for Robotic Manipulation via Inverse Kinematics cs.RO · 2026-03-11 · conditional · none · ref 6 · internal anchor
An egocentric vision pipeline with MediaPipe hand tracking and damped-least-squares IK achieves 86.7% success on structured pick-and-place for the SO-ARM101 robot but falls to 9.3% in real-world environments with occlusions.

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer