RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.
super hub Mixed citations
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Mixed citation behavior. Most common role is background (53%).
abstract
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and enviro
authors
co-cited works
representative citing papers
In a two-period game-theoretic model of learning-by-deploying, data pooling raises welfare with fixed prices but can turn privately unprofitable under Cournot competition, with a sustainability threshold set by demand elasticity.
VLA language backbones show high redundancy on manipulation benchmarks, with half the LLM blocks removable and even two blocks sufficient to recover baseline performance after fine-tuning, unlike vision and action pathways.
Processed egocentric human video outperforms teleoperated real-robot trajectories as pretraining data for embodied foundation models, delivering 24% lower validation loss and 52.5-90% higher task success rates under matched post-training protocols.
EgoCS-400K is a new 400K-video egocentric CS dataset with action-state-event alignment from public match demos for world model training.
ThinkingVLA is a Mixture-of-Transformers VLA model that performs interleaved forward CoT for subgoal and image prediction followed by inverse CoT grounded on the predicted image to generate actions.
Ambient Diffusion Policy enables better imitation learning from suboptimal robot data by leveraging spectral properties to restrict data usage to specific diffusion times.
UMI-Bench 1.0 is presented as the first open benchmark dedicated to reproducible real-world evaluation of Universal Manipulation Interface policies.
World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.
ActionMap introduces a voxel heatmap action head for VLA models that improves policy learning by exploiting geometric structure in the action space.
Action-only curation metrics for imitation learning fail to detect structural defects that degrade policies, while state-aware metrics recover roughly one-third of the performance gap.
DVAC uses denoising variance as an intrinsic signal to adaptively chunk actions in flow-based robot policies, improving success rates and cutting replans on LIBERO, RoboTwin, CALVIN, and real-world tasks.
The paper identifies a deployment safety gap in VLA policies where identical checkpoints can be executable-inequivalent due to action metadata mismatches, supported by a derived closed-form transform and empirical drift measurements on LIBERO benchmarks.
TTT-VLA performs test-time training for VLA models by optimizing only a latent prompt on new interaction data via a proxy self-supervised signal, yielding higher task success rates on SimplerEnv in single- and multi-embodiment settings.
BOKBO is the first conformal abstention method for K-sample VLA policies that supplies finite-sample distribution-free guarantees on executed violation rates, with global and Mondrian per-task variants.
PhAIL provides an open benchmark and distributional evaluation method for real-robot VLA policies using time-to-success CDF, HRT scoring, and KS significance tests.
SkiP introduces action relabeling and Motion Spectrum Keying to skip redundant steps in robot trajectories, cutting executed steps by 15-40% while maintaining success rates across 72 simulated and 3 real tasks.
Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.
SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.
A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.
A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing full compositions.
citing papers explorer
-
RobotValues: Evaluating Household Robots When Human Values Conflict
RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.
-
Data Sharing and Competition in Learning-by-Deploying Industries: Insights from Robotics and Beyond
In a two-period game-theoretic model of learning-by-deploying, data pooling raises welfare with fixed prices but can turn privately unprofitable under Cournot competition, with a sustainability threshold set by demand elasticity.
-
HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining
Processed egocentric human video outperforms teleoperated real-robot trajectories as pretraining data for embodied foundation models, delivering 24% lower validation loss and 52.5-90% higher task success rates under matched post-training protocols.
-
EgoCS-400K: An Egocentric Gameplay Dataset for World Models
EgoCS-400K is a new 400K-video egocentric CS dataset with action-state-event alignment from public match demos for world model training.
-
ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation
ThinkingVLA is a Mixture-of-Transformers VLA model that performs interleaved forward CoT for subgoal and image prediction followed by inverse CoT grounded on the predicted image to generate actions.
-
Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics
Ambient Diffusion Policy enables better imitation learning from suboptimal robot data by leveraging spectral properties to restrict data usage to specific diffusion times.
-
UMI-Bench 1.0: An Open and Reproducible Real-World Benchmark for Tabletop Robotic Manipulation with UMI Data
UMI-Bench 1.0 is presented as the first open benchmark dedicated to reproducible real-world evaluation of Universal Manipulation Interface policies.
-
Targeting World Models to Compromise Robot Learning Pipelines
World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.
-
ActionMap: Robot Policy Learning via Voxel Action Heatmap
ActionMap introduces a voxel heatmap action head for VLA models that improves policy learning by exploiting geometric structure in the action space.
-
Auditing Demonstration Curation Metrics: Action-Only Scorers Fail on the Structural Defects That Degrade Imitation Policies
Action-only curation metrics for imitation learning fail to detect structural defects that degrade policies, while state-aware metrics recover roughly one-third of the performance gap.
-
Denoising Tells When to Replan: Denoising-Variance Adaptive Chunking for Flow-Based Robot Policies
DVAC uses denoising variance as an intrinsic signal to adaptively chunk actions in flow-based robot policies, improving success rates and cutting replans on LIBERO, RoboTwin, CALVIN, and real-world tasks.
-
Same Weights, Different Robot: A Deployment Safety View of VLA Policies
The paper identifies a deployment safety gap in VLA policies where identical checkpoints can be executable-inequivalent due to action metadata mismatches, supported by a derived closed-form transform and empirical drift measurements on LIBERO benchmarks.
-
TTT-VLA: Test-Time Latent Prompt Optimization for Vision-Language-Action Models
TTT-VLA performs test-time training for VLA models by optimizing only a latent prompt on new interaction data via a proxy self-supervised signal, yielding higher task success rates on SimplerEnv in single- and multi-embodiment settings.
-
BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies
BOKBO is the first conformal abstention method for K-sample VLA policies that supplies finite-sample distribution-free guarantees on executed violation rates, with global and Mondrian per-task variants.
-
PhAIL: A Real-Robot VLA Benchmark and Distributional Methodology
PhAIL provides an open benchmark and distributional evaluation method for real-robot VLA policies using time-to-success CDF, HRT scoring, and KS significance tests.
-
SkiP: When to Skip and When to Refine for Efficient Robot Manipulation
SkiP introduces action relabeling and Motion Spectrum Keying to skip redundant steps in robot trajectories, cutting executed steps by 15-40% while maintaining success rates across 72 simulated and 3 real tasks.
-
Aligning Flow Map Policies with Optimal Q-Guidance
Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
-
Dynamic Execution Commitment of Vision-Language-Action Models
A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.
-
SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation
SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion
Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.
-
OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction
A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.
-
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing full compositions.
-
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
-
PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning
PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.
-
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
-
VLAs are Confined yet Capable of Generalizing to Novel Instructions
Averaging and temporally interpolating text latents in VLAs enables 83% success on novel task combinations in the libero-ood benchmark where SOTA models achieve under 15%.
-
Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction
Introduces the Kaiwu multimodal dataset and framework with 11,664 synchronized assembling demonstrations including hand motions, pressures, sounds, multi-view videos, motion capture, eye gaze, and EMG signals with timestamp-based and semantic annotations.
-
RoboDreamer: Learning Compositional World Models for Robot Imagination
RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.
-
3D-VLA: A 3D Vision-Language-Action Generative World Model
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
-
Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs
TAP uses two-stage pretraining on unlabeled data to learn physical competence before language grounding, matching 1M-expert models with far less labeled data and showing robustness on real robots.
-
ABot-M0.5: Unified Mobility-and-Manipulation World Action Model
ABot-M0.5 proposes a unified mobility-and-manipulation world action model using three alignment strategies that achieves state-of-the-art performance on mobile and fine-grained manipulation benchmarks.
-
Sequential Planning via Anchored Robotic Keypoints
SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.
-
Chronos: A Physics-Informed Full-History Framework for Non-Markovian Long-Horizon Manipulation
Chronos elevates full observation history to the policy's latent state via selective SSM tokens and a Schrödinger-inspired acceleration bridge, achieving large gains on memory-dependent robot tasks with fewer parameters.
-
Critical Interval MSE: Toward Reliable Offline Validation for Robot Manipulation Policies
CI-MSE improves Spearman's rank correlation between offline validation error and real rollout performance from -0.61 (raw MSE) to -0.87 across policy checkpoints in simulation and real-world robot manipulation experiments.
-
Direct Action-Head Injection of A Grounded 3D Point Unlocks Spatial and Task Generalization
Direct 3D point grounding injected into the action head via a two-layer MLP and adaptive layer norm boosts VLA success rates by 32-46 points on spatial and task perturbations in LIBERO-PRO.
-
CoStream: Composing Simple Behaviors for Generalizable Complex Manipulation
CoStream composes semantic, predictive, and reactive behaviors on an SE(3) interface to enable precise, generalizable performance on eight real-world contact-rich manipulation tasks.
-
ASCII Art Turns LLMs into VLA Controllers
ASCII rendering of visual states enables fine-tuned text-only LLMs to serve as VLA controllers that identify objects and generate feasible action sequences in 2D manipulation benchmarks in simulation and on hardware.
-
PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning
PoLAR imposes radial structure on latent actions in hyperbolic space to factorize extent and mode, improving robot policy performance over baselines.
-
Inductive Generalization for Robotic Manipulation
The paper introduces an inductive generalization evaluation protocol for manipulation policies and shows that SOTA vision-language-action models fail on progressively harder task variants.
-
Contrastive Action-Image Pre-training for Visuomotor Control
CAIP learns action-aligned visual representations via contrastive pre-training on human hand keypoints from egocentric video, outperforming DINOv2, SigLIP, MVP, and R3M with >30% gains on real dexterous manipulation tasks.
-
T-Rex: Tactile-Reactive Dexterous Manipulation
T-Rex introduces a large tactile dataset and MoT architecture that achieves over 30% higher success rates than baselines on 12 tasks requiring force control and deformable object handling.
-
Geometric Action Model for Robot Policy Learning
GAM splits a geometric foundation model to enable language-conditioned future geometry prediction and action decoding for robot policies, claiming superior performance on manipulation benchmarks.
-
SafeDojo: Safe Reinforcement Learning for VLA via Interactive World Model
SafeDojo is a new world model-based safe RL framework for VLA that outperforms baselines on SafeLIBERO and real robot tasks.
-
Transferring Contact, Not Just Motion: Compliant Grasping Across Dexterous Hands
A cross-embodiment force-position interface with system-identified torque calibration enables a flow-matching policy to perform transferable compliant grasping on heterogeneous dexterous hands.
-
RoboProcessBench: Benchmarking Process-Aware Understanding in Vision-Language Robotic Manipulation
RoboProcessBench is a new benchmark decomposing process-aware understanding into static monitoring and dynamic reasoning across 12 question families, with evaluations showing VLM limitations but post-training gains on the provided data.
-
APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies
APT pretrains the action expert as a vision-action prior on frozen VLM features then adds language through gated fusion to improve OOD instruction generalization in continuous-action VLA policies.
-
What Demonstration Curation Metrics Do to Your Policy
On a LIBERO pick-and-place task with gripper defects, curation metrics with highest defect-detection AUROC produce the worst policies while lower-AUROC metrics nearly match the oracle, and many metrics rely on episode length as a proxy.
-
Uncertainty-Aware Intention Prediction for Human-to-Robot Assembly Teleoperation
Human-to-robot transfer learning with conformal prediction improves robot assembly action segmentation Edit score from 70.50 to 80.70 using only 16 robot demonstrations.
-
TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies
TempoVLA learns a single VLA policy with controllable execution speed via variable-speed trajectory augmentation and explicit speed conditioning.