pith. sign in

super hub Mixed citations

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Mixed citation behavior. Most common role is background (53%).

104 Pith papers citing it
Background 53% of classified citations
abstract

Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io.

hub tools

citation-role summary

background 22 dataset 18 baseline 2 method 1

citation-polarity summary

claims ledger

  • abstract Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and enviro

authors

co-cited works

clear filters

representative citing papers

Aligning Flow Map Policies with Optimal Q-Guidance

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.

Atomic-Probe Governance for Skill Updates in Compositional Robot Policies

cs.RO · 2026-04-29 · unverdicted · novelty 7.0 · 2 refs

A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing full compositions.

3D-VLA: A 3D Vision-Language-Action Generative World Model

cs.CV · 2024-03-14 · unverdicted · novelty 7.0

3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.

RT-H: Action Hierarchies Using Language

cs.RO · 2024-03-04 · conditional · novelty 7.0

RT-H learns robot policies by first predicting language motions as an intermediate representation and then mapping those plus the high-level task to actions, yielding more robust multi-task performance and the ability to learn from language interventions.

Any-point Trajectory Modeling for Policy Learning

cs.RO · 2023-12-28 · conditional · novelty 7.0

ATM pre-trains models to predict trajectories of any points in videos, then uses those predictions to learn strong visuomotor policies from minimal action labels, beating baselines by 80% on 130+ tasks.

Action with Visual Primitives

cs.RO · 2026-05-21 · unverdicted · novelty 6.0

AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.

citing papers explorer

Showing 2 of 2 citing papers after filters.