super hub Mixed citations

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Linxi "Jim" Fan, Nikita Cherniadev, NVIDIA: Johan Bjorck, Runyu Ding, Xingye Da · 2025 · cs.RO · arXiv 2503.14734

Mixed citation behavior. Most common role is background (66%).

209 Pith papers citing it

Background 66% of classified citations

open full Pith review browse 209 citing papers more from Linxi "Jim" Fan arXiv PDF

abstract

General-purpose robots need a versatile body and an intelligent mind. Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy in the human world. A robot foundation model, trained on massive and diverse data sources, is essential for enabling the robots to reason about novel situations, robustly handle real-world variability, and rapidly learn new tasks. To this end, we introduce GR00T N1, an open foundation model for humanoid robots. GR00T N1 is a Vision-Language-Action (VLA) model with a dual-system architecture. The vision-language module (System 2) interprets the environment through vision and language instructions. The subsequent diffusion transformer module (System 1) generates fluid motor actions in real time. Both modules are tightly coupled and jointly trained end-to-end. We train GR00T N1 with a heterogeneous mixture of real-robot trajectories, human videos, and synthetically generated datasets. We show that our generalist robot model GR00T N1 outperforms the state-of-the-art imitation learning baselines on standard simulation benchmarks across multiple robot embodiments. Furthermore, we deploy our model on the Fourier GR-1 humanoid robot for language-conditioned bimanual manipulation tasks, achieving strong performance with high data efficiency.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 63 baseline 18 method 7 other 1

citation-polarity summary

background 59 baseline 19 unclear 6 use method 5

claims ledger

abstract General-purpose robots need a versatile body and an intelligent mind. Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy in the human world. A robot foundation model, trained on massive and diverse data sources, is essential for enabling the robots to reason about novel situations, robustly handle real-world variability, and rapidly learn new tasks. To this end, we introduce GR00T N1, an open foundation model for humanoid robots. GR00T N1 is a Vision-Language-Action (VLA) model with a dual-system architecture. The vision-lang

authors

Fernando Casta\~neda Linxi "Jim" Fan Nikita Cherniadev NVIDIA: Johan Bjorck Runyu Ding Xingye Da

co-cited works

representative citing papers

FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models

cs.CV · 2026-03-30 · unverdicted · novelty 8.0

FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indistinguishable from normal ones.

GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

cs.RO · 2026-05-21 · unverdicted · novelty 7.0

GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.

Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

AutoScale is a closed-loop data engine using Graph-RAE for scene representation and Cluster-GA for importance-based retrieval to improve real-synthetic co-training for autonomous driving.

Dexora: Open-source VLA for High-DoF Bimanual Dexterity

cs.RO · 2026-05-18 · unverdicted · novelty 7.0

Dexora is the first open-source VLA system for dual-arm dual-hand high-DoF manipulation, trained on 100K simulated and 10K real teleoperated trajectories with a discriminator-weighted diffusion policy, achieving 66.7% dexterous success versus 51.7% for baselines.

RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation

cs.RO · 2026-05-17 · unverdicted · novelty 7.0

RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.

RotVLA: Rotational Latent Action for Vision-Language-Action Model

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

Premover enables VLA policies to act on partial instructions by precomputing focus maps from intermediate backbone layers, reducing wall-clock time 13.6 percent on LIBERO while preserving 95 percent success rate.

DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.

RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.

Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

cs.RO · 2026-05-12 · conditional · novelty 7.0

A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.

Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

cs.RO · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.

CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational overhead during adaptation.

VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 7.0

VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.

BatchWeave: A Consistent Object-Store-Native Data Plane for Large Foundation Model Training

cs.DC · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

BatchWeave delivers an object-store-native data plane for distributed large foundation model training via transactional global batches and a decentralized adaptive commit algorithm.

LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.

Octopus Protocol: One-Shot Hardware Discovery and Control for AI Agents via Infrastructure-as-Prompts

cs.RO · 2026-05-09 · unverdicted · novelty 7.0

Octopus Protocol enables one-shot hardware onboarding for AI agents by running a five-stage LLM-driven pipeline that probes devices, infers capabilities, generates an MCP server, and deploys it for closed-loop control.

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

cs.CV · 2026-05-08 · conditional · novelty 7.0 · 3 refs

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

cs.RO · 2026-05-08 · unverdicted · novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

cs.RO · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

AT-VLA proposes adaptive tactile injection and a dual-stream tactile reaction mechanism to enhance VLA models for contact-rich robotic manipulation with real-time responses.

Latent Bridge: Feature Delta Prediction for Efficient Dual-System Vision-Language-Action Model Inference

cs.RO · 2026-05-04 · unverdicted · novelty 7.0

Latent Bridge predicts VLM feature deltas to reduce VLM calls by 50-75% in dual-system VLA models while retaining 95-100% performance and achieving 1.65-1.73x speedup across LIBERO, RoboCasa, and ALOHA benchmarks.

Phone2Act: A Low-Cost, Hardware-Agnostic Teleoperation System for Scalable VLA Data Collection

cs.RO · 2026-05-03 · unverdicted · novelty 7.0

Phone2Act is a smartphone-based teleoperation system that collects synchronized multi-camera robot manipulation data in LeRobot format without custom hardware, validated by fine-tuning GR00T-N1.5 to 90% success on a real Dobot CR5 pick-and-place task using 130 episodes.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

cs.RO · 2026-04-27 · unverdicted · novelty 7.0

VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with marginal task degradation.

citing papers explorer

Showing 41 of 41 citing papers after filters.

USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots cs.RO · 2025-10-09 · unverdicted · none · ref 10 · internal anchor
Introduces USIM simulation dataset and U0 VLA model with CAP module for general underwater robot tasks, reporting 0.0359 offline error and 43.1% online success rate.
Steering Your Diffusion Policy with Latent Space Reinforcement Learning cs.RO · 2025-06-18 · unverdicted · none · ref 20 · internal anchor
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving cs.CV · 2025-06-09 · unverdicted · none · ref 4 · internal anchor
ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
DreamGen: Unlocking Generalization in Robot Learning through Video World Models cs.RO · 2025-05-19 · unverdicted · none · ref 5 · internal anchor
DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperation dataset.
Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation cs.RO · 2025-12-29 · unverdicted · none · ref 2 · internal anchor
DreamTacVLA grounds VLA models in contact physics by aligning multi-scale vision-tactile inputs and predicting future tactile states, reaching up to 95% success on contact-rich tasks.
Clutter-Robust Vision-Language-Action Models through Object-Centric and Geometry Grounding cs.RO · 2025-12-27 · conditional · none · ref 6 · internal anchor
OBEYED-VLA improves VLA robustness in cluttered real-world manipulation by disentangling perception into VLM-based object-centric grounding and geometry-aware stages, then fine-tuning the policy only on single-object demonstrations.
AstraNav-World: World Model for Foresight Control and Consistency cs.CV · 2025-12-25 · unverdicted · none · ref 17 · internal anchor
AstraNav-World unifies diffusion video generation and vision-language action planning in a single bidirectional model that improves trajectory accuracy, success rates, and zero-shot real-world adaptation in embodied navigation.
HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models cs.RO · 2025-12-10 · unverdicted · none · ref 3 · internal anchor
HiF-VLA improves long-horizon robotic manipulation by encoding past motion as hindsight priors and anticipating future motion through foresight reasoning inside a VLA framework.
IGen: Scalable Data Generation for Robot Learning from Open-World Images cs.RO · 2025-12-01 · unverdicted · none · ref 4 · internal anchor
IGen generates realistic visuomotor training data including actions and temporally coherent visuals from unstructured open-world images via 3D reconstruction and VLM reasoning.
SimScale: Learning to Drive via Real-World Simulation at Scale cs.CV · 2025-11-28 · conditional · none · ref 65 · internal anchor
SimScale synthesizes unseen driving states from real logs via neural rendering and reactive environments, generates pseudo-expert trajectories, and shows that co-training on real plus simulated data improves planning robustness and generalization on real benchmarks, with gains scaling by simulation
Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary cs.RO · 2025-11-28 · unverdicted · none · ref 4 · internal anchor
Humanoid-LLA converts unconstrained natural language commands into stable whole-body motions for humanoid robots using a unified motion vocabulary and two-stage supervised-plus-reinforcement fine-tuning.
ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models cs.CV · 2025-11-22 · conditional · none · ref 3 · internal anchor
ActDistill transfers action knowledge from heavy VLA teacher models to lightweight students via graph-encapsulated hierarchies and action-guided dynamic routing, delivering over 50% computation reduction and 1.67x speedup with comparable or better performance on embodied tasks.
SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control cs.RO · 2025-11-11 · unverdicted · none · ref 6 · internal anchor
Scaling motion tracking models along size, data volume, and compute produces a foundation model for natural, robust humanoid whole-body control with downstream uses in kinematic planning and vision-language-action models.
Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning cs.RO · 2025-11-06 · unverdicted · none · ref 9 · internal anchor
Isaac Lab is a unified GPU-native platform combining high-fidelity physics, photorealistic rendering, multi-frequency sensors, domain randomization, and learning pipelines for scalable multi-modal robot policy training.
DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models cs.LG · 2025-10-31 · unverdicted · none · ref 2 · internal anchor
DeepThinkVLA shows CoT improves VLA models only under decoding and causal alignment, delivering 97% success on LIBERO and 21.7-point gains via hybrid attention and SFT-RL training.
Co-Evolving Latent Action World Models cs.LG · 2025-10-30 · unverdicted · none · ref 26 · internal anchor
CoLA-World jointly trains latent action models and world models with a warm-up phase to achieve co-evolution, matching or exceeding prior two-stage methods in video simulation quality and visual planning performance.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy cs.RO · 2025-10-15 · unverdicted · none · ref 2 · internal anchor
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation cs.RO · 2025-08-07 · unverdicted · none · ref 5 · internal anchor
Genie Envisioner unifies robotic policy learning, simulation, and evaluation inside one instruction-conditioned video diffusion framework using GE-Base, GE-Act, and GE-Sim.
Video Generators are Robot Policies cs.RO · 2025-08-01 · conditional · none · ref 54 · internal anchor
Training models to generate videos of robot actions produces policies that generalize better to new objects and tasks while using far less demonstration data than standard behavior cloning.
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models cs.RO · 2025-07-31 · unverdicted · none · ref 52 · internal anchor
villa-X enhances latent action modeling in VLA models to support zero-shot action planning for unseen robot embodiments and open-vocabulary instructions, yielding better manipulation results in simulation and real-world tests.
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge cs.CV · 2025-07-06 · unverdicted · none · ref 39 · internal anchor
DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 average length on CALVIN ABC-D.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning cs.AI · 2025-06-11 · unverdicted · none · ref 4 · internal anchor
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.
Real-Time Execution of Action Chunking Flow Policies cs.RO · 2025-06-09 · unverdicted · none · ref 4 · internal anchor
Real-time chunking (RTC) allows diffusion- and flow-based action chunking policies to execute smoothly and asynchronously, maintaining high success rates on dynamic tasks even with significant inference latency.
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics cs.LG · 2025-06-02 · unverdicted · none · ref 4 · internal anchor
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
DreamPolicy: A Unified World-model Policy for Scalable Humanoid Locomotion cs.RO · 2025-05-24 · unverdicted · none · ref 45 · internal anchor
DreamPolicy integrates an autoregressive diffusion world model with policy learning to produce a single scalable policy that generalizes to unseen composite terrains for humanoid locomotion.
FLARE: Robot Learning with Implicit World Modeling cs.RO · 2025-05-21 · unverdicted · none · ref 8 · internal anchor
FLARE integrates predictive latent world modeling into diffusion transformer policies for robots, delivering up to 26% gains on multitask manipulation benchmarks and enabling co-training with action-free human videos.
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data cs.RO · 2025-05-06 · unverdicted · none · ref 8 · internal anchor
GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization cs.LG · 2025-04-22 · unverdicted · none · ref 7 · internal anchor
π_{0.5} is a VLA model that achieves long-horizon dexterous manipulation in entirely new homes through co-training on heterogeneous tasks and multi-source data including web and semantic predictions.
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model cs.CV · 2025-03-13 · unverdicted · none · ref 12 · internal anchor
HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control cs.RO · 2025-02-09 · unverdicted · none · ref 16 · internal anchor
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
Motus: A Unified Latent Action World Model cs.CV · 2025-12-15 · unverdicted · none · ref 6 · internal anchor
Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention cs.LG · 2025-11-24 · unverdicted · none · ref 2 · internal anchor
AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.
XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations cs.RO · 2025-11-04 · unverdicted · none · ref 8 · internal anchor
XR-1 introduces Unified Vision-Motion Codes learned by dual-branch VQ-VAE and applies them in a three-stage training pipeline to outperform prior VLA models on 120+ real-world manipulation tasks across six robot embodiments.
Reflection-Based Task Adaptation for Self-Improving VLA cs.RO · 2025-10-14 · unverdicted · none · ref 13 · internal anchor
Reflective Self-Adaptation combines failure-reflective reinforcement learning with success-guided imitation learning to enable faster and more reliable task adaptation for pre-trained Vision-Language-Action models.
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey cs.RO · 2025-08-18 · unverdicted · none · ref 32 · internal anchor
This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning cs.CV · 2025-07-22 · unverdicted · none · ref 3 · internal anchor
ThinkAct introduces reinforced visual latent planning in a dual VLA system to enable better long-horizon reasoning and adaptation for embodied tasks.
GR-3 Technical Report cs.RO · 2025-07-21 · unverdicted · none · ref 7 · internal anchor
GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.
A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation cs.RO · 2025-07-07 · accept · none · ref 6 · internal anchor
Multi-task pretraining of diffusion policies on diverse robot data produces more successful, robust, and data-efficient policies for dexterous manipulation than single-task baselines, with performance scaling with pretraining size and diversity.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective cs.RO · 2025-07-02 · unverdicted · none · ref 265 · internal anchor
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
World Simulation with Video Foundation Models for Physical AI cs.CV · 2025-10-28 · unverdicted · none · ref 9 · internal anchor
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI cs.AI · 2025-10-06 · unverdicted · none · ref 230 · internal anchor
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer