super hub Mixed citations

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Linxi "Jim" Fan, Nikita Cherniadev, NVIDIA: Johan Bjorck, Runyu Ding, Xingye Da · 2025 · cs.RO · arXiv 2503.14734

Mixed citation behavior. Most common role is background (66%).

338 Pith papers citing it

Background 66% of classified citations

open full Pith review browse 338 citing papers more from Linxi "Jim" Fan arXiv PDF

abstract

General-purpose robots need a versatile body and an intelligent mind. Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy in the human world. A robot foundation model, trained on massive and diverse data sources, is essential for enabling the robots to reason about novel situations, robustly handle real-world variability, and rapidly learn new tasks. To this end, we introduce GR00T N1, an open foundation model for humanoid robots. GR00T N1 is a Vision-Language-Action (VLA) model with a dual-system architecture. The vision-language module (System 2) interprets the environment through vision and language instructions. The subsequent diffusion transformer module (System 1) generates fluid motor actions in real time. Both modules are tightly coupled and jointly trained end-to-end. We train GR00T N1 with a heterogeneous mixture of real-robot trajectories, human videos, and synthetically generated datasets. We show that our generalist robot model GR00T N1 outperforms the state-of-the-art imitation learning baselines on standard simulation benchmarks across multiple robot embodiments. Furthermore, we deploy our model on the Fourier GR-1 humanoid robot for language-conditioned bimanual manipulation tasks, achieving strong performance with high data efficiency.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 63 baseline 18 method 7 other 1

citation-polarity summary

background 59 baseline 19 unclear 6 use method 5

claims ledger

abstract General-purpose robots need a versatile body and an intelligent mind. Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy in the human world. A robot foundation model, trained on massive and diverse data sources, is essential for enabling the robots to reason about novel situations, robustly handle real-world variability, and rapidly learn new tasks. To this end, we introduce GR00T N1, an open foundation model for humanoid robots. GR00T N1 is a Vision-Language-Action (VLA) model with a dual-system architecture. The vision-lang

authors

Fernando Casta\~neda Linxi "Jim" Fan Nikita Cherniadev NVIDIA: Johan Bjorck Runyu Ding Xingye Da

co-cited works

representative citing papers

HABIT: Human-Aware Behavior and Interaction Training Dataset for Robot Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 8.0

HABIT is a large-scale robot demonstration dataset for human-present environments that elicits spatiotemporal synchronization, yielding, and gesture grounding behaviors absent from robot-only training data.

FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models

cs.CV · 2026-03-30 · unverdicted · novelty 8.0

FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indistinguishable from normal ones.

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

cs.RO · 2026-07-02 · unverdicted · novelty 7.0

Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.

LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.

Adapting Generalist Robot Policies with Semantic Reinforcement Learning

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.

Revisiting Parameter Redundancy in Vision-Language-Action Models: Insights from VLM-to-VLA Adaptation

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

cs.RO · 2026-06-22 · unverdicted · novelty 7.0

LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi

Improving Robotic Generalist Policies via Flow Reversal Steering

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

Flow Reversal Steering steers flow matching generalist policies by reversing suboptimal actions to nearby better modes, enabling improved zero-shot control, quick distillation, and RL bootstrapping in robotic manipulation.

FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.

Trajectory-Level Redirection Attacks on Vision-Language-Action Models

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

A prompt-only attack called command-preserving trajectory redirection can steer VLA robot behavior to attacker-chosen physical outcomes while the text still appears to match the intended task.

Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics

cs.RO · 2026-06-10 · unverdicted · novelty 7.0

Ambient Diffusion Policy enables better imitation learning from suboptimal robot data by leveraging spectral properties to restrict data usage to specific diffusion times.

World Model Self-Distillation: Training World Models to Solve General Tasks

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

cs.CV · 2026-06-07 · unverdicted · novelty 7.0

X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.

ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies

cs.RO · 2026-06-07 · unverdicted · novelty 7.0

ActProbe is an action-space detector that uses temporal consistency error and action chunk magnitude from policy outputs, mapped via LSTM-MLP, to predict failures earlier than baselines across policies and real-robot tasks.

ActionMap: Robot Policy Learning via Voxel Action Heatmap

cs.RO · 2026-06-05 · unverdicted · novelty 7.0

ActionMap introduces a voxel heatmap action head for VLA models that improves policy learning by exploiting geometric structure in the action space.

Benchmarking Visual State Tracking in Multimodal Video Understanding

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.

RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

RoboTrustBench evaluates seven video world models on trustworthiness using four scenarios, six dimensions, and 13 criteria, finding gaps in constraint reasoning and unsafe instruction handling.

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

cs.CV · 2026-05-28 · conditional · novelty 7.0

VLMs exhibit consistent vertical-distance entanglement in embeddings from perspective bias in natural images, producing accuracy gaps that a new synthetic benchmark SpatialTunnel exposes as model-intrinsic.

PhAIL: A Real-Robot VLA Benchmark and Distributional Methodology

cs.RO · 2026-05-28 · unverdicted · novelty 7.0

PhAIL provides an open benchmark and distributional evaluation method for real-robot VLA policies using time-to-success CDF, HRT scoring, and KS significance tests.

MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus, finding visual fidelity a poor proxy for action fidelity, no reliable scale benefi

Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation

cs.RO · 2026-05-27 · unverdicted · novelty 7.0

CoP tactile representation with differentiable calibration enables zero-shot sim-to-real transfer and outperforms binary and raw-taxel baselines on peg-in-hole insertion and ball balancing with a multi-fingered hand.

{\Omega}-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling

cs.CV · 2026-05-27 · unverdicted · novelty 7.0

Omega-QVLA is a post-training quantization framework achieving uniform W4A4 for VLA models' LLM backbone and DiT action head via composite SVD-Hadamard rotation and per-step scaling, matching FP16 success rates on LIBERO.

GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

cs.RO · 2026-05-21 · unverdicted · novelty 7.0

GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.

Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

AutoScale is a closed-loop data engine using Graph-RAE for scene representation and Cluster-GA for importance-based retrieval to improve real-synthetic co-training for autonomous driving.

citing papers explorer

Showing 50 of 215 citing papers after filters.

HABIT: Human-Aware Behavior and Interaction Training Dataset for Robot Manipulation cs.RO · 2026-06-30 · unverdicted · none · ref 14 · internal anchor
HABIT is a large-scale robot demonstration dataset for human-present environments that elicits spatiotemporal synchronization, yielding, and gesture grounding behaviors absent from robot-only training data.
Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots cs.RO · 2026-07-02 · unverdicted · none · ref 4 · internal anchor
Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.
Adapting Generalist Robot Policies with Semantic Reinforcement Learning cs.RO · 2026-06-30 · unverdicted · none · ref 8 · internal anchor
SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.
Revisiting Parameter Redundancy in Vision-Language-Action Models: Insights from VLM-to-VLA Adaptation cs.RO · 2026-06-30 · unverdicted · none · ref 3 · internal anchor
VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.
LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models cs.RO · 2026-06-22 · unverdicted · none · ref 4 · internal anchor
LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi
Improving Robotic Generalist Policies via Flow Reversal Steering cs.RO · 2026-06-11 · unverdicted · none · ref 46 · internal anchor
Flow Reversal Steering steers flow matching generalist policies by reversing suboptimal actions to nearby better modes, enabling improved zero-shot control, quick distillation, and RL bootstrapping in robotic manipulation.
FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation cs.RO · 2026-06-11 · unverdicted · none · ref 2 · internal anchor
FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.
Trajectory-Level Redirection Attacks on Vision-Language-Action Models cs.RO · 2026-06-11 · unverdicted · none · ref 57 · internal anchor
A prompt-only attack called command-preserving trajectory redirection can steer VLA robot behavior to attacker-chosen physical outcomes while the text still appears to match the intended task.
Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics cs.RO · 2026-06-10 · unverdicted · none · ref 3 · internal anchor
Ambient Diffusion Policy enables better imitation learning from suboptimal robot data by leveraging spectral properties to restrict data usage to specific diffusion times.
ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies cs.RO · 2026-06-07 · unverdicted · none · ref 6 · internal anchor
ActProbe is an action-space detector that uses temporal consistency error and action chunk magnitude from policy outputs, mapped via LSTM-MLP, to predict failures earlier than baselines across policies and real-robot tasks.
ActionMap: Robot Policy Learning via Voxel Action Heatmap cs.RO · 2026-06-05 · unverdicted · none · ref 7 · internal anchor
ActionMap introduces a voxel heatmap action head for VLA models that improves policy learning by exploiting geometric structure in the action space.
PhAIL: A Real-Robot VLA Benchmark and Distributional Methodology cs.RO · 2026-05-28 · unverdicted · none · ref 30 · internal anchor
PhAIL provides an open benchmark and distributional evaluation method for real-robot VLA policies using time-to-success CDF, HRT scoring, and KS significance tests.
Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation cs.RO · 2026-05-27 · unverdicted · none · ref 1 · internal anchor
CoP tactile representation with differentiable calibration enables zero-shot sim-to-real transfer and outperforms binary and raw-taxel baselines on peg-in-hole insertion and ball balancing with a multi-fingered hand.
GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations cs.RO · 2026-05-21 · unverdicted · none · ref 14 · internal anchor
GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.
Dexora: Open-source VLA for High-DoF Bimanual Dexterity cs.RO · 2026-05-18 · unverdicted · none · ref 7 · internal anchor
Dexora is the first open-source VLA system for dual-arm dual-hand high-DoF manipulation, trained on 100K simulated and 10K real teleoperated trajectories with a discriminator-weighted diffusion policy, achieving 66.7% dexterous success versus 51.7% for baselines.
RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation cs.RO · 2026-05-17 · unverdicted · none · ref 2 · internal anchor
RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.
RotVLA: Rotational Latent Action for Vision-Language-Action Model cs.RO · 2026-05-13 · unverdicted · none · ref 22 · internal anchor
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
SafeManip: A Property-Driven Benchmark for Temporal Safety Evaluation in Robotic Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 1 · 2 links · internal anchor
SafeManip is a benchmark applying reusable LTLf templates across eight safety categories to evaluate temporal properties in robotic manipulation on VLA policies.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 3 · internal anchor
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete cs.RO · 2026-05-12 · unverdicted · none · ref 15 · internal anchor
Premover enables VLA policies to act on partial instructions by precomputing focus maps from intermediate backbone layers, reducing wall-clock time 13.6 percent on LIBERO while preserving 95 percent success rate.
DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies cs.RO · 2026-05-12 · unverdicted · none · ref 30 · internal anchor
DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.
RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning cs.RO · 2026-05-12 · unverdicted · none · ref 4 · internal anchor
RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation cs.RO · 2026-05-12 · conditional · none · ref 4 · internal anchor
A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models cs.RO · 2026-05-12 · unverdicted · none · ref 12 · 2 links · internal anchor
Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
Octopus Protocol: One-Shot Hardware Discovery and Control for AI Agents via Infrastructure-as-Prompts cs.RO · 2026-05-09 · unverdicted · none · ref 21 · internal anchor
Octopus Protocol enables one-shot hardware onboarding for AI agents by running a five-stage LLM-driven pipeline that probes devices, infers capabilities, generates an MCP server, and deploys it for closed-loop control.
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models cs.RO · 2026-05-08 · unverdicted · none · ref 15 · internal anchor
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models cs.RO · 2026-05-08 · unverdicted · none · ref 5 · 2 links · internal anchor
AT-VLA proposes adaptive tactile injection and a dual-stream tactile reaction mechanism to enhance VLA models for contact-rich robotic manipulation with real-time responses.
Latent Bridge: Feature Delta Prediction for Efficient Dual-System Vision-Language-Action Model Inference cs.RO · 2026-05-04 · unverdicted · none · ref 2 · internal anchor
Latent Bridge predicts VLM feature deltas to reduce VLM calls by 50-75% in dual-system VLA models while retaining 95-100% performance and achieving 1.65-1.73x speedup across LIBERO, RoboCasa, and ALOHA benchmarks.
Phone2Act: A Low-Cost, Hardware-Agnostic Teleoperation System for Scalable VLA Data Collection cs.RO · 2026-05-03 · unverdicted · none · ref 5 · internal anchor
Phone2Act is a smartphone-based teleoperation system that collects synchronized multi-camera robot manipulation data in LeRobot format without custom hardware, validated by fine-tuning GR00T-N1.5 to 90% success on a real Dobot CR5 pick-and-place task using 130 episodes.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 5 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors cs.RO · 2026-04-27 · unverdicted · none · ref 6 · 2 links · internal anchor
Discrete diffusion policies act as natural asynchronous executors for robotics by treating action generation as iterative unmasking, yielding higher success rates and lower computation than flow-matching real-time chunking in dynamic tasks.
Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment cs.RO · 2026-04-27 · unverdicted · none · ref 14 · internal anchor
VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with marginal task degradation.
Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models cs.RO · 2026-04-25 · unverdicted · none · ref 5 · internal anchor
MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation cs.RO · 2026-04-21 · unverdicted · none · ref 7 · 2 links · internal anchor
RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
Flow Motion Policy: Manipulator Motion Planning with Flow Matching Models cs.RO · 2026-04-08 · unverdicted · none · ref 36 · internal anchor
Flow Motion Policy uses flow matching to model distributions over feasible manipulator paths, enabling best-of-N sampling with post-generation collision filtering to improve success and efficiency over prior neural and sampling-based planners.
DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching cs.RO · 2026-03-27 · conditional · none · ref 1 · internal anchor
DFM-VLA uses discrete flow matching to iteratively refine action tokens in VLA models, outperforming autoregressive and diffusion baselines with 4.44 average success length on CALVIN and 95.7% success on LIBERO.
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models cs.RO · 2026-03-23 · unverdicted · none · ref 2 · internal anchor
VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models cs.RO · 2026-03-10 · unverdicted · none · ref 4 · internal anchor
AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.
Force Policy: Learning Hybrid Force-Position Control Policy under Interaction Frame for Contact-Rich Manipulation cs.RO · 2026-02-25 · unverdicted · none · ref 6 · internal anchor
Force Policy learns a global vision policy for free space and a local force-feedback policy that recovers an interaction frame to execute stable hybrid force-position control in contact-rich manipulation.
PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning cs.RO · 2026-02-23 · unverdicted · none · ref 44 · internal anchor
PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models cs.RO · 2026-02-23 · unverdicted · none · ref 25 · internal anchor
UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation cs.RO · 2026-02-18 · unverdicted · none · ref 1 · internal anchor
PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.
Action-to-Action Flow Matching cs.RO · 2026-02-07 · unverdicted · none · ref 2 · internal anchor
A2A flow matching starts action generation from prior proprioceptive actions in latent space to enable single-step high-quality predictions in robotic policies.
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos cs.RO · 2026-02-06 · unverdicted · none · ref 8 · internal anchor
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
VLA-Corrector: Lightweight Detect-and-Correct Inference for Adaptive Action Horizon cs.RO · 2026-07-02 · unverdicted · none · ref 1 · internal anchor
VLA-Corrector adds a detect-and-correct inference layer using a latent vision monitor and online gradient guidance to enable adaptive action horizons in chunked VLA policies.
FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model cs.RO · 2026-07-01 · unverdicted · none · ref 32 · internal anchor
Progress-enhanced VLA model raises simulated bimanual furniture assembly success from 48% to 80% across three furniture types and shows 16% drop on real Kinova robot.
ROSA: A Robotics Foundation Model Serving System for Robot Factories cs.RO · 2026-07-01 · unverdicted · none · ref 6 · internal anchor
ROSA introduces shared GPU-pool serving, robotics-aware abstractions for multi-model pipelines, and factory-productivity scheduling that improves output by up to 12.06x over dedicated per-robot systems.
Human-Centric Transferable Tactile Pre-Training for Dexterous Robotic Manipulation cs.RO · 2026-07-01 · unverdicted · none · ref 49 · internal anchor
Introduces H-Tac human tactile-action dataset and TTP pre-training that unifies spaces and predicts future tactile signals to improve robotic dexterous manipulation transfer.
RoboWorld: Fast and Reliable Neural Simulators for Generalist Robot Policy Evaluation cs.RO · 2026-07-01 · unverdicted · none · ref 10 · internal anchor
RoboWorld introduces an automated pipeline using autoregressive video world models and task-progress VLM scoring, plus Step Forcing for long-horizon stability, to achieve high correlation with real robot policy evaluation.

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer