HABIT is a large-scale robot demonstration dataset for human-present environments that elicits spatiotemporal synchronization, yielding, and gesture grounding behaviors absent from robot-only training data.
super hub Mixed citations
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Mixed citation behavior. Most common role is background (66%).
abstract
General-purpose robots need a versatile body and an intelligent mind. Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy in the human world. A robot foundation model, trained on massive and diverse data sources, is essential for enabling the robots to reason about novel situations, robustly handle real-world variability, and rapidly learn new tasks. To this end, we introduce GR00T N1, an open foundation model for humanoid robots. GR00T N1 is a Vision-Language-Action (VLA) model with a dual-system architecture. The vision-language module (System 2) interprets the environment through vision and language instructions. The subsequent diffusion transformer module (System 1) generates fluid motor actions in real time. Both modules are tightly coupled and jointly trained end-to-end. We train GR00T N1 with a heterogeneous mixture of real-robot trajectories, human videos, and synthetically generated datasets. We show that our generalist robot model GR00T N1 outperforms the state-of-the-art imitation learning baselines on standard simulation benchmarks across multiple robot embodiments. Furthermore, we deploy our model on the Fourier GR-1 humanoid robot for language-conditioned bimanual manipulation tasks, achieving strong performance with high data efficiency.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract General-purpose robots need a versatile body and an intelligent mind. Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy in the human world. A robot foundation model, trained on massive and diverse data sources, is essential for enabling the robots to reason about novel situations, robustly handle real-world variability, and rapidly learn new tasks. To this end, we introduce GR00T N1, an open foundation model for humanoid robots. GR00T N1 is a Vision-Language-Action (VLA) model with a dual-system architecture. The vision-lang
authors
co-cited works
representative citing papers
FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indistinguishable from normal ones.
Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.
LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.
SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.
VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.
LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi
Mix-QVLA is a task-evidence-aware mixed-precision PTQ framework for VLA models that preserves task-relevant evidence via evidence-mass and attribution-distribution metrics to guide bit allocation under memory and BitOps constraints.
EBench is a benchmark that evaluates generalist mobile manipulation policies on 26 tasks across 5 capability and 4 generalization dimensions, revealing distinct capability profiles among models with similar success rates.
ThinkingVLA is a Mixture-of-Transformers VLA model that performs interleaved forward CoT for subgoal and image prediction followed by inverse CoT grounded on the predicted image to generate actions.
MuseVLA adds on-demand sensor selection via tokens and converts readings into grounded sensor images for multimodal fusion, reporting 80.6% average success on real-robot dexterous tasks that need non-visual sensing.
LeaP introduces a learnable proprioception-conditioned diagonal Gaussian source prior for generative robot policies, raising average success rates on 15 RoboTwin tasks from baselines by 6.5-25.5 points.
Flow Reversal Steering steers flow matching generalist policies by reversing suboptimal actions to nearby better modes, enabling improved zero-shot control, quick distillation, and RL bootstrapping in robotic manipulation.
FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.
A prompt-only attack called command-preserving trajectory redirection can steer VLA robot behavior to attacker-chosen physical outcomes while the text still appears to match the intended task.
Ambient Diffusion Policy enables better imitation learning from suboptimal robot data by leveraging spectral properties to restrict data usage to specific diffusion times.
Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.
X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.
ActProbe is an action-space detector that uses temporal consistency error and action chunk magnitude from policy outputs, mapped via LSTM-MLP, to predict failures earlier than baselines across policies and real-robot tasks.
ActionMap introduces a voxel heatmap action head for VLA models that improves policy learning by exploiting geometric structure in the action space.
VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
RoboTrustBench evaluates seven video world models on trustworthiness using four scenarios, six dimensions, and 13 criteria, finding gaps in constraint reasoning and unsafe instruction handling.
VLMs exhibit consistent vertical-distance entanglement in embeddings from perspective bias in natural images, producing accuracy gaps that a new synthetic benchmark SpatialTunnel exposes as model-intrinsic.
PhAIL provides an open benchmark and distributional evaluation method for real-robot VLA policies using time-to-success CDF, HRT scoring, and KS significance tests.
citing papers explorer
-
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.