TAVIS is a released benchmark showing active vision improves imitation learning in a task-dependent manner, multi-task policies struggle with shifts, and imitation produces human-like anticipatory gaze.
hub Mixed citations
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
Mixed citation behavior. Most common role is background (58%).
abstract
robosuite is a simulation framework for robot learning powered by the MuJoCo physics engine. It offers a modular design for creating robotic tasks as well as a suite of benchmark environments for reproducible research. This paper discusses the key system modules and the benchmark environments of our new release robosuite v1.5.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
BEHAVIOR-1K introduces a benchmark of 1,000 human everyday activities in realistic simulated scenes together with the OMNIGIBSON physics simulator to evaluate embodied AI.
LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
OOPSIEVERSE is a new damage-aware simulation benchmark for household robot manipulation that converts contact, thermal, and fluid signals into task-agnostic damage metrics and demonstrates uses in safer policy learning and benchmarking.
VoLoAgent uses a VLM to steer heterogeneous robot capabilities as interruptible tools for long-horizon manipulation and introduces the RoboVoLo benchmark, claiming substantial outperformance over single VLA/VLM or tool-based systems with real-robot validation.
HapTile introduces a visuotactile dataset with haptic-informed teleoperation for language-conditioned contact-rich manipulation tasks and provides baseline policy benchmarks.
Dream.exe evaluates 8 video generation models on 101 manipulation tasks by converting generated videos into executable robot trajectories in a simulator, finding measurable success rates that visual metrics do not predict.
The paper identifies a deployment safety gap in VLA policies where identical checkpoints can be executable-inequivalent due to action metadata mismatches, supported by a derived closed-form transform and empirical drift measurements on LIBERO benchmarks.
Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational overhead during adaptation.
CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-rich robotic scenarios.
A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing full compositions.
HANDFUL learns resource-aware grasps using finger contact rewards and curriculum learning to improve success on sequential dexterous tasks in simulation and on a real LEAP hand.
ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.
BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
MIMIC-D enables multi-modal multi-agent coordination via joint training of decentralized diffusion policies using only local information.
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
Freeform Preference Learning trains language-conditioned multi-axis reward models from human pairwise preferences to produce steerable and compositional robot policies that outperform sparse and binary-preference baselines by 38 percentage points.
SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.
On a LIBERO pick-and-place task with gripper defects, curation metrics with highest defect-detection AUROC produce the worst policies while lower-AUROC metrics nearly match the oracle, and many metrics rely on episode length as a proxy.
iMaC introduces image-based action tokens in a dual-branch architecture to improve future state prediction and control in embodied world models over vector-based baselines.
DARP reparameterizes imitation learning around local neighborhood structure using k-NN expert states, actions, and relative distance vectors, delivering 15-46% gains over behavior cloning in control and manipulation tasks.
MoDex is a diffusion policy conditioned on opposition space and point cloud, trained first by imitation learning then RL fine-tuning, that reports higher success rates than baselines for sequential multi-object dexterous grasping in simulation and real-world tests.
citing papers explorer
-
TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning
TAVIS is a released benchmark showing active vision improves imitation learning in a task-dependent manner, multi-task policies struggle with shifts, and imitation produces human-like anticipatory gaze.
-
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
-
BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation
BEHAVIOR-1K introduces a benchmark of 1,000 human everyday activities in realistic simulated scenes together with the OMNIGIBSON physics simulator to evaluate embodied AI.
-
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
-
OopsieVerse: A Safety Benchmark with Damage-Aware Simulation for Robot Manipulation
OOPSIEVERSE is a new damage-aware simulation benchmark for household robot manipulation that converts contact, thermal, and fluid signals into task-agnostic damage metrics and demonstrates uses in safer policy learning and benchmarking.
-
VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation
VoLoAgent uses a VLM to steer heterogeneous robot capabilities as interruptible tools for long-horizon manipulation and introduces the RoboVoLo benchmark, claiming substantial outperformance over single VLA/VLM or tool-based systems with real-robot validation.
-
HapTile: A Haptic-Informed Vision-Tactile-Language-Action Dataset for Contact-Rich Imitation Learning
HapTile introduces a visuotactile dataset with haptic-informed teleoperation for language-conditioned contact-rich manipulation tasks and provides baseline policy benchmarks.
-
Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?
Dream.exe evaluates 8 video generation models on 101 manipulation tasks by converting generated videos into executable robot trajectories in a simulator, finding measurable success rates that visual metrics do not predict.
-
Same Weights, Different Robot: A Deployment Safety View of VLA Policies
The paper identifies a deployment safety gap in VLA policies where identical checkpoints can be executable-inequivalent due to action metadata mismatches, supported by a derived closed-form transform and empirical drift measurements on LIBERO benchmarks.
-
CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models
Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational overhead during adaptation.
-
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-rich robotic scenarios.
-
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing full compositions.
-
HANDFUL: Sequential Grasp-Conditioned Dexterous Manipulation with Resource Awareness
HANDFUL learns resource-aware grasps using finger contact rewards and curriculum learning to improve success on sequential dexterous tasks in simulation and on a real LEAP hand.
-
Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations
ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.
-
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
-
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
-
MIMIC-D: Multi-modal Imitation for MultI-agent Coordination with Decentralized Diffusion Policies
MIMIC-D enables multi-modal multi-agent coordination via joint training of decentralized diffusion policies using only local information.
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
-
Freeform Preference Learning for Robotic Manipulation
Freeform Preference Learning trains language-conditioned multi-axis reward models from human pairwise preferences to produce steerable and compositional robot policies that outperform sparse and binary-preference baselines by 38 percentage points.
-
Sequential Planning via Anchored Robotic Keypoints
SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.
-
What Demonstration Curation Metrics Do to Your Policy
On a LIBERO pick-and-place task with gripper defects, curation metrics with highest defect-detection AUROC produce the worst policies while lower-AUROC metrics nearly match the oracle, and many metrics rely on episode length as a proxy.
-
iMaC: Translating Actions into Motion and Contact Images for Embodied World Models
iMaC introduces image-based action tokens in a dual-branch architecture to improve future state prediction and control in embodied world models over vector-based baselines.
-
Difference-Aware Retrieval Policies for Imitation Learning
DARP reparameterizes imitation learning around local neighborhood structure using k-NN expert states, actions, and relative distance vectors, delivering 15-46% gains over behavior cloning in control and manipulation tasks.
-
MoDex: A Diffusion Policy for Sequential Multi-Object Dexterous Grasping
MoDex is a diffusion policy conditioned on opposition space and point cloud, trained first by imitation learning then RL fine-tuning, that reports higher success rates than baselines for sequential multi-object dexterous grasping in simulation and real-world tests.
-
What Are We Actually Benchmarking in Robot Manipulation?
LIBERO and CALVIN fail multiple proposed diagnostics for shortcut solvability, statistical significance, overfitting, and data dependence, while a tiny 0.09B probe reaches near-SOTA on LIBERO.
-
CLAW: Learning Continuous Latent Action World Models via Adversarial Latent Regularization
CLAW is an end-to-end self-supervised method that learns semantically meaningful continuous latent actions and predictive world models from action-free videos to support imitation learning and goal-directed planning.
-
Set-Supervised Diffusion Policy: Learning Action-Chunking Diffusion through Corrections
SDP constructs sets of desired action-chunks from human correction pairs and trains diffusion policies to align with those sets, yielding better performance and robustness than standard behavior cloning on robotic tasks.
-
Drift Q-Learning
DriftQL is a single-pass offline RL algorithm using drift regularization that outperforms diffusion and flow policies on standard benchmarks.
-
RoboWits: Unexpected Challenges for Robotic Creative Problem Solving
RoboWits benchmark with 238 tasks shows pre-trained VLAs succeed on seed tasks but fail on mutated ones, highlighting brittleness in reasoning.
-
Frequency-Guided Action Diffusion via Sub-Frequency Manifold Traversal
FGO guides diffusion policy generation via expanding spectral bands on sub-frequency manifolds to improve action smoothness on 15 robotic manipulation tasks.
-
Behavior-Consistent Deep Reinforcement Learning
QED bounds cross-run KL divergence in Boltzmann policies by setting temperature proportional to Q-disagreement and reduces return variance by two orders of magnitude on 18 continuous-control tasks without performance loss.
-
Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning
ZPRL adapts frozen flow-matching imitation policies via RL perturbations on a task-relevant bottleneck latent, yielding 33.7% higher average success on four real-world manipulation tasks than action-residual baselines.
-
COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones
COBALT enables scalable crowdsourced teleoperation of robots using smartphones, supporting concurrent users with low latency and yielding a 7500+ demonstration dataset validated on imitation learning tasks.
-
DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo
DexJoCo is a benchmark and toolkit with 11 functionally grounded tasks, 1.1K trajectories, and empirical benchmarks for task-oriented dexterous manipulation on MuJoCo.
-
Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making
Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.
-
Slot-MPC: Goal-Conditioned Model Predictive Control with Object-Centric Representations
Slot-MPC learns slot representations to build a differentiable object-centric dynamics model that supports efficient gradient-based MPC for robotic manipulation in novel situations.
-
HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions
A task-conditioned two-stage system decouples grasp localization from interaction trajectory planning using specialized foundation models to improve generalization across heterogeneous object types.
-
Kintsugi: Learning Policies by Repairing Executable Knowledge Bases
Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.
-
BEACON: Cross-Domain Co-Training of Generative Robot Policies via Best-Effort Adaptation
BEACON uses discrepancy-aware importance reweighting to jointly train diffusion-based robot policies and source sample weights, improving performance over target-only and fixed-ratio baselines in cross-domain manipulation tasks.
-
GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning
GS-Playground delivers a high-throughput photorealistic simulator for vision-informed robot learning via parallel physics integrated with batch 3D Gaussian Splatting at 10^4 FPS and an automated Real2Sim workflow for consistent environments.
-
Visual-Tactile Peg-in-Hole Assembly Learning from Peg-out-of-Hole Disassembly
A visual-tactile RL method learns peg-in-hole assembly from reversed peg-out-of-hole disassembly trajectories, reaching 87.5% success on seen objects and 77.1% on unseen objects while lowering contact forces.
-
A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies
Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.
-
Learning Without Losing Identity: Capability Evolution for Embodied Agents
Embodied agents maintain persistent identity while evolving modular capabilities through a closed-loop process, raising simulated task success from 32.4% to 91.3% with zero policy drift.
-
Unify Robot Actions in Camera Frame
CalibAll estimates camera extrinsics on existing datasets to convert robot actions into a unified camera-frame representation, enabling stronger cross-embodiment pretraining.
-
RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation
RoboEval is a new benchmark providing eight bimanual tasks, thousands of expert demonstrations, and standardized metrics for efficiency, coordination, safety, and failure localization in robotic manipulation.
-
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.
-
From Action Labels to Sets: Rethinking Action Supervision for Imitation Learning from Corrective Feedback
CLIC uses set-valued action targets from interactive human corrections instead of pointwise labels to train more robust imitation learning policies.
-
RoboMD: Uncovering Robot Vulnerabilities through Semantic Potential Fields
A deep RL vulnerability-prediction policy trained in semantic embedding space finds up to 23% more unique robot manipulation failures than vision-language baselines and enables more efficient fine-tuning.
-
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
RoboCasa supplies a large-scale kitchen simulator, generative assets, 100 tasks, and automated data pipelines that produce a clear scaling trend in imitation learning for generalist robots.
-
Evaluating Real-World Robot Manipulation Policies in Simulation
SIMPLER simulated environments yield policy performance that correlates strongly with real-world robot manipulation results and captures similar sensitivity to distribution shifts.