RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
hub Mixed citations
Maniskill2: A unified benchmark for generalizable manipulation skills
Mixed citation behavior. Most common role is background (45%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.
PhysEditWorld is a new dataset of over 60 million frames from 12 UE5 cinematic scenes with synchronized multimodal signals and explicit gravity labels, built via replay to support physics-editable world models.
SWAAP is the first two-stage poisoning framework that identifies a harmful target world model via bilevel optimization and realizes it through stealth-constrained gradient matching on a limited fraction of fine-tuning transitions.
HARBOR is a new agentic harness framework that automates robot RL workflows end-to-end across 16 tasks in manipulation, locomotion, and dexterous control, matching or exceeding default configurations while enabling sim-to-real transfer.
VHYDRO is a support-safe variational hybrid filter that jointly recovers continuous latent states, discrete contact modes, and sparse port-Hamiltonian laws per regime while preventing loss of feasible transitions.
Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
Pipette supplies an open wet-lab simulation platform, 11-task benchmark, and perturbation-based augmentation pipeline that raises VLA success rates on sample handling and device tasks from limited demonstrations.
FATE-VLA reframes VLA evaluation as active failure discovery and reports uncovering up to 29.7% more failures across four models while revealing diverse failure modes.
InfoAtlas is a pretrained neural model for zero-shot mutual information estimation that matches state-of-the-art accuracy with 100x speedup and handles varying dimensions via a single model.
RoboWits benchmark with 238 tasks shows pre-trained VLAs succeed on seed tasks but fail on mutated ones, highlighting brittleness in reasoning.
DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
AffordSim integrates open-vocabulary 3D affordance prediction into simulation trajectory generation to create a 50-task benchmark that reaches 93% of manual annotation success rates and enables 24% average zero-shot success on a real Franka FR3.
SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformable manipulation.
VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.
RoboCasa supplies a large-scale kitchen simulator, generative assets, 100 tasks, and automated data pipelines that produce a clear scaling trend in imitation learning for generalist robots.
LaST-HD creates a shared latent dynamics space via a world model to transfer physical reasoning from scalable human-hand demonstrations to robots, achieving over 90% accuracy with 20 minutes of new data after mixed training.
MagicSim is a unified embodied interaction infrastructure built on a deterministic batched runtime and shared MDP that supports diverse world construction, execution, task evaluation, automatic rollout generation, and interactive agent interfaces.
LabVLA uses RoboGenesis simulation data and a two-stage FAST pretraining plus flow matching recipe on a Qwen3-VL backbone to achieve the highest success rates on LabUtopia under in- and out-of-distribution conditions.
A survey of test-time scaling for multimodal foundation models that introduces a three-way taxonomy of sampling, feedback, and search approaches along with applications and benchmarks.
citing papers explorer
-
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
-
ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models
ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.
-
PhysEditWorld: A Large-Scale Dataset Toward Physics-Editable World Models
PhysEditWorld is a new dataset of over 60 million frames from 12 UE5 cinematic scenes with synchronized multimodal signals and explicit gravity labels, built via replay to support physics-editable world models.
-
Stealthy World Model Manipulation via Data Poisoning
SWAAP is the first two-stage poisoning framework that identifies a harmful target world model via bilevel optimization and realizes it through stealth-constrained gradient matching on a limited fraction of fine-tuning transitions.
-
HARBOR: A Harness Framework for Agentic Robot Reinforcement Learning
HARBOR is a new agentic harness framework that automates robot RL workflows end-to-end across 16 tasks in manipulation, locomotion, and dexterous control, matching or exceeding default configurations while enabling sim-to-real transfer.
-
Support-Safe Variational Hybrid Filtering for Contact-Mode and Sparse-Law Recovery
VHYDRO is a support-safe variational hybrid filter that jointly recovers continuous latent states, discrete contact modes, and sparse port-Hamiltonian laws per regime while preventing loss of feasible transitions.
-
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
-
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
-
An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics
Pipette supplies an open wet-lab simulation platform, 11-task benchmark, and perturbation-based augmentation pipeline that raises VLA success rates on sample handling and device tasks from limited demonstrations.
-
FATE-VLA:Failue-aware test generation for vision-language-action models
FATE-VLA reframes VLA evaluation as active failure discovery and reports uncovering up to 29.7% more failures across four models while revealing diverse failure modes.
-
InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate
InfoAtlas is a pretrained neural model for zero-shot mutual information estimation that matches state-of-the-art accuracy with 100x speedup and handles varying dimensions via a single model.
-
RoboWits: Unexpected Challenges for Robotic Creative Problem Solving
RoboWits benchmark with 238 tasks shows pre-trained VLAs succeed on seed tasks but fail on mutated ones, highlighting brittleness in reasoning.
-
DexHoldem: Playing Texas Hold'em with Dexterous Embodied System
DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
-
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
-
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
-
AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation
AffordSim integrates open-vocabulary 3D affordance prediction into simulation trajectory generation to create a 50-task benchmark that reaches 93% of manual annotation success rates and enables 24% average zero-shot success on a real Franka FR3.
-
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformable manipulation.
-
LaST-HD: Learning Latent Physical Reasoning from Scalable Human Data for Robot Manipulation
LaST-HD creates a shared latent dynamics space via a world model to transfer physical reasoning from scalable human-hand demonstrations to robots, achieving over 90% accuracy with 20 minutes of new data after mixed training.
-
MagicSim: A Unified Infrastructure for Executable Embodied Interaction
MagicSim is a unified embodied interaction infrastructure built on a deterministic batched runtime and shared MDP that supports diverse world construction, execution, task evaluation, automatic rollout generation, and interactive agent interfaces.
-
LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories
LabVLA uses RoboGenesis simulation data and a two-stage FAST pretraining plus flow matching recipe on a Qwen3-VL backbone to achieve the highest success rates on LabUtopia under in- and out-of-distribution conditions.
-
Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning
A survey of test-time scaling for multimodal foundation models that introduces a three-way taxonomy of sampling, feedback, and search approaches along with applications and benchmarks.
-
Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization
MBDPO reformulates policy optimization as a diffusion process over searched trajectories in latent world models to reduce misalignment between search and value learning.
-
Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation
A hybrid structural latent points representation is learned by inserting a point-wise latent VAE into a point-cloud autoencoder and regularizing toward a Gaussian prior, paired with a lightweight 3DGS rendering pipeline, yielding gains on RLBench and ManiSkill2 benchmarks.
-
E$^2$DT: Efficient and Effective Decision Transformer with Experience-Aware Sampling for Robotic Manipulation
E²DT couples a Decision Transformer with a k-Determinantal Point Process that scores trajectories on return-to-go quantiles, predictive uncertainty, and stage coverage to improve sample efficiency and policy quality in robotic manipulation.
-
R3D: Revisiting 3D Policy Learning
A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
-
EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development
EmbodiedClaw automates embodied AI development workflows through conversation, reducing manual effort and improving consistency and reproducibility.
-
Co-policy: Responsive Human-Robot Co-Creation for Musical Performances
Co-policy framework uses a fine-tuned Qwen-vl planner for semantic plans and a Gaussian-mixture visuomotor policy for low-latency robot actions to enable constrained human-robot musical co-creation.
-
GE-Sim 2.0: A Roadmap Towards Comprehensive Closed-loop Video World Simulators for Robotic Manipulation
GE-Sim 2.0 is a video-based closed-loop simulator for robotic manipulation that adds state expert, world judge, and acceleration modules on top of prior video generation to support policy learning and evaluation.
-
Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems
A literature review that defines silent physical-action failures in Physical AI and identifies the lack of complete runtime authorization boundaries across surveyed technical streams.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
World Action Models: A Survey
A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.
- From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data