ManiSoft is a new benchmark featuring a soft-body simulator, four deformable control tasks, and an automated pipeline generating 6300 scenes with expert trajectories for training and evaluating vision-language policies on continuum robots.
hub Baseline reference
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
Baseline reference. 51% of citing Pith papers use this work as a benchmark or comparison.
abstract
Simulation-based data synthesis has emerged as a powerful paradigm for advancing real-world robotic manipulation. Yet existing datasets remain insufficient for robust bimanual manipulation due to (1) the lack of scalable task generation methods and (2) oversimplified simulation environments. We present RoboTwin 2.0, a scalable framework for automated, large-scale generation of diverse and realistic data, together with unified evaluation protocols for dual-arm manipulation. At its core is RoboTwin-OD, an object library of 731 instances across 147 categories with semantic and manipulation-relevant annotations. Building on this, we design an expert data synthesis pipeline that leverages multimodal language models (MLLMs) and simulation-in-the-loop refinement to automatically generate task-level execution code. To improve sim-to-real transfer, RoboTwin 2.0 applies structured domain randomization along five axes: clutter, lighting, background, tabletop height, and language, enhancing data diversity and policy robustness. The framework is instantiated across 50 dual-arm tasks and five robot embodiments. Empirically, it yields a 10.9% gain in code generation success rate. For downstream policy learning, a VLA model trained with synthetic data plus only 10 real demonstrations achieves a 367% relative improvement over the 10-demo baseline, while zero-shot models trained solely on synthetic data obtain a 228% gain. These results highlight the effectiveness of RoboTwin 2.0 in strengthening sim-to-real transfer and robustness to environmental variations. We release the data generator, benchmark, dataset, and code to support scalable research in robust bimanual manipulation. Project Page: https://robotwin-platform.github.io/, Code: https://github.com/robotwin-Platform/robotwin/.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Simulation-based data synthesis has emerged as a powerful paradigm for advancing real-world robotic manipulation. Yet existing datasets remain insufficient for robust bimanual manipulation due to (1) the lack of scalable task generation methods and (2) oversimplified simulation environments. We present RoboTwin 2.0, a scalable framework for automated, large-scale generation of diverse and realistic data, together with unified evaluation protocols for dual-arm manipulation. At its core is RoboTwin-OD, an object library of 731 instances across 147 categories with semantic and manipulation-releva
co-cited works
representative citing papers
DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
GridS is a plug-and-play differentiable module for geometry-aware visual token resampling in VLA models that achieves under 10% token retention and 76% FLOPs reduction with no success-rate loss.
RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational overhead during adaptation.
VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
VUDA enables spatial sharing between CUDA and Vulkan on GPUs via channel redirection and page-table grafting, achieving up to 85% higher throughput than temporal baselines in embodied AI tasks.
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
HiPolicy is a new hierarchical multi-frequency action chunking method for imitation learning that jointly generates coarse and fine action sequences with entropy-guided execution to improve performance and efficiency in robotic manipulation.
BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.
DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.
Variational Regularization imposes an adaptive information bottleneck on noisy intermediate features in DP3-UNet and DP3-DiT policies, consistently raising task success rates on RoboTwin2.0, Adroit, and MetaWorld while achieving new state-of-the-art results.
TouchGuide improves contact-rich robot manipulation by steering diffusion or flow-matching visuomotor policies with tactile feasibility scores from a contrastively trained Contact Physical Model.
DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
STABLE generates simulation-ready tabletop scenes by alternating a semantic LLM reasoner for task-aligned coarse layouts with a physics corrector for physical plausibility using progressive scene expansion.
UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation tasks including OOD cases.
SCAR proposes a joint inverse-forward dynamics framework to learn transferable continuous action representations across embodiments from visual data using regularization and adversarial invariance.
citing papers explorer
-
ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics
ManiSoft is a new benchmark featuring a soft-body simulator, four deformable control tasks, and an automated pipeline generating 6300 scenes with expert trajectories for training and evaluating vision-language policies on continuum robots.
-
DSSP: Diffusion State Space Policy with Full-History Encoding
DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.
-
RotVLA: Rotational Latent Action for Vision-Language-Action Model
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
-
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
-
See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model
GridS is a plug-and-play differentiable module for geometry-aware visual token resampling in VLA models that achieves under 10% token retention and 76% FLOPs reduction with no success-rate loss.
-
RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning
RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
-
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models
VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
-
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
HiPolicy: Hierarchical Multi-Frequency Action Chunking for Policy Learning
HiPolicy is a new hierarchical multi-frequency action chunking method for imitation learning that jointly generates coarse and fine action sequences with entropy-guided execution to improve performance and efficiency in robotic manipulation.
-
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
-
JailWAM: Jailbreaking World Action Models in Robot Control
JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.
-
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
-
Information Filtering via Variational Regularization for Robot Manipulation
Variational Regularization imposes an adaptive information bottleneck on noisy intermediate features in DP3-UNet and DP3-DiT policies, consistently raising task success rates on RoboTwin2.0, Adroit, and MetaWorld while achieving new state-of-the-art results.
-
TouchGuide: Inference-Time Steering of Visuomotor Policies via Touch Guidance
TouchGuide improves contact-rich robot manipulation by steering diffusion or flow-matching visuomotor policies with tactile feasibility scores from a contrastively trained Contact Physical Model.
-
DexHoldem: Playing Texas Hold'em with Dexterous Embodied System
DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
-
SCAR: Self-Supervised Continuous Action Representation Learning
SCAR proposes a joint inverse-forward dynamics framework to learn transferable continuous action representations across embodiments from visual data using regularization and adversarial invariance.
-
From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation
AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bimanual tasks.
-
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models
PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
-
HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions
A task-conditioned two-stage system decouples grasp localization from interaction trajectory planning using specialized foundation models to improve generalization across heterogeneous object types.
-
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation
VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
-
When to Trust Imagination: Adaptive Action Execution for World Action Models
A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.
-
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
-
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discrete tokens proving most effective.
-
MotuBrain: An Advanced World Action Model for Robot Control
MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new robots with 50-100 trajectories.
-
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
-
LeHome: A Simulation Environment for Deformable Object Manipulation in Household Scenarios
LeHome is a simulation platform offering high-fidelity dynamics for robotic manipulation of varied deformable objects in household settings, with support for multiple robot embodiments including low-cost hardware.
-
From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation
Digital Cousins is a generative real-to-sim method that creates diverse high-fidelity simulation scenes from real panoramas to improve generalization in robot learning and evaluation.
-
Grounded World Model for Semantically Generalizable Planning
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
-
AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps
AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.
-
V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation
V-CAGE automates the creation of scalable, high-quality robotic manipulation datasets through context-aware scene construction, closed-loop visual verification, and perceptually-driven compression.
-
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformable manipulation.
-
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
-
SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation
SeedPolicy introduces self-evolving gated attention to extend the temporal horizon of diffusion policies, yielding 36.8% and 169% relative gains over standard DP on clean and randomized RoboTwin 2.0 tasks.
-
Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
OptimusVLA augments hierarchical VLA models with Global Prior Memory for shorter generative paths and Local Consistency Memory for temporal coherence, yielding higher success rates and 2.9x faster inference on simulation and real-world robotic benchmarks.
-
RISE: Self-Improving Robot Policy with Compositional World Model
RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.
-
A Pragmatic VLA Foundation Model
LingBot-VLA is a VLA foundation model trained on massive real robot data that shows superior generalization across tasks and platforms with fast training throughput.
-
Genie Sim 3.0 : A High-Fidelity Comprehensive Simulation Platform for Humanoid Robot
Genie Sim 3.0 introduces an LLM-powered scene generator, the first LLM-based automated evaluation benchmark, and a large open synthetic dataset that demonstrates zero-shot sim-to-real transfer for robotic manipulation policies.
-
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' phenomenon.
-
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
Genie Envisioner unifies robotic policy learning, simulation, and evaluation inside one instruction-conditioned video diffusion framework using GE-Base, GE-Act, and GE-Sim.
-
RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models
RoVLA enforces instructional, evolutionary, and observational consistency to improve robustness of VLA policies on manipulation benchmarks and real robots.
-
PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models
PAPO-VLA identifies planning actions via variation and outcome, estimates their causal importance, and folds that importance into GRPO to emphasize key decisions while still using full-trajectory feedback.
-
Key-Gram: Extensible World Knowledge for Embodied Manipulation
Key-Gram uses a memory module with key-grams and hashed lookup to inject static linguistic priors into vision-language-action backbones, yielding reported gains on manipulation benchmarks.
-
WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform
WorldArena 2.0 extends embodied world model benchmarks to visuotactile perception, interactive policy training, and diverse real and simulated robotic platforms under a unified protocol.
-
AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment
AffordVLA improves VLA models for robotic manipulation by implicitly injecting affordance perception through feature alignment with a zero-shot teacher, claiming SOTA results in simulation and real-world tests.
-
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization
DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
-
X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction
X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.
-
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LIBERO, RoboTwin, and real-robot tasks.
-
Nautilus: From One Prompt to Plug-and-Play Robot Learning
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
-
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation
ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.