super hub Baseline reference

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Baijun Chen, Tianxing Chen, Yibin Liu, Zanxin Chen, Zijian Cai, Zixuan Li · 2025 · cs.RO · arXiv 2506.18088

Baseline reference. 51% of citing Pith papers use this work as a benchmark or comparison.

125 Pith papers citing it

Baseline 51% of classified citations

open full Pith review browse 125 citing papers more from Baijun Chen arXiv PDF

abstract

Simulation-based data synthesis has emerged as a powerful paradigm for advancing real-world robotic manipulation. Yet existing datasets remain insufficient for robust bimanual manipulation due to (1) the lack of scalable task generation methods and (2) oversimplified simulation environments. We present RoboTwin 2.0, a scalable framework for automated, large-scale generation of diverse and realistic data, together with unified evaluation protocols for dual-arm manipulation. At its core is RoboTwin-OD, an object library of 731 instances across 147 categories with semantic and manipulation-relevant annotations. Building on this, we design an expert data synthesis pipeline that leverages multimodal language models (MLLMs) and simulation-in-the-loop refinement to automatically generate task-level execution code. To improve sim-to-real transfer, RoboTwin 2.0 applies structured domain randomization along five axes: clutter, lighting, background, tabletop height, and language, enhancing data diversity and policy robustness. The framework is instantiated across 50 dual-arm tasks and five robot embodiments. Empirically, it yields a 10.9% gain in code generation success rate. For downstream policy learning, a VLA model trained with synthetic data plus only 10 real demonstrations achieves a 367% relative improvement over the 10-demo baseline, while zero-shot models trained solely on synthetic data obtain a 228% gain. These results highlight the effectiveness of RoboTwin 2.0 in strengthening sim-to-real transfer and robustness to environmental variations. We release the data generator, benchmark, dataset, and code to support scalable research in robust bimanual manipulation. Project Page: https://robotwin-platform.github.io/, Code: https://github.com/robotwin-Platform/robotwin/.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 21 background 17 other 1

citation-polarity summary

use dataset 20 background 16 unclear 3

claims ledger

abstract Simulation-based data synthesis has emerged as a powerful paradigm for advancing real-world robotic manipulation. Yet existing datasets remain insufficient for robust bimanual manipulation due to (1) the lack of scalable task generation methods and (2) oversimplified simulation environments. We present RoboTwin 2.0, a scalable framework for automated, large-scale generation of diverse and realistic data, together with unified evaluation protocols for dual-arm manipulation. At its core is RoboTwin-OD, an object library of 731 instances across 147 categories with semantic and manipulation-releva

authors

Baijun Chen Tianxing Chen Yibin Liu Zanxin Chen Zijian Cai Zixuan Li

co-cited works

representative citing papers

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

cs.CV · 2026-05-26 · unverdicted · novelty 8.0

SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.

Labimus: A Simulation and Benchmark for Humanoid Dexterous Manipulation in Chemical Laboratory

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

Labimus is the first benchmark for humanoid dexterous manipulation in organic chemistry laboratories, exposing a gap between task completion and required experimental precision.

ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models

cs.RO · 2026-06-25 · unverdicted · novelty 7.0

ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

cs.RO · 2026-06-22 · unverdicted · novelty 7.0

LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

cs.CV · 2026-06-07 · unverdicted · novelty 7.0

X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.

ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

cs.RO · 2026-05-18 · unverdicted · novelty 7.0

ManiSoft is a new benchmark featuring a soft-body simulator, four deformable control tasks, and an automated pipeline generating 6300 scenes with expert trajectories for training and evaluating vision-language policies on continuum robots.

DSSP: Diffusion State Space Policy with Full-History Encoding

cs.RO · 2026-05-14 · conditional · novelty 7.0

DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

cs.CV · 2026-05-14 · conditional · novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.

RotVLA: Rotational Latent Action for Vision-Language-Action Model

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

cs.RO · 2026-05-12 · conditional · novelty 7.0

GridS is a plug-and-play differentiable module for geometry-aware visual token resampling in VLA models that achieves under 10% token retention and 76% FLOPs reduction with no success-rate loss.

RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.

CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational overhead during adaptation.

VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 7.0

VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

cs.RO · 2026-05-08 · unverdicted · novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control

cs.RO · 2026-05-02 · unverdicted · novelty 7.0 · 4 refs

HDP3 is a pocket-scale 3D diffusion policy with a Diffusion Mixer decoder that achieves state-of-the-art visuomotor control using two-step DDIM inference and under 1% of the parameters of prior 3D diffusion policies.

VUDA: Breaking CUDA-Vulkan Isolation for Spatial Sharing of Compute and Graphics on the Same GPU

cs.OS · 2026-05-02 · unverdicted · novelty 7.0

VUDA enables spatial sharing between CUDA and Vulkan on GPUs via channel redirection and page-table grafting, achieving up to 85% higher throughput than temporal baselines in embodied AI tasks.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

HiPolicy: Hierarchical Multi-Frequency Action Chunking for Policy Learning

cs.RO · 2026-04-07 · unverdicted · novelty 7.0

HiPolicy is a new hierarchical multi-frequency action chunking method for imitation learning that jointly generates coarse and fine action sequences with entropy-guided execution to improve performance and efficiency in robotic manipulation.

BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination

cs.RO · 2026-04-07 · conditional · novelty 7.0

BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.

JailWAM: Jailbreaking World Action Models in Robot Control

cs.RO · 2026-04-07 · unverdicted · novelty 7.0

JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.

ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs

cs.RO · 2026-02-09 · unverdicted · novelty 7.0

ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.

RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training

cs.AI · 2026-02-05 · unverdicted · novelty 7.0

RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.

Information Filtering via Variational Regularization for Robot Manipulation

cs.RO · 2026-01-29 · unverdicted · novelty 7.0

Variational Regularization imposes an adaptive information bottleneck on noisy intermediate features in DP3-UNet and DP3-DiT policies, consistently raising task success rates on RoboTwin2.0, Adroit, and MetaWorld while achieving new state-of-the-art results.

citing papers explorer

Showing 50 of 125 citing papers.

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation cs.RO · 2026-05-06 · unverdicted · none · ref 14 · internal anchor
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models cs.RO · 2026-05-06 · unverdicted · none · ref 17 · internal anchor
A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discrete tokens proving most effective.
Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation cs.CV · 2026-05-04 · unverdicted · none · ref 7 · internal anchor
A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.
MotuBrain: An Advanced World Action Model for Robot Control cs.RO · 2026-04-30 · unverdicted · none · ref 11 · internal anchor
MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new robots with 50-100 trajectories.
Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations cs.CV · 2026-04-29 · unverdicted · none · ref 4 · internal anchor
RecGen achieves state-of-the-art 3D multi-object scene reconstruction from sparse RGB-D views by combining compositional synthetic scene generation with strong 3D shape priors, outperforming SAM3D by 30%+ in shape quality and pose accuracy while using 80% fewer meshes.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising cs.RO · 2026-04-29 · unverdicted · none · ref 71 · 2 links · internal anchor
X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
LeHome: A Simulation Environment for Deformable Object Manipulation in Household Scenarios cs.RO · 2026-04-24 · unverdicted · none · ref 4 · internal anchor
LeHome is a simulation platform offering high-fidelity dynamics for robotic manipulation of varied deformable objects in household settings, with support for multiple robot embodiments including low-cost hardware.
From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation cs.RO · 2026-04-17 · unverdicted · none · ref 6 · internal anchor
Digital Cousins is a generative real-to-sim method that creates diverse high-fidelity simulation scenes from real panoramas to improve generalization in robot learning and evaluation.
Grounded World Model for Semantically Generalizable Planning cs.RO · 2026-04-13 · conditional · none · ref 13 · internal anchor
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps cs.RO · 2026-04-13 · unverdicted · none · ref 2 · internal anchor
AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks cs.CV · 2026-04-13 · unverdicted · none · ref 7 · internal anchor
CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation cs.RO · 2026-04-10 · unverdicted · none · ref 6 · internal anchor
V-CAGE automates the creation of scalable, high-quality robotic manipulation datasets through context-aware scene construction, closed-loop visual verification, and perceptually-driven compression.
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds cs.RO · 2026-04-09 · unverdicted · none · ref 13 · internal anchor
SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformable manipulation.
Fast-WAM: Do World Action Models Need Test-time Future Imagination? cs.CV · 2026-03-17 · unverdicted · none · ref 41 · internal anchor
Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors cs.RO · 2026-03-16 · conditional · none · ref 45 · internal anchor
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation cs.RO · 2026-03-05 · conditional · none · ref 6 · internal anchor
SeedPolicy introduces self-evolving gated attention to extend the temporal horizon of diffusion policies, yielding 36.8% and 169% relative gains over standard DP on clean and randomized RoboTwin 2.0 tasks.
Universal Pose Pretraining for Generalizable Vision-Language-Action Policies cs.CV · 2026-02-23 · unverdicted · none · ref 10 · internal anchor
Pose-VLA uses a decoupled two-stage pre-training with discrete pose tokens to extract universal 3D spatial priors from 3D datasets and robotic trajectories, achieving 79.5% success on RoboTwin 2.0 and 96.0% on LIBERO.
Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation cs.RO · 2026-02-22 · unverdicted · none · ref 5 · internal anchor
OptimusVLA augments hierarchical VLA models with Global Prior Memory for shorter generative paths and Local Consistency Memory for temporal coherence, yielding higher success rates and 2.9x faster inference on simulation and real-world robotic benchmarks.
RISE: Self-Improving Robot Policy with Compositional World Model cs.RO · 2026-02-11 · unverdicted · none · ref 16 · internal anchor
RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.
ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning cs.CV · 2026-02-11 · unverdicted · none · ref 11 · internal anchor
ABot-M0 unifies heterogeneous robot data into a 6-million-trajectory dataset and introduces Action Manifold Learning to predict stable actions on a low-dimensional manifold using a DiT backbone.
A Pragmatic VLA Foundation Model cs.RO · 2026-01-26 · unverdicted · none · ref 8 · internal anchor
LingBot-VLA is a VLA foundation model trained on massive real robot data that shows superior generalization across tasks and platforms with fast training throughput.
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning cs.RO · 2025-09-11 · conditional · none · ref 21 · internal anchor
SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' phenomenon.
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation cs.RO · 2025-08-07 · unverdicted · none · ref 10 · internal anchor
Genie Envisioner unifies robotic policy learning, simulation, and evaluation inside one instruction-conditioned video diffusion framework using GE-Base, GE-Act, and GE-Sim.
Vidar: Embodied Video Diffusion Model for Generalist Manipulation cs.LG · 2025-07-17 · unverdicted · none · ref 13 · internal anchor
Vidar shows that a video diffusion prior continuously pre-trained on 750K multi-view robot trajectories plus a label-free masked inverse dynamics adapter can generalize manipulation to new robot embodiments with 1% of typical demonstration data.
CORE: Common Outcome Regularities from Action-Free Visual Demonstrations for Robot Manipulation cs.RO · 2026-06-28 · unverdicted · none · ref 2 · internal anchor
CORE extracts visual goal prototypes from terminal embeddings in action-free demonstrations to condition robot policies, reporting success rate gains of up to 17 percentage points on manipulation benchmarks.
PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation cs.CV · 2026-06-26 · unverdicted · none · ref 10 · internal anchor
PhysisForcing applies trajectory and relational alignment losses to DiT features in video models, improving physical plausibility on R-Bench, PAI-Bench, and EZS-Bench while raising closed-loop robotic success rates from 16% to 24%.
Scalable Multi-Task Data Generation via Reinforcement Learning for Language-Conditioned Bimanual Dexterous Manipulation cs.RO · 2026-06-21 · unverdicted · none · ref 13 · internal anchor
An RL data generation pipeline with generalizable rewards and language annotations produces diverse synthetic datasets that improve multi-task policy generalization on three bimanual manipulation tasks.
GeoSem-WAM: Geometry- and Semantic-Aware World Action Models cs.RO · 2026-06-02 · unverdicted · none · ref 29 · internal anchor
GeoSem-WAM adds geometric and semantic auxiliary prediction tasks to World Action Models during training to improve latent representations and action prediction accuracy while keeping inference efficient by avoiding explicit future rollouts.
DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation cs.RO · 2026-05-29 · unverdicted · none · ref 11 · internal anchor
DeMaVLA is a VLA foundation model using a pruned action expert and flow matching, pre-trained on 5000 hours of real demonstrations and post-trained on multi-task folding data with human-in-the-loop correction, reporting competitive benchmark and real-world folding performance.
VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models cs.RO · 2026-05-28 · unverdicted · none · ref 7 · internal anchor
VLA-Pro improves cross-task generalization in vision-language-action models by storing task-specific LoRA adapters as procedural memories and retrieving/fusing them at inference.
World Models for Robotic Manipulation: A Survey cs.RO · 2026-05-27 · accept · none · ref 126 · internal anchor
Survey organizing world models for robotic manipulation into representation families, a functional taxonomy, and infrastructure roles across pretraining, post-training, and inference, while reviewing 34 datasets and evaluation protocols.
SANTS: A State-Adaptive Scheduler for World Action Models cs.RO · 2026-05-27 · unverdicted · none · ref 17 · internal anchor
SANTS adaptively chooses denoising depth in video-based robot action diffusion policies using a state-dependent stopping hazard and noise ratio, trained via downstream action reward to reduce latency.
HyperSim: A Holistic Sim-To-Real Framework For Robust Robotic Manipulation cs.RO · 2026-05-26 · unverdicted · none · ref 12 · internal anchor
HyperSim reports 80% and 95% sim-to-real success on two manipulation policies across 400 real executions by combining synthetic environment synthesis, adversarial trajectories, and co-training.
QuoVLA: Quotient Space for Vision-Language-Action Models cs.CV · 2026-05-24 · unverdicted · none · ref 4 · internal anchor
QuoVLA introduces a quotient-space framework that compresses VLM latents into action-sufficient representations via quantization and dual-branch design for better VLA generalization.
RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models cs.RO · 2026-05-19 · unverdicted · none · ref 21 · internal anchor
RoVLA enforces instructional, evolutionary, and observational consistency to improve robustness of VLA policies on manipulation benchmarks and real robots.
PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models cs.RO · 2026-05-19 · unverdicted · none · ref 4 · internal anchor
PAPO-VLA identifies planning actions via variation and outcome, estimates their causal importance, and folds that importance into GRPO to emphasize key decisions while still using full-trajectory feedback.
Key-Gram: Extensible World Knowledge for Embodied Manipulation cs.RO · 2026-05-18 · unverdicted · none · ref 17 · internal anchor
Key-Gram uses a memory module with key-grams and hashed lookup to inject static linguistic priors into vision-language-action backbones, yielding reported gains on manipulation benchmarks.
WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform cs.RO · 2026-05-18 · unverdicted · none · ref 49 · internal anchor
WorldArena 2.0 extends embodied world model benchmarks to visuotactile perception, interactive policy training, and diverse real and simulated robotic platforms under a unified protocol.
AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment cs.RO · 2026-05-17 · unverdicted · none · ref 68 · internal anchor
AffordVLA improves VLA models for robotic manipulation by implicitly injecting affordance perception through feature alignment with a zero-shot teacher, claiming SOTA results in simulation and real-world tests.
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization cs.RO · 2026-05-17 · unverdicted · none · ref 168 · internal anchor
DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation cs.RO · 2026-05-14 · unverdicted · none · ref 6 · internal anchor
IntentVLA conditions VLA chunk generation on a compact intent code from recent observations and introduces AliasBench to evaluate stability under short-horizon observation aliasing, reporting gains on multiple robot benchmarks.
X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction cs.RO · 2026-05-12 · unverdicted · none · ref 9 · internal anchor
X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 29 · internal anchor
The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LIBERO, RoboTwin, and real-robot tasks.
Nautilus: From One Prompt to Plug-and-Play Robot Learning cs.RO · 2026-05-12 · unverdicted · none · ref 19 · internal anchor
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation cs.RO · 2026-05-09 · unverdicted · none · ref 11 · internal anchor
ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.
Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models cs.RO · 2026-05-08 · unverdicted · none · ref 9 · internal anchor
Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction cs.CL · 2026-05-07 · unverdicted · none · ref 24 · internal anchor
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation cs.RO · 2026-04-29 · unverdicted · none · ref 10 · internal anchor
STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising real-world success from 42.5% to 70.8%.
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement cs.RO · 2026-04-20 · unverdicted · none · ref 14 · internal anchor
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict action accuracy on AgiBot and 9.7-17.6% gains in real-robot tasks.
R3D: Revisiting 3D Policy Learning cs.CV · 2026-04-16 · unverdicted · none · ref 6 · internal anchor
A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer